Natural Language Processing with Python & NLTK (Part 2)
image originally from Tertiary Courses
Introduction
Working off my post from yesterday, I'll continue with Word Taggers, an incredibly important topic in natural language processing. Before you begin this tutorial, make sure you've gone through the environment set up portion of yesterday's tutorial.
3.0 Word Tagging and Models
Given any sentence, you can classify each word as a noun, verb, conjunction, or any other class of words. When there are hundreds of thousands of sentences, even millions, this is obviously a large and tedious task. But it's not one that can't be solved computationally.
3.1 NLTK Parts of Speech Tagger
NLTK is a package in python that provides libraries for different text processing techniques, such as classification, tokenization, stemming, parsing, but important to this example, tagging.
import nltk
text = nltk.word_tokenize("Python is an awesome language!")
nltk.pos_tag(text)
[('Python', 'NNP'), ('is', 'VBZ'), ('an', 'DT'), ('awesome', 'JJ'), ('language', 'NN'), ('!', '.')]
Not sure what DT, JJ, or any other tag is? Just try this in your python shell:
nltk.help.upenn_tagset('JJ')
JJ: adjective or numeral, ordinal
third ill-mannered pre-war regrettable oiled calamitous first separable
ectoplasmic battery-powered participatory fourth still-to-be-named
multilingual multi-disciplinary ...
3.1.1 Ambiguity
But what if a word can be tagged as more than one part of speech? For example, the word "sink." Depending on the content of the sentence, it could either be a noun or a verb.
Furthermore, what if a piece of text demonstrates a rhetorical device like sarcasm or irony? Clearly this can mislead the sentiment analyzer to misclassify a regular expression.
3.2 Unigram Models
Remember our bag of words model from earlier? One of its characteristics was that it didn't take the ordering of the words into account - that's why we were able to use dictionaries to map each words to True values.
With that said, unigram models are models where the order doesn't make a difference in our model. You might be wondering why we care about unigram models since they seem to be so simple, but don't let their simplicity fool you - they're a foundational block for a lot of more advanced techniques in NLP.
from nltk.corpus import brown
brown_tagged_sents = brown.tagged_sents(categories='news')
brown_sents = brown.sents(categories='news')
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
unigram_tagger.tag(brown_sents[2007])
[('Various', 'JJ'), ('of', 'IN'), ('the', 'AT'), ('apartments', 'NNS'), ('are', 'BER'), ('of', 'IN'), ('the', 'AT'), ('terrace', 'NN'), ('type', 'NN'), (',', ','), ('being', 'BEG'), ('on', 'IN'), ('the', 'AT'), ('ground', 'NN'), ('floor', 'NN'), ('so', 'QL'), ('that', 'CS'), ('entrance', 'NN'), ('is', 'BEZ'), ('direct', 'JJ'), ('.', '.')]
3.3 Bigram Models
Here, ordering does matter.
bigram_tagger = nltk.BigramTagger(brown_tagged_sents)
bigram_tagger.tag(brown_sents[2007])
Notice the changes from the last time we tagged the words of this same sentence:
[('Various', 'JJ'), ('of', 'IN'), ('the', 'AT'), ('apartments', 'NNS'), ('are', 'BER'), ('of', 'IN'), ('the', 'AT'), ('terrace', 'NN'), ('type', 'NN'), (',', ','), ('being', 'BEG'), ('on', 'IN'), ('the', 'AT'), ('ground', 'NN'), ('floor', 'NN'), ('so', 'CS'), ('that', 'CS'), ('entrance', 'NN'), ('is', 'BEZ'), ('direct', 'JJ'), ('.', '.')]
4.0 Normalizing Text
The best data is data that's consistent - textual data usually isn't. But we can make it that way by normalizing it. To do this, we can do a number of things.
At the very least, we can make all the text so that it's all in lowercase. You may have already done this before:
Given a piece of text,
raw = "OMG, Natural Language Processing is SO cool and I'm really enjoying this workshop!"
tokens = nltk.word_tokenize(raw)
tokens = [i.lower() for i in tokens]
['omg', ',', 'natural', 'language', 'processing', 'is', 'so', 'cool', 'and', 'i', "'m", 'really', 'enjoying', 'this', 'workshop', '!']
4.1 Stemming
But we can do more!
4.1.1 What is Stemming?
Stemming is the process of converting the words of a sentence to its non-changing portions. In the example of amusing, amusement, and amused above, the stem would be amus.
4.1.2 Types of Stemmers
You're probably wondering how do I convert a series of words to its stems. Luckily, NLTK has a few built-in and established stemmers available for you to use! They work slightly differently since they follow different rules - which you use depends on whatever you happen to be working on.
First, let's try the Lancaster Stemmer:
lancaster = nltk.LancasterStemmer()
stems = [lancaster.stem(i) for i in tokens]
This should have the output:
['omg', ',', 'nat', 'langu', 'process', 'is', 'so', 'cool', 'and', 'i', "'m", 'real', 'enjoy', 'thi', 'workshop', '!']
Secondly, we try the Porter Stemmer:
porter = nltk.PorterStemmer()
stem = [porter.stem(i) for i in tokens]
Notice how "natural" maps to "natur" instead of "nat" and "really" maps to "realli" instead of "real" in the last stemmer.
['omg', ',', 'natur', 'languag', 'process', 'is', 'so', 'cool', 'and', 'i', "'m", 'realli', 'enjoy', 'thi', 'workshop', '!']
4.2 Lemmatization
4.2.1 What is Lemmatization?
Lemmatization is the process of converting the words of a sentence to its dictionary form. For example, given the words amusement, amusing, and amused, the lemma for each and all would be amuse.
4.2.2 WordNetLemmatizer
Once again, NLTK is awesome and has a built in lemmatizer for us to use:
from nltk import WordNetLemmatizer
lemma = nltk.WordNetLemmatizer()
text = "Women in technology are amazing at coding"
ex = [i.lower() for i in text.split()]
lemmas = [lemma.lemmatize(i) for i in ex]
['woman', 'in', 'technology', 'are', 'amazing', 'at', 'coding']
Notice that women is changed to "woman"!
Bienvenido a Steemit! Este Post puede tener muchos upvotes con la ayuda del King: @dineroconopcion, El cual es un Grupo de Soporte mantenido por @wilbertphysique, @yoenelmundo y 5 personas mas que quieren ayudarte a llegar hacer un Top Autor En Steemit sin tener que invertir en Steem Power.Te Gustaria Ser Parte De Este Projecto?
Welcome to Steemit! This Post can have many upvote's with the help of the King's Account: @dineroconopcion, It's a Support Group run by @wilbertphysique, @yoenelmundo, and 5 other people that want to help you be a Top Steemit Author without having to invest into Steem Power. Would You Like To Be Part of this Project?
buen post