Artificial Intelligence

Coding the NLP Pipeline in Python

August 6, 2020

626

Table of Contents

Introduction Coding the NLP Pipeline in Python

So how do we code this NLP Pipeline in Python? Thanks to amazing python libraries like NLTK (Natural Langauge Toolkit), it’s already done! Another library is there name as spaCy it is also good but NLTK is more powerful. The steps are all coded and ready for you to use.

Install nltk
For notebook
!pip install nltk

After that install the all dependency tool 
!pip install nltk(“all‘)

So, let’s explain the step-by-step

Sentence Segmentation

Break the sentence from the text.
import nltk   #import library
text = "Backgammon is one of the oldest known board games. Its history can be traced back nearly 5,000 years to archeological discoveries in the Middle East.”    #any text or string
sentences = nltk.sent_tokenize(text)    #create a sentence using sent_tokenize
#for output
for sentence in sentences:
	print(sentence)
	print()

Output:
Backgammon is one of the oldest known board games.

Its history can be traced back nearly 5,000 years to archeological discoveries in the Middle East.

Word Tokenization

Word tokenization means seperate the words from sentence.also called as word segmentation.Dividing the string of written language into its words.Use the nltk.word_tokenize function.

for sentence in sentences:
	word = nltk.word_tokenize(sentence)
	print(word)
	print()
Output:
['Backgammon', 'is', 'one', 'of', 'the', 'oldest', 'known', 'board', 'games', '.']

['Its', 'history', 'can', 'be', 'traced', 'back', 'nearly', '5,000', 'years', 'to', 'archeological', 'discoveries', 'in', 'the', 'Middle', 'East', '.']

Text Lemmatization and Stemming

This is last step of NLP Pipeline in Python. Text lemmatization is used for grammatical reasons. A text can contain different form of a words such as drive,driving,driven,drives,It converts normal form.The main aim of lemmatization and Stemming is to reduce inflectional forms.

Example :- bat,bats,bat’s, => bat

Examples :-the word “better” has “good” as its lemma.the word “play” is the base form for the word “playing”, and hence this is matched in both stemming and lemmatization

from nltk.stem.wordnet import WordNetLemmatizer
lemmaztization = WordNetLemmatizer()
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
word = "mulptiplying"
lemmaztization.lemmatize(word,"v")
Output:
‘mulptiplying’
stemmer.stem(word)
'mulptipli'
Another example using function.
from nltk.stem import PorterStemmer,WordNetLemmatizer
from nltk.corpus import wordnet

def compare_stemmer_and_lemmatizer(stemmer, lemmatizer, word, pos):
	print("stemmer:", stemmer.stem(word))
	print("Lemmatizer:", lemmatizer.lemmatize(word,pos))
	print()
    
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()

compare_stemmer_and_lemmatizer(stemmer,lemmatizer, word = "seen", pos=wordnet.VERB)
Output:
stemmer: seen
Lemmatizer: see

Stop Words

It is called a filtering process. In a text lot of noise. We want to remove this irrelevant noise. The NLTK tool has a predefined list of ‘stopwords’, called ‘corpus’.

nltk.download("stopwords")
once you download we can load the stopwords then import directly from package from
(nltk.corpus).
from nltk.corpus import stopwords
print(stopwords.words("english"))
Output:

"you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

Remove the stop words:
stop_words = set(stopwords.words("english"))
sentence = 'Backgammon is one of the oldest known board games.'
words = nltk.word_tokenize(sentence)
print(word)
Output: 
['Backgammon', 'is', 'one', 'of', 'the', 'oldest', 'known', 'board', 'games', '.']
without_stop_words = [word for word in words if not word in stop_words]
print(without_stop_words)
Output:
['Backgammon', 'one', 'oldest', 'known', 'board', 'games', '.']

Remove the stop words:
stop_words = set(stopwords.words("english"))
sentence = 'Backgammon is one of the oldest known board games.'
words = nltk.word_tokenize(sentence)
print(word)
Output: 
['Backgammon', 'is', 'one', 'of', 'the', 'oldest', 'known', 'board', 'games', '.']
without_stop_words = [word for word in words if not word in stop_words]
print(without_stop_words)
Output:
['Backgammon', 'one', 'oldest', 'known', 'board', 'games', '.']

This model is a simple feature extraction technique used when we work with text.Bag of words model based on ‘a vocabulary’ of known words and ‘a measure of the presence’ of known words.

#create a sample text file 
with open("text_Sample", "r") as file:
	documents = file.read().splitlines()
print(documents)
Output:
["I like this movie, it's horrer.", "I don't like this movie.", 'This was awesome! I like it.', 'Nice one. I love it.']
#import required library to convert the text into number
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
#design the Vocabulary
count_vectorizer = CountVectorizer()
#Create a bag-of-words
bag_of_words = count_vectorizer.fit_transform(documents)

#bag-of-words model as a pandas Dataframe
feature_names = count_vectorizer.get_feature_names()
pd.DataFrame(bag_of_words.toarray(),columns= feature_names)
Output:

Introduction Coding the NLP Pipeline in Python

Sentence Segmentation

Word Tokenization

Text Lemmatization and Stemming

Stop Words

RELATED ARTICLESMORE FROM AUTHOR

Sentence Segmentation Using NLP

Text Summarization Using NLP

Parts of Speech Tagging Using NLP

Removing Stop Words Using NLTK in NLP

Text To Speech Conversion Using NLP

Language Translation Using Deep Learning

LEAVE A REPLY Cancel reply

RELATED ARTICLES MORE FROM AUTHOR