Introduction TO Email Spam Filtering with NLTK
Email Spam filtering using NLTK means to classify the email (such that spam or ham) based on text. Now, you can understand how text is important in daily life. Spam emails are disturbing the routine. That’s why generally email account already has got a spam filter. This filter based on the subject line.
Now, in this tutorial build a simple spam filter for emails.
Using N-Gram model
For Email spam filtering using NLTK or generally text classify used the N-grams for language modeling based on word prediction, predict the next word based on previous one N-1 words. A bigram is the two-word sequence of N-grams that predict the next words of a sentence using the previous one. Instead of considering the whole history of a sentence or a particular word sequence, a model like a bigram can be occupied in terms of an approximation of history by occupying a limited history.
Identification of a message as ‘ham’ or ‘spam’ is a classification task since the target variable has got discrete values that is ‘ham’ or ‘spam’. In this article, the bigram model is used though there are many advanced techniques that can be utilized for the purpose. In order to use a bigram model to assign a given message as ‘spam’ or ‘ham’.
Let’s start the code:
Import required library from functools import reduce import nltk from nltk.stem import WordNetLemmatizer import pandas as pd import string import re Read Data full_corpus = pd.read_csv('email_file.tsv', sep='\t', header=None, names=['label', 'msg_body']) Note: here used tsv file. Create a two empty vector # Separating messages into ham and spam ham_text = [ ] spam_text = [ ] Here, two vectors save a ham and spam email. Define a function for seperate the message def separate_msgs(): for index, column in full_corpus.iterrows(): label = column message_text = column if label == 'ham': ham_text.append(message_text) elif label == 'spam': spam_text.append(message_text) separate_msgs() Preprocessing of text In this step clean the text.remove the punctuation, special symbol,tokenizatio and all steps. #removing punctuation marks from the email messages def remove_msg_punctuations(email_msg): puntuation_removed_msg = "".join([word for word in email_msg if word not in string.punctuation]) return puntuation_removed_msg
Converting Text into Lowercase and Word Tokenizing
The Conversion of all characters in text into a common context such as lowercase supports to prevent identifying two words differently where one is in lowercase and the other one is not. For instance, ‘First’ and ‘first’ should be identified as the same, therefore lowercasing all the characters makes the task easier. Moreover, the stop words are also in lowercase, so that this would make removing stop words later is also feasible.
def tokenize_into_words(text): tokens = re.split('\W+', text) return tokens #lemmatizing word_lemmatizer = WordNetLemmatizer() def lemmatization(tokenized_words): lemmatized_text = [word_lemmatizer.lemmatize(word)for word in tokenized_words] return ' '.join(lemmatized_text) def preprocessing_msgs(corpus): categorized_text = pd.DataFrame(corpus) categorized_text['non_punc_message_body']=categorized_text.apply(lambdamsg: remove_msg_punctuations(msg)) categorized_text['tokenized_msg_body']=categorized_text['non_punc_message_body'].apply(lambdamsg: tokenize_into_words(msg.lower())) categorized_text['lemmatized_msg_words']=categorized_text['tokenized_msg_body'].apply(lambdaword_list:lemmatization(word_list)) return categorized_text['lemmatized_msg_words']
Extracting features i.e. n-grams
After the preprocessing stage, the features should be extracted from the text. The features are the units that support the classifying task, and bigrams are the features in this task of message classification. The bigrams or the features are extracted from the preprocessed text. Initially, the unigrams are acquired, and then those unigrams are used to obtain the unigrams in each corpus (‘ham’ and ‘spam’).
def feature_extraction(preprocessed_text): bigrams =  unigrams_lists =  for msg in preprocessed_text: # adding end of and start of a message msg = '<s> ' +msg +' </s>' unigrams_lists.append(msg.split()) unigrams = [uni_list for sub_list in unigrams_lists for uni_list in sub_list] bigrams.extend(nltk.bigrams(unigrams)) return bigrams removing bigrams only with stop words stopwords = nltk.corpus.stopwords.words('english') def filter_stopwords_bigrams(bigram_list): filtered_bigrams =  for bigram in bigram_list: if bigram in stopwords and bigram in stopwords: continue filtered_bigrams.append(bigram) return filtered_bigrams
Acquiring frequencies of features The frequency distribution is used to obtain the frequency of occurrence of each vocabulary items in a certain text. def ham_bigram_feature_frequency(): # features frequency for ham messages ham_bigrams = feature_extraction(preprocessing_msgs(ham_text)) ham_bigram_frequency = nltk.FreqDist(filter_stopwords_bigrams(ham_bigrams)) return ham_bigram_frequency def spam_bigram_feature_frequency():
Building the Model
The model for classifying a given message as ‘ham’ or ‘spam’ has been approached by calculating bigram probabilities within each corpus.Then the bigrams are extracted from the preprocessed text for finally calculating the probability of the text to be in each corpus ‘ham’ or ‘spam’.
calculating bigram probabilities def bigram_probability(message): probability_h = 1 probability_s = 1 # preprocessing input messages punc_removed_message = "".join(word for word in message if word not in string.punctuation) punc_removed_message = '<s> ' +punc_removed_message +' </s>' tokenized_msg = re.split('\s+', punc_removed_message) lemmatized_msg = [word_lemmatizer.lemmatize(word)for word in tokenized_msg] # bigrams for message bigrams_for_msg = list(nltk.bigrams(lemmatized_msg)) # stop words removed unigrams for vocabulary ham_unigrams = [word for word in feature_extraction(preprocessing_msgs(ham_text)) if word not in stopwords] spam_unigrams = [word for word in feature_extraction(preprocessing_msgs(spam_text)) if word not in stopwords] # frequecies of bigrams extracted ham_frequency = ham_bigram_feature_frequency() spam_frequency = spam_bigram_feature_frequency() print('========================== Calculating Probabilities ==========================') print('----------- Ham Freuquencies ------------') for bigram in bigrams_for_msg: # probability of first word in bigram ham_probability_denominator = 0 # probability of bigram (smoothed) ham_probability_of_bigram = ham_frequency[bigram] + 1 print(bigram, ' occurs ', ham_probability_of_bigram) for (first_unigram, second_unigram) in filter_stopwords_bigrams(ham_unigrams): ham_probability_denominator += 1 if(first_unigram == bigram): ham_probability_denominator += ham_frequency[first_unigram, second_unigram] probability = ham_probability_of_bigram / ham_probability_denominator probability_h *= probability print('\n') print('----------- Spam Freuquencies ------------') for bigram in bigrams_for_msg: # probability of first word in bigram spam_probability_denominator = 0 # probability of bigram (smoothed) spam_probability_of_bigram = spam_frequency[bigram] + 1 print(bigram, ' occurs ', spam_probability_of_bigram) for (first_unigram, second_unigram) in filter_stopwords_bigrams(spam_unigrams): spam_probability_denominator += 1 if(first_unigram == bigram): spam_probability_denominator += spam_frequency[first_unigram, second_unigram] probability = spam_probability_of_bigram / spam_probability_denominator probability_s *= probability print('\n') print('Ham Probability: ' +str(probability_h)) print('Spam Probability: ' +str(probability_s)) print('\n') if(probability_h >= probability_s): print('\"' +message +'\" is a Ham message') else: print('\"' +message +'\" is a Spam message') print('\n') bigram_probability('Sorry, ..use your brain dear') bigram_probability('SIX chances to win CASH.')
Smoothing algorithms are occupied in order to mitigate the zero probability issue in language modeling applications. Here, Laplace (Add-1) Smoothing techniques have been used which overcomes the issue of zero probability by pretending the non-existent bigrams have been seen once before.
The above equation has been modified in Laplace smoothing into the following equation to avoid dividing by zero error.
- A message being ‘ham’ or ‘spam’ depends only upon its text within the message
Output: =======Calculating Probabilities ========= ----------- Ham Freuquencies ------------ ('<s>', 'SIX') occurs 1 ('SIX', 'chance') occurs 1 ('chance', 'to') occurs 3 ('to', 'win') occurs 3 ('win', 'CASH') occurs 1 ('CASH', '</s>') occurs 1 ----------- Spam Freuquencies ------------ ('<s>', 'SIX') occurs 1 ('SIX', 'chance') occurs 1 ('chance', 'to') occurs 17 ('to', 'win') occurs 18 ('win', 'CASH') occurs 1 ('CASH', '</s>') occurs 1 Ham Probability: 1.415066409862033e-29 Spam Probability: 1.1060464178520215e-23 "SIX chances to win CASH." is a Spam message