Email Spam Filtering using NLTK


Introduction TO Email Spam Filtering with NLTK

Email Spam filtering using NLTK means to classify the email (such that spam or ham) based on text. Now, you can understand how text is important in daily life. Spam emails are disturbing the routine. That’s why generally email account already has got a spam filter. This filter based on the subject line. 

Now, in this tutorial build a simple spam filter for emails.

Using N-Gram model

For Email spam filtering using NLTK or generally text classify used the N-grams for language modeling based on word prediction, predict the next word based on previous one N-1 words. A bigram is the two-word sequence of N-grams that predict the next words of a sentence using the previous one. Instead of considering the whole history of a sentence or a particular word sequence, a model like a bigram can be occupied in terms of an approximation of history by occupying a limited history.

Identification of a message as ‘ham’ or ‘spam’ is a classification task since the target variable has got discrete values that is ‘ham’ or ‘spam’. In this article, the bigram model is used though there are many advanced techniques that can be utilized for the purpose. In order to use a bigram model to assign a given message as ‘spam’ or ‘ham’.

Let’s start the code:

Import required library

from functools import reduce
import nltk
from nltk.stem import WordNetLemmatizer
import pandas as pd
import string
import re
	Read Data
full_corpus = pd.read_csv('email_file.tsv', sep='\t', header=None, names=['label', 'msg_body'])
Note: here used tsv file. 
Create a two empty vector
# Separating messages into ham and spam
ham_text = [ ]
spam_text = [ ]
Here, two vectors save a  ham and spam email.

Define a function for seperate the message
def separate_msgs():
	for index, column in full_corpus.iterrows():
    	label = column[0]
    	message_text = column[1]
    	if label == 'ham':
    	elif label == 'spam':

Preprocessing of text
In this step clean the text.remove the punctuation, special symbol,tokenizatio and all steps.
#removing punctuation marks from the email messages
def remove_msg_punctuations(email_msg):
	puntuation_removed_msg = "".join([word for word in email_msg if word not in string.punctuation])
	return puntuation_removed_msg

Converting Text into Lowercase and Word Tokenizing

The Conversion of all characters in text into a common context such as lowercase supports to prevent identifying two words differently where one is in lowercase and the other one is not. For instance, ‘First’ and ‘first’ should be identified as the same, therefore lowercasing all the characters makes the task easier. Moreover, the stop words are also in lowercase, so that this would make removing stop words later is also feasible.

def tokenize_into_words(text):
	tokens = re.split('\W+', text)
	return tokens

word_lemmatizer = WordNetLemmatizer()
def lemmatization(tokenized_words):
	lemmatized_text = [word_lemmatizer.lemmatize(word)for word in tokenized_words]
	return ' '.join(lemmatized_text)

def preprocessing_msgs(corpus):
	categorized_text = pd.DataFrame(corpus)
	categorized_text['non_punc_message_body']=categorized_text[0].apply(lambdamsg: remove_msg_punctuations(msg))
	categorized_text['tokenized_msg_body']=categorized_text['non_punc_message_body'].apply(lambdamsg: tokenize_into_words(msg.lower()))
	return categorized_text['lemmatized_msg_words']

 Extracting features i.e. n-grams

After the preprocessing stage, the features should be extracted from the text. The features are the units that support the classifying task, and bigrams are the features in this task of message classification. The bigrams or the features are extracted from the preprocessed text. Initially, the unigrams are acquired, and then those unigrams are used to obtain the unigrams in each corpus (‘ham’ and ‘spam’).

def feature_extraction(preprocessed_text):
	bigrams = []
	unigrams_lists = []
	for msg in preprocessed_text:
    	# adding end of and start of a message
    	msg = '<s> ' +msg +' </s>'
	unigrams = [uni_list for sub_list in unigrams_lists for uni_list in sub_list]
	return bigrams

removing bigrams only with stop words
stopwords = nltk.corpus.stopwords.words('english')
def filter_stopwords_bigrams(bigram_list):
	filtered_bigrams = []
	for bigram in bigram_list:
    	if bigram[0] in stopwords and bigram[1] in stopwords:
	return filtered_bigrams
Acquiring frequencies of features
The frequency distribution is used to obtain the frequency of occurrence of each vocabulary items in a certain text.
def ham_bigram_feature_frequency():
	# features frequency for ham messages
	ham_bigrams = feature_extraction(preprocessing_msgs(ham_text))
	ham_bigram_frequency = nltk.FreqDist(filter_stopwords_bigrams(ham_bigrams))
	return ham_bigram_frequency

def spam_bigram_feature_frequency():


Building the Model

The model for classifying a given message as ‘ham’ or ‘spam’ has been approached by calculating bigram probabilities within each corpus.Then the bigrams are extracted from the preprocessed text for finally calculating the probability of the text to be in each corpus ‘ham’ or ‘spam’.

 calculating bigram probabilities
def bigram_probability(message):
	probability_h = 1
	probability_s = 1
	# preprocessing input messages
	punc_removed_message = "".join(word for word in message if word not in string.punctuation)
	punc_removed_message = '<s> ' +punc_removed_message +' </s>'
	tokenized_msg = re.split('\s+', punc_removed_message)
	lemmatized_msg = [word_lemmatizer.lemmatize(word)for word in tokenized_msg]
	# bigrams for message
	bigrams_for_msg = list(nltk.bigrams(lemmatized_msg))
	# stop words removed unigrams for vocabulary
	ham_unigrams = [word for word in feature_extraction(preprocessing_msgs(ham_text)) if word not in stopwords]
	spam_unigrams = [word for word in feature_extraction(preprocessing_msgs(spam_text)) if word not in stopwords]
	# frequecies of bigrams extracted
	ham_frequency = ham_bigram_feature_frequency()
	spam_frequency  = spam_bigram_feature_frequency()
	print('========================== Calculating Probabilities ==========================')
	print('----------- Ham Freuquencies ------------')
	for bigram in bigrams_for_msg:
    	# probability of first word in bigram
    	ham_probability_denominator = 0
    	# probability of bigram (smoothed)
    	ham_probability_of_bigram = ham_frequency[bigram] + 1
    	print(bigram, ' occurs ', ham_probability_of_bigram)
    	for (first_unigram, second_unigram) in filter_stopwords_bigrams(ham_unigrams):
        	ham_probability_denominator += 1
        	if(first_unigram == bigram[0]):
            	ham_probability_denominator += ham_frequency[first_unigram, second_unigram]
    	probability = ham_probability_of_bigram / ham_probability_denominator
    	probability_h *= probability
	print('----------- Spam Freuquencies ------------')
	for bigram in bigrams_for_msg:
    	# probability of first word in bigram
    	spam_probability_denominator = 0
    	# probability of bigram (smoothed)
    	spam_probability_of_bigram = spam_frequency[bigram] + 1
    	print(bigram, ' occurs ', spam_probability_of_bigram)
    	for (first_unigram, second_unigram) in filter_stopwords_bigrams(spam_unigrams):
        	spam_probability_denominator += 1
        	if(first_unigram == bigram[0]):
            	spam_probability_denominator += spam_frequency[first_unigram, second_unigram]
    	probability = spam_probability_of_bigram / spam_probability_denominator
    	probability_s *= probability
	print('Ham Probability: ' +str(probability_h))
	print('Spam Probability: ' +str(probability_s))
	if(probability_h >= probability_s):
    	print('\"' +message +'\" is a Ham message')
    	print('\"' +message +'\" is a Spam message')
bigram_probability('Sorry,  ..use your brain dear')
bigram_probability('SIX chances to win CASH.')


Smoothing algorithms are occupied in order to mitigate the zero probability issue in language modeling applications. Here, Laplace (Add-1) Smoothing techniques have been used which overcomes the issue of zero probability by pretending the non-existent bigrams have been seen once before.

The above equation has been modified in Laplace smoothing into the following equation to avoid dividing by zero error.


  • A message being ‘ham’ or ‘spam’ depends only upon its text within the message
=======Calculating Probabilities =========
----------- Ham Freuquencies ------------
('<s>', 'SIX')  occurs  1
('SIX', 'chance')  occurs  1
('chance', 'to')  occurs  3
('to', 'win')  occurs  3
('win', 'CASH')  occurs  1
('CASH', '</s>')  occurs  1

----------- Spam Freuquencies ------------
('<s>', 'SIX')  occurs  1
('SIX', 'chance')  occurs  1
('chance', 'to')  occurs  17
('to', 'win')  occurs  18
('win', 'CASH')  occurs  1
('CASH', '</s>')  occurs  1

Ham Probability: 1.415066409862033e-29
Spam Probability: 1.1060464178520215e-23

"SIX chances to win CASH." is a Spam message


Please enter your comment!
Please enter your name here