Spam Filtering using Bag of Words

Spam Filtering using Bag of Words

10 mins read1.4K Views Comment
Updated on Mar 13, 2023 17:12 IST

The purpose of this blog is to explain you the concept of spam filtering using bag of words.


We daily receive many emails, some of which are spam and useless to us. So, what if we can make a spam filtering system using text classification that classifies whether the email, we have received is Spam or not? That will save us a lot of time. Sounds cool? Let’s jump into spam filtering using a bag of words.

In this post we will discuss: –

How Natural Language Data is Processed?

Have you ever wondered how Google and YouTube understand what you are trying to search for? So, a very simple answer to this question is Natural Language Processing so now your question is what is Natural Language Processing or NLP?

It is the domain of Artificial Intelligence where we humans help Computers understand what we are trying to say and when I say Computers, the only prerequisite for Computers to Understand anything is, they need Numbers, just raw Numbers.

So now our first question will be how we can create numbers out of our natural language data.

To understand the problem let’s take a scenario. If you want to go and buy a brand-new mobile phone, then the first thing you will do is you’ll check reviews regarding mobile phones.

So, say you have read 3 reviews about a particular phone: –

  • R1 : Awesome camera a must buy
  • R2 : Good quality screen and camera
  • R3 : Overheating issues

Now in actual cases, you will have Hundreds or maybe Thousands of reviews about a single phone. Now it will be difficult for a Human to read it all and make a decision out of it whether you should buy it or not.

So now comes NATURAL LANGUAGE PROCESSING for your rescue but the only prerequisite is – For doing any processing of natural language on computers they need numbers.

What is Bag of Words?

So, the process of converting text to a group/vector of numbers is called Vectorisation.
So, our very first algorithm for this purpose of vectorisation is Bag of Words, and in this basically, we will count how many times a word appears in a sentence so that sentence can be converted to a vector.

Steps for doing Bag of Words: –

1. Calculate all the unique words present in the set of sentences you are having.

a. Set of sentences in natural language lingo is called corpus.
b. In our case we have 3 reviews and we have and a total of 11 unique words from them

camera Awesome a must buy Good quality screen and Overheating issues

2. Calculate how many times a word appears in a review.

  Awesome camera a must buy Good quality screen and Overheating issues
R1 1 1 1 1 1 0 0 0 0 0 0
R2 0 1 0 0 0 1 1 1 1 0 0
R3 0 0 0 0 0 0 0 0 0 1 1

So now reviews have been converted to an appropriate form so that a Computer can process it and tell you whether you should buy this mobile phone or not.
Now reviews are now looking like this: –

R1 = [ 1,1,1,1,1,0,0,0,0,0,0 ]

R2 = [ 0,1,0,0,0,1,1,1,1,0,0 ]

R3 = [ 0,0,0,0,0,0,0,0,0,1,1 ]

And this is the complete idea of Bag of words.

Implementation of Bag of Words (BoW)

Before starting with the implementation of BoW, let’s revise some definitions

1. Corpus = Set of all sentences.
2. Vectorisation = Process of converting natural language sentences to a vector.

Now let’s jump to the implementation part. We’ll use Keras Tokenizer for this purpose.


And you’ll get the following output: –


And this output is what we expected. Notice that the first column of every vector is ZERO. This detail is limited to Keras library so we can ignore it.

Problem Statement: Spam Filtering

Now to understand how Bag of Words works and how we can make a Spam filter using it, we will first cover some basic analogies, and after going through the complete problem statement, you will understand: –

  1. What is Tokenization
  2. What is Text Pre-processing
  3. How to apply BOW using the sklearn library

How our data is looking: –
In our training dataset, we have 2 columns: –
1. “Message_body” → Actual data on which we will train our Bag of words, model
2. “Label” → It has values “Spam” and “Not-spam.”

Test data has only 1 column: –
1. “Message_body” → Text on which we have to predict whether it is spam or not.

What is Tokenization?

To understand what Tokenization is. Let’s take an example: –

“James has a cat and a dog. “

Now this sentence is made up of words and these words are called Tokens the process of converting a text too small useful tokens is called Tokenization.

But your question can be Why it is even Important: –
If I say these 2 sentences: –

  1. “James having cat dog.”
  2. “James is having a cat and a dog “

Both of these sentences will frame nearly the same picture in your mind that “James is having a cat and a dog” so this is the power of tokenization. Tokens are building blocks of any text and if we input good tokens into Machine Learning Algorithms, we will get good results.

Tokenization can be done both at the word level and sentence level. We can either split a paragraph into small token sentences or we can split a sentence into small token words.

Example code: –

# importing libraries
import nltk
# importing whitespace tokenizer
from nltk.tokenize import WhitespaceTokenizer
#creating object of class
to = WhitespaceTokenizer()
text  = to.tokenize("He is playing cricket")
# output is ['He', 'is', 'playing', 'cricket']
Copy code

What is Text Pre-processing

Text pre-processing is the series of steps that we take to clean our data before feeding it into a machine learning algorithm. There are many techniques for text processing and for this problem we will apply a few of them to our text data.

1. Removal of Punctuations
2. Lowering the text
3. Removal of Stopwords
4. Stemming

1. Removal of Punctuations

Remove all the punctuation marks from a sentence. Punctuation marks are: –

#library that contains punctuation
import string
# output → !”#$%&'()*+,-./:;<=>?@[]^_`{|}
Copy code

2. Lowering the text 

Lower all the text before giving it to the Bag of Words model. The reason for this is because if we do not lower the text, we will have more words in our vocabulary differentiated just on the basis of the casing and hence unnecessarily longer input vectors to
BOW model.

  • If not lowering text: –
    ■ Example → “I have a big cat, small cat, dog, and a small Dog”
    ■ Our tokens will be: –
#importing nlp library
import nltk
from nltk.tokenize import WhitespaceTokenizer
tk = WhitespaceTokenizer()
text = tk.tokenize("I have a big Cat a small Cat a Dog and a small Dog")
# output is ['I', 'have', 'a', 'big', 'Cat', 'a', 'small', 'cat', 'a', 'dog', 'and', 'a', 'small', 'Dog']
Copy code

Now we have these (Cat, cat) and (Dog, dog) tokens. These pairs of tokens are representing the same words but if we will not lower the text, we will have these unnecessary tokens. That is why it is important to lower text before inputting it into the Bag of Words model.

3. Removal of Stopwords

Before going to the removal of stopwords Let’s understand what are stop words.
Stopwords = Words that are very frequent in a sentence and do not add any specific context to that sentence.
So, it’s better to remove these stopwords from our sentences.
And It is not mandatory to use only the list of stopwords from a specific library. In some cases, we have to create our own set of words looking at our corpus

import nltk'stopwords')
#Stop words present in the library
stopwords = nltk.corpus.stopwords.words('english')
# ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your'......]
Copy code

4. Stemming

It is a text standardisation technique where a word is reduced to its stem/base word. Example: “playing” → “play” and “kicking” → “kick”. The main aim for stemming is that we can reduce the vocab size before inputting it into the BOW model.

# importing nlp library
import nltk
from nltk.tokenize import WhitespaceTokenizer
#importing the Stemming function from nltk library
from nltk.stem.porter import PorterStemmer
tk = WhitespaceTokenizer()
#defining the object for stemming
porter_stemmer = PorterStemmer()
text = tk.tokenize("I am playing cricket")
text = [porter_stemmer.stem(word) for word in text]
# stemmed output: - ['I', 'am', 'play', 'cricket']
Copy code

The Disadvantage of stemming is sometimes, after the stemming word loses its meaning.
Example: – “copying” → “copi” and there is no word “copi” in English vocab.

So these were all text pre-processing steps we’ll take before inputting our data into our Bag of Words model.

How to Apply Bag of Words using Sklearn

Now that we have covered the basics of text pre-processing let’s apply it along with inputting our data to the BoW model and then finally make a complete Spam Classifier for Emails.

1. Reading our Training and Testing data and Setting All Our Imports

\n \n <span style="color: #ff7700; font-weight: bold;">\n \n import numpy \n \n <span style="color: #ff7700; font-weight: bold;">\n \n as np\n \n
\n \n <span style="color: #ff7700; font-weight: bold;">\n \n import pandas \n \n <span style="color: #ff7700; font-weight: bold;">\n \n as pd\n \n
df \n \n <span style="color: #66cc66;">\n = pd. \n <span style="color: black;">\n read_csv( \n <span style="color: #483d8b;">\n 'train-dataset.csv' \n <span style="color: black;">\n ) \n
\n <span style="color: #808080; font-style: italic;">\n # shuffling all our data \n
df  \n <span style="color: #66cc66;">\n = df. \n <span style="color: black;">\n sample(frac \n <span style="color: #66cc66;">\n = \n <span style="color: #ff4500;">\n 1 \n <span style="color: black;">\n ) \n
\n <span style="color: #808080; font-style: italic;">\n # reading only Message_body and label column \n
df  \n <span style="color: #66cc66;">\n = df \n <span style="color: black;">\n [[ \n <span style="color: #483d8b;">\n 'Message_body' \n <span style="color: #66cc66;">\n , \n <span style="color: #483d8b;">\n 'Label' \n <span style="color: black;">\n ]] \n
\n <span style="color: #808080; font-style: italic;">\n # reading our test data \n
df_test  \n <span style="color: #66cc66;">\n = pd. \n <span style="color: black;">\n read_csv( \n <span style="color: #483d8b;">\n 'test-dataset.csv' \n <span style="color: #66cc66;">\n ,encoding \n <span style="color: #66cc66;">\n = \n <span style="color: #483d8b;">\n 'cp1252' \n <span style="color: black;">\n ) \n
df_test. \n <span style="color: black;">\n head() \n </span style="color: black;"> \n </span style="color: black;"> \n </span style="color: #483d8b;"> \n </span style="color: #66cc66;"> \n </span style="color: #66cc66;"> \n </span style="color: #483d8b;"> \n </span style="color: black;"> \n </span style="color: #66cc66;"> \n </span style="color: #808080; font-style: italic;"> \n </span style="color: black;"> \n </span style="color: #483d8b;"> \n </span style="color: #66cc66;"> \n </span style="color: #483d8b;"> \n </span style="color: black;"> \n </span style="color: #66cc66;"> \n </span style="color: #808080; font-style: italic;"> \n </span style="color: black;"> \n </span style="color: #ff4500;"> \n </span style="color: #66cc66;"> \n </span style="color: black;"> \n </span style="color: #66cc66;"> \n </span style="color: #808080; font-style: italic;"> \n </span style="color: black;"> \n </span style="color: #483d8b;"> \n </span style="color: black;"> \n </span style="color: #66cc66;">\n \n </span style="color: #ff7700; font-weight: bold;">\n \n </span style="color: #ff7700; font-weight: bold;">\n \n </span style="color: #ff7700; font-weight: bold;">\n \n </span style="color: #ff7700; font-weight: bold;">
Copy code

2. Text Preprocessing

a. Removal of Punctuation

#library that contains punctuation
import string
# list of all punctuations we have
#defining the function to remove punctuation
def remove_punctuation(text):
    punctuationfree="".join([for i in text if i not in string.punctuation])
    return punctuationfree
#storing the punctuation free text for both training and testing data
df['clean_msg']= df['Message_body'].apply(lambda x:remove_punctuation(x))
df_test['clean_msg'] = df_test['Message_body'].apply(lambda x:remove_punctuation(x))
Copy code

b. Lower the Text

df['clean_msg']= df['clean_msg'].apply(lambda x: x.lower())
df_test['clean_msg']= df_test['clean_msg'].apply(lambda x: x.lower())
Copy code

c. Remove StopWords

#defining function for tokenization
import re
#whitespace tokenizer from nltk 
from nltk.tokenize import WhitespaceTokenizer
def tokenization(text):
    tk = WhitespaceTokenizer()
    return tk.tokenize(text)
#applying function to the column for making tokens in both Training and Testing data
df['tokenised_clean_msg']= df['clean_msg'].apply(lambda x: tokenization(x))
df_test['tokenised_clean_msg'] = df_test['clean_msg'].apply(lambda x: tokenization(x))
Copy code

d. Stemming

#importing the Stemming function from nltk library
from nltk.stem.porter import PorterStemmer
#defining the object for stemming
porter_stemmer = PorterStemmer()
#defining a function for stemming
def stemming(text):
  stem_text = [porter_stemmer.stem(word) for word in text]
  return stem_text
# applying function for stemming
df['cleaned_tokens']=df['cleaned_tokens'].apply(lambda x: stemming(x))
df_test['cleaned_tokens']=df_test['cleaned_tokens'].apply(lambda x: stemming(x))
Copy code

After stemming we have tokens of text so we will join our tokens back again so that we can input them to our Bag of Words model

def join_text(text):
  return " ".join(text)
# join text because till now we have tokens
df['final_txt'] = df['cleaned_tokens'].apply(lambda x  : join_text(x))
df_test['final_txt'] = df_test['cleaned_tokens'].apply(lambda x  : join_text(x))
And our predictor column is categorical in nature so let’s convert our categorical column back to Numeric values.
# extracting only useful columns
df = df[['final_txt','Label']]
df_test = df[['final_txt',"Label"]]
# as our target variable is categorical it is important to convert our categorical variable to Numeric variable
# Non-Spam --> 0
# SPam --> 1
final_dict = {'Non-Spam':0,'Spam':1}
df['Label'] = df['Label'].map(final_dict)
df_test['Label'] = df_test['Label'].map(final_dict)
Copy code

3. Applying Bag of Words model using Sklearn

\n \n <span style="color: #808080; font-style: italic;">\n \n # importing our Bag of Words model from sklearn\n \n
\n \n <span style="color: #ff7700; font-weight: bold;">\n \n from sklearn.\n \n <span style="color: black;">\n \n feature_extraction.\n \n <span style="color: black;">\n \n text \n \n <span style="color: #ff7700; font-weight: bold;">\n import CountVectorizer \n
train_documents_for_bow  \n <span style="color: #66cc66;">\n = df \n <span style="color: black;">\n [ \n <span style="color: #483d8b;">\n 'final_txt' \n <span style="color: black;">\n ]. \n <span style="color: black;">\n tolist() \n
test_docs  \n <span style="color: #66cc66;">\n = df_test \n <span style="color: black;">\n [ \n <span style="color: #483d8b;">\n 'final_txt' \n <span style="color: black;">\n ]. \n <span style="color: black;">\n tolist() \n
\n <span style="color: #808080; font-style: italic;">\n # Create a BoW model Object \n
vectorizer  \n <span style="color: #66cc66;">\n = CountVectorizer \n <span style="color: black;">\n (max_features \n <span style="color: #66cc66;">\n = \n <span style="color: #ff4500;">\n 100 \n <span style="color: black;">\n ) \n
\n <span style="color: #808080; font-style: italic;">\n # fitting our Bag of Words Model \n
vectorizer. \n <span style="color: black;">\n fit(train_documents_for_bow  \n <span style="color: black;">\n ) \n
\n <span style="color: #808080; font-style: italic;">\n # Printing the identified Unique words along with their indices \n
\n <span style="color: #ff7700; font-weight: bold;">\n print \n <span style="color: black;">\n ( \n <span style="color: #483d8b;">\n "Vocabulary: " \n <span style="color: #66cc66;">\n , vectorizer. \n <span style="color: black;">\n vocabulary_) \n
\n <span style="color: #808080; font-style: italic;">\n # Encode the Document \n
X_train  \n <span style="color: #66cc66;">\n = vectorizer. \n <span style="color: black;">\n fit_transform(train_documents_for_bow  \n <span style="color: black;">\n ) \n
\n <span style="color: #808080; font-style: italic;">\n # Naive Bayes for Classification { Spam or Not-Spam } \n
\n <span style="color: #ff7700; font-weight: bold;">\n from sklearn. \n <span style="color: black;">\n naive_bayes  \n <span style="color: #ff7700; font-weight: bold;">\n import GaussianNB \n
classifier  \n <span style="color: #66cc66;">\n = GaussianNB \n <span style="color: black;">\n () \n
classifier. \n <span style="color: black;">\n fit(X_train. \n <span style="color: black;">\n toarray() \n <span style="color: #66cc66;">\n , df \n <span style="color: black;">\n [ \n <span style="color: #483d8b;">\n 'Label' \n <span style="color: black;">\n ]) \n
X_test  \n <span style="color: #66cc66;">\n = vectorizer. \n <span style="color: black;">\n transform(test_docs \n <span style="color: black;">\n ) \n
\n <span style="color: #808080; font-style: italic;">\n # Predict Class \n
y_pred  \n <span style="color: #66cc66;">\n = classifier. \n <span style="color: black;">\n predict(X_test. \n <span style="color: black;">\n toarray()) \n
\n <span style="color: #808080; font-style: italic;">\n # Accuracy  \n
\n <span style="color: #ff7700; font-weight: bold;">\n from sklearn. \n <span style="color: black;">\n metrics  \n <span style="color: #ff7700; font-weight: bold;">\n import accuracy_score \n
accuracy  \n <span style="color: #66cc66;">\n = accuracy_score \n <span style="color: black;">\n (df_test \n <span style="color: black;">\n [ \n <span style="color: #483d8b;">\n 'Label' \n <span style="color: black;">\n ]. \n <span style="color: black;">\n tolist() \n <span style="color: #66cc66;">\n , y_pred \n <span style="color: black;">\n ) \n
\n <span style="color: #ff7700; font-weight: bold;">\n print \n <span style="color: black;">\n ( \n <span style="color: #483d8b;">\n "Accuracy is --> " \n <span style="color: #66cc66;">\n ,accuracy* \n <span style="color: #ff4500;">\n 100 \n <span style="color: black;">\n ) \n
\n </span style="color: black;"> \n </span style="color: #ff4500;"> \n </span style="color: #66cc66;"> \n </span style="color: #483d8b;"> \n </span style="color: black;"> \n </span style="color: #ff7700; font-weight: bold;"> \n </span style="color: black;"> \n </span style="color: #66cc66;"> \n </span style="color: black;"> \n </span style="color: black;"> \n </span style="color: #483d8b;"> \n </span style="color: black;"> \n </span style="color: black;"> \n </span style="color: #66cc66;"> \n </span style="color: #ff7700; font-weight: bold;"> \n </span style="color: black;"> \n </span style="color: #ff7700; font-weight: bold;"> \n </span style="color: #808080; font-style: italic;"> \n </span style="color: black;"> \n </span style="color: black;"> \n </span style="color: #66cc66;"> \n </span style="color: #808080; font-style: italic;"> \n </span style="color: black;"> \n </span style="color: black;"> \n </span style="color: #66cc66;"> \n </span style="color: black;"> \n </span style="color: #483d8b;"> \n </span style="color: black;"> \n </span style="color: #66cc66;"> \n </span style="color: black;"> \n </span style="color: black;"> \n </span style="color: black;"> \n </span style="color: #66cc66;"> \n </span style="color: #ff7700; font-weight: bold;"> \n </span style="color: black;"> \n </span style="color: #ff7700; font-weight: bold;"> \n </span style="color: #808080; font-style: italic;"> \n </span style="color: black;"> \n </span style="color: black;"> \n </span style="color: #66cc66;"> \n </span style="color: #808080; font-style: italic;"> \n </span style="color: black;"> \n </span style="color: #66cc66;"> \n </span style="color: #483d8b;"> \n </span style="color: black;"> \n </span style="color: #ff7700; font-weight: bold;"> \n </span style="color: #808080; font-style: italic;"> \n </span style="color: black;"> \n </span style="color: black;"> \n </span style="color: #808080; font-style: italic;"> \n </span style="color: black;"> \n </span style="color: #ff4500;"> \n </span style="color: #66cc66;"> \n </span style="color: black;"> \n </span style="color: #66cc66;"> \n </span style="color: #808080; font-style: italic;"> \n </span style="color: black;"> \n </span style="color: black;"> \n </span style="color: #483d8b;"> \n </span style="color: black;"> \n </span style="color: #66cc66;"> \n </span style="color: black;"> \n </span style="color: black;"> \n </span style="color: #483d8b;"> \n </span style="color: black;"> \n </span style="color: #66cc66;"> \n </span style="color: #ff7700; font-weight: bold;">\n \n </span style="color: black;">\n \n </span style="color: black;">\n \n </span style="color: #ff7700; font-weight: bold;">\n \n </span style="color: #808080; font-style: italic;">
Copy code
 Applying BoW model using Sklearn

Hyperparameters of Bag of Words

In our model of Bag of Words, we have used a Hyperparameter max_features =100 this means we are choosing the top 100 most frequent words from our vocabulary.

Output: –

After fitting our bag of words model and checking on testing data, we have got an accuracy of nearly 90% which is very impressive considering just the BOW model.

Introduction to Natural language processing
Introduction to Natural language processing
Natural Language Processing (NLP) refers to the branch of artificial intelligence that gives machines the ability to read, understand, and derive meaning from human language. In this article we more
Text Pre-processing For Spam Filtering (With Codes)
Text Pre-processing For Spam Filtering (With Codes)
“Garbage in Garbage Out” this quote is one of the key fundamentals when it comes to machine learning and Natural language processing. Natural language processing is no different, it more
Tokenization in NLP | Techniques to Apply Tokenization in Python
Tokenization in NLP | Techniques to Apply Tokenization in Python
In this blog, we will discuss the process of tokenization, which involves breaking down a string of text into smaller units called tokens. This is an important step in more

Pros and Cons of Bag of Words

Pros: –

1. It is easy to implement and takes very little computation power compared to other Natural Language Processing algorithms.

Cons: –

In our above example, our corpus has only 11 words, but we’ll start facing problems when our corpus size is huge and we have a lot of unique words.

1. In cases where we have a huge corpus, our vector length will be very long and will contain most of the 0’s, thereby resulting in a sparse matrix.
2. We are not retaining any information related to the placement of words in sentences.
a. Examples: –
i. Say we have two sentences in our corpus: –

S1 = Cat killed a mouse.
S2 = Mouse killed a Cat.

Now in our corpus, we have 4 words = [ “Mouse” , “killed” , “a”, “cat” ]

And our BoW will look like this

S1 = [ 1, 1, 1, 1 ]
S2 = [ 1, 1, 1, 1 ]

But as we know after reading 2 sentences from our corpus, we can say that they are 2 completely different sentences but just because we have used BoW our sentences after vectorisation has lost meaning.

This is one of the major drawbacks of the BoW algorithm

3. The BOW model captures the occurrence of words in a sentence but fails to understand the semantics of a text.
a. Example: –
i. “He lives near riverbank”
ii. “He has a bank account”

When we will create a BOW model out of it we will see the word “bank” got the same number in both sentences but actually the meaning of the word “bank” is completely different in both sentences. So, it tells BOW fails to capture any semantic meaning of a sentence

Next Steps

In our next blog, we’ll discuss what other algorithms we can use to tackle the drawbacks of BoW algorithms and we’ll learn about TF-IDF.

Contributed By: Sahib Singh

Download this article as PDF to read offline

Download as PDF
About the Author

This is a collection of insightful articles from domain experts in the fields of Cloud Computing, DevOps, AWS, Data Science, Machine Learning, AI, and Natural Language Processing. The range of topics caters to upski... Read Full Bio


We use cookies to improve your experience. By continuing to browse the site, you agree to our Privacy Policy and Cookie Policy.