by Sahib Singh
“Garbage in Garbage Out” this quote is one of the key fundamentals when it comes to machine learning and Natural language processing. Natural language processing is no different, it is a stream of data science where natural language data is given to machine learning algorithms for various purposes like: –
- Sentiment Analysis
- Spam filtering
- Named entity recognition
- Part of speech tagging etc.
In this blog, I will be covering some very useful text pre-processing techniques and we will see how text pre-processing can make a difference in your final results.
Some pre-processing steps that we will cover in this blog will be:-
- Lower casing
- Removal of punctuations
- Removal of stop words
- Removal of frequent words
- Conversion of Emoticons to words
- Removal of URLs
- Removal of HTML tags
- Spelling Correction
- Removal of rare words
Remember we do not have to use every text pre-processing technique for our data and for every different use case we have to carefully select which works best for us.
For example, In the case of sentiment analysis, it is better not to remove emoticons since emojis convey some very important information to our data but it is useful to convert the emoticon back to text.
For this blog, we will pick up an email spam filtering dataset where the goal is to classify whether an email is a spam or not. Let’s look at our data: –
import numpy as np import pandas as pd df = pd.read_csv('/content/train-dataset.csv') # shuffling all our data df = df.sample(frac=1) # reading only Message_body and label df = df[['Message_body','Label']] df
1. Lower Casing
- It is a text pre-processing technique where all words are lowercased so that words like ‘cat’ and ‘CAT’ are treated the same way. This technique comes in handy while we are using Bag of words or Tf-Idf for making features out of our natural language data.
- This might not be helpful while doing Part of Speech Tagging ( where nouns can be differentiated based on the case of text ) or Sentiment Analysis ( where Capital words generally depict anger ) it is recommended not to use this technique for your text pre-processing.
df['clean_msg']= df['clean_msg'].apply(lambda x: x.lower())
2. Removal of Punctuations
- It is also a technique of text pre-processing where we try to remove unnecessary punctuation symbols because their presence does not make any significance in our text data.
For Example, ‘yippee’ and ‘yippee!’ are conveying the feeling of happiness and excitement and this exclamation mark is of no use here.
- We will remove punctuation marks from string.punctuation
- But you always add more symbols based on your use case.
#library that contains punctuation import string # list of all punctuations we have print(string.punctuation) #defining the function to remove punctuation def remove_punctuation(text): punctuationfree="".join([i for i in text if i not in string.punctuation]) return punctuationfree #storing the punctuation free text for both training and testing data df['clean_msg']= df['clean_msg'].apply(lambda x:remove_punctuation(x))
3. Removal of Stopwords in Text Pre-processing
- Stopwords are a set of words that do not value a text example ‘a’,’an’,’the’ these are the words that occur very frequently in our text data, but they are of no use. Many libraries have compiled stop words for various languages and we can use them directly and for any specific use case if we feel we can also add a more specific set of stop words to the list.
from nltk.corpus import stopwords ", ".join(stopwords.words('english'))
b. Code for removal of stop words
Before stop words removal we need tokenized text
#defining function for tokenization import re #whitespace tokenizer from nltk.tokenize import WhitespaceTokenizer def tokenization(text): tk = WhitespaceTokenizer() return tk.tokenize(text) #applying function to the column for making tokens in both Training and Testing data df['tokenised_clean_msg']= df['clean_msg'].apply(lambda x: tokenization(x))
Now stop words removal
#importing nlp library import nltk nltk.download('stopwords') #Stop words present in the library stopwords = nltk.corpus.stopwords.words('english') #defining the function to remove stopwords from tokenized text def remove_stopwords(text): output= [i for i in text if i not in stopwords] return output #applying the function for removal of stopwords df['cleaned_tokens']= df['tokenised_clean_msg'].apply(lambda x:remove_stopwords(x))
4. Removal of Frequent Words
As we have removed stop words but in some cases, it’s better to remove the most frequent words from the data itself that are useless. The most frequent words in our corpus are:-
from collections import Counter cnt = Counter() for text in df["cleaned_tokens"].values: for word in text: cnt[word] += 1 cnt.most_common(10)
from collections import Counter cnt = Counter() for text in df["cleaned_tokens"].values: for word in text: cnt[word] += 1 FREQWORDS = set([w for (w, wc) in cnt.most_common(10)]) def remove_freqwords(text): """custom function to remove the frequent words""" return " ".join([word for word in text if word not in FREQWORDS])
5. Stemming in Text Pre-processing
It is a text standardization technique where a word is reduced to its stem/base word. Example: “jabbing” → “jab” and “kicking” → “kick”. The main aim for stemming is that we can reduce the vocab size before inputting it into any machine learning model.
#importing the Stemming function from nltk library from nltk.stem.porter import PorterStemmer #defining the object for stemming porter_stemmer = PorterStemmer() #defining a function for stemming def stemming(text): stem_text = [porter_stemmer.stem(word) for word in text] return stem_text # applying function for stemming df['cleaned_tokens']=df['cleaned_tokens'].apply(lambda x: stemming(x))
The Disadvantage of stemming is sometimes after the stemming word loses its meaning. Example: – “copying” → “copi” and there is no word “copi” in English vocab.
Also, this porter stemmer is for the English language. If we are working with other languages, we can use snowball stemmer.
- Lemmatization is very similar to stemming with the only difference being the word here will get reduced to a word that has a particular meaning in its language. Due to this, lemmatization is generally slower than stemming.
- Let us use the WordNetLemmatizer in nltk to lemmatize our sentences.
from nltk.stem import WordNetLemmatizer lemmatizer = WordNetLemmatizer() def lemmatize_words(text): return " ".join([lemmatizer.lemmatize(word) for word in text.split()])
7. Conversion of Emoticons in Text Pre-processing
- We know that on social media there is continuously increasing use of emoticons so it’s better to convert these emoticons back to some natural text so that we can get some useful content out of them.
- This method can be useful for some use cases.
- For implementation refer to the notebook at the end.
def convert_emoticons(text): for emot in EMOTICONS: text = re.sub(u'('+emot+')', "_".join(EMOTICONS[emot].replace(",","").split()), text) return text text = "Hello :-) :-)" convert_emoticons(text)
8. Removal of URL’S
- The next preprocessing step is to remove any URLs present in the data. For example, if we are doing a news data analysis, then there is a very good chance that the news article will have some URL in it. Probably we might need to remove them for our further analysis. We can use the below code snippet to do that.
- For example, we will just remove https links
def remove_urls(text): url_pattern = re.compile(r'https?://S+|www.S+') return url_pattern.sub(r'', text)
9. Remove HTML tags
While scrapping data from different websites, there are very high chances that we might get html tags with that and it’s useful to remove those html tags for any further processing. We’ll use regular expressions to remove html tags
def remove_html(text): html_pattern = re.compile('<.*?>') return html_pattern.sub(r'', text) text = """<div> <h1> Data</h1> <p> News articles</p> """ print(remove_html(text))
Data News articles
10. Spelling Correction
- Typos are very common when it comes to either social media or when users on different platforms unintentionally type wrong spellings on platforms and it is extremely useful to correct wrong spellings so that you can make a better analysis of textual data.
- If we are interested in writing a spelling corrector of our own, we can probably start with the famous code from Peter Norvig.
from spellchecker import SpellChecker spell = SpellChecker() def correct_spellings(text): corrected_text =  misspelled_words = spell.unknown(text.split()) for word in text.split(): if word in misspelled_words: corrected_text.append(spell.correction(word)) else: corrected_text.append(word) return " ".join(corrected_text)
11. Removal of Rare words
This approach is very specific and, in some cases, can give you good scores based on your metrics.
from collections import Counter cnt = Counter() for text in df["cleaned_tokens"].values: for word in text: cnt[word] += 1 n_rare_words = 10 Rare_words = set([w for (w, wc) in cnt.most_common()[:-n_rare_words-1:-1]]) def remove_rarewords(text): """custom function to remove the rare words""" return " ".join([word for word in str(text).split() if word not in Rare_words])
So these are some of the text pre-processing techniques that you can use whenever you are dealing with textual data and based on your data text pre-processing techniques can change and you can make your own text pre-processing techniques.
Why Are They Useful?
To understand the impact of these techniques we will pick up a problem and will compare the results with and without text pre-processing.
The problem we will see today is spam filtering in emails and we will use the Bag of words model
That we described in detail over here but briefly Bag of words is just a Vectorisation algorithm where a text document is converted to a bag of words vector by counting how many times a word appears in a text document.
For this problem, I have used a few of the text pre-processing techniques from above and we got a good improvement in our results.
- When we did not use any text pre-processing, we got an accuracy of nearly ~88%.
- When we used text pre-processing we got an accuracy of nearly ~92% so that means an increase of nearly 4.5% which is fair enough for this dataset.
Hope this blog helps you understand what is text pre-processing and how you can understand your data and make your own text pre-processing techniques.
I have attached both the notebooks below with and without text pre-processing for your convenience.
Recently completed any professional course/certification from the market? Tell us what liked or disliked in the course for more curated content.
Click here to submit its review with Shiksha Online.
Download this article as PDF to read offlineDownload as PDF