text-preprocessing

Created Date : 2021 March 31

Last Modified : 2021 December 11

tags: bag of words TF-IDF

categories: NLP data preprocessing

References :
- [Book] Builiding the ML systems with python
- http://tfidf.com/
Questions :

Ways to preprocess text before Feature Extraction in NLP

When preprocessing, we can perform the following:

Eliminate handles and URLs
Tokenize the string into words.
Remove stop words like “and, is, a, on, etc.”
Stemming or convert every word to its stem.
- Like dancer, dancing, danced, becomes ‘danc’. You can use porter stemmer to take care of this.
Convert all your words to lower case.

For example

@YMourri and @AndrewYNg are tuning a GREAT AI model at https://deeplearning.ai!!!

After preprocessing becomes
```
[tun,great,ai,model]
```
Hence we can see how we eliminated handles, tokenized it into words, removed stop words, performed stemming, and converted everything to lower case.

Stemming

Stemming is the process of converting a word to its most general form, or stem. This helps in reducing the size of our vocabulary.

Consider the words:

learn
learn ing
learn ed
learn t

All these words are stemmed from its common root learn. However, in some cases, the stemming process produces words that are not correct spellings of the root word. For example, happi and sunni. That’s because it chooses the most common stem for related words. For example, we can look at the set of words that comprises the different forms of happy:

happ y
happi ness
happi er

We can see that the prefix happi is more commonly used. We cannot choose happ because it is the stem of unrelated words like happen.

NLTK has different modules for stemming and we will be using the PorterStemmer module which uses the Porter Stemming Algorithm. Let’s see how we can use it in the cell below.

Code

import nltk                                # Python library for NLP
from nltk.corpus import twitter_samples    # sample Twitter dataset from NLTK
import matplotlib.pyplot as plt            # library for visualization
import random

Download stopwords from NLTK

nltk.download('stopwords')
# download the stopwords from NLTK

Imports

import re                                  # library for regular expression operations
import string                              # for string operations

from nltk.corpus import stopwords          # module for stop words that come with NLTK
from nltk.stem import PorterStemmer        # module for stemming
from nltk.tokenize import TweetTokenizer   # module for tokenizing strings

Remove hyperlinks, Twitter marks and styles

tweet2 = re.sub(r'^RT[\s]+', '', tweet)

# remove hyperlinks
tweet2 = re.sub(r'https?://[^\s\n\r]+', '', tweet2)

# remove hashtags
# only removing the hash # sign from the word
tweet2 = re.sub(r'#', '', tweet2)

print(tweet2) #

Tokenize the string

# instantiate tokenizer class
tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True,
			       reduce_len=True)

# tokenize tweets
tweet_tokens = tokenizer.tokenize(tweet2)

print()
print('Tokenized string:')
print(tweet_tokens)

Stop words and punctuation

#Import the english stop words list from NLTK
stopwords_english = stopwords.words('english')

print('Stop words\n')
print(stopwords_english)

print('\nPunctuation\n')
print(string.punctuation)

Removing stop words and punctuation

tweets_clean = []

for word in tweet_tokens: # Go through every word in your tokens list
    if (word not in stopwords_english and  # remove stopwords
	word not in string.punctuation):  # remove punctuation
	tweets_clean.append(word)

print('removed stop words and punctuation:')
print(tweets_clean)

Stemming

# Instantiate stemming class
stemmer = PorterStemmer()

# Create an empty list to store the stems
tweets_stem = []

for word in tweets_clean:
    stem_word = stemmer.stem(word)  # stemming word
    tweets_stem.append(stem_word)  # append to the list

print('stemmed words:')
print(tweets_stem)

⇦ ⇨