Information

Section: More complex transformations of text
Goal: Learn and understand the complex methods of text preprocessing such as stemming, negation handling, stopwords, …
Time needed: 30 min
Prerequisites: None

More complex transformations of text

# Get example tweets
tweet_1 = 'Hospitalizations from COVID-19 have increased nearly 90% and California officials say they could triple by Christmas. https://t.co/hrBnP04HnB'
tweet_2 = 'Something for the afternoon slump / journey home / after school / cooking dinner ... a special 30 minute mix of cool Christmas tunes intercut with Christmas film samples and scratching @BBCSounds https://t.co/rHovIA3u5e'
tweet_3 = 'This happened in Adelaide the other day. #koala #Adelaide https://t.co/vAQFkd5r7q'

Negations and contractions

Words such as “can’t”, “don’t”, in other words, words containing a negative contracted form, could be recognized by our algorithm, however, it is possible to make it simpler by removing the contracted form from the text (“can not”, “do not”). A “not” is easier to interpret as it is a more frequent word than all the contracted forms.

Of course, one can argue that, with a bag-of-words method, having two words instead of one, including one that is meaning the opposite of what we want to show, can be misleading (for example, “not happy” contains the word “happy”, but means the opposite). This can be true in some case, so we always should have a good overview of the data and the problem we deal with, as always.

Depending on the methods of text processing might you use for your algorithm, you might have more or less use in replacing negations.

Stopwords removal

A lot of words are present for syntaxic purposes, but do not add any information for the task if we consider their meaning. Those words, such as “a”, “the”, “for”, etc… (see the list below) should be removed for a more efficient analysis, otherwise they will probably have a high frequency and pollute the results of the bag-of-words method. This also allows the whole experiment to be ran faster with less information to treat.

In Python, some libraries come with a built-in list of stopwords, making it easy to remove them. For example, from the library nltk.corpus, the class stopwords contains a list of stopwords that one can simply remove from the dataset.

import nltk
from nltk.corpus import stopwords

print(stopwords.words('english'))
---------------------------------------------------------------------------
LookupError                               Traceback (most recent call last)
/usr/local/lib/python3.7/site-packages/nltk/corpus/util.py in __load(self)
     82                 try:
---> 83                     root = nltk.data.find("{}/{}".format(self.subdir, zip_name))
     84                 except LookupError:

/usr/local/lib/python3.7/site-packages/nltk/data.py in find(resource_name, paths)
    584     resource_not_found = "\n%s\n%s\n%s\n" % (sep, msg, sep)
--> 585     raise LookupError(resource_not_found)
    586 

LookupError: 
**********************************************************************
  Resource stopwords not found.
  Please use the NLTK Downloader to obtain the resource:

  >>> import nltk
  >>> nltk.download('stopwords')
  
  For more information see: https://www.nltk.org/data.html

  Attempted to load corpora/stopwords.zip/stopwords/

  Searched in:
    - '/root/nltk_data'
    - '/usr/local/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/local/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************


During handling of the above exception, another exception occurred:

LookupError                               Traceback (most recent call last)
<ipython-input-2-90697c33e297> in <module>
      2 from nltk.corpus import stopwords
      3 
----> 4 print(stopwords.words('english'))

/usr/local/lib/python3.7/site-packages/nltk/corpus/util.py in __getattr__(self, attr)
    118             raise AttributeError("LazyCorpusLoader object has no attribute '__bases__'")
    119 
--> 120         self.__load()
    121         # This looks circular, but its not, since __load() changes our
    122         # __class__ to something new:

/usr/local/lib/python3.7/site-packages/nltk/corpus/util.py in __load(self)
     83                     root = nltk.data.find("{}/{}".format(self.subdir, zip_name))
     84                 except LookupError:
---> 85                     raise e
     86 
     87         # Load the corpus.

/usr/local/lib/python3.7/site-packages/nltk/corpus/util.py in __load(self)
     78         else:
     79             try:
---> 80                 root = nltk.data.find("{}/{}".format(self.subdir, self.__name))
     81             except LookupError as e:
     82                 try:

/usr/local/lib/python3.7/site-packages/nltk/data.py in find(resource_name, paths)
    583     sep = "*" * 70
    584     resource_not_found = "\n%s\n%s\n%s\n" % (sep, msg, sep)
--> 585     raise LookupError(resource_not_found)
    586 
    587 

LookupError: 
**********************************************************************
  Resource stopwords not found.
  Please use the NLTK Downloader to obtain the resource:

  >>> import nltk
  >>> nltk.download('stopwords')
  
  For more information see: https://www.nltk.org/data.html

  Attempted to load corpora/stopwords

  Searched in:
    - '/root/nltk_data'
    - '/usr/local/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/local/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************
from nltk import word_tokenize
from nltk.corpus import stopwords

print(tweet_1)

tokens = word_tokenize(tweet_1)
for token in tokens:
    if token in stopwords.words('english'):
        tokens.remove(token)
        
print(tokens)
Hospitalizations from COVID-19 have increased nearly 90% and California officials say they could triple by Christmas. https://t.co/hrBnP04HnB
['Hospitalizations', 'COVID-19', 'increased', 'nearly', '90', '%', 'California', 'officials', 'say', 'could', 'triple', 'Christmas', '.', 'https', ':', '//t.co/hrBnP04HnB']

Emoji transformation

Emojis can be a great source of information for sentiment analysis, and it would be a shame to simply discard them. Instead, we can use a mapping function, which gives a corresponding text translation for each emoji. With this example, the emoji “:-)” will be translated into “simple_smile”.

A “cheat sheet” of emojis transformed into text can be seen here.

This emoji transformation has to be done before we remove all special characters, as we showed on the previous page.

A few resources exist about emoji transformation. Depending on how the emoji is represented in the text we want to process, we will use different functions.

For example, the library emoji offer a few functions to transform emojis to text.

Stemming and lemmatization

  • Stemming is a linguistic operation consisting of reducing the words to their stem, or root, by removing all language suffixes. The stem might not be a real word.

  • Lemmatization is a similar operation which groups all words with the same root into one entity, the lemma, which is a real word.

They often give the same result, but not always.

For example:

Words

Stemming

Lemmatization

written

writ

write

writing

writ

write

gives

giv

give

finally

final

final

expected

expect

expect

picky

pick

pick

This has the impact to consider all the words with the same root as synonyms, as they will be treated as a single word. This makes sense for an algorithm that is trying to detect the general idea of the text more than studying each word one by one. This operation is useful to reduce the number of different words in the data and therefore build a stronger model.

A stemming algorithm is easier to write, as the word is only cut and no dictionnary or lookup table is needed, but it can sometimes give uncertain results: for example, the word given will be reduced to giv, but the word gave will stay gav even though those two words have the same root and should be grouped together.
A lemmatization algorithm requires more preleminary resources.

In the nltk library, we can use the algorithm PorterStemmer, which uses a specific stemming algorithm, the Porter stemming algorithm.

Try to change the words to see how other words change.

Note that although the algorithm is said to be stemming, it gives real words and not cut versions of the root.

from nltk.stem import PorterStemmer

porter = PorterStemmer()

list_words = ['writing', 'gives', 'expected']

for word in list_words:
    print(word, '-', porter.stem(word))
writing - write
gives - give
expected - expect

Quiz

from IPython.display import IFrame
IFrame("https://blog.hoou.de/wp-admin/admin-ajax.php?action=h5p_embed&id=67", "959", "309")

References

  • Kumar, Harish, 2018, Classification of Short Text Using Various Preprocessing Techniques: An Empirical Evaluation, Recent Findings in Intelligent Computing Techniques, pp. 19-30.

  • Arpita et al., 2020, Data Cleaning of Raw Tweets for Sentiment Analysis, 2020 Indo – Taiwan 2nd International Conference on Computing, Analytics and Networks (Indo-Taiwan ICAN)

  • Bitext.com, 2018, What is the difference between stemming and lemmatization?, online, accessed 04.12.2020