Information

Section: Put everything together
Goal: Apply all the seen methods together to see the transformation of the text.
Time needed: 10 min
Prerequisites: Chapter 3

Put everything together

Now that we have detailed the transformations to do with the text, let’s see how the tweets are transformed when we apply all the methods after each other.

Tweets examples

# Put everything in a single list
tweet_1 = 'Hospitalizations from COVID-19 have increased nearly 90% and California officials say they could triple by Christmas. https://t.co/hrBnP04HnB'
tweet_2 = 'Something for the afternoon slump / journey home / after school / cooking dinner ... a special 30 minute mix of cool Christmas tunes intercut with Christmas film samples and scratching @BBCSounds https://t.co/rHovIA3u5e'
tweet_3 = 'This happened in Adelaide the other day. #koala #Adelaide https://t.co/vAQFkd5r7q'
list_tweets = [tweet_1, tweet_2, tweet_3]

Result

Change the value of the tweets to see how any text is changed by our transformations.

# Function for tweet preprocessing with what we saw in the chapter

def preprocess_tweet(tweet):
    '''
    Takes a tweet as an input and output the list of tokens.
    '''
    
    import emoji
    import re
    from nltk import word_tokenize
    from nltk.corpus import stopwords
    from nltk.stem import PorterStemmer
    
    # Initialization
    new_tweet = tweet
    
    ## Changes on string
    
    # Remove urls
    new_tweet = re.sub(r'https?://[^ ]+', '', new_tweet)
    
    # Remove usernames
    new_tweet = re.sub(r'@[^ ]+', '', new_tweet)
    
    # Remove hashtags
    new_tweet = re.sub(r'#', '', new_tweet)
    
    # Character normalization
    new_tweet = re.sub(r'([A-Za-z])\1{2,}', r'\1', new_tweet)
    
    # Emoji transformation
    new_tweet = emoji.demojize(new_tweet)
    
    # Punctuation and special characters
    new_tweet = re.sub(r' 0 ', 'zero', new_tweet)
    new_tweet = re.sub(r'[^A-Za-z ]', '', new_tweet)
    
    # Lower casing
    new_tweet = new_tweet.lower()
    
    
    ## Changes on tokens
    
    # Tokenization
    tokens = word_tokenize(new_tweet)
    
    porter = PorterStemmer()
    
    for token in tokens:
        # Stopwords removal
        if token in stopwords.words('english'):
            tokens.remove(token)
        # Stemming
            token = porter.stem(token)
    
    return tokens
# Use function on our list of tweets

list_tweets2 = []
for tweet in list_tweets:
    print(tweet)
    tokens = preprocess_tweet(tweet)
    print(tokens)
    list_tweets2.append([tokens])
Hospitalizations from COVID-19 have increased nearly 90% and California officials say they could triple by Christmas. https://t.co/hrBnP04HnB
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-3-48fe187a2976> in <module>
      4 for tweet in list_tweets:
      5     print(tweet)
----> 6     tokens = preprocess_tweet(tweet)
      7     print(tokens)
      8     list_tweets2.append([tokens])

<ipython-input-2-d6a684a4ec59> in preprocess_tweet(tweet)
      6     '''
      7 
----> 8     import emoji
      9     import re
     10     from nltk import word_tokenize

ModuleNotFoundError: No module named 'emoji'
# Beginner version: cell to hide

import ipywidgets as widgets
from ipywidgets import interact

def preprocess_tweet(tweet):
    '''
    Takes a tweet as an input and output the list of tokens.
    '''
    
    import emoji
    import re
    from nltk import word_tokenize
    from nltk.corpus import stopwords
    from nltk.stem import PorterStemmer
    
    new_tweet = tweet
    new_tweet = re.sub(r'https?://[^ ]+', '', new_tweet)
    new_tweet = re.sub(r'@[^ ]+', '', new_tweet)
    new_tweet = re.sub(r'#', '', new_tweet)
    new_tweet = re.sub(r'([A-Za-z])\1{2,}', r'\1', new_tweet)
    new_tweet = emoji.demojize(new_tweet)
    new_tweet = re.sub(r' 0 ', 'zero', new_tweet)
    new_tweet = re.sub(r'[^A-Za-z ]', '', new_tweet)
    new_tweet = new_tweet.lower()
    
    tokens = word_tokenize(new_tweet)
    porter = PorterStemmer()
    for token in tokens:
        if token in stopwords.words('english'):
            tokens.remove(token)
            token = porter.stem(token)
            
    print(tokens)

interact(preprocess_tweet, tweet = widgets.Textarea(
    value = tweet_1,
    description = 'Tweet:',
    disabled = False
))
<function __main__.preprocess_tweet(tweet)>

Conclusion

This chapter showed some simple text transformations for a machine learning experiment based on text analysis. This was only a simple case (bag-of-words), where we treat each word as an independent entity.
Other, more complicated, methods exist to also take into consideration the role of the word in the sentence (part-of-speech for example) and do more language-based analysis. To go further on this topic theoretically, you can have a look at this good article or this one. For some Python oriented resources, have a look here or there.

from IPython.display import IFrame
IFrame("https://blog.hoou.de/wp-admin/admin-ajax.php?action=h5p_embed&id=65", "959", "332")