Information

Section: Simple tweet preprocessing
Goal: Learn and understand the simple methods of text preprocessing and normalization using regular expressions.
Time needed: 30 min
Prerequisites: None

Simple tweet preprocessing

On this page, we examine some basic ways of changing the text to increase its ability to be processed by a natural language processing algorithm.

# Get example tweets
tweet_1 = 'Hospitalizations from COVID-19 have increased nearly 90% and California officials say they could triple by Christmas. https://t.co/hrBnP04HnB'
tweet_2 = 'Something for the afternoon slump / journey home / after school / cooking dinner ... a special 30 minute mix of cool Christmas tunes intercut with Christmas film samples and scratching @BBCSounds https://t.co/rHovIA3u5e'
tweet_3 = 'This happened in Adelaide the other day. #koala #Adelaide https://t.co/vAQFkd5r7q'

Remove URLs

URLs do not give any information when we try to analyze text from words, especially on Twitter as they are reduced to a code to take less space. One of the first reasonable thing to do is then to just remove them from the text.

For that, we can simply remove all chains of characters starting with http.

To remove all chains matching a certain pattern in a string, we use regular expressions.
In Python, we will use the function sub() from the library re, which allows us to use regular expressions. This function replaces each occurence of a specified chain by another specified chain. In our case, as we want to remove the URL, the replacement chain will be an empty string, i.e. ''.

The regular expression we use here would be: https?:\/\/.+. The part http is here because we expect a URL to start with those 4 characters, then we add a s?, because some URL are ‘https’ but not all of them. Then, we match :// exactly by adding an escape character for /. We continue by adding [^ ]+, meaning any character but a space, an unlimited amount of times.

Change the string in tweet_1 to see how the transformation changes.

import re

print(tweet_1)
transf = re.sub(r'https?://[^ ]+', '', tweet_1)
print(transf)
Hospitalizations from COVID-19 have increased nearly 90% and California officials say they could triple by Christmas. https://t.co/hrBnP04HnB
Hospitalizations from COVID-19 have increased nearly 90% and California officials say they could triple by Christmas. 

Remove usernames

Same as for the URL, a username in a tweet won’t give any valuable information because it won’t be recognized as a word carrying meaning. We will then remove it.

Specifically for Twitter, all usernames start with the character @. To remove them, we only have to remove all chains of characters starting with @.

The regular expression here will be: @[^ ]+, to match any string starting with @ and ending with a space.

print(tweet_2)
transf = re.sub(r'@[^ ]+', '', tweet_2)
print(transf)
Something for the afternoon slump / journey home / after school / cooking dinner ... a special 30 minute mix of cool Christmas tunes intercut with Christmas film samples and scratching @BBCSounds https://t.co/rHovIA3u5e
Something for the afternoon slump / journey home / after school / cooking dinner ... a special 30 minute mix of cool Christmas tunes intercut with Christmas film samples and scratching  https://t.co/rHovIA3u5e

Hashtags

Hashtags are hard to apprehend, but usually contain useful information about the context of a tweet and its content. The problem with hashtags is that the words are all after the other, without a space. This kind of word is hard to understand with a basic algorithm for word extraction. However, most of the time, hashtags consist on only one word, preceeded by the symbol #. It can then be useful to keep the part following the #. If the word is made of two or more words, it will stay as noise in the data.

To deal with hashtags, we only remove the character #.

The regular expression is very simple in that case: #.

print(tweet_3)
transf = re.sub(r'#', '', tweet_3)
print(transf)
This happened in Adelaide the other day. #koala #Adelaide https://t.co/vAQFkd5r7q
This happened in Adelaide the other day. koala Adelaide https://t.co/vAQFkd5r7q

Character normalization

As Twitter is used mostly informally, it is very common to find unregularly written words. One of them is a repetition of characters to accentuate a statement, for example: “It starts todaaaaaaaaay”. The word “todaaaaaaaaay” won’t be recognized by our algorithm, while the word “today” would and could convey important information.

We can replace each character that is repeated more than 2 times in a row by its single value.

Here, we use a regular expression to match a letter repeated more than 2 times: ([A-Za-z])\1{2,}. This one is a bit more complicated than the previous ones. First, we use [A-Za-z] to match only letters. This group is in between parentheses so we can add \1, to match the same character that was first matched, and not any letter. Finally, we add {2,} to specify that we need a repeatition of more than 2 characters.

In the function re.sub(), as the second parameter, we use r'\1' to replace the identified group with the matched character.

Change the value of string to see what happens with other examples.

string = 'todaaaaaaaaay'
print(string)
print(re.sub(r'([A-Za-z])\1{2,}', r'\1', string))
todaaaaaaaaay
today

Punctuation, special characters and numbers

In the same way, punctuation and single characters do not add any information with the method we use to process the text, as the algorithm for sentiment analysis only detects words.
Same goes for numbers: they are not processed, understandably as they do not represent a sentiment. An exception could be for the number 0, as it can convey a negative sense. To make sure of that, we can keep the number 0 and translate into its textual form, “zero”.

We decide to detect all the single 0, transforming them into zero, and keep only letters otherwise. This has for effect to get rid of all the special characters and digits.

First, we decide to change all the zeros. For that, we select all zeros preceeding and following a space, in order to only keep real zeros. This is simply done with the regular expression 0, replacing with zero.

Then, it is easier to remove all characters that are not letters or blank spaces:[^A-Za-z ], ^ having the effect of removing all characters that are not specified.

print(tweet_2)
transf = re.sub(r' 0 ', 'zero', tweet_2)
transf = re.sub(r'[^A-Za-z ]', '', tweet_2)
print(transf)
Something for the afternoon slump / journey home / after school / cooking dinner ... a special 30 minute mix of cool Christmas tunes intercut with Christmas film samples and scratching @BBCSounds https://t.co/rHovIA3u5e
Something for the afternoon slump  journey home  after school  cooking dinner  a special  minute mix of cool Christmas tunes intercut with Christmas film samples and scratching BBCSounds httpstcorHovIAue

Lower casing

To be sure that all words are equally analyzed, we cannot keep the distinction between upper and lower case. To ensure of that, we transform all capital letters to their lower case equivalent.

This is a very simple transformation in Python, as there is a built-in method to do that easily. No need for regular expressions, we just use string.lower().

print(tweet_2)
transf = tweet_2.lower()
print(transf)
Something for the afternoon slump / journey home / after school / cooking dinner ... a special 30 minute mix of cool Christmas tunes intercut with Christmas film samples and scratching @BBCSounds https://t.co/rHovIA3u5e
something for the afternoon slump / journey home / after school / cooking dinner ... a special 30 minute mix of cool christmas tunes intercut with christmas film samples and scratching @bbcsounds https://t.co/rhovia3u5e

Now that we went through the simple techniques for tweet preprocessing, we can go onto the next page to see some more advanced algorithms.

Quiz

from IPython.display import IFrame
IFrame("https://blog.hoou.de/wp-admin/admin-ajax.php?action=h5p_embed&id=66", "959", "309")

References

  • Kumar, Harish, 2018, Classification of Short Text Using Various Preprocessing Techniques: An Empirical Evaluation, Recent Findings in Intelligent Computing Techniques, pp. 19-30.

  • Arpita et al., 2020, Data Cleaning of Raw Tweets for Sentiment Analysis, 2020 Indo – Taiwan 2nd International Conference on Computing, Analytics and Networks (Indo-Taiwan ICAN)