Photo by Nitin Sharma from Pexels


How to take your model from unremarkable to amazing simply by cleaning and preprocessing your data

Data cleaning done right will change your life.

If you have a model that has acceptable results but isn’t amazing, take a look at your data! Taking the time to clean and preprocess your data the right way can make your model a star.

Photo by Burst from Pexels

In order to look at scraping and preprocessing in more detail, let’s look at some of the work that went into “You Are What You Tweet: Detecting Depression in Social Media via Twitter Usage.” That way, we can really examine the process of scraping Tweets and then cleaning and preprocessing them. We’ll also do a little exploratory visualization, which is an awesome way to get a better sense of what your data looks like! We’re going to do some of the most basic cleaning and preprocessing work here: it’s up to you to really get these Tweets in order when you’re building your model!



A little background

More than 300 million people suffer from depression and only a fraction receive adequate treatment. Depression is the leading cause of disability worldwide and nearly 800,000 people every year die due to suicide. Suicide is the second leading cause of death in 15–29-year-olds. Diagnoses (and subsequent treatment) for depression are often delayed, imprecise, and/or missed entirely.

It doesn’t have to be this way! Social media provides an unprecedented opportunity to transform early depression intervention services, particularly in young adults.

Every second, approximately 6,000 Tweets are tweeted on Twitter, which corresponds to over 350,000 tweets sent per minute, 500 million tweets per day and around 200 billion tweets per year. Pew Research Center states that currently, 72% of the public uses some type of social media. This project captures and analyses linguistic markers associated with the onset and persistence of depressive symptoms in order to build an algorithm that can effectively predict depression. By building an algorithm that can analyze Tweets exhibiting self-assessed depressive features, it will be possible for individuals, parents, caregivers, and medical professionals to analyze social media posts for linguistic clues that signal deteriorating mental health far before traditional approaches currently do. Analyzing linguistic markers in social media posts allows for a low-profile assessment that can complement traditional services and would allow for a much earlier awareness of depressive signs than traditional approaches.


Where do we start?

We need data!


Photo by Quang Nguyen Vinh from Pexels


Gathering Data


In order to build a depression detector, there were two kinds of tweets that were needed: random tweets that do not necessarily indicate depression and tweets that demonstrate that the user may have depression and/or depressive symptoms. A dataset of random tweets can be sourced from the Sentiment140 dataset available on Kaggle, but for this binary classification model, this dataset which utilizes the Sentiment140 dataset and offers a set of binary labels proved to be the most effective for building a robust model. There are no publicly available datasets of tweets indicating depression, so “depressive” Tweets were retrieved using the Twitter scraping tool TWINT. The scraped Tweets were manually checked for relevance (for example, Tweets indicating emotional rather than economic or atmospheric depression) and Tweets were cleaned and processed. Tweets were collected by searching for terms specifically related to depression, specifically to lexical terms as identified in the unigram by De Choudhury, et. al. 

TWINT is a remarkably simple tool to use! 

You can download it right from the command line with:

pip install twint

If you want to, for example, search for the term “depression” on July 20, 2019 and store the data as a new csv named “depression,” you would run a command like:

twint -s "depression" --since 2019-07-20 -o depression —csv

Once you’ve gathered the Tweets, you can start cleaning and preprocessing them. You’ll probably wind up with a ton of information that you don’t need, like conversation ids and so on. You may decide to create multiple CSVs that you want to combine. We’ll get to all of that!


How did the model perform?


At first? Not that impressively. After a basic cleaning and preprocessing of the data, the best results (even after spending time fine-tuning the model) hovered around 80%. 

The reason for that really made sense after I examined word frequency and bigrams. Explore your data! Once I looked at the words themselves, I realized that it was going to take a lot of work to clean and prepare the dataset the right way, and that doing so was an absolute necessity. Part of the cleaning process had to be done manually, so don’t be afraid to get in there and get your hands dirty. It takes time, but it’s worth it!

In the end? The accuracy of the model was evaluated and compared to a binary classification baseline model using logistic regression. The models were analyzed for accuracy and a classification report was run to determine precision and recall scores. The data were split into training, testing, and validation sets and the accuracy for the model was determined based on the model’s performance with the testing data, which were kept separate. While the performance of the benchmark logistic regression model was 64.32% using the same data, learning rate, and epochs, the LSTM model performed significantly better at 97.21%.

So how did we get from the scraped Tweets to the results?

Practice, practice, practice! (And some serious work.)


Photo by DSD from Pexels

Basic Data Cleaning and Preprocessing


Let’s say we scraped Twitter for the search terms “depression,” “depressed,” “hopeless,” “lonely,” “suicide,” and “antidepressant” and we saved those files of scraped Tweets as, for example, “depression” in the file “tweets.csv” and so on.

We’ll start with a few imports

import pandas as pd
import numpy as np

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt'fivethirtyeight')

%matplotlib inline
%config InlineBackend.figure_format = 'retina'
import re
from nltk.tokenize import WordPunctTokenizer
tok = WordPunctTokenizer()

We’ll read one of our CSV files and take a look at the head.


hopeless_tweets_df = pd.read_csv('hopeless/tweets.csv')



First of all, we should get rid of any of the information stored in the datasets that aren’t necessary. We don’t need names, ids, conversation ids, geolocations, and so on for this project. We can get those out of there with:


hopeless_tweets_df.drop(['date', 'timezone', 'username', 'name', 'conversation_id', 'created_at', 'user_id', 'place', 'likes_count', 'link', 'retweet', 'quote_url', 'video', 'user_rt_id', 'near', 'geo', 'mentions', 'urls', 'photos', 'replies_count', 'retweets_count'], axis = 1, inplace = True)

Now we have this, which is much easier to deal with!



Now just do that with all of the CSVs you created with your search terms and we can combine our separate datasets into one! 


df_row_reindex = pd.concat([depression_tweets_df, hopeless_tweets_df, lonely_tweets_df, antidepressant_tweets_df, antidepressants_tweets_df, suicide_tweets_df], ignore_index=True)



Before we go any further, let’s drop the duplicates 


depressive_twint_tweets_df = df.drop_duplicates()


And save our dataset as a new CSV!


export_csv = depressive_twint_tweets_df.to_csv(r'depressive_unigram_tweets_final.csv')

More Advanced Preprocessing


Before the data could be used in the model, it was necessary to expand contractions, remove links, hashtags, capitalization, and punctuation. Negations needed to be dealt with. That meant creating a dictionary of negations so that negated words could be effectively handled. Links and URLs needed to be removed along with whitespaces. Additionally, stop words beyond the standard NLTK stop words needed to be removed to make the model more robust. These words included days of the week and their abbreviations, month names, and the word “Twitter,” which surprisingly showed up as a prominently featured word when the word clouds were created. The tweets were then tokenized and PorterStemmer was utilized to stem the tweets.

Let’s take out all of the stuff that isn’t going to help us!

Imports, of course

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import itertools
import collections
networkx as nx
import nltk['punkt','stopwords'])
from nltk.corpus import stopwords
stopwords = stopwords.words('english')
from nltk.corpus import stopwords
from nltk import bigrams

import warnings

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()%matplotlib inline
%config InlineBackend.figure_format = 'retina'


Read in your new CSV



Turn it into a Pandas dataframe

df2 = pd.read_csv('depressive_unigram_tweets_final.csv')


Now let’s see if there are any null values. Let’s clean it up!



We’ll quickly remove stopwords from the Tweets with


df_new['clean_tweet'] = df_new['tweet'].apply(lambda x: ' '.join([item for item in x.split() if item not in stopwords]))

If you want to, you can analyze the Tweets for VADER sentiment analysis scores!


df_new['vader_score'] = df_new['clean_tweet'].apply(lambda x: analyzer.polarity_scores(x)['compound'])

From there, you can also create labels. For a binary classification model, you may want a binary labelling system. However, be aware of your data! Sentiment scores alone do not indicate depression and it is far too simplistic to assume that a negative score indicates depression. In fact, anhedonia, or loss of pleasure, is an extremely common symptom of depression. Neutral, or flat, Tweets are at least as likely, if not more likely, to be an indicator of depression and should not be ignored. 

For the purposes of experimentation, you may want to set a sentiment analysis label like this. Feel free to play around with it!


positive_num = len(df_new[df_new['vader_score'] >=0.05]) negative_num = len(df_new[df_new['vader_score']<0.05])
df_new['vader_sentiment_label']= df_new['vader_score'].map(lambda x:int(1) if x>=0.05 else int(0))

If you need to, drop what you don’t need


df_new = df_new[['Unnamed: 0', 'vader_sentiment_label', 'vader_score', 'clean_tweet']]


Go ahead and save a csv!



Let’s keep playing!

df_new['text'] = df_new['clean_tweet']


We can remove URLs


def remove_url(txt):
return " ".join(re.sub("([^0-9A-Za-z \t])|(\w+:\/\/\S+)", "", txt).split())
all_tweets_no_urls = [remove_url(tweet) for tweet in df_new['text']]


Now let’s make everything lowercase and split the Tweets.

#lower_case = [word.lower() for word in df_new['text']]

sentences = df_new['text']
words_in_tweet = [tweet.lower().split() for tweet in all_tweets_no_urls]


Data cleaning done manually


It’s not fun and it’s not pretty, but manual cleaning was critical. It took hours, but getting rid of references to things like tropical depressions and economic depressions improved the model. Removing Tweets that were movie titles improved the model (you can see “Suicide Squad” in the bigrams below). Removing quoted news headlines that included the search terms improved the model. It felt like it took an eternity to do, but this step made an enormous difference in the robustness of the model.


Exploratory Visualization and Analysis


Now let’s look at character and word frequency!

It‘s fairly easy to analyze the most common words found in the dataset. After removing the stop words, it was apparent that there were certain words that appeared much more frequently than other words.

Let’s count our most common words!


# List of all words
all_words_no_urls = list(itertools.chain(*words_in_tweet))

# Create counter
counts_no_urls = collections.Counter(all_words_no_urls)



And turn them into a dataframe.


clean_tweets_no_urls = pd.DataFrame(counts_no_urls.most_common(15),
columns=['words', 'count'])



Hmmm. Too many stopwords. Let’s deal with those.

stop_words = set(stopwords.words('english'))
# Remove stop words from each tweet list of words
tweets_nsw = [[word for word in tweet_words if not word in stop_words]
for tweet_words in words_in_tweet]


Let’s take another look.


all_words_nsw = list(itertools.chain(*tweets_nsw))  counts_nsw = collections.Counter(all_words_nsw)  counts_nsw.most_common(15)


Better, but not great yet. Some of these words don’t tell us much. Let’s make a few more adjustments.

collection_words = ['im', 'de', 'like', 'one']
tweets_nsw_nc = [[w for w in word if not w in collection_words]
for word in tweets_nsw]



# Flatten list of words in clean tweets

all_words_nsw_nc = list(itertools.chain(*tweets_nsw_nc))

# Create counter of words in clean tweets
counts_nsw_nc = collections.Counter(all_words_nsw_nc)



Much better! Let’s save this as a dataframe.

clean_tweets_ncw = pd.DataFrame(counts_nsw_nc.most_common(15),
columns=['words', 'count'])

What does that look like? Let’s visualize it!

fig, ax = plt.subplots(figsize=(8, 8))

# Plot horizontal bar graph

ax.set_title("Common Words Found in Tweets (Including All Words)")


Let’s look at some bigrams!

nltk import bigrams

# Create list of lists containing bigrams in tweets
terms_bigram = [list(bigrams(tweet)) for tweet in tweets_nsw_nc]

# View bigrams for the first tweet
# Flatten list of bigrams in clean tweets
bigrams = list(itertools.chain(*terms_bigram))

# Create counter of words in clean bigrams
bigram_counts = collections.Counter(bigrams)

bigram_df = pd.DataFrame(bigram_counts.most_common(20),                              columns=['bigram', 'count'])  bigram_df


Certain bigrams were also extremely common, including smile and wide, appearing 42,185 times, afraid and loneliness, appearing 4,641 times, and feel and lonely, appearing 3,541 times.



This is just the beginning of cleaning, preprocessing, and visualizing the data. We can still do a lot from here before we build our model!

Once the Tweets were cleaned, it was easy to see the difference between the two datasets by creating a word cloud with the cleaned Tweets. With only an abbreviated TWINT Twitter scraping, the differences between the two datasets were clear:


Random Tweet Word Cloud:



Depressive Tweet Word Cloud:



Early in the process, it became clear that the most important part of refining the model to get more accurate results would be the data gathering, cleaning, and preprocessing stage. Until the Tweets were appropriately scraped and cleaned, the model had unimpressive accuracy. By cleaning and processing the Tweets with more care, the robustness of the model improved to 97%.



If you’re interested in learning about the absolute basics of data cleaning and preprocessing, take a look at this article!



Thanks for reading! As always, if you do anything cool with this information, let everyone know about it in the comments below or reach out any time!

Pin It on Pinterest

Share This