12 Twitter Sentiment Analysis Algorithms Compared

Photo: Farknot Architect / iStockPhoto

The 12 sentiment analysis algorithms can be broken down into four categories:

Use of sentiment lexicons
Off-the-shelf sentiment analysis systems including Amazon Comprehend, Google Cloud Services, and the Stanford CoreNLP system
Classical machine learning algorithms
Deep learning algorithms

Sentiment Lexicons

The first algorithm compares each word in a tweet to a database of words that are labeled as having positive or negative sentiment. There are many such databases. For this analysis, I downloaded a list of positive and negative sentiment words from Kaggle.

Before comparing the words in a tweet to the list of positive and negative words, it is first necessary to split the tweet or review into a list of tokens (mostly words). This was done using the NLTK word-tokenizer. NLTK is one of the more popular natural language processing toolkits for the Python language.

Then each token is run through a pipeline (i.e. a series of code conversions) that modify or remove the tokens. The steps used in the pipeline are:

Convert to lower case
Remove @ mentions in tweets
Remove hyperlinks
Remove contractions (e.g. convert “won’t” to “will” and “not”
Remove punctuation
Convert each token into its based form, a process known as lemmatization. For example, “moving” is converted to “move”, and “feet” is converted to “foot”. The WordNet Lemmatizer available in the NLTK was used for this purpose. This lemmatizer takes as input a token and whether it is a verb, noun, or adjective, a notation that is also produced by the word-tokenizer mentioned above.
Finally, all common words like “a” and “the” that don’t contribute to the sentiment are removed. This list of “stop words” was obtained from the NLTK corpus stopwords function.

Each word in the positive and negative lists were also run through this pipeline in order to effects an “apples to apples” comparison. A tweet with more positive words than negative was scored as a positive. One with more negative words was scored as a negative, and if there were no positive/negative words or the same number, it was scored as neutral.

For the tweet sentiment analysis, this approach produced an accuracy of 46% for the tweets. Chance accuracy is 33%.

A little over 15,000 tweets were used.

Twitter Sentiment Analysis Using Off-The-Shelf Systems

The second category of algorithms are off-the-shelf systems that don’t require any preprocessing of the data. You supply the text and the system calculate the sentiment. I tested sentiment analysis services from Google Cloud and Amazon Comprehend.

These services are at a bit of disadvantage relative to the machine learning algorithms discussed below because they have to work for all types of text. In contrast, machine learning algorithms have the opportunity to learn what makes tweets different from reviews and other text.

In this section, I also tested Stanford’s CoreNLP sentiment analyzer. This tool is at even more of a disadvantage because it tries to first analyze the syntactic structure of a sentence. However, tweets are often ungrammatical so it wasn’t surprising that this tool didn’t perform well.

Each of these tools was also tested with a little more than 15,000 tweets. The Google sentiment analysis tool did best at 59% with the Amazon sentiment analysis tool close behind at 58% and the Stanford tool at 47%. The performance of the Google and Amazon tools were much better than the sentiment lexicon algorithm.

Twitter Sentiment Analysis Using Machine Learning Algorithms

Machine learning algorithms for sentiment analysis should be the best performers because they have the opportunity to tailor their decision-making to a specific type of data like tweets or reviews.

However, machine learning algorithms require much larger datasets than either the sentiment lexicon algorithm or the off-the-shelf algorithms. In addition to the test set of tweets, there must also be a set of training data.

To create the training and test sets, I started with 30,000 tweets. Each of these were then preprocessed using the pipeline discussed above in the sentiment lexicon algorithm section.

One standard approach is to then split each of these datasets into a training and a test. I split these datasets 70% for training and 30% for test.

However, the distribution of positive, negative, and neutral tweets in the training set was far from even. There were far more positive tweets than negative or neutral tweets. This unbalanced data would likely have led to the machine learning systems discovering that most of the tweets were positive and learning to rely on guessing positive for every tweet.

To counter this, I used SMOTE oversampling to add negative and neutral examples to the training set so that the training set had approximately 12,000 positive, 12,000 negative, and 12,000 neutral tweets. It should be noted that it is critical to do the oversampling after doing the train-test split. If it is done prior to the train-test split, then there will be some examples that are in both the training and test sets which would lead to misleadingly high accuracy numbers.

Another modification that was necessary for machine learning was to transform the tokens in each tweet into a set of features that could be analyzed by the machine learning algorithms. There are many ways to do this but I chose to use a bag-of-words (BOW) approach. The features in the BOW approach were the 2000 most common words in the tweets. So each tweet had 2000 features. Each feature value was simply the number of time the word appeared in the tweet. Of course, the feature vector for each tweet was quite sparse, i.e. most of the features would have a zero value.

I then input these BOW features to several machine learning algorithms using scikit-learn including:

Naive Bayes: This algorithm is known to work well for many text classification problems and requires relatively few training examples.
Support Vector Machine: Like Naive Bayes classifiers, support vector classifiers also work well for text classification and require relative few training examples.
Decision Tree: Decision Trees often do a good job of learning to classify and have the additional property of producing easily explainable results in the form of decision trees.
XGBoost: This algorithm uses a set of different decision trees known as a random forest. It is known to be both fast and often achieves very high accuracy. However, it is not as interpretable as a simple decision tree.
k-Nearest Neighbors: This algorithm works by finding the training examples closest to the test example.

The highest accuracy for these ML classifiers were XGBoost and Naive Bayes which both achieved 73% accuracy. The Linear SVC algorithm was close behind with 71%. Decision trees came in at 63% and k-nearest neighbors was far behind at 38%.

Twitter Sentiment Analysis Using Deep Learning

The Winner

The XGBoost and Naive Bayes algorithms were tied for the highest accuracy of the 12 twitter sentiment analysis approaches tested. There might not have been enough data for optimal performance from the deep learning systems. That said, I’ve seen XGBoost outperform deep learning systems in at least one other bake-off.

There are likely many ways to improve the overall performance. The DistilBERT approach probably would have performed better if I had the available memory to run the full dataset. Moreover, instead of just using the pretrained features, one could do further training to fine-tune the system on the tweet dataset. Finally, instead of using the BOW approach, use of word embeddings as feature might produce better overall accuracy. Perhaps this will be the subject of a future post.

12 Twitter Sentiment Analysis Algorithms Compared

Twitter Sentiment Analysis Data Source

Sentiment Analysis Algorithms

Sentiment Lexicons

Twitter Sentiment Analysis Using Off-The-Shelf Systems

Twitter Sentiment Analysis Using Machine Learning Algorithms

Twitter Sentiment Analysis Using Deep Learning

The Winner

SHARE THIS

Leave A Comment Cancel reply