12 Twitter Sentiment Analysis Algorithms Compared

twitter sentiment analysis

Photo: Farknot Architect / iStockPhoto

Sentiment analysis is used to determine if the sentiment in a piece of text is positive, negative, or neutral.  Sentiment analysis is a form of natural language processing and is part of a subcategory of NLP techniques known as information extraction.

One job of a data scientist is to choose the optimal algorithm for a given task.  Often, the best approach is to try many different algorithms to see what works best.

In this article, I’ll compare a dozen sentiment analysis techniques using the Python programming language including out-of-the-box sentiment analysis services from Google and Amazon.  The Python code for each of these techniques can be found in this Github repository.

Twitter Sentiment Analysis Data Source

The dataset used for the analyses in this article is a set of tweets regarding consumer impressions of airline performance.  Each tweet is scored as having a positive, negative, or neutral sentiment.  Each of the 12 sentiment analysis algorithms computes a sentiment for each tweet.  The computed sentiment is then scored as correct or incorrect on the basis of the sentiment score in the dataset.  The overall measure used is accuracy or the percent of tweets that the algorithm computed correctly.

Sentiment Analysis Algorithms

The 12 sentiment analysis algorithms can be broken down into four categories:

  • Use of sentiment lexicons
  • Off-the-shelf sentiment analysis systems including Amazon Comprehend, Google Cloud Services, and the Stanford CoreNLP system
  • Classical machine learning algorithms
  • Deep learning algorithms

Sentiment Lexicons

The first algorithm compares each word in a tweet to a database of words that are labeled as having positive or negative sentiment.  There are many such databases.  For this analysis, I downloaded a list of positive and negative sentiment words from Kaggle.

Before comparing the words in a tweet to the list of positive and negative words, it is first necessary to split the tweet or review into a list of tokens (mostly words).  This was done using the NLTK word-tokenizer.  NLTK is one of the more popular natural language processing toolkits for the Python language.

Then each token is run through a pipeline (i.e. a series of code conversions) that modify or remove the tokens.  The steps used in the pipeline are:

  • Convert to lower case
  • Remove @ mentions in tweets
  • Remove hyperlinks
  • Remove contractions (e.g. convert “won’t” to “will” and “not”
  • Remove punctuation
  • Convert each token into its based form, a process known as lemmatization.  For example, “moving” is converted to “move”, and “feet” is converted to “foot”.  The WordNet Lemmatizer available in the NLTK was used for this purpose.  This lemmatizer takes as input a token and whether it is a verb, noun, or adjective, a notation that is also produced by the word-tokenizer mentioned above.
  • Finally, all common words like “a” and “the” that don’t contribute to the sentiment are removed.  This list of “stop words” was obtained from the NLTK corpus stopwords function.

Each word in the positive and negative lists were also run through this pipeline in order to effects an “apples to apples” comparison. A tweet with more positive words than negative was scored as a positive.  One with more negative words was scored as a negative, and if there were no positive/negative words or the same number, it was scored as neutral.

For the tweet sentiment analysis, this approach produced an accuracy of 46% for the tweets.  Chance accuracy is 33%.

A little over 15,000 tweets were used.

Twitter Sentiment Analysis Using Off-The-Shelf Systems

The second category of algorithms are off-the-shelf systems that don’t require any preprocessing of the data.  You supply the text and the system calculate the sentiment.  I tested sentiment analysis services from Google Cloud and Amazon Comprehend.

These services are at a bit of disadvantage relative to the machine learning algorithms discussed below because they have to work for all types of text.  In contrast, machine learning algorithms have the opportunity to learn what makes tweets different from reviews and other text.

In this section, I also tested Stanford’s CoreNLP sentiment analyzer.  This tool is at even more of a disadvantage because it tries to first analyze the syntactic structure of a sentence.  However, tweets are often ungrammatical so it wasn’t surprising that this tool didn’t perform well.

Each of these tools was also tested with a little more than 15,000 tweets.  The Google sentiment analysis tool did best at 59% with the Amazon sentiment analysis tool close behind at 58% and the Stanford tool at 47%.  The performance of the Google and Amazon tools were much better than the sentiment lexicon algorithm.

Twitter Sentiment Analysis Using Machine Learning Algorithms

Machine learning algorithms for sentiment analysis should be the best performers because they have the opportunity to tailor their decision-making to a specific type of data like tweets or reviews.

However, machine learning algorithms require much larger datasets than either the sentiment lexicon algorithm or the off-the-shelf algorithms.  In addition to the test set of tweets, there must also be a set of training data.

To create the training and test sets, I started with 30,000 tweets.  Each of these were then preprocessed using the pipeline discussed above in the sentiment lexicon algorithm section.

One standard approach is to then split each of these datasets into a training and a test.  I split these datasets 70% for training and 30% for test.

However, the distribution of positive, negative, and neutral tweets in the training set was far from even.  There were far more positive tweets than negative or neutral tweets.  This unbalanced data would likely have led to the machine learning systems discovering that most of the tweets were positive and learning to rely on guessing positive for every tweet.

To counter this, I used SMOTE oversampling to add negative and neutral examples to the training set so that the training set had approximately 12,000 positive, 12,000 negative, and 12,000 neutral tweets.  It should be noted that it is critical to do the oversampling after doing the train-test split.  If it is done prior to the train-test split, then there will be some examples that are in both the training and test sets which would lead to misleadingly high accuracy numbers.

Another modification that was necessary for machine learning was to transform the tokens in each tweet into a set of features that could be analyzed by the machine learning algorithms.  There are many ways to do this but I chose to use a bag-of-words (BOW) approach.  The features in the BOW approach were the 2000 most common words in the tweets.  So each tweet had 2000 features.  Each feature value was simply the number of time the word appeared in the tweet.  Of course, the feature vector for each tweet was quite sparse, i.e. most of the features would have a zero value.

I then input these BOW features to several machine learning algorithms using scikit-learn including:

  • Naive Bayes:  This algorithm is known to work well for many text classification problems and requires relatively few training examples.
  • Support Vector Machine:  Like Naive Bayes classifiers, support vector classifiers also work well for text classification and require relative few training examples.
  • Decision Tree:  Decision Trees often do a good job of learning to classify and have the additional property of producing easily explainable results in the form of decision trees.
  • XGBoost:  This algorithm uses a set of different decision trees known as a random forest.  It is known to be both fast and often achieves very high accuracy.  However, it is not as interpretable as a simple decision tree.
  • k-Nearest Neighbors:  This algorithm works by finding the training examples closest to the test example.

The highest accuracy for these ML classifiers were XGBoost and Naive Bayes which both achieved 73% accuracy.  The Linear SVC algorithm was close behind with 71%.  Decision trees came in at 63% and k-nearest neighbors was far behind at 38%.

Twitter Sentiment Analysis Using Deep Learning

Deep learning algorithms often outperform that more classical machine learning algorithms discussed in the previous section.  However, they often require far more data.  Nonetheless, for comparison purposes, I used the same training and test data for the deep learning algorithms as I did for the machine learning algorithms in the previous section.

Three deep learning algorithms were tested:

  • Keras:  Keras is an easy-to-use layer on top of TensorFlow and other deep learning frameworks.  I used a 3-layer sequential network.  I tried 10, 20, and 50 epochs though there was little difference in accuracy.
  • fasText:  fasText is an NLP library developed by the Facebook AI. It is an open-source, free, lightweight library that allows users to learn text representations and text classifiers. It works on standard, generic hardware. Models can later be reduced in size to even fit on mobile devices.  Here also, I used 10, 20, and 50 epochs and found little difference in accuracy.
  • DistilBERT:  DistilBERT is a smaller, faster version of the well-known BERT language model.  The last layer of a pre-trained DistilBERT system was used as a set of feature inputs to a logistic regression classifier and followed the basic scheme used in this Google Colab demonstration notebook.  This was also the only algorithm that I did not run on my computer as it was too resource-intensive.  Instead, I ran it on my Google Colab Pro account.  Even there, I had to limit the number of tweets in the dataset to 10,000 to avoid running out of memory.

The fastest algorithm performed the best with 71% accuracy.  The Keras and DistilBERT networks both scored 68%.

The Winner

The XGBoost and Naive Bayes algorithms were tied for the highest accuracy of the 12 twitter sentiment analysis approaches tested.  There might not have been enough data for optimal performance from the deep learning systems.  That said, I’ve seen XGBoost outperform deep learning systems in at least one other bake-off.

There are likely many ways to improve the overall performance.  The DistilBERT approach probably would have performed better if I had the available memory to run the full dataset.  Moreover, instead of just using the pretrained features, one could do further training to fine-tune the system on the tweet dataset.  Finally, instead of using the BOW approach, use of word embeddings as feature might produce better overall accuracy.  Perhaps this will be the subject of a future post.