Given some document (in plain text, preferably), how can we determine the emotional state or sentiment of the writer towards a particular entity at the time of production? Provided we have a sizable corpus of documents — static or streaming — the answer to this question could be useful for several reasons. Ideally, a Sentiment Analysis tool should be able to estimate an answer to the following time-constrained, yet important questions: what’s the present favorability rating for Obama? How likable was the president just before the election? Knowing the answers to these questions could significantly sway political strategies of different organizations. Similarly, knowing answers to several other comparably phrased yet diverse questions could be very useful to a lot of organizations, brands, and people.
Let’s back up a little bit. We’ve been talking about determining the sentiment of a document in a very broad sense. What answer do we expect to get out of Sentiment Analysis? An extended description of the writer’s sentiments at the time of writing or a Yes/No/Neutral answer? It turns out that the latter is easier, and in some cases even preferable. The Yes/No/Neutral answer is much less opaque and much more suitable for analysis. There are probably other answering models but I’ll from hence concentrate on the Yes/No/Neutral answer.
You can consider the Yes/No/Neutral answer as a three-option, one-choice answer. Loosely speaking, according to this model, there are three classes: Yes, No, Neutral. But for reasons to be uncovered shortly, we’ll use only two classes: Yes, No. A document that has been determined to be positive-enough — with positivity above a certain threshold — will be placed in the Yes class. On the other hand, documents negative-enough — with negativity above a certain threshold — will be placed in the No class. Documents not in the Yes/No class can be safely placed in the Neutral class. How do we determine the thresholds? Well, through training our model on pre-marked data. Binary classification is much easier to interpret and more amenable to several statistical models. At this point, we’ve been able to formulate the problem as a binary classification problem. The problem of statistical classification is well-known and well-researched.
Out of a cornucopia of classification methods, the best methods (judged by popularity, effectiveness, and ease-of-use) in my humble opinion are:
- Naive Bayes
- Maximum Entropy (Logistic Regression)
- Support Vector Machines
Using any (or a combination) of these methods, we can classify documents into the Yes-No(-Neutral) classes using a trained model. The method that should be used in production is the one with the highest accuracy on the test set.
Nick Jones and I, for our final project in our NLP class, built a twitter Sentiment Analysis tool. We tried two classifiers: Naive Bayes, and Maximum Entropy. Naive Bayes turned out to be more accurate (83.01% accuracy on the test set). Since our training set is a bit old (based on tweets made in 2006), we made a web application to classify more recent tweets in real-time via the twitter API. For more information on this, check out our github repo. There are several improvements we hope to make to this tool. But this is a start. The goal of this project was to experiment with some of the methods used in Sentiment Analysis. We sure did.