In case you missed the first post in this series, the purpose of this project is explained there. For this post, we’ll talk about the more technical aspects of the project including software architecture and coding.
To get started we broke the problem down into pieces. We found that there were three modules in the project: data collection, storage, and data analysis.
Pieces of the project were well-documented and straightforward. We were getting our data from Twitter, which has a really nice API. Specifically, we used the Twitter streaming API, as it would allow us to collect tweets as they are posted. With data now coming in, we threw up a SQL database to store the information.
Analyzing the tweets required several different services. Natural Language Processing is surprisingly underdeveloped in .NET. The most commonly used library is OpenNLP, which is mostly written in Java, but worked well for topic extraction. Sentiment analysis would have been a difficult and time-consuming function to create on our own. It would have required tons of training data, already pre-sorted into positive and negative. We would have also had to label the data, drawing from a wide range of topics in order to make a well-rounded system. We didn’t have time for this but, luckily, there are others out there who have services we could use. Microsoft Azure and Aylien both offer extensive text-processing services that include APIs for sentiment analysis. We decided to use both services in order to compare the results later.
One of our objectives was to be able to focus sentiment analysis to a specific region but we hit a snag in trying to geolocate our tweeters. Tweets have a couple fields for storing a user’s location, but most of the time they are empty due to user's security settings. Thankfully, the internet is creepy-enough to provide other ways of getting at least an idea of a user's potential location. Most Twitter users specify their hometown in their profile. The Google Maps API can be used to then get coordinates. Obviously this isn’t ideal. Besides the fact that some people list they are located on Mars, there’s the problem that we aren’t actually tracking where the tweet is coming from, but rather where the tweeter self-identifies themselves as being from.
Sentiment API Comparison
Both APIs were great to work with. Microsoft’s Azure is a behemoth which required a lot more code on our end to achieve our goal, but it has a wide-range of capability and provides a very simple, single value in return, making their responses easy to work with. Returned scores range from -1 (negative) to 1 (positive).
Aylien returns a value, indicating confidence, and a label for positive, negative, or neutral. Also returned is a value for subjectivity/objectivity of the text. I found it super handy to have a “neutral” score for some applications, but unfortunately for this project, we really needed a way to represent each text on a scale from -1 to 1. The inclusion of a neutral score, with no indication of the confidence levels for the other two states meant that we were filling in a 0 for each neutral tweet. We ended up with a lot of zeros, which meant we threw away a lot of information. We found Aylien's SDK to be far simpler to get up-and-running with and required less code. They also offer a very cool News API, which will likely be the subject of a future blog post.
The final stage of the project was to find a way to display the information that was collected. We thought the best way to do this was through a world map. With each country given a sentiment score, we used an area-value map. This displays the countries using a gradient of colors. The darker colors showed that the country’s score was negative, and the lighter colors represented a positive score. We did discover that if tweets weren’t collected from enough countries then a world map was not the best representation, but with enough data the map was able to show which areas had the most positive and negative scores, as well as showing how the world reacted to the subject matter as a whole.
Using a graph we could show how many countries had a score that was very negative, negative, positive, and very positive. We included another graph to show the data by region by taking the average sentiment scores of each continent for a less detailed visual of the data.
The visuals used for this stage of the project were an overall success. We were able to accomplish our goal of taking a data set and turning it into an informative visual that is easy to understand and consumes the information.