Calculate Trending Topics and Sentiment Trends from Live Stream of Messages

Abstract

Popularity of the social media and the amount of importance given by an individual to social media has significantly increased in last few years. As more and more people become part of the social networks like Twitter, Facebook, information which flows through the social network, can potentially give us good understanding about what is happening around in our locality, state, nation or even in the world. The conceptual motive behind the project is to quantify the information that flows through Twitter via finding Trending Topics from live Twitter Stream, with a hidden technical motive of building a scalable system, and face and solve the challenges encountered during the system construction. Thus, the project aims at building a system which finds trending topics from live Twitter Stream. Also, for each trending topic, the system also shows a sentiment graph showing how positive and negative sentiments are trending as the topic is getting trended. The system uses Storm for handling and processing the Twitter data stream in the distributed fashion. The built system is deployed on the Amazon EC2 server and is available to public with free access.

Keywords: Trending Topics, Sentiment, Twitter, Storm

Introduction

Popularity of the social media and the amount of importance given by an individual to social media has significantly increased in last few years. Recently, Facebook announced that 1.11 billion people use their website each month while 665 million people are active each day [1]. Similar is the story with another popular social website, Twitter. Twitter has 554 million active registered users per month, 190 million unique Twitter site visitors every month, and 58 million tweets tweeted every day [2]. From this data one can clearly see the importance of social media in our day-to-day life. Social media provides a platform via which people can easily communicate their thoughts, ideas, beliefs, and share it with other people (called as Friends in Facebook, or Followers in Twitter). Advancement in the technology and scalable distributed systems has made the propagation of information shared by users to reach out to other people in merely fraction of seconds. I personally find it amazing. Social Network can be thought as one giant network just like internet network, with people as nodes which are constantly feeding information in the network via messages/tweets. Thus, information which flows through the social media, can potentially give us good understanding about what is happening around in our locality, state, nation or even in the world.

As part of this project, I studied the way of quantifying the information which flows through social network site, via finding Trending Topics from live Twitter Streams. The notion of “Trending” is bit ambiguous and need more precise definition which is provided in subsequent sections of the report, but for now we can assume that “Trending Topics” corresponds to dominant concepts that is currently been flown or talked about in the social network (Twitter). “Trending Topics” are time dependent and change with time, and new concepts might start to dominate or talked about in the social network as time passes by. Here, notion of “talking about” corresponds to the mention of the concept in the tweets. Though, Twitter already provides the functionality of letting users know about current “Trending Topics”, in this project, I have implemented the similar system which does the similar thing, but with technical motive of understanding how these systems can be built and building one to understand technical challenges associated with it.

Whenever people tweet a tweet, tweeted message is never random, it is associated with some topic/concept. And generally as a human, whenever we write/tweet something worth sharing, we associate some emotion to it. The words chosen to write a tweet give a good idea of the sentiment tone of the user about the concept/topic. So, if we do basic sentimental analysis on the tweets associated with some topic, we can get a fair amount of idea if the topic is positively talked about or negatively talked about in the Twitter. Hence, also as part of the project, apart from finding trending topic, the developed systems also show how sentiment associated with the topic trends for all the current trending topics. This broadly helps us to mentally classify the current trending topic in one of three groups: Positively talked about, negatively talked about, or controversial topic that have mix of positive and negative opinions. This analysis is something that is not given by Twitter, and would help users to get more quality information about the trending topics.

Technical Motivation

Introduction section mentions the high level goals of the project. This section describes technical motivation for doing the project. Below are technical goals:

Learning about Data Streams as a field: In recent years, data streams have received much attention because of the large amount data getting generated and the requirement for real time processing and analyzing the generated data. Example of such applications include financial applications (stock monitoring), network monitoring (packet monitoring), security, telecommunications data management, web applications, manufacturing, sensor networks, and others [3]. It is difficult to use traditional data analysis approaches for the data stream, primarily because of below reasons:

o Unbounded data size in the data stream, requires the algorithm to be independent of data size

o Generally because of the size of the data, it is not possible to persist the incoming data; hence any algorithm operating on the data stream should avoid having criteria of looking/processing the data more than once.

o Most of the data stream related problems expects real time answer, hence all the computation is needed to be done on the fly, and give real time response.

o Traditional algorithms are developed to give accurate results. In data streams, generally the approximations are accepted.

Finding Trending Topics from live Twitter stream also falls into this bucket of problems and the project provides an opportunity to learn about this field and develop an approach that have above mentioned characteristics and be robust.

Twitter API: These days, most of the study in the field of social network, is done using data from popular social network size like Twitter, Facebook. Project offers an opportunity to learn about Twitter Streaming API [4], and how application can be developed which run-time consumes the data retrieved from the Streaming API.
Storm: Storm [5] is open source an open source, real time distributed system developed for processing unbounded stream of data via providing reliability of processing of data. Storm was originally developed at Backtype (company) which was acquired by Twitter, and later Twitter made this project as an open source project [6]. One of the main learning goals of the project is learning about Storm. More details about Storm can be found in the Background Section of this report.
Sentiment Analysis: One of the reasons for aiming at doing Sentiment Analysis for the trending topic was to learn how this analysis can be done on the real time data. The project provides an opportunity to survey different ways of doing sentiment analysis and to learn about different popular sentiment score dictionaries.
Building end-to-end system: In the Introduction section, it is mentioned that Twitter itself provides the information about topics that are currently trending in Twitter. One of the aims for doing the project is developing an end-to-end system which is robust, handles and processes the data from live twitter stream, and summarizes/ranks the trending topic and sentiment results and shows them to the end user on UI which can be intuitively interpreted.
Server deployment on Amazon EC2 [7]: One of my personal goals from last half year was learning about EC2 and deploying some system on EC2 which runs 24×7. This project provided me an opportunity to accomplish my long pending goal.

Background

Before going into details of the built system, this section provides required background about the topics/concepts which would be used in subsequent sections of the report. Readers, who have knowledge about Storm and Z-Score, can safely skip this section.

3.1 Storm

Storm [5] is distributed real time computation system. Storm can be thought as a system like Hadoop, but main difference between Hadoop and Storm is that Hadoop does data processing in batches on the data present in the Hadoop Distributed File System (HDFS), whereas Storm does real time processing of unbounded stream of data. Storm is fast, and benchmark result shows that Storm processes one million tuples per second per node [5].

A Storm cluster is something similar to Hadoop cluster. On Hadoop cluster, “Map-Reduce jobs” are run, whereas in Storm terminology, we run “Topologies” on Storm cluster. Key difference between “Topologies” and “Jobs” is that, Jobs are expected to be completed after some finite amount of time, while Topologies processes data forever [8].

Storm cluster has two kinds of nodes: master node and worker node. “Nimbus” daemon runs on master node, which is responsible for distributing code across cluster machines, assigning tasks to different machines, and monitoring machine/task failure. Worker node runs a daemon called as “Supervisor” which is responsible for listening to the “Nimbus” for its task assignment, and is responsible for starting and stopping processes on worker node as necessary. Each worker is responsible for execution of some part of the topology. Typical storm topology would have many worker nodes.

ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services [9]. Storm uses Zookeeper for all coordination between Nimbus and Supervisor. All state related information is kept in Zookeeper, and Nimbus and Supervisor are all stateless. Thus, if any worker node or nimbus node crashes, the crashed nodes can be restarted with their backups and gives the impression that as if nothing has happened thus providing incredible stability to Storm cluster. Below figure, figure 1, shows the high-level interactions:

Figure 1: Storm: Interaction between Nimbus, Zookeeper and Supervisor [8]

In Storm, “Topology” is nothing but a graph of computation. Each node in the topology contains the logic that is to be processed. Links between the nodes of topology indicate how the data should flow between the nodes. So a topology is created and is submitted to Storm cluster for processing. Another core abstraction in Storm is “Stream”. “Stream” is an unbounded sequence of tuples. Storm provides different primitives like transforming one stream to another stream reliably and in distributed fashion.

Very important primitive for Stream transformations are “spouts” and “bolts”. A “spout” is the source of stream. For example, Twitter Streaming API can be implemented in “spout” which would take incoming tweets, and create “stream” in storm “topology”. A “bolt” consumes any number of input “streams”, does some processing, and possibly emits new streams. Thus using “spouts” and “bolts” complex topologies can be designed in Storm. Below figure, figure 2, shows an example:

Figure 2: Some Storm Topology [8]

Once the topology is created, different number of worker nodes can be assigned to different types of spouts/bolts. All assigned worker nodes to a specific bolt/spout would execute the same bolt/spout logic in parallel.

Another important concept in Storm is of shuffling of data between two bolts/spouts. Shuffling of data determines how the data is to be passed between worker nodes of one spout/bolt to another spout/bolt. For example, consider a simple topology which has one spout and one bolt. Spout emits words, and bolt is responsible for doing word count (where we need to know the number of occurrence of each word in the data stream). So, as we have multiple worker nodes executing the same bolt logic, each worker node of bolt would be responsible for tracking some set of word occurrence. And hence what we expect is that the worker nodes of the spouts send a specific word to the specific worker node. There are two important types of shuffling supported by Storm: Random Shuffling and Field Shuffling. In Random shuffling the data is sent to different worker nodes randomly. In Field shuffling, a data that contains specific field always go the same worker (similar way how data with same key goes to the same reducer in Hadoop). In above word count example, Field shuffling can be used on the word, in order to make sure that same worker node processes occurrence of a specific word.

In this project, storm topology is created in order to find trending topics from the live twitter stream. Subsequent sections would provide more information about the same.

3.2 Z-Score

In statistics, Z-Score, also called as “Standard Score”, indicates by how many standard deviations an observation is above or below the mean [10]. Z-Score is dimensionless quantity, and can be easily computed for any raw score (x) using below formula:

Figure 3: Formula for calculating Z-score

Z-Score can be used to model time-series data. Given set of time-series points for an observation, mean and standard deviation can be easily calculated. Now, when we receive a new time-series point, we can use the mean and standard deviation computed to find the Z-score for the new time point. High Z score would indicate that the new point is significantly higher than the expectation while a high negative Z-score indicate that the new point is significantly below the expectation.

In this project, Z-score is used for ranking the trending topics. In next section ranking process is explained in detail.

Approach and Design of the System

This section describes the approach used for building the system for finding trending topics and trending sentiment from live twitter stream. To simplify the implementation, the system only considers “hashtags” as candidates for “topics”. Twitter defines “hashtags” as “The # symbol, called a hashtag, is used to mark keywords or topics in a Tweet. It was created organically by Twitter users as a way to categorize messages” [11]. So in other words, system would find trending “hashtags” from live twitter stream. But, in discussion section, it is explained that how the system can be trivially extended for considering any topic (group of words) as a trending topic candidate. Henceforth, the notion of “topic” and “hashtag” would mean the same and would be used interchangeably unless explicitly specified.

Below figure, figure 4, shows high level data flow and main components of the systems:

Figure 4: High Level Data Flow [12]

System has a module, “Storm Topology” which has storm topology created and is executing the storm topology on the storm cluster. The topology has a spout, named “twitter-spout” which implements the logic for consuming Twitter Streaming API. The spout receives the tweets from the Twitter Streaming API and it creates a new “stream” in the topology with one tweet per tuple. Storm topology is explained in detail in the following sub section. At the last step of the storm topology, top topics are persisted in the H2 database [13] with related information in JSON format.

Whenever a client wants to retrieve latest trending topics, it would send a request to the Jetty Server [14]. The server would contact H2 database and retrieve latest trending topics and related information (sentiment scores). Server would send the retrieved trending topics to the “Ranking Topics” module, which is responsible for ranking the trending topics based on some criteria (criteria would dependent upon the notion of ‘trending’). The Ranking Topics module would rank the topics and return the ranked topic list. Server would then send the ranked trending topics to the client.

In the subsequent sub sections, each component is described in detailed.

4.1 Storm Topology

4.1.1 Window

What does “Top Trending Topic” mean? Top trending topic means the topics which have maximum occurrence in tweets. But, this definition is incomplete as it doesn’t provide any temporal information. For example, Top Trending Topics in last 60 minutes may not be same as Top Trending Topics in last 1 day. Hence, while asking the question about finding top trending topics, it is essential to specify the time period. In other words, the question should be “Top Trending Topics since …”. In the project, the time period is assumed as one hour. So, the system would provide the list of topics that are trending in last one hour.

To model the time period, system uses the notion of “Window”. Window is a logical unit which monitors topic for specified time period. So Window of 1 hour would keep track of topic for 1 hour time period. To keep track of different topics, we would need one window per topic. Window is sub-divided into equal time period chunks called as “pane”. So, if we divide the Window into 6 equal chunks then each pane would be of 10 minutes. Below figure, figure 5, shows window of 1 hour and 6 panes of 10 minutes each, for topic, say, #hangover.

Figure 5: 1 hour Window divided into 6 equal non-overlapping panes [12]

Each pane of the window would keep track of the total occurrence of the topic (count) in the pane time period, positive sentiment score (PS Sco) calculated from all the tweets received during the pane time period and negative sentiment score (NS Sco) calculated from all the tweets received during the pane time period. “Green Check” in the above figure means that the “pane” is a valid pane.

For monitoring topic activity (count, sentiment score) for last one hour, at every 10 minutes, the oldest pane (leftmost) would expire (making pane invalid), and a new pane would be added at rightmost end. This mechanism is called as “Window Slide”. Before invalidating the oldest pane, an aggregation operation is performed on complete window which finds total count, total positive and negative sentiment score, and gives the summary about the topic during last one hour. This provides good approximation for last hour activity for a topic. Below figure, figure 6, shows the window slide mechanism.

Figure 6: Dotted Lines show the position of new window. Red cross indicates that the pane is invalid while green check indicates valid pane [12]

For implementing the “Window” concept, Circular Linked List data structure is used. Below figure, figure 7, shows how the window can be modeled as Circular Linked List.

Figure 7: Window visualized as Circular Link List

Elegant thing about the representation of the Window as a Circular Linked List is that circular linked list implementation only need to expose two APIs:

//API to add data to the latest pane

Void addData(Object data);

//API to return aggregate information and also do sliding at end of aggregation

Object getAggregateInfoAndSlide();

Thus, the client which consumes this, does not need to know how many “panes” or “nodes” are present during add, slide, aggregate operation. New panes can be added without client dependency. Even without knowing underline window/pane implementation, client can still control the length of the pane (pane length would be equal to time difference between two consecutive calls to “getAggregateInfoAndSlide()” function). This makes integration of circular link list data structure very easy, useful and simple.

Window mechanism described only tracks activity for one topic. But, for finding top topics it is required to monitor activity for all the topics which are encountered in last one hour time period. For tracking activity for all the topics, a Map data structure is used, where key is the topic name and value is the pointer to the “Window” or the Circular Linked List. Below figure, figure 8, shows the data structure.

Figure 8: Data structure for tracking activity of all the topics

4.1.2 Modules in Storm Topology

We now understand that how we can model Windows for all topics and keep track of topic occurrence and other information like sentiment scores associated with each topic. As overall goal is to make the implementation scalable, hence we would create Storm Topology for finding top topics from the Twitter Stream, and run the topology on the Storm cluster. As explained in the background section, Storm Topology is created by wiring different spouts and bolts which are responsible for executing some part of the overall logic. Before discussing the storm topology, lets first break down the tasks in the modules (independent of Storm) to understand end to end flow, and then we can map modules to appropriate bolts and spout where the modules would be executed.

Figure 8 shows the flow and the approach for finding top topics from the twitter stream.

Figure 8: Approach for finding Top Topics from the Twitter Stream

4.1.2.1 Buffer Incoming Data

This module is responsible for integrating with Twitter Streaming API via subscribing to the twitter stream. twitter4j [15] library is used for the subscription. Once subscribed, it would receive live tweets from Twitter, and the modules buffers the incoming tweets, retrieves the tweet message, and then pass the tweet to the next module (Filter Content) for the processing of the tweets. Figure 9 shows the buffered tweets and the output from the module

Figure 9: Buffered Tweets in “Buffer Incoming Data” module

4.1.2.2 Filter Content

This module receives a tweet message and is responsible for extracting relevant information from the tweet message and passing the relevant information to the next phase. Relevant information extracted from the tweet:

Topic Names: Extracts all the hashtags that are present in the tweet
Positive Sentiment Score: Send the tweet message to the “Sentiment Calculator” and if the sentiment score is positive then stores the score as positive sentiment score
Negative Sentiment Score: Send the tweet message to the “Sentiment Calculator” and if the sentiment score is negative then store the score as negative sentiment score

For each topic retrieved, pass the topic name, positive and negative sentiment score to the next module. Figure 10 describes the module:

Figure 10: Extracted Information from a tweet

4.1.2.3 Sentiment Calculator

This module receives tweet message and it calculates the sentiment score for the message and returns the sentiment score to the caller. For calculating the sentiment score, SentiWordNet [16] dictionary is used. SentiWordNet is a dictionary that provides a sentiment score for every word in the range of [-1, 1] (real number scale). For every word in the tweet, a lookup is done to retrieve the score associated with the word in SentiWordNet dictionary. If a word is not found in the dictionary, then score of that word is considered as 0. For calculating score of complete tweet, individual word scores are added, and resulting score is sent back to the caller.

4.1.2.4 Update HashTag Summary

This module maintains the topic summaries for all the topics seen in the Twitter Stream. The data structure used for maintaining the topic summaries is the same as that is shown in the figure 8. The module receives an object (topic name, positive sentiment score, and negative sentiment score) as input. It then does a lookup in its Map to find the window/circular link list, for the topic name. If no record found then a new entry is added to the map with the topic name and the object is added to the latest pane. If the record exists then the object is added to the latest pane.

Figure 11: Storing the Topic Information

4.1.2.5 Slide and Send Window Summary

This is not a separate module, but a logical different part of “Update HashTag Summary” module. After fixed specified time, 10 minutes, on every element present in the map, which is maintained by “Update HashTag Summary”, getAggregateInfoAndSlide() function is called. This method returns aggregated information (total count, total positive and negative sentiment score) for every topic for complete window. This information (object) is passed to the next phase which finds top 20 topics (based on total count) from all the summaries.

4.1.2.6 Find Top 20 HashTags

This module receives topic summaries and every 10 minutes, it persist top20 topics that have maximum count, in the H2 database with current timestamp.

4.1.3 Building Storm Topology

This section describes Storm topology and its correspondence with the modules mentioned in above section. As mentioned in the background section, two important building blocks of the storm topology are spouts and bolts. As spout acts as a source for the stream in the Storm topology, a spout is created for the module “Buffer Incoming Data”. A bolt is created with the logic of “Filter Content” module with incoming data stream from the “Buffer Incoming Data” spout. As the “Filter Content” module heavily interacts with the “Sentiment Score” module, the “Sentiment Score” module is also placed in the same bolt as that of “Filter Content”. Output stream of the “Filter Content” bolt goes as input stream of “Summary” bolt where the logic of both “Update Hashtag Summary” and “Slide and Send Window Summary” modules is placed. Every 10 minutes, “Summary” bolt would emit a data stream with the summary of all the topics which would go as input stream to the “Ranking” Bolt. It is possible that the “Ranking Bolt” might get too much of information in small time period, and hence the logic of ranking the topic is spread across two different types of bolts, namely “Partial Ranking” and “Final Ranking”. Output stream from “Summary” bolt is splitted into multiple streams depending upon the number of worker nodes that are executing “Partial Ranking”. Each “Partial Ranking” bolt calculates top 20 topics and sends the stream with top 20 topics to the “Final Ranking” bolt which calculates top 20 topics among all the top 20 topics sent by “Partial Ranking” bolt. Top 20 topics found by “Final Ranking” bolt is persisted in the H2 database.

Key feature of the Storm is that the logic of each kind of spout and bolt can be executed on different worker nodes, and it let the application developer choose the number of worker nodes that are to be assigned to each bolt and spout. Thus final number of assignment of number of worker nodes to the type of bolt or spout would depend upon the number of worker nodes available, but we can do a relative estimation on the number of worker nodes needed for each type of bolt and spout.

Apart from the logical flow between the modules, Figure 8 also has arrows in different colors (red and green). Red arrow indicate that the link connecting the two modules would have heavy data load while green arrows indicate that the modules would have relatively less load then red arrows. This distinction between the links can act as good criteria for deciding upon the number of worker nodes assignment to each type of bolt/spout.

In the topology we have only spout. Twitter Streaming API only permits one active connection from specific credentials [4]. Thus, if we assign multiple worker nodes to the spout, it would mean that we would be opening multiple active connections with the Twitter Streaming API with same credentials, and Twitter won’t allow that. Hence, in this case, because of restriction from source (Twitter), the spout cannot be assigned more than one worker node. Figure 12 shows the Storm Topology.

Figure 12: Storm Topology for finding top 20 topics

One more important thing to consider is the shuffling type between every spout and bolt. Two most common shuffling type of Storm are: Random Shuffling and Field Shuffling. These shuffling types are explained in the background section. For the first connection between “Buffer Incoming Data” spout and “Filter Content” bolt, it doesn’t matter which worker node in the bolt executes which tweets, all the worker nodes does the same job. Hence, random shuffling is used between them. For the second connection which is between “Filter Content” bolt and “Update Hashtag & Send Summary” bolt, we know that every worker node of the “Update Hashtag & Send Summary” bolt is responsible for having a Map for the topics in order to monitor activities of the topics. Hence, we expect that a specific hashTag should always go to the same worker node which is keeping track of hashTag’s activity. Hence, we would use “Field Shuffling” on the “Topic” field for the link between “Filter Content” and “Update HashTag & Send Summary” bolts, in order to make sure that irrespective which worker node of “Filter Content” bolt processes a topic, the same topic is always processed by same worker node of the “Update HashTag & Send Summary” bolt.

For the next connection which is between “Update HashTag & Send Summary” bolt and “Partial Ranking” bolt, we would again use “Field Shuffling” on Topic as a shuffling mode. And for the final connection which is between “Partial Ranking” and “Final Ranking” bolts, we can use any type of shuffling as there is only one worker node which is executing the logic of the “Final Ranking” and hence it doesn’t matter as all the topic need to go to the same worker node. Figure 13 shows different shuffling modes between different spouts and bolts.

Figure 13: Shuffling modes in Storm Topology

4.2 H2 Database

Second component in the data flow diagram as shown in the figure 4 is H2 Database. H2 database is an open source, light weight java database which can be embedded into the java application. By embedded it is meant that we do not need any pre-installation or configurations on the machine that is hosting the system for H2 database to work. H2 database is created on the fly by the system and data of the database is persisted in only one file on the local disk. Thus, it makes development of the system easy and makes the system easily runnable on any machine.

Apart from H2 database, there are other java based, light weight databases which too can be embedded into the applications. Popular light weight java databases are HSQLDB, ObjectDB, Derby, OrientDB, and Neo4j. One of my blog [17] describes comparison between different light weight java databases, and H2 and ObjectDB were the two candidate chosen for the project. Initial implementation of the system used ObjectDB, but because of unresolved technical difficulties, the ObjectDB implementation was dropped and was replaced with H2 database.

The database has very simple schema, only one table with three columns. First column is the time stamp column, second column is the topic name column and the last column is of CLOB datatype, which stores the topic information/summary of the entire window in the JSON format.

4.3 Rank Topics

“Rank Topics” module ranks the topics on the basis of the notion of “trending”. The topics which are relatively more aggressively trending are given higher rank. In order to implement this module, a precise definition of “trending” is required. In the project we shall consider a topic as a trending topic if it is among top k maximum occurred topics in last one hour (window time). This is a baseline criterion for a topic to be considered as a trending topic. In order to rank the top k topics, intuitively a topic which is “growing fast” in the latest panes of the window should be ranked higher than the topics which grew in the older panes of the window and are dying off in the latest window panes. This notion would be formalized and would become clearer as different ranking strategies are discussed.

Top k topics that are found using baseline criteria can be ranked using below strategies:

4.3.1 Pure Count Based Ranking

One of the naïve, easy to implement, strategy would be to rank the topics based upon the count of occurrence in the “Window” time period. If a topic, A, has occurred more number of times in the window period than topic B then topic A would be given higher rank than topic B. I.e., we can sort the topics in descending order of their count of occurrence in the complete window time period and then rank the topics as per the sorted ranking.

This was the first strategy implemented and tested. As it turned out, this strategy had a major drawback. Figure 14A and Figure 14B shows graphs of two topics, and with the current strategy, topic in figure 14A would be ranked higher than the topic in figure 14B, which is somewhat against our intuition.

Figure 14A: Total window count ~85

Figure 14B: Total window count ~80

4.3.2 Local-Z Score Based Ranking

One of the main reason for “Pure Count” approach for not giving the results as per our intuition of trending topic was because of the fact that the approach didn’t considered or modeled fluctuations in the window while making a decision. To solve this problem, a new strategy, “Local-Z”, was developed.

In Local-Z based ranking, for every topic, its window is divided into two equal non-overlapping parts called as segments. So each segment would have 3 panes (as window as total 6 panes). First segment, also called as history segment, would consist of oldest three panes of the window, while the second segment, also called as trend segment, would consist of latest three panes of the window. History segment would act as a baseline for qualitative interpretation of the counts in the trend segment. As per our intuition, we can expect that a higher ranked trending topic would have low values in its history segment in comparison to the values in the trend segment. In order to quantify the comparison, using every topic’s history segment, mean and standard deviation is calculated for all the points (pane count) present in the history segment. Using the calculated mean and standard deviation, Z-Score is calculated for all the three point (pane count) of the trend segment, using below formula, and mean is taken of all the three values. This mean is called as Local Z-mean of the topic.

Z = (X – Mean)/StandardDeviation, where X -> DataPoint

Figure 15: Shows history and trend segments, and Local Z-mean Calculation

Higher Z-mean value, implies that the points in the trend segment are further away from the baseline (history segment points), and thus intuitively the topic is trending fast. Negative Z-mean value would mean that the points in the trend segment are below the base line and the topic is dying off, so such topics can be ranked lower. All the topics are sorted with their Local Z-mean value in the descending order, and are ranked as per their position in the sorted order.

This strategy solved the problem that was faced by the “Pure Count” based ranking approach. This would rank the topic in figure 14B higher than the topic in figure 14A. But, after implementation, a new problem was discovered, which was again somewhat against our intuition of the trending topics. Consider below two figures, figure 16A and figure 16B. Using Local-Z based approach, topic shown in the figure 16A would be ranked higher than the topic shown in the figure 16B.

Figure 16A: High local Z-mean Score

Figure 16B: Local Z-mean score close to 0

The reason for failure of this approach was that in order for a topic to be high ranked, it always needs to have points in the trend segment much higher than the history segment points. Now consider the topics which initially had high Z-mean value, and as time passed, the topic became popular and reached at apex level, where the count is high and is constant for all the panes for the window, depicting that the topic has reached the peak. Such topics, intuitively, are still trending topics and should be ranked high. But, in the Local-Z based ranking, such topic would be assigned Z-mean value close to 0, thus penalizing the rank. What Local-Z does is correct, alas its “local” and it doesn’t have information about the mean and standard deviations of other topics, and hence it would give value as approximately 0. To resolve this issue, a new strategy was developed which is discussed in the next section.

4.3.3 Global Z-Score and Local Z-Score Based Ranking

As mentioned in the earlier section, main reason for failure for the Local-Z was that the Local-Z based ranking didn’t account for the mean and standard deviations of other topics which are competing for the ranking. Hence, a new notion of Global Z-Score was introduced. Topic ranking obtained from the combination of Global Z-Score information with the Local-Z score gave the results which were as per our intuition of the trending topics.

Using counts (data points) present in all the 20 topic’s history segments, mean and standard deviation was calculated and these values are called as global mean and global standard deviation. For every point in every topic’s trend segment, another Z-score, called as global Z-score, was calculated using the global mean and global standard deviation. And mean was taken of all the calculated global Z-scores of a topic; this mean is called as global Z-mean.

Now, every topic would have two scores, local Z-mean and global Z-mean. All the topics are grouped into three groups. First group, called as “All Positive” group, contains those topics which have positive values for both local Z-mean and global Z-mean. Second group, called as “Positive-Negative” group, contains those topics which either have, positive local Z-mean and negative global Z-mean, or negative local Z-mean and positive global Z-mean. Final third group, called as “All Negative” group, contains those topics which have negative values for both local Z-mean and global Z-mean.

For ranking, all the topics in the “All Positive” group are placed higher than the topics in the “Positive-Negative” group. And all the topics in “Positive-Negative” group are placed higher than the topics present in “All Negative” group. All topics in the specific group are ranked among themselves using below criteria:

All Positive Group: Topics in this group were ranked on the basis of global Z-mean score. Topics are sorted in the descending order of the global Z-mean score, and ranked as per their sorted rank.
Positive-Negative Group: Topics in this group were ranked on the basis of local Z-mean score. Topics which had higher local Z-mean score were placed higher than the topics that had lower local Z-mean score. Topics are sorted in the descending order on the basis of local Z-mean score, and ranked as per their sorted rank.
All Negative Group: Topics in this group were ranked on the basis of global Z-mean score. Topics are sorted in the descending order of the global Z-mean score, and ranker as per their sorted rank.

4.4 Jetty Server

Jetty [14] is an open source, pure Java-based HTTP Server and provides Java Servlet container. One of the key features of the Jetty is that the server itself can be embedded into the applications. I.e., as a part of the application, server can be started at a specific port, stopped when required. As the server is embedded within application, no pre-configuration on the machine hosting the application is required. The application when started launches the Jetty Server and handles all the HTTP Requests.

Jetty Server receives all the client requests for the trending topics. Once request is received, it fetches the latest top 20 record persisted in the H2 database with other information (positive and negative sentiment score). It passes the retrieved list of topics to the Ranking Topic module for ranking the topics using different strategies (as mentioned in earlier section). Ranking Module returns the topics in the ranked order. Server then converts the results in the JSON format and sends the result back to the client. At client end, the received JSON is parsed and displayed as charts on the browser using Chartjs [18] HTML5 library.

4.5 F2/Homogeneity/Second Moment Score

So far, we have seen that how we can find trending topics from the live Twitter Stream, and rank them as per our intuitive notion of ‘trending’. One question, we would still like to have answer for is, “Can we compare the quality of the trending topics retrieved during some time point, say at 5PM, to the trending topics retrieved during some other time point, say 2 AM? ”.

So for finding the quality we need to consider two factors:

Number of occurrence of the topic in the tweet messages. Continuing the above mentioned example, say the average count of occurrence of the ranked topics was 500 at 5PM, while say the average was 50 at 2AM. We can clearly see that the quality of the trending topics at 5PM is better than the quality of trending topics at 2AM, as the topics during 5PM were much more talked about in the Twitter.
Distribution of the count. Again, continuing the example, for simplicity consider top 5 topics, and same number of occurrence of topics at 5PM and 2AM. I.e., let’s say that we found top 5 topics at 5PM, where topic1 and topic2 occurred 8 times, topic 3 occurred 2 times, while topic 4 and topic 5 occurred 1 time, making total count of 20. Also, even at 2AM we found that the total count of top 5 topics was 20, but all topics had count of 4. In this scenario, we would expect the trending topics at 5PM are better than trending topics found at 2AM, because of the skewed distribution of the counts even when total count is the same.

(PS: In the example we never assumed that the topics that trend during 5PM and 2AM are same. They can be different topics and we are interested in quality comparison).

Above two mentioned criteria for doing qualitative analysis can be modeled via calculating second moment of the data points also called as homogeneity or F2 score [3]. Second moment can be calculated using below formula:

F2 Score = ∑(M_i * M_i), where M_i is the number of occurrence of the i^th trending topic.

The above mentioned simple formula, would give higher F2 score if the occurrence count is higher (criteria 1), and even when the overall count of occurrence is same, it prefers skewed data (criteria 2: {8² + 8² + 2² + 1² + 1²= 134} > {4² + 4²+ 4² + 4² + 4² = 80})

Hence, if we compare the F2 score at different time points, we can get fair amount of idea about the quality of trending topics. The system implemented, calculates F2 score of the Twitter Stream for top 20 trending topics, every 10 minutes (when the topics are persisted in the database), and keep track of F2 score for last 24 hours (10 minutes intervals -> 6/hour * 24 hours = 144 F2 scores), and plot a graph. The user can look at the graph, and can compare the current F2 score with one day history of F2 scores for qualitative comparison of the trending topics.

Experimental Setup and Results

5.1 System Setup

5.1.1 Storm Topology and Cluster Setup

As part of the project, complete new system is built as per the approach mentioned in the previous section. One of the important components of the system is the Storm Topology which is to be run on the Storm cluster. Storm supports cluster setup in distributed and local mode. Distributed setup is the real setup where there are number of available machines and each machine is configured for a specific job (worker/nimbus/zookeeper nodes). For the purpose of development, Storm also supports Local cluster mode, which behaves in similar way like distributed mode, but all the jobs/tasks are executed on one machine with different JVMs, making it independent of other tasks. These two modes are similar to the two modes pseudo-distributed mode and full distributed mode in Hadoop [19].

As the goal of the project was to learn about Storm and because of unavailability of machines, the storm topology that was created as part of the project was ran on the Local Storm cluster. Number of logical tasks chosen to run in parallel for:

Buffer Incoming Data Spout – 1 task
Filter Content Bolt – 5 tasks
Update HashTag and Send Summary Bolt – 8 tasks
Partial Ranking Bolt – 3 tasks
Final Ranking Bolt – 1 task

All the tasks run in parallel without been aware of the fact that other tasks too are running on the same machine thus simulating the Storm Distributed Topology.

The developed system is deployed on Amazon EC2 t1.micro instance running Ubuntu 12.04 operating system. All the results mentioned in the subsequent sections are obtained from the same instance.

5.1.2 Twitter API

For finding trending topics from live Twitter Stream, it is required to subscribe to the Twitter Streaming API which samples and send some fraction of the live twitter stream to the subscriber [20]. By default the tweets sent by the Twitter API to the subscriber contains tweets that are tweeted around the globe. Twitter Streaming API also supports “Filter” option, where the subscriber can specify geographical region, and the API only sends the tweets tweeted in the mentioned geographical region. The built system, while subscribing to Twitter, has provided the filter in order to receive tweets that were tweeted only from United States. So, the trending topics that are shown by the system, indicates the topics that are trending in United States.

5.1.3 Filtering Irrelevant HashTags

There are few set of hashtags which are stated more as an abbreviation, and their occurrence in the tweet message doesn’t suggest trending of the topic. For example, one of the popular hash tag is “#OOMF” which is abbreviation for “One Of My Follower”. People use this hash tag as an abbreviation in the tweet message, another such hash tag is “#RT” which is mentioned by some users as an abbreviation for Retweet, and clearly such topics are something we were not interested in finding under the trending topic notion.

In order to tackle this problem, a black list of the hashtags is made, and such hashtags are filtered out and are not considered as candidates for trending topics. Hashtags currently blacklisted by the system are: #OOMF, #RT, #JOB, #JOBS, #TWEETMYJOBS. These hashtags are found during the system testing and topic verification process.

5.1.4 Browser Compatibility

Front end of the system, which displays the results and the trending count and sentiment graph for all top 20 trending topics, uses chart js library. Chartjs requires browser to support HTML5. So, in order to see the results on browser, it is expected that the client has HTML5 browser.

5.2 Results

As part of the project, complete system was developed, and more information about obtaining the system URL can be found in “Running System Instance” section of the report.

5.2.1 Results as shown on browser

Figure 17 shows a screen shot of the Trending Topics and Trending positive and negative sentiments as displayed on browser.

Figure 17: Trending Topics as displayed in the browser

In the figure 17, first column indicates the Trending Topic Rank. The second column mentions the topic name. Third column shows the occurrence (count) of the topic since last 1 hour, rightmost point on the axis is the current time. And fourth column shows a graph of trending positive and negative sentiments. Red curve in the graph of fourth column shows how negative sentiment trended, while blue curve in the graph shows trending of positive sentiment.

Figure 18 shows the F2/Homogeneity score graph for the span of 1 day. Graph contains 144 points on the X – axis and shows F2 score on the Y – axis.

Figure 18: Homogeneity Score graph as displayed by the system on 27^th May, 2013 at 4:50PM

5.2.2 Results Comparison with the Twitter Trending Topic

As mentioned in the introduction section, even Twitter has a functionality of letting users know about the current trending topics. So, as part of the system evaluation, Twitter predicted trending topics were compared to the Trending Topics predicted by our system. (PS: The system built only considers hashtags as topics while the Twitter also finds trend of non hashtag topics, so for comparison, we compare the hashtag shown by the Twitter in the Twitter trending section to the hash tags predicted by the system).

Figure shows the trending topics as displayed by Twitter on 20^th May, 2013 (4:10PM). Figure 20A and Figure 20B shows the trending topics as predicted by the system during the same time.

Figure 19: Trending Topics shown by Twitter on 20^th May, 2013 (4:10PM)

Figure 20A: Trending Topic as displayed by the system on 20^th May, 2013 (4:10PM)

Figure 20B: Trending Topic as displayed by the system on 20^th May, 2013 (4:10PM)

Discussion and Future Directions

The project aimed at designing a system for finding trending topic also with the trending sentiment from the live Twitter Streaming API. The system is deployed on modest hardware, Amazon EC2 micro instance having 613 MB of memory, and total 8GB of disk space. System is running 24×7 for past 12 days without any issue. As shown in the results section, the system was able to identify the trending topics and topics predicted by the system were also shown by the Twitter as the trending topics. There were couple of hash tags that were shown by the Twitter as trending but were not identified by the system as trending. Few reasons for this might be:

Implemented system has a short term memory of 1 hour for any given topic with 10 minute of window slide. These parameters are not fined tuned, and may need more sophisticated approaches.
Twitter only provides small fraction of the tweets to the subscribers and the system uses this fraction to predict the trending topic. In contrast, Twitter might be using all the Tweets for finding trending topics and hence the mismatch between the predictions.

Moreover the developed system also provides the sentiment trends, which is not something shown by the Twitter. Sentiment trend might help the users to get better qualitative knowledge of the trending topics.

As mentioned in the Experiment Setup and Result section, there are some hash tag that are black listed. What makes the problem of identification of abbreviation hashtags tougher is the fact that there is no universal set of abbreviations or rules for using abbreviations. Moreover abbreviations evolve over time. So, such tags can only be found by trending topic evaluation process.

Current system implementation only considers hash tag as the candidates for the trending topics. The system can be trivially extended to support finding non hash tag trending topics. In order to achieve this, a Natural Language Processing library can be used in the “Filter Content” bolt. On receiving a tweet for filtering a call can be made to the NLP library for the identification of all Noun phrases in the tweet messages. The identified noun phrases can be treated as potential trending topic candidate. Rest implementation of the system need not be even touched. This can act as possible future work for the project.

System currently uses simple Z-score based ranking mechanism by dividing the window into two halves and using former half as a history for finding fluctuations in the later half. More sophisticated ranking techniques like regression models can be used for the ranking the trending topics.

As recommended by Professor Junghoo Cho, the system can be modified to finding trending Wikipedia topics from live twitter stream. In order to achieve this, again, only change required would be at “Filter Content” bolt. Currently filter content bolt identifies hashtags from the tweet messages; this can be changed to identify the phrases which are topics in the Wikipedia. This extension when coupled with longer window time period can help to identify the Wikipedia topics whose popularity is evolving in the similar fashion.

Though the current system is been developed for handling only Twitter Streams; most of the system modules are independent of the Twitter Stream. Only, “Filter Content” bolt and the spout are Twitter specific. The system can be easily extended to find topic trends from other data sources like web forum, blogs, Quora.

Current system implementation uses the Storm in the Local mode. Possible future work could be test out the system on the distributed Storm cluster.

Running System Instance

Currently the system is running on Amazon EC2 and the system can be accessed via below URLs:

Trending Topics and Sentiments: http://23.20.37.11:8082/

Homogeneity Score: http://23.20.37.11:8082/html/homogenity.html

If system is unavailable for some reasons on the above mentioned URLs then please contact the author for getting a valid URL via which the system can be accessed.

Conclusion

As part of the project, a complete end to end system was developed that would find trending topics along with the trending sentiments from live Twitter Stream. The project was tremendous learning experience both on conceptual and technical fronts. Project provided me an immense opportunity to explore different technologies, figure out merits and demerits, and use them to solve the problem at hand. All the points mentioned in the technical motivation for doing the project were achieved. Moreover, the project has large gamut of future extensions, and more sophisticated systems can be built by using current system as a base to answer lot of interesting problems.