The subject of my diploma thesis was designing and implementing a system that given a dataset of tweets, it recognises real world events in it. Ideally, it also recognises the relationships between the extracted events, generating storylines as a result. Results are visualized on a web browser. You can see the implemented user interface with some results here. The describing essay is written in greek.
Using twitter to learn the news offers advantages compared to traditional digital news sources (digital newpapers, blogs, etc). News spread on a faster pace and are also presented from multiple standpoints. Since twitter users also post notable links to traditional sources, we can still take advantage of those too.
The problem though is that twitter is a social media platform. That means that there is alot of spam and social noise. Also the sheer amount of tweets posted for some topics is hard to manage. The tweets of news kind we are interested in amount to about 4-5% of the total posted tweets [source]. It is thus very important to filter out tweets and try to be as memory-efficient as possible.
Another serious problem we have to consider is the short length of tweets and the fact that tweets come from multiple and diverse sources (ie users).
Filter and Pre-process
The first thing to do is filter the tweets entering the system. We keep tweets that:
- are in the supported language (ie english), determined by the supported language of thelanguage analysis tools used)
- are not a retweet or has a mention to another user, because those kind of tweets tend to be mostly social noise. This condition is optional but recomended
- have a length of at least 5 words and 20 characters
- have less than 4 urls and 4 hashtags. We do this to eliminate some spam tweets
- have a noun followed by a verb
The last condition evaluation is made possible by using a Part of Speech (POS) tagger. A POS tagger is a tool that given a series of tokens, assigns a tag to every token, that represents what part of speech they are. The POS tagger is also used to extract as candidate named entities the tokens that are tagged as proper nouns.
After that we strip any unused information from the kept input tweets, which are described in JSON format, and save them in a line per tweet JSON text file format. This is the format that the next phase, the analysis phase reads from.
If we choose to run the program that retrieves historical data from the twitter REST API, we have the option to perfom a query expansion step. This is completely optional and not really necessary. The methodology for the query expansion is as follows:
- Extract candidates with Rapid Automatic Keyword Extraction (RAKE) [source]
- Ignore those with verbs, adverbs and adjectives
- Choose those that exceed a threshold in frequency
- From those, discard the unigrams that appear in digrams and trigrams
- The expanded query is the disjuction from the remaining keywords
We can then re-run the retrieval program using generated query from the expansion. Example result:
original query: korea
expanded query: north korea OR south korea OR kim jong OR north korean leader kim jong
At this point we enter the analysis phase. The first step is to split the tweets in groups, based on their timestamp. The time ranges that define those groups are determined by the peaks in the time-tweet volume histogram. To detect the peaks in the histogram we use a moving median-based filter to generate a critical curve. If there is a number of in sequence histogram points over the critical curve, then we mark those points as a peak region. Example result of the method is shown in the image below.
Essentially, what we end up with are high and low traffic groups of tweets. The reasoning behind this partitioning is to reduce computation complexity for the following steps, while still grouping relevant tweets together.
We then apply the DBSCAN clustering algorithm on every time group of tweets we generated from the previous step. Out implementation of the algorithm is of quadratic time complexity.
Before applying the alorithm on the groups of tweets, we must first generate representations of the texts that support the notion of distance. The representation we use is bag-of-words. The bag-of-words representation is essentialy a sparse vector, where every dimension represents a word (or token etc) from a dictionary. When a word is not present in a text the assigned dimension weight for the word is zero. If a word is present, then the assigned weight is 1 for regular words and b (usually > 1) for words in the detected named entities set. We then apply a tf-idf model on the vector and normalize it to unit length.
As distance D between vectors A, B we use the formula:
Essentialy we calulate the complementary of the cosine of the angle between two bag-of-words vectors. We can calculate the cosine by the dot product formula and since we have unit vectors, we don’t have to divide by the product of their norm.
To mitigate the fact that our implementation of DBSCAN has quadratic time complexity, we can optionally sample every group. Sampling is done, by setting a ceiling on the number of tweets that are used from every group. That way we set an upper limit ot execution time per tweet group. We also normalize the size of the generated clusters to take sampling into account.
At this point we attach several attributes to the generated clusters by pulling data from file. Those attributes are:
- Centroid: normalized vector sum of the bag-of-words in the cluster
- Terms: centroid in readable format
- Texts: Nearest 10 to the centroid unique texts
- Title: Top ranking text from the above attribute
- Time: Minimun/Start, Maximum/End timestamp of tweets + mini volume histogram
- Informal: Count of informal tweets (detected with regular expressions)
- Size: The size of the group, normalized if from sampled group
We then merge clusters using the formula:
Where a and threshold are parameters, sim is the cosine of the angle between the centroids of clusters e1,e2 and dt their distance on the timeline in seconds. To put simply, we merge clusters whose cosine distance is less than a threshold after applying a damping factor depending on temporal distance. The main reasoning behind this is that we want to ‘fix’ any inapropriate splits we did during temporal partitioning.
Now that we have our clusters in their expanded form, we eliminate those that we deem as of bad quality. Selection is done by thresholding on a ranking heuristic and a limit on the ratio of informal/size that is acceptable. The ranking heuristic favor clusters with:
- Bigger size
- Less time duration
- 1-2 urls
- More terms
- Less ‘informal’ tweets
For the last step of analysis, we attemp to extract relationships between clusters. Relationship extraction is done in a simple manner. Relationship is determined by thresholding on the cosine distance between the centers of the clusters.
The problem that arises from this approach, except for that if fails to detect some hidden relationships, is that we have redudant relationships. To fix this problem we eliminate relationships if there is another path of relationships that can express said relationship.
One can think of the clusters as the vertices of a directed graph. Relationships are represented on the graph as edges. The direction of the edges is based on the place on the timeline.
As said earlier, the results are presented to the user through a website. There are three methods of representation used. First, we use a regular feed format. Then, a graph visualization is presented where relationships are more clear. Finally, the clusters are placed on a timeline. Also, on the bottom of the page, we present the volume histogram along with the peak region and the critical curve.
Evaluation is done through the website, using the checkboxes on the side of the feed items to mark them as relevant or ignored. Eliminated clusters from the selection stage are also available and presented in red color, by checking the eliminated checkbox on the top. The mark checkbox on the top can also hide the color of clusters, to eliminate bias. After marking is done, click the evaluate link, calculates the evaluation metrics.
A working version with some example results can be viewed here.
Conclusions and Possible Improvements
The results are quite good and the methodology as a proof of concept seems to work on different types of datasets. Of course there is a lot of room for improvement and extensions.
The system can support other languages if used in conjuction with language analysis tools that support languages other than english. Most of the work for this extension is concentrated at the preprocess step. Latent language analysis (ie synonyms) can also be used a as bridge between different languages. As of now, other languages can be used by the system but with degraded perfomance since the POS tagger and the stemmer used have no effect on laguages other than english.
Other social media networks can be supported as well if the appropriate wrappers are implemented.
Finally, the methodology can be modified to support real-time streams of tweets. Relevance feedback can be used to improve the results. An hierarchical/agglomerative clustering algorithm is more appropriate in the real-time case. The peak detection filter also introduce a delay in the system which could be an issue. A issue that also arises is the initialization and updating of the tf-idf models and the dictionary.