04 December 2012

Data Science DC: Implicit Sentiment Mining in Twitter Streams

Summary and more

I have been attending the Data Science DC meetup pretty regularly as it’s an interesting meetup often with quite good talks, the most recent was a very interesting presentation called "Implicit Sentiment Mining in Twitter Streams" by Maksim (Max) Tsvetovat.  He discussed number of ideas that relate to semantic discovery which are of interest to me as I am doing research into related areas including applying semantic ideas to software naming.  So I thought it would be nice to do a little review augmented with links to references that were made and additional ones that I found in the course of researching what was discussed.  I am also including some more introductory links and material as it helps me and hopefully others who are not fully versed in the world of NLP.  The meetup was held at Google’s downtown DC office, the irony being that the meetup was about Twitter was pointed out humorously by David Lieber of Google as he introduced Harlan for the usual Data Science DC Harlan-Mark introduction.

Max starts by talking about a system that his team at George Mason built to map sentiment during the 2012 presidential election which was then used to mine sentiment from current news, in this case the media coverage of recent conflict in Gaza. This work has yielded an algorithm to show media bias.

He points out that there are a number of things people are trying to mine and predict using twitter, the example he cites is the wired article "Twitter Can Predict the Stock Market".  He sees twitter not as a social network but as a wire, an analogue of physical broadcast ether.  It’s not a media carrier but a wire that other things go into with a real-time nature where things can change very quickly.

He moves on to Sentiment analysis, mentioning a paper called "Sentiment Analysis is Hard but Worth it" by Michelle deHaaff.   He contrasts this title with what he describes as an easy "old school" sentiment analysis.  It's where you want to know what people think, so you take a corpus of words and a stream of data and you look for occurrences of good words vs. bad words. You use an average or apply some formula to create a measure of sentiment, which is a naïve approach that might be used in a CS curriculum, but it does not really work in practice due to the complexity of human emotions and language that can have double and triple entendres.  He refers to a computational linguistics paper about "She said" jokes, which I believe is this "That’s What She Said: Double Entendre Identification".  Some examples he gives of possibly deceptive and/or ambiguous statements in terms of sentiment are:

  • This restaurant would deserve highest praise if you were a cockroach (a real Yelp review ;-)
  • This is only a flesh wound! (Monty Python and the Holy Grail)
  • This concert was f**ing awesome!
  • My car just got rear-ended! F**ing awesome!
  • A rape is a gift from God (he lost! Good ;-)

He summarizes these ideas which are challenges to machines learning these things:

  • Ambiguity is rampant
  • Context matters
  • Homonyms are everywhere
  • Neutral words become charged as discourse changes, charged words lose their meaning

The field of computational linguistics has developed a number of techniques to handle some the complexity issues above by parsing text using POS (parts-of-speech) identification which helps with homonyms and some ambiguity. He gives the following example:

Create rules with amplifier words and inverter words:

  • This concert (np) was (v) f**ing (AMP) awesome (+1) = +2
  • But the opening act (np) was (v) not (INV) great (+1) = -1
  • My car (np) got (v) rear-ended (v)! F**ing (AMP) awesome (+1) = +2??

Here he introduces two concepts which modify the sentiment, which might fall under the concept of sentiment "polarity classification" or detection.  One idea is of an amplifier (AMP) which makes the sentiment stronger and an inverter (INV) which creates an opposite sentiment.  I found this idea of "sentiment modification" intriguing and did a little searching and came across a paper called "Multilingual Sentiment Analysis on Social Media" which describes these ideas [page 12] and a few more including an attenuator which is the opposite of an amplifier.  It also describes some other modifiers that control sentiment flow in the text, pretty interesting concepts, actually the paper looks quite interesting, I only read the first few pages.

He cites a paper "Cognitive map dimensions of the human value system extracted from the natural language" by Alexei Samsonovich and Giorgio Ascoli.  This paper defines the following dimensions:

  • Valence (good vs. bad)
  • Relevance (me vs. others)
  • Immediacy (now/later)
  • Certainty (definitely/maybe)
  • And about 9 more less-significant dimensions

One result which is quite interesting is that these dimensions are pretty much language independent. While searching this out I also came across "Computing Semantics of Preference with a Semantic Cognitive Map of Natural Language: Application to Mood Sensing from Text" and "Principal Semantic Components of Language and the Measurement of Meaning" by the same authors.

Max’s work seems to run pretty heavy in social networking theory, which includes an Orielly book: Social Network Analysis for Startups.  He also mentions having quite a bit of exposure to social psychology, consisting of "half a degree" as he put it, which also shows in his work.  He mentions a couple of human psychological aspects, somewhat NLP related but also somewhat divergent, these are the idea of mirroring and marker words.

Mirroring is the idea that when people interact, if the interaction is positive, the example that was given was a successful flirtation, then one person will mimic the others body language.  He extends this concept to the language used by various parties, in this case the tweets they emit.

Marker words are unique words an individual speaker tends to use. The idea can also extend to common expression between speakers. His description of marker words is:

  • All speakers have some words and expressions in common (e.g. conservative, liberal, party designation, etc)
  • However, everyone has a set of trademark words and expressions that make him unique.

He extends this idea to the idea of linguistic mannerisms he cites are calling health care "Obama care" would mark you as conservative, calling Hamas "freedom fighters" would mark you as siding with Hamas.  Which he uses to observe mirroring:

  • We detect marker words and expressions in social media speech and compute sentiment by observing and counting mirrored phrases

The next part of the talk gets into the details of how to do the analysis of the raw text.   One idea that he talks about is text cleaning pointing out that Twitter data is very noisy.  The text is cleaned in part using stop words which are words that are common and have little lexical meaning, some examples are {a, on, the, to}. His full list which he pilfered from WordNet is here.  

Another important NLP concept is stemming a linguistic morphology related concept, given by his example:

  • Stemming identifies root of a word, stripping away: Suffixes, prefixes, verb tense, etc
  • "stemmer", "stemming", "stemmed" ->> "stem"
  • "go","going","gone" ->> "go"

He takes his stemming code from the python project: Natural Language Toolkit.

Since the data being mined is coming from the internet which is used by people all over the globe, language detection is important. While the semantic concepts as outlined in the above work by Samsonovich and Ascoli may be language independent, the stemming and stop words are not, these techniques apply to most other languages but the specific tools and data do not, so the goal is to filter out other languages.  He sums this up as:

  • Language identification is pretty easy...
  • Every language has a characteristic distribution of tri-grams (3-letter sequences);
    • E.g. English is heavy on "the" trigram
  • Use open-source library "guess-language"

The Python library he uses is guess-language which is based on some other implementations.  There is also a java library: language-detection on Google code which was written by Nakatani Shuyo.  All of these use a trigram approach to language detection which uses an n-gram of characters and their probabilities to identify languages this approach is described in "N-Gram-Based Text Categorization".

After the text is filtered to English, cleaned, and stemmed this leaves roots of big words, words that carry more meaning.  These are used to create term vectors.  Term vectors are a way to map documents into a vector space, this is known as the Vector Space Model (VSM), and is a fairly common approach, it is used in Lucene and its derivatives like Solr and ElasticSearch.  Term vectors can be built with different levels of granularity, generally in Lucene this done at the document level but it can also be done at the sentence level.

I was hoping to better describe what he is doing with the term vectors and how they relate to the graphs that he creates but I am unclear as to whether his term vectors are built at the document (tweet) level or sentence level, I believe it is the sentence level as he refers to a common word in two sentences being the intersection of two different term vectors.  He then starts talking about bigrams and linking speakers to bigrams, I am not sure how these relate to the term vectors.  In this case the bigrams, n-grams order 2, refer to words as opposed to the trigrams mentioned above for language detection which were for letters.

Regardless of how they are created the system he describes uses bigrams of words linked to speakers which form a two-mode network, a concept that I was unfamiliar with which is described in "2-Mode Concepts in Social Network Analysis".  This two-mode graph technique drives the final graphs for the sets, in the cases of {Santorum, Gingrich, Romney} and {IDF, Hamas}.   He also points out by counting of the occurrence of bigrams the most common bigrams give the nature of discourse structure.

Counting bigrams enables a technique to throw out bigrams that only occur once in a certain time period, purging single occurrences cuts out the noise.  The number of co-occurrences are power law distributed which reduces this from a big data problem to something that runs on an Amazon micro instances.  Also dates were recorded for each occurrence which allowed noncurrent topics to be purged from the current data over time.

The algorithm to detect media bias, which he warned is naïve, yielded:

NPR58% favorable to IDF
Aljazeera53% favorable to IDF
CNN59% favorable to IDF
BBC54% favorable to IDF
FOX51% favorable to Hamas
CNBC60% favorable to IDF

I hope others find this useful, I sure learned a lot digging into the presentation and researching the ideas presented.  This post ran longer than I had originally thought. I attribute this to the broad subject area that this talk covered.  Semantics is a complex and deep topic with many facets and approaches, I was hoping to throw some order theory related ideas in as well, as they are quite applicable, but that will have to wait for another time.


The following references are a mix of works referenced in the presentation and that I came across while writing this, many are linked above but not all are:

Data Science DC meetup

Data Science DC:Implicit Sentiment Mining in Twitter Streams

Event Audio

Code and Slides

Maksim Tsvetovat Publications

The Math of Search and Similarity, Part One: Lucene, the Boolean Model, tf*idf, and the Vector Space Model

Sentiment Analysis is Hard but Worth it by Michelle deHaaff.

That’s What She Said: Double Entendre Identification by Chloe Kiddon and Yuriy Brun


Stanford NLP

Natural Language Toolkit, github

Multilingual Sentiment Analysis on Social Media by Erik Tromp

Opinion mining and sentiment analysis by Bo Pang and Lillian Lee

Tracking Sentiment Analysis through Twitter by Thomas Carpenter and Thomas Way

Sentiment Analysis: An Overview by Yelena Mejova

N-Gram-Based Text Categorization (1994) by William B. Cavnar , John M. Trenkle

2-Mode Concepts in Social Network Analysis by Stephen P. Borgatti

Basic notions for the analysis of large two-mode networks by Matthieu Latapy, Clemence Magnien, and Nathalie Del Vecchio


  1. It was six years ago when we attended this!?!

  2. Attend The Machine Learning course in Bangalore From ExcelR. Practical Machine Learning course in Bangalore Sessions With Assured Placement Support From Experienced Faculty. ExcelR Offers The Machine Learning course in Bangalore.
    Machine Learning course in Bangalore

  3. Well, the most on top staying topic is Data Science. Data science is one of the most promising technique in the growing world. I would like to add Data science training to the preference list. Out of all, Data science course in Mumbai
    is making a huge difference all across the country. Thank you so much for showing your work and thank you so much for this wonderful article.

  4. Such a very useful article. I have learn some new information.thanks for sharing.
    data scientist course in mumbai


  5. Excelr is providing emerging & trending technology training, such as for data science, Machine learning, Artificial Intelligence, AWS, Tableau, Digital Marketing. Excelr is standing as a leader in providing quality training on top demanding technologies in 2019. Excelr`s versatile training is making a huge difference all across the globe. Enable ?business analytics? skills in you, and the trainers who were delivering training on these are industry stalwarts. Get certification on "
    data science course fees in hyderabad"
    and get trained with Excelr.

  6. I like viewing web sites which comprehend the price of delivering the excellent useful resource free of charge. I truly adored reading your posting. Thank you!
    Data science course in mumbai

  7. Nice Blog...Very interesting to read this article. I have learn some new information.thanks for sharing.
    ExcelR Mumbai

  8. Very nice blog here and thanks for post it.. Keep blogging...
    ExcelR data science training


  9. Cool stuff you have and you keep ExcelR Machine Learning Training overhaul every one of us

  10. Good to become visiting your weblog again, it has been months for me. Nicely this article that i've been waited for so long. I will need this post to total my assignment in the college, and it has exact same topic together with your write-up. Thanks, good share.
    data analytics courses in hyderabad

  11. Excellent Blog! I would like to thank for the efforts you have made in writing this post. I am hoping the same best work from you in the future as well. I wanted to thank you for this websites! Thanks for sharing. Great websites!
    data science course in mumbai

  12. Such a very useful article. Very interesting to read this article.I would like to thank you for the efforts you had made for writing this awesome article.
    data science course in mumbai

  13. Nice information, valuable and excellent work, as share good stuff with good ideas and concepts, lots of great information and inspiration, both of which I need, thanks to offer such a helpful information here. data science course

  14. I have a mission that I’m just now working on, and I have been at the look out for such information ExcelR Data Scientist Course Pune

  15. Great post i must say and thanks for the information. Education is definitely a sticky subject. However, is still among the leading topics of our time. I appreciate your post and look forward to more.
    ExcelR Data Analytics Course

  16. Excellent Blog! I would like to thank for the efforts you have made in writing this post. I am hoping the same best work from you in the future as well. I wanted to thank you for this websites! Thanks for sharing. Great websites!
    data analytics course hyderabad

  17. I have been searching to find a comfort or effective procedure to complete this process and I think this is the most suitable way to do it effectively.
    Please check ExcelR Data Science Courses in Pune

  18. Hi,Thanks for sharing beautiful Stuff About Data Science...

    More: https://www.kellytechno.com/Hyderabad/Course/Data-Science-Training

    Data Science Training in Hyderabad

  19. I need to to thank you for this very good read!! I definitely loved every little bit of it. I have you bookmarked to check out new things you post… data science course bangalore

  20. Thanks for posting about Data Science explained ,please post like this type of articles
    Data Science Training in Hyderabad

  21. Great post i must say and thanks for the information. Education is definitely a sticky subject. However, is still among the leading topics of our time. I appreciate your post and look forward to more.

    data science course in mumbai
    You can reach us:
    ExcelR - Data Science, Data Analytics, Business Analytics Course Training in Mumbai

    304, 3rd Floor, Pratibha Building. Three Petrol pump, Opposite Manas Tower, LBS Rd, Pakhdi, Thane West, Thane, Maharashtra 400602

    Contact Number : 18002122120