15 March 2014

Are Your Developers Treating Your Project as Their Own Personal Science Project?

Years ago I worked on a project that had a custom role based security system and for this system we needed a way to load menus according to the users’ assigned roles.  Two of the dominant developers on the project decided that this was a perfect use for AJAX mainly because they wanted to learn and use AJAX.  So they build a separate deployable menu application to serve up the menu by emitting XML from a Spring Controller URL, it was very comparable to a simple Restful service. It was called directly from JavaScript in the application’s web page.  Since there would be a lot of calls for any given user’s menu they built a cache for each user’s menu options, it was a custom cache where objects never expired.  This work was completed as I joined the project and as I got acquainted with this project my first thought was why do all this work?  Why not just load the menu when the user is authenticated and then cache it into the HTTP Session, I raised the issue but was ignored.  It was their pet project and it was sacrosanct. 

Later, while testing the apps that used the AJAX menu call the testers ran into a problem, the cache was interfering with testing because it was keeping the database changes from be propagated and testing the applications created a need to continuously restart the AJAX menu app.  So a new admin interface was added to the AJAX menu application to allow the testers to clear the cache.  This created a whole separate application that the client would have to support just to enable this unnecessary AJAX menu delivery mechanism.  So instead of just caching the data in the HTTP Session which is easier and any cache problem would be fixed by just logging out and logging back in, these two developers built a pet project which  consists of an extra application and codebase that has to be understood and maintained.  Ultimately the biggest problem this creates boils down to simple economics, it more expensive to maintain a solution that requires a whole separate application.

I think there is a tendency for all developers both good and bad to sometimes want to make projects more interesting and there can be a lot of temptations.  It is often tempting to create a custom implementation of something that could be found already written because it’s a fun project or the temptation to shoehorn a new hot technology into a project or even the temptation to add a solution to a problem that doesn’t need to be solved or just strait up over engineering.    I am guilty of some of these transgressions myself over the years. 

My anecdote about the menu app is just one of at least a dozen of egregious examples of these types of software project blunders that I have seen over the years.  My anecdote shows a manifestation of this problem and it also shows that these types of software project issues have a real tangible cost.  They create more code and often more complex code.  In some cases they create the need for specialized technical knowledge that wasn’t even necessary for the project.  While these approaches may only result in a small cost increases during the construction of the software the real consequence will be the long term maintenance costs of a software system that was needlessly overcomplicated.   In some cases these costs will be staffing costs but they can also be time costs in that changes and maintenance take considerably longer.  These problems may also result in costs to business as unwieldy systems cause the loss of customers to competitors and can impede the ability to gain new customers.

These types of decisions can be viewed as software project blunders.  In some cases the developers know better but choose to be selfish in that they put their own desires above the project, in these cases these decisions can potentially be viewed as unethical.  The two developers from the anecdote were extreme examples and were pushing into unethical territory since they treated every new project as a personal opportunity to use some new technique or library or some custom idea that they were interested in.  However, I think most of these blunders are simply that, mistakes in judgment or lack of good oversight.  It is always easier in hindsight to see these problems, but when one is in the heat of the moment on the project one can lose perspective.  The real challenge is trying to avoid these blunders. Really these are problems that fall into a larger spectrum of software project management problems and sadly I have mostly seen terrible software project management throughout my career.

I wish I could say I have a solution to this problem.  I think developers have some responsibility here especially senior developers, well at least the good ones.  Software development is really about decisions.  Developers make many decisions every day including naming, project structure, tools, libraries, frameworks, languages, etc.  Too often these decisions are made in isolation without any review and sometimes they can have serious long term consequences.  When making so many decisions in such a complex domain mistakes are inevitable.  Good software developers and good teams make an effort to deliberate and review their choices.

11 December 2013

Haskell DC

I am proud and excited to announce the meetup group Haskell DC.  Our first meetup agenda is open so I envision some discussion and maybe some basic hacking perhaps using Real World Haskell or Learn You a Haskell for Greater Good. Both are freely available online.  In this post I will put forth some ideas that we might consider for future meetups.  I hope to see this grow into a community and am looking forward to contributions from members especially those that are currently working with Haskell.  I hope to see a mix of practical and theoretical discussions and topics. I will also try to grow the group in terms of sponsorship.  We actually have our first sponsor, the local coworking space company Uberoffices is providing us space for our meetups.  I recently added our meetup to haskell.org under East Coast, which makes it seem a little more official.  I have setup social media accounts on Twitter @HaskellDC also use #HaskellDC for HaskellDC tweets.  I created a Google Plus account with the email haskelldc0 at gmail,  haskelldc was taken. I have also created a Community in Google Plus for HaskellDC. I hope to do a Google Hangout for our meetups which was requested by one of our remote members.  I even created a Facebook account if for no other reason but to claim it.  I admit to not being great at the social media stuff but will do my best or perhaps I will get help there from other members.

Learning Haskell has been on my list of things that I want to do for quite a while.  Haskell is attractive to me because it was developed in academia and embodies some mathematical concepts and now it seems to be gaining ground for commercial use.  So it is interesting to both the computer scientist and the working programmer.  I have a few ideas about future meetup topics both practical and theoretical that I wanted to list out in this post. Of course the future is wide open and I am hoping that as a group we develop future ideas with opportunities for anyone to present or lead a coding session.  In researching this post I became overwhelmed by the copious amount of material that there is out there for Haskell.  I plan to write at least one if not two more posts about Haskell in terms of resources, research, and the ecosystem.

The following are some lists of Programming Language Features, Software Development Related Topics, Practical applications, Advanced Programming Areas and Theoretical Topics. This list is derived from areas of my own interests along with areas that I feel are useful to practical software development.  I put these forth as possible topics for consideration for future meetups:

Haskell Language


  • IDE Support
  • Building and Deploying, Production deployment
  • Continuous Integration
  • Package Management
  • Testing
  • Implementations, GHC, etc.

Practical Uses

  • Web Development
  • XML, HTML Processing and parsing
  • Network Client/Server, Web Services, Restful, JSON, SOAP
  • Desktop GUI
  • Graphics support
  • Embedded programming
  • Relational Database Access
  • MongoDB
  • Neo4j
  • Hadoop/HBase
  • Lucene, Elastic Search, Solr
  • Other NoSQL: Casandra, CouchDB, Riak, etc.

Advanced Topics and Uses


Interesting (Haskell Related) Publications

Ok, that’s quite a list. I tried to categorize these as best as I could. Monads show up under theory and language features, although I am not sure but I think they may be considered more of a language idiom.  The list is clearly ambitious.  If we have one meetup per month it would probably take several years to cover all of these topics if that were the plan.  As some of the theoretical topics are pretty advanced, and I hope to someday understand them.  Hopefully as a group maybe we make some progress as the theoretical aspects are important for better application of the practical.

I have heard many positive things said about Haskell in regards to good software development practices and that it facilitates the creation of better quality software which is what I, and most likely all good software developers, are striving for.   One topic that is of great interest to me is software reuse, I have written about it in relation to complexity and generic programming.  The generic programming post takes its title from a paper of the same name, listed above as a Haskell related paper.  Interestingly Edward Kmett mentions in his Haskellcast about his Lens Library that using Haskell facilitates a higher ease of reuse that would require significant discipline, especially for a team, when using an OO language like C++.  Another area is that of testing, while I think testing is important, my experience with TDD is limited, but it seems that TDD increases the amount of code that has to be written and understood.  I have read that Haskell’s type system, like any statically typed language, allows for less testing because of static type checking.  This rings true to me as dynamically typed languages like Python, Javascript, and Ruby need tests to enforce things that a type system would give you. Also, purity and referential transparency eliminate side effects. State still exists but it is managed more cleanly.  Haskell’s concise syntax is another potential benefit which allows more complexity to be expressed and read more easily.  Powerful features like list comprehensions, pattern matching, and abstract datatypes (ADTs) allow for smaller idiomatic code.  Of course this requires programmers that understand all of this.

CS research publications are a primary area of interest for me and I have encountered many that use and mention Haskell.  The list above provides some papers covering some fairly pragmatic topics that should be of interest to working programmers like Generic Programming, Design Patterns, and Data Structures.     I also include Brent Yorgey’s Typeclassopedia which gets a lot of mentions from people giving advice about learning Haskell and starts with the following advice:

 “The standard Haskell libraries feature a number of type classes with algebraic or category-theoretic underpinnings. Becoming a fluent Haskell hacker requires intimate familiarity with them all, yet acquiring this familiarity often involves combing through a mountain of tutorials, blog posts, mailing list archives, and IRC logs. The goal of this article is to serve as a starting point for the student of Haskell wishing to gain a firm grasp of its standard type classes. The essentials of each type class are introduced, with examples, commentary, and extensive references for further reading.

In the course of my own general research and what I did for this post, I can attest to the large amount and overwhelming Haskell related topics both theoretical and practical that are available online.

In my above list of possible theoretical interests to the meetup group, I list Homotopy Type Theory.  The relevance here is of course that Type Theory is relevant to Haskell.  The theory and the book are new, coming out this year.  I wanted to give it special mention because there is a lot of excitement about it, including claims that the theory is potentially revolutionary to both mathematics and computer science.  The book is a daunting 600 pages and very heavy going, however, Robert Harper has done a series of lectures.  I have watched the first lecture and found it very enlightening so I would recommend it, so far anyway.     

So for the meetup, as I mentioned we will probably start off with some easy learning and hacking. I thought it might be fun to try to implement something simple, maybe Fizz Buzz.  Ultimately it would be nice to work to get set up with the ecosystem to do productive things in Haskell. Maybe eventually a group project or some higher level hacking, I have thought it might be interesting to something like work though some of the problems in Think Stats possibly using a Haskell statistics package.

Well those are my ideas, feel free to suggest others or correct me if I got anything wrong, this is all new to me.

11 March 2013

Programming and Order Theory

Covariance, Contravariance and Order Theory

In this post I make the observation that covariance and contravariance in programming are what are known as Order Duals.  I am not the first person to make this observation, however, these ideas often tend to be buried in academic research papers like "Adding Axioms to Cardelli-Wegner Subtyping" by Anthony J H Simons, don’t get me wrong I love these types of papers, they give me hope and inspiration that software engineering will someday become a first class engineering citizen.  Unfortunately, these types of papers tend to be too theoretical and thus not very accessible for the average developer.  This is unfortunate as the idea of covariance and contravariance as order duals both puts these concepts into the mathematical context of order theory and possibly gives programmers some native context for order theory.  So hopefully this post will make these ideas more programmer friendly.

I previously wrote a post about lattice theory, which is part of the more general order theory, where I talked about some basic order theory ideas such as duality.   Order theory occurs quite a lot in software and programming and this is part of a series of posts to talk about those occurrences.

Covariance and contravariance receive a fair amount of attention, as they should, in software blogs and some of these posts include some interesting observations. One, perhaps slightly off topic observation, is "Liskov Substitution Principle is Contravariance" which is an interesting observation and interesting post if you overlook the disdainful tone towards OO.  Another more relevant post which is a nice post about "Covariance and Contravaiance in Scala" relates these ideas to category theory which is relevant especially since apparently you can think of "Category Theory as Coherently Constructive Lattice Theory", warning heavy going in that paper.

Defining Covariance and Contravariance

To me one of the most striking and perhaps apropos examples of order theory in software is that of covariance and contravariance which Eric Lippert defines on his blog as:

The first thing to understand is that for any two types T and U, exactly one of the following statements is true:

  • T is bigger than U.
  • T is smaller than U.
  • T is equal to U.
  • T is not related to U.

For example, consider a type hierarchy consisting of Animal, Mammal, Reptile, Giraffe, Tiger, Snake and Turtle, with the obvious relationships. (Mammal is a subclass of Animal, etc.) Mammal is a bigger type than Giraffe and smaller than Animal, and obviously equal to Mammal. But Mammal is neither bigger than, smaller than, nor equal to Reptile, it’s just different.

He has an eleven part series on covariance and contravariance, his posts cover some C# implantation details but the ideas are generally applicable and looking at one language’s details can help with comparing and contrasting to other languages.

Wikipedia includes the following definition, the animal example is pretty popular:

  • Covariant: converting from wider (Animals) to narrower (Cats).
  • Contravariant: converting from narrower (Triangles) to wider (Shapes).
  • Invariant: Not able to convert.

Including this is slightly redundant but this definition captures the conversion aspect and defines the relationships explicitly. 

Covariance and Contravariance as Order Duals

The above are definitions that have order theory written all over them.  In fact that is pretty much a text book definition of an order Relation in that it is reflexive, transitive, and antisymmetric. It is reflexive since Animal = Animal, transitive since Animal ≤ Mammal ≤ Cat implies Animal ≤ Cat, and antisymmetric since Animal ≤ Mammal implies not Animal ≥ Mammal and in the animal example there are cases of both comparability and incomparability as you would find in a partial order.

As you can see from the above definitions both sets of terms, wider/narrower or bigger/smaller, which are the same, define an order dual for comparison.  To write it more formally we will call the set C classes of various types in an OO hierarchy. So covariant would be represented by less than or equals ≤ and contravariant would be represented by greater than or equals ≥ and a set of classes with these order relations can be written with mathematical notation as (C, ≤) = (C, ≥)d .

Types as Sets of Fields

It was in researching this post that I came across the paper "Adding Axioms to Cardelli-Wegner Subtyping". These kinds of discoveries are one of the reasons I write these posts.  In that paper they quote another paper "On understanding types, data abstraction and polymorphism" by Luca Cardelli and Peter Wegner:

a type A is included in, or is a subtype of another type B when all the values of type A are also values of B, that is, exactly when A, considered as a set of values, is a subset of B

The ideas about types and subtypes covered in these papers extend beyond which fields a class or object has, however, I thought it would be interesting and beneficial to limit the discussion to that case.  One reason is that if you take the fields of an object or class then all subtype collections of fields will be found in the powerset and all subtypes will be a subset relation and these can be drawn as my favorite lattice, yes I have favorite lattice, the powerset lattice.  Also in this case covariance and contravariance are now defined as the subset and superset operations on the powerset lattice.

Types as Sets of Fields in the Real World

Now I always feel that a real example helps quite a bit so I have created a set of example classes, Scala traits actually, which illustrate the above ideas using a quasi-real-world example.  Please note that these code examples are designed for the purposes of illustrating these ideas and may contain design issues that one would not implement in the real world, but they should be close enough to bridge the conceptual gap, if you will.  Also this first example is should be applicable to dynamically typed languages that might use duck typing or structural typing.

The following Scala traits define a possible domain object hierarchy that could be used to persist data to a database and render it back to a web page among other possible uses:

trait BaseDomain {

override def equals(that: Any) : Boolean

override def hashCode : Int



trait PersonInfo extends BaseDomain {

var firstName : String

var lastName : String



trait Address extends BaseDomain {

var street : String

var street2 : String

var city : String

var state : String

var country : String

var zipCode : String



trait PhoneNumber extends BaseDomain {

var phoneNumber : String

var extension : String



trait Person extends BaseDomain {

var personInfo : PersonInfo

var address : Address

var phoneNumber : PhoneNumber


These traits yield the following field powerset lattice:

Now suppose we would want to define a type for each of the above lattice points, which we probably would not do but there may be cases to do similar types of things in the real world.  Let’s define the following Scala traits that wrap the above domain object hierarchy elements:

trait PersonInfoTrait extends BaseDomain {

var personInfo : PersonInfo



trait AddressTrait {

var address : Address



trait PhoneNumberTrait extends BaseDomain {

var phoneNumber : PhoneNumber



trait PersonInfoAddressTrait extends AddressTrait with PersonInfoTrait {



trait AddressPhoneNumberTrait extends AddressTrait with PhoneNumberTrait {



trait PersonInfoPhoneNumberTrait extends PhoneNumberTrait with PersonInfoTrait {



trait PersonTrait extends PersonInfoAddressTrait with AddressPhoneNumberTrait with PersonInfoPhoneNumberTrait {


Since Scala traits support multiple inheritance we can define the above type hierarchy which can be drawn as the powerset lattice that it is:

Again we can see covariant and contravariant types defined on this lattice and each relation is actually the subset superset relation on fields.

I feel the basic observations and the above examples on order duals and covariance and contravariance make the ideas pretty strait forward for field set context.  In writing this I delved into a number of papers, adding some of them to my "to read list", on Types in programming and Type Theory and I feel there are probably some deeper insights and implications of all this. 

04 December 2012

Data Science DC: Implicit Sentiment Mining in Twitter Streams

Summary and more

I have been attending the Data Science DC meetup pretty regularly as it’s an interesting meetup often with quite good talks, the most recent was a very interesting presentation called "Implicit Sentiment Mining in Twitter Streams" by Maksim (Max) Tsvetovat.  He discussed number of ideas that relate to semantic discovery which are of interest to me as I am doing research into related areas including applying semantic ideas to software naming.  So I thought it would be nice to do a little review augmented with links to references that were made and additional ones that I found in the course of researching what was discussed.  I am also including some more introductory links and material as it helps me and hopefully others who are not fully versed in the world of NLP.  The meetup was held at Google’s downtown DC office, the irony being that the meetup was about Twitter was pointed out humorously by David Lieber of Google as he introduced Harlan for the usual Data Science DC Harlan-Mark introduction.

Max starts by talking about a system that his team at George Mason built to map sentiment during the 2012 presidential election which was then used to mine sentiment from current news, in this case the media coverage of recent conflict in Gaza. This work has yielded an algorithm to show media bias.

He points out that there are a number of things people are trying to mine and predict using twitter, the example he cites is the wired article "Twitter Can Predict the Stock Market".  He sees twitter not as a social network but as a wire, an analogue of physical broadcast ether.  It’s not a media carrier but a wire that other things go into with a real-time nature where things can change very quickly.

He moves on to Sentiment analysis, mentioning a paper called "Sentiment Analysis is Hard but Worth it" by Michelle deHaaff.   He contrasts this title with what he describes as an easy "old school" sentiment analysis.  It's where you want to know what people think, so you take a corpus of words and a stream of data and you look for occurrences of good words vs. bad words. You use an average or apply some formula to create a measure of sentiment, which is a naïve approach that might be used in a CS curriculum, but it does not really work in practice due to the complexity of human emotions and language that can have double and triple entendres.  He refers to a computational linguistics paper about "She said" jokes, which I believe is this "That’s What She Said: Double Entendre Identification".  Some examples he gives of possibly deceptive and/or ambiguous statements in terms of sentiment are:

  • This restaurant would deserve highest praise if you were a cockroach (a real Yelp review ;-)
  • This is only a flesh wound! (Monty Python and the Holy Grail)
  • This concert was f**ing awesome!
  • My car just got rear-ended! F**ing awesome!
  • A rape is a gift from God (he lost! Good ;-)

He summarizes these ideas which are challenges to machines learning these things:

  • Ambiguity is rampant
  • Context matters
  • Homonyms are everywhere
  • Neutral words become charged as discourse changes, charged words lose their meaning

The field of computational linguistics has developed a number of techniques to handle some the complexity issues above by parsing text using POS (parts-of-speech) identification which helps with homonyms and some ambiguity. He gives the following example:

Create rules with amplifier words and inverter words:

  • This concert (np) was (v) f**ing (AMP) awesome (+1) = +2
  • But the opening act (np) was (v) not (INV) great (+1) = -1
  • My car (np) got (v) rear-ended (v)! F**ing (AMP) awesome (+1) = +2??

Here he introduces two concepts which modify the sentiment, which might fall under the concept of sentiment "polarity classification" or detection.  One idea is of an amplifier (AMP) which makes the sentiment stronger and an inverter (INV) which creates an opposite sentiment.  I found this idea of "sentiment modification" intriguing and did a little searching and came across a paper called "Multilingual Sentiment Analysis on Social Media" which describes these ideas [page 12] and a few more including an attenuator which is the opposite of an amplifier.  It also describes some other modifiers that control sentiment flow in the text, pretty interesting concepts, actually the paper looks quite interesting, I only read the first few pages.

He cites a paper "Cognitive map dimensions of the human value system extracted from the natural language" by Alexei Samsonovich and Giorgio Ascoli.  This paper defines the following dimensions:

  • Valence (good vs. bad)
  • Relevance (me vs. others)
  • Immediacy (now/later)
  • Certainty (definitely/maybe)
  • And about 9 more less-significant dimensions

One result which is quite interesting is that these dimensions are pretty much language independent. While searching this out I also came across "Computing Semantics of Preference with a Semantic Cognitive Map of Natural Language: Application to Mood Sensing from Text" and "Principal Semantic Components of Language and the Measurement of Meaning" by the same authors.

Max’s work seems to run pretty heavy in social networking theory, which includes an Orielly book: Social Network Analysis for Startups.  He also mentions having quite a bit of exposure to social psychology, consisting of "half a degree" as he put it, which also shows in his work.  He mentions a couple of human psychological aspects, somewhat NLP related but also somewhat divergent, these are the idea of mirroring and marker words.

Mirroring is the idea that when people interact, if the interaction is positive, the example that was given was a successful flirtation, then one person will mimic the others body language.  He extends this concept to the language used by various parties, in this case the tweets they emit.

Marker words are unique words an individual speaker tends to use. The idea can also extend to common expression between speakers. His description of marker words is:

  • All speakers have some words and expressions in common (e.g. conservative, liberal, party designation, etc)
  • However, everyone has a set of trademark words and expressions that make him unique.

He extends this idea to the idea of linguistic mannerisms he cites are calling health care "Obama care" would mark you as conservative, calling Hamas "freedom fighters" would mark you as siding with Hamas.  Which he uses to observe mirroring:

  • We detect marker words and expressions in social media speech and compute sentiment by observing and counting mirrored phrases

The next part of the talk gets into the details of how to do the analysis of the raw text.   One idea that he talks about is text cleaning pointing out that Twitter data is very noisy.  The text is cleaned in part using stop words which are words that are common and have little lexical meaning, some examples are {a, on, the, to}. His full list which he pilfered from WordNet is here.  

Another important NLP concept is stemming a linguistic morphology related concept, given by his example:

  • Stemming identifies root of a word, stripping away: Suffixes, prefixes, verb tense, etc
  • "stemmer", "stemming", "stemmed" ->> "stem"
  • "go","going","gone" ->> "go"

He takes his stemming code from the python project: Natural Language Toolkit.

Since the data being mined is coming from the internet which is used by people all over the globe, language detection is important. While the semantic concepts as outlined in the above work by Samsonovich and Ascoli may be language independent, the stemming and stop words are not, these techniques apply to most other languages but the specific tools and data do not, so the goal is to filter out other languages.  He sums this up as:

  • Language identification is pretty easy...
  • Every language has a characteristic distribution of tri-grams (3-letter sequences);
    • E.g. English is heavy on "the" trigram
  • Use open-source library "guess-language"

The Python library he uses is guess-language which is based on some other implementations.  There is also a java library: language-detection on Google code which was written by Nakatani Shuyo.  All of these use a trigram approach to language detection which uses an n-gram of characters and their probabilities to identify languages this approach is described in "N-Gram-Based Text Categorization".

After the text is filtered to English, cleaned, and stemmed this leaves roots of big words, words that carry more meaning.  These are used to create term vectors.  Term vectors are a way to map documents into a vector space, this is known as the Vector Space Model (VSM), and is a fairly common approach, it is used in Lucene and its derivatives like Solr and ElasticSearch.  Term vectors can be built with different levels of granularity, generally in Lucene this done at the document level but it can also be done at the sentence level.

I was hoping to better describe what he is doing with the term vectors and how they relate to the graphs that he creates but I am unclear as to whether his term vectors are built at the document (tweet) level or sentence level, I believe it is the sentence level as he refers to a common word in two sentences being the intersection of two different term vectors.  He then starts talking about bigrams and linking speakers to bigrams, I am not sure how these relate to the term vectors.  In this case the bigrams, n-grams order 2, refer to words as opposed to the trigrams mentioned above for language detection which were for letters.

Regardless of how they are created the system he describes uses bigrams of words linked to speakers which form a two-mode network, a concept that I was unfamiliar with which is described in "2-Mode Concepts in Social Network Analysis".  This two-mode graph technique drives the final graphs for the sets, in the cases of {Santorum, Gingrich, Romney} and {IDF, Hamas}.   He also points out by counting of the occurrence of bigrams the most common bigrams give the nature of discourse structure.

Counting bigrams enables a technique to throw out bigrams that only occur once in a certain time period, purging single occurrences cuts out the noise.  The number of co-occurrences are power law distributed which reduces this from a big data problem to something that runs on an Amazon micro instances.  Also dates were recorded for each occurrence which allowed noncurrent topics to be purged from the current data over time.

The algorithm to detect media bias, which he warned is naïve, yielded:

NPR58% favorable to IDF
Aljazeera53% favorable to IDF
CNN59% favorable to IDF
BBC54% favorable to IDF
FOX51% favorable to Hamas
CNBC60% favorable to IDF

I hope others find this useful, I sure learned a lot digging into the presentation and researching the ideas presented.  This post ran longer than I had originally thought. I attribute this to the broad subject area that this talk covered.  Semantics is a complex and deep topic with many facets and approaches, I was hoping to throw some order theory related ideas in as well, as they are quite applicable, but that will have to wait for another time.


The following references are a mix of works referenced in the presentation and that I came across while writing this, many are linked above but not all are:

Data Science DC meetup

Data Science DC:Implicit Sentiment Mining in Twitter Streams

Event Audio

Code and Slides

Maksim Tsvetovat Publications

The Math of Search and Similarity, Part One: Lucene, the Boolean Model, tf*idf, and the Vector Space Model

Sentiment Analysis is Hard but Worth it by Michelle deHaaff.

That’s What She Said: Double Entendre Identification by Chloe Kiddon and Yuriy Brun


Stanford NLP

Natural Language Toolkit, github

Multilingual Sentiment Analysis on Social Media by Erik Tromp

Opinion mining and sentiment analysis by Bo Pang and Lillian Lee

Tracking Sentiment Analysis through Twitter by Thomas Carpenter and Thomas Way

Sentiment Analysis: An Overview by Yelena Mejova

N-Gram-Based Text Categorization (1994) by William B. Cavnar , John M. Trenkle

2-Mode Concepts in Social Network Analysis by Stephen P. Borgatti

Basic notions for the analysis of large two-mode networks by Matthieu Latapy, Clemence Magnien, and Nathalie Del Vecchio