28 May 2012

The Math of Search and Similarity

Part one: Lucene, The Boolean Model, tf*idf, and the Vector Space Model

Lucene is built on classic IR (Information Retrieval) Math concepts that are actually pretty old by current standards in our industry, with most of its roots from the 1970’s and some of them like the Boolean Model being quite a bit older. Still that’s a solid quarter century older than search techniques like Pagerank and yet it can be a central and almost necessary technology used in many systems.  The mathematical domain of searching in Lucene relies on ideas like similarity measures which also relates to other disciplines that are used in Data Mining and Machine learning.  The basic ideas of Lucene and IR provide a nice introduction to these ideas and I want to focus on the following classic IR topics:

Tokenizing Text

Lucene is built on the idea of Documents, which can be thought of as a collection of terms.  Think of a document as a web page or book and think of the terms as the words and in some cases phrases.  Lucene uses an Analyzer to tokenize the text, for example you may want not want to include Stop Words like [the, and, to, a, ...] in your index also indexes can be built using the concept of N-Grams take the expression "Information Retrieval" for example, you might want to treat this as a single term as opposed to breaking it up, there are also Analyzers that do Stemming and even phonetic analysis.  The tokenizing process is involved but that should give you a basic idea, more can be found here.  Tokenization allows the terms to be defined, identified and parsed and it is the properties of terms in relation to the documents which drive the Boolean Model, the Vector Space Model and tf*idf.

The Boolean Model

The Boolean Model treats documents as vectors of boolean indicators, this is a brief summary of ideas discussed in chapter one of Introduction to Information Retrieval, I am borrowing part of the example from that chapter:

  Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0

This is an incidence matrix that shows which terms are in which documents.  This representation can be viewed as either a set of columns which form binary vectors indicating which term occurs in each document or it can be viewed as a set of binary term-vectors which shows which shows which documents contain a given term.  This example is also a bit deceiving in that these vectors will be pretty sparse, The Tempest and Othello document vectors (columns) and the Cleopatra and Calpurnia term vectors (rows) are more indicative of this sparseness at this small scale.  Whether we store and search this data as a full matrix or individual vectors it would require a lot of space and the space would be mostly zeros, so a better structure to store and search this is used, the inverted index, the details of which are covered in chapter one of Introduction to Information Retrieval and Apendix B of Lucene in Action.

The queries on this structure take the form of standard Boolean logic including: Logical Conjunction, Logical Disjunction and Logical Negation. An example is (again taken from Introduction to Information Retrieval with some notational modification):

To answer the query we create a vector (q) [Brutus AND Caesar AND NOT Calpurnia], we take the Boolean vectors for Brutus (b), Caesar (cs) and Calpurnia (cl), complement the last, and then do a bitwise AND, where all Boolean vector operation are bitwise operations, this can all be denoted as follows:

The resulting term vector for this query yields: [Antony and Cleopatra, Hamlet] vector positions one and four.

In the above example you can see using the column vectors shows how the inverted index structure works. Here we are indexing into the corpus based on the terms not on the documents. So where a document is usually a collection terms (a column) we have inverted this document/term relationship and are now using the term/document relationship to link to the documents (rows). The inverted index structure in Lucene is built on listing terms and then linking to the documents, in Lucene each document is given an integer document id, so each term has a list of document ids that contain the document. The inverted index structure can include other information such as the position in the document or other metadata.

The inverted index is the same concept as something known as a concordance, which is a list of words and their positions in a work used by scholars. A famous example is one that was published for the Dead Sea scrolls by the team who exclusively controlled them, but the published images of the scrolls at a very slow pace which left many unpublished and this created a researcher monopoly on the scripts which upset many researchers who did not have access. In 1988 Ben Zion Wacholder used a computer to reconstruct 17 documents from the concordance, these were published in 1991.  Coincidently the same month a complete set of facsimiles were discovered at the Huntington Library and published breaking the monopoly on the scrolls, they are now available free on online, as they should be.

tf*idf Weighting (Term Frequency*Inverse Document Frequency)

tf*idf stands for Term Frequency – Inverse Document Frequency which is basically the Term Frequency multiplied by the Inverse Document Frequency.  In 1972 Karen Sparck Jones published a paper proposing what has become known as Inverse Document Frequency (idf) it is defined as:

Where |D| is the size of the set of all of the documents and |dt| is the number of documents that contain the term, this formula is effectively the log of the inverse of the estimated (uniform) probability that a term will occur in a document, which would be given by the formula:

This can be rewritten as:

Using the logarithmic identity, this is needed since a log of a probability, which is less than or equal to one, will always be either a negative value or zero:

The interesting thing about using logs is that it makes the idf terms addable for example if we want to compute the Inverse Document Frequency for term one and(∧) term two, assuming they are mutually exclusive which means the probability P(x ∧ y) = P(x)P(y) is multiplicative:

The additive property is due to another logarithmic identity, which also forms a group isomorphism between multiplication of positive real numbers and addition of real numbers:

Now to get tf*idf we need to multiply the Term Frequency which is the number of times the term occurs in the document with the Inverse Document Frequency, the complete formula is:

When I learned about this I immediately thought of the formula for Information Theory:

It turns out that this has been thought of and debated and is not really correct, a topic that is fairly involved and you can read more about it this paper by Stephen Robertson. The ideas above about the logarithmic properties of addition in idf are taken from this paper.

Basically tf*idf is a term weighting mechanism which maps documents and terms to real numbers and we need this for the next piece of the puzzle.

The Vector Space Model

One problem with the Boolean model is that it either gives too many results or too few.  In the example above a Shakespearean "micro-corpus" was used to demonstrate how it works but in real corpus a query like that could potentially retrieve thousands of documents which would make it impractical to attempt to wade through the result set.  This is where a weighting mechanism like tf*idf comes in, it effectively maps the boolean values to values in the domain of Real Numbers, well technically floating point numbers.  The vector space model uses the columns of the incidence matrix above which are the vectors of terms for each document where each term position in the vector is calculated using the tf*idf weighting scheme.  These document vectors exist in a Vector Space and a query can be constructed as a vector and each document can be compared to the query vector using the angles between the query vector and the document vectors (α, θ) to determine which documents (d1,d2) best answer a query (q), shown visually above.  Also you should note that this is a simple two dimensional example, in reality there will be a dimension for each additional term which becomes a hyperdimensional vector space consisting of hundreds or even thousands of dimensions. 

It is not the angle between vectors that is calculated but the cosine of the angle, which is known as the Cosine Similarity, this is used as it is easier to calculate than the angle itself, the formula is:

The numerator is the Dot Product of the two vectors, a query vector and a document vector, the denominator is the product of the Euclidean Norm or Magnitude of the two vectors.  The above equation can be written as follows with the expanded Dot Product and Euclidean Norm calculations:

tf*idf is designed to increase as the number of occurrences of the term increase in a given document It should also be noted that the Vector Space Model can be built with other term weighting mechanisms other than tf*idf.  More information can be found in "Cosine Similarity and Term Weight Tutorial", in chapter six of Introduction to Information Retrieval and Lucene specific information on scoring.

Extending our example from above we can calculate our weights, first the idf values:

       D      dt      Idf
Antony 6 3 0.69314718
Brutus 6 3 0.69314718
Caesar 6 5 0.18232155
Calpurnia 6 1 1.79175946
Cleopatra 6 1 1.79175946

From a Shakespearean concordance we can get the number of occurrences for each term:

  Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 157 61 0 0 0 1
Brutus 3 112 0 1 0 0
Caesar 159 145 0 2 1 1
Calpurnia 0 10 0 0 0 0
Cleopatra 56 0 0 0 0 0

Using the idf values and term frequencies we can calculate tf*idf for our documents:

  Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 108.82410 42.281978 0 0 0 0.69314718
Brutus 2.0794415 77.632484 0 0.69314718 0 0
Caesar 28.989127 26.436625 0 0.36464311 0.18232155 0.18232155
Calpurnia 0 17.917594 0 0 0 0
Cleopatra 100.33853 0 0 0 0 0

I will leave calculating a query as an exercise for you, if you are feeling so inclined or masochistic.

Lucene uses a combination of the Boolean Model and the Vector Space Model, when the query is run the documents are first "approved", selected, by the Boolean Model and then scored by The Vector Space Model.  I believe this approach offers two advantages.  First it would reduce the number of calculations to search the Vector Space since you aren’t really searching it and limiting your calculations to a subset.  Secondly the Vector Space Model does not handle negation in queries, and I am not sure how negation is handled in Lucene in regards to scoring.  Actually there have been a number of IR tweaks and improvements for flexibility and efficiency that have been incorporated into Lucene and I confess I have only scratched the surface in both my coverage here and current knowledge.

13 May 2012

Software System Morphology

On the Importance of Naming in Software Systems, part III

In my previous posts on naming "A Survey of the Conventional Wisdom on Software Naming" and "More Thoughts on Formal Approaches to Naming in Software" I discussed several ideas about defining a way to talk about software system naming, I feel the name "Software System Morphology" captures this "big picture" concept. My objective is to develop a comprehensive way to define and analyze the semantic nature (morphology) of the construction of a software system. It’s an ambitious undertaking and whether or not I succeed in this endeavor I am hoping that these ideas will contribute to a better solution to this problem.

To review, I am defining three attributes that a name of signifier can have, the name scope context, i.e. where the name is used or occurs, the type or class of the system referent signified by the name, and the lexical structure of the name i.e. the n-gram of name units.  I feel that these three qualities can be used to determine and possibly measure aspects of the morphological structure of software and by determining the morphological structure of a software system that structure might be "comprehended" and refined which could yield higher quality software. 

In Linguistics a morpheme is the smallest unit of meaning and it is considered, by some, to be a complete sub-discipline of the study of Linguistics. I admit that I am in the process of learning more about it, I am a few chapters into What is Morphology by Mark Aronoff and Kirsten Fudeman it is an excellent book for linguistic neophytes like me.  Software System Morphology will both be divergent and dependent on natural language morphological concepts, which are quite complex.  However a general understanding of linguistic morphology will make these ideas more accessible, I suggest looking into it on your own, it’s fascinating and worth it solely on its own merits.  A quick example of English Morphology can be given with the morphologically complex word autobiography which consists of the morphemes: auto/bio/graph/y, where [auto] means "self", [bio] means "life", [graph] means "drawn" and [y] means "characterized by".

In my first post of this series I explored various ideas proposed by a few prominent software engineers, one idea that I discussed was the use of what were denoted as qualifiers which included the scope qualifier, the type qualifier and the computed-value qualifier. These qualifiers which are prefixed or suffixed to names can be thought of more generally as morphemes. So we can now incorporate those ideas into a more general way of looking at meaning in software naming. Now the natural conclusion, if you have been paying attention, may be to assume that the name unit concept will map to the morpheme concept, it does not, at least in terms of only software morphology and this is in interesting distinction, software morphemes may consist of n-grams of name units which can be natural language morphemes two examples are CommandPattern and MedicalRecord where separating these into individual name units potentially become meaningless in certain software contexts.

Software System Morphology is primarily influenced by three things, the semantic nature of the problem domain(s), the semantic nature of the solution domain(s), which include languages ORMs, paradigms, design patterns, data structures, and finally the natural language(s) in which both the problem domain and solution domains are expressed. The previous example items CommandPattern and MedicalRecord illustrate these ideas as well.  

Ok now let’s take our concepts further with a more detailed example.  Assume we have a Hospital Medical system which has the following problem domain concepts: [Medical Record, Patient, Visit, Doctor, Hospital, Address, Invoice]. Obviously there are more. For the solution domain we will assume a JEE/Spring style implementation which might include the following concepts: [Entity, Dao, ServiceLayer, Controller, Map, View].  Now some expected names (signifiers) might be: MedicalRecordEntity, PatientEntity, PatientServiceLayer, PatientDao, PatientController, PatientView, VisitEntity, VisitDao, VisitServiceLayer, VisitController, VisitView, PatientVisitMap, PatientHospitalVisitor,... Now as you can see the names are composed of morphemes from both the problem domain and the solution domain.  This is to be expected as each name tells what it is/does in the problem domain and what it is/does in the solution domain. In fact any developer with knowledge of JEE Spring Web application developer can look at these names and know what to expect and where they fit in the system.  Additionally it is easy to look at the above set of domain concepts and anticipate named system referents that I omitted.

Previously I pointed out that the idea referred to as a qualifier is really a morpheme and while this is true the term qualifier seems to denote a more contextual setting in terms of the domain concepts.  In the above examples the solution domain morphemes act as qualifiers telling where and how each named system referent fits into the solution Entity tells that it is a domain object, Dao tells that it relates to persistence, ServiceLayer tells that one would expect to see a separation of concerns that would put the business logic here. Structurally one would expect the domain objects (with the Entity suffix) to be passed back from the Dao’s and ServiceLayer’s and there might be transformative manipulation or aggregation of the domain objects in the ServiceLayer.  The idea of qualification is meaning based but in this case it also indicates structural aspects of the solution domain. In this example I intentionally use the Entity suffix, in many systems the domain concepts would be used alone i.e. [Patient, Visit], I have mixed feelings about this in one sense it can be viewed as potentially unnecessary, in another sense it represents a qualifier that can possibly improve comprehension. Qualifiers do make names longer and while I tend to favor longer names it can become problematic if they get too long.  These ideas towards naming are powerful for building comprehensible maintainable code as you can see from the above name examples that they follow a consistent pattern, admittedly these are fairly easy naming tasks, but you would be surprised how many developers fail to even do these types of things consistently.

Unlike others who advocate specific practices in regards to naming I feel that naming should be dealt with in a flexible framework that can be adapted to a specific project’s needs.  While I personally agree and disagree with ideas previously discussed, my philosophy is that naming should be done consistently and comprehensively reflecting deliberate choices using conceptual understanding and analysis of the Domains. Also it should be tracked in a Lexicon, an idea that I am working on and hope to explore in a later post.  Also as I have said before I feel that tooling could be used to allow better conceptual understanding and navigation and the use of analytics and possibly visualization could be used to ease naming decisions in practice by providing improvements to both static analysis and interactive autosuggest/intellisense aids.

I feel that these ideas are an attempt to create a more "formalized" approach to what Eric Evan’s does in Domain Driven Design which also covers a number of other structural ideas, I hope to extend my ideas here to better integrate some of those as well.  The next steps for me are to undertake an analytic approach I also hope to explore the application of ideas like formal concept analysis and spectral graph theory techniques which may add some mathematical rigor to these ideas and perhaps allow some empirical validation and probably refinement. Now it gets really hard so it might be a while for those follow-ups.