Part one: Lucene, The Boolean Model, tf*idf, and the Vector Space Model
Lucene is built on classic IR (Information Retrieval) Math concepts that are actually pretty old by current standards in our industry, with most of its roots from the 1970’s and some of them like the Boolean Model being quite a bit older. Still that’s a solid quarter century older than search techniques like Pagerank and yet it can be a central and almost necessary technology used in many systems. The mathematical domain of searching in Lucene relies on ideas like similarity measures which also relates to other disciplines that are used in Data Mining and Machine learning. The basic ideas of Lucene and IR provide a nice introduction to these ideas and I want to focus on the following classic IR topics:
- The Boolean Model
- The Vector Space Model
- tf*idfweighting (Term Frequency * Inverse Document Frequency)
Tokenizing Text
Lucene is built on the idea of Documents, which can be thought of as a collection of terms. Think of a document as a web page or book and think of the terms as the words and in some cases phrases. Lucene uses an Analyzer to tokenize the text, for example you may want not want to include Stop Words like [the, and, to, a, ...] in your index also indexes can be built using the concept of N-Grams take the expression "Information Retrieval" for example, you might want to treat this as a single term as opposed to breaking it up, there are also Analyzers that do Stemming and even phonetic analysis. The tokenizing process is involved but that should give you a basic idea, more can be found here. Tokenization allows the terms to be defined, identified and parsed and it is the properties of terms in relation to the documents which drive the Boolean Model, the Vector Space Model and tf*idf.
The Boolean Model
The Boolean Model treats documents as vectors of boolean indicators, this is a brief summary of ideas discussed in chapter one of Introduction to Information Retrieval, I am borrowing part of the example from that chapter:
Antony and Cleopatra | Julius Caesar | The Tempest | Hamlet | Othello | Macbeth | |
Antony | 1 | 1 | 0 | 0 | 0 | 1 |
Brutus | 1 | 1 | 0 | 1 | 0 | 0 |
Caesar | 1 | 1 | 0 | 1 | 1 | 1 |
Calpurnia | 0 | 1 | 0 | 0 | 0 | 0 |
Cleopatra | 1 | 0 | 0 | 0 | 0 | 0 |
This is an incidence matrix that shows which terms are in which documents. This representation can be viewed as either a set of columns which form binary vectors indicating which term occurs in each document or it can be viewed as a set of binary term-vectors which shows which shows which documents contain a given term. This example is also a bit deceiving in that these vectors will be pretty sparse, The Tempest and Othello document vectors (columns) and the Cleopatra and Calpurnia term vectors (rows) are more indicative of this sparseness at this small scale. Whether we store and search this data as a full matrix or individual vectors it would require a lot of space and the space would be mostly zeros, so a better structure to store and search this is used, the inverted index, the details of which are covered in chapter one of Introduction to Information Retrieval and Apendix B of Lucene in Action.
The queries on this structure take the form of standard Boolean logic including: Logical Conjunction, Logical Disjunction and Logical Negation. An example is (again taken from Introduction to Information Retrieval with some notational modification):
To answer the query we create a vector (q) [Brutus AND Caesar AND NOT Calpurnia], we take the Boolean vectors for Brutus (b), Caesar (cs) and Calpurnia (cl), complement the last, and then do a bitwise AND, where all Boolean vector operation are bitwise operations, this can all be denoted as follows:
The resulting term vector for this query yields: [Antony and Cleopatra, Hamlet] vector positions one and four.
In the above example you can see using the column vectors shows how the inverted index structure works. Here we are indexing into the corpus based on the terms not on the documents. So where a document is usually a collection terms (a column) we have inverted this document/term relationship and are now using the term/document relationship to link to the documents (rows). The inverted index structure in Lucene is built on listing terms and then linking to the documents, in Lucene each document is given an integer document id, so each term has a list of document ids that contain the document. The inverted index structure can include other information such as the position in the document or other metadata.
The inverted index is the same concept as something known as a concordance, which is a list of words and their positions in a work used by scholars. A famous example is one that was published for the Dead Sea scrolls by the team who exclusively controlled them, but the published images of the scrolls at a very slow pace which left many unpublished and this created a researcher monopoly on the scripts which upset many researchers who did not have access. In 1988 Ben Zion Wacholder used a computer to reconstruct 17 documents from the concordance, these were published in 1991. Coincidently the same month a complete set of facsimiles were discovered at the Huntington Library and published breaking the monopoly on the scrolls, they are now available free on online, as they should be.
tf*idf Weighting (Term Frequency*Inverse Document Frequency)
tf*idf stands for Term Frequency – Inverse Document Frequency which is basically the Term Frequency multiplied by the Inverse Document Frequency. In 1972 Karen Sparck Jones published a paper proposing what has become known as Inverse Document Frequency (idf) it is defined as:
Where |D| is the size of the set of all of the documents and |dt| is the number of documents that contain the term, this formula is effectively the log of the inverse of the estimated (uniform) probability that a term will occur in a document, which would be given by the formula:
This can be rewritten as:
Using the logarithmic identity, this is needed since a log of a probability, which is less than or equal to one, will always be either a negative value or zero:
The interesting thing about using logs is that it makes the idf terms addable for example if we want to compute the Inverse Document Frequency for term one and(∧) term two, assuming they are mutually exclusive which means the probability P(x ∧ y) = P(x)P(y) is multiplicative:
The additive property is due to another logarithmic identity, which also forms a group isomorphism between multiplication of positive real numbers and addition of real numbers:
Now to get tf*idf we need to multiply the Term Frequency which is the number of times the term occurs in the document with the Inverse Document Frequency, the complete formula is:
When I learned about this I immediately thought of the formula for Information Theory:
It turns out that this has been thought of and debated and is not really correct, a topic that is fairly involved and you can read more about it this paper by Stephen Robertson. The ideas above about the logarithmic properties of addition in idf are taken from this paper.
Basically tf*idf is a term weighting mechanism which maps documents and terms to real numbers and we need this for the next piece of the puzzle.
The Vector Space Model
One problem with the Boolean model is that it either gives too many results or too few. In the example above a Shakespearean "micro-corpus" was used to demonstrate how it works but in real corpus a query like that could potentially retrieve thousands of documents which would make it impractical to attempt to wade through the result set. This is where a weighting mechanism like tf*idf comes in, it effectively maps the boolean values to values in the domain of Real Numbers, well technically floating point numbers. The vector space model uses the columns of the incidence matrix above which are the vectors of terms for each document where each term position in the vector is calculated using the tf*idf weighting scheme. These document vectors exist in a Vector Space and a query can be constructed as a vector and each document can be compared to the query vector using the angles between the query vector and the document vectors (α, θ) to determine which documents (d1,d2) best answer a query (q), shown visually above. Also you should note that this is a simple two dimensional example, in reality there will be a dimension for each additional term which becomes a hyperdimensional vector space consisting of hundreds or even thousands of dimensions.
It is not the angle between vectors that is calculated but the cosine of the angle, which is known as the Cosine Similarity, this is used as it is easier to calculate than the angle itself, the formula is:
The numerator is the Dot Product of the two vectors, a query vector and a document vector, the denominator is the product of the Euclidean Norm or Magnitude of the two vectors. The above equation can be written as follows with the expanded Dot Product and Euclidean Norm calculations:
tf*idf is designed to increase as the number of occurrences of the term increase in a given document It should also be noted that the Vector Space Model can be built with other term weighting mechanisms other than tf*idf. More information can be found in "Cosine Similarity and Term Weight Tutorial", in chapter six of Introduction to Information Retrieval and Lucene specific information on scoring.
Extending our example from above we can calculate our weights, first the idf values:
D | dt | Idf | |
Antony | 6 | 3 | 0.69314718 |
Brutus | 6 | 3 | 0.69314718 |
Caesar | 6 | 5 | 0.18232155 |
Calpurnia | 6 | 1 | 1.79175946 |
Cleopatra | 6 | 1 | 1.79175946 |
From a Shakespearean concordance we can get the number of occurrences for each term:
Antony and Cleopatra | Julius Caesar | The Tempest | Hamlet | Othello | Macbeth | |
Antony | 157 | 61 | 0 | 0 | 0 | 1 |
Brutus | 3 | 112 | 0 | 1 | 0 | 0 |
Caesar | 159 | 145 | 0 | 2 | 1 | 1 |
Calpurnia | 0 | 10 | 0 | 0 | 0 | 0 |
Cleopatra | 56 | 0 | 0 | 0 | 0 | 0 |
Using the idf values and term frequencies we can calculate tf*idf for our documents:
Antony and Cleopatra | Julius Caesar | The Tempest | Hamlet | Othello | Macbeth | |
Antony | 108.82410 | 42.281978 | 0 | 0 | 0 | 0.69314718 |
Brutus | 2.0794415 | 77.632484 | 0 | 0.69314718 | 0 | 0 |
Caesar | 28.989127 | 26.436625 | 0 | 0.36464311 | 0.18232155 | 0.18232155 |
Calpurnia | 0 | 17.917594 | 0 | 0 | 0 | 0 |
Cleopatra | 100.33853 | 0 | 0 | 0 | 0 | 0 |
I will leave calculating a query as an exercise for you, if you are feeling so inclined or masochistic.
Lucene uses a combination of the Boolean Model and the Vector Space Model, when the query is run the documents are first "approved", selected, by the Boolean Model and then scored by The Vector Space Model. I believe this approach offers two advantages. First it would reduce the number of calculations to search the Vector Space since you aren’t really searching it and limiting your calculations to a subset. Secondly the Vector Space Model does not handle negation in queries, and I am not sure how negation is handled in Lucene in regards to scoring. Actually there have been a number of IR tweaks and improvements for flexibility and efficiency that have been incorporated into Lucene and I confess I have only scratched the surface in both my coverage here and current knowledge.