Showing posts with label Logic. Show all posts
Showing posts with label Logic. Show all posts

28 May 2012

The Math of Search and Similarity


Part one: Lucene, The Boolean Model, tf*idf, and the Vector Space Model


Lucene is built on classic IR (Information Retrieval) Math concepts that are actually pretty old by current standards in our industry, with most of its roots from the 1970’s and some of them like the Boolean Model being quite a bit older. Still that’s a solid quarter century older than search techniques like Pagerank and yet it can be a central and almost necessary technology used in many systems.  The mathematical domain of searching in Lucene relies on ideas like similarity measures which also relates to other disciplines that are used in Data Mining and Machine learning.  The basic ideas of Lucene and IR provide a nice introduction to these ideas and I want to focus on the following classic IR topics:

Tokenizing Text


Lucene is built on the idea of Documents, which can be thought of as a collection of terms.  Think of a document as a web page or book and think of the terms as the words and in some cases phrases.  Lucene uses an Analyzer to tokenize the text, for example you may want not want to include Stop Words like [the, and, to, a, ...] in your index also indexes can be built using the concept of N-Grams take the expression "Information Retrieval" for example, you might want to treat this as a single term as opposed to breaking it up, there are also Analyzers that do Stemming and even phonetic analysis.  The tokenizing process is involved but that should give you a basic idea, more can be found here.  Tokenization allows the terms to be defined, identified and parsed and it is the properties of terms in relation to the documents which drive the Boolean Model, the Vector Space Model and tf*idf.

The Boolean Model


The Boolean Model treats documents as vectors of boolean indicators, this is a brief summary of ideas discussed in chapter one of Introduction to Information Retrieval, I am borrowing part of the example from that chapter:

  Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0

This is an incidence matrix that shows which terms are in which documents.  This representation can be viewed as either a set of columns which form binary vectors indicating which term occurs in each document or it can be viewed as a set of binary term-vectors which shows which shows which documents contain a given term.  This example is also a bit deceiving in that these vectors will be pretty sparse, The Tempest and Othello document vectors (columns) and the Cleopatra and Calpurnia term vectors (rows) are more indicative of this sparseness at this small scale.  Whether we store and search this data as a full matrix or individual vectors it would require a lot of space and the space would be mostly zeros, so a better structure to store and search this is used, the inverted index, the details of which are covered in chapter one of Introduction to Information Retrieval and Apendix B of Lucene in Action.

The queries on this structure take the form of standard Boolean logic including: Logical Conjunction, Logical Disjunction and Logical Negation. An example is (again taken from Introduction to Information Retrieval with some notational modification):

To answer the query we create a vector (q) [Brutus AND Caesar AND NOT Calpurnia], we take the Boolean vectors for Brutus (b), Caesar (cs) and Calpurnia (cl), complement the last, and then do a bitwise AND, where all Boolean vector operation are bitwise operations, this can all be denoted as follows:








The resulting term vector for this query yields: [Antony and Cleopatra, Hamlet] vector positions one and four.

In the above example you can see using the column vectors shows how the inverted index structure works. Here we are indexing into the corpus based on the terms not on the documents. So where a document is usually a collection terms (a column) we have inverted this document/term relationship and are now using the term/document relationship to link to the documents (rows). The inverted index structure in Lucene is built on listing terms and then linking to the documents, in Lucene each document is given an integer document id, so each term has a list of document ids that contain the document. The inverted index structure can include other information such as the position in the document or other metadata.

The inverted index is the same concept as something known as a concordance, which is a list of words and their positions in a work used by scholars. A famous example is one that was published for the Dead Sea scrolls by the team who exclusively controlled them, but the published images of the scrolls at a very slow pace which left many unpublished and this created a researcher monopoly on the scripts which upset many researchers who did not have access. In 1988 Ben Zion Wacholder used a computer to reconstruct 17 documents from the concordance, these were published in 1991.  Coincidently the same month a complete set of facsimiles were discovered at the Huntington Library and published breaking the monopoly on the scrolls, they are now available free on online, as they should be.

tf*idf Weighting (Term Frequency*Inverse Document Frequency)


tf*idf stands for Term Frequency – Inverse Document Frequency which is basically the Term Frequency multiplied by the Inverse Document Frequency.  In 1972 Karen Sparck Jones published a paper proposing what has become known as Inverse Document Frequency (idf) it is defined as:



Where |D| is the size of the set of all of the documents and |dt| is the number of documents that contain the term, this formula is effectively the log of the inverse of the estimated (uniform) probability that a term will occur in a document, which would be given by the formula:



This can be rewritten as:



Using the logarithmic identity, this is needed since a log of a probability, which is less than or equal to one, will always be either a negative value or zero:


The interesting thing about using logs is that it makes the idf terms addable for example if we want to compute the Inverse Document Frequency for term one and(∧) term two, assuming they are mutually exclusive which means the probability P(x ∧ y) = P(x)P(y) is multiplicative:





The additive property is due to another logarithmic identity, which also forms a group isomorphism between multiplication of positive real numbers and addition of real numbers:



Now to get tf*idf we need to multiply the Term Frequency which is the number of times the term occurs in the document with the Inverse Document Frequency, the complete formula is:



When I learned about this I immediately thought of the formula for Information Theory:



It turns out that this has been thought of and debated and is not really correct, a topic that is fairly involved and you can read more about it this paper by Stephen Robertson. The ideas above about the logarithmic properties of addition in idf are taken from this paper.

Basically tf*idf is a term weighting mechanism which maps documents and terms to real numbers and we need this for the next piece of the puzzle.



The Vector Space Model

One problem with the Boolean model is that it either gives too many results or too few.  In the example above a Shakespearean "micro-corpus" was used to demonstrate how it works but in real corpus a query like that could potentially retrieve thousands of documents which would make it impractical to attempt to wade through the result set.  This is where a weighting mechanism like tf*idf comes in, it effectively maps the boolean values to values in the domain of Real Numbers, well technically floating point numbers.  The vector space model uses the columns of the incidence matrix above which are the vectors of terms for each document where each term position in the vector is calculated using the tf*idf weighting scheme.  These document vectors exist in a Vector Space and a query can be constructed as a vector and each document can be compared to the query vector using the angles between the query vector and the document vectors (α, θ) to determine which documents (d1,d2) best answer a query (q), shown visually above.  Also you should note that this is a simple two dimensional example, in reality there will be a dimension for each additional term which becomes a hyperdimensional vector space consisting of hundreds or even thousands of dimensions. 

It is not the angle between vectors that is calculated but the cosine of the angle, which is known as the Cosine Similarity, this is used as it is easier to calculate than the angle itself, the formula is:


The numerator is the Dot Product of the two vectors, a query vector and a document vector, the denominator is the product of the Euclidean Norm or Magnitude of the two vectors.  The above equation can be written as follows with the expanded Dot Product and Euclidean Norm calculations:


tf*idf is designed to increase as the number of occurrences of the term increase in a given document It should also be noted that the Vector Space Model can be built with other term weighting mechanisms other than tf*idf.  More information can be found in "Cosine Similarity and Term Weight Tutorial", in chapter six of Introduction to Information Retrieval and Lucene specific information on scoring.

Extending our example from above we can calculate our weights, first the idf values:

       D      dt      Idf
Antony 6 3 0.69314718
Brutus 6 3 0.69314718
Caesar 6 5 0.18232155
Calpurnia 6 1 1.79175946
Cleopatra 6 1 1.79175946



From a Shakespearean concordance we can get the number of occurrences for each term:

  Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 157 61 0 0 0 1
Brutus 3 112 0 1 0 0
Caesar 159 145 0 2 1 1
Calpurnia 0 10 0 0 0 0
Cleopatra 56 0 0 0 0 0



Using the idf values and term frequencies we can calculate tf*idf for our documents:

  Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 108.82410 42.281978 0 0 0 0.69314718
Brutus 2.0794415 77.632484 0 0.69314718 0 0
Caesar 28.989127 26.436625 0 0.36464311 0.18232155 0.18232155
Calpurnia 0 17.917594 0 0 0 0
Cleopatra 100.33853 0 0 0 0 0


I will leave calculating a query as an exercise for you, if you are feeling so inclined or masochistic.

Lucene uses a combination of the Boolean Model and the Vector Space Model, when the query is run the documents are first "approved", selected, by the Boolean Model and then scored by The Vector Space Model.  I believe this approach offers two advantages.  First it would reduce the number of calculations to search the Vector Space since you aren’t really searching it and limiting your calculations to a subset.  Secondly the Vector Space Model does not handle negation in queries, and I am not sure how negation is handled in Lucene in regards to scoring.  Actually there have been a number of IR tweaks and improvements for flexibility and efficiency that have been incorporated into Lucene and I confess I have only scratched the surface in both my coverage here and current knowledge.

29 January 2012

Abstract Algebra: The Logic of Odd and Even




Steven Strogatz wrote a series of math related articles on the New York Times website one of which is called "Rock Groups" where he looks at adding integers visually and with respect to odd and even and ventures into the territory of triangles and squares and visually portrays the closed formula for computing triangular numbers. My previous post on triangular numbers in some ways parallels his "Rock Groups" article at least in respect to the visual portrayal of triangular numbers.  This post also parallels ideas in that article, in contrast his observations are geometric, these are algebraic observations about, among other things, adding and multiplying even and odd numbers, which work as follows:



+

*

Even + Even = Even

Even * Even = Even

Odd + Even = Odd

Odd * Even = Even

Odd + Odd = Even

Odd * Odd = Odd




Now this doesn’t really give one a sense of what is going on here, however, by using a Cayley Table it can also be called an operation table in that it shows the details of an operator. It becomes more interesting we now have a better depiction of the Group for Addition (+) with Even as the Identity Element and the Monoid for Multiplication (*) with Odd as the Identity Element:



+

Even

Odd

Even

Even

Odd

Odd

Odd

Even




*

Even

Odd

Even

Even

Even

Odd

Even

Odd



Now that still may not be very intuitive, however, if you convert Even to Zero (0) and Odd to One (1) you get the following, possibly more familiar Group with 0 as the identity and the Monoid respectively with 1 as the identity:



+

0

1

0

0

1

1

1

0




*

0

1

0

0

0

1

0

1



The Group has 1 as is its own inverse 1 + 1 = 0.  In general these "types" of Algebraic structures consist of a binary operator and a set that is closed under the operation so I am using the notation (Operator, {Set}) to denote these and since the sets are small I can just list the elements, but you could also write (+, Z) where Z is the set of integers and in fact this is a well known group: addition on the set of integers, but the set is infinite in this case in contrast to our finite sets.  We can describe the above as the Group (+, Z2) and the Monoid  (*, Z2) where  Z2 is the set {0,1} of the Integers Modulo 2, or simply the Binary digits which are Base 2.


Now let’s do another transformation on our Monoid and Group, we will now map Zero (0) to False (F) and One (1) to True (T) . We will also change our operators, + now becomes Exclusive-Or and * now becomes Logical Conjunction aka Logical And.



Xor

F

T

F

F

T

T

T

F




And

F

T

F

F

F

T

F

T



All of these transformations are actually Isomorphisms and the two structures combine to form a Ring this is also a Field, the smallest Finite Field GF(2). I have a follow up for this post and I will get into these and other ideas in more detail.


Some basic algebraic structures are:


Magma or Groupoid – A closed binary operation on a set.


Semigroup – A closed binary operation on a set, with associatively. (An associative Groupoid)


Monoid – A closed binary operation on a set, with associatively and an identity element. (A Semigroup with an identity element)


Group – A closed binary operation on a set, with associatively and an identity element. Each element has an inverse element. (A Monoid with inverses)


Commutative (Abelian) – used to describe any of the above with the commutative property, these two terms are interchangeable but term abelian1 in abstract algebra is used mostly just for groups.




 

Closed

Associative

Identity

Inverse

Magma

X

 

 

 

Semigroup

X

X

 

 

Monoid

X

X

X

 

Group

X

X

X

X



Now Commutative is more of a property of the above, as we saw the String Monoid is not or Non-Commutative where as the Group (+, Z2) and the Monoid  (*, Z2) are both Commutative.


Going back to Boolean Logic, if you have had any exposure to Boolean Logic things now may look amiss in regards to operators: such as what about the Or operator? Actually Xor can be defined in terms of Logical Conjunction and Logical Disjunction as follows:




Our Cayley Table for Logical Disjunction (Or) is:



Or

F

T

F

F

T

T

T

T



Now this table is almost the same as Xor with one exception (T or T) = T vs. (T Xor T) = F.  We can still call F an identity since it combines with T to give T, but you can see with that one change we lost our inverse for T so this is a now "downgraded" to Monoid status.


Now if we add Logical Conjunction (And) back we get the familiar values in Caley tables:



Or

F

T

F

F

T

T

T

T




And

F

T

F

F

F

T

F

T



Above we noted that the Group (+, Z2) and the Monoid (*, Z2)  formed a Ring.  Since Or is a Monoid these now combine to form a slightly different structure, a SemiRing.


Another interesting observation about these sets of operators comes from a logical standpoint. The combination of these two operators are what’s known as not being functionally complete, which means that there exists truth functional statements that cannot be expressed by these two operators alone, this can be fixed by adding negation which yields a truth functionally complete set of two dyadic and one monadic2 operators: {And, Or, Not}, which is a pretty well known and familiar set of logical operators.  The Ring operators from above {Xor, And} also form a functionally complete set. It turns out that just {And, Not} and just {Or, Not} are also each functionally complete sets as are the one element operator sets {Nand} and {Nor}, this is why digital logic circuits can be constructed solely with Nand Gates.


The thing that bothered me the most was the fact that And and Xor had this nice mapping from Boolean Logic with {T,F} to Binary Integer Arithmetic {0,1} but Or seemed to be an oddball and then I realized that you can  remap both Logical Conjunction and Logical Disjunction to the Numerical functions Min and Max, respectively:



Max(0,0) = 0


Max(0,1) = Max(1,0) =1


Max(1,1) =1


 


Min(0,0) = 0


Min(0,1) = Min(1,0) =0


Min(1,1) =1



In Caley table form which better shows the isomorphisms (≅ means bidirectionally isomorphic to) from our Caley tables above: (And, {T,F}) ≅ (Min, Z2) and (Or, {T,F}) ≅ (Max, Z2):



Max

0

1

0

0

1

1

1

1




Min

0

1

0

0

0

1

0

1



Interestingly this mapping is the effectively definition of Fuzzy Logic mapped to the discrete cases (two value logic).  Boolean Logic also maps to Lattices with disjunction being meet and conjunction being join, the min and max ideas give a nice intuition in relation to ordering.


To recap we saw a number of Group and Monoid Isomorphisms (remember Z2 = {0, 1}), we saw:



(+, Z2) ≅ (+, {Odd, Even})


 (*, Z2) ≅ (*, {Odd, Even})


(+, Z2) ≅ (Xor, {F, T})


 (*, Z2) ≅ (And, {F, T})


(Or, {F, T}) ≅ (Max, Z2)


 (And, {F, T}) ≅ (Min, Z2)



We also saw Ring Isomorphisms and a Semiring Isomorphism:



[(+, Z2), (*, Z2))] ≅ [(+, {Odd, Even}), (*, {Odd, Even})]


[(+, Z2), (*, Z2))] ≅ [(Xor, {F, T}), (And, {F, T})]


[(Or, {F, T}), (And, {F, T})] ≅ [(Max, Z2), (Min, Z2))]



I have always felt that examples are helpful and I think these give some sense of the diversity that can be found in just a small little corner of math. In researching this I came across some interesting facts which also highlight this diversity, for example the Nand operation is not associative but is commutative so it’s a commutative Groupoid.  The Octonions, which are hypercomplex numbers,are neither associative nor commutative under multiplication but have an identity of 1 making it a non commutative Groupoid with an identity, so my point here is that the above table of basic algebraic structures that makes everything look so hierarchical is deceiving as the diversity does not seem to be constrained in any way, as far as I can tell.


1 This term is named after the Norwegian mathematician Niels Henrik Abel who developed Group Theory independently of Galois and like Galois tragically died young(<30).


2 Dyadic means takes two operators, e.g. (And, Or) and monadic means takes one operand, in this case the negation (Not) operator, this should not be confused with other uses of the term Monad.

05 December 2011

Responding to "Eleven Equations" Comments, Math Notation, and More





Wow that was quite an experience, I have been looking at sites like Hacker News, Reddit and DZone for years but to see my own post go to #1 on Hacker News and rise high on Reddit and DZone was quite a thrill, with over 70K hits at the time of this writing.  It has completely changed my life the phone is ringing off the hook with endorsement offers, it looks like I will only be wearing Nike for the next year.


The best part was the over 200 comments on my Blog, Hacker News, and Reddit, it spawned some spirited discussion on all three locations and many people added their own equations some that I was unfamiliar with and I learned a few new things, excellent!  The various discussions for the most part were great and very enjoyable and I’d like to thank everyone who contributed.   In the post I commented that I wondered how I would feel about the list in a year and I already think I would change the list by one equation.  I was tempted to respond to the comments on all of the forums, I know I should at least respond on my own blog, I just figured I’d do it as a post that way I could hit a number of topics. 


One thing that people objected to was the title, which induced some inferences in the tone of the title. One redditer summed it up as:


Someone presuming, yet again, to know what "every" Computer Scientist should know? A title like "my favourite computer science equations" might be better. This list contains a lot of ideas which are only relevant in very narrow domains.

Yes, probably on the title comment, but the title was an intentional "spoof" if you will of a title of an article on Wired, and if you read my post you would have gotten that.  Actually I confess that you might even accuse me of "Hacker News Post Popularity Hacking" as I saw the Wired post rise up and figured that there was a high probability that a similar CS related post could achieve a similar placement, actually I think it exceeded my expectations and went even higher, it shot to #1 in under two hours of my submitting it and it stayed on the front page for almost a day.  I felt the tone I was setting, by using the term "rip off" and the Spinal tap joke (This one goes to eleven) that I don’t think anyone got, was subtly light hearted, perhaps it was too subtle.  I recall reading that the title is key to getting people to read posts, and clearly this title was somewhat controversial which probably helped it, actually most of my titles are fairly boring or references that make them less popular, but I generally title things the way want not for popularity, well with this one exception.


Another thing that people took umbrage with was the use of the equations, as I just explained since I had set my title accordingly my format was now set.  One commenter, in two different comments on a thread on Hacker News added:


I think this is a pretty silly post, to be honest. CS covers so much, and everytime I see a list of "things you should know", I have to resist the urge to roll my eyes and ignore it. But then I read it, and inevitably roll my eyes anyway.


...


I disagree. Programming is math. Highly advanced math, in fact. It's just a different type of math. And the 11 equations in the OP's article just barely touches on what CS is about. There is far more to it than that.


I mostly agree with this commenter. I admit that the sentiment that this not really about the equations but the general domain knowledge is implied, again perhaps too subtle.  At one point someone mentions memorizing these equations which again means you are missing the point.  It’s about knowing what these equations mean and how to use them.  I feel that the math domain knowledge for our field is growing and becoming more important. Do I use any of this for my day job? Well no, but I hope to one day.  As to the broad domain I actually have a post "Math For Programmer TNG" that covers a very broad domain of math that I think is potentially relevant.  Also not all areas of math are going to be used in every math related job.  


As to the narrowness insinuated by the two comments above, the post touched on the following domains: Combinatorics, Logic, Boolean Algebra, Set Theory, Lattice Theory, Probability Theory, Statistics, Category Theory, Number Theory including Complex Numbers, Relational Algebra, Abstract Algebra, Linear Algebra, Shannon Information Theory, Algorithmic Information Theory, Kolmogorov complexity, Graph Theory, Formal Language Theory, Automata Theory, Combinatory Logic, Lambda Calculus, Machine Learning, Data Mining, Encryption, Limits, Topology, Fourier Analysis and The Simpsons. Not exactly narrow, or am I missing something here?  In fact the list was picked in part for diversity.


The Pumping Lemma turned out to be fairly controversial, I do admit that the Pumping Lemma was perhaps a bit "contrived" as an Equation and as one person put it: "that's the most hideous formulation of the pumping lemma I have ever seen". That formulation came from Wikipedia.  As I mentioned I really did want something from Formal Language Theory or Automata Theory and it’s not so much about determining which language is regular but knowing the difference between a Regular Language and a Context Free Language so that you are not the cargo cult coder who tries to parse html with regular expressions because you understand these distinctions. I think the following exchange summed up what I was going for:


(except the Pumping Lemma, WTF?)


Reply:"Seriously, you never learnt theoretical computer science? As in, theory of computation? Automata theory? Complexity theory? Nothing like that?"



There were a number of comments about the use of mathematical notation including the following:


"Shannon's Information Theory, Eigenvector, DeMorgan's Laws, etc. None of those names are meaningful or descriptive. And then the greek letters and made up symbols. Math could learn something from Computer Science:


"https://www.google.com/search?q=readable+code



This commenter seemed to have an issue with the actual names of things in math.  I found this to be very narrow minded. Math is many things including being a domain of knowledge, that’s like complaining about the names: Ytterbium in Chemistry, Bose–Einstein Condensate in Physics, Borneo in world Geography, or Impressionism in art. Really!?!


Now the complaints about the notation are more understandable but still unreasonable in my opinion, math in some respects is a language, perhaps somewhere between a natural language and programming language, it has a notation and to really be able to understand it you need to learn this notation (language).  Someone remarked that symbols were "unfriendly".  Well I find Chinese, Arabic1, etc. to consist of unfriendly symbols, but that’s because I don’t know them.  Also like natural languages you have inconsistencies, and multiple meanings for symbols.  One person complained about my notation for the Set version of De Morgan's laws, this may be a nonstandard notation, I saw it in some course notes on probability and set theory and I liked it and I just used it without thinking about it. I do think it’s more elegant.  This makes a good point about math.  If you learn it and read from diverse sources you will encounter notational differences.  This has been my experience, in fact if you look on the Wikipedia page for De Morgan's laws you will find the following example from Modal Logic:






I am used to the notation used in Hughes and Cresswell:






That’s just how things work.  You can get hung up on it and complain or you can accept it and move on.   Don’t get me wrong it would be nice if there was a notational standard that every one followed for all math.


Like natural languages Math has its equivalences of "Homographs", for example the following  symbols can have multiple context dependent meanings: |b| can mean absolute value of b if it is a number or cardinality of b if it is a set or length of b if it is string, ∂ can mean Partial Derivative or Topological Boundary,  and as we saw ∧ and ∨ can mean meet and join or "logical and" and "logical or". 


Just as Math has "Homographs", as in natural languages it also has its equivalences of "Polysemes", meaning that there are multiple ways to express the same ideas. For example set difference can be expressed as (S – T) or (S \ T), the Powerset notation can be either 2S or P(S), and meet and join can be expressed as the either of the following sets of symbols:




The list of both math "Homographs" and "Polysemes" is much longer.


For De Morgan's laws, someone added the following generalization using the Existential and Universal Quantifiers, which is really interesting and also illustrates the duality:


It's not much of a generalization, but I prefer the predicate version of De Morgan's:

~∀x p(x) ≡∃x ~p(x)
~∃x p(x) ≡∀x ~p(x)


If you read the Wikipedia Article it includes this generalization as well:




The above equations do generalize De Morgan's laws within Logic and Set Theory respectively but my point was to further generalize them in terms of Lattices which is a different concept all together and gets at the deeper intuition about the interrelation of Logic and Set Theory, quoting from Combinatorics the Rota way: by Rota and Kung, an amazing book I am trying to read, emphasis on the word trying:


"As John Von Neumann put it, the theory of Boolean Algebras is "pointless" set theory.

This set version would be written using the complement exponentiation notation as follows:




I really do prefer this notation, I originally encountered it in "Introduction to Probability Theory" by By Ali Ghodsi and Cosmin Arad, lecture 1(pdf).  This notation is also used in Combinatorics the Rota way. So this notation is now the official notation of my blog, get used to it or hit the road. ;)


A commenter (who was also the demorgans notation complainer) responded with the following:


...

For the rest, they seem more likely to be used by someone doing something related to machine learning/AI/information theory than you run of the mill I-spend-all-day-parsing-user-input programmer.

Thank you for making me feel like I'm not a 'true' computer scientist. Next would you like to tell me about how I'm not a 'man' because I don't like to work on cars?


Well I hate to break it to you, but if this list makes you feel that way then chances are you are like me: a software developer with a CS Degree and that does not make you or me a Computer Scientist. This is from Wikipedia:


"The general public sometimes confuses computer scientists with other computer professionals having careers in information technology, or think that computer science relates to their own experience with computers, ...

Oh, and yes you are not a man if you do not at least know how to change your oil, including the filter. ;)


Also a number of people remarked that I did not explain the equations in enough detail and I did not provide code examples.  It was meant as high level post, if I were to explain all of these equations the post would have been excessively long, maybe even a whole book.  As to the code examples again it would be too long, if you want code examples that relate to math, I have posts for that, I recommend the following posts on my blog, some were linked in the original post: "O(log(n))", "Triangles, Triangular Numbers, and the Adjacency Matrix", "The Combinatorial and Other Math of the Java Collections API", "Math You Can Use : Refactoring If Statements with De Morgan's Laws", and "Monoid for the Masses".


Regrettably some people responded with the sentiment of: I’m not good at math or that math is too hard, I don’t think I can get a CS degree.  Fortunately there were a number of comments of encouragement to counter these sentiments, my favorite was this one:


Don't let this be intimidating to you - instead of asking yourself "how come I don't know this?" ask yourself "how can I learn more about this?". This might sound cheesy and simplified but it's as simple as "nobody was born knowing this". I'm 31 and I wish I had the money to go back in education and learn all of this with the official way but for now I'm just picking resources on the web and who knows? It just might happen...

I have a post planned to talk about my experiences and thoughts about learning math, I will say if you are a programmer you are doing math in a sense.  You just need to formalize your understanding of the underlying concepts.  If it is any consolation I find Math to be very hard at times, I pretty much suck at it in the traditional sense, but it is so beautiful and when you grok a concept, at least for me, it is an incredible buzz.  Every new concept that you learn changes how you see everything, I see so much math in programming and the world now it’s scary, I just wish I could figure out how to express it.  Everything is math.  Over the last few years my ability to understand things has greatly increased and continues to do so.  Yes it’s hard but it is so worth it.


A number of commenters added their own equations to the discussion, and this was the best part, which included a few new things for me. In general my list was mostly directed at more pure math oriented equations that I felt were fairly central to a specific discipline and encapsulated certain principles.  Here are a few recommendations and thoughts:


One person pointed out the absence of P=NP or P ≠ NP, to which I would respond "D’oh!" If I was doing this over I might even replace the Pumping Lemma with P=NP.


One poster recommended triangular numbers:


Here's a real simple and practical equation: 1+2+3+4 . . . N = N(N+1)/2

This equation represents the number ofedges on a complete graph with  N+1 vertices or the number of possible pairings given  N+1 objects. Useful whenestimating O(N) for certain algorithms that involve comparing an item to everyother item in a set.


I love Triangular Numbers and this is a great equation, I wrote a whole post "Triangles, Triangular Numbers, and the Adjacency Matrix" on exactly what he is mentioning, and considered that equation but rejected it because triangular numbers are sort of a special case of the Binomial Coefficient equation:



Here are some of the other equations and theorems that were suggested, I wouldn’t really consider these because they didn’t fit my theme and criteria but that doesn’t mean they aren’t interesting:



One commenter accused me of shoehorning in Euler’s Identity.  This is partially true.  He also accused me of "Name Dropping". That’s not true.  A defender supplied the following course notes:


Analytic Combinatorics: A Primer


I followed up on that and found that you can download a copy of Analytic Combinatorics:
by Flajolet and Sedgewick. Warning: pretty advanced stuff.


As the popularity of my post was waning on Hacker News a math oriented post called "Fuzzy string search" was making its way up the ranks. It’s an interesting post and I think it helps illustrate a couple of points I have been making here, first of all it uses mathematical notation, which is not uncommon when writing about these types of topics.  Additionally the post includes the equation for the Triangle Inequality:




An equation that crossed my mind when writing my equations post.  It is not the equation itself, but the concept, if you become familiar with the idea of a Metric or Metric Spaces, you instantly recognize this equation concept without having to read the associated text, not that you shouldn’t read the associated text.  Knowing these ideas gives you the ability to better understand more complex ideas that are based on these concepts. I think this is the case in all knowledge domains.  Also I noticed that there was not one comment on that post complaining about math notation.


A lot of what went on is part of an ongoing conversation, I blogged about it in "The Math Debate" which links to other famous blogs of the same ilk.  Ultimately the equations probably don’t apply to most programming jobs, the title contained as some commenters pointed out "Computer Science Geeks" not "Corporate Software Developers".  A few people castigated the commenters who seemed to be advocating positions of ignorance, thanks, BTW.  I personally find it tragic, the following comment sums this up: 


"Someone with a CS degree who knows nothing of De Morgan's law should have their credit removed and be demoted to making websites or performing tech support tasks.


Reply: "There goes 98% of your CS graduates then. I wish I was joking, but alas.."



Of course it’s a bit I ironic that as I am wrapping this up, "A co-Relational Model of Data for Large Shared Data Banks" by Erik Meijer, Gavin Bierman just popped up on Hacker News, the relevant quote is:


Every programmer is familiar with the two dual forms of De Morgan’s laws

The writing quality of the post was criticized. I don't know what people will think of this post, I feel like I just wrote a clip show. I admit that previous post was not my best effort, while it was fun putting it together it was not easy, I had hoped to learn all of the equations in more detail, I did to some degree, I confess I still don’t fully get the Y-Combinator, but I will.  I finally decided to just push it out and move on.  This post, while being relevant to my mission of blogging about math was something of a social experiment in popularity, it was interesting but I will be returning to the true mission of my blog which is my journey through math and the pursuit of a real Software Engineering discipline.


Ultimately I am interested in the readers, like a young colleague of mine who was too busy to comment because he was following the links to learn new things he didn't know, not the complainers and naysayers. There were many positive comments and again, thanks. I did feel a needed to refute some of the negative comments. The bottom line is if you want to read about math, you need to get used to the notation, it probably won't help you to build CRUD Web apps but it will hopefully help you take your career to the next level.


 


1Actually this is not true I find them to be quite beautiful, especially Arabic.

27 November 2011

Eleven Equations True Computer Science Geeks Should (at Least Pretend to) Know




This idea is a complete rip off an article that appeared in Wired a little while ago and it got me thinking what would my list for Computer Science look like?  Plus I thought it might be a fun post and unlike the Wired list this one goes to eleven.  So here they are in no particular order:



Binomial Coefficient



The Binomial Coefficient equation generates Pascal’s Triangle and gives you the coefficients for the Binomial Theorem these ideas are often attributed to Pascal but in fact they have been known in part for over a millennia.


As I mentioned that this list is no particular order and I don’t wish to play favorites but if there is one equation here that you should really consider learning and committing to memory this is it. It is central to Combinitorics which has connections to many areas of math, I guess I should qualify that if you are serious about computer science and math related programming topics then you should be able to recite and apply it from memory, even when you are drunk, sleep deprived or being held at gun point. Ok that might be a bit much but you get the idea.  If you are programmer and you haven’t learned it, or need a refresher, I have a post that relates it to the Java Collections API.



Demorgan’s Laws


Logical form:


Set Form:



Ok so that’s like four "equations" for DeMorgan’s Laws, one can’t help but to struck by the similarity between the two sets and this is not a coincidence these two sets of formulas are essentially the same thing, they are special cases of complemented distributive lattices  which means technically it’s really just two formulas:




In this case the ∨ symbol means lattice join operation and the ∧ symbol is the lattice meet operation and the dash with the right downward mark means lattice complementation, I used this to differentiate from the tilde for Boolean complementation.  Also these two formulas are categorical duals of each other, which means that if one were smart and had a good grasp of Category Theory we could probably be really OCD about it and get it down to one formula.


Understanding Set Theory and Boolean Algebra are very important basic concepts in our profession and the Boolean version can make you better programmer, as you can use it to refactor logical if statements


Eigenvector and Eigenvalue



This equation and these concepts are not only central to Linear Algebra but these Linear Algebraic ideas and others are both used and extend into many other areas of math including Linear Transformations and Graphs in terms of matrices like the Adjacency Matrix and much more.


Linear Algebra is becoming increasingly central to the types of things we are doing in our profession, it is used heavily in Statistics and Machine Learning not to mention the Page Rank Algorithm is based in part on this equation.



Pumping Lemma for Regular Languages



No Comp Sci list would be complete with at least one formula from either Formal Language Theory or Automata Theory.  The problem is that these two are pretty different from other areas of math in terms of "Equations" I tried to think of some basic Ideas and The Pumping Lemma came to mind, and I found the above formula concisely expressed on Wikipedia.  This version of the Pumping Lemma is a more limited version of Another Theory which is used to check whether a Language is Regular. Also this is maybe not the best choice as it is pretty dense, if you want a better explanation I recommend this.


While this "equation" is pretty specialized, it is actually a special case of the more general Iteration Theorem, the real point here is really about more general domain knowledge.  It is central to what you do every day you are programming such as compilers and regular expressions not to mention the almost ubiquitous Kleene Star.


Information Entropy


I confess after all my machinations to try to reduce Demorgan’s Laws down to less equations I am totally cheating here by pulling a twofer in including two formulas, both Shannon’s Information Theory:



And the formula for Chaitin’s Constant, which is associated with Algorithmic Information Theory and Kolmogorov Complexity:



Interestingly this area has a parallel in the physical word, in terms of Thermodynamic Entropy, which also parallels the Boltzman Equation mentioned in the Wired Article.  Seth Loyd draws some interesting connections between computation and the physical world including Entropy and Boltzman’s Equation.


In programming and computer science information is our business and these ideas are central to many things that we deal with especially the storage, transmission, computational transformation (like compression), and computation of information.  We’ve already seen the possible relationship between both of these theories and Reusability and Software Frameworks.



Bayes' Theorem



Of all of equations on the list, Bayes' Theorem may stand out on its own merits as being one of the most recognizable and it might even qualify as a poster child for the statistical revolution sweeping our industry.  It rose from obscurity to being used for spam detection and with such applications as classifiers, Machine Learning, Data Mining and more.


Mathematically it is part of a rift in statistics and probability and I confess I may not yet call myself "an initiate of the bayesian conspiracy" but it hard to deny its utility plus there seems to be a more axiomatic basis that relates to Boolean Logic Set Theory, which makes it all the more alluring.



Fermat’s Little Theorem




These two equations of Fermat’s Little Theorem are really the same equation written in two different forms, where (a ⊥ p) means coprime and (p ∈ P) means prime.


This is a beautiful little theorem in Number Theory which can be generalized and written even more succinctly using Eulers Totient Function which is represented by φ (n):



These ideas are crucial for understanding encryption algorithms like RSA and there’s this cool result.


Natural Join



Natural Join, like all of the Relational Algebra Join Operations is a composition of the more basic, operations: Selection (σ), Projection (π), Cartesian Product (×), Set Difference (-) and Union (∪).  Natural Join is the Cartesian Product of two tables selecting matching rows and eliminating duplicate columns.  (In the formula above the projection (π) is the set (A) of attributes with duplicates removed and then selects (σ) the set (C) of common attributes which match in both sets and are equal in both sets.)


The Relational Algebra is in my opinion a bit of an odd beast when it comes to algebraic structures, but if you use SQL you are already using an implementation of it.  As data persistence becomes more heterogeneous I think understanding these concepts, will become more important.  Additionally Relational Algebra also has strong ties to Abstract Algebra[ and Category Theory[.



The Fixed-Point (Y) Combinator



Just as we need a "equation" from Formal Language Theory or Automata Theory we really should have one that is related to Lambda Calculus, in this case this equation is in the untyped lambda calculus.  Even though the Fixed-point Combinator stands up on its own, it is fairly well known, well at least in name due to the use of it by Paul Graham for his incubator (Y-Combinator).


The Fixed-Point Combinator allows one to implement anonymous recursion which is a powerful programming technique.  It also ties into some deeper theoretical aspects of computer science and logic such as Combinatory Logic.



O(n)


Many CS majors may cringe at this, and another possible reaction is that’s not much of an equation.  There’s actually quite a bit to this for example did you know there are some substantial and quite beautiful equations associated with this seemingly simple formula.  They are:




Also the following is a definition of (lim sup) Limit Superior:



Where inf is infimum and sup is supremum two concepts that seem to show up quite a bit especially when you are concerned with bounds like with Lattices and they also seem to show up in relation to Topology.


And we all know why we should care about his one: Algorithms, Algorithms, Algorithms.



Euler’s Identity



Euler’s Identity is also listed in Wired’s article and I have to agree that it is one the most beautiful and intriguing formulas in math.


Now this isn’t really as relevant to CS as the other formulas but if you a true computer geek you are hopefully some kind of math geek, and this is such a great formula, not to mention that it appears in The Simpsons' Homer3. It’s actually a specific case (where θ = π) of Euler’s Formula:



These formulas will give one a better understanding complex numbers which are needed in other areas of advanced math also these formulas relate to Fourier Analysis which does have applications in Computer Science.


I am curious to see how well this list holds up, how will I feel about it in a year?  Also in writing this I realized that I still have quite a bit more to learn about at least half of this list, if not all of it, which made it hard to write as it was quite a little (unfinished) research project. I have already written individual articles on several of these equations and have at least two in the works for Linear Algebra, at least two on the Relational Algebra and one planned for Pascal’s Triangle, and those are just the ideas I’ve already had, who knows what the future holds.  I hope you enjoyed it as much as I did writing it.


The Set version may not apply to all cases in set theory, this assumes Union and Intersection over the Power Set of a Set Universe, this would be closed under these operations and would include complements of all elements.  The Set Universe is the upper bound and the empty set is the lower bound.