Elegant Coding

31 July 2011

Software Frameworks: Resistance isn’t Futile

As I have previously discussed, in my opinion there are three main framework components that can be described succinctly as, libraries, rules, and templates. It is the library component that I wanted to talk about here, perhaps in a context that might be seen as more evidence in support of frameworks in certain cases. Now to recap, building a framework does not make sense for all projects, the two main scenarios that I have seen that are extremely conducive to it are an organization with multiple similar projects with similar problem domains. The second case is a medium to large project with a lot of commonality that would favor reusability across the project. One of my complaints about the Framework debate is that it is debated as a black and white argument. Either frameworks are absolutely required or the worst thing you can do. Now I am sure that many developers can anecdotally cite either side of this argument, which is what I feel really drives this debate and there is no doubt that I do this as well, but the goal, in my opinion, is to step back and look at this problem from a bigger perspective.

One place that I have found some interesting perspective is in a paper by Todd L. Veldhuizen titled: "Software Libraries and Their Reuse:Entropy, Kolmogorov Complexity, and Zipf’s Law", there is a slide version here, now a word of caution this paper is very math intensive, but it should be possible to read it and gain some insights without understanding the math, for example in the paper he states the following:

A common theme in the software reuse literature is that if we can only get the right environment in place— the right tools, the right generalizations, economic incentives, a "culture of reuse" — then reuse of software will soar, with consequent improvements in productivity and software quality. The analysis developed in this paper paints a different picture: the extent to which software reuse can occur is an intrinsic property of a problem domain, and better tools and culture can have only marginal impact on reuse rates if the domain is inherently resistant to reuse.

I think this is a good observation, many projects that I have worked on have exhibited characteristics that are favorable to reuse, but I have read a number of counter arguments especially by people in fast paced startups where the flux of the system evolution is potentially resistant to reusability, actually in this case it’s probably the SDLC, not necessarily the problem domain that is resistant. Also I would suspect the probability of reuse is in part inversely proportional to system size, so for small systems it’s less likely or at least will have a smaller set of reusable components, so an investment in reuse may not be seen as justified.

Another interesting observation from the paper is:

Under reasonable assumptions we prove that no finite library can be complete: there are always more components we can add to the library that will allow us increase reuse and make programs shorter. To make this work we need to settle a subtle interplay between the Kolmogorov complexity notion of compressibility (there is a shorter program doing the same thing) and the information theoretic notion of compressibility (low entropy over an ensemble of programs).

This is especially interesting if you have some familiarity with Information Theory, and if you don’t I recommend learning more about it. Here he is comparing characteristics of both Algorithmic Information Theory [Kolmogorov complexity] and Shannon’s Information Theory [information theoretic notion]. Roughly, Algorithmic Information Theory is concerned with the smallest algorithm to generate data and Shannon’s Information Theory is about how to represent the data in the most compact form. These concepts are closely related to data compression and in the paper this is paralleled to the idea that reusing code, will make the system smaller in terms of lines of code, or more specifically: symbols, which effectively "compresses" the codebase. In Algorithmic Information theory you can never really know if you have the smallest algorithm so I may be taking some liberty here, but I think the takeaway is that when trying to create reuse you can probably do it forever so one needs to temper this desire with practicality. In other words there is probably a point where any subsequent work towards reuse is a diminishing return.

I find the paper compelling and I confess that perhaps I am being bamboozled by math that I still do not fully understand, but intuitively these ideas feel right to me. Also the application of Zipf’s law is interesting and should be pretty intuitive, once again roughly, Zipfs law relates to power curve distributions, also related to the 80/20 rule, the prime example is the frequency of English words in text, words like [and, the, some, in, etc.] are much, perhaps orders of magnitude, more common than words like [deleterious, abstruse, etc.]. This distribution shows up in things like the distribution of elements in the universe, think hydrogen vs. platinum, the wealth distribution of people you vs. Bill Gates, how many followers people have on twitter, etc. and to a smaller scale, the curve is scale invariant, in software, often some components will have a fair amount of reuse, things string copy functions, entity base classes, etc., whereas others may only have a couple of reuses.

On the lighter side, Power curves relate to Chaos Theory, I have seen a number of people including Andy Hunt draw parallels between Agile and Chaos theory, although these are usually pretty loose, it does strike me that one way to model chaos is through iterated maps, which is reminiscent of the iterative process of agile, also the attractor and orbit concepts seem to parallel the target software system as well.

Another place, and this is one that most developers will find more accessible, I would have lead with this but the title and flow leaned the other way, is Joshua Bloch’s "How To Design A Good API and Why it Matters". Actually I think this a presentation that every developer should watch especially if you are involved in high level design and architecture related work, and don’t worry, there is no math to intimidate the average developer. A summary article version can be found here, but I would still recommend watching the full video at least once if not multibple times. Slides to an older version of the talk can be found here.

In his presentation he talks about getting the API right the first time because you are stuck with it. I think this helps illuminate one very important and problematic aspect of framework design and software development. The problem can be illustrated with a linguistic analogy, in linguistics there are two ways to define rules for a language: prescriptive, is where you apply (prescribe) the rules on to the language vs. descriptive where the rules describe the language as it exists, I strongly favor the descriptive. One of the most famous English language rules: "not to end sentences with prepositions" is a prescriptive rule thought to be from Latin which was introduced by John Dryden and famously mocked by Winston Churchill "This is the sort of nonsense up with which I will not put.", pointing out that it really doesn’t always fit a Germanic language like English. I know I’m a bit off topic again with my "writer’s embellishment", not to mention that the same idea is discussed in depth by Martin Fowler which he terms "Predictive versus Adaptive".

It is a common problem which Agile attempts to address and it is common in general software design and construction, it also occurs API and framework design. Software construction as we know is an organic process and I feel that frameworks are best developed in part out of that process though Martin Fowler’s Harvesting which can be termed as descriptive framework creation. What Joshua Bloch in part describes and to some degree cautions against can be described as a prescriptive approach to API/Frameworks. I think many developers including me have attempted to create framework components and API’s early in a project usually driven by a high level vision of what the resulting system will look like only to find out later that certain assumptions were not valid or certain cases were not accounted for¹. What he talks about is a pretty rigid prescriptive model which is in many ways at odds with an adaptive agile approach, I feel that the more adaptive agile approach is really what is needed for the framework approach and we do see this via versioning, for example the differences between Spring 1.x and Spring 3.x are substantial and no one would want to use 1.x now, but there are apps that are tied to it now. Also this approach of complete backwards compatibility was used with Java Generics, specifically Type Erasure which has lead to a significant weakening of the implementation of that abstraction. It is my understanding that Scala has recently undergone some substantial changes from version to version leading some to criticize its stability while others cite that it is the only way to avoid painting yourself into a corner like with Java. The harvesting approach will often involve refinement and changes to the components which are extracted and this can lead to the need for refactorings that can potentially affect large amounts of already existing code. It’s a real chicken and egg problem.

He starts his presentation with the following Characteristics of a Good API:

Characteristics of a Good API
Easy to learn

Easy to use, even without documentation

Hard to misuse

Easy to read and maintain code that uses it

Sufficiently powerful to satisfy requirements

Easy to extend

Appropriate to audience

He also makes the following point:

APIs can be among a company's greatest assets

I think this is sentiment is reflected in Paul Graham’s "Beating the Averages" of course that is more about Lisp but underlying principle is the same, actually an interesting language agnostic point comes from Peter Norvig, I can’t find the reference, but he said he had a similar attitude towards Lisp until he got to Google and saw good programmers who were incredibly productive in C++. I feel that this is all just the framework argument, maximizing reusability by building reusable high level abstractions within your problem domain that allow you to be more productive and build new high level components more quickly, it’s all about efficiency.

In regards to API’s he adds:

API Should Be As Small As Possible But No Smaller

To which he adds:

When in doubt leave it out.

Conceptual weight is more important than the bulk - The number of concepts.

The most important way to reduce weight is reusing interfaces.

He attributes this sentiment to Einstein, but he wasn’t sure about it, I did some follow up the paraphrasing is "Everything should be made as simple as possible, but no simpler." or "Make things as simple as possible, but not simpler." More about that can be found here. These are good cautions about over-design or over-engineering APIs, Frameworks, and Software in general, I have definitely been guilty of this at times, once again this is something that needs to be balanced.

The following is what I consider to be seminal advice:

All programmers are API designers because good programming is inherently modular and these inter modular boundaries are API’s and good API’s tend to get reused.

As stated in his slides:

Why is API Design Important to You?
If you program, you are an API designer
Good code is modular–each module has an API

Useful modules tend to get reused
Once module has users, can’t change API at will

Good reusable modules are corporate assets

Thinking in terms of APIs improves code quality

The next two quotes are pretty long, and it was no easy task transcribing them, he talks really fast, I tried to accurately represent this as best as possible, also I feel that he really nails some key ideas and I wanted to have a written record of it to reference, since I am not aware of any other:

Names matter a lot, there are some people that think that names don’t matter and when you sit down and say well this isn’t named right, they say don’t waste your time let’s just move on, it’s good enough. No! Names, in an API, that are going to be used by anyone else that includes yourself in a few months mater an awful lot. The idea is that every API is kind of a little language and people who are going to use your API needs to learn that language and then speak in that language and that means that names should be self explanatory, you should avoid cryptic abbreviations so the original Unix names, I think, fail this one miserably.

He augments these ideas by adding consistency and symmetry:

You should be consistent, it is very important that the same word means the same thing when used repeatedly in your API and you don’t have multiple words meaning that same thing so let us say that you have a remove and a delete in the same API that is almost always wrong what’s the difference between remove and delete, Well I don’t know when I listen to those two things they seem to mean the same thing if they do mean the same thing then call them both the same thing if they don’t then make the names different enough to tell you how they differ if they were called let’s say delete and expunge I would know that expunge was a more permanent kind of removal or something like that. Not only should you strive for consistency you should strive for symmetry so if you API has two verbs add and remove and two nouns entry and key, I would like to see addEntry, addKey, removeEntry, removeKey if one of them is missing there should be a very good reason for it I am not saying that all API’s should be symmetric but the great bulk of them should. If you get it right the code should read like prose, that’s the prize.

From the Slides:

Names Matter–API is a Little Language
Names Should Be Largely Self-Explanatory
Avoid cryptic abbreviations

Be consistent–same word means same thing
Throughout API, (Across APIs on the platform)

Be regular–strive for symmetry

Code should read like prose

I feel that this hits some essential concepts which really resonate with me, in fact I have follow on posts planned to further deconstruct and develop these ideas more generally. From the framework perspective this also gets at some of the variety of the Framework code components, they can be the Paul Graham’s functional Lisp abstractions, they can be DSL’s, they can be Object Oriented like Spring, Hibernate and Java API, etc. Any framework built for a domain will have their conceptual vocabularies or API languages that are a higher level abstraction of the problem domain, Domain Specific Abstractions, and they all benefit from concepts like consistency and symmetry as appropriate to the domain.

The following is a very common and widespread problem that often inhibits reuse and leads to less efficient production of lower quality software:

Reuse is something that is far easier to say than to do. Doing it requires both good design and very good documentation. Even when we see good design, which is still infrequently, we won’t see the components reused without good documentation.
- D. L. Parnas, Software Aging. Proceedings of the 16^th International Conference on Software Engineering, 1994.

He adds:

Example Code should be Exemplary, one should spend ten times as much time on example code than production code.

He references a paper called "Design fragments" by George Fairbanks also here which looks interesting but I have not had time to read it yet.

This is also, I believe, to be a critical point that is possibly symptomatic of problems with many software projects. I feel that projects never explicitly allow for capturing reuse in terms of planning, schedule and developer time. Often reuse is a lucky artifact if you have proactive developers who make the extra effort to do it and it can often be at odds with the way I have seen projects run (mismanaged). I have some follow up planned for this topic as well.

In terms of design he adds this interesting point:

You need one strong design lead to that can ensure that the api that you are designing is cohesive and pretty and clearly the work of one single mind or at least a single minded body and that’s always a little bit of a trade off being able to satisfy the needs of many costumers and yet produce something that is beautiful and cohesive.

I have to confess that I do not recall how I came across Todd Veldhuizen’s paper or Joshua Bloch’s talk, but I felt that they were really about similar ideas, in writing this and finding all of the references again I realized that my association of these two was not coincidental at all. For they are both part of the Library-Centric Software Design LCSD'05 workshop for Object-Oriented Programming, Systems, Languages and Applications (OOPSLA'05) with Joshua Bloch delivering the same talk as the Keynote Address.

Now I admit that one goal of mine is to put these ideas, much of which was borrowed from the two referenced works, into a written form that I can reference back to, I hope this stands up by itself but is really written for my future entries. Also, for the record this is not the first time I have "leached" off of Joshua Bloch’s work.

Ultimately my ideas will diverge from some of those in regards to API design, Joshua Bloch speaks from a position of greater responsibility in terms of APIs. I see them as part of a framework continuum used to construct software and I think many of his ideas apply directly and more generally to that. Also I see framework and the system as interlocked and feel that frameworks can drive structure and consistency for example Object-Oriented API’s aka frameworks can in turn drive the structure of software, the Spring Framework is a good example of this, Spring is build heavily around IOC which is one of the SOLID principles.

There will always be arguments against frameworks, the classic one is it creates more code that is more complex and it requires more learning, my counter argument to this is twofold: First any code that is in production probably has the need to be known and understood which might require it to be learned regardless of how efficiently it is created. Also if a framework creates reusability and consistency the initial learning curve will be higher but each subsequent encounter with a codebase that constructed this way should be easier. Also highly redundant inconsistent code is potentially (much) more difficult to learn and maintain because there is more of it. The second is if your framework API is well defined and well documented it should make the resultant code much easier to understand and maintain aka "read like prose". This will be due to the fact that much of the "generic" complexity is "abstracted" downwards into the framework level. For example compare the code to implement a Spring Controller to the underlying classes that do the work such as AnnotationMethodHandlerAdapter Now if you have defects and issues at that lower level they will be harder to fix and changing common code can have side effects to other dependant code, it’s not a perfect world.

I think the issue with reuse and the framework approach is asking the right question: How resistant (or favorable) is you domain and your SDLC to reuse? I think most domains have some if not a fair amount of favorability to reuse and I see reuse as increased efficiency in software construction.

¹Perhaps: "...certain cases for were not accounted." Dryden blows.

24 July 2011

Triangles, Triangular Numbers, and the Adjacency Matrix

One of the cool things about math is that there are many interconnections and interrelations between the various ideas and concepts and looking at things from a broader perspective can often be enlightening.

The area of a triangle is something that we learn in elementary school, its formula, often stated as one half the base times the height, is:

It’s pretty basic stuff, perhaps even boring.

In Number Theory and Discrete Math there is a sequence of numbers known as the Triangular numbers, each triangular number (T_n) is the sum of all the integers up to an including itself. For example

T₁ = 1 = 1

T₂ = 1 + 2 = 3

T₃ = 1 + 2 + 3 = 6

T₄ = 1 + 2 + 3 + 4 = 10

…

T_n = 1 + 2 + 3 + ... + (n -1) + n

These can be generalized by the following formula:

The capital Sigma notation is Summation, which is equivalent to the following simple Java code snippet:

public int triangularNumber(int n)

int result = 0;

for(int k = 1; k <= n; k++) {

result += k;

}

return result;

}

Triangular numbers are the additive analog to the factorial function, i.e. the addition of 1..n vs. the multiplication of 1..n to and can also computed recursively:

T_n = n + T_n-1

T_n = n + ((n - 1) + T_n-2)

. . .

T_n = n + (n - 1) + ... + 2 + 1

Which can be implemented in Java as:

public int triangularNumber(int n)

if(1 == n)

return n;

else

return n + triangularNumber(n - 1);

}

Triangular numbers can be drawn visually as a triangle made up of dots, the first 5 triangular numbers are 1, 3, 6, 10 and 15:

There is another, better way, to compute triangular numbers, a closed formula that Gauss¹ came up with when he was a child:

If you assign (n + 1) to be the base and (n) to the height, the formula bears a striking resemblance to the formula for the area of a triangle, seen visually as:

As you can see the base of the triangle ends being one larger than the actual size of the triangle and results in an "area" is larger than would be expected by the continuous area. I think this might be doing the equivalent to what the ceiling function does to round up and acquire the "extra bits" of the discrete case. This triangle was referred to by Fermat as the Collateral Triangle, although I think the term "discrete triangle" is better, and this concept relates to the differences between discrete and continuous mathematics. This is explored in more detail by David J. Pengelley specifically in the following the paper: "The bridge between the continuous and the discrete via original sources" and partial chapter: "The Bridge Between Continuous and Discrete". Actually his work, especially in regards to the Collateral Triangle aka discrete triangle was something of a missing link for my ideas here. The following graph was derived from a graph in one of his papers. I refactored it to use the function y=x and pimped it out a bit to better demonstrate my triangle theme, visually shows the differences between the discrete and continuous "areas" in terms of Summation and Integration:

The "area" of discrete vs. continuous triangles

The diagram shows how the discrete area picks up more in that jagged edge than the continuous area does and ends up being a larger quantity, n/2 to be exact as seen in the formulas, this is the type of graph that is usually used as an introduction to integral calculus to show how integration works, where in the continuous world the rectangular sections are infinitesimal. In this example the function y = x creates an area under its curve that is a unique² triangle, drawn in red, the famous 45-45-90 triangle which is both right and isosceles and in this case the height and base are equal, both the value of n so the area is (n*n)/2 which is the result of the above integral which measures the area of the red triangle where height = n and base = n. The discrete triangle, Fermat’s Collateral Triangle, is superimposed into each 1 x 1 cell of the graph as light blue circles with its area represented by the summation below the diagram and to the left. Also note that integration and Summation are continuous and discrete analogs of each other. I think there’s probably a lot more to this but that’s all I need for my purposes so I’m going to have to leave it there.

Triangular numbers have the property if you add two consecutive triangular numbers they add to a square which can be expressed as the following equation:

T_n-1 + T_n = n²

This addition can be seen visually as:

T₄ + T₅ = 10 + 15 = 5² = 25

Since, from our recursive definition, T_n= n + T_n-1 we can rewrite (T_n-1 + T_n = n²) from above as:

T_n-1 + T_n-1 + n = 2 ∙ T_n-1 + n = n²

This can be seen visually as:

2 ∙ T₄ + 5 = 2 ∙ 10 + 5 = 5² = 25

Another pattern in drawing triangular numbers is to replace the dots with ones (hold this thought for a moment):

Triangular numbers show up in Graph Theory, the Complete Graph, the graph where all vertices are connected to every other vertex exactly once, will have T_{n -1} edges for a graph with n vertices, the complete graph is denoted as K_n. The first three examples are:

T₁, T₂, and T₃ are the count of edges for K₂, K₃, and K₄. So if we define the function E(G) where G is a graph and E(G) is the edge count, i.e. number of edges in the graph, then E(K_n)=T_n-1, this only applies to the Complete Graphs K_n not all graphs.

The formula for the E(K_n) and T_n-1 is:

Let’s consider the graph K₆:

Now in looking at the fully constructed graph we don’t really get a true sense intuitively about the relation to triangular numbers so let’s look the construction from an inductive and combinatorial perspective:

So let’s build K₆. We’ll start with vertex 1 which is blue and draw an edge (a blue line) to each other vertex which results in 5 blue edges.

Now we move (clockwise and numerically increasing) to vertex 2 in green, since we already have a blue line back to vertex 1 we don’t draw another, remember: no repeated edges. So we will only draw edges to vertices that we have not visited yet which leaves 4 remaining which are drawn in green.

As we continue this we add 3 red edges, 2 yellow, 1 black, and 0 purple since we have connected to all vertices upon reaching this last vertex.

As shown in the above diagram in the appropriate colors, the edges are added as 5 + 4 + 3 + 2 + 1 which is the 5^th Triangular Number for K₆ with 6 vertices. Also this inductive construction is similar to the above recursive function; remember recursion and induction are related, except you would call it with (n - 1) e.g. triangularNumber(n -1).

The Adjacency Matrix is a way to represent the edges between vertices in a Matrix. For a graph of n vertices it has the form:

Each position a_r,c where r is the row and c is the column represents an edge, or possibly a count of edges, that connects vertex number r to vertex number c. The following gives the template structure for the matrix for K₆:

As you can see from the above matrix that each colored edge maps to the corresponding adjacency element, edge 1 –> 2 is represented by the blue a_1,2 and that all of the 1 -> n and n -> 1 edge elements appear in blue and this pattern repeats for all the colored edges. For this example we are only going to consider Simple Graphs, usually just referred to as Graphs, which only have one edge between any two edges as opposed to Multigraphs which allow edges to repeat between vertices so all elements of the matrix can only be either 0 or 1. Also we are not considering graphs which have Loops so the diagonal where each the row equals the column a_i,i will always be 0. So our template will now take the (très triangular) form:

We are only considering the Adjacency Matrix for the Undirected Graph which will have the property that the upper right triangular portion will be symmetric to the lower left triangular portion, or in other words the terms a_i,j and a_j,i are equal: a_i,j = a_j,i, this is referred to as a Symmetric Matrix. From our colorful K₆ example from above we have the following adjacency matrix. The color which maps to the edges to their representation in the matrix also helps illuminate the symmetry, if you visualize rotating the matrix around the diagonal of zeros you can see that they are the same.

Also you may have noticed, and this is not a coincidence, that this structure is very similar to our square constructed from n and the two T_n-1 triangles (T_n-1 + T_n-1 + n). If you rotate the above square by 90 degrees either in a clockwise or counterclockwise direction and replace the black dots with zeros and the blue dots with ones you get the adjacency matrix for the K_n graph.

Well this is my first, but not last "pure" math post, except perhaps for the two code snippets. I feel that these concepts are important and relevant to programmers and probably willso all positions of the graph can either be 0 or 1 only be more so in the future, the Adjacency Matrix is pretty important regarding anything relating to graph Theory like the Page Rank Algorithm or Social Network Theory, there is a lot more to say about it and I have at least one follow on post planned in this area as well. The goal here was to present the Adjacency Matrix in a context that makes it more interesting and intuitive, and to convey some of the beauty of these structures, hopefully I succeeded in some small measure to do that. I admit my approach here is somewhat unconventional and I may have taken some artistic license with some ideas such as the terms combinatorial and inductive, but I think they fit. Also I am not sure I fully understand the concept of area in the discrete context. I hope you found this as enjoyable to read as I found it to write. When I first had the idea I realized that I needed to draw a number of images, Processing really came in handy and was a blast to use for it, and it has given me a whole new perspective on geometric/graphical programming and visualization, I hope to do more with it including writing about it. Additionally LaTex has been great for creating math expressions.

¹This was also known to ancient mathematicians.

²Only one right-isosceles triangle exists in Euclidean Geometry.

18 July 2011

The Certificate Industry

I have been thinking about writing this for a while now and as I write this there is something of an education debate in part spurred on by remarks made by Peter Thiel. I have no interest in being part of that debate. Actually this is more of a rant that is on the periphery of that discussion. My complaint is that there seems to be a whole industry that surrounds the idea of certifying people as knowledgeable or learned in certain areas whereas my experience is that many people who carry these certifications then think themselves knowledgeable and or qualified to do certain tasks yet they are often severely lacking in knowledge and ability. I have met many people who look at education solely as a means to a job and then rest on the laurels of their prior educational accomplishments and see no need to further their education beyond college. I cannot tell you how many people I have met who seem to have no interest in the very subject that they majored in college, it makes me feel as though the degree was the only goal not the knowledge gained to receive the degree and picking a major was a mandatory option to attain the degree. It is this type of goal oriented education that seems to be the bane of our educational system which is nicely, albeit nervously, articulated ¹ by Erica Goldson.

I am lumping college degrees including advanced degrees and corporate, often product related, certifications together, not to mention the mentality by corporations that these, as Erica Goldson puts it, "pieces of paper that tell us that we are smart enough to do so" seem to be seen as the end all in qualifications. Now obviously not all companies think this way, in fact determining actual skills and knowledge during interviews, especially of programmers, seems to be a subject of much protracted debate on the inter-tubes. I am not saying that are not people who effectively avail themselves of their educational opportunities and are not passionate about their fields, I just feel these people are not in the majority.

The certificate that I find the most irksome is PMI’s PMP. In my experience I have dealt with many managers who proudly append this suffix to their names and companies seem to regard this as a concrete qualification for managing a project, however, my experience is that it usually is not. The most egregious example of this in my experience was using twenty-something liberal arts majors with PMP certifications to manage software projects as I have previously mentioned. I know someone who is pretty involved in the PMI community and she points out that this deficiency is not the methodology itself but due to the fact that the processes are either not applied or improperly applied, so I am in no position to comment on the efficacy of their methodology. Regardless of that fact it seems like PMI has a pretty lucrative business selling these certifications.

On the other end of the spectrum I have encountered people who hold masters degrees in Computer Science, who seem rather lacking in the fundamentals, I have met relatively recent graduates who strain to tell you the difference between an NDFA and DFA. In one case a young recent graduate saw a document on by desk with the title "Lambda Calculus" and asked me why I was reading about Calculus, when I responded "it’s not Calculus it’s Lambda Calculus", he replied with "what’s that?" I then said, "You know what a Turing Machine is" thinking I would explain it in terms Turing Equivalence, and he replies "No." At that point I was dumbfounded and retorted with "How the hell can you have a Masters in Computer Science and not know what a Turing Machine is?" I then pulled it up on Wikipedia and he claimed to know it, he probably did, it turned out he was pretty bright with a fair amount of potential but kind of an airhead, still it struck me as pretty lame, plus shouldn’t someone with a Masters in CS at least have heard of Lambda Calculus and know about Turing Machines off of the top of their head? Now this doesn’t only apply to theoretical knowledge, I have worked with some Masters degree holders who were terrible programmers. Once again this not to say that there aren’t bright knowledgeable Masters Degree holders who are knowledgable and good programmers, but clearly having that piece of paper does not make it so.

Now these are just my experiences but you will find other mention of these types of experiences also related to Microsoft, Sun and other corporate certifications, remember selling certifications are businesses which can be fairly profitable.

I recently came across this article on the New York Times website about how the movie "The Social Network" has renewed interest in Computer Science Curriculums because many starry eyed college students are seeing themselves as possibly being the next Mark Zuckerberg. This reminds me of a previous coworker who had graduated around the time of the last tech bubble, this guy was actually pretty sharp but he had sporadic interest in the day to day work and would often gripe that he really didn’t like the field, and he was only in it because it was a well paying career. I admit as someone who is passionate about my career and my field of study I find this mentality extremely frustrating, I feel that these are the exact people that you do not want to hire, now obviously not all of these new enthusiasts will be like that, but as a word of advice, most people working in startups tend to make less money than salaried employees, because even if your company is successful that doesn’t mean that it will have some huge market valuation and if it fails you are out of the street to start anew or find a job, of course if you want try it go for it. Now there are diverse areas that you can go into in IT but be prepared to find something that you like to do every day, don’t just do it because it pays well or you might get rich, both may happen especially the former which is a lot less glamorous. I’m just saying for every Mark Zuckerberg there thousands of us workaday developers who are just out there making a living at our "craft". And you may find that your life in this field ends up being more like the movie "Office Space" than the "Social Network".

Just to be clear here, I really enjoy and admire academia and academic work, but unfortunately I think, perhaps unjustly, that many CS curriculums are broken in terms of practicality and how and what math they teach you. So if you want to learn programming and start a company, you might want to: just do it, but if you are serious about the field and the discipline of software then you might want to go the education route and avail yourself of the possible opportunities that it offers.

Maybe it’s just because I’ve become jaded over the years but I am really skeptical about this Hollywood/Rockstar mentality that seems to now taint our industry and I think that in some ways this is indicative of the real problem with education in our society in that we view education solely as a means to a better job and we do not truly respect knowledge and learning. It’s all about the Benjamins baby.

¹I admire her Courage for speaking out like that.

19 June 2011

Free Math Resources You Can Use

In my opinion, this actually may be the best time in human history to learn math, or just about anything for that matter. If you are of sufficient means to have access to the internet, which unfortunately not all people are, you have access to an unprecedented amount of information. The problem is the overwhelming amount of information that is out there and how to find it. So I wanted to share some of my tricks to finding good math information, which I call "Math Mining", of course these can be used for any academic information, so maybe "Academic Mining" may be more apropos.

One of the reasons that the present time is a good time to learn math is due to the diversity of sources for information, Wikipedia is one of those sources, Wiki Surfing, as has been previously discussed, but that is just the tip of the math iceberg and the really cool thing is that there are many resources that can give you entirely new and fresh perspectives on things that may sometimes seem dull and obfuscated by more traditional approaches found in books, not that there aren't a lot of great books too. It really is an exciting time.

Many professors have their publications, notes and course resources freely available on the internet and some of these include full books in pdf or ps or html format. In fact that leads me to my little google trick, let's assume you want to find some information about a math subject, we'll use Linear Algebra as an example. Then if you use the Google search:

"Linear Algebra" inurl:pdf

You will get a lot of hits that are academic pages, these will be a mix of publications and course related material. Once you find a document, you can use that url to find more information. For example let's say that our above search leads us to the following (fictitious) url:

www.math.umars.edu/~hjfarnsworth/math420/fall2010/chapter10.pdf

After you click the link and get your reward, you should realize that this is a potential gateway to much more information. Now admittedly this might be seen as a moral gray area, because sometimes I get the feeling that some of these resources are not as openly exposed as they could be so it may be that the instructors do not want to openly share their work and they are practicing security through obscurity, but in my opinion if directory browsing is enabled and/or your documents are indexed by Google, then they're fair game, so if you are someone who this applies to, I suggest that you either share it openly it or lock it down. I encourage anyone who is sharing their work to do it freely and openly regardless of whether people are taking your classes. After all it's for the greater good. "The greater good." And if you openly share it then people like me can read it, learn it, know it and talk about how awesome you and your work are. It's a win-win.

Hack the Site

Hack #1 Url Suffix Removal

By removing "chapter10.pdf" yielding www.math.umars.edu/~hjfarnsworth/math420/fall2010/ will expose more resources if this directory has browsing enabled or if it has a default page. You can progressively remove directories to find one that is useful, and actually sometimes it is worth it to jump directly to www.math.umars.edu/~hjfarnsworth/ which will often be a professor home page which can yield links to publications, course pages with documents, and other potentially interesting information.

Hack #2 File Name Enumeration

So you looked at chapter10.pdf and it's awesome but Hack #1 did not yield it or the related chapters. Due to the naming convention try: www.math.umars.edu/~hjfarnsworth/math420/fall2010/chapter09.pdf or www.math.umars.edu/~hjfarnsworth/math420/fall2010/chapter9.pdf, often this approach will yield other related documents.

Hack #3 Invoke the Power of Google

Let's say the hack #1 didn't work and the resultant url had a random characteristic like:

www.math.umars.edu/~hjfarnsworth/math420/fall2010/larnd42-chap10.pdf

The following Google search will ferret out those pesky hard to find pdf's:

site:www.math.umars.edu/~hjfarnsworth/ inurl:pdf

Also you can use .ps and .ps.gz in place of .pdf for file type searches. If you feel that this is crossing some kind of moral line then don't do it, but I like to say all is fair in Love and Math.

I would like to give another example of this technique, I recently came across "Mapreduce & Hadoop Algorithms in Academic Papers (4th update - May 2011)" which linked to "Max-cover algorithm in map-reduce" which caught my interest, and of course the ACM is charging for it, but no worries, there is usually no need to pay them, actually I recommend boycotting them. I employed the above tricks but they didn't work, simply Googling one of the authors did (always pick the most unique name(s)):

"Flavio Chierichetti"

Pulled up his web site which had a free copy of the paper, now all I have to do is find the time to read it. Also the above techniques yielded the paper's "cliff notes".

Of course you can just look up someone by name, for example, you can find some of Donald Knuth's publications here.

In regards to academic publications there are two excellent repositories with a wealth of information these are Citeseer out of Penn State, this site can be a little flaky in terms of availability, at least that's been my experience in the past and the other is arXiv run by Cornell University. These mostly contain research oriented work but you can often find relevant information even for neophytes, actually a lot of advanced papers and books for that matter start out with introductory sections that can be worth looking at.

Encyclopedic and other Miscellaneous Resources

Wikipedia, obviously, as previously mentioned. Also the oft controversial Stephen Wolfram provides an excellent resource called Wolfram Mathworld.

Project Euler is a site dedicated to collaboratively solving math oriented problems programmatically more about it can be found here.

Math on the Web by category here provides some interesting links, I believe this is run by the American Mathematical Society but I am not sure.

The National Institute of Standards and Technology site: NIST Digital Library of Mathematical Functions.

Also there is Mathoverflow which is a Stackoverflow type of question and answer community devoted to Math.

Blogs

There are a number of blogs that blog about both math and programming related math. Actually if your primary interest is machine learning, I recommend Bradford Cross's Measuring Measures blog, it is hard to find things on his site and it was recently restyled with a magenta/maroon background which I now find a little bit harder to read. The relevant links here are: Learning About Network Theory, Learning About Statistical Learning, and Learning About Machine Learning, 2nd ed. Additionally Ravi Mohan did a follow-up: Learning about Machine Learning.

Good Math Bad Math by Mark Chu-Carroll has lots of good articles about math including some for beginners in various areas. Catonmat by Peteris Krumins has some nice entries with notes about the online MIT courses that he has worked through which currently covers Algorithms and Linear Algebra also mentioned above. The Unapologetic Mathematician has a lot of nice articles, this is a bit more advanced though. Math-Blog has a lot of articles as well. They tend to focus on more traditional areas of math. Math blog's abound and there are too many to mention, here's a few:

The n-Category Café
Michi's blog
A Neighborhood of Infinity
XOR's Hammer
Gödel's Lost Letter and P=NP
Terrance Tao's Blog
Arsmathematica
Steven Strogatz
Devlin's Angle by Keith Devlin
The programming/CS Theoretical Lambda The Ultimate

Lastly I will mention a blog by Jeff Moser, actually he only has a few math related posts, but his Computing your Skill on Probability and Statistics is a beautiful work of art well worth looking at.

Online Courses

The well known Khan Academy offers a number of courses including several math courses.

MIT Open Courseware has many online courses most notably for CS majors Introduction to Algorithms by the venerable Charles Leiserson and Erik Demaine videos here and Linear Algebra by Gilbert Strang.

On Stanford Engineering Everywhere the following might be of interest:

Artificial Intelligence | Machine Learning

Artificial Intelligence | Natural Language Processing

Linear Systems and Optimization | The Fourier Transform and its Applications

The Mechanical Universe is primarily dedicated to physics, but several math topics such as Calculus and Vectors are covered explicitly. It's also a nice series of lectures on the topic in spite of being a little dated in productions values.

Math Software

There are many math related software packages and libraries three of which are covered in more detail here.

Math library Sage written in Python

GNU Octave

The R project for Statistical Computing

Scilab

Maxima, a Computer Algebra System

Various Books and Academic Stuff

Here are a bunch of interesting courses and books that I have encountered during my searching which you might find interesting as well. These are presented in no particular order:

Algorithims

Algorithms by S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani.

Jeff Erickson has some Algorithms Course Materials

Steven Skiena author of the The Algorithm Design Manual offers some pretty comprehensive course notes for his cse541 LOGIC for COMPUTER SCIENCE not to mention the opportunity to learn how to bet on Jai-alai in the Cayman Islands.

Gregory Chaitin's Algorithmic Information Theory.

Computer Science

Foundations of Computer Science by Jeffrey Ullman and Al Aho.

The Haskell Road to Logic, Math and Programming by Kees Doets and Jan van Eijck

Discrete Math

Discrete Mathematics with Algorithms by M. O. Albertson and J. P. Hutchinson.

Analysis

Analysis WebNotes is a self-contained course in Mathematical Analysis for undergraduates or beginning graduate students.

Introduction to Analysis Lecture Notes by Vitali Liskevich.

Applied Analysis by John Hunter and Bruno Nachtergaele.

REAL ANALYSIS by Gabriel Nagy.

Probability Theory

Introduction to Probability Theory by Ali Ghodsi.

Introduction to Probability by Charles M. Grinstead.

The first three chapters of Probability Theory: The Logic of Science by E. T. Jaynes. Can be found here.

Think Stats: Probability and Statistics for Programmers by Allen B. Downey.

LECTURE NOTES MEASURE THEORY and PROBABILITY by Rodrigo Bañuelos.

Principles of Uncertainty by by Chapman and Hall.

Information Theory, Inference, and Learning Algorithms by David MacKay.

Machine Learning/Date Mining

Machine Learning Module ML(M) by M. A .Girolami.

Alexander J. Smola's and and S.V.N. Vishwanathan's draft of Introduction to Machine Learning.

The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Second Edition) by Trevor Hastie, Robert Tibshirani and Jerome Friedman.

Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze.

Mining of Massive Datasets by Jeffrey Ullman.

Fourier Theory

Lecture Notes for EE 261 The Fourier Transform and its Applications pdf By Brad Osgood.

Abstract Algebra

Abstract Algebra by Thomas W. Judson.

James Milne has a number of sets of extensive notes on Algebraic topics like goup theory here.

ABSTRACT ALGEBRA: A STUDY GUIDE FOR BEGINNERS by John A. Beachy.

Elements of Abstract and Linear Algebra Edwin H. Connell.

Abstract Algebra by Elbert A. Walker.

A series of chapters on groups by Christopher Cooper.

Linear Algebra

Linear Algebra by Robert A. Beezer.

A course on Linear Algebra with book chapters.

Really cool interactive tutorial on Singular Value Decomposition by Todd Will.

Model Theory

Fundamentals of Model Theory pdf by William Weiss and Cherie D'Mello.

Set Theory

A book on Set Theory pdf by William Weiss.

Graph Theory

Reinhard Diestel makes his excellent and comprehensive book Graph Theory available, pdf here.

Logic

You can find Introduction to Mathematical Logic by J. Adler, J. Schmid, Model Theory, Universal Algebra and Order by J. Adler, J. Schmid, M. Sprenger and other goodies here.

Introduction to Logic by Michal Walicki.

Logic for Computer Science: Foundations of Automatic Theorem Proving by Jean Gallier.

The Novel Research Institute has a number of free academic books including: Logic and Metalogic:Logic, Metalogic, Fuzzy and Quantum Logics and Algebraic Topology, Category Theory and Higher Dimensional Algebra-Results and Applications to Quantum Physics

Category Theory

Some course notes on Category Theory by Tom Leinster.

Basic Category Theory pdf by Jaap van Oosten.

Abstract and Concrete Categories The Joy of Cats by Jiri Adámek, Horst Herrlich, George E. Strecker.

A gentle introduction to category theory --- the calculational approach pdf by Maarten M. Fokkinga.

Steve Easterbrook's An introduction to Category Theory for Software Engineers.

Algebraic Topology/Topos Theory

Eugenia Cheng of Catsters fame has a course in Algebraic Topology with some substantial notes.

The above links of Eugenia Cheng refer to Algebraic Topology by Allen Hatcher.

An informal introduction to topos theory pdf by Tom Leinster.

Topology

A free, protected, password available by request, e-book on topology: Topology without Tears by Sidney A. Morris.

Chapters for a topology course by Anatole Katok can be found here.

Computational Topology

Jeff Erickson has some nice notes on Computational Topology, pdf's can be found on the schedule page.

Afra Zomorodian has some nice resources on Computational Topology including a nice introductory paper.

Spectral Graph Theory

Fan Chung Graham has a lot interesting stuff, some pretty advanced, relating to graph theory including social graph theory and spectral graph theory.

Dan Spielman has some course notes on Spectral Graph Theory.

A course in game theory by Martin J. Osborne and Ariel Rubinstein, requires registration.

Algebraic/Enumerative Combinatorics

MIT Open Courseware in Algebraic Combinatorics

An uncompleted book and notes on Enumerative Combinatorics by the Late Kenneth P. Bogart also here.

Lionel Levine's notes on Algebraic Combinatorics

Richard P. Stanley new edition of Enumerative Combinatorics Volume one.

A Course in Universal Algebra by Stanley N. Burris and H.P. Sankappanavar

Misc

Pat Hanrahan's CS448B: Visualization.

Sean Luke's "Essentials of Metaheuristics".

It's all a click away

The links in this entry, especially the academic links are susceptible to link rot, people move from institution to institution or leave academia for jobs in the private sector. I will endeavor to revisit this entry and try to keep these up to date and perhaps even add to them, however, if you encounter this page and have any interest in any or all of these resources I recommend downloading them now so that you have them.

Using the resources of this blog you should be able to get your hands on a huge amount of free resources on a wide range of topics. This can be helpful if you are on a budget or just want to try before you buy an expensive book on a topic. I hope you avail yourself of some of these, there's lots of great stuff and if you know of some that I do not please add them in the comments.