Elegant Coding: 2011

24 December 2011

Metaphorically Thinking about Software Engineering

Please note this is a rewrite previous post that I was unhappy with.

Years ago I heard a comparison of software engineering to chemical engineering. I believe the intent of the analogy between the two fields was due to the fact that Software Engineering is a very young discipline just as Chemical Engineering was in relatively recent history, circa 1900.

Metaphors get used a lot in our industry, in Chapter 2: “Metaphors for a Richer Understanding of Software Development” of Steve McConnell’s Code Complete they are discussed in some detail. In that chapter he cites the following:

A confusing abundance of metaphors has grown up around software development. David Gries says writing software is a science (1981). Donald Knuth says it's an art (1998). Watts Humphrey says it's a process (1989). P. J. Plauger and Kent Beck say it's like driving a car, although they draw nearly opposite conclusions (Plauger 1993, Beck 2000). Alistair Cockburn says it's a game (2002). Eric Raymond says it's like a bazaar (2000). Andy Hunt and Dave Thomas say it's like gardening. Paul Heckel says it's like filming Snow White and the Seven Dwarfs (1994). Fred Brooks says that it's like farming, hunting werewolves, or drowning with dinosaurs in a tar pit (1995). Which are the best metaphors?

I am not sure how apt the chemical engineering metaphor is either, but there are some interesting parallels. The early chemical manufacturing process to produce sulfuric acid is called the Lead chamber process and it was invented in the mid 1700s. The process was not very well understood and was inefficient in that it required the use of expensive nitrates often from Chile which were partially lost as nitric oxide vapor, and it took over a hundred years before a way to recycle the part of nitrates was added to increase the efficiency of the process. The discipline of chemical engineering was codified in the late 1800s when college curricula and associations for chemical engineering were created. Over time the science of chemistry and stoichiometry advanced as did thermodynamics which made major steps forward around the same time. I am over simplifying here but eventually little understood processes that were discovered and tweaked empirically became understood and could be controlled and improved by applying the laws of science and mathematics. Another parallel is the focus on reusable processes, in that if you are building a new plant to manufacture sulfuric acid you don’t want to design it from scratch each time. An interesting and abstract parallel is that both disciplines have their entropies, i.e. thermodynamic vs. information. A social parallel may be the transformative effect that software is having on our lives and our society is as profound as that of industrial scale chemical production, this keyboard and almost thing I can see right now has been influenced in some way by chemical engineering. Also what are the negative sides of these industries, what problems are we creating by the unregulated and somewhat opaque policies in regards to chemicals and what might some of the issues with software and IT be?

Now admittedly there is pretty big difference between a physical chemical process or the construction of something physical like bridge and software construction. There are obvious parallels to building things in that people have to come together to build something that is greater than what one single individual could build. And there are similarities to Chemical processes in that once you build it that you have keep it up and running and you will probably need to do maintenance and make improvements while it is operational and even in some cases take it off line or build a new one in parallel while the old one is still in production. Also might we possibly view Computer Science and Math as related to Software Engineering as Chemistry, stoichiometry and thermodynamics are related to Chemical Engineering?

Like Donald Knuth I too think there is an artistic element to software, this idea is not incongruent with building or bridge architecture which is a process that organizes groups of people to build something beautiful. Perhaps a possible analogy for software engineering would be the construction of a bridge where we still don’t really understand all of the physics behind it or the best way to go about it. As a result when we are done it might stay up and it might not and it will probably take longer than we thought to build it.

I guess the question is it possible to create a discipline of software engineering that is as effective as the other engineering disciplines? If we use the chemical engineering metaphor, where are we in the time line of understanding the underlying science and applying it to our process? Perhaps we are at the equivalent point to the mid to late 1800s in terms of process and our understanding of the underlying science but I very much doubt the years will map linearly, it’s probably more like an exponential mapping. The elusive software engineering is probably its own entity, are our metaphors confining our thinking and hindering our ability to objectively look at this problem?

The software industry is also different from almost any other technical industry in history just in the sheer number of people who now participate in it worldwide and it seems that the barriers to enter the field can be lower than that of other engineering and scientific disciplines in regards to formal degrees.

05 December 2011

Responding to "Eleven Equations" Comments, Math Notation, and More

Wow that was quite an experience, I have been looking at sites like Hacker News, Reddit and DZone for years but to see my own post go to #1 on Hacker News and rise high on Reddit and DZone was quite a thrill, with over 70K hits at the time of this writing. It has completely changed my life the phone is ringing off the hook with endorsement offers, it looks like I will only be wearing Nike for the next year.

The best part was the over 200 comments on my Blog, Hacker News, and Reddit, it spawned some spirited discussion on all three locations and many people added their own equations some that I was unfamiliar with and I learned a few new things, excellent! The various discussions for the most part were great and very enjoyable and I’d like to thank everyone who contributed. In the post I commented that I wondered how I would feel about the list in a year and I already think I would change the list by one equation. I was tempted to respond to the comments on all of the forums, I know I should at least respond on my own blog, I just figured I’d do it as a post that way I could hit a number of topics.

One thing that people objected to was the title, which induced some inferences in the tone of the title. One redditer summed it up as:

Someone presuming, yet again, to know what "every" Computer Scientist should know? A title like "my favourite computer science equations" might be better. This list contains a lot of ideas which are only relevant in very narrow domains.

Yes, probably on the title comment, but the title was an intentional "spoof" if you will of a title of an article on Wired, and if you read my post you would have gotten that. Actually I confess that you might even accuse me of "Hacker News Post Popularity Hacking" as I saw the Wired post rise up and figured that there was a high probability that a similar CS related post could achieve a similar placement, actually I think it exceeded my expectations and went even higher, it shot to #1 in under two hours of my submitting it and it stayed on the front page for almost a day. I felt the tone I was setting, by using the term "rip off" and the Spinal tap joke (This one goes to eleven) that I don’t think anyone got, was subtly light hearted, perhaps it was too subtle. I recall reading that the title is key to getting people to read posts, and clearly this title was somewhat controversial which probably helped it, actually most of my titles are fairly boring or references that make them less popular, but I generally title things the way want not for popularity, well with this one exception.

Another thing that people took umbrage with was the use of the equations, as I just explained since I had set my title accordingly my format was now set. One commenter, in two different comments on a thread on Hacker News added:

I think this is a pretty silly post, to be honest. CS covers so much, and everytime I see a list of "things you should know", I have to resist the urge to roll my eyes and ignore it. But then I read it, and inevitably roll my eyes anyway.

...

I disagree. Programming is math. Highly advanced math, in fact. It's just a different type of math. And the 11 equations in the OP's article just barely touches on what CS is about. There is far more to it than that.

I mostly agree with this commenter. I admit that the sentiment that this not really about the equations but the general domain knowledge is implied, again perhaps too subtle. At one point someone mentions memorizing these equations which again means you are missing the point. It’s about knowing what these equations mean and how to use them. I feel that the math domain knowledge for our field is growing and becoming more important. Do I use any of this for my day job? Well no, but I hope to one day. As to the broad domain I actually have a post "Math For Programmer TNG" that covers a very broad domain of math that I think is potentially relevant. Also not all areas of math are going to be used in every math related job.

As to the narrowness insinuated by the two comments above, the post touched on the following domains: Combinatorics, Logic, Boolean Algebra, Set Theory, Lattice Theory, Probability Theory, Statistics, Category Theory, Number Theory including Complex Numbers, Relational Algebra, Abstract Algebra, Linear Algebra, Shannon Information Theory, Algorithmic Information Theory, Kolmogorov complexity, Graph Theory, Formal Language Theory, Automata Theory, Combinatory Logic, Lambda Calculus, Machine Learning, Data Mining, Encryption, Limits, Topology, Fourier Analysis and The Simpsons. Not exactly narrow, or am I missing something here? In fact the list was picked in part for diversity.

The Pumping Lemma turned out to be fairly controversial, I do admit that the Pumping Lemma was perhaps a bit "contrived" as an Equation and as one person put it: "that's the most hideous formulation of the pumping lemma I have ever seen". That formulation came from Wikipedia. As I mentioned I really did want something from Formal Language Theory or Automata Theory and it’s not so much about determining which language is regular but knowing the difference between a Regular Language and a Context Free Language so that you are not the cargo cult coder who tries to parse html with regular expressions because you understand these distinctions. I think the following exchange summed up what I was going for:

(except the Pumping Lemma, WTF?)

Reply:"Seriously, you never learnt theoretical computer science? As in, theory of computation? Automata theory? Complexity theory? Nothing like that?"

There were a number of comments about the use of mathematical notation including the following:

"Shannon's Information Theory, Eigenvector, DeMorgan's Laws, etc. None of those names are meaningful or descriptive. And then the greek letters and made up symbols. Math could learn something from Computer Science:

"https://www.google.com/search?q=readable+code

This commenter seemed to have an issue with the actual names of things in math. I found this to be very narrow minded. Math is many things including being a domain of knowledge, that’s like complaining about the names: Ytterbium in Chemistry, Bose–Einstein Condensate in Physics, Borneo in world Geography, or Impressionism in art. Really!?!

Now the complaints about the notation are more understandable but still unreasonable in my opinion, math in some respects is a language, perhaps somewhere between a natural language and programming language, it has a notation and to really be able to understand it you need to learn this notation (language). Someone remarked that symbols were "unfriendly". Well I find Chinese, Arabic¹, etc. to consist of unfriendly symbols, but that’s because I don’t know them. Also like natural languages you have inconsistencies, and multiple meanings for symbols. One person complained about my notation for the Set version of De Morgan's laws, this may be a nonstandard notation, I saw it in some course notes on probability and set theory and I liked it and I just used it without thinking about it. I do think it’s more elegant. This makes a good point about math. If you learn it and read from diverse sources you will encounter notational differences. This has been my experience, in fact if you look on the Wikipedia page for De Morgan's laws you will find the following example from Modal Logic:

I am used to the notation used in Hughes and Cresswell:

That’s just how things work. You can get hung up on it and complain or you can accept it and move on. Don’t get me wrong it would be nice if there was a notational standard that every one followed for all math.

Like natural languages Math has its equivalences of "Homographs", for example the following symbols can have multiple context dependent meanings: |b| can mean absolute value of b if it is a number or cardinality of b if it is a set or length of b if it is string, ∂ can mean Partial Derivative or Topological Boundary, and as we saw ∧ and ∨ can mean meet and join or "logical and" and "logical or".

Just as Math has "Homographs", as in natural languages it also has its equivalences of "Polysemes", meaning that there are multiple ways to express the same ideas. For example set difference can be expressed as (S – T) or (S \ T), the Powerset notation can be either 2^S or P(S), and meet and join can be expressed as the either of the following sets of symbols:

The list of both math "Homographs" and "Polysemes" is much longer.

For De Morgan's laws, someone added the following generalization using the Existential and Universal Quantifiers, which is really interesting and also illustrates the duality:

It's not much of a generalization, but I prefer the predicate version of De Morgan's:
~∀x p(x) ≡∃x ~p(x)
~∃x p(x) ≡∀x ~p(x)

If you read the Wikipedia Article it includes this generalization as well:

The above equations do generalize De Morgan's laws within Logic and Set Theory respectively but my point was to further generalize them in terms of Lattices which is a different concept all together and gets at the deeper intuition about the interrelation of Logic and Set Theory, quoting from Combinatorics the Rota way: by Rota and Kung, an amazing book I am trying to read, emphasis on the word trying:

"As John Von Neumann put it, the theory of Boolean Algebras is "pointless" set theory.

This set version would be written using the complement exponentiation notation as follows:

I really do prefer this notation, I originally encountered it in "Introduction to Probability Theory" by By Ali Ghodsi and Cosmin Arad, lecture 1(pdf). This notation is also used in Combinatorics the Rota way. So this notation is now the official notation of my blog, get used to it or hit the road. ;)

A commenter (who was also the demorgans notation complainer) responded with the following:

...
For the rest, they seem more likely to be used by someone doing something related to machine learning/AI/information theory than you run of the mill I-spend-all-day-parsing-user-input programmer.

Thank you for making me feel like I'm not a 'true' computer scientist. Next would you like to tell me about how I'm not a 'man' because I don't like to work on cars?

Well I hate to break it to you, but if this list makes you feel that way then chances are you are like me: a software developer with a CS Degree and that does not make you or me a Computer Scientist. This is from Wikipedia:

"The general public sometimes confuses computer scientists with other computer professionals having careers in information technology, or think that computer science relates to their own experience with computers, ...

Oh, and yes you are not a man if you do not at least know how to change your oil, including the filter. ;)

Also a number of people remarked that I did not explain the equations in enough detail and I did not provide code examples. It was meant as high level post, if I were to explain all of these equations the post would have been excessively long, maybe even a whole book. As to the code examples again it would be too long, if you want code examples that relate to math, I have posts for that, I recommend the following posts on my blog, some were linked in the original post: "O(log(n))", "Triangles, Triangular Numbers, and the Adjacency Matrix", "The Combinatorial and Other Math of the Java Collections API", "Math You Can Use : Refactoring If Statements with De Morgan's Laws", and "Monoid for the Masses".

Regrettably some people responded with the sentiment of: I’m not good at math or that math is too hard, I don’t think I can get a CS degree. Fortunately there were a number of comments of encouragement to counter these sentiments, my favorite was this one:

Don't let this be intimidating to you - instead of asking yourself "how come I don't know this?" ask yourself "how can I learn more about this?". This might sound cheesy and simplified but it's as simple as "nobody was born knowing this". I'm 31 and I wish I had the money to go back in education and learn all of this with the official way but for now I'm just picking resources on the web and who knows? It just might happen...

I have a post planned to talk about my experiences and thoughts about learning math, I will say if you are a programmer you are doing math in a sense. You just need to formalize your understanding of the underlying concepts. If it is any consolation I find Math to be very hard at times, I pretty much suck at it in the traditional sense, but it is so beautiful and when you grok a concept, at least for me, it is an incredible buzz. Every new concept that you learn changes how you see everything, I see so much math in programming and the world now it’s scary, I just wish I could figure out how to express it. Everything is math. Over the last few years my ability to understand things has greatly increased and continues to do so. Yes it’s hard but it is so worth it.

A number of commenters added their own equations to the discussion, and this was the best part, which included a few new things for me. In general my list was mostly directed at more pure math oriented equations that I felt were fairly central to a specific discipline and encapsulated certain principles. Here are a few recommendations and thoughts:

One person pointed out the absence of P=NP or P ≠ NP, to which I would respond "D’oh!" If I was doing this over I might even replace the Pumping Lemma with P=NP.

One poster recommended triangular numbers:

Here's a real simple and practical equation: 1+2+3+4 . . . N = N(N+1)/2
This equation represents the number ofedges on a complete graph with N+1 vertices or the number of possible pairings given N+1 objects. Useful whenestimating O(N) for certain algorithms that involve comparing an item to everyother item in a set.

I love Triangular Numbers and this is a great equation, I wrote a whole post "Triangles, Triangular Numbers, and the Adjacency Matrix" on exactly what he is mentioning, and considered that equation but rejected it because triangular numbers are sort of a special case of the Binomial Coefficient equation:

Here are some of the other equations and theorems that were suggested, I wouldn’t really consider these because they didn’t fit my theme and criteria but that doesn’t mean they aren’t interesting:

One commenter accused me of shoehorning in Euler’s Identity. This is partially true. He also accused me of "Name Dropping". That’s not true. A defender supplied the following course notes:

Analytic Combinatorics: A Primer

I followed up on that and found that you can download a copy of Analytic Combinatorics:
by Flajolet and Sedgewick. Warning: pretty advanced stuff.

As the popularity of my post was waning on Hacker News a math oriented post called "Fuzzy string search" was making its way up the ranks. It’s an interesting post and I think it helps illustrate a couple of points I have been making here, first of all it uses mathematical notation, which is not uncommon when writing about these types of topics. Additionally the post includes the equation for the Triangle Inequality:

An equation that crossed my mind when writing my equations post. It is not the equation itself, but the concept, if you become familiar with the idea of a Metric or Metric Spaces, you instantly recognize this equation concept without having to read the associated text, not that you shouldn’t read the associated text. Knowing these ideas gives you the ability to better understand more complex ideas that are based on these concepts. I think this is the case in all knowledge domains. Also I noticed that there was not one comment on that post complaining about math notation.

A lot of what went on is part of an ongoing conversation, I blogged about it in "The Math Debate" which links to other famous blogs of the same ilk. Ultimately the equations probably don’t apply to most programming jobs, the title contained as some commenters pointed out "Computer Science Geeks" not "Corporate Software Developers". A few people castigated the commenters who seemed to be advocating positions of ignorance, thanks, BTW. I personally find it tragic, the following comment sums this up:

"Someone with a CS degree who knows nothing of De Morgan's law should have their credit removed and be demoted to making websites or performing tech support tasks.

Reply: "There goes 98% of your CS graduates then. I wish I was joking, but alas.."

Of course it’s a bit I ironic that as I am wrapping this up, "A co-Relational Model of Data for Large Shared Data Banks" by Erik Meijer, Gavin Bierman just popped up on Hacker News, the relevant quote is:

Every programmer is familiar with the two dual forms of De Morgan’s laws

The writing quality of the post was criticized. I don't know what people will think of this post, I feel like I just wrote a clip show. I admit that previous post was not my best effort, while it was fun putting it together it was not easy, I had hoped to learn all of the equations in more detail, I did to some degree, I confess I still don’t fully get the Y-Combinator, but I will. I finally decided to just push it out and move on. This post, while being relevant to my mission of blogging about math was something of a social experiment in popularity, it was interesting but I will be returning to the true mission of my blog which is my journey through math and the pursuit of a real Software Engineering discipline.

Ultimately I am interested in the readers, like a young colleague of mine who was too busy to comment because he was following the links to learn new things he didn't know, not the complainers and naysayers. There were many positive comments and again, thanks. I did feel a needed to refute some of the negative comments. The bottom line is if you want to read about math, you need to get used to the notation, it probably won't help you to build CRUD Web apps but it will hopefully help you take your career to the next level.

¹Actually this is not true I find them to be quite beautiful, especially Arabic.

27 November 2011

Eleven Equations True Computer Science Geeks Should (at Least Pretend to) Know

This idea is a complete rip off an article that appeared in Wired a little while ago and it got me thinking what would my list for Computer Science look like? Plus I thought it might be a fun post and unlike the Wired list this one goes to eleven. So here they are in no particular order:

Binomial Coefficient

The Binomial Coefficient equation generates Pascal’s Triangle and gives you the coefficients for the Binomial Theorem these ideas are often attributed to Pascal but in fact they have been known in part for over a millennia.

As I mentioned that this list is no particular order and I don’t wish to play favorites but if there is one equation here that you should really consider learning and committing to memory this is it. It is central to Combinitorics which has connections to many areas of math, I guess I should qualify that if you are serious about computer science and math related programming topics then you should be able to recite and apply it from memory, even when you are drunk, sleep deprived or being held at gun point. Ok that might be a bit much but you get the idea. If you are programmer and you haven’t learned it, or need a refresher, I have a post that relates it to the Java Collections API.

Demorgan’s Laws

Logical form:

Set Form:

Ok so that’s like four "equations" for DeMorgan’s Laws, one can’t help but to struck by the similarity between the two sets and this is not a coincidence these two sets of formulas are essentially the same thing, they are special cases of complemented distributive lattices^† which means technically it’s really just two formulas:

In this case the ∨ symbol means lattice join operation and the ∧ symbol is the lattice meet operation and the dash with the right downward mark means lattice complementation, I used this to differentiate from the tilde for Boolean complementation. Also these two formulas are categorical duals of each other, which means that if one were smart and had a good grasp of Category Theory we could probably be really OCD about it and get it down to one formula.

Understanding Set Theory and Boolean Algebra are very important basic concepts in our profession and the Boolean version can make you better programmer, as you can use it to refactor logical if statements.

Eigenvector and Eigenvalue

This equation and these concepts are not only central to Linear Algebra but these Linear Algebraic ideas and others are both used and extend into many other areas of math including Linear Transformations and Graphs in terms of matrices like the Adjacency Matrix and much more.

Linear Algebra is becoming increasingly central to the types of things we are doing in our profession, it is used heavily in Statistics and Machine Learning not to mention the Page Rank Algorithm is based in part on this equation.

Pumping Lemma for Regular Languages

No Comp Sci list would be complete with at least one formula from either Formal Language Theory or Automata Theory. The problem is that these two are pretty different from other areas of math in terms of "Equations" I tried to think of some basic Ideas and The Pumping Lemma came to mind, and I found the above formula concisely expressed on Wikipedia. This version of the Pumping Lemma is a more limited version of Another Theory which is used to check whether a Language is Regular. Also this is maybe not the best choice as it is pretty dense, if you want a better explanation I recommend this.

While this "equation" is pretty specialized, it is actually a special case of the more general Iteration Theorem, the real point here is really about more general domain knowledge. It is central to what you do every day you are programming such as compilers and regular expressions not to mention the almost ubiquitous Kleene Star.

Information Entropy

I confess after all my machinations to try to reduce Demorgan’s Laws down to less equations I am totally cheating here by pulling a twofer in including two formulas, both Shannon’s Information Theory:

And the formula for Chaitin’s Constant, which is associated with Algorithmic Information Theory and Kolmogorov Complexity:

Interestingly this area has a parallel in the physical word, in terms of Thermodynamic Entropy, which also parallels the Boltzman Equation mentioned in the Wired Article. Seth Loyd draws some interesting connections between computation and the physical world including Entropy and Boltzman’s Equation.

In programming and computer science information is our business and these ideas are central to many things that we deal with especially the storage, transmission, computational transformation (like compression), and computation of information. We’ve already seen the possible relationship between both of these theories and Reusability and Software Frameworks.

Bayes' Theorem

Of all of equations on the list, Bayes' Theorem may stand out on its own merits as being one of the most recognizable and it might even qualify as a poster child for the statistical revolution sweeping our industry. It rose from obscurity to being used for spam detection and with such applications as classifiers, Machine Learning, Data Mining and more.

Mathematically it is part of a rift in statistics and probability and I confess I may not yet call myself "an initiate of the bayesian conspiracy" but it hard to deny its utility plus there seems to be a more axiomatic basis that relates to Boolean Logic Set Theory, which makes it all the more alluring.

Fermat’s Little Theorem

These two equations of Fermat’s Little Theorem are really the same equation written in two different forms, where (a ⊥ p) means coprime and (p ∈ P) means prime.

This is a beautiful little theorem in Number Theory which can be generalized and written even more succinctly using Eulers Totient Function which is represented by φ (n):

These ideas are crucial for understanding encryption algorithms like RSA and there’s this cool result.

Natural Join

Natural Join, like all of the Relational Algebra Join Operations is a composition of the more basic, operations: Selection (σ), Projection (π), Cartesian Product (×), Set Difference (-) and Union (∪). Natural Join is the Cartesian Product of two tables selecting matching rows and eliminating duplicate columns. (In the formula above the projection (π) is the set (A) of attributes with duplicates removed and then selects (σ) the set (C) of common attributes which match in both sets and are equal in both sets.)

The Relational Algebra is in my opinion a bit of an odd beast when it comes to algebraic structures, but if you use SQL you are already using an implementation of it. As data persistence becomes more heterogeneous I think understanding these concepts, will become more important. Additionally Relational Algebra also has strong ties to Abstract Algebra[ and Category Theory[.

The Fixed-Point (Y) Combinator

Just as we need a "equation" from Formal Language Theory or Automata Theory we really should have one that is related to Lambda Calculus, in this case this equation is in the untyped lambda calculus. Even though the Fixed-point Combinator stands up on its own, it is fairly well known, well at least in name due to the use of it by Paul Graham for his incubator (Y-Combinator).

The Fixed-Point Combinator allows one to implement anonymous recursion which is a powerful programming technique. It also ties into some deeper theoretical aspects of computer science and logic such as Combinatory Logic.

O(n)

Many CS majors may cringe at this, and another possible reaction is that’s not much of an equation. There’s actually quite a bit to this for example did you know there are some substantial and quite beautiful equations associated with this seemingly simple formula. They are:

Also the following is a definition of (lim sup) Limit Superior:

Where inf is infimum and sup is supremum two concepts that seem to show up quite a bit especially when you are concerned with bounds like with Lattices and they also seem to show up in relation to Topology.

And we all know why we should care about his one: Algorithms, Algorithms, Algorithms.

Euler’s Identity

Euler’s Identity is also listed in Wired’s article and I have to agree that it is one the most beautiful and intriguing formulas in math.

Now this isn’t really as relevant to CS as the other formulas but if you a true computer geek you are hopefully some kind of math geek, and this is such a great formula, not to mention that it appears in The Simpsons' Homer³. It’s actually a specific case (where θ = π) of Euler’s Formula:

These formulas will give one a better understanding complex numbers which are needed in other areas of advanced math also these formulas relate to Fourier Analysis which does have applications in Computer Science.

I am curious to see how well this list holds up, how will I feel about it in a year? Also in writing this I realized that I still have quite a bit more to learn about at least half of this list, if not all of it, which made it hard to write as it was quite a little (unfinished) research project. I have already written individual articles on several of these equations and have at least two in the works for Linear Algebra, at least two on the Relational Algebra and one planned for Pascal’s Triangle, and those are just the ideas I’ve already had, who knows what the future holds. I hope you enjoyed it as much as I did writing it.

^† The Set version may not apply to all cases in set theory, this assumes Union and Intersection over the Power Set of a Set Universe, this would be closed under these operations and would include complements of all elements. The Set Universe is the upper bound and the empty set is the lower bound.

22 October 2011

A Confederacy of Cargo Cult Coders

I hate to say it but I feel that one of the biggest problems with Software Engineering is the prevalence of programmers who are Cargo Cult Coders. Now I know this may seem a bit extreme but it is my feeling that many software projects suffer in part due to the fact that most developers are mediocre at best and there are quite a few developers some who are more senior who are "faking it" . So before you call me a cynic or worse, think about this, it’s pretty well known that really good developers are really rare and good developers are rare. So let’s assume for sake of argument that developer ability falls on the Normal Distribution, I am not asserting that it does, that means the vast majority of developers fall into the average category it also runs along the lines of the 80/20 rule in that most developers range from average to bad.

Now I have seen these types of developers a fair amount and this is one possible down side of frameworks such as Spring and Hibernate. If you read my blog you know that I am a big fan of frameworks and the concept in general, but they can be abused as can any other technique or methodology. The plus side of technologies like Spring and Hibernate is that good developers can quickly create large complex well engineered systems, of course mediocre and bad programmers can, by blindly implementing framework patterns, create large complex monstrosities. I have seen cases where through the use of cutting and pasting of example code, bad programmers can create working systems with a limited understanding of the underlying technologies, I often use googled code snippets myself, but at least I understand the fundamentals and if I don’t fully understand what I am doing at the time I try to go back and learn more about how the underlying technology works. Essentially these Cargo Cult Programmers are mostly just configuring boilerplate code and code snippets by trial and error within the confines of a framework.

On one of my previous contracts I had the misfortune to work with perhaps one of the worst developers I have ever met, he was truly a consummate cargo cult coder, he was musician turned programmer¹ with about eight years of professional experience. He had zero understanding of Computer Science fundamentals and seemingly no understanding of good Software Engineering principles and yet during his career he had accumulated enough basic knowledge of Web Design, Javascript, Java, Spring, Hibernate, SQL and various other technologies and components to cobble together enterprise applications. Ironically his ability to create these applications was actually somewhat impressive, as long as you didn’t look at the code. I think the scariest thing about this developer is that he had garnered a false sense of the level of his abilities which he would project, which in turn would cause others to falsely believe that he was a highly competent developer.

I use this particular developer as an example, but this is a trend that I am seeing, developers who now claim themselves as "senior level" because they have gained proficiency as cargo cult coders, but when you see their non boilerplate code it will often demonstrate an egregious lack of knowledge of basic concepts like cohesion, coupling, inheritance, basic OOP design, threading, the underlying workings of the framework technologies themselves, etc. I think the biggest danger is the complacency and perhaps self delusion or naivety which can even manifest itself as hubris, that these developers acquire from this limited perspective, in fact it is ultimately self limiting behavior because all new technologies and languages are then viewed in the same narrow context, usually as another resume bullet, practitioners of what I call RDD (Resume Driven Development).

I believe that you can boil down what makes a good developer to two relatively simple things, one is ability and the other is desire. Ability is a complex web of inherent skill, intellect including a good memory, experience etc. and desire is the hunger for knowledge and aspiration to want to improve one’s skill and the passion to find better ways to do things. In some ways the two go hand in hand but not always, the "consummate cargo cult coder" from above had a high degree of passion, but he was stubborn and did not work well with others and was unable to take advantage of the opportunity to benefit from the knowledge of others, I actually found his situation to be somewhat tragic. The other side is people who are talented who lack desire, ironically I have been accused of this one, it’s usually not due to my lack of desire, I just sometimes find what I am working on to be boring which in turn can cause me to lag in terms of productivity, fortunately this is fairly aberrant behavior for me.

Sadly I have recently been on some pretty dysfunctional teams and I feel the biggest tragedy is that on those teams there have been developers in whom I saw that they could be more than they were on the project, but they mainly lacked guidance, which I could not provide due to team’s dynamics. This type of situation is really a double tragedy, it is tragic for the developer in that a better opportunity for the developer is missed, and it is a missed opportunity for project, the team, and the management of the team as they could have a had a better developer who was producing better quality code. I guess what I am saying is that there should be a way to try to avoid wasting developer potential, I think this is some of what Agile Process tries to achieve. "Build projects around motivated individuals. Give them the environment and support they need, and trust them to get the job done." I think this should be taken further to try maximize each developer’s skill and inspire their inherent passion and to kindle increased passion to maximize the efficiency of the team.

If you are lucky the cargo cult coders on your team are just plodding along and hopefully not messing up your codebase too much and are the ones who just need that extra guidance or encouragement. Unfortunately they can do far worse damage, there are many out there who have been getting by for many years and have formed overrated opinions of themselves, these are the worst, not only can they disrupt the software structure they can often disrupt the team dynamics and create more substantial problems, in my Driven Development post I mentioned developers who I was in conflict with creating a CDD (Cognitive Dissonance Development) environment, they were cargo cult coders who had taken very defensive and recalcitrant positions, you couldn’t reason with them because if you tried to talk about good design principles or standard practices you were speaking a language they did not understand, they only saw software development as learning some new technology that they could then claim to be experts in. They will pollute your system with redundant inconsistent shoddy code that will degrade its quality, performance and maintainability. Worst yet, they will be excessively defensive in the face of criticism, which makes code reviews hard if not impossible, and since they don’t read or understand good engineering principles they often argue against them. Also they will often disrupt the team dynamics and poison it with their defensive and sometimes arrogant behavior this can result in the team becoming highly dysfunctional which can create a toxic working environment that is hostile to good developers and is the opposite of the optimal environment in that the team will not function as whole that is greater than the sum of its parts it will function as whole that is less than the sum of its parts.

¹ I have worked with plenty of non-technical/non-CS converts and there are some who are good and go the extra mile to learn the basics and beyond.

18 October 2011

O(log(n))

I find O(log(n)) to be a very interesting algorithm complexity, it’s also highly desirable and if you achieve it you are probably doing well, or you are just using a couple of common time tested algorithms and data structures such as Binary Search or [Self] Balanced Binary Trees like Red Black or AVL trees. The order of Big O Complexities is:

O(1)	constant
O(log log n)	Double Logarithmic
O(log n)	Logarithmic
O(n^m) 0 < m < 1	fractional power
O(n)	Linear
O(n log n) = O(log n!)	Loglinear
O(n²)	Quadratic
O(n³)	Cubic
O(n^m) m > 1	(general) polynomial (m is a fixed, non-negative integer; e.g. n⁴, n⁵)
O(mⁿ)	exponential (m >= 2; ; e.g. 2ⁿ, 3ⁿ)
O(n!)	Factorial

When I was growing up and hand held calculators were quickly evolving much like smart phones are now, although perhaps not as rapidly, my father would bring these calculators home and I would play with them, this was my first exposure to logarithms when I got to high school I learned more about them including some of the identities. Actually I find the Logarithmic Identities to be quite interesting and have found them important in understanding logarithms which are useful in understanding other areas of science and math. I even used to use log base 10 to calculate the number of digits of numbers for print formatting way back before we had better ways to do number formatting^†, ⌈log₁₀(n)⌉, of any base 10 number is its number of digits, where ⌈x⌉ is the ceiling function. Also many common measurements like the Richter Scale, pH and Decibel among other things are logarithmic scale. Also we have previously encountered logs in relation to information entropy . Some useful logarithmic identities are:

y = log_b(x) if and only if x = b^y
log_b(1) = 0
log_b(b) = 1
-log_b(x) = log_b(1/x)
log_b(x*y) = log_b(x) + log_b(y)
log_b(x/y) = log_b(x) - log_b(y)
log_b(xⁿ) = n log_b(x)
log_b(x) = log_b(c) * log_c(x)
log_b(x) = log_c(x) / log_c(b)

The first one demonstrates the relation between the functions of logarithm and exponentiation, as you can see the value y, which is log base b of the value x, is the exponent to raise b to get the value x, so a logarithm is just a way to get the exponent. Log base 10 of 100 is 2, log₁₀(100) = 2, since 10 to second power is 100, 10² = 100. Log base b of y, x = log_b(y) is inverse function of raising b to the y power, x=b^y, so:

b^log_b(x) = x

Another interesting observation that one can draw from the logarithmic identities is the answer to the question: What is the difference between O(log₂(x)) and O(log₁₀(x))? It’s a little trick question that I like. The answer is they are the same. Using the second to last identity (#8):

O(log₂(x)) = O(log₂(10) * log₁₀(x))

Since log₂(10) is a constant:

O(log₂(x)) = O(log₁₀(x))

Pretty cool!

This would also work with the second to last identity (#9) as well and remember that Big O notation is about [asymptotic] growth, so equals in the case of Big O notation is not the same as equals e.g. log₂(x) ≠ log₁₀(x). Also for small quantities like the examples in this article, the base does make a difference.

Simlarily exponetiation has its identities as well:

b⁰ = 1
b¹ = b
b^-n = 1/bⁿ
b^mbⁿ= b^m+n
b^m/bⁿ= b^m-n
(b^m)ⁿ = b^mn
(b/b)ⁿ = bⁿ/bⁿ
(b/a)^-n = (a/b)ⁿ

The binary tree structure (a rooted acyclic graph) visually illustrates O(log(n)) and the inverse function relationship between logarithms and exponentiation. The diagram above shows a complete^‡ and therefore balanced binary tree structure. As you can see the number of items in each row grows exponentially as powers of two, shown in red, also the total number of elements in the tree as each row is added grows in the same exponential fashion, denoted in square braces. So in a balanced tree you will fill all of the n rows up to the 2ⁿ – 1 item, and when that number is exceeded (greater than or equal to 2ⁿ) a new (n + 1) row will be added. Now the growth is not necessarily exponential, you can add items at any rate, but the structure in which items are stored are broken down along “exponential row boundaries”. So to search 2⁴ – 1 (15) items for the number 7 value we would traverse 8-4-6-7, shown in green, which is 4 nodes, this is the maximum searches for this tree, a search can also be 1,2, or 3 depending on the depth of the item. Since 4 the exact exponent of the total size of the graph, which is [2⁴-1], and therefore the log: [log₂(2⁴-1) = 3.9…], almost 4, our max traversal, O(log(2ⁿ-1)) = O(log(2ⁿ)). Since we know that these two functions are inverses this illustrates O(log(n)) in visual terms.

The Binary Search algorithm has similar log base 2 characteristics, if you remember the algorithm it takes list which has to be ordered, and it repeatedly halves the length of the remaining items and checks the element at each new position, if it is equal you are done, if it is not then depending on whether the search value is greater or less than the midpoint value you then half the bottom or top of the list respectively and repeat the process. This is better illustrated by the following Java code:

public static int binarySearch(int value, int[] array) {

int min = 0;

int max = array.length - 1;

while (min <= max) {

int mid = min + (max - min) / 2;

if (value < array[mid])

max = mid - 1;

else if (value > array[mid])

min = mid + 1;

else

return mid;

}

return -1;

}

The following recursive code example is both more elegant and gives a more intuitive feel for the Divide and Conquer nature of the algorithm. Also it is this Divide and Conquer behavior that breaks the list apart logarithmically.

public static int binarySearch(int value, int[] array) {

returnbinarySearch(value, array, 0, array.length – 1);

}

public static intbinarySearch(int value, int[] array, int min, int max) {

if (max < min)

return -1;

int mid = min + (max - min) / 2;

if (array[mid] > value)

return binarySearch(value, array, min, mid - 1);

else if (key > array[mid])

return binarySearch(value, array, mid + 1, max);

return mid;

}

Let's view a search visually:

In our search for the value 7 in our 16 item list we first go to 8, which is too large, then to 4 which is too small and then to 6, and then 7, at each point in our search the size of what remains to be searched is half of the size of the previous search space.

In our visual example needed to search 4 positions (8, 4, 6, 7) and log₂(16) = 4 which is O(log(n)).

An interesting side note is that many binary search algorithms are broken due to possible overflow problems with integers, as pointed out in a blog post by Joshua Bloch, I account for the issue, you can read more about it here.

^† Due to computing a logarithm, this may not be optimally efficient.

^‡ Complete in this context means that all positions up to and including 2ⁿ-1 positions are filled, i.e. have a node. Also it should be noted that Self Balancing algorithms may not yield such a well balanced tree and it might have more rows for 15 items, but it is still O(log(n)).

21 August 2011

You Can Control What You Can’t Measure

A couple of years ago, Tom Demarco published a brief article titled: "Software Engineering:An Idea Whose Time Has Come and Gone?". Now I know this is old news, and in it he claims that the role of measurement and hence control has not been a factor in some notable successful projects, he cites Google Earth and Wikipedia, now I cannot comment on those or many famous successful projects as I have no firsthand knowledge, however, I suspect that control has been important it just was not employed as he has anticipated and as he points out he is no longer involved in hands on development. So perhaps the premise of the famous quote "You can’t control what you can’t measure" is what is flawed. I believe that the majority of successful software projects and successful companies built around successful software have exhibited at least some if not a large degree of control over the both the software construction process and the software structure. I admit my assertion is tenuous at best as I have very little data to support this but I do have what I consider some interesting anecdotal support.

My own experience also tells me this is true, I am not well versed in the world of software metrics but what we have see m to be largely unwieldy and even potentially dubious (KSLOC) not to mention a lack of tooling and methodologies to apply them. Still I feel that I can look at code base and make "aesthetic" judgments about good and bad, cohesion and coupling are fairly abstract, understanding them can shed light on how to refactor code to improve it and to give a general indication of quality. In looking at code, both good and bad patterns emerge and can be utilized to assess and improve a code base. One thing I can often do is reduce code size by refactoring general code into a framework and refactor it to more effectively use API’s like the Java API and API’s like Apache’s StringUtil and DateUtil classes or things like effectively employing the Spring Framework’s data binding. Also code can often be better abstracted by composition, inheritance and general modularization and the use of patterns like the SOLID principles. These are all things that even the best developers occasionally miss, I often apply these refactoring to my own code as well, and if you are a good practitioner, what I am saying here is probably what you do every day.

While my personal control has mostly been limited to small and midsize projects and I can tell you that I have received some complements about my work to create cleaner more structured systems, my professional experience pales in comparison to that of others and I find their stories more interesting as you probably do also. I’ll start with Steve Yegge’s "Done, and Gets Things Smart" post where he talks, not about his role, but his observations about what went on at Google:

... but I've realized that one of the Google seed engineers (exactly one) is almost singlehandedly responsible for the amazing quality of Google's engineering culture. And I mean both in the sense of having established it, and also in the sense of keeping the wheel spinning. ... Done, and Gets Things Smart folks aren't necessarily your friends. They're just people you're lucky enough to have worked with.

At first it's entirely non-obvious who's responsible for Google's culture of engineering discipline: the design docs, audited code reviews, early design reviews, readability reviews, resisting introduction of new languages, unit testing and code coverage, profiling and performance testing, etc. You know. The whole gamut of processes and tools that quality engineering organizations use to ensure that code is open, readable, documented, and generally non-shoddy work.
But if you keep an eye on the emails that go out to Google's engineering staff, over time a pattern emerges: there's one superheroic dude who's keeping us all in line.

The black bolding is his and the blue bolding is my highlighting. Clearly this demonstrates a pretty high degree of control. The next comes from Martin Fowler’s "Who Needs an Architect?":

Architectus Oryzus [is the] kind of architect must be very aware of what's going on in the project, looking out for important issues and tackling them before they become a serious problem.
...
In many ways, the most important activity of Architectus Oryzus is to mentor the development team, to raise their level so that they can take on more complex issues. Improving the development team's ability gives an architect much greater leverage than being the sole decision maker and thus running the risk of being an architectural bottleneck. This leads to the satisfying rule of thumb that an architect's value is inversely proportional to the number of decisions he or she makes.

A similar sentiment is found in Lean Software Development: An Agile Toolkit By Mary Poppendieck and Tom Poppendieck:

Master Developers
In an extensive study of large system design, _[22] Bill Curtis and his coauthors found that for most large systems, a single or small team of exceptional designers emerge to assume primary responsibility for the design. Exceptional designers exercise leadership through their superior knowledge rather than bestowed authority. Their deep understanding of both the customers and the technical issues gain them the respect of the development team. Exceptional designers are people who are extremely familiar with the application domain and are skilled at communicating their technical vision to the development team. They are usually consumed with the success of their systems.
^[22]Curtis, Kransner, and Iscoe, "A field Study of the Software Design Process for Large Systems," 1272.

Another person who seems to support this is Joshua Block in his talk about API design, I have already blogged about that and the relevant quote is here. Also John Carmack of Id Software talks about controlling the software development process, among other things, and he makes many points that I think jibe with my control argument.

I my opinion these all show that good control does occur but it’s probably done more intuitively and it is more art than science and probably occurs without traditional metrics. That is why software projects can be insanely successful without measurement. Actually it is my experience that control and success is not a binary relation in that control does not unconditionally imply success but rather it is a continuum such that control increases the probability of success. Also I am not saying that software metrics is an idea whose time has come and gone, actually I believe quite the opposite, but that is another topic for another time. Another takeaway from these quotes is that having the right people and empowering them in that very critical role is a very important point in software management, again another entire topic.