31 July 2011

Software Frameworks: Resistance isn’t Futile

As I have previously discussed, in my opinion there are three main framework components that can be described succinctly as, libraries, rules, and templates. It is the library component that I wanted to talk about here, perhaps in a context that might be seen as more evidence in support of frameworks in certain cases. Now to recap, building a framework does not make sense for all projects, the two main scenarios that I have seen that are extremely conducive to it are an organization with multiple similar projects with similar problem domains. The second case is a medium to large project with a lot of commonality that would favor reusability across the project. One of my complaints about the Framework debate is that it is debated as a black and white argument. Either frameworks are absolutely required or the worst thing you can do. Now I am sure that many developers can anecdotally cite either side of this argument, which is what I feel really drives this debate and there is no doubt that I do this as well, but the goal, in my opinion, is to step back and look at this problem from a bigger perspective.

One place that I have found some interesting perspective is in a paper by Todd L. Veldhuizen titled: "Software Libraries and Their Reuse:Entropy, Kolmogorov Complexity, and Zipf’s Law", there is a slide version here, now a word of caution this paper is very math intensive, but it should be possible to read it and gain some insights without understanding the math, for example in the paper he states the following:

A common theme in the software reuse literature is that if we can only get the right environment in place— the right tools, the right generalizations, economic incentives, a "culture of reuse" — then reuse of software will soar, with consequent improvements in productivity and software quality. The analysis developed in this paper paints a different picture: the extent to which software reuse can occur is an intrinsic property of a problem domain, and better tools and culture can have only marginal impact on reuse rates if the domain is inherently resistant to reuse.

I think this is a good observation, many projects that I have worked on have exhibited characteristics that are favorable to reuse, but I have read a number of counter arguments especially by people in fast paced startups where the flux of the system evolution is potentially resistant to reusability, actually in this case it’s probably the SDLC, not necessarily the problem domain that is resistant. Also I would suspect the probability of reuse is in part inversely proportional to system size, so for small systems it’s less likely or at least will have a smaller set of reusable components, so an investment in reuse may not be seen as justified.

Another interesting observation from the paper is:

Under reasonable assumptions we prove that no finite library can be complete: there are always more components we can add to the library that will allow us increase reuse and make programs shorter. To make this work we need to settle a subtle interplay between the Kolmogorov complexity notion of compressibility (there is a shorter program doing the same thing) and the information theoretic notion of compressibility (low entropy over an ensemble of programs).

This is especially interesting if you have some familiarity with Information Theory, and if you don’t I recommend learning more about it. Here he is comparing characteristics of both Algorithmic Information Theory [Kolmogorov complexity] and Shannon’s Information Theory [information theoretic notion].  Roughly, Algorithmic Information Theory is concerned with the smallest algorithm to generate data and Shannon’s Information Theory is about how to represent the data in the most compact form.  These concepts are closely related to data compression and in the paper this is paralleled to the idea that reusing code, will make the system smaller in terms of lines of code, or more specifically: symbols, which effectively "compresses" the codebase.  In Algorithmic Information theory you can never really know if you have the smallest algorithm so I may be taking some liberty here, but I think the takeaway is that when trying to create reuse you can probably do it forever so one needs to temper this desire with practicality. In other words there is probably a point where any subsequent work towards reuse is a diminishing return.

I find the paper compelling and I confess that perhaps I am being bamboozled by math that I still do not fully understand, but intuitively these ideas feel right to me.  Also the application of Zipf’s law is interesting and should be pretty intuitive, once again roughly, Zipfs law relates to power curve distributions, also related to the 80/20 rule, the prime example is the frequency of English words in text, words like [and, the, some, in, etc.] are much, perhaps orders of magnitude, more common than words like [deleterious, abstruse, etc.].  This distribution shows up in things like the distribution of elements in the universe, think hydrogen vs. platinum, the wealth distribution of people you vs. Bill Gates, how many followers people have on twitter, etc. and to a smaller scale, the curve is scale invariant, in software, often some components will have a fair amount of reuse, things string copy functions, entity base classes, etc., whereas others may only have a couple of reuses.

On the lighter side, Power curves relate to Chaos Theory, I have seen a number of people including Andy Hunt draw parallels between Agile and Chaos theory, although these are usually pretty loose, it does strike me that one way to model chaos is through iterated maps, which is reminiscent of the iterative process of agile, also the attractor and orbit concepts seem to parallel the target software system as well.

Another place, and this is one that most developers will find more accessible, I would have lead with this but the title and flow leaned the other way, is Joshua Bloch’s "How To Design A Good API and Why it Matters".  Actually I think this a presentation that every developer should watch especially if you are involved in high level design and architecture related work, and don’t worry, there is no math to intimidate the average developer.  A summary article version can be found here, but I would still recommend watching the full video at least once if not multibple times. Slides to an older version of the talk can be found here.

In his presentation he talks about getting the API right the first time because you are stuck with it. I think this helps illuminate one very important and problematic aspect of framework design and software development.  The problem can be illustrated with a linguistic analogy, in linguistics there are two ways to define rules for a language: prescriptive, is where you apply (prescribe) the rules on to the language vs. descriptive where the rules describe the language as it exists, I strongly favor the descriptive.  One of the most famous English language rules: "not to end sentences with prepositions" is a prescriptive rule thought to be from Latin which was introduced by John Dryden and famously mocked by Winston Churchill "This is the sort of nonsense up with which I will not put.", pointing out that it really doesn’t always fit a Germanic language like English.  I know I’m a bit off topic again with my "writer’s embellishment", not to mention that the same idea is discussed in depth by Martin Fowler which he terms "Predictive versus Adaptive". 

It is a common problem which Agile attempts to address and it is common in general software design and construction, it also occurs API and framework design.  Software construction as we know is an organic process and I feel that frameworks are best developed in part out of that process though Martin Fowler’s Harvesting which can be termed as descriptive framework creation.  What Joshua Bloch in part describes and to some degree cautions against can be described as a prescriptive approach to API/Frameworks.  I think many developers including me have attempted to create framework components and API’s early in a project usually driven by a high level vision of what the resulting system will look like only to find out later that certain assumptions were not valid or certain cases were not accounted for1.  What he talks about is a pretty rigid prescriptive model which is in many ways at odds with an adaptive agile approach, I feel that the more adaptive agile approach is really what is needed for the framework approach and we do see this via versioning, for example the differences between Spring 1.x and Spring 3.x are substantial and no one would want to use 1.x now, but there are apps that are tied to it now.  Also this approach of complete backwards compatibility was used with Java Generics, specifically Type Erasure which has lead to a significant weakening of the implementation of that abstraction.  It is my understanding that Scala has recently undergone some substantial changes from version to version leading some to criticize its stability while others cite that it is the only way to avoid painting yourself into a corner like with Java.  The harvesting approach will often involve refinement and changes to the components which are extracted and this can lead to the need for refactorings that can potentially affect large amounts of already existing code. It’s a real chicken and egg problem.

He starts his presentation with the following Characteristics of a Good API:

Characteristics of a Good API
  • Easy to learn
  • Easy to use, even without documentation
  • Hard to misuse
  • Easy to read and maintain code that uses it
  • Sufficiently powerful to satisfy requirements
  • Easy to extend
  • Appropriate to audience

He also makes the following point:

APIs can be among a company's greatest assets

I think this is sentiment is reflected in Paul Graham’s "Beating the Averages" of course that is more about Lisp but underlying principle is the same, actually an interesting language agnostic point comes from Peter Norvig, I can’t find the reference, but he said he had a similar attitude towards Lisp until he got to Google and saw good programmers who were incredibly productive in C++. I feel that this is all just the framework argument, maximizing reusability by building reusable high level abstractions within your problem domain that allow you to be more productive and build new high level components more quickly, it’s all about efficiency.

In regards to API’s he adds:

API Should Be As Small As Possible But No Smaller

To which he adds:

  • When in doubt leave it out.
  • Conceptual weight is more important than the bulk - The number of concepts.
  • The most important way to reduce weight is reusing interfaces.

He attributes this sentiment to Einstein, but he wasn’t sure about it, I did some follow up the paraphrasing is "Everything should be made as simple as possible, but no simpler." or "Make things as simple as possible, but not simpler." More about that can be found here. These are good cautions about over-design or over-engineering APIs, Frameworks, and Software in general, I have definitely been guilty of this at times, once again this is something that needs to be balanced.

The following is what I consider to be seminal advice:

All programmers are API designers because good programming is inherently modular and these inter modular boundaries are API’s and good API’s tend to get reused.

As stated in his slides:

Why is API Design Important to You?
  • If you program, you are an API designer
    • Good code is modular–each module has an API
  • Useful modules tend to get reused
    • Once module has users, can’t change API at will
    • Good reusable modules are corporate assets
  • Thinking in terms of APIs improves code quality

The next two quotes are pretty long, and it was no easy task transcribing them, he talks really fast, I tried to accurately represent this as best as possible, also I feel that he really nails some key ideas and I wanted to have a written record of it to reference, since I am not aware of any other:

Names matter a lot, there are some people that think that names don’t matter and when you sit down and say well this isn’t named right, they say don’t waste your time let’s just move on, it’s good enough. No! Names, in an API, that are going to be used by anyone else that includes yourself in a few months mater an awful lot.  The idea is that every API is kind of a little language and people who are going to use your API needs to learn that language and then speak in that language and that means that names should be self explanatory, you should avoid cryptic abbreviations so the original Unix names, I think, fail this one miserably.

He augments these ideas by adding consistency and symmetry:

You should be consistent, it is very important that the same word means the same thing when used repeatedly in your API and you don’t have multiple words meaning that same thing so let us say that you have a remove and a delete in the same API that is almost always wrong what’s the difference between remove and delete, Well I don’t know when I listen to those two things they seem to mean the same thing if they do mean the same thing then call them both the same thing if they don’t then make the names different enough to tell you how they differ if they were called let’s say delete and expunge I would know that expunge was a more permanent kind of removal or something like that. Not only should you strive for consistency you should strive for symmetry so if you API has two verbs add and remove and two nouns entry and key, I would like to see addEntry, addKey, removeEntry, removeKey if one of them is missing there should be a very good reason for it I am not saying that all API’s should be symmetric but the great bulk of them should. If you get it right the code should read like prose, that’s the prize.

From the Slides:

Names Matter–API is a Little Language
  • Names Should Be Largely Self-Explanatory
    • Avoid cryptic abbreviations
  • Be consistent–same word means same thing
    • Throughout API, (Across APIs on the platform)
  • Be regular–strive for symmetry
  • Code should read like prose

I feel that this hits some essential concepts which really resonate with me, in fact I have follow on posts planned to further deconstruct and develop these ideas more generally.  From the framework perspective this also gets at some of the variety of the Framework code components, they can be the Paul Graham’s functional Lisp abstractions, they can be DSL’s, they can be Object Oriented like Spring, Hibernate and Java API, etc.   Any framework built for a domain will have their conceptual vocabularies or API languages that are a higher level abstraction of the problem domain, Domain Specific Abstractions, and they all benefit from concepts like consistency and symmetry as appropriate to the domain.

The following is a very common and widespread problem that often inhibits reuse and leads to less efficient production of lower quality software:

Reuse is something that is far easier to say than to do. Doing it requires both good design and very good documentation. Even when we see good design, which is still infrequently, we won’t see the components reused without good documentation.

- D. L. Parnas, Software Aging. Proceedings of the 16th International Conference on Software Engineering, 1994.

He adds:

Example Code should be Exemplary, one should spend ten times as much time on example code than production code.

He references a paper called "Design fragments" by George Fairbanks also here which looks interesting but I have not had time to read it yet.

This is also, I believe, to be a critical point that is possibly symptomatic of problems with many software projects. I feel that projects never explicitly allow for capturing reuse in terms of planning, schedule and developer time.  Often reuse is a lucky artifact if you have proactive developers who make the extra effort to do it and it can often be at odds with the way I have seen projects run (mismanaged).  I have some follow up planned for this topic as well.

In terms of design he adds this interesting point:

You need one strong design lead to that can ensure that the api that you are designing is cohesive and pretty and clearly the work of one single mind or at least a single minded body and that’s always a little bit of a trade off being able to satisfy the needs of many costumers and yet produce something that is beautiful and cohesive.

I have to confess that I do not recall how I came across Todd Veldhuizen’s paper or Joshua Bloch’s talk, but I felt that they were really about similar ideas, in writing this and finding all of the references again I realized that my association of these two was not coincidental at all.  For they are both part of the Library-Centric Software Design LCSD'05 workshop for Object-Oriented Programming, Systems, Languages and Applications (OOPSLA'05) with Joshua Bloch delivering the same talk as the Keynote Address.

Now I admit that one goal of mine is to put these ideas, much of which was borrowed from the two referenced works, into a written form that I can reference back to, I hope this stands up by itself but is really written for my future entries. Also, for the record this is not the first time I have "leached" off of Joshua Bloch’s work.

Ultimately my ideas will diverge from some of those in regards to API design, Joshua Bloch speaks from a position of greater responsibility in terms of APIs. I see them as part of a framework continuum used to construct software and I think many of his ideas apply directly and more generally to that.  Also I see framework and the system as interlocked and feel that frameworks can drive structure and consistency for example Object-Oriented API’s aka frameworks can in turn drive the structure of software, the Spring Framework is a good example of this, Spring is build heavily around IOC which is one of the SOLID principles.

There will always be arguments against frameworks, the classic one is it creates more code that is more complex and it requires more learning, my counter argument to this is twofold: First any code that is in production probably has the need to be known and understood which might require it to be learned regardless of how efficiently it is created.  Also if a framework creates reusability and consistency the initial learning curve will be higher but each subsequent encounter with a codebase that constructed this way should be easier. Also highly redundant inconsistent code is potentially (much) more difficult to learn and maintain because there is more of it.  The second is if your framework API is well defined and well documented it should make the resultant code much easier to understand and maintain aka "read like prose".  This will be due to the fact that much of the "generic" complexity is "abstracted" downwards into the framework level.  For example compare the code to implement a Spring Controller to the underlying classes that do the work such as AnnotationMethodHandlerAdapter Now if you have defects and issues at that lower level they will be harder to fix and changing common code can have side effects to other dependant code, it’s not a perfect world.

I think the issue with reuse and the framework approach is asking the right question: How resistant (or favorable) is you domain and your SDLC to reuse?  I think most domains have some if not a fair amount of favorability to reuse and I see reuse as increased efficiency in software construction. 

1Perhaps: "...certain cases for were not accounted." Dryden blows.

1 comment: