Cyclomatic method complexity is a scam

Thomas McCabe’s 1976 paper A Complexity Measure [pdf] suggests that we can “measure” the complexity of a program by counting the branch points. Later authors “show” that McCabe’s number correlates strongly with program size, and that it is therefore worthless. Still later authors [pdf] show that McCabe complexity is much lower (typically halved) for programs written in object-oriented languages. And although they don’t describe or assess the “quality” of their code samples, this latest “result” does suggest that complexity might be a useful tool in evaluating the habitability of a program.

(The ironic quote marks are there as a shorthand, to indicate that I question the science in these papers. In particular, no account was taken of design or coding style. In terms of habitability, though, the last study does offer a little chink of light.)

However, something irks me about the complexity measuring tools I’ve tried: they all work at the method level.

McCabe’s work describes program complexity. Not method (or function) complexity. His reasoning, if it is valid at all, is valid only for the entire call graph of a program. Such a thing is difficult to calculate in programs written using modern languages, due to late binding, polymorphism and/or dynamic typing, and may even be impossible without dynamic analysis. Indeed, a running program is likely to have different call graphs in different test scenarios, rendering the cyclomatic complexity of the entire program a moot point. So it is quite natural to re-cast the measure in terms of methods, purely in order to get something potentially useful.

But here’s the rub: how should we then roll those method complexities up so that we have an idea of the complexity of bigger things, such as classes or programs? I contend that we can’t, meaningfully.

Note that each method has at least one path through it, but two methods do not necessarily mean there are two code paths. Suppose I have this Ruby method, which has cyclomatic complexity 1:

def total_price
  base_price * 1.20

I may wish to extract a method to wrap the tax value:

def total_price
  base_price * tax_rate

def tax_rate

Now I have two methods, each of which has “cyclomatic” complexity 1. The total complexity appears to have increased, and yet the program’s call graph is no more complex. Thus any tool that sums method complexities will penalise the ExtractMethod refactoring, because the tool is now using invalid graph theory.

True cyclomatic complexity is meaningful only (if at all) in the context of the call graph of the entire program. Anything that counts branch points for individual methods (and then adds 1) is not calculating “cyclomatic” complexity, and runs a high risk of penalising useful refactorings.

How to measure habitability

I have argued that in order to measure the habitability of a codebase, we should focus on attempting to measure how well the code conforms to the Once And Only Once rule. So what does that rule mean? It means that every important concept in the domain is represented (named) explicitly — not buried inside another concept (Once), and not spread out in the space between several other concepts (Only Once).


How can we write a tool that will detect when a concept is buried inside another? Well, let’s flip that question around: How can we tell when a concept contains another buried inside it? I’m not sure that pure size is a reliable guide, because design concepts have no standard size. Some will require only a few lines of code, while others will need dozens. But it is probably not unreasonable to assume that a file which is significantly larger than the others in a codebase is probably harbouring too much responsibility.

What about complexity? It is usually true that a class with many conditionals or many levels of indentation is asking to be simplified. And very often the best response to over-use of branching is to break out the Strategy pattern, which gives explicit names to two or more of the responsibilities hidden inside the original class.

Lopez et al, in Relevance of the Cyclomatic Complexity Threshold for the Java Programming Language [pdf], propose that the McCabe complexity of object-oriented software is lower (less than half) that of procedural software. While their study does not appear to take the degree of habitability into account, it seems suggestive to me that some measure of branching might be a reasonable reflection of whether a class is pregnant with unnamed concepts.

Efferent coupling [pdf] may also be indicative here. It seems intuitively correct that a class that has to import/require many others is probably doing too many things.

Only Once:

How could we write a tool that will detect when a concept is spread out in the space between two or more others? Well, again the presence of branches may be an indication of Connascence of Meaning (think of the Null Check and Special Case code smells). Primitive Obsession is also a symptom of Connascence of Meaning, but I suspect we could only detect that automatically in statically typed languages (without actually running the code).

Again, efferent coupling may be indicative. When two classes both import/require another, there may be some Connascence of Algorithm going on — or at least a missing abstraction wrapping the imported stuff.

So while I have no evidence, my instincts tell me that we may be able to get a decent hint at habitability by finding some workable combination of coupling and branching measures. Soon I will begin exploring these ideas with code samples in a variety of languages, to see whether I can work up a metric (or a system of metrics) that can predict whether I will like the code…

Choosing what to measure

Suppose we want metrics that help us write “better” code — what does “better” mean? I’m going to fall back on Extreme Normal Form, more commonly known as the four rules of Simple Design. These say that our priorities, in order, when choosing what to improve are that the code should:

  1. be correct and complete
  2. express the design clearly
  3. express every concept once and only once
  4. be small

Let’s agree to call code that meets these criteria “habitable code”, in that it is easier to understand and maintain than uninhabitable code. How can we design a metric that rewards the creation of habitable code?

Now it seems to me that most teams have processes in place for ensuring correctness and completeness. These might be requirements traceability procedures, suites of (manual or automated) tests, QA procedures, user testing, focus groups, etc etc. The first rule of simple design helps to ensure that we are only working on valid code, but doesn’t help us differentiate “better” code. Only the remaining three rules can help us with that. In a sense, the first rule can be taken as a “given” when our projects are set up with appropriate QA processes and procedures in place. I will therefore not consider test coverage when searching for a metric (indeed I believe that test coverage is an unhelpful measure in the context of code habitability).

Let’s move on to the second rule of Simple Design. Joe Rainsberger re-casts this rule as No bad names, because in the huge majority of current programming languages the names we give to variables, functions, classes etc are our primary (often only) means of expression in the code. I believe it is impossible to directly measure how well code expresses intent, or whether the names chosen are “good” in any aesthetic sense. But the beauty, coherence and honesty of the names chosen by the programmer is only one aspect of applying this rule. Implicit in the statement “No bad names” is the notion that the programmer has chosen to give names to the “correct” set of entities. That is, choosing the things to reify and name in the code is at least as important as, if not more important than, choosing the best name to give those things.

This is where rules 2 and 3 begin to trip over each other (and perhaps why some authors place them in the opposite order). Consider for example the Primitive Obsession code smell. This occurs when two different classes communicate using (messages containing parameters of) primitive types (string, integer, array etc). The primitive(s) used for the message parameter(s) represent some domain concept that must be known to both classes; but this code is smelly because both classes must also understand the particular representation chosen for that concept in this particular message. The two classes exhibit Connascence of Meaning, and the third rule of Simple Design is violated because the representation of the domain concept is understood in two different places in the code. But we could also consider this to be a violation of rule 2, because that domain concept has not been reified into a class and named — we have a missing name.

Based solely on this type of logic, I am therefore going to boldly assert that any metric that gives a good indication of the degree to which rule 3 is followed, will also be a good indication of how well rule 2 is followed (allowing for the fact that we cannot measure aesthetics). So as a proxy measure I claim we can focus solely on the Once And Only Once rule as a measure of the habitability of our code.


Inspired by Uncle Bob’s use of crap4j, and egged on somewhat by various members of the Twitterverse, Marty Andrews and I have spiked crap4r on Github. This version looks for Rspec examples in all files called spec/**/*_spec.rb; then it runs them using Rcov and calculates the CRAP metric using code from Roodi. Very neat.

It’s all a bit scrappy and difficult to use right now, but we’ve proved the concept, and over the next few weeks we’ll get it licked into shape and (hopefully) published as a gem.

the big DIP

Several times today my online reading has brought me to Henrik Mårtensson’s Kallokain blog. Henrik has a great knack of taking concepts from lean manufacturing or the Theory of Constraints and creating worked examples to show how they play out in day-to-day software development.

Take for example what Henrik calls the DIP – design-in-process – the software development equivalent of inventory. DIP is any work that has been begun, but not yet shipped to end users.

An easy, and highly visible, way to measure DIP, is to make a Project Control Board. Use a whiteboard, or a large mylar sheet on a wall. Make a table with one column for each hand-off in your development process. […] At the start of each iteration, write sticky notes for each story (use case, feature description, or whatever you use). […] Whenever a developer begins to work on a new story, the developer moves the story from the stack of sticky notes to the first column in the table. The developer is then responsible for moving the story to a new column each time the story goes into a new phase of development. […] The sticky notes in the table are your DIP. Track it closely. If you have a process problem, you will soon see how sticky notes pile up at one of your process stages.

This fits with everything I’ve written about task boards, but Henrik then goes on to show how to calculate the current value of your DIP – ie. the investment you have made in the unshipped work on the board. In his worked example, a requirements change causes 20 story points worth of DIP to be scrapped; and as each story point costs 1,000 EUR to develop, the requirements change means we must write off 20,000 EUR of development costs.

It is worth noting that some of the 1,000 EUR per story point is “carrying costs” for the work that was scrapped. Just as manufacturing inventory costs money to store and move around, unshipped software (not to mention uncoded designs) has to be regularly read, searched, built and tested. It was there all the time, and we had to respect it and live with it. So it slowed us down; we had to step around it, and we would have tripped along more lightly if it had never been there. Looked at another way: with lower DIP inventory our cost per story point would be lower; equivalently, we could achieve higher throughput for the same expense.

(Of course, there are two obvious reactions to this write-off cost: make changes really difficult, or reduce inventories and cycle times so that DIP is always lower. Each has its sphere of applicability, I guess…)

one foot in the past?

[I wrote this piece on May 12th 2007. I found it today in my Drafts folder, and it still feels relevant. What do you think?]

While collating this week’s Carnival of the Agilists I very much wanted to break with tradition. In the usual format we use the four values from the Agile Manifesto as the main headings in the digest. These headings (“people over process” etc) represent the rallying cry of the agile movement, and express a set of desires in adversarial terms: we prefer this compared to that. That kind of thinking was all well and good in 2001, when the great and good in Snowbird felt they had a war to win. But I grow less comfortable with that approach over time, and I feel the need for something different.

Moreover, I wanted to lift the digest’s eyes a little, away from the nitty-gritty of what colour task cards to use and onto the reasons I feel agile software development is important and necessary: delivering business value. I wanted to change the headings to the three metrics suggested by Mary Poppendieck a while ago:

  1. reduce cycle time;
  2. increase business case realisation; and
  3. increase net promoter score.

Guess what: I couldn’t do it. In particular, I couldn’t fill those first two sections. The blogsphere is full of the minutiae of task-based planning, test-driven development, customer collaboration and so on. But there’s very little being said about business drivers or measuring benefits. I have found a small number of bloggers who write about cycle time – usually from a Theory of Constraints perspective – such as Pascal van Cauwenberghe and Clarke Ching. And there are a number of good blogs in the “user experience” category – notably Kathy Sierra, the folks at 47signals and those at Planet Argon.

But compared to the hundreds of blogs about task cards (and I’ve been as guilty as any), those mentioned above are a drop in the ocean. It’s as if most of the software development community is fixated on “being” agile, instead of being fixated on delivering. So come on, prove me wrong. By all means talk about what you do and how you do it, but try also to relate that to why you do it – map everything back to those metrics (or anything similar). Let me report in my next turn at the Carnival that we’ve lifted our eyes towards our goal.

throughput accounting measurement units

dollar-key Earlier today I was reading about the key business measures used in the Theory of Constraints, and running some examples in my head while I tried to make sure I understood exactly what was being said. Here’s what I learned…

To recap, the only business measures considered by TOC are

Throughput (T)
Cash receipts for actual sales, minus the cost of the raw materials used towards those sales.
Operating Expense (OE)
The expenditure required to produce what we sold, including heat/light, payroll etc.
Investment (I)
The money invested up front, or tied up in the operation. This includes depreciation on machinery, buildings, and other assets and liabilities.

These can be combined to give other measures; for example return on investment:

ROI = (T - OE)/I

(For more detail see throughput accounting on Wikipedia, for example.)

What struck me were the units used to express these measures. The only one I’ve seen explicitly stated was Throughput, which is expressed as “cash receipts per unit of time” – let’s call it dollars-per-day. So in order for the ROI calculation to make arithmetical sense, operating expense (OE) must also be measured in dollars-per-day. So far so good.

Now I had been expecting Investment (I) to simply be a sum of money. But in that case ROI would turn out to have units of “per-day”, which is clearly nonsense. So Investment must also be measured in dollars-per-day, and now I see that makes sense too. And in turn, this all implies that ROI is just a number, a scalar value. (Better perhaps to express it as a ratio or a percentage – so that an ROI of 1.50 is easier to understand as a 3:2 return or a 150% return.) Other traditional measures are calculated as follows:

Net profit (NP) = T-OE
Return on investment (ROI) = NP/I
Productivity (P) = T/OE
Investment turns (IT) = T/I

If my logic above is correct, Net Profit is expressed in dollars-per-day and all the others turn out to be scalar, unit-less numbers – and that alone helps me understand them much more.

Bug bounties

Brent Strange reports that some major companies – notably Microsoft, Mozilla and VeriSign – have begun rewarding their testers with cash for finding serious defects prior to release. It seems to me that this approach is seriously flawed, in at least two respects.

First, it further promotes the traditional antagonism between developers and testers. There’s now a clear reward for testers to find the developers’ work wanting. How does that help to build trust or teamwork?

And second, it rewards the testers for not helping the developers get it right sooner. Sure, the cash will be less than the cost of releasing with a serious defect, but it will also be less than the cost of rework due to finding the defect late in the value stream.

The solution? Both developers and testers should be rewarded when the pre-release testing finds no defects.* Instead of rewarding antagonism, reward collaboration. And reward the reduction in rework. Have the testers engaged at the front of the value stream, creating automated self-checking tests that will help the developers get it right – and complete – first time.

* (Footnote: this needs to be balanced by penalties of some kind if the pre-release tests are skimped in any way!)

Lack of visibility can be the bottleneck

Making targets and their associated metrics more visible can be the most important factor in achieving those targets. That’s the message behind the agile notion of Information Radiators or Big Visible Charts: the visibility itself is a critical component of their success. Today Shawn Callahan tells an anecdote in which it seems the increased visibility of a target and its metric served to dramatically change employee behaviour. He relates a story from Pfeffer and Sutton in Hard Facts, Dangerous Half-Truths, and Total Nonsense in which a shipping company wished to save money by bundling small orders into bigger boxes. The policy wasn’t being followed, so:

“So the company announced a new program that provided rewards such as praise—not financial rewards—for improvement. On the first day, the proportion of packages placed in the larger containers increased to 95 percent in about 70 percent of the company’s offices. The speed of this overwhelming improvement suggests that a change in performance derived not just from the rewards that were offered, but also from the information provided that the current performance level was poor and this action—consolidating shipments—was important to the company.”

As Jack Vinson suggests, the relative invisibility of the target behaviour and its measures could well have been the constraint on that business at that time. The anecdote is from the 1970s, too early for the business to have used the ToC thinking tools to diagnose the situation, or for the metric’s increased visibility to have been the result of applying the five focusing steps (identify, exploit, subordinate, elevate, repeat).

For me, the lesson from this story is that the constraint is as likely to be in knowledge management or communication as anywhere…

yet more on metrics for coaching

A month ago I began asking whether it was possible, meaningful or sensible to measure the performance of an agile coach. Herewith a couple of further thoughts on that topic.

vernier Firstly, there’s been some discussion of metrics over on the scrumdevelopment Yahoo group. The original question posed to the group was:

“What metrics do I collect that tell me this agile stuff is actually doing my group any good”

Mary Poppendieck weighed in with three metrics that should be taken together as a package:

  1. cycle time;
  2. business case realisation;
  3. and net promoter score.

Her post goes on to describe these measures in detail, and why they are important. They do seem spot on to me – though I’ve only tried the first for real, and even then with only limited success. It seems to me that there’s a strong case for measuring the coach according to the results of the organisation being coached, and these look to be measures I’d be willing to adopt for myself. But then…

Secondly, I have now finished Value-based Fees: How to Charge and Get What You’re Worth by Alan Weiss, as recommended by Deb. While not about coaching per se, the book does make recommendations as to how a consultant should establish performance measures with his or her clients. Weiss divides measures into quantitative and qualitative. Regarding the former:

“[Y]ou cannot base your contribution on what I call a ‘magic number’. While a 6 percent sales increase might be highly desirable and even, in your opinion, very achievable, it’s never a good idea to peg your success to that magic number. The reason is that there are far too many variables that can affect that number adversely for you to take that risk. […] It is better to state that you will assist in maximising the sales increase.”

And regarding qualitative measures:

“Qualitative measures can be among the most powerful [but] “I’ll know it when I see it” […] is insufficient for a consulting project. In fact, here are the parameters for creating a successful, anecdotal sereies of measures for a project: the buyer himself or herself will judge the result; the effectiveness will rely on observed behaviour and factual evidence; there will be gradations of success, not success or failure; there must be reasonable time limits.”

Weiss goes on to give examples that are very much like the kind of anecdotal measures I’ve been using on my own projects. So now I’m back to where I started! On reflection, I think that’s just fine. Start with something like:

  1. assist in minimising cycle time;
  2. assist in maximising business case realisation;
  3. and assist in maximising net promoter score

and then augment those with any other anecdotal measures that are valuable to the team, manager or organisation I’m working with.

What measures do you use to assess your own performance at work?