Cyclomatic method complexity is a scam

Thomas McCabe’s 1976 paper A Complexity Measure [pdf] suggests that we can “measure” the complexity of a program by counting the branch points. Later authors “show” that McCabe’s number correlates strongly with program size, and that it is therefore worthless. Still later authors [pdf] show that McCabe complexity is much lower (typically halved) for programs written in object-oriented languages. And although they don’t describe or assess the “quality” of their code samples, this latest “result” does suggest that complexity might be a useful tool in evaluating the habitability of a program.

(The ironic quote marks are there as a shorthand, to indicate that I question the science in these papers. In particular, no account was taken of design or coding style. In terms of habitability, though, the last study does offer a little chink of light.)

However, something irks me about the complexity measuring tools I’ve tried: they all work at the method level.

McCabe’s work describes program complexity. Not method (or function) complexity. His reasoning, if it is valid at all, is valid only for the entire call graph of a program. Such a thing is difficult to calculate in programs written using modern languages, due to late binding, polymorphism and/or dynamic typing, and may even be impossible without dynamic analysis. Indeed, a running program is likely to have different call graphs in different test scenarios, rendering the cyclomatic complexity of the entire program a moot point. So it is quite natural to re-cast the measure in terms of methods, purely in order to get something potentially useful.

But here’s the rub: how should we then roll those method complexities up so that we have an idea of the complexity of bigger things, such as classes or programs? I contend that we can’t, meaningfully.

Note that each method has at least one path through it, but two methods do not necessarily mean there are two code paths. Suppose I have this Ruby method, which has cyclomatic complexity 1:

def total_price
  base_price * 1.20

I may wish to extract a method to wrap the tax value:

def total_price
  base_price * tax_rate

def tax_rate

Now I have two methods, each of which has “cyclomatic” complexity 1. The total complexity appears to have increased, and yet the program’s call graph is no more complex. Thus any tool that sums method complexities will penalise the ExtractMethod refactoring, because the tool is now using invalid graph theory.

True cyclomatic complexity is meaningful only (if at all) in the context of the call graph of the entire program. Anything that counts branch points for individual methods (and then adds 1) is not calculating “cyclomatic” complexity, and runs a high risk of penalising useful refactorings.

How to measure habitability

I have argued that in order to measure the habitability of a codebase, we should focus on attempting to measure how well the code conforms to the Once And Only Once rule. So what does that rule mean? It means that every important concept in the domain is represented (named) explicitly — not buried inside another concept (Once), and not spread out in the space between several other concepts (Only Once).


How can we write a tool that will detect when a concept is buried inside another? Well, let’s flip that question around: How can we tell when a concept contains another buried inside it? I’m not sure that pure size is a reliable guide, because design concepts have no standard size. Some will require only a few lines of code, while others will need dozens. But it is probably not unreasonable to assume that a file which is significantly larger than the others in a codebase is probably harbouring too much responsibility.

What about complexity? It is usually true that a class with many conditionals or many levels of indentation is asking to be simplified. And very often the best response to over-use of branching is to break out the Strategy pattern, which gives explicit names to two or more of the responsibilities hidden inside the original class.

Lopez et al, in Relevance of the Cyclomatic Complexity Threshold for the Java Programming Language [pdf], propose that the McCabe complexity of object-oriented software is lower (less than half) that of procedural software. While their study does not appear to take the degree of habitability into account, it seems suggestive to me that some measure of branching might be a reasonable reflection of whether a class is pregnant with unnamed concepts.

Efferent coupling [pdf] may also be indicative here. It seems intuitively correct that a class that has to import/require many others is probably doing too many things.

Only Once:

How could we write a tool that will detect when a concept is spread out in the space between two or more others? Well, again the presence of branches may be an indication of Connascence of Meaning (think of the Null Check and Special Case code smells). Primitive Obsession is also a symptom of Connascence of Meaning, but I suspect we could only detect that automatically in statically typed languages (without actually running the code).

Again, efferent coupling may be indicative. When two classes both import/require another, there may be some Connascence of Algorithm going on — or at least a missing abstraction wrapping the imported stuff.

So while I have no evidence, my instincts tell me that we may be able to get a decent hint at habitability by finding some workable combination of coupling and branching measures. Soon I will begin exploring these ideas with code samples in a variety of languages, to see whether I can work up a metric (or a system of metrics) that can predict whether I will like the code…

Choosing what to measure

Suppose we want metrics that help us write “better” code — what does “better” mean? I’m going to fall back on Extreme Normal Form, more commonly known as the four rules of Simple Design. These say that our priorities, in order, when choosing what to improve are that the code should:

  1. be correct and complete
  2. express the design clearly
  3. express every concept once and only once
  4. be small

Let’s agree to call code that meets these criteria “habitable code”, in that it is easier to understand and maintain than uninhabitable code. How can we design a metric that rewards the creation of habitable code?

Now it seems to me that most teams have processes in place for ensuring correctness and completeness. These might be requirements traceability procedures, suites of (manual or automated) tests, QA procedures, user testing, focus groups, etc etc. The first rule of simple design helps to ensure that we are only working on valid code, but doesn’t help us differentiate “better” code. Only the remaining three rules can help us with that. In a sense, the first rule can be taken as a “given” when our projects are set up with appropriate QA processes and procedures in place. I will therefore not consider test coverage when searching for a metric (indeed I believe that test coverage is an unhelpful measure in the context of code habitability).

Let’s move on to the second rule of Simple Design. Joe Rainsberger re-casts this rule as No bad names, because in the huge majority of current programming languages the names we give to variables, functions, classes etc are our primary (often only) means of expression in the code. I believe it is impossible to directly measure how well code expresses intent, or whether the names chosen are “good” in any aesthetic sense. But the beauty, coherence and honesty of the names chosen by the programmer is only one aspect of applying this rule. Implicit in the statement “No bad names” is the notion that the programmer has chosen to give names to the “correct” set of entities. That is, choosing the things to reify and name in the code is at least as important as, if not more important than, choosing the best name to give those things.

This is where rules 2 and 3 begin to trip over each other (and perhaps why some authors place them in the opposite order). Consider for example the Primitive Obsession code smell. This occurs when two different classes communicate using (messages containing parameters of) primitive types (string, integer, array etc). The primitive(s) used for the message parameter(s) represent some domain concept that must be known to both classes; but this code is smelly because both classes must also understand the particular representation chosen for that concept in this particular message. The two classes exhibit Connascence of Meaning, and the third rule of Simple Design is violated because the representation of the domain concept is understood in two different places in the code. But we could also consider this to be a violation of rule 2, because that domain concept has not been reified into a class and named — we have a missing name.

Based solely on this type of logic, I am therefore going to boldly assert that any metric that gives a good indication of the degree to which rule 3 is followed, will also be a good indication of how well rule 2 is followed (allowing for the fact that we cannot measure aesthetics). So as a proxy measure I claim we can focus solely on the Once And Only Once rule as a measure of the habitability of our code.

I don’t measure test coverage

I believe that test coverage is an unhelpful measure. I have two main reasons behind this belief: Firstly, test coverage often gives false positives and false negatives; and secondly, too many badly written tests will slow a project down (that is, higher coverage can mean lower habitability).

It is relatively easy to write tests that exercise code without actually checking it for correctness. So high test coverage does not necessarily correlate with correctness and completeness. Conversely, low automated test coverage does not imply that the software is untested, nor does imply the presence of defects.

It is also relatively easy to write tests that enshrine bad design. In codebases with high coupling and low cohesion, high test coverage probably means that those poor designs are difficult to improve without breaking tests. For example, in code that is procedural, incohesive and highly coupled, tests that simply “mock out” every one of an object’s dependencies (read: use a mock library to stub out the dependencies) will likely increase the cost of changing the design later; whereas tests that have been used to drive roles into the design will ultimately help to make the code easier to change — but only if the fundamental design structures in the code mirror those in the domain.

So it is my belief that good design trumps test coverage, and I therefore prefer to focus on design measures first. (More on those later…) Note that I am not saying that automated testing is bad, nor indeed that high automated test coverage is bad. I am saying that high test coverage is not a good measure of software habitability.