why i broke 89 tests

A couple of days ago I made a little change to Reek and to my surprise 89 tests broke, which is bad. I tweeted about it, and soon @logosity had asked me to write about why it had happened. I know the same sort of thing happens to people all the time, and it can be quite traumatic in the otherwise placid flow of test-driven development. So I’m hoping it will be somewhat helpful to uncover the path that led me from blissful ignorance to test calamity with one keystroke. So without further ado, let’s do some root cause analysis…

Undesirable event:
I deleted an attr_reader that was called in only one place and 89 of Reek’s 360-odd rspec examples broke. Of these broken tests, only three were tests of SimulatedPolymorphism#examine_context, which was the deleted attribute’s one caller.

Why did the other 86 tests break unexpectedly?

Because they each indirectly invoked the one method that now could no longer work.

Why did they invoke the broken method?

Because they are integration tests, checking the co-ordinated behaviour of a dozen or so classes.

The broken method sits on a relatively new class, an instance of which is automatically included in the framework that those tests invoke.

The central class in Reek is the Sniffer, which obtains a syntax tree from a Source (eg. a File or a String or an IO stream) and then traverses it, firing events at registered SmellDetector objects as it encounters syntax elements they subscribe to. The broken method is in one of these listeners, and it is fired whenever the parser encounters a new class definition in the syntax tree.

Reek has a couple of hundred tests that each run a Sniffer over a code snippet, checking that the detectors find a specific set of code smells (and don’t find any others). The broken detector (SimulatedPolymorphism) is fired on any code fragment that includes a class definition; it turns out that only 86 of these couple hundred tests do that. The damage could have been much worse.

All of which means that the failing tests are “integration tests”, checking the overall behaviour of over a dozen classes under a variety of specific circumstances. Under some circumstances one of those classes is broken, so some of these integration tests fail.

Why are there so many integration tests in the unit test suite?

Because they are much easier to write than the tests for a single class would be. And they express the desired behaviour in ways that unit tests can’t. So I never took the time to write the unit tests.

Here’s an example of one of these tests:

describe ControlCouple do
  it 'should report a ternary check on a parameter' do
    code = 'def simple(arga) arga ? @ivar : 3 end'
    code.should reek_only_of(:ControlCouple, /arga/)
  end
end

The corresponding unit tests would confirm that:

  • the code parser and method context objects conspire together to record the method’s parameters
  • the code parser sends an event to the control couple detector when, and only when, it sees a conditional expression
  • the control couple detector can obtain the enclosing method’s parameters
  • the control couple detector records a smell warning only when the conditional expression tests the value of a method parameter
  • the smell warning includes the correct information
  • no other detector records any warnings for the given code fragment

Unit tests exist for some of these cases and not for others. The ease with which I can write readable tests has made me lazy: I often don’t write the unit tests, because I get new features into Reek much more quickly this way. The penalty occurs when a change breaks one of the detectors or (worse) something in the Sniffer, because then I get dozens of broken tests and it can be difficult to pinpoint the problem quickly.

There’s also another reason I got lazy: As Reek’s internal architecture stabilized many months ago, most of the new code I write consists of adding new smell detector classes, which usually slot right into the existing stable framework with few surprises or problems.

So the root cause of my 89 test failures is the combination of a well-tested, stable, practically defect-free framework together with a great way to quickly write readable integration test examples. Is that so bad after all, I wonder.

Am I going to fix the problem? Yes and no. Each time this happens in future I will add the missing unit tests that should have failed. But I don’t want to lose the expressive value of these integration tests, so I plan to (gradually) organise them into a different test suite more closely related to Reek’s cucumber tests. What would you do?

people problems vs process problems

dunce A good deal has been written recently on the origin of problems that occur during software development. Pascal started the ball rolling by noticing an apparent conflict between Weinberg and Toyota:

“… Gerald Weinberg tells us that Whatever the problem is, it’s always a people problem. […] The Toyota Way states that Whatever the problem is, it’s always a process problem.”

Pascal goes on to resolve the conflict by noting that it is a mirage: fix the “people problem” immediately, and then fix the “process problem” so that same class of problem can’t recur. Aside from dubious use of the term “people problem”, these are the classical elements of the jidoka method: detect, stop, fix, prevent.

Then Marc followed up by proposing that it’s a false dichotomy:

“Processes (by which I mean the things and actions that actually happen in a system) are the manifestation of what the people involved have learned – of both their explicit and tacit knowledge. People and processes are two sides of the same coin. The way to improve your processes is to improve your people.”

Dave, though, brings the question around to organizational culture:

“IMO the key question is, How does the organization — which comprises people (who have a culture) — react reflexively when things are going badly? Is the tendency to fall back on adherence to formal processes, or to facilitate people’s ability to find creative solutions?”

I’ve raised Pascal’s original dilemma a number of times recently during my jidoka workshops. At some point I’ll inevitably make a bald statement that I believe every problem is a process problem (and no problem is a “people problem”). But then I find the resulting discussion revolves around some participants’ interpretation of the word “process” as being equivalent to “procedure”. And my point is lost forever. So here I want to re-cast the debate, hopefully in terms that are unambiguous.

First, what do we mean by “problem”. I take the term to refer to any reduction of productivity in a system or organisation. In TOC or lean terms this usually translates to a loss of throughput or flow – the problem boils down essentially to one of the seven types of muda – but it can also apply to an increase in operating costs without a corresponding increase in throughput. And I think we’re using the word “problem” above to refer to two different things:

incidents
The bad things that happen and which trigger us to look hard at our system. TOC (the theory of constraints) calls these UDEs – Undesirable Effects. I prefer to call them incidents to reflect their uniqueness and their immediacy.
root causes
Those attributes of our system that lie at the heart of the observed incidents. Any given incident will have a single root cause, if we broaden the scope of our system far enough.

And by system I mean that collection of processes, procedures, situations, knowledge, structures that make up the environment in which the incident occurred. The system may be a single department and all its working, or it may be the broader organisation comprising the whole company.

So above, when I talked about “process problems”, I intended to refer to any root cause that is ultimately responsible for causing an incident. This will be some attribute of the system we’re operating within. It may be that the system didn’t encourage two people to talk to each other, or didn’t ensure adequate knowledge in its workers, or comprised an incorrect flow of materials, or failed to embody a quality check at some point, or whatever.

So, trying to be perfectly clear: I believe that the reductions in a system‘s productivity occur because of the structure and behaviours of that system. Randomly switching people around might cause different incidents to actually occur, but in every case it is the system that permitted those incidents, and the system’s attributes that ultimately caused them. It seems to me that looking for people problems is about as smart as blaming the laser printer for a badly spelled document. People are not the problem. Ever.

I leave the final word to the Lean FAQ of the Northwest Lean Networks:

“If your ‘5 Why’ exercise seems to be pointing to ‘operator error’ as the root cause, you are going down the wrong path. Operators only do what our production systems allow them to do, so the root cause is in our systems, not our workers.”