why i broke 89 tests

A couple of days ago I made a little change to Reek and to my surprise 89 tests broke, which is bad. I tweeted about it, and soon @logosity had asked me to write about why it had happened. I know the same sort of thing happens to people all the time, and it can be quite traumatic in the otherwise placid flow of test-driven development. So I’m hoping it will be somewhat helpful to uncover the path that led me from blissful ignorance to test calamity with one keystroke. So without further ado, let’s do some root cause analysis…

Undesirable event:
I deleted an attr_reader that was called in only one place and 89 of Reek’s 360-odd rspec examples broke. Of these broken tests, only three were tests of SimulatedPolymorphism#examine_context, which was the deleted attribute’s one caller.

Why did the other 86 tests break unexpectedly?

Because they each indirectly invoked the one method that now could no longer work.

Why did they invoke the broken method?

Because they are integration tests, checking the co-ordinated behaviour of a dozen or so classes.

The broken method sits on a relatively new class, an instance of which is automatically included in the framework that those tests invoke.

The central class in Reek is the Sniffer, which obtains a syntax tree from a Source (eg. a File or a String or an IO stream) and then traverses it, firing events at registered SmellDetector objects as it encounters syntax elements they subscribe to. The broken method is in one of these listeners, and it is fired whenever the parser encounters a new class definition in the syntax tree.

Reek has a couple of hundred tests that each run a Sniffer over a code snippet, checking that the detectors find a specific set of code smells (and don’t find any others). The broken detector (SimulatedPolymorphism) is fired on any code fragment that includes a class definition; it turns out that only 86 of these couple hundred tests do that. The damage could have been much worse.

All of which means that the failing tests are “integration tests”, checking the overall behaviour of over a dozen classes under a variety of specific circumstances. Under some circumstances one of those classes is broken, so some of these integration tests fail.

Why are there so many integration tests in the unit test suite?

Because they are much easier to write than the tests for a single class would be. And they express the desired behaviour in ways that unit tests can’t. So I never took the time to write the unit tests.

Here’s an example of one of these tests:

describe ControlCouple do
  it 'should report a ternary check on a parameter' do
    code = 'def simple(arga) arga ? @ivar : 3 end'
    code.should reek_only_of(:ControlCouple, /arga/)

The corresponding unit tests would confirm that:

  • the code parser and method context objects conspire together to record the method’s parameters
  • the code parser sends an event to the control couple detector when, and only when, it sees a conditional expression
  • the control couple detector can obtain the enclosing method’s parameters
  • the control couple detector records a smell warning only when the conditional expression tests the value of a method parameter
  • the smell warning includes the correct information
  • no other detector records any warnings for the given code fragment

Unit tests exist for some of these cases and not for others. The ease with which I can write readable tests has made me lazy: I often don’t write the unit tests, because I get new features into Reek much more quickly this way. The penalty occurs when a change breaks one of the detectors or (worse) something in the Sniffer, because then I get dozens of broken tests and it can be difficult to pinpoint the problem quickly.

There’s also another reason I got lazy: As Reek’s internal architecture stabilized many months ago, most of the new code I write consists of adding new smell detector classes, which usually slot right into the existing stable framework with few surprises or problems.

So the root cause of my 89 test failures is the combination of a well-tested, stable, practically defect-free framework together with a great way to quickly write readable integration test examples. Is that so bad after all, I wonder.

Am I going to fix the problem? Yes and no. Each time this happens in future I will add the missing unit tests that should have failed. But I don’t want to lose the expressive value of these integration tests, so I plan to (gradually) organise them into a different test suite more closely related to Reek’s cucumber tests. What would you do?


6 thoughts on “why i broke 89 tests

  1. So, once you fix SimulatedPolymorphism your 86 tests will pass again, right?

    I don’t think this is anything to get too worked up about. I know people complain a lot about how fragile integration tests can be, but if the tests are valid and maintainable they are valuable checks on the behavior of the system. If you have a reliable integration environment, they can be very quick to write as well.

    I would tend to segregate them from more focused unit tests, because they tell you different things. It sounds like you’re approaching this the same way I would — write a focused unit test that shows the problem, then fix it. All your tests pass again, and life is good!

    • Yes, I test-drove a fix for SimulatedPolymorphism and then set about writing this up.

      I do think it’s always worth paying attention to the little nagging voice whenever I’m developing, because that voice is almost always telling me something important. When I can figure out what it’s saying…

  2. FWIW, I have the same attitude, and the same problem, in the Roodi code base. The integration tests are much more expressive, and seem to be where I focus now that the code is reasonably stable.

  3. Thank you for the writeup. It’s always instructive to dig deeper into a situation that surprised you.

    I examined the results of running 200M tests over 50 person years and found a clear power law distribution in the number of failures per test run. Sounds like you hit the fat tail (or the fat tail hit you).

    I agree with your analysis of the dilemma–readability vs. jackpot failures. Like you, I’ll take readability every time. 40% of the tests for JUnit are at the API level and we don’t regret it, in spite of the occasional multiple failure.

  4. I agree with your general approach. In fact, the problem is not that 89 tests broke. People often complain about that type of thing, but in truth it is not the issue. What can be problematic is if it is difficult to find your code error among the failing tests. If you were able to immediately say, “aha, the problem is in SimulatedPolymorphism” then it doesn’t matter how many other tests broke.
    Your plan to write a unit test when this happens is one good way to solve this. Another is to try to always make sure some integration tests hit every class in a meaningful way. If you change a class and 200 tests fail, but one of the failures is in a suite named for the class you just changed, then you immediately know where to look.
    Of course, if your integration tests run slowly, that is a different issue. I am assuming that yours run as quickly as typical unit tests, in which case I find the extra test coverage to be a bonus, not a smell. Again, given that you have solved the issue of rapid location of the underlying error.

    Troy Frever

  5. Pingback: Caffeine Driven Development » Blog Archive » L33t Links #35

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s