microtests for exceptions

Over on the Dev@Pulse blog Jim raises some interesting points about how and when to microtest the messages in thrown exceptions. Jim’s discussion got me thinking about how I solve the same problem myself, and on reflection I discovered I use a slightly different tactic than Jim’s:

I write one test for each important piece of information that must be conveyed by the exception. For example, there might be a test that checks the exception’s message for the incorrect business value that caused the error; and maybe another that checks for the name of the throwing object. Etc etc.

Here’s a simple example, based on some of Reek‘s microtests:

context 'when the parser fails' do
  before :each do
    @parser_error = 'Error message'
    parser = mock('parser')
    parser.should_receive(:parse).and_raise(SyntaxError.new(@parser_error))
    @source_name = 'source/file/path.rb'
    @src = SourceCode.new('def unused() end', @source_name, parser)
  end
  it 'raises a SyntaxError' do
    lambda {@src.syntax_tree}.should raise_exception(SyntaxError)
  end
  it 'reports the source name' do
    lambda {@src.syntax_tree}.should raise_exception(SyntaxError, /#{@source_name}/)
  end
  it 'reports the parser error message' do
    lambda {@src.syntax_tree}.should raise_exception(SyntaxError, /#{@parser_error}/)
  end
end

This scheme has two advantages: First, there are fewer direct dependencies on incidental text in the error message, which means the microtests are less brittle, and hence less likely to be affected by simple changes of wording. And second, the tests clearly indicate which parts of the error are “important” – so if one of them later fails I get a reminder that I’ve lost information, and thus a gentle prod to re-consider the most recent code change.

Everyone writes code that throws exceptions, so I’m sure these aren’t the only tactics for microtesting them; what do you do?

extract class or substitute algorithm?

I have a dilemma. In fact, I frequently have this dilemma, so I thought I’d ask your opinion on which way you would address it.

I just completed a CRC session for the next feature I want to add to my codebase. One of the classes identified for the new scenarios is a particular kind of counting collector, and I realised that the same behaviour already occurs elsewhere in the codebase. So one of the first steps in coding up this new scenario will be to extract a new class from an existing class. The existing class has reasonable micro-tests, and therein lies the dilemma:

  • I could simply use the Extract Class refactoring recipe, and I’d be done quite quickly; the new collector class would be tested indirectly, via the micro-tests for the original class (which would now be the extracted class’s only client).
  • Or I could develop the collecting class test-first, from scratch, so that it had its own micro-tests, and then use Substitute Algorithm, reimplementing the old class to use the new.

The advantage of the second approach is that it gives me precise micro-tests of the new class. But while developing that class, I could drift away from the required usage, or I could make it harder to switch the original class over to using the new one. (I tend to think of Substitute Algorithm as being riskier than Extract Class.)

So which is better — use Extract Class and know that I have a working client-supplier relationship; or TDD from scratch and risk going off track? In practice, I tend to choose Extract Class if I’m feeling lazy or distracted, whereas I go with TDD followed by Substitute Algorithm when I’m feeling organized; in the latter case I usually get stressed by the “substitute” step…

(It seems to me that if we choose the latter approach in general, many refactorings should cease to be used. That can’t be right, can it?)

why i broke 89 tests

A couple of days ago I made a little change to Reek and to my surprise 89 tests broke, which is bad. I tweeted about it, and soon @logosity had asked me to write about why it had happened. I know the same sort of thing happens to people all the time, and it can be quite traumatic in the otherwise placid flow of test-driven development. So I’m hoping it will be somewhat helpful to uncover the path that led me from blissful ignorance to test calamity with one keystroke. So without further ado, let’s do some root cause analysis…

Undesirable event:
I deleted an attr_reader that was called in only one place and 89 of Reek’s 360-odd rspec examples broke. Of these broken tests, only three were tests of SimulatedPolymorphism#examine_context, which was the deleted attribute’s one caller.

Why did the other 86 tests break unexpectedly?

Because they each indirectly invoked the one method that now could no longer work.

Why did they invoke the broken method?

Because they are integration tests, checking the co-ordinated behaviour of a dozen or so classes.

The broken method sits on a relatively new class, an instance of which is automatically included in the framework that those tests invoke.

The central class in Reek is the Sniffer, which obtains a syntax tree from a Source (eg. a File or a String or an IO stream) and then traverses it, firing events at registered SmellDetector objects as it encounters syntax elements they subscribe to. The broken method is in one of these listeners, and it is fired whenever the parser encounters a new class definition in the syntax tree.

Reek has a couple of hundred tests that each run a Sniffer over a code snippet, checking that the detectors find a specific set of code smells (and don’t find any others). The broken detector (SimulatedPolymorphism) is fired on any code fragment that includes a class definition; it turns out that only 86 of these couple hundred tests do that. The damage could have been much worse.

All of which means that the failing tests are “integration tests”, checking the overall behaviour of over a dozen classes under a variety of specific circumstances. Under some circumstances one of those classes is broken, so some of these integration tests fail.

Why are there so many integration tests in the unit test suite?

Because they are much easier to write than the tests for a single class would be. And they express the desired behaviour in ways that unit tests can’t. So I never took the time to write the unit tests.

Here’s an example of one of these tests:

describe ControlCouple do
  it 'should report a ternary check on a parameter' do
    code = 'def simple(arga) arga ? @ivar : 3 end'
    code.should reek_only_of(:ControlCouple, /arga/)
  end
end

The corresponding unit tests would confirm that:

  • the code parser and method context objects conspire together to record the method’s parameters
  • the code parser sends an event to the control couple detector when, and only when, it sees a conditional expression
  • the control couple detector can obtain the enclosing method’s parameters
  • the control couple detector records a smell warning only when the conditional expression tests the value of a method parameter
  • the smell warning includes the correct information
  • no other detector records any warnings for the given code fragment

Unit tests exist for some of these cases and not for others. The ease with which I can write readable tests has made me lazy: I often don’t write the unit tests, because I get new features into Reek much more quickly this way. The penalty occurs when a change breaks one of the detectors or (worse) something in the Sniffer, because then I get dozens of broken tests and it can be difficult to pinpoint the problem quickly.

There’s also another reason I got lazy: As Reek’s internal architecture stabilized many months ago, most of the new code I write consists of adding new smell detector classes, which usually slot right into the existing stable framework with few surprises or problems.

So the root cause of my 89 test failures is the combination of a well-tested, stable, practically defect-free framework together with a great way to quickly write readable integration test examples. Is that so bad after all, I wonder.

Am I going to fix the problem? Yes and no. Each time this happens in future I will add the missing unit tests that should have failed. But I don’t want to lose the expressive value of these integration tests, so I plan to (gradually) organise them into a different test suite more closely related to Reek’s cucumber tests. What would you do?

today’s kata

In Ruby, one easy way to check for palindromes would be to write a method such as this:


class String
 def palindrome?
   self == self.reverse
 end
end

But the C++ part of my brain screams that this is likely to be very inefficient. So, today’s kata: test-drive a palindrome? method for String, ensuring (with tests) that it is much faster than the above.

Reflections on shotgun surgery

I’ve just finished refactoring the code of Reek to (almost) eliminate a large dose of Shotgun Surgery, so I thought the time was ripe for a little reflection on what occurred and why.

What is Shotgun Surgery?

This is a code smell that occurs when you have to change loads of existing code in order to make a single change to the overall codebase.

In this case I wanted to change the reports generated by Reek’s spec matchers so that test failures are described more succinctly. To do so I had to change 7 classes, adding 4 methods (three of which all had the same name) and modifying several others. I pushed quickly to GREEN, at which point the code looked like I had driven a bulldozer through it from one end to the other and back again.

target-335029_1280

Ok, that happens all the time. Why is it a problem?

Because it creates drag. I had known for a while that certain kinds of change in this codebase would be painful to make, and the pain I imagined was subconsciously steering me away from making those kinds of change. I had been spending my time making changes that I knew were easier, putting off tackling the no-go areas of the codebase.

Furthermore, as I was tackling this smell I discovered a few defects in the code. They were edge cases, but they were definitely bugs. They had festered there in the dark corners where I had been afraid to go; and for all I know they had tripped up some new user of Reek and put them off forever.

And perhaps worst of all, it indicates that there’s a single design decision whose implementation (and consequences) is distributed throughout the codebase. So any similar or related decisions are going to be just as difficult and messy, and pretty soon they’ll all pile up together and render the code completely unmaintainable.

Why did it get so bad?

I think the main reason is that I’m not particularly well tuned to noticing Shotgun Surgery. (I believe I’m not alone in this.) We write a test, and it takes a bulldozer to get to green. We get to green, and we make a few localized refactorings to tidy up some of the methods we touched. But we often don’t see the bigger picture. “Hey, I just had to touch 6 classes to get that done; there’s a design issue here.” I’m sure I had previously failed to notice this or related smells on several previous occasions.

I had also failed to listen to my inner demon. I had “seen” several pieces of the eventual puzzle many weeks earlier, but I hadn’t addressed them because they didn’t seem serious. For example, I knew that some classes violated one of my central tenets of a good hexagonal architecture. But I told myself that was just a guideline — I shouldn’t push the code towards that shape unless the code told me it wanted to go. It turned out that the code had been screaming at me for some time!

How did I fix it?

The first, and most important, step is to identify the decision that hadn’t been properly and uniquely represented in the code. The aim is to refactor the code so that if I wished to change that decision again I would only need to make a single change to a single class. This is the hardest part of fixing Shotgun Surgery, and it’s a design step.

In this case I wanted to create a different kind of report, and I wanted to do it in the spec matcher. But quiet reports were currently a special case within full reports, implemented by reading an attribute from a global variable containing the user’s preferences. And the report itself was created way over the other side of the system. In a lazily evaluated cache.

I won’t bore you with the details. I decided to do this one “by the book”, inverting dependencies one by one until the Report object was created once, at the identified decision point in the code. Over the course of a dozen hours I changed all of the above 7 classes again; I introduced 4 new classes; and at the close of play none of the methods I had touched in the original change still remained.

Could this cost have been avoided?

I think this is a natural process — reasonably well-refactored code suddenly receives a change request that goes against the grain, so there’s some origami required before the old and the new sit comfortably together. That much is inevitable.

But I’m sure I missed fainter warning signs earlier in the code’s life. Had I responded to them, this week’s change might have been much less traumatic. So in future I will try to be much more vigilant: the next time a test causes me to change more than one class, I’ll sort it out right there and then.

a different use for CruiseControl

Until recently I had always thought of CruiseControl as a backstop, something that would check for silly mistakes during commit, or certain classes of environmental dependency in build, tests or deployment. And then, during December, two things happened to change that view; and now I use CruiseControl quite differently.

As you know, I’ve been developing this tool called reek, which looks for code smells in Ruby source files and prints warnings about what it finds. reek has been developed largely test-first, and now has a couple of hundred tests. The first thing that happened was that I began to be irritated by the huge runtime of the full test suite — 15 seconds is a long time when you’re sitting waiting for feedback, and I noticed I would sometimes dip off into the browser while waiting for green (or red). I needed the coverage, but without the waiting around.

The second thing that happened was that during one of those interminable 15-second waits I read about the Saff squeeze. Then I tried it, enjoyed it, and found that the tests it produced were significantly different from those I wrote “myself”.

So now I use CruiseControl differently. I began by splitting the tests into two suites: fast and slow. The fast tests comprise over 90% of the total, and yet run in under 1 second; I use them as normal during my development cycle — no more waiting around! The slow tests consist of runs of the whole tool on known source code; there are only a dozen of them, but they take 14 seconds to run. (You might call them “unit tests” and “integration tests”, and indeed that’s what the two suites comprise right now. But the key attribute is their speed, not their scope.)

So now the slow tests only get run by CruiseControl, not by me (it does run the fast tests too, btw). I’ve delegated the 14 seconds of waiting, and now my development state of flow is much easier to maintain. And if the slow tests fail, it means I have one or more missing fast tests — so I whip out the Saff squeeze to go from one to the other. I’m using CruiseControl as a batch test runner, and the Saff squeeze as a mechanical tool to help generate a missing fast test from a failing slow test. (The process of squeezing also tends to tell me something quite interesting about my code. More on that another time.)

I’m sure this use of CruiseControl isn’t new or original, but I’m pleased with how well it works. Let me know if you try something similar?

the saff squeeze

In Hit ’em High, Hit ’em Low: Regression Testing and the Saff Squeeze Kent Beck outlines a very nice technique for isolating a defect using regression tests:

“To effectively isolate a defect, start with a system-level test and progressively inline and prune until you have the smallest possible test that demonstrates the defect.”

Kent provides a step-by-step example of the technique, which he attributes to David Saff. I’ve always done the “High” part of the process, but the successive inlining and pruning to get to the “Low” is new and very clever. Tucked into my bag of tricks now.

(Link via PierG.)

red, green, refactor … commit

As time passes I find myself getting more and more nervous about the amount of time between passing tests, and between commits to the source code repository. So when Neil asked me today where I put the “commit” step in my TDD practice, I said:

red
green
   commit
refactor
   commit

That extra “commit” at the green bar captures the safe state, in case I get distracted by the kids or forget to take my medication. It also shares that working code with the rest of the team, which in turn makes sure I get feedback on whether it builds. And that also makes sure those tests get run more frequently and outside of my environment sooner.

Recently I’ve been learning to use Git (which I like a lot, although some of the differences from Subversion stil baffle me a little). Anyroadup, with Git I can get even more safety from the process:

red
   commit
green
   commit
   push
refactor
   commit
   push

The extra “commit” at the red bar is local to my repository. It means I can “stop on a red bar”, as used to be recommended in the early days of TDD, and yet still have all my code checked in.

I wonder whether I’m becoming more paranoid as Old Age now has me in its grip…

write tests for old or new features?

You have a pile of legacy code that’s hard to modify, hard to test, and never seems to be bug-free. You have some testers who are available to be moved to the front of the flow to write regression tests. Is it more cost-effective to have them write regression tests for current features or for new features? Does the answer depend on aspects of the environment, or the tools available?

I must admit I’ve always tended to favour testing the new stuff; what would you do?

TDD and random numbers in ruby

I’m about to TDD a Ruby class whose behaviour will involve the use of random numbers. I expect the algorithms within the class to evolve as I implement new stories, so I don’t want to design and build a testing mechanism that will be brittle when those changes occur. But before I can write the next example, I need to figure out how to control the random numbers needed by the code under test. Off the top of my head I can think of four options:

  1. One way would be to set a fixed seed just before each test and then simply let the random algorithm do its thing. But for each new behaviour I would need to guess the appropriate seed, which is likely to be time-consuming. Furthermore, the relationship between each magic seed and the specific behaviour tested is likely to be obscure, possibly requiring comments to document it for the future reader. And finally, if the algorithm later evolves in such a way as to consume random numbers in a different order or quantity, the seed may turn out to be inappropriate, leading to broken tests or, worse, tests that pass but which no longer test the planned behaviour.
  2. Alternatively I could redefine Kernel::rand — but that could potentially interfere with stuff I don’t know about elsewhere in the object space.
  3. Within the class under test I could self-encapsulate the call to Kernel::rand, and then override the encapsulating method in a subclass for the tests. But then I’m not testing the class itself.
  4. Finally, I could parameterize the class, passing to it an object that generates random numbers for it. This option appears to give me complete control, without being too brittle or trampling on anyone else in the object space.

So I’ll go with option 4. Right now, though, I’m not sure what interface the randomizer object should provide to the calling class. Looking ahead, I expect I’ll most likely want to select a random item from an array, which means selecting a random integer in the range 0...(array.length). And for this next example all I’ll need is a couple of different randomizers that return 0 or 1 on every request, so I’ll simply pass in a proc:

obj.randomize_using { |ceil| 0 }

And if ever I need to provide a specific sequence of varying random numbers, I can do it like this:

rands = [1, 0, 2]
obj.randomize_using { |ceil| rands.shift || 0 }

Later that same day…

The class I’m developing has evolved quite a lot and split into three. And suddenly, with the most recent change, three of the tests have begun failing. A little investigation reveals that the code is now consuming a random number when it didn’t need to in the past, and so some of my randomizer procs now provide inappropriate values. It turns out that two of the failing examples actually boil down to being a single test of a piece that has now been refactored out into another class; by refactoring the tests to match I can remove the dependency on random numbers altogether. And the last broken test is fixed by providing a randomizer that respects the ceiling passed to it (not an unreasonable request):

obj.randomize_using { |ceil| [2, ceil-1].min }

This works, and I get no more surprises during the session.