fun and games at xp-manchester

Last night was the October 2010 meeting of XP-Manchester, a local group set up by me and Jim McDonald. As always the meeting consisted of two halves, the first being a workshop (this time led by me) and the second being a coding dojo.

For the workshop this month I ran a version of James Shore’s Offing the Offsite Customer game, as described by Kane Mar and using Kane’s drawings as the requirements. I hadn’t run the session before, and it turned out really well. We had 19 participants, so we split into two roughly equal-sized teams, with each team further split equally between a group of Product Owners and a group of Developers.

In the first run-through neither team managed to create a diagram that looked anything like the requirement; whereas in the second attempt both teams produced very good diagrams, and well inside the allotted time. The difference was born in the team retrospectives between the two runs. Both teams independently decided to work much more iteratively and interactively second time around, and it paid off. It could be said that in the first run, the teams focussed on perfecting the written spec, whereas in the second run the teams focussed on perfecting the working diagram. This focus on evolving a diagram using direct feedback was “invented” independently by both teams, and towards the end of the second run they even had free time available for fine-grained polishing.

The second part of the evening was a dojo. Jim introduced us to the Minisculus challenge set by Eden Development at the recent Software Craftsmanship 2010 event. Jim decided we should attempt the Mark I problem in Ruby, and due to the relative lack of Ruby knowledge in the room last night this meant that Mike Josephson did most of the driving. We didn’t get very far, but we did have some very interesting discussions about TDD style: After you’ve faked a return value to get quickly to GREEN, what’s the best step to take next? Is it better to add another test in order to triangulate towards a more general solution, or is it better to treat the fake return value as duplication and fix that by moving specifics up into the test? We explored the latter approach last night, and no doubt we’ll continue the debate next month.

Many thanks to Jim and Mike for running things, and to everyone else for joining in!

XP-Manchester happens after work on the second Thursday of every month at Madlab in Manchester’s Northern Quarter. Everyone is welcome, and if you want to come along you can get details of upcoming meetings by joining the mailing list at http://groups.google.com/group/xp-manchester.

microtests for exceptions

Over on the Dev@Pulse blog Jim raises some interesting points about how and when to microtest the messages in thrown exceptions. Jim’s discussion got me thinking about how I solve the same problem myself, and on reflection I discovered I use a slightly different tactic than Jim’s:

I write one test for each important piece of information that must be conveyed by the exception. For example, there might be a test that checks the exception’s message for the incorrect business value that caused the error; and maybe another that checks for the name of the throwing object. Etc etc.

Here’s a simple example, based on some of Reek‘s microtests:

context 'when the parser fails' do
  before :each do
    @parser_error = 'Error message'
    parser = mock('parser')
    parser.should_receive(:parse).and_raise(SyntaxError.new(@parser_error))
    @source_name = 'source/file/path.rb'
    @src = SourceCode.new('def unused() end', @source_name, parser)
  end
  it 'raises a SyntaxError' do
    lambda {@src.syntax_tree}.should raise_exception(SyntaxError)
  end
  it 'reports the source name' do
    lambda {@src.syntax_tree}.should raise_exception(SyntaxError, /#{@source_name}/)
  end
  it 'reports the parser error message' do
    lambda {@src.syntax_tree}.should raise_exception(SyntaxError, /#{@parser_error}/)
  end
end

This scheme has two advantages: First, there are fewer direct dependencies on incidental text in the error message, which means the microtests are less brittle, and hence less likely to be affected by simple changes of wording. And second, the tests clearly indicate which parts of the error are “important” – so if one of them later fails I get a reminder that I’ve lost information, and thus a gentle prod to re-consider the most recent code change.

Everyone writes code that throws exceptions, so I’m sure these aren’t the only tactics for microtesting them; what do you do?

extract class or substitute algorithm?

I have a dilemma. In fact, I frequently have this dilemma, so I thought I’d ask your opinion on which way you would address it.

I just completed a CRC session for the next feature I want to add to my codebase. One of the classes identified for the new scenarios is a particular kind of counting collector, and I realised that the same behaviour already occurs elsewhere in the codebase. So one of the first steps in coding up this new scenario will be to extract a new class from an existing class. The existing class has reasonable micro-tests, and therein lies the dilemma:

  • I could simply use the Extract Class refactoring recipe, and I’d be done quite quickly; the new collector class would be tested indirectly, via the micro-tests for the original class (which would now be the extracted class’s only client).
  • Or I could develop the collecting class test-first, from scratch, so that it had its own micro-tests, and then use Substitute Algorithm, reimplementing the old class to use the new.

The advantage of the second approach is that it gives me precise micro-tests of the new class. But while developing that class, I could drift away from the required usage, or I could make it harder to switch the original class over to using the new one. (I tend to think of Substitute Algorithm as being riskier than Extract Class.)

So which is better — use Extract Class and know that I have a working client-supplier relationship; or TDD from scratch and risk going off track? In practice, I tend to choose Extract Class if I’m feeling lazy or distracted, whereas I go with TDD followed by Substitute Algorithm when I’m feeling organized; in the latter case I usually get stressed by the “substitute” step…

(It seems to me that if we choose the latter approach in general, many refactorings should cease to be used. That can’t be right, can it?)

why i broke 89 tests

A couple of days ago I made a little change to Reek and to my surprise 89 tests broke, which is bad. I tweeted about it, and soon @logosity had asked me to write about why it had happened. I know the same sort of thing happens to people all the time, and it can be quite traumatic in the otherwise placid flow of test-driven development. So I’m hoping it will be somewhat helpful to uncover the path that led me from blissful ignorance to test calamity with one keystroke. So without further ado, let’s do some root cause analysis…

Undesirable event:
I deleted an attr_reader that was called in only one place and 89 of Reek’s 360-odd rspec examples broke. Of these broken tests, only three were tests of SimulatedPolymorphism#examine_context, which was the deleted attribute’s one caller.

Why did the other 86 tests break unexpectedly?

Because they each indirectly invoked the one method that now could no longer work.

Why did they invoke the broken method?

Because they are integration tests, checking the co-ordinated behaviour of a dozen or so classes.

The broken method sits on a relatively new class, an instance of which is automatically included in the framework that those tests invoke.

The central class in Reek is the Sniffer, which obtains a syntax tree from a Source (eg. a File or a String or an IO stream) and then traverses it, firing events at registered SmellDetector objects as it encounters syntax elements they subscribe to. The broken method is in one of these listeners, and it is fired whenever the parser encounters a new class definition in the syntax tree.

Reek has a couple of hundred tests that each run a Sniffer over a code snippet, checking that the detectors find a specific set of code smells (and don’t find any others). The broken detector (SimulatedPolymorphism) is fired on any code fragment that includes a class definition; it turns out that only 86 of these couple hundred tests do that. The damage could have been much worse.

All of which means that the failing tests are “integration tests”, checking the overall behaviour of over a dozen classes under a variety of specific circumstances. Under some circumstances one of those classes is broken, so some of these integration tests fail.

Why are there so many integration tests in the unit test suite?

Because they are much easier to write than the tests for a single class would be. And they express the desired behaviour in ways that unit tests can’t. So I never took the time to write the unit tests.

Here’s an example of one of these tests:

describe ControlCouple do
  it 'should report a ternary check on a parameter' do
    code = 'def simple(arga) arga ? @ivar : 3 end'
    code.should reek_only_of(:ControlCouple, /arga/)
  end
end

The corresponding unit tests would confirm that:

  • the code parser and method context objects conspire together to record the method’s parameters
  • the code parser sends an event to the control couple detector when, and only when, it sees a conditional expression
  • the control couple detector can obtain the enclosing method’s parameters
  • the control couple detector records a smell warning only when the conditional expression tests the value of a method parameter
  • the smell warning includes the correct information
  • no other detector records any warnings for the given code fragment

Unit tests exist for some of these cases and not for others. The ease with which I can write readable tests has made me lazy: I often don’t write the unit tests, because I get new features into Reek much more quickly this way. The penalty occurs when a change breaks one of the detectors or (worse) something in the Sniffer, because then I get dozens of broken tests and it can be difficult to pinpoint the problem quickly.

There’s also another reason I got lazy: As Reek’s internal architecture stabilized many months ago, most of the new code I write consists of adding new smell detector classes, which usually slot right into the existing stable framework with few surprises or problems.

So the root cause of my 89 test failures is the combination of a well-tested, stable, practically defect-free framework together with a great way to quickly write readable integration test examples. Is that so bad after all, I wonder.

Am I going to fix the problem? Yes and no. Each time this happens in future I will add the missing unit tests that should have failed. But I don’t want to lose the expressive value of these integration tests, so I plan to (gradually) organise them into a different test suite more closely related to Reek’s cucumber tests. What would you do?

today’s kata

In Ruby, one easy way to check for palindromes would be to write a method such as this:


class String
 def palindrome?
   self == self.reverse
 end
end

But the C++ part of my brain screams that this is likely to be very inefficient. So, today’s kata: test-drive a palindrome? method for String, ensuring (with tests) that it is much faster than the above.

a different use for CruiseControl

Until recently I had always thought of CruiseControl as a backstop, something that would check for silly mistakes during commit, or certain classes of environmental dependency in build, tests or deployment. And then, during December, two things happened to change that view; and now I use CruiseControl quite differently.

As you know, I’ve been developing this tool called reek, which looks for code smells in Ruby source files and prints warnings about what it finds. reek has been developed largely test-first, and now has a couple of hundred tests. The first thing that happened was that I began to be irritated by the huge runtime of the full test suite — 15 seconds is a long time when you’re sitting waiting for feedback, and I noticed I would sometimes dip off into the browser while waiting for green (or red). I needed the coverage, but without the waiting around.

The second thing that happened was that during one of those interminable 15-second waits I read about the Saff squeeze. Then I tried it, enjoyed it, and found that the tests it produced were significantly different from those I wrote “myself”.

So now I use CruiseControl differently. I began by splitting the tests into two suites: fast and slow. The fast tests comprise over 90% of the total, and yet run in under 1 second; I use them as normal during my development cycle — no more waiting around! The slow tests consist of runs of the whole tool on known source code; there are only a dozen of them, but they take 14 seconds to run. (You might call them “unit tests” and “integration tests”, and indeed that’s what the two suites comprise right now. But the key attribute is their speed, not their scope.)

So now the slow tests only get run by CruiseControl, not by me (it does run the fast tests too, btw). I’ve delegated the 14 seconds of waiting, and now my development state of flow is much easier to maintain. And if the slow tests fail, it means I have one or more missing fast tests — so I whip out the Saff squeeze to go from one to the other. I’m using CruiseControl as a batch test runner, and the Saff squeeze as a mechanical tool to help generate a missing fast test from a failing slow test. (The process of squeezing also tends to tell me something quite interesting about my code. More on that another time.)

I’m sure this use of CruiseControl isn’t new or original, but I’m pleased with how well it works. Let me know if you try something similar?

the saff squeeze

In Hit ‘em High, Hit ‘em Low: Regression Testing and the Saff Squeeze Kent Beck outlines a very nice technique for isolating a defect using regression tests:

“To effectively isolate a defect, start with a system-level test and progressively inline and prune until you have the smallest possible test that demonstrates the defect.”

Kent provides a step-by-step example of the technique, which he attributes to David Saff. I’ve always done the “High” part of the process, but the successive inlining and pruning to get to the “Low” is new and very clever. Tucked into my bag of tricks now.

(Link via PierG.)

red, green, refactor … commit

As time passes I find myself getting more and more nervous about the amount of time between passing tests, and between commits to the source code repository. So when Neil asked me today where I put the “commit” step in my TDD practice, I said:

red
green
   commit
refactor
   commit

That extra “commit” at the green bar captures the safe state, in case I get distracted by the kids or forget to take my medication. It also shares that working code with the rest of the team, which in turn makes sure I get feedback on whether it builds. And that also makes sure those tests get run more frequently and outside of my environment sooner.

Recently I’ve been learning to use Git (which I like a lot, although some of the differences from Subversion stil baffle me a little). Anyroadup, with Git I can get even more safety from the process:

red
   commit
green
   commit
   push
refactor
   commit
   push

The extra “commit” at the red bar is local to my repository. It means I can “stop on a red bar”, as used to be recommended in the early days of TDD, and yet still have all my code checked in.

I wonder whether I’m becoming more paranoid as Old Age now has me in its grip…

write tests for old or new features?

You have a pile of legacy code that’s hard to modify, hard to test, and never seems to be bug-free. You have some testers who are available to be moved to the front of the flow to write regression tests. Is it more cost-effective to have them write regression tests for current features or for new features? Does the answer depend on aspects of the environment, or the tools available?

I must admit I’ve always tended to favour testing the new stuff; what would you do?