TDD for teams

I strongly suspect that TDD for teams is different than TDD for individuals.

There’s a proverb in software development to the effect that:

“TDD is a design technique, not a testing technique”

I agree. But that doesn’t mean it’s the only design technique we need. And it doesn’t also mean that everyone will use it equivalently or get the same results with it. For example, take a look at the approaches used by Seb Rose, Ron Jeffries and Alistair Cockburn to solving the Letter Diamond kata. (Click on their names to read their blog posts, then hop back here if you still have any energy left.) They each tackled the same set of requirements in completely different ways. Each used TDD, and yet their designs were completely different.

In fact, while I was drafting this post, George Dinwiddie had a go too, and Ron made a second attempt. So now we have 5 different design approaches to one toy kata from 4 developers, all in the space of one weekend. (And there are probably more that I haven’t heard about.) Of course, it would have been weird if they had all produced identical designs. We are all different, and that’s definitely a good thing. But I worry that this can cause problems for teams doing TDD.

A couple of years ago I remember doing a performance kata in which I paired with Mark Kirschstein to tackle the supermarket checkout kata. My role was to write the tests and Mark’s was to make them pass. At the beginning of the session I made the bold claim that I could get Mark to implement the checkout in a way he had never seen before, with a design that he had not conceived. The audience, and Mark, were skeptical. They were used to thinking of the problem as framed by the tests in Dave Thomas’ original problem statement. And so they were expecting the first test to be something like this:

public class CheckoutTests {
  public void oneA() {
    Checkout checkout = new Checkout();

But in fact my first test was this:

public class CheckoutTests2 implements ScannerListener {
  int priceReported = 0;
  String productReported = null;

  public void oneA() {
    Scanner scanner = new Scanner(this);
    scanner.scan(new SKU("A"));
    assertEquals(30, priceReported);
    assertEquals("A", productReported);

  public void itemScanned(String product, int price) {
    productReported = product;
    priceReported = price;

(expressed using the SelfShunt pattern, as anyone who has attended any of my training courses will recognise immediately). Mark, to his surprise, was gradually led to creating a checkout implementation based on notifications and listeners, and with no getters or setters.

[Supplementary challenge: implement the supermarket checkout kata without conditionals!]

While this was (I hope) great theatre, there’s a deeper message in all of this when it comes to whole teams working on non-trivial problems: If a team is to produce software that exhibits internal consistency and is understandable by all of its developers, then somehow these individual differences must be surrendered or homogenized in some way. Somehow the team must create designs — and a design style — that everyone agrees on.

Does it make sense for teams to operate as isolated pairs, each programming away on their specific tasks, without regard for how their designs will integrate with those of their team-mates? I see too many teams doing just that; ditching design sessions on the basis of reading TDD books and blogs in which a single person or pair did all of the thinking. I see far too many codebases in which personal style is the major design force; where the same domain concept is implemented in two or more radically different ways; where duplication is thus very hard to spot, and even harder to remove.

Perhaps we need more published examples of team-based TDD, showing techniques for creating and sharing Just Enough Design Up FrontTM.

XP includes the key practices of Coding Standard and System Metaphor; are they enough to solve the problem? How can pairs Refactor Mercilessly if there is no team consensus as to what constitutes “good” and “consistent”?

What does your team do to balance the needs of “enough design” and “too much design up front”?

I don’t measure test coverage

I believe that test coverage is an unhelpful measure. I have two main reasons behind this belief: Firstly, test coverage often gives false positives and false negatives; and secondly, too many badly written tests will slow a project down (that is, higher coverage can mean lower habitability).

It is relatively easy to write tests that exercise code without actually checking it for correctness. So high test coverage does not necessarily correlate with correctness and completeness. Conversely, low automated test coverage does not imply that the software is untested, nor does imply the presence of defects.

It is also relatively easy to write tests that enshrine bad design. In codebases with high coupling and low cohesion, high test coverage probably means that those poor designs are difficult to improve without breaking tests. For example, in code that is procedural, incohesive and highly coupled, tests that simply “mock out” every one of an object’s dependencies (read: use a mock library to stub out the dependencies) will likely increase the cost of changing the design later; whereas tests that have been used to drive roles into the design will ultimately help to make the code easier to change — but only if the fundamental design structures in the code mirror those in the domain.

So it is my belief that good design trumps test coverage, and I therefore prefer to focus on design measures first. (More on those later…) Note that I am not saying that automated testing is bad, nor indeed that high automated test coverage is bad. I am saying that high test coverage is not a good measure of software habitability.

TDD: three easy mistakes

I run introductory training in test-driven development quite frequently these days. And each time I do, I find the same basic mistakes cropping up every time, even among teams who already claim to practice TDD. Here are the three mistakes I see most often:

1. Starting with edge cases (empty string, null parameter) or error cases:

Imagine you have to test-drive the design of an object that can count the number of occurrences of every word in a string. I will often see some or all of these tests written before any others:

public class WordCounterTest {

  public void wordCounterCanBeCreated() {
    assertNotNull(new WordCounter());

  public void nullInputCausesExeption() {
    new WordCounter().count(null);

  public void emptyInputGivesEmptyOutput() {
    Map<String, Integer> actual = new WordCounter().count("");
    assertEquals(new HashMap<String, Integer>(), actual);


These feel easy to write, and give a definite feeling of progress. But that is all they give: a feeling of progress. These tests really only prove to ourselves that we can write a test.

When written first like this, they don’t deliver any business value, nor do they get us closer to validating the Product Owner’s assumptions. When we finally get around to showing him our work and asking “Is this what you wanted?”, these tests turn out to be waste if he says “Actually, now I see it, I think I want something different”.

And if the Product Owner decides to continue, that is the time for us to advise him that we have some edge cases to consider. Very often it will turn out to be much easier to cope with those edge cases now, after the happy path is done. Some of them may now already be dealt with “for free”, as it were, simply by the natural shape of the algorithm we test drove. Others may be easy to implement by adding a Decorator or modifying the code. Conversely, if we had started with the edge cases, chances are we had to work around them while we built the actual business value — and that will have slowed us down even more.

So start with tests that represent business value:

public void singleWordIsCounted() {
  Map<String, Integer> expected = new HashMap<String, Integer>();
  expected.put("happy", 2);
  assertEquals(expected, new WordCounter().count("happy happy"));

This way you will get to ask the Product Owner that vital question sooner, and he will invest less before he knows whether he wants to proceed. And you will have a simpler job to do, both while developing the happy path, and afterwards when you come to add the edge cases.

2. Writing tests for invented requirements:

You may think that your solution will decompose into certain pieces that do certain things, and so you begin by testing one of those and building upwards from there.

For example, in the case of the word counter we may reason along the following lines: “We know we’ll need to split the string into words, so let’s write a test to prove we can do that, before we continue to solve the more difficult problem”. And so we write this as our first test:

public void countWords() {
  assertEquals(2, new WordCounter().countWords("happy monday"));

No-one asked us to write a method that counts the words, so yet again we’re wasting the Product Owner’s time. Equally bad, we’ve invented a new requirement on our object’s API, and locked it in place with a regression test. If this test breaks some time in the future, how will someone looking at this code in a few months’ time cope with that: A test is failing, but how does he know that it’s only a scaffolding test, and should have been deleted long ago?

So start at the outside, by writing tests for things that your client or user actually asked for.

3. Writing a dozen lines of code in order to get the next test to pass:

When the bar is red and the path to green is long, TDD beginners often soldier on, writing an entire algorithm just to get one test to pass. This is highly risky, and also highly stressful. It is also not TDD.

Suppose you have these tests:

public void singleWordIsCounted() {
  assertEquals("happy=1", new WordCounter().counts("happy"));

public void repeatedWordIsCounted() {
  assertEquals("happy=2", new WordCounter().counts("happy happy"counts));

And suppose you have been writing the simplest possible thing that works, so your code looks like this:

public class WordCounter {
  public String counts(String text) {
    if (text.contains(" "))
      return "happy=2";
    return "happy=1";

Now imagine you picked this as the next test:

public void differentWords() {
  assertEquals("happy=1 monday=1", new WordCounter().counts("happy monday"));

This is a huge leap from the current algorithm, as any attempt to code it up will demonstrate. Why? Well, the code duplicates the tests at this point (“happy” occurs as a fixed string in several places), so we probably forgot the REFACTOR step! It is time to remove the duplication before proceeding; if you can’t see it, try writing a new test that is “closer” to the current code:

public void differentSingleWordIsCounted() {
  assertEquals("monday=1", new WordCounter().counts("monday"));

We can now make this simpler set of tests pass easily, effectively by removing the duplication between the code and the tests:

public class WordCounter {
  public String counts(String text) {
    String[] words = text.split(" ");
    return words[0] + "=" + words.length;

After making this relatively simple change, we have now test-driven part of the algorithm with which we struggled earlier. At this point we can try the previous test again; and this time if it is still too hard, we may wish to ask whether our chosen result format is helping or hindering…

So if you notice that you need to write or change more than 3-4 lines of code in order to get to green, STOP! Revert back to green. Now either refactor your code in the light of what just happened, so as to make that test easier to pass, or pick a test closer to your current behaviour and use the new test to force you to do that refactoring.

The step from red bar to green bar should be fast. If it isn’t, you’re writing code that is unlikely to be 100% tested, and which is prone to errors. Choose tests so that the steps are small, and make sure to refactor ALL of the duplication away before writing the next test, so that you don’t have to code around it whilst at the same time trying to get to green.

A testing strategy

The blog post Cucumber and Full Stack Testing by @tooky sparked a very interesting Twitter conversation, during the course of which I realised I had fairly clear views on what tests to write for a web application. Assuming an intention to create (or at least work towards creating) a hexagonal architecture, here are the tests I would ideally aim to have at some point:

  • A couple of end-to-end tests that hit the UI and the database, to prove that we have at least one configuration in which those parts join up. These only need to be run rarely, say on CI and maybe after each deployment.
  • An integration test for each adapter, proving that the adapter meets its contract with the domain objects AND that it works correctly with whatever external service it is adapting. This applies to the views-and-controllers pairings too, with the service objects in the middle hexagon stubbed or mocked as appropriate. These will each need to run when the adapters or the external services change, which should be infrequent once initial development of an adapter has settled out.
  • Unit tests for each object in the middle hexagon, in which commands issued to other objects are mocked (see @sandimetz‘s testing rules, for which I have no public link). And for every mocked interaction, a contract test proving that the mocked object really would respond as per the mocked interaction (see @jbrains‘s Integrated tests are a scam articles). These will be extremely fast, and will be run every few seconds during the TDD cycle.

I’ve never yet reached this goal, but that’s what I’m striving for when I create tests. It seems perfectly adequate to me, given sufficient discipline around the creation of the contract tests. Have I missed anything? Would it give you confidence in your app?

Why I don’t use spork

Spork is great. And so is guard and its family of plugins. Early this year I spent a while converting all of my rails projects to use spork, and we even had a team standard tmux setup that ran spork in one of the start-up screens. So every time we saved a file, guard/spork ran some of our specs. We even had growl/notify pop up a little message telling us the result, so we didn’t have to go to the trouble of switching screen to find out. How very efficient!

But now I’ve turned spork off, and here’s why: Spork solves the wrong problem.

We enabled guard/spork because our specs were slow, and during the few months we had spork, those specs became even slower. But we never noticed. Most of the time guard/spork guessed correctly which specs to run, and we became quite good at configuring guard to re-load Rails when critical files changed. So we were reasonably comfortable, and our specs ran quickly most of the time. But our design was getting worse every day.

Slow specs are a sign of too much coupling, and in the case of a Rails app that usually means not enough separation between the domain objects and the adapters. So we’ve stopped running spork, so that we can feel the pain of the 15-second Rails load time, and so that we can feel the pain of all that coupling. And slowly we are making the app’s specs faster. Many spec files now run in under a hundredth of a second, and more will follow as we gradually peel domain code out from the Rails infrastructure. We don’t need to test the Rails components, and we have a very good suite of cucumber features that act as end-to-end integration tests. So our specs should be fast, and only need to tell us that our own objects each do their thing. That’s what we are now working towards, and that’s the pot of gold that spork was hiding from us.

Faster Rails controller specs

One of the Rails apps I work on has this:

$ rspec spec
Finished in 61.82 seconds
475 examples, 0 failures

61 seconds!  (And on top of that I have to wait another 15 seconds for Rails load; that’s a whole other story, and I hope to come back to that in a future post.) A quick dig into the 61 seconds reveals this:

$ rspec spec/controllers
Finished in 22.82 seconds
114 examples, 0 failures

It happens that almost every controller action in our app needs the user to be logged in, so every controller spec needs that too. We’re using Devise, so every spec has something equivalent to this:

before :each do
  sign_in Factory(:user)

This implies hits on the database to create the user object, record the IP address and last sign-in time, etc etc. We can do much better by using a mock_model for the User, and stubbing out Warden‘s authentication completely.

First, in the interests of DRYness, I made some helper methods to do some of the jobs that factory_girl does:

module RandomHelpers
  def random_number() SecureRandom.random_number(10000); end
  def random_id() random_number; end
  def random_name() SecureRandom.hex(20); end

Next, a method that creates a test double representing the User who will be logged in:

module ControllerHelpers
  def user_double(attrs = {})
    user_attrs = {
      :first_name => random_name,
      :last_name => random_name,
      :authenticatable_salt => 'x'
    mock_model(User, user_attrs)

This is the place to fill the User double with default attribute values. Our app requires every User to have a first and last name, so I’ve stubbed these with random strings; they can be overridden by the caller if specific values are needed for any test. In your own app, change this method to set any mandatory attributes your user model needs. Note also the stubbed authenticatable_salt() method, which wasn’t required when mocking earlier versions of Devise.

I also need a method that logs a user in:

module ControllerHelpers
  def stub_sign_in_with(user)
    request.env['warden'] = double(Warden,
                                   :authenticate => user,
                                   :authenticate! => user,
                                   :authenticate? => true)
    return user

This replaces Warden with a test double, and Devise is none the wiser.
I also create a method to glue these together for the simplest (and most common) case:

module ControllerHelpers
  def stub_sign_in(attrs = {})
    stub_sign_in_with user_double(attrs)

(Note that this returns the User double; you’ll see why that’s useful in a moment.)

Finally, I configure the specs so they have access to these helper methods. To achieve that, spec/spec_helper.rb needs to look something like this:

Dir[Rails.root.join("spec/support/**/*.rb")].each {|f| require f}

RSpec.configure do |config|
  config.include RandomHelpers
  config.include Devise::TestHelpers, :type => :controller
  config.include ControllerHelpers, :type => :controller

Putting all this together allows me to write controller specs with:

before do

or, if I need access to the user double:

let(:user) { stub_sign_in }

And the payoff?

$ rspec spec/controllers
Finished in 7.36 seconds
114 examples, 0 failures

I saved as whole 15 seconds! There’s still some work to do in my specs, but stubbing out Warden has made a massive difference to every test run that involves this project’s controller microtests.

Note that the above has been tested only with the following gem versions:

  • rails 3.0.3
  • devise 1.3.4
  • warden 1.0.4
  • rspec-core 2.6.4
  • rspec-mocks 2.6.0
  • rspec-rails 2.6.1
  • factory_girl 1.3.3

A condensed gist for the above code is here.

fun and games at xp-manchester

Last night was the October 2010 meeting of XP-Manchester, a local group set up by me and Jim McDonald. As always the meeting consisted of two halves, the first being a workshop (this time led by me) and the second being a coding dojo.

For the workshop this month I ran a version of James Shore’s Offing the Offsite Customer game, as described by Kane Mar and using Kane’s drawings as the requirements. I hadn’t run the session before, and it turned out really well. We had 19 participants, so we split into two roughly equal-sized teams, with each team further split equally between a group of Product Owners and a group of Developers.

In the first run-through neither team managed to create a diagram that looked anything like the requirement; whereas in the second attempt both teams produced very good diagrams, and well inside the allotted time. The difference was born in the team retrospectives between the two runs. Both teams independently decided to work much more iteratively and interactively second time around, and it paid off. It could be said that in the first run, the teams focussed on perfecting the written spec, whereas in the second run the teams focussed on perfecting the working diagram. This focus on evolving a diagram using direct feedback was “invented” independently by both teams, and towards the end of the second run they even had free time available for fine-grained polishing.

The second part of the evening was a dojo. Jim introduced us to the Minisculus challenge set by Eden Development at the recent Software Craftsmanship 2010 event. Jim decided we should attempt the Mark I problem in Ruby, and due to the relative lack of Ruby knowledge in the room last night this meant that Mike Josephson did most of the driving. We didn’t get very far, but we did have some very interesting discussions about TDD style: After you’ve faked a return value to get quickly to GREEN, what’s the best step to take next? Is it better to add another test in order to triangulate towards a more general solution, or is it better to treat the fake return value as duplication and fix that by moving specifics up into the test? We explored the latter approach last night, and no doubt we’ll continue the debate next month.

Many thanks to Jim and Mike for running things, and to everyone else for joining in!

XP-Manchester happens after work on the second Thursday of every month at Madlab in Manchester’s Northern Quarter. Everyone is welcome, and if you want to come along you can get details of upcoming meetings by joining the mailing list at

microtests for exceptions

Over on the Dev@Pulse blog Jim raises some interesting points about how and when to microtest the messages in thrown exceptions. Jim’s discussion got me thinking about how I solve the same problem myself, and on reflection I discovered I use a slightly different tactic than Jim’s:

I write one test for each important piece of information that must be conveyed by the exception. For example, there might be a test that checks the exception’s message for the incorrect business value that caused the error; and maybe another that checks for the name of the throwing object. Etc etc.

Here’s a simple example, based on some of Reek‘s microtests:

context 'when the parser fails' do
  before :each do
    @parser_error = 'Error message'
    parser = mock('parser')
    @source_name = 'source/file/path.rb'
    @src ='def unused() end', @source_name, parser)
  it 'raises a SyntaxError' do
    lambda {@src.syntax_tree}.should raise_exception(SyntaxError)
  it 'reports the source name' do
    lambda {@src.syntax_tree}.should raise_exception(SyntaxError, /#{@source_name}/)
  it 'reports the parser error message' do
    lambda {@src.syntax_tree}.should raise_exception(SyntaxError, /#{@parser_error}/)

This scheme has two advantages: First, there are fewer direct dependencies on incidental text in the error message, which means the microtests are less brittle, and hence less likely to be affected by simple changes of wording. And second, the tests clearly indicate which parts of the error are “important” – so if one of them later fails I get a reminder that I’ve lost information, and thus a gentle prod to re-consider the most recent code change.

Everyone writes code that throws exceptions, so I’m sure these aren’t the only tactics for microtesting them; what do you do?

extract class or substitute algorithm?

I have a dilemma. In fact, I frequently have this dilemma, so I thought I’d ask your opinion on which way you would address it.

I just completed a CRC session for the next feature I want to add to my codebase. One of the classes identified for the new scenarios is a particular kind of counting collector, and I realised that the same behaviour already occurs elsewhere in the codebase. So one of the first steps in coding up this new scenario will be to extract a new class from an existing class. The existing class has reasonable micro-tests, and therein lies the dilemma:

  • I could simply use the Extract Class refactoring recipe, and I’d be done quite quickly; the new collector class would be tested indirectly, via the micro-tests for the original class (which would now be the extracted class’s only client).
  • Or I could develop the collecting class test-first, from scratch, so that it had its own micro-tests, and then use Substitute Algorithm, reimplementing the old class to use the new.

The advantage of the second approach is that it gives me precise micro-tests of the new class. But while developing that class, I could drift away from the required usage, or I could make it harder to switch the original class over to using the new one. (I tend to think of Substitute Algorithm as being riskier than Extract Class.)

So which is better — use Extract Class and know that I have a working client-supplier relationship; or TDD from scratch and risk going off track? In practice, I tend to choose Extract Class if I’m feeling lazy or distracted, whereas I go with TDD followed by Substitute Algorithm when I’m feeling organized; in the latter case I usually get stressed by the “substitute” step…

(It seems to me that if we choose the latter approach in general, many refactorings should cease to be used. That can’t be right, can it?)