I’m about to TDD a Ruby class whose behaviour will involve the use of random numbers. I expect the algorithms within the class to evolve as I implement new stories, so I don’t want to design and build a testing mechanism that will be brittle when those changes occur. But before I can write the next example, I need to figure out how to control the random numbers needed by the code under test. Off the top of my head I can think of four options:

  1. One way would be to set a fixed seed just before each test and then simply let the random algorithm do its thing. But for each new behaviour I would need to guess the appropriate seed, which is likely to be time-consuming. Furthermore, the relationship between each magic seed and the specific behaviour tested is likely to be obscure, possibly requiring comments to document it for the future reader. And finally, if the algorithm later evolves in such a way as to consume random numbers in a different order or quantity, the seed may turn out to be inappropriate, leading to broken tests or, worse, tests that pass but which no longer test the planned behaviour.
  2. Alternatively I could redefine Kernel::rand — but that could potentially interfere with stuff I don’t know about elsewhere in the object space.
  3. Within the class under test I could self-encapsulate the call to Kernel::rand, and then override the encapsulating method in a subclass for the tests. But then I’m not testing the class itself.
  4. Finally, I could parameterize the class, passing to it an object that generates random numbers for it. This option appears to give me complete control, without being too brittle or trampling on anyone else in the object space.

So I’ll go with option 4. Right now, though, I’m not sure what interface the randomizer object should provide to the calling class. Looking ahead, I expect I’ll most likely want to select a random item from an array, which means selecting a random integer in the range 0...(array.length). And for this next example all I’ll need is a couple of different randomizers that return 0 or 1 on every request, so I’ll simply pass in a proc:

obj.randomize_using { |ceil| 0 }

And if ever I need to provide a specific sequence of varying random numbers, I can do it like this:

rands = [1, 0, 2]
obj.randomize_using { |ceil| rands.shift || 0 }

Later that same day…

The class I’m developing has evolved quite a lot and split into three. And suddenly, with the most recent change, three of the tests have begun failing. A little investigation reveals that the code is now consuming a random number when it didn’t need to in the past, and so some of my randomizer procs now provide inappropriate values. It turns out that two of the failing examples actually boil down to being a single test of a piece that has now been refactored out into another class; by refactoring the tests to match I can remove the dependency on random numbers altogether. And the last broken test is fixed by providing a randomizer that respects the ceiling passed to it (not an unreasonable request):

obj.randomize_using { |ceil| [2, ceil-1].min }

This works, and I get no more surprises during the session.

The more I think about it, the more astonished I become. Maintenance contracts for (bespoke) software: Buying insurance to cover against the possibility that the software doesn’t work.

I know the consumer electronics industry does the same, and I always baulk at the idea of spending an extra fifty quid in case the manufacturer cocked up. I wonder what percentage of purchasers buy the insurance? And I wonder what percentage of goods are sent back for repairs? Perhaps the price could be increased by 10% and all defects fixed for free. Or perhaps the manufacturer could invest a little in defect prevention.

It seems to me that software maintenance contracts are an addiction. Software houses undercut each other to win bids, and then rely on the insurance deal to claw back some profits. So no-one is incentivised to improve their performance, and in fact the set-up works directly against software quality. Perhaps it’s time now to break that addiction…

If a software house were able to offer to fix all defects for free, would that give them enough of a market advantage to pay for the investment needed to prevent those defects? Is “zero defects or we fix it for free” a viable vision? (Does any software house offer that already?) And how many software companies would have to do it before the buyer’s expectations evolved to match?

As an industry, do we now know enough to enable a few software houses to compete on the basis of quality?

This latest edition of the Carnival focuses on what is rapidly becoming a cornerstone of agile methods: Test-Driven Development. Or Test-Driven Design if you prefer. Or Example-Driven Development. Or Behaviour-Driven Development.

First up, Jeremy Miller discusses Designing for Testability: “I have yet to see a codebase that wasn’t built with TDD that was easy to test.”

New blogger Eric Mignot gets a lot of Pleasure from introducing a developer to the joys of TDD: “He had discovered that even if you don’t use it from the beginning of your project, TDD is the most fun and efficient way to correct bugs.”

And David Laribee speaks about the suspension of disbelief to those who have never tried TDD, afraid to make the leap of faith: “Check your disbelief at the door, stay in the cycle, and wait for the payoff. A bit further down the path, when you’ve finished the story and tests are passing, you’ll all of a sudden have a working, shippable, F5-able feature without ever having hit the debugger.”

Brian Marick laments the standard pitfall of demonstrating TDD using an existing codebase: “The usual reason it fails is that the code wasn’t written test-first, so it’s hard to do anything without instantiating eighteen gazillion objects.” So Brian has created a (draft) TDD workbook, in which you get to add features to a real application that has already been developed for you using TDD; potentially the best of both worlds.

When it comes to writing good tests (as opposed to just writing tests), many bloggers - most recently James Newkirk - have mentioned or reviewed Gerard Meszaros’ xUnit Test Patterns, which Eugene Wallingford says is really three books in one: “I have the best of two worlds: a relatively short, concise, well-written story that shows me the landscape of automated unit testing and gets me started writing tests, plus a complete reference book to which I can turn as I need”.

If you prefer behaviour-driven development (BDD) to TDD Dan North has recently developed rbehave for Ruby: “Inspired by [rspec], I wanted to find a simple and elegant way in Ruby to describe behaviour at the application level.” Moments later Joe Ocampo ported the idea to NUnit for C# too! The BDD world is fast-moving right now, with new tools and new experience posts popping up all over; is it the “next big thing”?

And finally, please nominate someone for one of this year’s Gordon Pask awards. If you don’t know what they are, or what they mean to the software development community, read the thoughts of Laurent Bossavit, one of last year’s winners. And then nominate someone from your corner of that community.

(While putting this carnival together I was shocked to discover that many recent “TDD” blog posts involved writing code and then writing tests. As so often, it seems the buzzword has greater velocity than the practice itself…)

To suggest items for a future carnival - especially from a blog we haven’t featured before - email us at agilists.carnival@gmail.com. All previous editions of the Carnival are referenced at the Agile Alliance website. The next carnival is due to appear around July 19, hosted by Pete Behrens.

In Quantify the cost of the things you wouldn’t have had to do, if only… Dave Nicolette attempts to calculate the overall cost to his project of a single defect. He arrives at a total of $8,712.50! First, there’s the time spent by the latest developer who got caught by the bug; then there’s the time spent by his two teammates when they helped him debug and fix it; and then there’s the multiplier for all the times this same defect had occurred, but gone unfixed, during the previous two years.

Dave’s figures are compelling, and quite possibly fall on the conservative side - for example, there may have been occasions in the past in which the application hung up during testing, but no-one reported the problem.

I prefer to leave the figure in developer-hours (102.5 in this case), so that we have an indication of the amount of backlog that wasn’t addressed due to this bug: the broken window cost the project almost three developer weeks of progress! If that time had been used to add just one more story, or to release just a little sooner, I wonder what the knock-on costs of this bug would turn out to be, in terms of lost opportunity revenues etc…

I’ve just noticed that one of my practices has changed subtly, probably sometime during the last six months or so:

I’ve always (ie. at least for the last twenty-five years) kept a to-do list during each programming episode. The list begins with one entry, describing the goal I’m working towards. And as I read and write code, new entries are added to make sure I go back and fix or refactor everything I see along the way. Until recently, those additional entries tended to describe an outcome - “split class Foo to create Bar and Xyzzy”, or “wrap x and y into a Mouse class”, for example. But recently I’ve noticed myself tending to write entries that describe the problem - “class Foo too large”, “x,y travel around together”.

I have no idea how, why or when this change happened, but I’m pleased it did. Between the time I write a particular to-do item and the time I finally get around to executing it, I may significantly change the code in that area. Or I may learn something about the code and where it wants to go next. Or the problem may disappear, as a side-effect of some other change. Either way, the solution I might have on my to-do list has a chance of being inappropriate now that the code and I have moved on. So by listing the problem as I originally saw it, I’m giving myself a much better chance of creating the right solution for it - because I’m deciding on that solution in the presence of the full facts.

I’ve recently been re-doing a few kata that I first tried a couple of years ago. I’ve noticed that my results are dramatically better now, mostly I think due to delaying the moment at which I decide how to solve each micro-problem. This effect is an example of Mary Poppendieck’s “decide as late as possible” maxim. There are clear benefits in terms of the quality of my designs; and there are psychological benefits too, because I don’t have to spend time trying to remember why I wanted to do each item on the list. Everyone wins.

a testing dilemma

March 31, 2007

When unit test coverage is low, situations can arise in which those unit tests do not perform well as regression tests.

One of the tenets of agile development is that the time spent automating unit tests is repaid many times over, because those tests are supposed to become a regression test suite that will catch recurring defects before they escape to the customer. But it seems to me that effect only occurs when test coverage is high – with sparse tests a defect can recur without setting off the alarms.

I came to that conclusion after a development episode the other day. At first I found it surprising, because it seems to cut across one of the central concepts of agile development. Turn it around, though, and it becomes another supporting argument for test-driven development: sparse, ad-hoc tests just aren’t good enough, and won’t help to reduce cycle time – only near-100% coverage will do. I’ll tell the whole story, and please let me know what you conclude after reading it…

I have an application consisting of 30,000 lines of high quality C++ code that has only a few dozen automated unit tests – and even fewer reported defects. The code is hard to test, as it runs in background as a server, and interacts heavily with Windows and with hardware devices. It makes use of a small number of configuration files, which it locates by assuming they are all in the same folder as the executable itself.

A week or so ago I spent a day getting this server into a test harness. As a result I now have one Watir test, with the server process itself hooked up to various loopbacks, emulators and mock configuration files. It is easy to see why the server’s tests were all manual before now, but I’m hoping that this harness breaks the ice and encourages the development of a reasonably thorough automated acceptance test suite (in the fullness of time). I also have an Ant build file that constructs this Heath-Robinson contraption and then runs that one test.

You won’t be surprised to hear that the server has never been run in this manner before, and consequently the test found a new defect. Under the conditions required by the harness, the server looks in the wrong place for its configuration files, because the algorithm that calculates their location doesn’t cater for every circumstance. An important point to note is that the bug in the algorithm took me over an hour to locate: beginning with a Watir test reporting that the title of a webpage was incorrect, through to running the whole shebang with the server sitting in the debugger and watching string comparisons trudge past. In this situation the cause (a broken regular expression for matching filenames) and the effect (an HTML page with the wrong title) are a long way apart, and the internal design of the server’s business objects doesn’t support rapid problem solving.

I dutifully added the defect to my project backlog, and yesterday the customer pushed it to the top of the queue. So today I set about fixing it. The faulty code consisted of two lines buried in a 200-line method. The method had no tests, and was far too complex to even contemplate writing any. So I performed ExtractMethod on the two lines, in order to isolate the (quite simple) changes I wanted to make. I then wrote a couple of tests of this new method, to characterize those cases in its current behaviour that were already correct; and then I TDD’d in the cases necessary to drive the bugfix. Done and dusted in fifteen minutes (excluding the hour spent diagnosing the problem the other day). So now I have an untested 199-line method that calls my extracted, fixed and heavily tested method. And just to make sure, the Watir test now works too.

It seems as if this entire episode has gone pretty much by the book. After all, the defect was found by an automated acceptance test, and there are now some unit tests that will also fail if the bug ever reappears. Moreover, the pin-point nature of those unit tests will very accurately identify any problems in my new little method, saving hours of harness-unpicking and server debugging for some future developer. Well, maybe.

My approach to fixing the defect was to isolate and extract the faulty code into a new method, and then test and fix that extracted method. But recall that tests for the server are few and far between (these were the first tests for this class, for example). In particular there are no tests for the caller - and therefore no tests to check that the new method is ever used, or that it is used correctly. It would therefore be quite easy for some future developer to change the (untested) calling method, replace the call to my (tested) bugfix method, and thereby unwittingly re-introduce the bug. My little method will still be there, and its tests will still be passing. But it won’t be called from anywhere in the server, and so it will no longer be doing its job. Of course, sometime later the next acceptance test run will fail due to some webpage’s title being wrong. But no unit tests will fail, so there will be no indication that this defect ever occurred before (and was fixed). So then the hunt for the problem will begin again, incurring roughly the same amount of harness-unpicking and debugger-watching as it cost me. My unit tests – my TDD’d bugfix - will have saved that future developer no time at all.

The problem arises because there are no intermediate “integration” tests between my TDD’d bugfix (deep in the bowels of the server’s initialisation code) and the acceptance test (which looks at the behaviour of the server’s client many seconds later). So although TDDing helped me to be sure I wrote correct code, the resulting unit tests won’t serve very well for regression. Testing that one method in isolation has ensured the method works, but cannot ensure that it is ever used.

So it seems that when test coverage is low, that lack of intermediate-level tests means that fixed defects could recur. It also means that we cannot be sure to find the bug more quickly second time around: tests for unused code don’t help to light the way, because they will never fail.

Is this just an artefact of the general lack of unit tests for the server? That is, if the server had near 100% unit test coverage, would this problem evaporate? I don’t know. However, I do believe that this problem wouldn’t arise if the server had been developed using TDD - because late details such as the existence of configuration files would have been inserted into existing TDD’d code as extensions, and therefore the call to my modified algorithm would very likely already be tested.

Perhaps this means that the value of unit tests increases geometrically as their coverage approaches 100%. Perhaps greater coverage means that each test is harder to bypass, which in turn increases the value of each test for regression and debugging…?

breaking the ice

February 25, 2007

A new project and a new pile of legacy code always presents me with a large psychological barrier: where should I begin testing, and where should I begin trying to understand what’s where?

Bill Wake, in the Planning Game Simulator chapter of his Refactoring Workbook, has a great answer. Bill suggests that we begin by writing just one test - any test - for every class in the system. The tests can be as simple as you like; their only purpose is to break the ice. Not only do these tests get every class into the test harness, they also prove to me that it’s possible to get every class under test - and that removes the psychological barrier completely. From there it is always a lot easier to continue writing more tests.

I saw this effect for real recently on a client project, and we also experienced it at this month’s AgileNorth coding dojo. We wanted to refactor a smell away from some of the “GUI” code in Wake’s planning game simulator exercise, and someone suggested we needed to make the refactoring safe by writing a test first. But this was GUI code, so the temptation was to shrug and claim it was impossible. Undeterred, Guy wrote a couple of lines that created the window, pushed a button and checked the resulting number of cards in the backlog, and suddenly we were off and running. The refactoring we had in mind was now safe, and the code now seemed like it was ours.

At that point another interesting effect occurred. We looked at the few tests we now had, and we saw smells in the product code - smells we hadn’t seen when we reviewed it at the start of the evening. The test code was revealing usability smells in the class’ interfaces, and our understanding of what to do and where the code wanted to go deepened.

Even a two-line test can have a dramatic impact on the stress of dealing with “untamed” code.

test-driven installation

January 2, 2007

Although I’ve been administering (to) a Joomla site for a while now, I recently had the opportunity to install Joomla from scratch for the first time. The process was remarkably smooth, and with a little help from friends around the web it was done in no time at all. And surely one of the keys to the simplicity of the installation was this: After untarring the distribution files, the next step is to fire up a browser and load the URL of the installation directory, which shows the first page of the Joomla installation process. The page is essentially a test report; in my case it was mostly green, with a couple of red markers that showed things I needed to fix before installation could proceed.

This is rather cool - test-driven installation - and it’s becoming more common these days (I saw the same thing when I installed MovableType for this blog). The page tells me what the software expects from its environment, and also where I currently fail to meet those expectations. Neat. In the Joomla case I had test failures telling me that some files weren’t writeable, and that I had no MySql support in PHP. I fixed those and refreshed the page - green! Onto the next page in the Joomla installation sequence, and the installation worked fine.

As an aside, this is one case of a genuinely test-driven process. The distributed package is shrink-wrapped, intended to be installed many times under varying conditions, and the tests are a kind of poke-yoke. Their aim is to ensure that the upstream process (preparing your environment) is completed correctly without passing errors on to the later step. (Contrast this with the example-driven processes we use in “test-driven development”. Here we create sample situations in order to assist - and constrain - the design of something new, during a process that essentially occurs only once.)

why refactor?

November 14, 2006

The (current) canon of dogmatic Agile (with a capital ‘A’) and test-driven development demands that we RefactorMercilessly at every turn, but that seems somewhat simplistic to me. When I refactor, I do indeed strive to refactor mercilessly; but sometimes I don’t feel the need to refactor at all…

Given a suite of acceptance tests for a software system, there is clearly a (countably) infinite set of possible program sources that will pass those tests. (If you don’t see that, consider that every variable in the program has an infinite set of possible names; the same is also true of function parameter ordering, curly brace positioning, sorting algorithm, method breakdown, class naming, whitespace…) And as the system grows during development and maintenance, we navigate this infinite space of programs. Every time we write, change or remove a line of code, our system moves to a new point in that space; every time we add or change an acceptance test, we define a new (countably infinite) space of possible programs, and our system instantly sits at one point in that space.

And so every time we add a new feature, or fix a defect, we have an infinite number of possible points to which we could move our system within the program space and still pass all the acceptance tests. Furthermore, these “neighbouring” points are not all equivalent: the points in the program space have attributes, or qualities. Each program possesses different amounts of readability, maintainability, resistance to certain directions of change, coupling, cohesion, and so on. Which means that each time we write a line of code we make choices about the qualities of the resulting system. And all too often we forget to make those choices consciously.

Now those qualities I mentioned all have one thing in common: they directly influence both the speed and cost of change of the system. That is, each point in the program space, given a fixed set of acceptance tests, will support some kinds of change cheaply and other kinds more expensively. So as we develop and maintain we should consciously select programs which align with our system’s intended lifecycle; in particular, a write-once-and-forget-about-it system will need different cost-of-change qualities than a system that is intended to survive and be supported for more than a few months.

Refactoring is the means by which we move around the program space. As soon as we have a passing test, we can cast around for a nearby program whose qualities are a good-to-great match for our business needs. How much time we spend doing this is an investment trade-off against those future needs, and should be judged accordingly. Sometimes any old thing is fine; other times it is worth chasing down every last ounce of duplication.

The point of all this is that every day the programmers make choices which affect the system’s qualities. We should do so consciously, and according to the needs of our business and of our users.

I’ve been stretching my TDD practice recently by dipping into the murky world of C++ again after a break of a few years. After reading a number of comparisons - notably this one by Noel Llopis - we settled on CxxTest as the project’s TDD framework. I then spent a few days converting and refactoring a few existing tests, and writing a load of new ones, to get an initial test suite going.

I discovered one big surprise along the way: unlike JUnit et al, each test suite class is only instantiated once, regardless of how many testXXX() methods it has. This means that the suite’s setUp() and tearDown() methods are always acting on the same fixture instance throughout the test run; and the test suite’s constructor is only invoked once. So for test independence it’s essential to only put pointers into the test’s state, so that all test objects can be safely deleted.

Another minor irritation is that CxxTest doesn’t easily allow me to run a subset of the tests. This is a useful feature when working with legacy code, because it’s sometimes hard to locate a crash among a few hundred tests.

What else should I know about? Are there any other gotchas? Have you found other styles of working with CxxTest?

(Note: CxxTest is also commonly, and mistakenly, known as CxxUnit.)