a testing dilemma

When unit test coverage is low, situations can arise in which those unit tests do not perform well as regression tests.

One of the tenets of agile development is that the time spent automating unit tests is repaid many times over, because those tests are supposed to become a regression test suite that will catch recurring defects before they escape to the customer. But it seems to me that effect only occurs when test coverage is high – with sparse tests a defect can recur without setting off the alarms.

I came to that conclusion after a development episode the other day. At first I found it surprising, because it seems to cut across one of the central concepts of agile development. Turn it around, though, and it becomes another supporting argument for test-driven development: sparse, ad-hoc tests just aren’t good enough, and won’t help to reduce cycle time – only near-100% coverage will do. I’ll tell the whole story, and please let me know what you conclude after reading it…

I have an application consisting of 30,000 lines of high quality C++ code that has only a few dozen automated unit tests – and even fewer reported defects. The code is hard to test, as it runs in background as a server, and interacts heavily with Windows and with hardware devices. It makes use of a small number of configuration files, which it locates by assuming they are all in the same folder as the executable itself.

A week or so ago I spent a day getting this server into a test harness. As a result I now have one Watir test, with the server process itself hooked up to various loopbacks, emulators and mock configuration files. It is easy to see why the server’s tests were all manual before now, but I’m hoping that this harness breaks the ice and encourages the development of a reasonably thorough automated acceptance test suite (in the fullness of time). I also have an Ant build file that constructs this Heath-Robinson contraption and then runs that one test.

You won’t be surprised to hear that the server has never been run in this manner before, and consequently the test found a new defect. Under the conditions required by the harness, the server looks in the wrong place for its configuration files, because the algorithm that calculates their location doesn’t cater for every circumstance. An important point to note is that the bug in the algorithm took me over an hour to locate: beginning with a Watir test reporting that the title of a webpage was incorrect, through to running the whole shebang with the server sitting in the debugger and watching string comparisons trudge past. In this situation the cause (a broken regular expression for matching filenames) and the effect (an HTML page with the wrong title) are a long way apart, and the internal design of the server’s business objects doesn’t support rapid problem solving.

I dutifully added the defect to my project backlog, and yesterday the customer pushed it to the top of the queue. So today I set about fixing it. The faulty code consisted of two lines buried in a 200-line method. The method had no tests, and was far too complex to even contemplate writing any. So I performed ExtractMethod on the two lines, in order to isolate the (quite simple) changes I wanted to make. I then wrote a couple of tests of this new method, to characterize those cases in its current behaviour that were already correct; and then I TDD’d in the cases necessary to drive the bugfix. Done and dusted in fifteen minutes (excluding the hour spent diagnosing the problem the other day). So now I have an untested 199-line method that calls my extracted, fixed and heavily tested method. And just to make sure, the Watir test now works too.

It seems as if this entire episode has gone pretty much by the book. After all, the defect was found by an automated acceptance test, and there are now some unit tests that will also fail if the bug ever reappears. Moreover, the pin-point nature of those unit tests will very accurately identify any problems in my new little method, saving hours of harness-unpicking and server debugging for some future developer. Well, maybe.

My approach to fixing the defect was to isolate and extract the faulty code into a new method, and then test and fix that extracted method. But recall that tests for the server are few and far between (these were the first tests for this class, for example). In particular there are no tests for the caller – and therefore no tests to check that the new method is ever used, or that it is used correctly. It would therefore be quite easy for some future developer to change the (untested) calling method, replace the call to my (tested) bugfix method, and thereby unwittingly re-introduce the bug. My little method will still be there, and its tests will still be passing. But it won’t be called from anywhere in the server, and so it will no longer be doing its job. Of course, sometime later the next acceptance test run will fail due to some webpage’s title being wrong. But no unit tests will fail, so there will be no indication that this defect ever occurred before (and was fixed). So then the hunt for the problem will begin again, incurring roughly the same amount of harness-unpicking and debugger-watching as it cost me. My unit tests – my TDD’d bugfix – will have saved that future developer no time at all.

The problem arises because there are no intermediate “integration” tests between my TDD’d bugfix (deep in the bowels of the server’s initialisation code) and the acceptance test (which looks at the behaviour of the server’s client many seconds later). So although TDDing helped me to be sure I wrote correct code, the resulting unit tests won’t serve very well for regression. Testing that one method in isolation has ensured the method works, but cannot ensure that it is ever used.

So it seems that when test coverage is low, that lack of intermediate-level tests means that fixed defects could recur. It also means that we cannot be sure to find the bug more quickly second time around: tests for unused code don’t help to light the way, because they will never fail.

Is this just an artefact of the general lack of unit tests for the server? That is, if the server had near 100% unit test coverage, would this problem evaporate? I don’t know. However, I do believe that this problem wouldn’t arise if the server had been developed using TDD – because late details such as the existence of configuration files would have been inserted into existing TDD’d code as extensions, and therefore the call to my modified algorithm would very likely already be tested.

Perhaps this means that the value of unit tests increases geometrically as their coverage approaches 100%. Perhaps greater coverage means that each test is harder to bypass, which in turn increases the value of each test for regression and debugging…?

2 thoughts on “a testing dilemma

  1. I think the problem with this refactoring is that it didn’t go far enough. Changing the structure by 1% won’t change how the future programmer thinks of it, and (s)he will probably bring the code back in to “save on a method call”. I think the problem is that the method you started with is 200 lines long. That is not insignificant for a whole program: I think the Ruby-Talk <-> comp.lang.ruby gateway was that long in total when it was created. Printed out, it would be over 3 pages of 66-line paper. I used to think that the instructions to keep methods short (one screen long, so about 25 lines) were just silly: clearly written code can be much longer and still well-understood. I’m finding now that when I move code out of long methods I suddenly see new ways in which the new, short methods can fail, which never occurred to me when the code was “in context”. I think this is because the brain is so poor at managing logic, especially paths through code. That ties in with the short term memory limits of 7 “chunks”. [Or maybe I’m just saving face here, and should have the courage to admit to being thick :-)] If code is about communicating with other humans, then some of them will be less bright than you, so simplifying the method by creating more methods is probably good.
    Your points about testing unused methods are interesting, and I don’t see a way around that which allows refactoring to work. Your points about test coverage make sense, though I could not assert that the relationship is geometric.

  2. Agree with all you say about long methods, although I don’t think that’s the source of the problem in this case. Even if it were broken into twenty 10-line pieces, the lack of test coverage at that level (and the calling levels above) would still mean that the same accident could occur. Indeed, it could be argued that a flotilla of small classes and methods might serve to further obscure the importance of my tested method – and thus render the accident more likely to occur.

    But just one test of that method – did we find the configuration files, for example – would kill my worry stone dead. The long method’s lack of testability is the real problem. Or the fact that I couldn’t quickly see a way to test it!

    (BTW, I meant “geometric” coloquially, in the sense of “rising faster than linearly”. Apologies for the confusion…)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s