When unit test coverage is low, situations can arise in which those unit tests do not perform well as regression tests.
One of the tenets of agile development is that the time spent automating unit tests is repaid many times over, because those tests are supposed to become a regression test suite that will catch recurring defects before they escape to the customer. But it seems to me that effect only occurs when test coverage is high – with sparse tests a defect can recur without setting off the alarms.
I came to that conclusion after a development episode the other day. At first I found it surprising, because it seems to cut across one of the central concepts of agile development. Turn it around, though, and it becomes another supporting argument for test-driven development: sparse, ad-hoc tests just aren’t good enough, and won’t help to reduce cycle time – only near-100% coverage will do. I’ll tell the whole story, and please let me know what you conclude after reading it…
I have an application consisting of 30,000 lines of high quality C++ code that has only a few dozen automated unit tests – and even fewer reported defects. The code is hard to test, as it runs in background as a server, and interacts heavily with Windows and with hardware devices. It makes use of a small number of configuration files, which it locates by assuming they are all in the same folder as the executable itself.
A week or so ago I spent a day getting this server into a test harness. As a result I now have one Watir test, with the server process itself hooked up to various loopbacks, emulators and mock configuration files. It is easy to see why the server’s tests were all manual before now, but I’m hoping that this harness breaks the ice and encourages the development of a reasonably thorough automated acceptance test suite (in the fullness of time). I also have an Ant build file that constructs this Heath-Robinson contraption and then runs that one test.
You won’t be surprised to hear that the server has never been run in this manner before, and consequently the test found a new defect. Under the conditions required by the harness, the server looks in the wrong place for its configuration files, because the algorithm that calculates their location doesn’t cater for every circumstance. An important point to note is that the bug in the algorithm took me over an hour to locate: beginning with a Watir test reporting that the title of a webpage was incorrect, through to running the whole shebang with the server sitting in the debugger and watching string comparisons trudge past. In this situation the cause (a broken regular expression for matching filenames) and the effect (an HTML page with the wrong title) are a long way apart, and the internal design of the server’s business objects doesn’t support rapid problem solving.
I dutifully added the defect to my project backlog, and yesterday the customer pushed it to the top of the queue. So today I set about fixing it. The faulty code consisted of two lines buried in a 200-line method. The method had no tests, and was far too complex to even contemplate writing any. So I performed ExtractMethod on the two lines, in order to isolate the (quite simple) changes I wanted to make. I then wrote a couple of tests of this new method, to characterize those cases in its current behaviour that were already correct; and then I TDD’d in the cases necessary to drive the bugfix. Done and dusted in fifteen minutes (excluding the hour spent diagnosing the problem the other day). So now I have an untested 199-line method that calls my extracted, fixed and heavily tested method. And just to make sure, the Watir test now works too.
It seems as if this entire episode has gone pretty much by the book. After all, the defect was found by an automated acceptance test, and there are now some unit tests that will also fail if the bug ever reappears. Moreover, the pin-point nature of those unit tests will very accurately identify any problems in my new little method, saving hours of harness-unpicking and server debugging for some future developer. Well, maybe.
My approach to fixing the defect was to isolate and extract the faulty code into a new method, and then test and fix that extracted method. But recall that tests for the server are few and far between (these were the first tests for this class, for example). In particular there are no tests for the caller – and therefore no tests to check that the new method is ever used, or that it is used correctly. It would therefore be quite easy for some future developer to change the (untested) calling method, replace the call to my (tested) bugfix method, and thereby unwittingly re-introduce the bug. My little method will still be there, and its tests will still be passing. But it won’t be called from anywhere in the server, and so it will no longer be doing its job. Of course, sometime later the next acceptance test run will fail due to some webpage’s title being wrong. But no unit tests will fail, so there will be no indication that this defect ever occurred before (and was fixed). So then the hunt for the problem will begin again, incurring roughly the same amount of harness-unpicking and debugger-watching as it cost me. My unit tests – my TDD’d bugfix – will have saved that future developer no time at all.
The problem arises because there are no intermediate “integration” tests between my TDD’d bugfix (deep in the bowels of the server’s initialisation code) and the acceptance test (which looks at the behaviour of the server’s client many seconds later). So although TDDing helped me to be sure I wrote correct code, the resulting unit tests won’t serve very well for regression. Testing that one method in isolation has ensured the method works, but cannot ensure that it is ever used.
So it seems that when test coverage is low, that lack of intermediate-level tests means that fixed defects could recur. It also means that we cannot be sure to find the bug more quickly second time around: tests for unused code don’t help to light the way, because they will never fail.
Is this just an artefact of the general lack of unit tests for the server? That is, if the server had near 100% unit test coverage, would this problem evaporate? I don’t know. However, I do believe that this problem wouldn’t arise if the server had been developed using TDD – because late details such as the existence of configuration files would have been inserted into existing TDD’d code as extensions, and therefore the call to my modified algorithm would very likely already be tested.
Perhaps this means that the value of unit tests increases geometrically as their coverage approaches 100%. Perhaps greater coverage means that each test is harder to bypass, which in turn increases the value of each test for regression and debugging…?