Last week I posted about using tests as a failsafe. Silly me, I should have known just how many people would comment out the failing tests! So what makes it easier to say “oh, let’s just comment them out so we can get on with our real work”, instead of fixing the tests? I’ve seen it happen many times, and I’ve done it myself – but why?
I think time is the driving force: if we try to fix the tests we’ll miss our deadline. Why? In every case I’ve seen, those failing tests were also refactoring no-go areas. The test code is in such a mess that either it would take an age to fix, or there is no-one left on the project who understands it. So it becomes much easier (and therefore quicker) to comment them out than it would be to fix them.
Why? How did the tests get into that state? In general, it seems to me that there are two main reasons: either the team didn’t know they had to keep the tests adaptable and maintainable; or the “unit” tests actually exercise chunks that are way too big.
Why? What causes the tests to be too chunky? I think in most cases the tests were written after the production code, when its too late to make the design decisions that lead to de-coupling and testability. In fact, whenever I see big chunky unit tests, or tests that are not malleable, I view them as a signpost to smelly production code: too much coupling, or leaked responsibilities, or inappropriate abstractions.
So it pays to work hard on the “ilities” of the test code. Keep the tests simple to setup; keep them independent from each other; remove duplication from them; and have each test check a very few things. (Oh, and don’t over-use mocks … but that’s for another day.) Refactor them just as you would refactor production code, and just as frequently. Because then they are less likely to break unexpectedly in response to application changes that are apparently unrelated. And then maybe you’ll be able to use them as a quality control…
Yesterday I wrote about tolerating the red bar, when a few scary tests fail only occasionally. And today it strikes me that one of the contributors to the persistence of this situation is our tool-set.
As I said, we run the tests, get a red bar (sometimes), and then open up the test list to check that only the expected tests have red dots against them. If we see nothing untoward we just get on with the next task. But of course we aren’t really looking at the reason for the failure. Perhaps the tool itself is making it too easy for us. Or perhaps we interpret those dots too liberally?
So here’s the thought: What would life be like if our tools actually prevented check-ins while there are any failing tests? This would effectively “stop the line” until the problem was sorted out. And it would force us to address each problem while that part of the codebase was fresh in our minds.
I also suspect that peer pressure (“hey, I can’t check anything in now!”) might quickly cause us develop a culture in which we tried to eradicate the root causes of test failures. Instead of relying on CruiseControl to “deodorise” our stinky practices…
(If you’ve tried this I’d love to hear your experiences. Drop me a line.)
So I’m sitting pairing for the first time on a project that’s been going for a couple of years. We tweak a little code and then run the tests. I’m impressed that they all run in ten seconds or so. But some of them fail, and the bar turns red. “That’s ok,” says my partner. “Some of the tests fail at random every so often.”
We hit Run again, and again we get a red bar. “Like I said. Must be another random failure.” He’s smiling and trying to be confident, but the expression on my face is making him a little unsure.
So we open up the test list, which has a little coloured dot alongside each test. Most of the dots are green, but three are red. My partner scans the list. “Yep, those are the ones that always fail.”
“Sorry – these are the three that randomly fail occasionally. Nothing to worry about.” He moves to get on with the next little refactoring step on our to-do list.
“How do you know they failed for the reason you expect? How do you know we didn’t break them in a new way? Worse still: how do you know whether those expected failures are masking something we just broke, but which isn’t being exercised when the randon failure occurs?”
“We don’t. But everyone’s looked at those tests. They’re impossible to understand, and no-one knows how to fix them. We’d be here all day if we tried, and we’ve got this other stuff to do.” Continue reading →
Unless I’m very much mistaken, this means that those domain classes have a dependency on JUnit, which in turn means that JUnit must be on the classpath wherever his application is deployed. Now while that may be only a slight headache for deployments such as in-house or web-based projects, it would be a complete no-no for a shrink-wrapped or off-the-shelf product of any kind.
Looking at the examples on Alan’s blog, I do like the look and feel of Chris’s resulting code. But that dependency makes it hard to swallow, so I guess I’d vote for Alan’s own style as being a reasonable compromise.
On reflection, is this maybe a problem of the language? In Ruby, for example, the test suite can add shouldXXX() methods to class Object, and these methods won’t appear in deployed production code. The rSpec framework does this, for example. Yet another reason to switch to Ruby, if you haven’t already done so…
If your unit test coverage was close to 100% – for example, if your code had all been written test-first – you’d have very little problem tracking down bugs. Because the unit tests will tell you most of the places the bug can’t be. But if your coverage is much lower, unit tests can still be a very effective way to track down and isolate problems.
Bugs can often be difficult to track down. They may depend on the interaction between objects in certain states, or they may depend on certain parameters having specific values, or any of a host of other difficult-to-diagnose situations. It’s always tempting to triangulate the bug by adding print statements or by using the debugger, but to my mind that gives no payback. Time spent in the debugger is time that can never be regained. Whereas time spent writing unit tests pays for itself again and again with every build.
So instead of debugging by inspection, I now do it by writing unit tests. If I need to figure out whether an object behaves itself under certain circumstances, I write those circumstances down as examples in a test suite. I find the answer I need, and I also get a set of regression tests for free. So even the code that didn’t contain the bug is now more robust and more understandable. And by writing those tests I’ve also documented the fact that the bug wasn’t there. All this helps speed up the process next time, and improves my overall confidence in my product.
(As usual, none of the above is new. But not everyone has read everything…)
I’m going to quote an entire email sent by Kent Beck to the [extremeprogramming] group on Yahoo, because it hits so many nails right on the head all at once:
In trying to understand what you’ve written I don’t find the traditional dichotomies of software engineering very helpful.
Unit vs. functional test — I test what I see. Sometimes I think big thoughts and I write big tests. Sometimes I think little thoughts and I write little tests. My first test of a system is a big test of a little system. Are these unit or functional tests? The question doesn’t help me.
Top-down vs. bottom-up — I try to have a “whole” system quickly, then expand the parts that need expanding. Is this top-down or bottom-up? The question doesn’t help me.
I could go on with other dichotomies that aren’t helpful to me– black box vs. white box, programming vs. QA, phases, customers vs. programmers. I wonder how our thinking about software development got divided into such tidy-yet-unhelpful little boxes.
In A New Look at Test Driven Development Dave Astels adds to the growing movement to get away from the name test-driven development. He plumps for behaviour-driven, whereas I had opted for example-driven. Either is fine by me, as long as we get away from “test”-driven. Keith Ray quotes from Dave’s article, and I liked this paragraph so much that I just had to repeat it here:
It’s about figuring out what you are trying to do before you run off half-cocked to try to do it. You write a specification that nails down a small aspect of behaviour in a concise, unambiguous, and executable form. It’s that simple. Does that mean you write tests? No. It means you write specifications of what your code will have to do. It means you specify the behaviour of your code ahead of time. But not far ahead of time. In fact, just before you write the code is best because that’s when you have as much information at hand as you will up to that point. Like well done TDD, you work in tiny increments… specifying one small aspect of behaviour at a time, then implementing it.
Dave then goes on to demand code frameworks that help with behaviour/example-driven design, and which don’t use a testing vocabulary. Surely it would be quite easy to change all the names in Ruby’s Test::Unit …?
I just found Michael Feathers‘ rant about designing for testability. I couldn’t agree more with his sentiment!
A couple of years ago I worked with a team that had exactly the problem Michael describes. Many of the classes were nearly impossible to instantiate in a test harness, and many of their methods couldn’t be tested without dire consequences for the development environment. Refactoring in the subsystems that contained these classes was also impossible, because of the unpredictable side-effects incurred by doing almost any method call. In the end, after months of frustration, we decided to very carefully and incrementally throw away those classes and replace them with something testable. In fact, that large-scale refactoring was never finished (due to circumstances outside the team’s control), but the improving code definitely “felt” better. Continue reading →