So I’m sitting pairing for the first time on a project that’s been going for a couple of years. We tweak a little code and then run the tests. I’m impressed that they all run in ten seconds or so. But some of them fail, and the bar turns red. “That’s ok,” says my partner. “Some of the tests fail at random every so often.”
We hit Run again, and again we get a red bar. “Like I said. Must be another random failure.” He’s smiling and trying to be confident, but the expression on my face is making him a little unsure.
So we open up the test list, which has a little coloured dot alongside each test. Most of the dots are green, but three are red. My partner scans the list. “Yep, those are the ones that always fail.”
“Sorry – these are the three that randomly fail occasionally. Nothing to worry about.” He moves to get on with the next little refactoring step on our to-do list.
“How do you know they failed for the reason you expect? How do you know we didn’t break them in a new way? Worse still: how do you know whether those expected failures are masking something we just broke, but which isn’t being exercised when the randon failure occurs?”
“We don’t. But everyone’s looked at those tests. They’re impossible to understand, and no-one knows how to fix them. We’d be here all day if we tried, and we’ve got this other stuff to do.”
I’ve seen this so many times – and I’ve been in that same position on some of my own projects too. The effort of trying to fix the tests seems huge, so we just tolerate the failures and hope we stay lucky.But the cost is greater than just the few seconds it takes to check that the red bar came from those expected tests. Because we have an area of the codebase that’s essentially untested. We never check whether test failures there are “real”, so anything we change in that area is tested when the tests pass, but untested when they fail.
Further, the fact that the tests are a no-go area for the developers is in itself a huge smell. It’s likely that the system’s design in this area (at least) is monolithic or incoherent. It needs to be re-designed to be testable.
Tolerating the red bar is costing us time and throughput on a daily basis. And who know what deeper problems lie behind those “random” failures…?