insurance for software defects

The more I think about it, the more astonished I become. Maintenance contracts for (bespoke) software: Buying insurance to cover against the possibility that the software doesn’t work.

I know the consumer electronics industry does the same, and I always baulk at the idea of spending an extra fifty quid in case the manufacturer cocked up. I wonder what percentage of purchasers buy the insurance? And I wonder what percentage of goods are sent back for repairs? Perhaps the price could be increased by 10% and all defects fixed for free. Or perhaps the manufacturer could invest a little in defect prevention.

It seems to me that software maintenance contracts are an addiction. Software houses undercut each other to win bids, and then rely on the insurance deal to claw back some profits. So no-one is incentivised to improve their performance, and in fact the set-up works directly against software quality. Perhaps it’s time now to break that addiction…

If a software house were able to offer to fix all defects for free, would that give them enough of a market advantage to pay for the investment needed to prevent those defects? Is “zero defects or we fix it for free” a viable vision? (Does any software house offer that already?) And how many software companies would have to do it before the buyer’s expectations evolved to match?

As an industry, do we now know enough to enable a few software houses to compete on the basis of quality?


This week I’ve come across a few articles that clicked together as I read them, and in so doing they reinforced one of my deepest beliefs about software development – or any other profession, for that matter. The articles were:

  • Train Wreck Management by Mary Poppendieck, in which Mary chronicles the origins of “management” and “accountability”. It seems that hierarchical management structures were first applied in the mistaken belief that it was one person’s dereliction that caused a train crash, and that in future such things could be prevented by management and accountability. Lean thinking says the opposite – that the train crash was an inevitable consequence of the structure of the system, and that specific individuals should not be blamed.
  • Survey Blames Blame for Lean Struggles by Mark Graban, in which Mark notices that a recent survey seems to indicate that employees blame each other for their business failing to adopt lean thinking. Mark’s subsequent analysis of the survey shows, however, that it seems to lead the participants to ask “who is responsible for…?” – so it should be no surprise that the top answers mostly involve job titles! Mark’s response is to design a new survey in “5-whys” style – an effort I applaud, even though I disliked the example embedded in the survey.
  • Risk Aversity by Henrik Mårtensson, in which Henrik uses the Theory of Constraints thinking tools to dig into why many organisations are immune to change. One of the root causes, according to Henrik, is that “mistakes” are punished in the average workplace – and so after a while everyone becomes afraid to innovate, or even to change the status quo. A truly lean organisation will reward even the person who makes a change that leads to lower throughput, because at least they contributed, and knowledge about the whole system has been improved as a result. But even the use of the word “mistake” shows how deep-seated is our culture’s desire to blame, and hence to discourage.
  • The Secret Sauce of Highly Productive Software Development by Amr Elssamadisy and Deb Hartmann, in which the authors propose that inability to learn is the bottleneck (in the Goldratt / TOC sense) in most software teams. I need to think about their “bottleneck” claim, but I do agree that learning is the key to agile success, and that a learning organisation will out-perform any other in the longer term.
  • The QnEK Horse has left the Barn by Hal Macomber, in which Hal opens the lid on a community for sharing QnEK (Quick-n-Easy Kaizen) ideas. QnEK can only succeed in an organisation where “mistakes” don’t exist, where blame is replaced by systemic learning.

For me, these articles all dovetail:

  • learning is the key to long-term success
  • the system can always be improved
  • systemic learning requires a constant flow of improvement suggestions
  • blame discourages innovation

It seems clear to me that blame itself lies at the root of many organisations’ problems. The system can always be improved, and is always the source of problems. People should be rewarded for discovering those problems and trying to fix them, especially when they “only” create new knowledge about how to perform less well.

carnival of the agilists, 3-aug-07

John Brothers has posted the latest edition of the Carnival, entitled I’m running late. This is a perennial topic for all software development projects, and doubly so for those of us who take a lean or TOC view of productivity and change, so props to John for bringing that focus to the carnival this time around.

Speaking of TOC, one recent post that escaped John’s attention is Multi-Tasking: Why projects take so long and still go late by Kevin Fox. Every time I work in a large organisation I find myself writing about multi-tasking – and wondering why “common sense” is so rare…

a ruby sparkline showing variation

Back in May’s Carnival of the Agilists I referenced a post by Clarke Ching in which he suggests we can learn a lot about variation in a complex process by simply flipping coins. When I tried the simulation a few times with Excel I found, as expected, that heads and tails don’t always occur in equal measure. But that was a pain to do, so I’ve made it easier. I’ve written a little Ruby program that simulates 1000 coin tosses; you can get it from github.

The simulation generates SVG images; if you can view those in your browser, you can see three consecutive runs here, here and here.

(Disclaimer: The images are SVG. It works fine in Firefox, but if you use IE you may need to install some kind of viewer.)

Kevin Fox on the theory of constraints

Kevin Fox has begun a new blog, on which he plans to share real-life success stories of applying the theory of constraints. With over 20 years working with Eli Goldratt on his CV, Kevin should have plenty of great case studies to share.

The first story tells of the development of a new drug by “Big Pharma”:

“As the drug was late in the development process, where most of the investment and effort is in the clinical trials to test the effectiveness of the drug, this was the area of most concern. When they used TOC to analyze their flow and understand where the bottlenecks were it became apparent that the real constraint was in an unlikely place, the clinical pharmacy.”

I won’t quote any further – read Kevin’s article to find out what happened. And after this taster, I’m looking forward to more of the same…

TOC mini-conference – UK and EU

If you’re interested in the Theory of Constraints and you’re based in the UK or EU, Clarke Ching’s proposal could be right up your street:

“If there is enough interest then I will kickoff a low-cost TOC mini-conference in the UK later this year. It will probably be held in Edinburgh during late-september or october and last one or two days. The goal will be to meet people and learn.”

I think this is a great idea, but it will only happen if enough people join in. If you’re interested, please hop over to Clarke’s blog and add your support.

the big DIP

Several times today my online reading has brought me to Henrik Mårtensson’s Kallokain blog. Henrik has a great knack of taking concepts from lean manufacturing or the Theory of Constraints and creating worked examples to show how they play out in day-to-day software development.

Take for example what Henrik calls the DIP – design-in-process – the software development equivalent of inventory. DIP is any work that has been begun, but not yet shipped to end users.

An easy, and highly visible, way to measure DIP, is to make a Project Control Board. Use a whiteboard, or a large mylar sheet on a wall. Make a table with one column for each hand-off in your development process. […] At the start of each iteration, write sticky notes for each story (use case, feature description, or whatever you use). […] Whenever a developer begins to work on a new story, the developer moves the story from the stack of sticky notes to the first column in the table. The developer is then responsible for moving the story to a new column each time the story goes into a new phase of development. […] The sticky notes in the table are your DIP. Track it closely. If you have a process problem, you will soon see how sticky notes pile up at one of your process stages.

This fits with everything I’ve written about task boards, but Henrik then goes on to show how to calculate the current value of your DIP – ie. the investment you have made in the unshipped work on the board. In his worked example, a requirements change causes 20 story points worth of DIP to be scrapped; and as each story point costs 1,000 EUR to develop, the requirements change means we must write off 20,000 EUR of development costs.

It is worth noting that some of the 1,000 EUR per story point is “carrying costs” for the work that was scrapped. Just as manufacturing inventory costs money to store and move around, unshipped software (not to mention uncoded designs) has to be regularly read, searched, built and tested. It was there all the time, and we had to respect it and live with it. So it slowed us down; we had to step around it, and we would have tripped along more lightly if it had never been there. Looked at another way: with lower DIP inventory our cost per story point would be lower; equivalently, we could achieve higher throughput for the same expense.

(Of course, there are two obvious reactions to this write-off cost: make changes really difficult, or reduce inventories and cycle times so that DIP is always lower. Each has its sphere of applicability, I guess…)

throughput accounting measurement units

dollar-key Earlier today I was reading about the key business measures used in the Theory of Constraints, and running some examples in my head while I tried to make sure I understood exactly what was being said. Here’s what I learned…

To recap, the only business measures considered by TOC are

Throughput (T)
Cash receipts for actual sales, minus the cost of the raw materials used towards those sales.
Operating Expense (OE)
The expenditure required to produce what we sold, including heat/light, payroll etc.
Investment (I)
The money invested up front, or tied up in the operation. This includes depreciation on machinery, buildings, and other assets and liabilities.

These can be combined to give other measures; for example return on investment:

ROI = (T - OE)/I

(For more detail see throughput accounting on Wikipedia, for example.)

What struck me were the units used to express these measures. The only one I’ve seen explicitly stated was Throughput, which is expressed as “cash receipts per unit of time” – let’s call it dollars-per-day. So in order for the ROI calculation to make arithmetical sense, operating expense (OE) must also be measured in dollars-per-day. So far so good.

Now I had been expecting Investment (I) to simply be a sum of money. But in that case ROI would turn out to have units of “per-day”, which is clearly nonsense. So Investment must also be measured in dollars-per-day, and now I see that makes sense too. And in turn, this all implies that ROI is just a number, a scalar value. (Better perhaps to express it as a ratio or a percentage – so that an ROI of 1.50 is easier to understand as a 3:2 return or a 150% return.) Other traditional measures are calculated as follows:

Net profit (NP) = T-OE
Return on investment (ROI) = NP/I
Productivity (P) = T/OE
Investment turns (IT) = T/I

If my logic above is correct, Net Profit is expressed in dollars-per-day and all the others turn out to be scalar, unit-less numbers – and that alone helps me understand them much more.

Lack of visibility can be the bottleneck

Making targets and their associated metrics more visible can be the most important factor in achieving those targets. That’s the message behind the agile notion of Information Radiators or Big Visible Charts: the visibility itself is a critical component of their success. Today Shawn Callahan tells an anecdote in which it seems the increased visibility of a target and its metric served to dramatically change employee behaviour. He relates a story from Pfeffer and Sutton in Hard Facts, Dangerous Half-Truths, and Total Nonsense in which a shipping company wished to save money by bundling small orders into bigger boxes. The policy wasn’t being followed, so:

“So the company announced a new program that provided rewards such as praise—not financial rewards—for improvement. On the first day, the proportion of packages placed in the larger containers increased to 95 percent in about 70 percent of the company’s offices. The speed of this overwhelming improvement suggests that a change in performance derived not just from the rewards that were offered, but also from the information provided that the current performance level was poor and this action—consolidating shipments—was important to the company.”

As Jack Vinson suggests, the relative invisibility of the target behaviour and its measures could well have been the constraint on that business at that time. The anecdote is from the 1970s, too early for the business to have used the ToC thinking tools to diagnose the situation, or for the metric’s increased visibility to have been the result of applying the five focusing steps (identify, exploit, subordinate, elevate, repeat).

For me, the lesson from this story is that the constraint is as likely to be in knowledge management or communication as anywhere…