yet more on metrics for coaching

A month ago I began asking whether it was possible, meaningful or sensible to measure the performance of an agile coach. Herewith a couple of further thoughts on that topic.

vernier Firstly, there’s been some discussion of metrics over on the scrumdevelopment Yahoo group. The original question posed to the group was:

“What metrics do I collect that tell me this agile stuff is actually doing my group any good”

Mary Poppendieck weighed in with three metrics that should be taken together as a package:

  1. cycle time;
  2. business case realisation;
  3. and net promoter score.

Her post goes on to describe these measures in detail, and why they are important. They do seem spot on to me – though I’ve only tried the first for real, and even then with only limited success. It seems to me that there’s a strong case for measuring the coach according to the results of the organisation being coached, and these look to be measures I’d be willing to adopt for myself. But then…

Secondly, I have now finished Value-based Fees: How to Charge and Get What You’re Worth by Alan Weiss, as recommended by Deb. While not about coaching per se, the book does make recommendations as to how a consultant should establish performance measures with his or her clients. Weiss divides measures into quantitative and qualitative. Regarding the former:

“[Y]ou cannot base your contribution on what I call a ‘magic number’. While a 6 percent sales increase might be highly desirable and even, in your opinion, very achievable, it’s never a good idea to peg your success to that magic number. The reason is that there are far too many variables that can affect that number adversely for you to take that risk. […] It is better to state that you will assist in maximising the sales increase.”

And regarding qualitative measures:

“Qualitative measures can be among the most powerful [but] “I’ll know it when I see it” […] is insufficient for a consulting project. In fact, here are the parameters for creating a successful, anecdotal sereies of measures for a project: the buyer himself or herself will judge the result; the effectiveness will rely on observed behaviour and factual evidence; there will be gradations of success, not success or failure; there must be reasonable time limits.”

Weiss goes on to give examples that are very much like the kind of anecdotal measures I’ve been using on my own projects. So now I’m back to where I started! On reflection, I think that’s just fine. Start with something like:

  1. assist in minimising cycle time;
  2. assist in maximising business case realisation;
  3. and assist in maximising net promoter score

and then augment those with any other anecdotal measures that are valuable to the team, manager or organisation I’m working with.

What measures do you use to assess your own performance at work?

designing metrics at SPA2006

The second session I attended at SPA2006 was Designing metrics for agile process improvement, run by Jason Gorman and Duncan Pierce. Their thesis is that metrics are difficult to design, and making a mistake can lead an organisation down the wrong path entirely.

vernier To demonstrate this, they divided us into four groups and assigned us each a “goal”. My group’s goal was “more frequent releases“. We had to brainstorm metrics that might act as measures of an organisation’s progress towards that goal, and then pick one to carry forward to the remainder of the exercise. We spent some time thrashing around wondering why an organisation might have such a goal, and finally we plumped for the metric average time in days between releases.

At this point, one person in each group took his group’s metric to the next table, where that group (minus one) would attempt to “game” it. The objective of this part of the afternoon was to try to get the metric looking right, but such that the organisation was clearly doing something very undesirable behind the scenes. Our table was joined by Sid, bringing along a metric whose stated goal was “fewer bugs in the product” and whose metric was number of bugs per function point per release. During the next half-hour we had great fun trying to find ways to write more bugs, while still having fewer per function point. These included testing previous release really hard and fill up the bugs database, then hardly test future releases at all; and making it really difficult for users to report bugs. Meanwhile another group was gaming our own metric, by producing zero-change releases and by saving up releases into rapid-fire batches all at once.

Next, if I recall, we had to re-design our original metrics in the light of the panning they had received in the hands of the gamers. We ended up measuring the goal of “more frequent releases” using the metric days per story point between customer feature request and customer acceptance. We tried to phrase it so that it averaged out over easy and difficult features, and so that the count forced us to release the feature to an end user. This neatly side-steps all attempts at gaming (that we could discover), and is roughly equivalent to throughput – or to Kent Beck’s software in process metric.

For the workshop’s final segment, we recognised that a single metric by itself is rarely sufficient to direct process improvement. Change one thing for the better and something else may well slip out of line. Jason and Duncan’s thesis is that we need a system of metrics, each of which balances the others in some way. So as a final exercise, we began an attempt to take the metrics of all four groups and weave them into a system, showing which measures reinforced which others etc. For me this is the key observation from the whole afternoon, and I would like to have spent more time here. I had never really seen the attraction of Systems Thinking, but these models derived from the need for metrics do seem to be a good application of the technique. Food for thought indeed…

never tolerate a red bar

So I’m sitting pairing for the first time on a project that’s been going for a couple of years. We tweak a little code and then run the tests. I’m impressed that they all run in ten seconds or so. But some of them fail, and the bar turns red. “That’s ok,” says my partner. “Some of the tests fail at random every so often.”

We hit Run again, and again we get a red bar. “Like I said. Must be another random failure.” He’s smiling and trying to be confident, but the expression on my face is making him a little unsure.

So we open up the test list, which has a little coloured dot alongside each test. Most of the dots are green, but three are red. My partner scans the list. “Yep, those are the ones that always fail.”


“Sorry – these are the three that randomly fail occasionally. Nothing to worry about.” He moves to get on with the next little refactoring step on our to-do list.

“How do you know they failed for the reason you expect? How do you know we didn’t break them in a new way? Worse still: how do you know whether those expected failures are masking something we just broke, but which isn’t being exercised when the randon failure occurs?”

“We don’t. But everyone’s looked at those tests. They’re impossible to understand, and no-one knows how to fix them. We’d be here all day if we tried, and we’ve got this other stuff to do.”
Continue reading

my eyes are fine

As it turns out, my eyes aren’t too high after all! My new passport was couriered to my home today (a Saturday), only four days after I submitted my borderline application. So no problems with throughput at the UK Passport Office – unless someone there reads my blog and decided they better fast-track my forms…

running tested features are not enough

There’s been some discussion this week in the comments over on It’s a Team Game about a software team whose internal and external customers don’t seem interested in their product increments. No-one seems to care whether the team delivers anything at all. In response, members of the team are concluding that they need more of a focus on delivery, as recommended by Alistair Cockburn in Are Iterations Hazardous to your Project?. They reason that frequent delivery (and demonstration) of running tested features will create belief, and then demand will follow.

This got me thinking about the RTF metric, and in general about the attributes of the product increments delivered at the end of each iteration. Certainly they should consist of running features – which I take to mean that they provide (possibly small) chunks of meaningful, cohesive, user-visible behaviour. They should also consist of tested features – which I take to mean that there are no defects in them. But which features?
Continue reading

agile, top down

In Agile, Top Down Ron Jeffries has written a compelling article describing how he would introduce agile methods into an organisation, starting from the top.

“There’s a recent thread on the Scrum list about how an executive or highly-placed manager could get Agile going. I’ve been one of those guys, and I know a bit about Agile, and here’s how I’d proceed. First, focus management attention on cyclic delivery of running tested software. Second, provide the resources to learn how to do that.”

Ron’s advice centres on measuring the development department by running tested software – that is, by measuring throughput. I very much like Ron’s advice and approach. But I do see some hurdles…

First, if you’re in a position to dictate that the only measure is throughput, that’s great. But if you’re not, the experience of the TOC and Lean guys is that you’re going to have a really hard time convincing the bean counters – and other senior management, for that matter – that throughput is the right measure. Common sense notwithstanding, this is a cost-accounting world. The only saving grace here is that most software development organisations are too chaotic to count anything, so they likely won’t be already counting the wrong things. (An obvious exception being those departments that are or have been in the grip of Accenture-like consulting practices. But then we’re not likely to try introducing agile top down in one of those places, I guess.)

And second, I would expect that the department’s throughput of software will initially (first three to four months) dip until the running and tested parts of the equation are working effectively. Because testing everything, and fixing all of the defects before they go out the door, costs. The investment will eventually be repaid many times over in terms of predictability, trust and speed, but it is an investment nonetheless.

I still think Ron has it right – just don’t expect an easy time of it for the first few months.

Update, 1 aug 05
Discussion with Jason (in the comments on this post) leads me to understand that the choice of metric is critical. In the article, Ron was explicitly and implicitly referring to his own metric, running tested features.

meetings, meetings

In Meetings, meetings, everywhere… David Allen (author of Getting Things Done) tells an alarming story and also quotes Dave Barry:

If you had to identify, in one word, the reason why the human race has not achieved, and never will achieve, its full potetial, that word would be: “meetings.”

In a current project of mine, our metrics suggest that this is the biggest factor in the team’s low throughput of user stories. Meetings and other distractions have more impact than legacy code, lack of tests, poor backlog management, sloppy timeboxing, weak object design, …

business automation gone wrong

To my mind, software exists in order to automate (parts of) the business process. But that simple idea seems to have been forgotten by some of the designers here…

I’ve temporarily joined a project whose purpose is to remove a manual step in a data transfer process, thus ensuring that the relevant users always have correct and complete data to hand when they need it. The new design does indeed achieve that goal. But somewhere during the design phase (yes, I know) two groups of people somehow failed to communicate clearly. And the resulting design therefore includes a manual step that’s harder to perform than the one they’ve eliminated!

What’s really funny is that this design will be measured as a success by the business. Because although the staff costs associated with running the affected business process are unchanged, the new design allows an old database to be de-commissioned. So several tens of thousands of pounds have been spent, in order to save a few thousand in annual maintenance costs, in support of a delivered value that hasn’t increased.

Am I becoming a ‘throughput’ maven…?

local optimisation

For the last couple of days I’ve been studying the results of a local process improvement exercise. The exercise was run earlier this year, and had its own business case, complete with a financial justification. The plan was to reduce by 50% the number of defects that this department’s support team had to deal with, and the justification was that this would save around £30,000 per month in overheads. Now of course I believe that fixing defects is muda, so any reduction in this waste gets my full support. And indeed the improvement exercise was successful, as the bug statistics for recent months have shown.

So the company has saved a load of money, right? Well, certainly there are fewer bugs to fix in this one department. But no-one was laid off – instead, people now simply spend less time in support, and more time doing other work. This is a classic case of what TOC calls local optimisation: this department is now spending less of its own money, but as a result some others are probably spending more. And as I look around I find that the entire organisation – which is large – is incentivised in a similar way. Each department’s objective is to “save” costs by cross-charging its staff to other departments. But because they all ultimately work for the same company, this local accounting is obscuring the bigger picture. I’m convinced that end-to-end project costs are therefore significantly higher than they could be.

Could it be done differently? TOC says it can. The key measure of success in this company’s business is time to market. What if we could find a way for each department to somehow be measured on throughput? (and such that fixing defects is seen to reduce throughput)

I guess I’ve found a mission…