more on performance metrics

Since I opened this particular can of worms a couple of weeks ago, the power of intention-manifestation has brought these related blogs posts into view:

Do you know of any more discussion, research or case studies into measuring the performance of managers, coaches or businesses?

process improvement metrics – some questions

Software process improvement is all very well, but how do we know whether it is working?

The topic of measuring the effectiveness of process improvement activities has crossed my path several times in recent days. For example my current client and I are trying to define ways to measure the benefits to their organisation of retaining me. And then there was the designing metrics session at SPA2006; in their handout notes, Jason and Duncan suggest that process improvement be run as an agile project, with its own stories, velocity etc. Not to mention a conversation I had with Deb Hartmann on the same topic; seems she wants to measure – or at least chart – her coaching activities.

blindfold I’ve found almost nothing written about this topic, so at present I’m afraid I have only questions. First though, some basic assumptions. I assume that software process improvement in itself is pointless. As I said earlier this week, the term “process” doesn’t capture enough of the behaviour of a system to be useful. And of course “improvement” is a relative term. Improvement compared to what? And towards which goal(s)? According to Goldratt there’s only one Goal: maximising profits in both the present and the future. I will assume, then, that software development is always carried out with that goal in mind: the software either reduces our own organisation’s costs or increases our sales, or is designed to be sold to other organisations so that they can use it for one of those ends. Either way, the process of producing that software is only a fraction of the whole system, and may contribute a large or small amount to how well we’re working towards our organisation’s Goal. Furthermore, improving the software development department in isolation may prove beneficial initially, but could be creating a local optimum at the expense of another part of the system, or could simply not be worth doing, if other parts of the system are contributing more to poor overall performance.

All of that notwithstanding, there will be times when an organisation must turn its attention to improving the software development department. Software development will inevitably be the system’s primary bottleneck some of the time, and so its contribution to the goal will need to be addressed. But depending on context the “improvement” required may involve improving quality, reducing time to market, improving customer service, ensuring the right software is developed, training people, or any number of other things. That is, the day-to-day dynamics of the overall system’s push towards the Goal will create short- and medium-term demands on software development; and the response from software development must be measured in terms of those demands.

Henceforward, I’m going to use the woolly term effectiveness as a shorthand for the software department’s contribution to the overall system Goal. Improving the department’s effectiveness then equates to reducing costs and/or increasing sales in the overall system. And instead of talking about “software process improvement”, I’ll discuss “software department effectiveness” instead.

So, back to the top. How can we measure the performance of a software coach? As yet I have no answer; instead I’ll ask a few more related questions:

  1. If we hire a coach or process facilitator or scrum master, how will we know whether we’re getting value for money? Unless the coach doesn’t charge, the organisation’s Operating Expense has increased, so should we be looking for a corresponding increase in sales?
  2. Should a software coach charge a percentage of their client’s increase in profits? Now and in the future?
  3. How often is it possible to attribute improved profitability to the actions of a software coach?
  4. What if the service constitutes training – how can that be directly accounted to profitability?
  5. Given that increased profitability may be measuable only after a period of time has elapsed, how should the risk be apportioned (between client and coach) during that interval?
  6. Conversely, how risky is it to define local measures purely on the software department?
  7. Imagine an ineffective organisation whose biggest problem is not within the software development sub-system. But suppose nevertheless that the organisation blindly hires an agile coach to improve the software department’s effectiveness. Is it meaningful to talk about “improvements” when the real problem is elsewhere? Can the changes be measured in any meaningful way?
  8. Assume that one of the objectives of a coach is to transfer sufficient knowledge so that the client organisation can ultimately continue alone. How can we chart progress towards that goal?

I can find no answers in the lean manufacturing literature (assuming lean sensei charge for their services), or in the agile domain. I suspect that we as an industry just don’t know (yet). Deb suggests we may need a few OpenSpace discussions to work out the answers – or even to establish some direction – and I concur. If you know of any resources, or if you wish to participate in getting these discussions going, please drop me a line.

designing metrics at SPA2006

The second session I attended at SPA2006 was Designing metrics for agile process improvement, run by Jason Gorman and Duncan Pierce. Their thesis is that metrics are difficult to design, and making a mistake can lead an organisation down the wrong path entirely.

vernier To demonstrate this, they divided us into four groups and assigned us each a “goal”. My group’s goal was “more frequent releases“. We had to brainstorm metrics that might act as measures of an organisation’s progress towards that goal, and then pick one to carry forward to the remainder of the exercise. We spent some time thrashing around wondering why an organisation might have such a goal, and finally we plumped for the metric average time in days between releases.

At this point, one person in each group took his group’s metric to the next table, where that group (minus one) would attempt to “game” it. The objective of this part of the afternoon was to try to get the metric looking right, but such that the organisation was clearly doing something very undesirable behind the scenes. Our table was joined by Sid, bringing along a metric whose stated goal was “fewer bugs in the product” and whose metric was number of bugs per function point per release. During the next half-hour we had great fun trying to find ways to write more bugs, while still having fewer per function point. These included testing previous release really hard and fill up the bugs database, then hardly test future releases at all; and making it really difficult for users to report bugs. Meanwhile another group was gaming our own metric, by producing zero-change releases and by saving up releases into rapid-fire batches all at once.

Next, if I recall, we had to re-design our original metrics in the light of the panning they had received in the hands of the gamers. We ended up measuring the goal of “more frequent releases” using the metric days per story point between customer feature request and customer acceptance. We tried to phrase it so that it averaged out over easy and difficult features, and so that the count forced us to release the feature to an end user. This neatly side-steps all attempts at gaming (that we could discover), and is roughly equivalent to throughput – or to Kent Beck’s software in process metric.

For the workshop’s final segment, we recognised that a single metric by itself is rarely sufficient to direct process improvement. Change one thing for the better and something else may well slip out of line. Jason and Duncan’s thesis is that we need a system of metrics, each of which balances the others in some way. So as a final exercise, we began an attempt to take the metrics of all four groups and weave them into a system, showing which measures reinforced which others etc. For me this is the key observation from the whole afternoon, and I would like to have spent more time here. I had never really seen the attraction of Systems Thinking, but these models derived from the need for metrics do seem to be a good application of the technique. Food for thought indeed…

personal goal-setting

Here’s an absolutely mad thought-experiment…

Over at KnowledgeJolt, Jack Vinson is on a “personal effectiveness kick”. And being interested in the Theory of Constraints, he’s naturally looking to define his Goal, and to define one or more effective measures of progress towards it. In response to a little goading from me, he says this in the comments on Too Busy Being Unproductive to Learn to Be Productive:

“What we need to know (in terms of measures) is whether we are reaching our goals or not. If we’ve taken action that moves us in the wrong direction, that’s probably the wrong action. (If we didn’t know that it would be wrong in advance, then it was a good learning experience, and we can do something different next time.)

“The difficulty with productivity, as it is typically used, is that it simply measures output without looking at direction. For example: I did ten things, but only one of them contributed to my long-term goals. Even worse, two of them distracted from my ability to meet my goals.

“I’m not sure what the final personal measure would look like, however. I’d rather have something that showed how much progress I’ve made, rather than an accounting of how many things I’ve done.”

What’s required, then, is a way to define our goals – and to quantify them.
Continue reading

Delivered value beats delivered features

In The Productivity Metric James Shore adds fresh fuel to the “running tested features” debate. The article is well worth reading, and James concludes:

“There is one way to define output for a programming team that does work. And that’s to look at the impact of the team’s software on the business. You can measure revenue, return on investment, or some other number that reflects business value.”

I whole-heartedly agree with this conclusion, although in my experience there are a couple of hurdles to overcome:

First, the figures may be hard to come by or difficult to compute. This is particularly true of infrastructure software, or tooling that’s used internally to the business. How do you compute the development team’s ROI from the impact they have on an admin clerk’s day-to-day? There will always be the danger of monetising some intermediate measure, and thereby creating a local optimum. (If you have examples of this being solved successfully, please get in touch.)

And second, the development team may feel that the figures are too dependent on other business processes, such as sales or dispatch. Even where the software is the company’s product, the value stream is often not as short or highly tuned as one might wish; and so the developers may not wish to be measured against the whole stream’s effectiveness. In theory, rapid feature development and compelling usability ought to energise the sales team and the market to the point where demand dominates supply; in which case the value/time metric will work well. In practice, the necessary pull is too often absent. (Maybe in that case the metric is still valuable, telling us something about the whole value stream…)

looking for patterns in the churn

One of the projects I work with has very high churn in the backlog. There are repeated and frequent seismic shifts in the project’s priorities, and each time that happens a whole slew of twenty or more cards instantly appears at the top of the pile. And a couple of times the entire backlog has been junked – sorry, archived. Naturally this all has side-effects, some good and some not so good.

It’s important to note that this project is considered a roaring success. The team has achieved amazing things in a short time. And its ability to cope with the backlog churn – while maintaining velocity – is remarkable. And yet the churn is disorienting: the team’s retrospectives regularly discover that individuals feel the project is out of control, or directionless.
Continue reading

xpday5

I spent another enjoyable day at XPday5 yesterday. As always, it was great to meet up with old friends and colleagues, although I never seemed to have the time to get into real conversation with anyone!

I was there to run a new version of the jidoka session, this time including games and a working group. The new format was great fun, and I’ll be running it again at SPA2006 next spring.

My main highlight of the day was Tom Gilb‘s introduction to ‘quantifying any quality.’ This was a compressed introduction to the Evo method’s approach to requirements, in which projects set numeric goals for everything from usability to maintainability to innovation. I was stunned by this approach, and spent some time over lunch quizzing Tom further. Tom’s thesis is that most of what we call “stories” or “requirements” are actually solutions to deeper business and user needs. Tom ferrets out those underlying needs, quantifies them, and then leaves the development team to decide how best to meet them. This chimes well with some of my experiences, so I’ve vowed to study Evo in detail during the next few months.

running tested features are not enough

There’s been some discussion this week in the comments over on It’s a Team Game about a software team whose internal and external customers don’t seem interested in their product increments. No-one seems to care whether the team delivers anything at all. In response, members of the team are concluding that they need more of a focus on delivery, as recommended by Alistair Cockburn in Are Iterations Hazardous to your Project?. They reason that frequent delivery (and demonstration) of running tested features will create belief, and then demand will follow.

This got me thinking about the RTF metric, and in general about the attributes of the product increments delivered at the end of each iteration. Certainly they should consist of running features – which I take to mean that they provide (possibly small) chunks of meaningful, cohesive, user-visible behaviour. They should also consist of tested features – which I take to mean that there are no defects in them. But which features?
Continue reading

open quality

Today on the Yahoo XP list, Kent Beck posted this link to Agitar’s open quality initiative. I applaud their openness, and would definitely encourage all other development groups to follow suit. (There’s a small danger, of course, that publication of such “dashboards” can be manipulated for the purposes of chest-thumping. I’m sure that isn’t the case with Agitar.)

It seems to me that the mere act of putting together the dashboard publication scheme would provide a group with important insights and impetus. And being able to “compare” numbers across the community offers both security (“Phew! most teams are as bad at UI testing as us”) and challenges (“Blimey, most folks test over 95% of their classes”). Perhaps every group that publishes a dashboard page should make it easy to Google – maybe we could agree on standard phrases to include on the page…? (with a link from the C2 wiki to the Google search, so that the standard is enshrined in a working implementation)

Update, 10 nov 05
Agitar’s Mark DeVisser has commented on this post.