“Gaming the system” is the kind of phenomenon that makes pedantic software development managers end their careers in mental asylums. A metric is introduced in order to achieve a certain outcome. To enhance the prospects of achieving the desired outcome, individuals and/or teams are compensated on the measured value of the metric. Over time they learn how to “game it”; that is, skillfully improving the measured value irrespective of whether or not such improvements still are in good accord with the desired outcome. The means (i.e., the measured value of the metric) becomes the end.
“Gaming it” manifests itself as failure over time of the measured performance to fully represent actual performance. For example, a team is likely to develop the capacity to produce code with a low level of Cyclomatic complexity per class1 if the team gets measured on this metric. But, the team would not necessarily attain the desired outcome: delivering code with a low level of technical debt. For instance, it might not pay due attention to unit testing. Or, the team might duplicate blocks of code as if there were no tomorrow.2 The team excels in keeping complexity low, which is an important factor in keeping technical debt low, but it fails to keep the overall level of technical debt in check as it neglects unmeasured components of technical debt.
Figure 1, inspired by the research of Cutter Fellow Dr. Robert Austin,3 illustrates the typical divergence between measured performance and actual performance. This divergence is the root cause for the feeling expressed by so many Cutter clients and prospects: “I don’t control the software; it controls me.”
Figure 1 — Measured performance versus actual performance.
A good way to view and address the “gaming it” phenomenon is to think of software metrics as if they were radioactive isotopes. Like isotopes, software metrics decay over time.4 Metrics, of course, do not lose mass, but they lose effectiveness. Beyond a certain period, satisfying a single metric per se does not necessarily help accomplish the desired outcome, as the metric in isolation loses much of its effectiveness.
Figure 2 describes a conceptual way to consistently drive a software team toward a desired outcome through half-life metrics. Metric M1 is introduced. Later on, when its effectiveness diminishes, it is augmented by metric M2. At a certain point in time, a third metric, M3, is added — and so on and so forth. The idea is that the combination of metrics “triangulates” team behavior toward the desired outcome. To reach a satisfactory level of compensation, the team has to strike a reasonable balance between the three metrics: M1, M2, and M3. It cannot optimize a single metric at the expense of the other two.
Figure 2 — Three successive metrics.
Figure 3 illustrates how the approach depicted in Figure 2 easily can be implemented to drive quality through technical debt metrics.5 The first metric, unit test coverage, is aimed at creating the “safety net” that will enable the team to refactor the code with confidence. The second metric, Cyclomatic complexity, is focused on reducing the occurrence of error-prone modules.6 The third metric, duplication, strives to reduce unexpected behavior across features.7 The combined effect over time of the three metrics (and potentially others) is code of lower technical debt. Such code statistically correlates with the desired outcome: higher-quality code.8
Figure 3 — Three successive technical debt metrics.
Unlike radioactive isotopes, the half-life period of a metric depends on the context in which it is applied. One cannot determine the period in the abstract — it can easily vary from one team to another and from one company to another. A judgment call is required to determine when to introduce a supplementary metric to complement the one that has reached its half-life.
The approach recommended here might not be appreciated in corporate environments in which consistency of measurements over a long period of time is required. You might actually face an uphill battle if you are a radical management type advocating this approach within a conservative environment. If this indeed is your situation, my simple recommendation to you comes from the Hebrew quip: Don’t worship the Gods you have created yourself.
In other words, metrics are relative and context-sensitive. Unless a metric is a compulsory regulatory metric, the choice of a metric is only as good as the effect it has toward a desired outcome in a specific context. Consider, for example, the choice of metrics in Figure 3. This specific choice — coverage, complexity, duplication — might be very appropriate for a certain company. Another company might not need to start with a metric for unit test coverage, as the desired outcome has already been accomplished through good technical practices. Instead, this other company might choose to introduce security violations as the first in its sequence of metrics to be used. The triplet of metrics in use in this case is security, complexity, and duplication.
Whether you choose the triplet “coverage, complexity, duplication”; the triplet “security, complexity, duplication”; or you opt to use an altogether different set or a different sequencing of metrics, the important thing is constructing a balance between the chosen metrics. Any single metric in itself probably would not get your team where you really want it to be. In contrast, the equipoise struck by a few thoughtfully chosen metrics will provide an operational envelope within which the team is likely to come realistically close.
— Israel Gat, Director, Agile Product & Project Management Practice
1 Gat, Israel. “Technical Debt Assessment: A Case of Simultaneous Improvement at Three Levels.” Cutter Consortium Agile Product & Project Management Executive Update, Vol. 11, No. 10, 2010.
2 Code duplication is the “practice” of cutting a piece of code from one place and pasting it (hurriedly) in another place, whether it actually fits there or not. For example, device drivers are often duplicated.
3 Austin, Robert, and Israel Gat. ”State of the Art in Software Governance.” Cutter Consortium study for private client, March 2011.
4 It is interesting to point out that Capers Jones views decay as an inherent characteristic of software in general: “All known compound objects decay and become more complex with the passage of time unless effort is exerted to keep them repaired and updated. Software is no exception…. Indeed, the economic value of lagging applications is questionable after about three to five years. The degradation of initial structure and the increasing difficulty of making updates without “bad fixes” tends towards negative returns on investment (ROI) within a few years.” (Jones, Capers. Estimating Software Costs. 2nd edition. McGraw-Hill, 2007).
5 Please note that quality is used here as an instantiation of the performance to be measured.
6 Gat, Israel. “Bugs, Technical Debt and Error Proneness.” Cutter Consortium Agile Product & Project Management Advisor, 12 May 2011.
7 Sterling, Chris, and Israel Gat. “Delving into Technical Debt.” Cutter Consortium Agile Product & Project Management Executive Update, forthcoming 2011.
8 Gat. See 6.