Story Points Aren’t Dead–You’re Just Making the Wrong Optimisation!

I recently had a Eureka moment with Story Points that I want to distil and share. Understanding this subtle frameshift explains much of the dissent towards Story Points as implemented in many Scrum teams.
I have used Story Points (to varying degrees) for my entire software career and have been feeling lately that they have been the cause of some tension in my team. I knew the theory behind them quite well, but there was some slight misconception causing misalignment on “how to do the accounting without cooking the books”.

 
If you take one thing away from this article, it should be an understanding of the following:
Story Points (or “Complexity” Points) measure the degree to which the system’s components must be changed to achieve the smallest releasable increment; this is completely independent of the time taken to do so.
Or more simply: complexity != time.
This may seem straightforward, but before you stop reading, stick with me while I explore; if Complexity is not a measure of time, how should we assign Story Points?

Why People Subtly Estimate in Time

One of the great failings of “Corporate Agile” is that it took the language of Agile but nothing else. This creates a deceptive surface area where people–using the same vocabulary–believe they are having substantive discussions about the same thing.
Where I stand on the root cause of conflicting opinions on Agile
Where I stand on the root cause of conflicting opinions on Agile
The discourse around Story Points (invented in XP and then included in Scrum) also succumbs to this phenomenon.
My Eureka moment was this:
Two irreconcilable systems are at play under the guise of a single system, driven by two intrinsically conflicting intents for Story Points. We have a single name for these different systems because the push for a common vocabulary (e.g. through Scrum Master qualifications) superseded the push for a common understanding.
To explore these two systems, let’s look at the intent–the result each persona wants to see by using the system–of each. The Story Points system services two main personas:
  • The Account/ Project Managers: the people who oversee the project at a high level. They are often responsible for managing timelines, and budgets, and are more senior in their organisations than the second persona.
  • The Scrum Team/ Delivery Manager: the people responsible for delivering/ developing the actual features as quickly as possible.
The Story Point system's intent for Project Managers is primarily to estimate when features will be complete to plan releases and retrospectively track the cost to inform feature ROI analysis. It enables the Scrum team to report on capacity planning.
As the Scrum team members often hold less power in the organisation, they are incentivised by the Project Managers to think about Story Points in terms of time. When the Scrum team say “We have three features on the roadmap of 30, 50, and 70 Complexity Points”, the Project Manager equates this directly to time and may plan their releases accordingly. The Scrum team (Scrum Master) may also be responsible for producing reports showing projected timelines (e.g. Gantt charts).
While there is value in following a plan, the Agile Manifesto asserts that responding to change is the superior value. The Scrum team often appreciates the need for capacity planning but, ultimately, they are concerned with picking up the next highest priority, completing it as quickly as possible, and minimising iteration cycles. Tracking Complexity Points takes time that could otherwise be spent on development–actively detracting from delivering value.
I’ve experienced corporate project management in the past few years, and recently, I’ve seen how this requirement to follow a set plan trickles down to how the Developers think about Story Points–as a mechanism to report on progress and predict when we will be “done”. With the Epic Burndown Chart in mind when estimating tickets, and wanting to keep the indicator as accurate a picture of our progress (measured in time because, ultimately, the thing we have constantly in mind is meeting the arbitrary deadline we have been set), estimations become reflections of the time we will spend on items (or at the very least slightly normalised to keep the team speed stable).
This undertone of estimating and tracking time does not mix well with the intent of Story Points, as I was originally taught–that Story Points are not days, hours, weeks, or minutes, nothing indicating time spent. The rest of the Scrum team know this too, but the focus on the deadline meant we had forgotten it without realising it.

A Different Measure: Value-Added Tasks

To figure out what was going wrong, in an act of a long time coming, I decided to write down almost everything I know about Story Points and the adjacent systems that they feed into (see the standard I wrote below). In doing so, I managed to articulate an intent of the system that delivers value to the Scrum team:
💡
Story Points are primarily a tool to highlight delivery and tech problems so the team can invest in solving them in an informed way to increase delivery speed.
This is quite a different purpose than what the Project Manager sees, so how does this work?

 
It really matters what we measure. What we measure is what we optimise.
If we measure speed (Story Points/ day), Developers mistakenly attach self-worth to their ability to deliver fast and will try to increase speed however they can. This can result in some gaming of the system:
  • inflated estimations
  • post-estimating tickets that took longer
  • adding pointed tickets for meeting time
For Story Points to bring value to the Scrum team, we must identify the difference between “value-adding” tasks and “non-value-adding” tasks. A value-adding task is one that contributes to creating the final product’s value to the user. There are two kinds of non-value-adding tasks:
  • Required: the activities we accept we must do to enable the valuable transformations (or to spend more time on value-adding tasks). For example, our team writes a small technical strategy on each ticket outlining the changes we expect to make in the ticket. This is required, as having a set of instructions to follow while coding increases our productivity.
  • Non-required: things that don’t contribute at all to value for the end user.
The main value-adding developer task is writing the line of code that gets deployed to production and gets run by a user.
I assert that Complexity Points should only be a measure of value-adding tasks and Developers should defend this as a mechanism for making their work more fulfilling.
This way, when a drop in speed is shown on our burndown chart, it visually illustrates that the Developer encountered some problem that prevented them from delivering value. We can then react and prevent the problem from reoccurring; solving these small everyday problems and ensuring the team is constantly learning creates compounding positive effects on the environment the developers have to deliver value.
In practice, this means that when making estimations, I think about the elements of the system that will need to change to do this ticket (i.e. roughly the git diff for the resulting PR) and assign Story Points in comparison to previous tickets of similar sizes (we call these reference tickets which are used to keep the sizing of estimations consistent over time). This trick, thinking about the git diff as a mental model for ticket estimations, keeps me from subconsciously estimating based on the time I think it will take me to complete.

A Small Example of How to Estimate

There are lots of problems on my current project. Our burdown chart is a tool to highlight these problems so that we can see and solve them.
Here’s an example of a problem I see on our project and how it unfolds in both systems.
In order to change the behaviour of some of our app’s flows, we have a visual code editor (a lo-code tool, that we can use to visually represent a user flow), from which we export and deploy an XML representation which then controls the app logic. At some point, to extend the functionality of this lo-code tool, someone wrote a script which takes the XML output and does some pre-processing before we deploy it. Over time, this script got more complicated and we started using features of this lo-code that meant that all our exports were no-longer compatible with the pre-processing script.
Our process to change functionality in these flows is now:
  1. go to the visual editor and make some changes
  1. save and export the user flows
  1. manually edit the exported output so it is compatible with out pre-processing script
  1. run the pre-processing script against the output
  1. add an SQL script (don’t ask) which deploys the new code on app start-up
All this process takes time, but produces a PR with fairly minimal diff.
This is a problem. The user doesn’t care about any of these steps, only the lines of deployed code that produce the desired behaviour–if I can reduce my time on these non-value-adding tasks, I can spend more of my time writing cool features for users.

Estimating by Time

If we’re playing to the system that the Project Managers want us to use, we would include some extra Story Points to account for the fact that more time will be spent doing these extra steps. In this case, we give a more accurate indication of when we expect the work to be complete (we give a bigger estimation because it will take more time), however our burdown chart now hides the problems that we know exist.
notion image
Further, increasing an estimation for a task that is more “complex” or more uncertain, is just a proxy for saying, “we expect this to take more time. Some people believe the tradition of using Fibonacci for Story Point values is because it accounts for uncertainty with larger items of work, but this shouldn’t be true–using complexity as a unit for estimations explains Fibonacci usage by pointing out that our projection of what the required changes (diff) gets less precise the bigger the piece of work (I still think we use Fibonacci because it’s nerdy).

Estimating by Complexity

As the diff is small (i.e. the value-adding part of the task is small), we have a fewer Complexity Points assigned to this ticket, and when the Developer takes a longer time to complete the ticket, which raises this problem which can be investigated, triggering the Tech Lead to come along and fix the pre-processing script to help the devs move quicker.
notion image
It’s important to note that we shouldn’t plan to fail a Sprint. If we know that we have lots of these tickets coming up, we can use our intuition to reduce our capacity for the Sprint–projecting to do fewer tickets because those we are taking in we project will take longer. A typical case for this might be that we know we are working on an older section of the codebase and that we might run into some nasty, brittle code.
Any time that someone speaks out in a Sprint planning to reduce our capacity, we should also consider their reason for doing so to be a cause for problem-solving:
“Hey, this section of the codebase is a mess, I’m not sure I’ll be able to finish all these this week”.
“OK, let's take out some of these tickets, but we should consider refactoring that section of the codebase so we can continue to work on it quickly in the future”.
Our new burndown chart would look like this–we still plan to complete the Sprint by reducing the capacity, we still see the problem occurring, and we still succeed the Sprint!
notion image

Speed as a Negotiation Tool for Increasing Productivity

With this system, we can do some cool analysis on speed which visualises wasted time due to problems and where we can make the biggest investments for increasing productivity.
If our speed looks like this:
notion image
We can investigate why the speed dropped (and it really did drop in real terms of value produced) for the group of tickets–we will probably discover some nasty code-coupling that can be fixed with a little work!

But Estimating Tickets by Time Gives Us a Roadmap!

It’s worth noting that this method of estimation I’m proposing does not give a 100% clear projection of when the features will be completed, only an indication. If we expect the speed to vary from feature to feature to visualise problems, then you cannot simply project a constant speed and calculate deadline = estimated complexity * team speed.
It’s also worth noting that humans are notably bad at estimating the time required to complete a task–especially on the macro level. We can make a prediction, but this often leads someone to promise it will be done by X day, and when our prediction turns out to be wrong, all hell can break loose.
I think using the system I propose, which doesn’t promise to make projections, exposes this uncertainty and hopefully highlights that a timeline projection using Complexity Points is a best guess, to which people can add the contingency that they wish.

Should We Just Measure the Rate of Git Contributions?

Personally, I don’t promote this. I don’t think it captures all value-adding activities and it also opens the system up to other types of gaming–for example committing extra whitespace or preferring solutions with more code (rather than a simpler solution).
Using git contributions as a heuristic for estimations doesn’t tell us two important things:
  1. If our code solves a real user problem: we could be producing code at a fast pace, but actually no one uses our app. It’s important to keep this in mind–I use the word “speed” and not “velocity” because this method doesn’t measure velocity (i.e. speed in a given direction). We could be rowing the boat very quickly in the wrong direction.
  1. The quality of our solution: we only see how easy the system is to change (speed). It’s an extremely important thing to optimise, as it makes the business more agile, but it can’t tell us if we’re over-engineering our solutions and only accepting the “essential complexity” required for a solution–we could write our solution in assembly code (probably a bad idea), writing thousands of lines instead of just using a few lines of React to create our GUI.
It’s important to keep these in mind.
I think measuring git contributions also creates a competitive/ toxic environment that I wouldn’t want to be a part of.

When to Break the System

Sometimes it’s more pragmatic to include some buffer for time in our estimations in edge cases. To know when this is OK, keep the following rule in mind:
Any time you treat Story Points as a proxy for time, you are saying, “There is a problem on my project that I accept we will not solve. To achieve greater delivery visibility, I am increasing the Story Points on this ticket to account for that”.

Conclusion

At the very least, I don’t think Story Points are leaving the world of software development anytime soon–whether they are a net positive or negative, they are here to stay. Hopefully understanding how they can be used to benefit those who have to use them brings a better implementation and less frustration.
 

 

Appendix: A Standard For Lean Story Points

 📖 Definitions

📖
Story Points (or Complexity Points): a measure of the degree to which the system’s components must be changed to achieve the smallest releasable increment (a ticket).
This measure excludes the contributions of “non-value-added activities” associated with changing the system.
📖
Non-Value-Added Tasks: steps in a process that don't contribute to the final product's value or align with customer expectations.
In Software Development, the only “value-added” touch time is the moment your finger presses the key that writes a character of code that changes the behaviour of the system and is deployed to production, all else is “non-value-added”.
📖
Developer Speed: the rate of completion of Story Points by a single developer.
📖
Team Speed: the rate of completion of Story Points by the entire team.

 💛 Intent

1️⃣
Increase the speed of delivery by solving the root causes of problems (in terms of the 4Ms) identified by measuring developer speed in completing value-added activities.
If understood correctly, speed can be used to negotiate an investment in tech quality.
2️⃣
Provide an indication for capacity planning and feature prioritisation by keeping our measure consistent, over many tickets (through the law of averages).
Speed is not expected to be constant from ticket to ticket as that would imply that no problems were faced on the project. For this reason, combined with the fact that we charge for time (not “complexity”) Story Points are not a 1:1 mapping to “cost”.

 ✅ Key Points

✅ Point
❓ Explanation
Complexity should (roughly) be proportional to the amount of git diff produced in the pull request for the ticket.
The diff is the “value added” transformation made to the codebase. There is some subtlety here: some of the code introduced will be essential complexity and some will be accidental complexity–we want to avoid the latter.
When a problem occurs, this is directly reflected by a drop in speed. The team reacts with a problem solving.
This indicates that your indicator is working and the system is running healthy.
Values go up in Fibonacci sequence (1️⃣2️⃣3️⃣5️⃣8️⃣).
• The bigger a piece of work, the worse we are at predicting its complexity. This is a mechanism to account for imprecision for more complex tickets. • Note: the incorrect explanation for this in the common mistakes section.
Developers have a reference for what amount of “changes” constitute each number of points.
This creates consistency across multiple developers–both a Tech Lead and Developer can have the same interpretation, so even if the team changes, the measure remains useful.
Tickets where additional, unexpected, complexity is discovered (e.g. “we expected a 10 line change, but discovered it was a 20 line change”) are post-estimated.
We can keep track of the total complexity of the Epic to use as a reference for other Epics in the future.
The sprint capacity uses the speed (informed by previous sprints) to plan which tickets can be completed in the sprint. The capacity can be changed in light of different circumstances at the discretion of the team.
• When ending a Sprint, we need to decide what to do with “in progress” tickets. Some methods split the ticket and “leave a point on” to get it through CR; instead, keep the tickets that are not done in the next sprint and adjust the capacity
When we discover an increase in scope, after post-estimating that ticket, other tickets are removed from the sprint to react to this. The PO is consulted on the change to the sprint.
If scope changes (because the PO is trying to change the tickets in the Sprint), you must take out from the sprint ~2x the number of points that you put in
This accounts for the time taken in BR, TR, etc… → you should expect your speed to decrease (deliver less due to the interruption)

 ❌ Common Mistakes

❌ Mistake
❓ Explanation
The team often succeeds sprints on a weeks plagued by lots of problems.
This is a smell that Story Points are not doing what they are designed to do.
Using Story Points and team speed directly to report to the FS on cost of features (using Story Points to audit how the team spends their time).
• Story Points are not designed to be a reporting tool to audit how long we spent on each feature. It can give an indication, but not an exact figure. • Not reaching the capacity indicates to the FS that the team are not working their X hours/ day, trying to force these two metrics to align reduces the effectiveness of Story Points on a visual indicator for problems.
When a problem occurs, this is accompanied by an increase in scope.
i.e. if timeboxes are added to the sprint or post-estimates are made (this shouldn’t affect the complexity) • Over many sprints, increasing the scope by a small amount for each ticket with a problem leads to an unclear picture of the remaining work (as we don’t know how much additional scope will be added before the Epic gets to “done”. • E.g. If you estimate an Epic to be 50 Story Points, with a speed of 5, 1 dev would expect to be done in 10 days (2 sprints). If we completed 25 points in sprint 1 however, in that time, the scope has increased to 60 (2 points scope increase per day due to problems), we see 25/60 Story Points complete (41% complete), however in reality if the team continues, the Epic will end up at 70 Story Points (taking into account the scope increases that are happening on average) → 25/70=35% completion. It’s better to have kept the Epic at 50 Story Points and only completed ~18 Story Points as this shows the pessimistic/ realistic view by default and prevents setting false expectations.
Values go up in Fibonacci sequence (1️⃣2️⃣3️⃣5️⃣8️⃣) because larger tasks have more uncertainty.
More uncertainty does not mean more complexity! Extra points are not given because this particular change is tricky–if it is tricky/ complicated, we should expect to see a slower speed–if we don’t, it means the system of Story Points (for identifying problems) is being undermined and we might as well stop using them.
The stakeholders are shown the weekly burndown chart (BDC) instead of having a qualitative understanding of what “success” of the sprint looks like (i.e. “if these features are in production by the end of the sprint, it was a success”).
• The team starts to optimise the measure they are judged against. When the client judges the sprint based on the BDC, the team is incentivised to “fake” the results so they can feel good at the end of the sprint–even though a failed sprint does not reflect poor work (often the opposite). • The team starts to use problem-solving to justify to the client why they are behind (”we’re behind, but don’t worry because…”).
Adding post-estimations because a ticket took longer than expected.
Story Points ≠ Time–only if you discovered extra code that needed to be produced, should you post-estimate your ticket to give a “true speed” for the sprint.
Doing a mental calculation of “I think this should take me X time, other tickets that took me X were Y points, therefore, this is a Y point ticket”.
Story Points ≠ Time–this creates the conditions where problems can be missed on the project. For example, lets say tickets A and B take the same time, ticket A has a big git diff, ticket B has a smaller git diff, however there is a script to run to generate that small diff which is quite fiddly and takes some manual intervention, so will take longer. If both A and B are given 5 points, we have missed an opportunity to identify that ticket B can be optimised and automated, which would otherwise be shown via decreased speed.
Developers attach their self-worth to their speed and get demotivated when they don’t meet it instead of considering other factors (the 4Ms) as a cause for improvements on the project.
This derives from a misunderstanding of the use of Story Points.
Internally measuring the success of the team by the outcome of the sprint rather than how they react to problems discovered.
• Goodhart’s law: when a metric becomes a target, it loses its value as a metric. • In this case, we are allowing/ incentivising the developers to give generous estimates and then pat themselves on the back for meeting the team speed (team speed is dependent on the accuracy of “estimates”).
Presenting to stakeholders countermeasures “to catch back up”.
• Lost time is lost - you likely won’t see the rewards of any problem solving you do within the same sprint (unless you’re doing the same thing over and over, which is unlikely). • Instead present to the client that you have solved the root cause of the problem that you faced - if it took 3h to do a task that you automate away, speak to the speed you will gain in the future (3h lost this week doesn’t seem so bad when we gain 3h every following week)

🧠 Mental Models

Essential vs Accidental Complexity
Source: Martin Fowler
Source: Martin Fowler
📖
Essential Complexity: the minimum complexity cost of achieving a given change to a system.
📖
Accidental Complexity: total complexity - essential complexity. This is the cruft that could have been avoided in the implementation.
No ticket is implemented without some accidental complexity, as no solution is perfect. It’s the developers job to reduce accidental uncertainty, as it is time that could otherwise have been spent on something valuable.
Developer Speed Over Time

Typical Dev Daily Speed

Inattentiveness to speed (or misuse of Story Points) leads to a decrease in speed over time as complexity builds in the codebase. This is typical and can happen within weeks or months is code quality is neglected.
notion image

Ideal Dev Daily Speed

The team reacts to problems indicated by the speed and therefore continues to increase speed steadily rather than stagnating.
notion image
The Cobra Effect for BDC (Perverse Incentives)
notion image
When the rewards for taking some action result in undesirable (or sometimes the inverse) behaviour. For example, snake catchers may receive incentives such as a commission for each snake they catch–this leads to the behaviour of breeding even more snakes to “catch” them for the reward (which means the existing snakes remain free in the wild).
In the context of showing the BDC, this can lead to teams accounting for their time with timeboxes, which means the sprint has succeeded, however at the cost of hiding problems from the team, which remain a problem.
Value-Added Touch Time
Source: Jonathan Wagner
Source: Jonathan Wagner
Consequences of Problems Increasing Scope

✅ BDC Using Consistent Complexity

The below BDC is unambiguous - it shows a problem occurred and that we shouldn’t expect to complete all the features this sprint.
notion image

❌ BDC With Investigation Ticket or Incorrectly Post-Estimated Ticket with Descoping to “Stay on Track”

In this case, the team has run into a problem, however, it doesn’t show in the BDC because either:
  • The team created a timebox to account for time spent and removed tickets from the sprint
  • The team post-estimated as higher and removed tickets from the sprint
This creates ambiguity as to what happened - did a problem occur or not?
notion image
Story Points in Brownfield v Greenfield

Brownfield: high variance in speed due to existing codebase complexity

This may affect how you do capacity planning for the sprint
notion image

Greenfield: lower variance in speed as everything is new ✨

notion image
Complexity ≠ Time
There are two mental models of the utility of Story Points:
1️⃣
Lean Story Points are primarily a tool to measure and raise tech quality problems so the team can invest in solving them in an informed way to increase delivery speed.
2️⃣
Corporate Story Points are primarily used to estimate when features will be complete to plan releases and retrospectively track the cost to inform feature ROI analysis.
Both can be a valid interpretation, however the effectiveness of using Story Points diminishes when these are mistakenly merged.
If management wants the output of “System 2️⃣”, it makes sense to:
  1. Use a timer to track the time being spent on each feature
  1. Estimate in time rather than overcomplicating the conversion from Story Points
      • This does not mean refining any less than we already do, just measuring in “hours” and ”days”
Therefore, the only sensible implementation of Story Points is “System 1️⃣”, where we measure Story Points relative to “Complexity”, but not time.
Complexity ≠ Value
🧠
If you think that the more Story Points you push to production, the more value you are delivering, you are mistaken. Value requires speed and direction. If you are pumping out features that no one uses, you are delivering no value.
This is why we use the word “speed” instead of “velocity”:
  • Speed = distance / time
  • Velocity = speed in a given direction
How many Story Points you complete says nothing about the value of your product (the “direction”), only the rate of change–a fast rate of change is good, as it allows you to be agile, but it’s not everything.
 
De-Stocking End of Sprint
For convenience, we schedule the sprint to end at the same time every week to have a review with stakeholders. This means that not every ticket might be finished yet.
What to do with the remaining tickets still in “doing”.
Nothing is in “done”, i.e. in production, so the ticket must be carried over to the next sprint if we still want it.
This interferes with capacity planning, because if speed is 5, but some work has already been done on a given ticket (let’s say 4/5 points), we can expect the dev to get that to done sooner. In this case, just alter the capacity of the sprint–take in extra tickets to account for the fact that we anticipate that this 5 point ticket will be in “done” sooner (so will expect to be able to complete more tickets).
Speed is just used to inform capacity, the team can make an informed decision to take in more tickets.
 
When to not follow the standard
Pre-requisite: I understand the intent of Story Points
⚠️
Sometimes the intent of a standard does not match what you want to achieve on a project. You may want to include non-value added tasks as part of the estimation for Story Points.
Any time you treat Story Points as a proxy for time, you are saying, “There is a problem on my project that I accept we will not solve. To achieve greater delivery visibility, I am increasing the Story Points on this ticket to account for that”.
 

 📚 Resources