For any team that is responsible for existing systems that are already in use while also iterating on those systems to enhance and/or improve them there is also typically the burden of interrupts from the existing system. Forward progress is best made with concentrated effort that is focused and uninterrupted. How does one balance these two conflicting jobs of a team?
The first step is to consider an interrupt rotation where only one person at a time on a team fields the interrupts leaving the others to focus. That person may also work on improvements, but is then the only one that should be getting interrupted. A downside to this approach is that if the interrupt falls outside the area of expertise of the interrupt person, they may spend too much time trying to solve the issue whereas the “expert” may have quickly found the root cause and solution. While there is value in digging and learning, there is also a big cost of time and frustration of fruitlessly going down the rat hole.
How does one balance between the two conflicting objectives of efficiently addressing issues that crop up in existing systems while making focused forward process on interrating on improvements?
Consider the following and tweak as appropriate for your team as your mileage may vary:
- Establish an interrupt rotation where one person fields and addresses all interrupts.
If this isn’t a full time job, devote the first x hours of each day to interrupts office hours.
- Have the interrupt person handle each new interrupt as follows:
Is the “house on fire”?
- If yes: This interrupt must be addressed as quickly as possible, go distract the person or people best suited to address the issue, and follow the normal communication and escalations processes.
- If not:
- Do your best to dig into the top priority interrupt on your plate and learn as you go about areas previously unfamiliar to you.
- At the next day’s stand-up, give your update on which issue(s) your working on (if the expert is now on another team, go attend their standup as a guest).
- As a follow-up to stand-up, if an expert on the team could help guide you to the root-cause/solution, they should speak up.
- The manager having heard everyone’s update can make a priority call if the expert should leave it at a few words of advice or invest time in helping with the issue that day rather than continue with their previously scheduled program.
Once the issue is understood along either path do either a formal or mental 5-why’s post-morten to get to the core of the right measured and proportional response of…
- Understanding the root cause(s)
- Fixing the immediate problem and
addressing the areas that would make the existing system more resilient..,
- Adding/improving monitoring to catch things like this (not just specifically this issue) when they start to go wrong
- Adding/improving test coverage to prevent (re)introducing issues like this (not just specifically this issue) by any other engineer modifying the code or by changes in underlying systems or environmental factors.
- Consider adding self-healing logic such as retries.
This approach has several advantages:
- Issues with existing systems do get triaged and get some measured and proportional attention with guidance from the manager.
- The majority of the team can work focused/uninterrupted on making improvements to the system.
- The interrupt person spends some amount of time each day broadening their knowledge to learn new areas and reducing the dependency on a single “expert.”
- There is a mechanism to prevent the interrupt person rat-holing into an unknown area for more than a day without other knowledgeable people learning about it and course correcting them.
- The interrupts of the focused members comes as a follow-up to stand-up at which point they are already interrupted.
- The manager is involved in the priority decision of deciding whether the ROI of investing more resources resolving the interrupt outweigh the investments in focused forward improvements of the system as a whole.
- Lastly, note that the principles here could/should also be applied for team-internal interrupts – if one team member is seeking help from another, doing an assessment of whether the house is on fire or if it can wait until tomorrow’s stand-up is healthy in either case as is doing some archeology first to learn a bit, see how far you can get on your own, and offer up good context when you do bring it up…
Interrupts for existing systems not under active development
Sometimes interrupts show up for systems that are no longer under active development. This can lead to interrupts for people that worked on that system in the past but have now moved onto other projects. At IMVU, we introduced the notion of “Domain Experts” for various areas of the product. There were also backups designated that either also had knowledge of the area or were tasked at helping field issues to be able to help off-load any single point of failure Domain Expert. For interrupts of this kind coming in, the process is not all that different.
Triage the issue to decide if it’s a “house is on fire” / drop everything issue, then essentially proceed as above together jump on the fire or wait until the next day’s stand-up to communicate that you need to decide whether to pause whatever your current work is to a ddress this issue…
I later introduced similar notions of dealing with interrupts to Twitch and then Pure.
Hand-offs between interrupt cycles
One thing to be aware of is that some interrupts require more depth and context than others. For simple interrupts or ones where the in-depth effort hasn’t begun, the hand-off from one person on interrupt to the next makes sense. For more in-depth ones, on the last day on interrupt, the interrupt engineer should mention such issues on their plate and then have a follow-up to stand-up with the manager and next person on interrupt to decide whether they should
- Continue with that issue or
- Hand it off to the next interrupt person
- Pause it and put it on the backlog for the team to revisit down the road.
Motivation for resolving interrupts
Often, people will sigh, moan, whine … when their turn for interrupt duty shows up. This can greatly impact they effectiveness. They often feel they are rooting around in code someone else wrote to uncover what they failed to account for.
There is a certain joy that be gained from solving a mystery. What greater mystery than rooting around in someone else’s code? There are rewards in being the Sherlock Holmes that solves the mystery. As a manager, it’s important to show recognition and appreciation for this and not just progress on the feature currently under development.
One source is start-up early code. It helps to explain that a successful company may not be where it is today had not the early engineers whipped something together to get the company where it is today and pay your paychecks. This train of thought should be less about guilt and more about appreciation.
Interrupts always ending up with most senior person
As startups grow into larger and larger companies, it’s not uncommon to see the more experienced engineers get buried in interrupts from people coming to them to do this or that or help debug something.
Part of this can be reduced by having a good spin-up and mentoring process.
Part of this can be addressed by ensuring engineers appreciate well written code with good test coverage greatly reduces how often other will come and ask them about it and hence creates freedom for them to grow to learn new things and/or experience upward mobility.
However, there are also time where you’ll hear “Why should I give it to someone else to do? I can do it in an hour and it’ll take them two days, and since I know the system, I’ll do it better. Besides, it seems selfish to ask someone else to burn two days of their time for what would take me 30-60 minutes. Right?” At first blush, that might seem right, but it’s actually the wrong approach. By doing it yourself, you’re depriving the other engineer(s) from learning and getting better at it. The organization won’t scale and neither will you if you take that approach, and the new engineers will soon get demotivated if they are never trusted to do anything. Having help experienced engineers see things from this angle has repeatedly resulted seen their ability to work on new things increase and with it their motivation and engagement.
Every now and then, you’ll get a wise old engineer counter: :”Well, what if it’s a hot customer issue, what if the ‘house is on fire’?” They kind of have a point here given the above; however, even here, things can improve. Grabbing that new engineer to observe while they fight the fire (while also verbalizing what they do and why) can also help to bring up new folks to fire fighting. The next step is to have the new engineer be the one with the hands on the keyboard while the senior engineer instructs and potentially dictates. Hands-on-keyboard will still help with learning more than simply observing. The next progression is to stop dictating and start asking, “ok, what’s your next step?” and only helping out when they don’t right away know the answer. This process helps bring up to speed new engineers and helps existing engineers not get bogged down and demotivated…
Another thing that has proven helpful is to create a routine for the person on interrupt duty to check through various sources of interrupts at the start of each day with some sense of priority to plan their day and the sequence of attack.