How to Run a Great Post-Mortem Incident

Post-mortem meetings are a way to analyze failures and prevent them from happening again. In this article, T.oni Farin, co-founder and CTO of Coralogix, discusses what needs to be addressed in post-mortem meetings to make them the most effective.

Software failures happen in production and every business must avoid failures completely. Finding ways to prevent failures from happening again and, ideally, limiting the number and duration of failures will separate successful businesses from the rest.

What is a post-mortem incident?

A post-mortem incident is an encounter that occurs after a software failure. A small group of people directly involved meets to describe the failure and its impacts. During the meeting, the team should discuss process changes to reduce the risk of repeat failure. The post-mortem meeting should identify changes that can be implemented and then measured for their effectiveness.

The outcome of a post-mortem meeting should be:

    • A template-based detailed incident report
    • All contributing root causes are fully understood
    • What preventive actions can be used in the future to reduce the likelihood of recurrence

Learn more: 3 key takeaways from the first-ever ECD Summit

How to do an effective autopsy

When to perform the autopsy

The post-mortem meeting must take place as soon as the incident is over. If too much time passes, team members can forget the details needed to dissect the failure. The meeting must take place within 48 hours of the resolution of the failure, although it must still take place even if this delay is not possible.

Who should attend the autopsy

Limit the meeting to a small group of team members for post-mortem discussions. Although each stakeholder should review the documentation, larger groups can hamper the productivity of the discussion. Those attending the autopsy should be those who responded to the incident and the critical stakeholders impacted by the outage.

Thoroughly document events

Documentation taken at a post-mortem meeting should be as detailed as possible. The intent is to review meeting and incident notes so team members can look back and properly take suggested actions, having understood the context of the failure. Next to a model can help keep the meeting on track and ensure that the discussion of the various stages of failure and recovery is not overlooked.

Keep it flawless

Post-mortem analysis Why an incident has occurred to change the policy and prevent it from happening again. A blameless post-mortem will do this without blaming any individual or team. This requires assuming that all parties acted with good intentions. The circumstances that led to the failure are what need to be changed to improve overall performance.

A irreproachable post-mortem removes fear of reprimand or insult from all team members. By doing this, communication can continue with honesty and objectivity; incidents are less likely to be ignored entirely out of fear; a healthier work culture is encouraged and teams are free to do their best.

Discussion points during the meeting

Since this meeting takes place after the issue has been resolved, those present at the meeting should together be able to give a full account of the failure and analyze why it happened. The post-mortem meeting should consolidate this information and communicate it to other stakeholders.

Describe the incident and its resolution

The first section of the post-mortem should include various discussions that dissect the failure. First, the incident should be summarized in a few sentences, including what happened and why, its severity and duration.

The meeting part should break the incident down into separate sections, each focusing on a different aspect of the failure. Each of these sections should be included in the post-mortem template used so that they are always included.

1. Preamble

Define the events that led to the failure. Has there been a new feature rollout? Did an external supplier have an outage? Was there a previously undetected bug?

2. Default

Describe how what was implemented was supposed to work, then compare it to how it worked in reality.

3. Impact

Describe how internal and external users were affected by the failure. If support tickets were created during the incident, they can be referenced here.

4. Detection

When and how did the team detect the incident? Were they alerted by an external observability tool, or were the customers the first to alert the team of the outage? Teams could discuss ways to improve detection if there was a significant delay between the failure and when the team was notified.

5. Response

Who answered the failure? How long after detection was there a response and were there any barriers to response? What was the response action taken?

6. Recovery

Describe how the failure was corrected and how the incident was resolved. How did stakeholders know what steps to take to resolve the issue?

7. Timeline

Detail the timeline of the events described above, including the time of any preparatory events, when the problem was first detected versus the known start of the failure, and when the incident was considered over.

Learn more: How to Use Phased Deployment to Solve Development Team Burnout

Define the root cause of the incident

Defining the root cause of the failure is key to improving business processes or systems to prevent it from happening again. Unfortunately, sometimes there can be multiple contributing causes for a failure. To get to the root cause, it helps to ask why the decisions were made, again assuming they were made in good faith.

Root cause analysis can be complex when the failure is deep in the software architecture or due to an edge case in user action. To ensure that the root cause of a software failure can be found, observability tools should be in place to help teams quickly identify failures.

Discuss corrective actions to prevent the problem from recurring

After determining the processes causing the error, a corrective action can be established. It could be a new training program, a change in testing processes, or a change to automate a process so that human error is less likely. The corrective action must be directly related to the root cause of the incident to prevent it from happening again in the future.

Prevention of future failures

A successful post-mortem meeting will identify processes and policies to prevent failures from recurring and will not assign blame to an individual’s actions. Identify the root cause(s) of failure from observability data and understand customer issues. Take corrective action by updating processes to prevent similar failures from occurring.

What are your key steps for an effective incident post-mortem? Share with us on Facebook, Twitterand LinkedIn.

Image source: Shutterstock

LEARN MORE ABOUT DEVOPS

Comments are closed.