Critical Incidents Checklist

Useful checklist for dealing with recovery crisis. Based on the talk "Saving Great Projects" 2017 Python Brasil. Access Github Repo.
incidents;
Checks are saved in your local storage

1. Manage the client's expectations.

  • Make sure the client knows from the start that critical incidents are natural.
  • Disclose your backup plan and recovery process.

2. Assess the severity of the situation:

  • What happened, who was affected and what's the impact of the issue?

3. Declare an incident:

  • Does the issue have impact and complexity that requires a team effort?

4. Assign clear responsibilities for the team:

  • Who will communicate the client?
  • Who will fix the issue?
  • Who will work on the restoration?

5. Have a transparency policy:

  • Notify the client and take responsibility as a team.

6. Define a recovery and data restoration plan:

  • Identify the bug causing the incident and issue a hotfix.
  • Identify the latest backup with valid data.
  • Define the time-frame not covered by the backup.
  • Retrace the state of the system during the time-frame not covered by the backup.
  • Write data restoration scripts.
  • Specify all commands and steps required for the restoration.

7. Execute the plan while providing rapid status updates:

  • Test the restoration locally.
  • Backup the data.
  • Restore the lost data.
  • Identify the data that could not be recovered.
  • Disclose the lost data to the client.

8. Write an incident postmortem with the team:

  • What happened?
  • Why the incident occurred?
  • What was the resolution? And how effective?
  • What would the team do differently?
  • What problems did the team encounter?
  • What actions will be taken to make sure the incident doesn’t happen again?

9. Update existing practices:

  • Take time after the incident to read the postmortem and update what was necessary.

Yay! You completed the checklist top to bottom!
Now spread the ❤︎ by thanking the author, making improvements or creating you own checklist!