My Ramblings

If you are reading this you must be pretty bored…

Resilience Engineering

Here are some notes from an interesting couple articles published by John Allspaw regarding Resilience Engineering.

  • Resilience is about being able to function, rather than being impervious to failure
  • Looking at the things that go right is a better strategy to improve resiliency
  • Failures in complex systems don't have a singular root cause
  • Identifying human error as a root cause should result in trying to figure out what led to the human error
  • Political safe environments are required if you truly want to figure out what, how and why a human error occurred
  • In addition to learning why things go wrong we ought to learn just as much from why things go right
  • Safety is not the absence of incidents and failures but rather the presence of actions, behaviors, and culture that causes an organization to be safe
  • Anyone, at any time , no matter their seniority, can make a mistake or act under faulty assumptions
  • Making a mistake should be acceptable and admitting fault should be encouraged
  • Near miss events are excellent learning opportunities because they are just a little bit of failure that doesn't really hurt, happen more frequently, are a powerful reminder and thus keep the "constant sense of unease" required to provide resilience in a system
  • The goal of a post-mortem should be to gather as much information about an incident, mistake, etc. in order to spread the observations within the organization in order to prevent then from happening in the future
  • Components in complex systems come together to behave in ways that they never would have on their own in isolation
  • The Four Conerstones of Resilience
    • Anticipation - Knowing what to expect in the future
      • Architectural reviews
      • Operability reviews
      • Game day exercises

    • Monitoring - Knowing what to look for
      • System metrics
      • Business metrics
      • Metrics on operations and activities of both infrastructure and staff
    • Response - Knowing what to do
    • Learning - Knowing what has happened
      • Post-mortem

Links