datadog + victorops webinar

Report

Post on 16-Feb-2017

110 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Do’s &

of post-incident analysis

Don’ts

Jason HandDevOps EvangelistDevOps, Dogs, Horses, and Mountain LivingTwitter: @jasonhand

VictorOpsIncident management & notificationsMakes on-call suck less!Twitter: @victorops

Jason YeeTechnical writer/evangelistTravel hacker & ChefTwitter: @gitbisect

DatadogSaaS-based full stack monitoringOver a trillion data points per dayTwitter: @datadoghq

AgendaService Disruptions

Detection

Diagnosis

Post-incident analysis

Framework

Follow & Share on Twitter

#VOWebinar

@gitbisect@jasonhand

@datadogHQ@VictorOps

Service Disruptions

There is no such thing as being soooo good, you’ll never fail

Are a reality in ALL complex systems

Complex Systems

● Diversity● Interdependent● Adaptive● Connectedness

(i.e. we can be connected but not dependent on each other)

Cynefin Framework

● Obvious - cause & effect is obvious to all

● Complicated - cause & effect requires analysis or expert knowledge

● Complex - cause & effect can only be perceived in retrospect

● Chaotic - no relationship between cause & effect

Cynefin diagram by Dave Snowden CC BY-SA 3.0

Contributing Factors

Systems Thinking: an understanding of a

system by examining the linkages and

interactions between the components that

comprise the entirety of that defined system

MTTR vs MTBF

Mean Time To Repair

Mean Time Between Failure

Detection

Collecting data is cheapNot having it when you need it can be expensive

4 qualities of good metricsNot all metrics are created equal

1. Well understood

2. Granular

3. Tagged & filterable

4. Long-lived

Diagnosis

Real-time Notification

Getting “the right” Humans Involved

Paging has evolved to: Smart & Actionable alerts ...

Routed to the right teams and people …

With valuable context

Graphs, Logs, Runbooks

Automation

ChatOps

jhand.co/chatopsbook

The Full Incident Lifecycle

What we are really here to learn about...

Post-incident Analysis(a.k.a. learning review, postmortem)

Do: Establish that we are here to learn

The primary objective of these exercises is to learn

Do: Establish timeline of events

Identify when anomaly was first detected, first responders, SMEs pulled in to assist, conversations, commands, etc.

Don’t: Hijack the Discussion

Having an objective moderator run the exercise can help prevent one person (or small group) from steamrolling the conversation and avoids

“Group Think”

Do: Describe What Happened

Gather a detailed account of what happened from team members. What services, components, etc. were affected? Include how

customers were impacted

i.e. Accountability

Don’t: Explain What Happened

Explaining often leads to a less than objective understanding of what took place as well as finger pointing and blame

Do: Ask “How” Things Happened

Understand in great detail “how” things happened including multiple contributing factors

Don’t: Ask “Why” Things Happened

Asking “why” often contains bias and leads to blame

“Why” .. brings us to the very mysterious incentives we have in the workplace.

“How" brings us to the conditions that allowed the event to take place to begin with. - John Allspaw (CTO Etsy)

Do: Understand Contributing Factors

Use Systems Thinking to see more holistically

“Cause is not something found in the rubble. Cause is created

in the minds of the investigators” - Sydney Dekker

Don’t: Focus on a ‘Root Cause’

Rather than focusing on the ‘Root Cause’ of service disruption, understand all of the contributing factors.

Newtonian thinking … Why some still seek a root cause

We’ve created the idea that a single cause has an

equal and opposite effect

● Humans adapt to the work they have● Root Cause analysis ONLY works in SIMPLE systems● Root Cause Analysis = Retrospective Cover of Ass

In complex systems .. it doesn’t

Do: Watch For Bias

We are easily susceptible to cognitive bias such as: confirmation, hindsight, anchoring, outcome, availability

Don’t: Blame Humans

Humans are only a part of the problem and response, never a contributing factor is issues

Do: Include What Went Well

Much can be learned from what worked during the response to a service disruption. Capture and discuss what efforts actually went well.

Don’t: Hide What Happened

Customers and end-users are savvy. Being transparent about what took place and what was done will help build trust

Do: Conduct Analysis Soon

Gather the team and conduct the post-incident analysis as soon as everyone is rested

Don’t: Wait longer than 48 hours

The longer time passes, the less accurate accounts of what took place will be

Do: Assign Action Items

Look for small incremental improvements to take action on.Each improvement item should be assigned an owner and tracked for

follow up

Don’t: Debate Without Action

Don’t allow for extended debate on action items. Place ideas into a “parking lot” for later action but come up with at least one action item to

be implemented immediately

Do: Hear from everyone

To fully understand the disruption and response you want to hear from all parties involved. Everyone’s experience was different. The more

voices you hear from, the more accurate the story and timeline become.

Do: Encourage Many Possible Improvements

We are looking for many possible areas for incremental improvements to our systems, processes, tools, incident response, and team members. Encourage people to build on top of existing ideas in

addition to posing alternatives.

Don’t: Overpromise or Overcommit

We are looking for ideas not binding commitments. This helps to make sure you get suggestions from a wide group

Do: Archive Your Postmortem

Save and store your postmortem where it is available to everyone internally for future review or as assistance during future similar

incidents

Do: Rinse & Repeat

Be disciplined in your post-incident analysis exercises and perform them for all incidents regardless of the severity. Practice makes

perfect and these will become more efficient and useful over time

Framework

Post-incident analysis framework

1. Summary: what happened?2. How was the incident detected?3. How did we respond?4. How did it happen?5. How can we improve?

Summary: what happened?

● Impact on customers● Severity of the incident● Components affected● What ultimately resolved the incident?● Externally shared information

How was the incident detected?

● Did we have a metric that showed the incident?● Was there a monitor/alerting on that metric?● How long did it take to declare an incident?

How did we respond?

● Who was involved?● ChatOps archive links● Timeline of events● What went well?● What didn’t go so well?

How did it happen?

● Technical deep-dive● Include context● Identify contributing factors● Ask “How,” not “Why”

How can we improve?

● Now (immediate actions)● Next (in current or following sprint)● Later (after the next sprint)● Follow up notes● Ensure all items are actionable and tracked

Summary:

Resources● Post-incident analysis framework/template

○ http://bit.ly/2dxDIT3

● Blameless postmortems & a just culture - John Allspaw○ https://codeascraft.com/2012/05/22/blameless-postmortems/

● The infinite hows - John Allspaw○ http://www.kitchensoap.com/2014/11/14/the-infinite-hows-or-the-dangers-of-the-five-whys/

● The human side of postmortems - Dave Zwieback○ http://www.oreilly.com/webops-perf/free/the-human-side-of-postmortems.csp

● Writing your first postmortem - Mathias Lafeldt○ https://medium.com/production-ready/writing-your-first-postmortem-8053c678b90f

http://bit.ly/2dxDIT3

https://codeascraft.com/2012/05/22/blameless-postmortems/

http://www.kitchensoap.com/2014/11/14/the-infinite-hows-or-the-dangers-of-the-five-whys/

http://www.oreilly.com/webops-perf/free/the-human-side-of-postmortems.csp

https://medium.com/production-ready/writing-your-first-postmortem-8053c678b90f