datadog + victorops webinar

65
Do’s & of post-incident analysis Don’ts

Upload: datadog

Post on 16-Feb-2017

108 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Datadog + VictorOps Webinar

Do’s &

of post-incident analysis

Don’ts

Page 2: Datadog + VictorOps Webinar

Jason HandDevOps EvangelistDevOps, Dogs, Horses, and Mountain LivingTwitter: @jasonhand

VictorOpsIncident management & notificationsMakes on-call suck less!Twitter: @victorops

Page 3: Datadog + VictorOps Webinar

Jason YeeTechnical writer/evangelistTravel hacker & ChefTwitter: @gitbisect

DatadogSaaS-based full stack monitoringOver a trillion data points per dayTwitter: @datadoghq

Page 4: Datadog + VictorOps Webinar

AgendaService Disruptions

Detection

Diagnosis

Post-incident analysis

Framework

Page 5: Datadog + VictorOps Webinar

Follow & Share on Twitter

#VOWebinar

@gitbisect@jasonhand

@datadogHQ@VictorOps

Page 6: Datadog + VictorOps Webinar

Service Disruptions

There is no such thing as being soooo good, you’ll never fail

Are a reality in ALL complex systems

Page 7: Datadog + VictorOps Webinar

Complex Systems

● Diversity● Interdependent● Adaptive● Connectedness

(i.e. we can be connected but not dependent on each other)

Page 8: Datadog + VictorOps Webinar

Cynefin Framework

● Obvious - cause & effect is obvious to all

● Complicated - cause & effect requires analysis or expert knowledge

● Complex - cause & effect can only be perceived in retrospect

● Chaotic - no relationship between cause & effect

Cynefin diagram by Dave Snowden CC BY-SA 3.0

Page 9: Datadog + VictorOps Webinar

Contributing Factors

Systems Thinking: an understanding of a

system by examining the linkages and

interactions between the components that

comprise the entirety of that defined system

Page 10: Datadog + VictorOps Webinar

MTTR vs MTBF

Mean Time To Repair

vs

Mean Time Between Failure

Page 11: Datadog + VictorOps Webinar

Detection

Page 12: Datadog + VictorOps Webinar

Collecting data is cheapNot having it when you need it can be expensive

Page 13: Datadog + VictorOps Webinar
Page 14: Datadog + VictorOps Webinar
Page 15: Datadog + VictorOps Webinar
Page 16: Datadog + VictorOps Webinar
Page 17: Datadog + VictorOps Webinar

4 qualities of good metricsNot all metrics are created equal

Page 18: Datadog + VictorOps Webinar

1. Well understood

Page 19: Datadog + VictorOps Webinar

2. Granular

Page 20: Datadog + VictorOps Webinar

3. Tagged & filterable

Page 21: Datadog + VictorOps Webinar

4. Long-lived

Page 22: Datadog + VictorOps Webinar
Page 23: Datadog + VictorOps Webinar

Diagnosis

Page 24: Datadog + VictorOps Webinar

Real-time Notification

Page 25: Datadog + VictorOps Webinar

Getting “the right” Humans Involved

Paging has evolved to: Smart & Actionable alerts ...

Routed to the right teams and people …

With valuable context

Page 26: Datadog + VictorOps Webinar

Graphs, Logs, Runbooks

Page 27: Datadog + VictorOps Webinar

Automation

Page 28: Datadog + VictorOps Webinar

ChatOps

jhand.co/chatopsbook

Page 29: Datadog + VictorOps Webinar

The Full Incident Lifecycle

Page 30: Datadog + VictorOps Webinar

What we are really here to learn about...

Page 31: Datadog + VictorOps Webinar

Post-incident Analysis(a.k.a. learning review, postmortem)

Page 32: Datadog + VictorOps Webinar

Do: Establish that we are here to learn

The primary objective of these exercises is to learn

Page 33: Datadog + VictorOps Webinar

Do: Establish timeline of events

Identify when anomaly was first detected, first responders, SMEs pulled in to assist, conversations, commands, etc.

Page 34: Datadog + VictorOps Webinar

Don’t: Hijack the Discussion

Having an objective moderator run the exercise can help prevent one person (or small group) from steamrolling the conversation and avoids

“Group Think”

Page 35: Datadog + VictorOps Webinar

Do: Describe What Happened

Gather a detailed account of what happened from team members. What services, components, etc. were affected? Include how

customers were impacted

i.e. Accountability

Page 36: Datadog + VictorOps Webinar

Don’t: Explain What Happened

Explaining often leads to a less than objective understanding of what took place as well as finger pointing and blame

Page 37: Datadog + VictorOps Webinar

Do: Ask “How” Things Happened

Understand in great detail “how” things happened including multiple contributing factors

Page 38: Datadog + VictorOps Webinar

Don’t: Ask “Why” Things Happened

Asking “why” often contains bias and leads to blame

“Why” .. brings us to the very mysterious incentives we have in the workplace.

“How" brings us to the conditions that allowed the event to take place to begin with. - John Allspaw (CTO Etsy)

Page 39: Datadog + VictorOps Webinar

Do: Understand Contributing Factors

Use Systems Thinking to see more holistically

Page 40: Datadog + VictorOps Webinar

“Cause is not something found in the rubble. Cause is created

in the minds of the investigators” - Sydney Dekker

Page 41: Datadog + VictorOps Webinar

Don’t: Focus on a ‘Root Cause’

Rather than focusing on the ‘Root Cause’ of service disruption, understand all of the contributing factors.

Page 42: Datadog + VictorOps Webinar

Newtonian thinking … Why some still seek a root cause

We’ve created the idea that a single cause has an

equal and opposite effect

● Humans adapt to the work they have● Root Cause analysis ONLY works in SIMPLE systems● Root Cause Analysis = Retrospective Cover of Ass

In complex systems .. it doesn’t

Page 43: Datadog + VictorOps Webinar

Do: Watch For Bias

We are easily susceptible to cognitive bias such as: confirmation, hindsight, anchoring, outcome, availability

Page 44: Datadog + VictorOps Webinar

Don’t: Blame Humans

Humans are only a part of the problem and response, never a contributing factor is issues

Page 45: Datadog + VictorOps Webinar

Do: Include What Went Well

Much can be learned from what worked during the response to a service disruption. Capture and discuss what efforts actually went well.

Page 46: Datadog + VictorOps Webinar

Don’t: Hide What Happened

Customers and end-users are savvy. Being transparent about what took place and what was done will help build trust

Page 47: Datadog + VictorOps Webinar

Do: Conduct Analysis Soon

Gather the team and conduct the post-incident analysis as soon as everyone is rested

Page 48: Datadog + VictorOps Webinar

Don’t: Wait longer than 48 hours

The longer time passes, the less accurate accounts of what took place will be

Page 49: Datadog + VictorOps Webinar

Do: Assign Action Items

Look for small incremental improvements to take action on.Each improvement item should be assigned an owner and tracked for

follow up

Page 50: Datadog + VictorOps Webinar

Don’t: Debate Without Action

Don’t allow for extended debate on action items. Place ideas into a “parking lot” for later action but come up with at least one action item to

be implemented immediately

Page 51: Datadog + VictorOps Webinar

Do: Hear from everyone

To fully understand the disruption and response you want to hear from all parties involved. Everyone’s experience was different. The more

voices you hear from, the more accurate the story and timeline become.

Page 52: Datadog + VictorOps Webinar

Do: Encourage Many Possible Improvements

We are looking for many possible areas for incremental improvements to our systems, processes, tools, incident response, and team members. Encourage people to build on top of existing ideas in

addition to posing alternatives.

Page 53: Datadog + VictorOps Webinar

Don’t: Overpromise or Overcommit

We are looking for ideas not binding commitments. This helps to make sure you get suggestions from a wide group

Page 54: Datadog + VictorOps Webinar

Do: Archive Your Postmortem

Save and store your postmortem where it is available to everyone internally for future review or as assistance during future similar

incidents

Page 55: Datadog + VictorOps Webinar

Do: Rinse & Repeat

Be disciplined in your post-incident analysis exercises and perform them for all incidents regardless of the severity. Practice makes

perfect and these will become more efficient and useful over time

Page 56: Datadog + VictorOps Webinar

Framework

Page 57: Datadog + VictorOps Webinar

Post-incident analysis framework

1. Summary: what happened?2. How was the incident detected?3. How did we respond?4. How did it happen?5. How can we improve?

Page 58: Datadog + VictorOps Webinar

Summary: what happened?

● Impact on customers● Severity of the incident● Components affected● What ultimately resolved the incident?● Externally shared information

Page 59: Datadog + VictorOps Webinar

How was the incident detected?

● Did we have a metric that showed the incident?● Was there a monitor/alerting on that metric?● How long did it take to declare an incident?

Page 60: Datadog + VictorOps Webinar

How did we respond?

● Who was involved?● ChatOps archive links● Timeline of events● What went well?● What didn’t go so well?

Page 61: Datadog + VictorOps Webinar

How did it happen?

● Technical deep-dive● Include context● Identify contributing factors● Ask “How,” not “Why”

Page 62: Datadog + VictorOps Webinar

How can we improve?

● Now (immediate actions)● Next (in current or following sprint)● Later (after the next sprint)● Follow up notes● Ensure all items are actionable and tracked

Page 63: Datadog + VictorOps Webinar

Summary:

Page 64: Datadog + VictorOps Webinar

Resources● Post-incident analysis framework/template

○ http://bit.ly/2dxDIT3

● Blameless postmortems & a just culture - John Allspaw○ https://codeascraft.com/2012/05/22/blameless-postmortems/

● The infinite hows - John Allspaw○ http://www.kitchensoap.com/2014/11/14/the-infinite-hows-or-the-dangers-of-the-five-whys/

● The human side of postmortems - Dave Zwieback○ http://www.oreilly.com/webops-perf/free/the-human-side-of-postmortems.csp

● Writing your first postmortem - Mathias Lafeldt○ https://medium.com/production-ready/writing-your-first-postmortem-8053c678b90f

Page 65: Datadog + VictorOps Webinar

Q&A

Do: Start a free trialhttps://app.datadoghq.com/signuphttps://victorops.com/start-free-trial