2019/06/03: Monitorama: Day 1b

by Gene Kim on

#Monitorama

2019/06/03: Monitorama: Day 1

  • @RealGeneKim: #monitorama The schedule! https://t.co/pqzllFKyIV
  • @honeycombio: And you can watch us online live at Monitorama at https://t.co/aL0F7mp5PD!
  • Hilariously, @obfuscurity has picture with @allspaw: shows up in Google search, but original no longer accessible: was on Librato blog, now long redirected away. Anyone have the original?

Up: John Allspaw (@allspaw): (was first speaker at the first Monitorama!)

  • @RealGeneKim: --- #Monitorama Hilariously, @obfuscurity has picture with @allspaw: shows up in Google search, but original no longer accessible: was on Librato blog, now long redirected away. Anyone have the original?
  • @neurot1cal: @allspaw kicking off @Monitorama PDX 19 this year with talking about human performance https://t.co/JQ206ycOt1
  • @RealGeneKim: @allspaw at #monitorama! https://t.co/UAQUJUSiDR
  • @allspaw: asking question, "how many of you run systems where it would still be running w/no changes/touching keyboard in 1 day, 3 days, 1 week..." (shows what and how critical the work done in this room are); and how dynamic the world it runs in is. Cc @markimbriaco
  • @allspaw: "beliefs about safety (1940s-1970s): safety could be encoded in tech; accidents could be avoided w/automation; procedures could be specified & be comprehensive; operators just need to follow them; 'humans vs machines are better at' (HABA/MABA)
  • @allspaw: "1979: Three Mile Island: was a landmark event; new beliefs: automation introduces new risk; rules/procedures always under specified, so can't guarantee safety; months/models for risk that rely on human error categories are fraught
  • @allspaw: "
  • @lizthegrey: --- But then Three Mile Island happened. We realized that automation introduce new forms of challenge and risk.

Procedures are always underspecified, and people interpret them using judgment. #monitorama
- @lizthegrey: --- Should operators following objective, comprehensive procedures and playbooks?

Are there things that humans or machines are better at // should be kept separate from? #monitorama
- @crayzeigh: --- Beliefs about Safety 40s-70s:
-Safety can be encoded in design
-Accidents can be avoided with more automation
-Procedures can be specified, obj, comprehensive
-Operators just have to follow procedure to get work done
-“Humans are better” v. “Machines are”

Monitorama @allspaw

  • @lizthegrey: --- (c.f. @breckcs above)

It used to be in the 1940s-1970s that safety sat within the designs of tech, and that more automation would save us. #monitorama
- @crayzeigh: --- First a thought exercise: “Pretend at your org, all eng are going to do nothing, no hands on keyboards, no alert response, nothing. How long will it stay working if no one touches anything?” That means YOU are the reason things are working. @allspaw #monitorama
- @lizthegrey: --- If a human takes their hands off the controls... do you believe that your systems will be running the same way a day later? a week later?

If not, then it means you're doing something to affect the system. What are you doing? #monitorama
- @lizthegrey: --- @allspaw Let's support people in the work we're doing, given how high the stakes are.

And it's important also as scientists and engineers to be able to revise our assumptions and change our beliefs. #monitorama
- @allspaw: "Three Mile Island: forced us to take human performance seriously; can't just put operators in those situations
- @lizthegrey: --- What we thought we knew about human contribution to successful work in complex systems was wrong.

You can't just put operators into these scenarios and expect that the engineering will take care of the rest safely. #monitorama
- @crayzeigh: --- Persists until 3mi Island. March ‘79, now:
-Automation both necessary and adds new challenge/risk
-Rules are always underspecified
-Ops required to take action that cannot be prespecified
- Methods & models for “risk” that rely on “human error” are fraught

Monitorama @allspaw

  • @allspaw: "by 'human performance', we mean 'cognitive performance'; people are avoiding problems/disasters all the time
  • @allspaw: "we study incidents b/c: time pressure, high consequences, uncertainty, ambiguity"
  • @julian_dunn: --- Good morning from @Monitorama where @allspaw is talking about one of my favorite topics, safety in nuclear operations. (I can’t help it. My dad was a nuclear engineer.) https://t.co/Z0sHnJal2m
  • @lizthegrey: --- Peoples' cognitive performance helps us avoid accidents. We think about it as "normal work" until it breaks down.

Peoples' instincts are hidden from us and they know more than they can tell you though. #monitorama
- @allspaw: "incidents are the non-routine, challenging events, b/c these tough cases have the greatest potential for uncovering elements of expertise and related cognitive phenomenon (Klein, Crandall, Hoffman, 2006)
- Holy cow, I'm enjoying @lizthegrey high fidelity note-taking!
- @lizthegrey: --- Incidents, because they're high-pressure and under time pressure, peel away everything that isn't relevant.

(@allspaw encourages all the vendors in the room to hire researchers!) [ed: and ours is @FisherDanyel who is one of the best!] #monitorama

  • @duckalini: --- Loving this introduction to how we keep systems working through our every day actions, as opposed to during our incident response. Even when we’re doing nothing, if our systems are up, we’redoing something #Monitorama
  • @lizthegrey: --- People don't just keep the state of the tools, they keep state on what they've already tried, what they've been told/shared, hypotheses/observations, context on holidays/events, resources both human and technical...

This is outside what we think of as "monitoring" #monitorama
- @duckalini: --- Our entire team brings a different context and a different hypothesis but they will all influence one another. One team mates thoughts on what can be done to stem fallout or fix an incident, who has authority, what can’t be done... these all impact one another #Monitorama
- @lizthegrey: --- People bring different experiences to the table and that affects how they interpret what they see and how they interact with other people.

Peoples' ability to effectively work together depends upon their ability to synchronize and compare mental models/hypotheses #monitorama
- @crayzeigh: --- “People don’t show up blank” we contribute our experience to an incident and only part of it is what we traditionally call “monitoring”.

Monitorama @allspaw

[Really an interesting thought about how each person contributes something different to an incident response] https://t.co/qItEwruRLm
- @duckalini: --- The human context matters! This influences what we do and how we act, each individuals knowledge, inferences and ideas has an impact on their passive and active choices. Not all of these are traditionally understood as monitoring #monitorama https://t.co/13tn96sWU1
- @LibbyMClark: --- Given how complex our systems are, they should break all the time. But they don’t. Why? Operators are doing the work. #monitorama @allspaw https://t.co/Zo4BwQRtl5
- @lizthegrey: --- Cues wind up being interpreted by people; they're elements of a technosocial system rather than being abstract, objective events. #monitorama
- @lizthegrey: --- People don't just keep the state of the tools, they keep state on what they've already tried, what they've been told/shared, hypotheses/observations, context on holidays/events, resources both human and technical...

This is outside what we think of as "monitoring" #monitorama
- @lizthegrey: --- Incidents are multidimensional and messy; @allspaw is showing a diagram of planning, diagnosis, and anomaly recognition and the interactions between them. [ed: trying to understand how this relates to my diagram on the debugging cycle...]

John says it's different #monitorama

  • @allspaw: "many/most incidents don't follow 'mystery' pattern seen in movies: many deal w/getting necessary permission, fully understanding impact..."
  • @duckalini: --- People will do why they think is productive - they’ll ssh in to a machine if they think it’ll help them find the answer. They won’t stand on principal and say “I shouldn’t ssh in because it’s 2019 darn it!” #monitorama
  • @asachs01: --- @allspaw "People will use tools that they trust, that they feel will be helpful. They won't use tools that they don't feel are helpful." #monitorama
  • @lizthegrey: --- This isn't just what we might dismiss as "debugging"; people are using judgment to decide what tools and hypotheses to pursue, rather than automatically following principles.

We can have all the rules in the world, but people will disregard them, do what's productive #monitorama
- @allspaw: "
- @lizthegrey: --- This diagram that @allspaw just showed reflects "Cognitive Process Tracing" of incidents.

Go and read "approaching overload" by Grayson -- it's a masters' thesis. #monitorama
- @allspaw: "interesting questions for teams: what tricks do people use to unerstand how oterwise opaque 3rd party services are behaving? Are there logs/data that people are suspicious of and therefore dismiss?"
- @allspaw: "
- @duckalini: --- Start making use of in depth incident analysis - NOT following a template post-mortems. Ask deeper questions and utilise your internal team members who are good at this skill. Select some incidents for closer and deeper analysis #Monitorama https://t.co/k0DX8mAHup
- @lizthegrey: --- Are there tools/data sources people find to be useFUL or useLESS?

Don't just pick the most severe ones; high-value incidents are not good for incident analysis because they get management/exec scrutiny and people are less honest. #monitorama
- @allspaw: "Disagreements are one of the most fruitful source of insights when looking at incidents; avoid 'filing off rough edges' off of incident reports
- @allspaw: "find lines of activities and hypotheses that didn't pan out:"
- @allspaw: "Make company-wide post-mortems sessions regular events; incidents are unplanned investments (made without your consent); make the best use of them
- @lizthegrey: "Incidents are unplanned investments; it's up to you to get a return on them." --@allspaw #monitorama
- @scottnasello: RT @RealGeneKim: #Monitorama @allspaw: "Disagreements are one of the most fruitful source of insights when looking at incidents; avoid 'filing off rough edges' off of incident reports
- @allspaw imploring vendors to hire researchers w/advanced degrees in this field. "Merely dogfooding your product isn't enough; you're leaving money and competitive advantage on the table"
- @allspaw: "Dr Lisanne Bainbridge 1983: "irony
- @wiredferret: #monitorama @norajs: Chaos Engineering Traps: Using Chaos Engineering as a means to distill expertise.
- @wiredferret: #monitorama @nora
js: Chaos Engineering Traps: Using Chaos Engineering as a means to distill expertise.

Up: Nora Jones: Chaos Engineering Traps: Using Chaos Engineering as a means to distill expertise. @nora_js

  • Once again, ratio of paper citations to speaker is very high here. :) cc @nora_js
  • @nora_js: "

- @wiredferret: --- #monitorama @nora_js: Observability by Woods and Hollnagel; Observability is feedback that provides insight in to a process and refers to the work needed to extract meaning from available data.

  • @lizthegrey: --- Let's talk chaos engineering gone wrong: not Chernobyl [which also is fruitful], but instead Apollo 1.

The launch rehearsal test killed all 3 crew members when a fire broke out. The "safety" door latch mechanism wouldn't let them out. #Monitorama
- @lizthegrey: --- Let's talk chaos engineering gone wrong: not Chernobyl [which also is fruitful], but instead Apollo 1. The launch rehearsal test killed all 3 crew members when a fire broke out. The "safety" door latch mechanism wouldn't let them out. #Monitorama
- @norajs: "I currently am at Slack; was previously at Netflix; wrote book: 'Chaos Engineering'"
- @nora
js: "I'm actively working on connecting two worlds of human factors and chaos engineering; we can se chaos engineering as a way to build adaptive decision making capabilities in engineers; we can build people up
- @norajs: "8 traps of chaos engineering: Trap 1: you can measure the success w/Chaos Engineering by counting # of vulnerabilities you find; Robert L Wears: 'you don't know much about safety, simply by counting errors;" you're mostly measuring ignorance, not risk
- @nora
js: "1954: how to lie with statistics: Darrell Huff:
- @RealGeneKim: #monitorama: @norajs: this is not learning: haha. https://t.co/gJFyOeTRXk
- @SarahVoegeli: "All graphs, all timelines, are actually opinions" - @nora
js on chaos engineering at #monitorama
- @norajs: "Goal of Chaos Engineering: not to simply find vulns thru tooling; the goal is to push forward on a journey of resilience THROUGH the vulnerability we find"; capture how team (not just experts) responds to found vulns
- @nora
js: "Dr. Sidney Dekker: 'look at what went right. The pursuit of success. What did we learn we are really good at."
- @norajs: "Trap 2: not all people that work on the service were brought early enough into the experiment; e.g., example: 2/11 engineers were invited to the prep; game day starts, but only 2 were fully aware, fully studied, etc...
- @nora
js: "
- @crayzeigh: --- Trap 2: not inviting everyone into the game day process.
Excluding people is excluding perspectives from the incident and reducing the surface area of your learning.

@norajs #monitorama
- @nora
js: "Trap 4: Chaos Engineers should be rule enforcers; the goal is to build relationships; assigning action items to systems you don't own/don't have expertise of can be infantilizing"
- @norajs: "Trap 5: it's not 'real' Chaos Engineering unless you move beyond gamedays and experimenting in sandbox environments; I've seen a lot of good come out of these activities, and it's critical preparation"
- @nora
js: "At Netflix, we had the CHAP (?) platform to help teams automate performing of their experiments: data needed: timeouts, retries, fallbacks, % of traffic serviced
- @lizthegrey: Even just gathering data and aggregating it/analyzing patterns is useful, without running a single experiment. #Monitorama

  • @duckalini: Trap five - chaos engineering has to happen in production.

It’s not necessary.

monitorama https://t.co/QLm3Rv3k2e

  • @nora_js: "At Netflix, we discovered one of the most powerful things we could do is assembling info on what services are safe to fail, as well as retries, timeouts."
  • @nora_js: "George Box: 'all mental models are wrong, but some are useful"
  • @nora_js: "
  • @wiredferret: --- #monitorama @nora_js: Prep session questions that may be valuable - Which behaviors and interactions are you concerned about? How do you decide what to experiment on? How long does it take to onboard a new engineer?
  • @nora_js: "Fascinating question: ' how long before an engineer knows enough to be put on-call?'
  • @nora_js: "
  • @devblackops: --- Going through the PROCESS of building chaos experiments is often MORE valuable and can be a better source of learning than the experiments themselves. @nora_js

monitorama

  • @nora_js: "Trap 7: you can worry about safety later. The important part is creating the chaos." OMG. Hahaha
  • @LibbyMClark: --- Disagreements or differences in how people view the system is good data in a chaos experiment because it teaches you how you work. #monitorama @nora_js
  • @lizthegrey: are our defenses, even if they're created, working correctly? how will we know? will they function for this test? #Monitorama
  • @RealGeneKim: #monitorama: @nora_js: Gus Grissom on astronaut mindset, months before the Apollo 1 accident https://t.co/1TzKzMQNvo
  • @RealGeneKim: #monitorama: @nora_js: Gus Grissom on astronaut mindset, months before the Apollo 1 accident https://t.co/1TzKzMQNvo
  • @nora_js: "Seeing the Invisible: the new employee can see what goes wrong, but hasn't seen what went right" (the reason why Chaos Engineering and oncall duty shouldn't be intern or new employee endeavor)
  • @lizthegrey: --- Don't have your intern or brand new hire do all of your chaos engineering. Minimize the blast radius and put safety first. People don't necessarily respond to emails!

What are your lower stakes learning opportunities?
(8.5) be strict about starting from steady state #Monitorama
- @lizthegrey: --- Don't have your intern or brand new hire do all of your chaos engineering. Minimize the blast radius and put safety first. People don't necessarily respond to emails!

What are your lower stakes learning opportunities?
(8.5) be strict about starting from steady state #Monitorama
- RT @lizthegrey: Don't have your intern or brand new hire do all of your chaos engineering. Minimize the blast radius and put safety first. What are your lower stakes learning opportunities? (8.5) be strict about starting from steady state #Monitorama

Up: David Calavera: "Observability and performance analysis with BPF" @calavera

  • "ask arbitrary new questions and epxlore where the cookie crumbs take you" --Charity Majors @mipsytipsy
  • @lizthegrey: Citing @mipsytipsy's o11y definition of being able to ask new questions and follow the cookie crumbs! #Monitorama

- @calavera: http://www.brendangregg.com/blog/2016-03-05/linux-bpf-superpowers.html

  • @lizthegrey: --- BPF is trying to solve this problem for Linux systems by hooking into the kernel. [ed: I... hesitate to recommend BPF as a solution for distributed systems rather than single machines, but maybe one of the many BPF talks I attend will change my mind one day] #Monitorama

- Reading about BPF right now, as described by @calavera: http://www.brendangregg.com/blog/2016-03-05/linux-bpf-superpowers.html; wow, injecting strings into the kernel that gets turned into executable code...

  • @crayzeigh: --- Even just using this on a local machine running Bash there’s a lot of learning in what’s happening in the Kernel. Just the ‘ls’ command is dumping tons of kernel instructions:

@calavera #monitorama https://t.co/vE9uyCgy0q
- @lizthegrey: --- And here we go, flame graphs! [ed: I love flame graphs! but they tend to be single-machine or process; have to think beyond that and look at full interprocess traces too!] #monitorama
- @RealGeneKim: The sometimes shocking things you find that are run when you type “ls” or “echo”. @calavera Haha. #monitorama https://t.co/A8xnaUpEWR
- @joecise: RT @norajs: Thanks for having me @Monitorama !! Here’s a version of the talk I presented this morning (in article form): https://t.co/GBZsSlHz4K #monitorama
- @nora
js: --- Thanks for having me @Monitorama !! Here’s a version of the talk I presented this morning (in article form): https://t.co/GBZsSlHz4K #monitorama
- @calavera downloading docker images on conference WiFi. Haha. "This is taking longer than it took 10 minutes ago."
- @crayzeigh: --- Using the https://t.co/qaVE4diYSc tool written for BPF we can generate happy flame graphs for our running processes in docker. Then we can dig down and see what’s happening throughout the docker space and the Linux kernel. @calavera #monitorama https://t.co/pnWUO0B6Os
- @acedrew: --- "By instrumenting in kernel space using #BPF, we can see every syscall, even those coming from inside a container" @calavera #monitorama #docker #containers #perf https://t.co/ZTa5SUZpqn

Sponsored pitches

  • resilience driven development

Up: Nida Farrukh: The Power and Creativity of Postmortems and Repair Items

  • @lizthegrey: --- Next up: speaking on postmortem repair items, my dear friend Nida Farrukh, former Google SRE, now at Microsoft, who came all the way from Zurich! #Monitorama
  • @lizthegrey: --- Next up: speaking on postmortem repair items, my dear friend Nida Farrukh, former Google SRE, now at Microsoft, who came all the way from Zurich! #Monitorama
  • Story time: "I love outages! How Nida broke [REDACTED] and netted a promotion." (So great!!)
  • @lizthegrey: Do outages "suck" or "rule?"

It depends upon your point of view, and Nida is more in camp 2 than camp 1! #Monitorama
- @ondayd: --- RT RealGeneKim "#Monitorama lizthegrey: --- Next up: speaking on postmortem repair items, my dear friend Nida Farrukh, former Google SRE, now at Microsoft, who came all the way from Zurich! #Monitorama"
- Farrukh: "Left for weekend; on Friday... so I didn't learn about the terrible times the times had all weekend until Monday; swamped canary data center, overwhelming it. Burned thru entire SLO failure budget in that first weekend. Spent five months fixing it; first thing I cited was post-mortem, demonstrated cost of getting it wrong"
- @lizthegrey: --- It tickled a weird bug, sent all the traffic to the canary datacenter...

They burned 1/3 of their quarterly error budget in the first week of the quarter.

But then she found every place that it broke, spent 5 months fixing it... and cited it in her promo packet #Monitorama
- @lizthegrey: --- It tickled a weird bug, sent all the traffic to the canary datacenter... They burned 1/3 of their quarterly error budget in the first week of the quarter. But then she found every place that it broke, spent 5 months fixing it... and cited it in her promo packet #Monitorama
- Farrukh: "That post-mortem frames the business case of why the problem needed fixing"
- @duckalini: --- “Oops!” when a single outage takes up 1/3 of your quarters failure budget in the first week of the quarter 😵

A good post mortem which details the issue, the learnings, the changes... this information and the data in a PM can give you a strong argument to make change #monitorama
- @duckalini: --- “Oops!” when a single outage takes up 1/3 of your quarters failure budget in the first week of the quarter 😵

A good post mortem which details the issue, the learnings, the changes... this information and the data in a PM can give you a strong argument to make change #monitorama
- Farrukh: "
- @lizthegrey: --- "Postmortems are a tidy little argument for change. There's no more powerful argument for change than 'shit was just on fire'."

Think of them as an opportunity. #Monitorama
- @lizthegrey: --- "Postmortems are a tidy little argument for change. There's no more powerful argument for change than 'shit was just on fire'." Think of them as an opportunity. #Monitorama
- @acedrew: --- "I thought it was fine, my team thought it was fine, the other teams we talked to thought it was fine. It was not fine" Nida Farrukh on using the learnings from a catastrophic outage to justify a promotion and improve her team's understanding of their system #monitorama https://t.co/9kPo9IdfrC
- @lizthegrey: --- There are a lot of prescriptive postmortem template bits to start with: impact assessment, timeline, contributing factor analysis, and repair/action items. #Monitorama
- Farrukh: "Outages direct attention to where it is needed; post-mortems pinpoint where attention is needed"
- @lizthegrey: Club #afternoontea represent. Every. single. xoogler SRE woman is an incredible force of nature. <3 @ckymnstr @whereistanya, Nida, and so many other of my amazing colleagues I'm proud to call friends.
- @devblackops: Outages capture how your systems ACTUALLY work. Like...in real life.

monitorama https://t.co/iGc2D4E3Wg

  • @lizthegrey: --- Club #afternoontea represent. Every. single. xoogler SRE woman is an incredible force of nature. <3 @ckymnstr @whereistanya, Nida, and so many other of my amazing colleagues I'm proud to call friends.
  • Farrukh: "Post-mortems create data-driven way of effecting change
  • @TrententTye: RT @devblackops: Outages capture how your systems ACTUALLY work. Like...in real life.

monitorama https://t.co/iGc2D4E3Wg

  • @duckalini: --- Post mortem is a data driven argument for specific change.

You impact assessment, timeline and analysis of contributing factors gives you that data, and helps you identify those specific changes that you need to implement.

monitorama https://t.co/7XUbd0QQPg

  • Farrukh: "Most people who write PIR is the person who was on call; they were there; but often need help people coming up with good repair items
  • @lizthegrey: --- How do we synthesize the past collected data into the repair items for what we want to do in the future?

Usually the person oncall writes the postmortem, because they were there. They're great at data collection.

But at seeing what to do? Not so much. Fog of war. #Monitorama
- Farrukh: "Good targets for repair items (where PIR reports tend to be weakest, least creative): Detection, Mitigation, Fixing; SRE tends to focus on Detect/Mitigate; Devs tend to focus on Fix"
- @acedrew: --- One benefit of outages is that you don't have to pontificate about how much a particular problem might cost, you KNOW ~ Nina Farrukh #monitorama https://t.co/qCEgnlNpXu
- @acedrew: --- One benefit of outages is that you don't have to pontificate about how much a particular problem might cost, you KNOW ~ Nina Farrukh #monitorama https://t.co/qCEgnlNpXu
- @lizthegrey: --- Think about the categories of {detect, mitigate, fix}.

Depending upon your background, you may be biased towards detect/mitigate or towards fix. Think about the other half too. #Monitorama
- Farrukh: "Focus on TTX: detect, engage, mitigate; here, you want as much detail as possible, because it will help you focus and identify opportunities on what needs improving"
- Farrukh: "Did we find out about issue from automated alert? Or by customer? Did we get the right alert? Too many alerts? Is the tech underlying communications systems slow/unreliable? Did we need to engage specific people, and they weren't available?"
- Farrukh: "High time to mitigate? Too many manual work? Too error prone? No rollback mechanism? Too time-consuming to diagnose? Is diagnostic info not available? Not enough documentation? Are certain people SPOF of action/information?"
- Farrukh: "To drive down X: why do we have have X? Why do we even wait for a response for Y? Is outage scenario an entirely missing use case during design? Can user flows be changed so same failure has less impact? Can we degrade? Are our dependencies behaving as expected?"
- @lizthegrey: --- Do our systems downstream behave according to their expectations/SLOs? Do we need to either ask them to behave according to SLO or change how we rely upon them? #monitorama
- @lizthegrey: --- Do our systems downstream behave according to their expectations/SLOs? Do we need to either ask them to behave according to SLO or change how we rely upon them? #monitorama
- @lizthegrey: --- Do our systems downstream behave according to their expectations/SLOs? Do we need to either ask them to behave according to SLO or change how we rely upon them? #monitorama

TODO

  • bring into notes should bring focus back into editor
  • RT button should bring focus back into editor
  • to book: battling mental models: compare/contrast explanations of reality; what things shouldn't we try? Who has authority to make decisions? Time pressure and escalating consequeces
  • @acedrew: --- "Outages direct your attention to areas of your system that need your attention" ~ Nida Farrukh "an outage is the result of real-world conditions that you need to be prepared for" #monitorama https://t.co/e9PKUaRhV7
  • why do some RT get truncated?
    ```
    RT @lizthegrey: Do our systems downstream behave according to their expectations/SLOs? Do we need to either ask them to behave acco… https://t.co/SG6UxfBPOv

  • Farrukh: "To drive down X: why do we have have X? Why do we even wait for a response for Y? Is outage scenario an entirely missing use case during design? Can user flows be changed so same failure has less impact? Can we degrade? Are our dependencies behaving as expected?"

```