by Thierry de Pauw on
#lascot
People are the adaptable element of complex systems, John Allspaw @allspaw
The goal of the talk is that you walk away with new and better questions.
The topic is people and their adaptability to scenario's that are new.
Bottem Line, Up Front
the work people do is always context dependent
people adapt because they have to.
conditions and circumstances can both help and hinder this adaptation
finding, understanding and exploring how people adapt is difficult to do, but can be done
doing this work (finding the messy ways how people adapt) has to be done in naturalistic settings, not in variable-controlled lab conditions
@EverydayKanban: RT @somesheep: @allspaw To understand adaptation we have to observe it, but that’s hard to do. It has to happen in naturalistic setting. It cannot be done in lab-type/variable-controlled environments. @allspaw #lascot
what does this have to do with resilience and Resilience Engineering @allspaw
I have to stand on the shoulders of Abeba Birhane's keynote talk "In Defence of Uncertainty"
To understand how people adapt:
The line of representation: nobody has the same mental representation of the running system. @allspaw
You never change the system. You interact with the representation of the system. The only information you have of the system comes from this representation. @allspaw
There is a world below and above the line. And we always speak about the world below the line as the system. Although we never see the system below the line. @allspaw
What are they actually doing? cognitive work
- ...
@SteveSmithCD: "People are the adaptable element of complex systems" @allspaw #lascot
@somesheep: @allspaw So what do people actually do? We need to acknowledge and then explore this aspect, starting with understanding how systems behave. We know this for some time and progress has not made a difference. @allspaw #lascot https://t.co/9yG8jwOsoa
Paper: Ironies of Automation, Lisanne Bainbridge (1982)
"... irony that the more advanced a control system is, ..." (I missed that quote)
- @AnnaArmstrong: RT @JitGo: Ironies of #automation @allspaw #lascot recommended reading
https://t.co/joBXbyTtRU https://t.co/dZMbZIsESu
Paper: Ironies of Automation: Still Unresolved after all these years, Barry Strauch (2017)
Typical industry understanding of Incident Data
@cacorriere: RT @sebrose: “Shallow, aggregate, poor strategies for coping with uncertainty” @allspaw #lascot https://t.co/IW96C0Mosa
TTR
Time to Detect TTD
@SteveSmithCD: I'm here for @allspaw throwing shade at SRE quantifications of incident facets #lascot
other bullshit accronyms that say nothing about cognitive work
research validates these opportunities
what makes incidents special:
- tend to wipe away things that dont immediately matter
- tend to generate energy and atention to a relatively contained area of tech and understandings of that tech
- they are trojan horse to allow use to look more closely at how things normally work and how peole typically adapt in that normal work
how we imagine incidents happen:
- detect the issue
- develop hypotheses(es) for contirbutors
- invalidate hypothes(es)
- fix issue
how incidents actually happen (see picture)
we need to be able to ask ..
- how does the lfow of
- ...
How does a new person in a team learn the nuances of systems behavior that isn't documented?
What do veteran engineers know about their system shtat others don't?
How do people who are in different teams or have different domain experise explain to each other what is familiar and important to watch?
How do people improvise new tools ...
- @EverydayKanban: RT @CatSwetel: OH: and by tool I mean even medium priced consultants like myself 😂 #lascot
...
@somesheep: @allspaw How are we considering what people are doing given the breadth of work and points of view? How do we foster and retain that knowledge? Software being a team sport, how are we cooperating? What can we learn from other industries? @allspaw #lascot https://t.co/W7JNSHODGd
@SteveSmithCD: Thinking about @allspaw talking about the focus an incident creates, and #CostOfDelay. An incident creates urgency, people are aware money is being lost... in feature development, that sense of urgency is often missing... #lascot
@thirstybear: 💯 this. An “incident” makes waste visible, painfully so. Day to day development generally doesn’t (until it causes said incident...?)
there is no single person:
- tenure
- domain expertise
- past experience with details
cooperative coordinated
Looking deeper we find:
- detection is different from identification or recognition
- diagnosis not always the most chalenging aspect
- confusion about how things become fixed is common
indications of surprise and novelty
- wtf happened here
- I have no idea what is going on
- well that's terryfying
all incidents can be worse
- @somesheep: @allspaw People will do things and often we miss observing them and how they keep things from getting worse. Even when we try to observe, representing this work, while it may provide some explanations, we might still miss a crucial element. @allspaw #lascot https://t.co/pBHcQgEpfL
@somesheep: @allspaw The Stella research found overlapping characteristics in 7 very different environments. So we can improve by learning from each other, but as complexity around us is increasing and so there’s urgency. @allspaw #lascot https://t.co/2fBEmVEPQa
@SteveSmithCD: Woods' Theorem: As the complexity of a system increases, the accuracy of any single agent's own model of that system decreases rapidly https://t.co/iPvqE3Bb1Z #lascot
"... your app systems tend to become ..."
What is surprising is not that there are so many incidents ...
... it is that there are so few!
"... you know this to be true: the thing that amawes you is not that your system go down sometimes ...
... it's that its up at all."
Resilience Engineering is a Community.
is largely made up of practitioners and researchers form ...
it is not math, it is emperical
it is based upon lots of other industries:
Resilience Engineering is not
- SRE
- DevOps
automation
redeundancy
rebostness
high availability
...
@SteveSmithCD: "Resilience is nothing to do with software or hardware" @allspaw #lascot 💥
resilience is:
- proactively aiming to prepare to be unprepared
- sustaining the potential for furutre adaptive action
- ...
@chrisvmcd: Resilience is not adapting, its holding together what is so that it can continue to adapt @allspaw #lascot
@somesheep: @allspaw Do what is resilience? It’s not adapting, but maintaining that integrity of a system that allows it to exhibit its desired characteristics. @allspaw #lascot https://t.co/xurq0gIpSC
@SteveSmithCD: "Robustness is an envelope of competence" @allspaw #lascot
@malk_zameth: RT @SteveSmithCD: Woods' Theorem: As the complexity of a system increases, the accuracy of any single agent's own model of that system decreases rapidly https://t.co/iPvqE3Bb1Z #lascot
Paper: The theory of graceful extensibility ...
Challenges to making progress:
- current and typical appraches to learning from incidents have very little to do with actual learning
- most post-incident review documetn are written do be filed not written to be read
- efforts made toward creating bleameless conditions and psychological safety for peole to give their account of an inident are table stakes these are necessary but not sufficent
to make real progress the industry needs to first ...
@somesheep: @allspaw Without being able to have open debate and a healthy culture allowing admitting failure without fear of punishment, you won’t find resilience, but that alone is not enough. @allspaw #lascot https://t.co/qq3A6vnYpN
inertia towards the status quo, oversimplications
chronic inability to learn from other domaines
...
The status quo beliefs:
- tyranny of metrics and shallow data
- under-investment in real incident analysis expertise
- oversimplified methods such as one-size-fits-all postmortem tempaltes and unquestied practices such as treating all incidents consistently
Inconvenient realities of shallow data
- mean time to X numbers are negotiated not objective
- all incident data is reactive and scoped to unwanted events
- trending these numbers tell us nothing about learning prevention expertise proactiveness or adaptive
Change is afoot:
- Nora Jones
- Casey Rosenthal
- Jessica DeVita
- Paul Reed
- John Allspaw
REdeploy conference
@SteveSmithCD: RT @cyetain: Velocity NY 2013: Richard Cook, "Resilience In Complex Adaptive Systems" -@ri_cook https://t.co/z83YNjf7zn
@somesheep: @allspaw With resilience, we’re now maybe where we were with DevOps in 2008.
Go to @REdeployConf !!! To close, and since this is an Agile conf.. (see the pic) @allspaw #lascot https://t.co/9FCdx81PO4