20191010 - LASCOT2019 - People are the adaptable element of complex systems, John Allspaw @allspaw

by Thierry de Pauw on

#lascot

People are the adaptable element of complex systems, John Allspaw @allspaw

The goal of the talk is that you walk away with new and better questions.

The topic is people and their adaptability to scenario's that are new.

Bottem Line, Up Front

  • people can adapt to unforeseen circumstances and machines cannot @allspaw

the work people do is always context dependent

  • people adapt because they have to.

  • conditions and circumstances can both help and hinder this adaptation

  • finding, understanding and exploring how people adapt is difficult to do, but can be done

  • doing this work (finding the messy ways how people adapt) has to be done in naturalistic settings, not in variable-controlled lab conditions

  • @EverydayKanban: RT @somesheep: @allspaw To understand adaptation we have to observe it, but that’s hard to do. It has to happen in naturalistic setting. It cannot be done in lab-type/variable-controlled environments. @allspaw #lascot

  • what does this have to do with resilience and Resilience Engineering @allspaw

I have to stand on the shoulders of Abeba Birhane's keynote talk "In Defence of Uncertainty"

  • @somesheep: @allspaw Imagine tomorrow at your work place, everybody would step away from all the computers and not touch then for 24h. Would your system still running as expected? If you say “yes”, how about a week? Do you know what each person is doing that keeps the system going? @allspaw #lascot

To understand how people adapt:

  • What are they doing?
  • What are they actully doing?

The line of representation: nobody has the same mental representation of the running system. @allspaw

  • @somesheep: @allspaw When we model systems, we routinely do not include the people who know and maintain these systems. Once we do, things become very different. What changes is that we now have a line of representation. @allspaw #lascot https://t.co/yLK6AwodGa

You never change the system. You interact with the representation of the system. The only information you have of the system comes from this representation. @allspaw

There is a world below and above the line. And we always speak about the world below the line as the system. Although we never see the system below the line. @allspaw

  • @somesheep: @allspaw In reality, we never see or touch the things below that line of representation. We interact with those parts via tools or other systems. You really work and change things above the line and therefore we have to take these things seriously. @allspaw #lascot https://t.co/kOhYHPLb9U

What are they actually doing? cognitive work
- ...

  • @SteveSmithCD: "People are the adaptable element of complex systems" @allspaw #lascot

  • @somesheep: @allspaw So what do people actually do? We need to acknowledge and then explore this aspect, starting with understanding how systems behave. We know this for some time and progress has not made a difference. @allspaw #lascot https://t.co/9yG8jwOsoa

Paper: Ironies of Automation, Lisanne Bainbridge (1982)

"... irony that the more advanced a control system is, ..." (I missed that quote)
- @AnnaArmstrong: RT @JitGo: Ironies of #automation @allspaw #lascot recommended reading
https://t.co/joBXbyTtRU https://t.co/dZMbZIsESu

Paper: Ironies of Automation: Still Unresolved after all these years, Barry Strauch (2017)

Typical industry understanding of Incident Data

  • @cacorriere: RT @sebrose: “Shallow, aggregate, poor strategies for coping with uncertainty” @allspaw #lascot https://t.co/IW96C0Mosa

  • TTR

  • Time to Detect TTD

  • @SteveSmithCD: I'm here for @allspaw throwing shade at SRE quantifications of incident facets #lascot

  • other bullshit accronyms that say nothing about cognitive work

research validates these opportunities

what makes incidents special:
- tend to wipe away things that dont immediately matter
- tend to generate energy and atention to a relatively contained area of tech and understandings of that tech
- they are trojan horse to allow use to look more closely at how things normally work and how peole typically adapt in that normal work

how we imagine incidents happen:
- detect the issue
- develop hypotheses(es) for contirbutors
- invalidate hypothes(es)
- fix issue

  • @somesheep: @allspaw We think of incidents of regular, well definable things (nice rectangles and lines), but incidents are messy with intertwined and hard to map co-dependencies. There is a whole catalogue of observations we have to make. What do people do, use tools, interactions! @allspaw #lascot https://t.co/tOvEGIshFC

how incidents actually happen (see picture)

we need to be able to ask ..
- how does the lfow of
- ...

How does a new person in a team learn the nuances of systems behavior that isn't documented?
What do veteran engineers know about their system shtat others don't?
How do people who are in different teams or have different domain experise explain to each other what is familiar and important to watch?
How do people improvise new tools ...
- @EverydayKanban: RT @CatSwetel: OH: and by tool I mean even medium priced consultants like myself 😂 #lascot

...

  • @somesheep: @allspaw How are we considering what people are doing given the breadth of work and points of view? How do we foster and retain that knowledge? Software being a team sport, how are we cooperating? What can we learn from other industries? @allspaw #lascot https://t.co/W7JNSHODGd

  • @SteveSmithCD: Thinking about @allspaw talking about the focus an incident creates, and #CostOfDelay. An incident creates urgency, people are aware money is being lost... in feature development, that sense of urgency is often missing... #lascot

  • @thirstybear: 💯 this. An “incident” makes waste visible, painfully so. Day to day development generally doesn’t (until it causes said incident...?)

there is no single person:
- tenure
- domain expertise
- past experience with details

cooperative coordinated

Looking deeper we find:
- detection is different from identification or recognition
- diagnosis not always the most chalenging aspect
- confusion about how things become fixed is common

indications of surprise and novelty
- wtf happened here
- I have no idea what is going on
- well that's terryfying

  • @somesheep: @allspaw This is cognitive work. We detect far quicker than we can identify or recognise. It’s not always the mysteries or stories of detective work that are most relevant. In this whole ranges of vectors and angles, we’re familiar with this dialogue. @allspaw #lascot https://t.co/zGucvBhqad

all incidents can be worse
- @somesheep: @allspaw People will do things and often we miss observing them and how they keep things from getting worse. Even when we try to observe, representing this work, while it may provide some explanations, we might still miss a crucial element. @allspaw #lascot https://t.co/pBHcQgEpfL

  • @somesheep: @allspaw The Stella research found overlapping characteristics in 7 very different environments. So we can improve by learning from each other, but as complexity around us is increasing and so there’s urgency. @allspaw #lascot https://t.co/2fBEmVEPQa

  • @SteveSmithCD: Woods' Theorem: As the complexity of a system increases, the accuracy of any single agent's own model of that system decreases rapidly https://t.co/iPvqE3Bb1Z #lascot

"... your app systems tend to become ..."

  • @wouterla: Murphy's law is wrong: what could go down, almost never does. The reason that's true is people. #lascot @allspaw https://t.co/7ywSvDemik

What is surprising is not that there are so many incidents ...
... it is that there are so few!
"... you know this to be true: the thing that amawes you is not that your system go down sometimes ...
... it's that its up at all."

  • @somesheep: @allspaw Once we grasp how much can go wrong, we realise that very little does. Murphy does not apply here? Because people step in when they see gaps and mitigate consequences, all the time. That’s what we need more data on. @allspaw #lascot https://t.co/KJKxm77sgp

Resilience Engineering is a Community.
is largely made up of practitioners and researchers form ...
it is not math, it is emperical

it is based upon lots of other industries:

  • @somesheep: @allspaw And there is an entire field working on these problems. A mix of domains and backgrounds, done empirically, recently also from software, related to chaos engineering too. @allspaw #lascot https://t.co/snZLgTjjLj

Resilience Engineering is not
- SRE
- DevOps

- Invented by any company

  • automation

  • redeundancy

  • rebostness

  • high availability

  • ...

  • @SteveSmithCD: "Resilience is nothing to do with software or hardware" @allspaw #lascot 💥

resilience is:
- proactively aiming to prepare to be unprepared
- sustaining the potential for furutre adaptive action
- ...

  • @chrisvmcd: Resilience is not adapting, its holding together what is so that it can continue to adapt @allspaw #lascot

  • @somesheep: @allspaw Do what is resilience? It’s not adapting, but maintaining that integrity of a system that allows it to exhibit its desired characteristics. @allspaw #lascot https://t.co/xurq0gIpSC

  • @SteveSmithCD: "Robustness is an envelope of competence" @allspaw #lascot

  • @malk_zameth: RT @SteveSmithCD: Woods' Theorem: As the complexity of a system increases, the accuracy of any single agent's own model of that system decreases rapidly https://t.co/iPvqE3Bb1Z #lascot

Paper: The theory of graceful extensibility ...

Challenges to making progress:
- current and typical appraches to learning from incidents have very little to do with actual learning
- most post-incident review documetn are written do be filed not written to be read
- efforts made toward creating bleameless conditions and psychological safety for peole to give their account of an inident are table stakes these are necessary but not sufficent
to make real progress the industry needs to first ...

  • @somesheep: @allspaw Without being able to have open debate and a healthy culture allowing admitting failure without fear of punishment, you won’t find resilience, but that alone is not enough. @allspaw #lascot https://t.co/qq3A6vnYpN

  • inertia towards the status quo, oversimplications

  • chronic inability to learn from other domaines

  • ...

The status quo beliefs:
- tyranny of metrics and shallow data
- under-investment in real incident analysis expertise
- oversimplified methods such as one-size-fits-all postmortem tempaltes and unquestied practices such as treating all incidents consistently

Inconvenient realities of shallow data
- mean time to X numbers are negotiated not objective
- all incident data is reactive and scoped to unwanted events
- trending these numbers tell us nothing about learning prevention expertise proactiveness or adaptive

  • @somesheep: @allspaw There are plenty of other obstacles you need to avoid before insights can be discovered. Also mind the difference between reporting and exploring. (Your existing report won’t show you what you didn’t know already to look for.) @allspaw #lascot https://t.co/oKJyUtgEHc

Change is afoot:
- Nora Jones
- Casey Rosenthal
- Jessica DeVita
- Paul Reed
- John Allspaw

REdeploy conference

  • @SteveSmithCD: RT @cyetain: Velocity NY 2013: Richard Cook, "Resilience In Complex Adaptive Systems" -@ri_cook https://t.co/z83YNjf7zn

  • @somesheep: @allspaw With resilience, we’re now maybe where we were with DevOps in 2008.
    Go to @REdeployConf !!! To close, and since this is an Agile conf.. (see the pic) @allspaw #lascot https://t.co/9FCdx81PO4