20200203 - CfgMgmtCamp 2020 - Untitled Config Game, Ryn Daniels

by Thierry de Pauw on

#cfgmgmtcamp

Untitled Config Game, Ryn Daniels, @rynchantress

Untitled goos game

It's a lovely morning in the data centre, and you are a horrible goos devop.

my purpose is to cause problems ... to improve things a lot

tales of horrible goos devop

Let's talk about state:

  • application state
  • system state

diff to compare state
cvs/svn to version state

application state will always be a thing, luckily we have application developers to take care of that

=> let's talk about system state

  • what is the system doing?
  • what is the system supposed to be doing?
  • what versions of things exist in the system?
  • what is and is not installed?
  • what things changed and when and by whom?

Bash scripts don't manage system state well. @rynchantress

enter Puppet ...

known system state + known config change = new known system state @rynchantress

tale:
postfix should be running, it wasn't
let's turn postfix on 200 servers
turns out postfix was off for 8 years => 8 years of mail were lying around on 200 servers

  • known state wasn't actually known
  • configuring system state didn't account for unknown application state
  • undo in configuration state couldn't undo changes to application state

what is application state?

  • what is the application doing?
  • what does the application care about?
  • what things changed, when and by whom?
  • it's always a caching problem

horrible lessons learned

  • don't forget about application state
  • knowns aren't always known (known unknown and unknown unknowns)

our systems are always more complex than we think

  • @felis_rex: „Chef was failing before it could break things on our seven PHP7 test servers. We accidentally discovered that day that seven servers with PHP7 could keep Etsy dot com running.“ - @rynchantress #cfgmgmtcamp

  • system state interacted with configuration change to cause new system state

  • unknown and inconsistent configuration state across infrastructure

  • uncontrolled source state changes combined with inconsistent configuration state

=> let's talk about state (again):

  • application state
  • system state
  • configuration state: changes in the configuration manager (Chef, Puppet, ...)
  • source state: configuration definitions, whatever that goes in source control

what is configuration state?

...

what is source state?

...

horrible lessons learned (continued)

  • monitor your configuration state
  • don't leave your states unsupervised

Wat.

  • wildly and unknowably different configuration states: someone applied a terraform change from a source that wasn't in source control
  • current configuration state not fully captured in source state
  • what even was in the terraform plan output again ?!

=> let's talk about state (again)

  • application state
  • system state
  • configuration state
  • source state
  • -terraform state- (in an S3 bucket): is really a configuration state

  • @zhenech: "all three S in S3 stand for sadness - sadness sadness sadness" @rynchantress at @cfgmgmtcamp

  • @binford2k: When current state doesn’t actually represent the source state you expect... you’re just a “minor” deployment away from disaster. ⁦@rynchantress⁩ ⁦@cfgmgmtcamphttps://t.co/ToQLbR0KBt

horrible lessons learned (continued)

  • enforce source state completeness and config state consistency

=> let's talk about state (again)

  • application state
  • system state
  • configuration state
  • source state
  • -terraform state-: is really configuration state
  • human state

=> human state

fewer sources of state

We have Chef and Puppet running, and we'll put a bit of Ansible and Terraform on top of that. Maybe don't do that.

  • @f3ew: Everyone uses ssh to debug, because our tooling isn't good enough. #cfgmgmtcamp

  • @EdVoncken: RT @felis_rex: „They gradually took all sudo commands away from ops after the ‚bad commands broke things‘. Eventually we could only run Puppet. So we puppet exec‘d everything.“ - @rynchantress #cfgmgmtcamp

surface incorrect state

make things human-readable

=> design tools/output that is sensible to humans, not only to computers

  • @felis_rex: „The trick is to make metrics visible and actionable without making it overwhelming.“ - paraphrased from @rynchantress #cfgmgmtcamp

horrible lessons learned (continued)

  • get rid of unreliable tools
  • make sure your tooling is usable by humans