by Thierry de Pauw on
#cfgmgmtcamp
Untitled goos game
It's a lovely morning in the data centre, and you are a horrible goos devop.
my purpose is to cause problems ... to improve things a lot
tales of horrible goos devop
Let's talk about state:
diff to compare state
cvs/svn to version state
application state will always be a thing, luckily we have application developers to take care of that
=> let's talk about system state
Bash scripts don't manage system state well. @rynchantress
enter Puppet ...
known system state + known config change = new known system state @rynchantress
tale:
postfix should be running, it wasn't
let's turn postfix on 200 servers
turns out postfix was off for 8 years => 8 years of mail were lying around on 200 servers
what is application state?
horrible lessons learned
our systems are always more complex than we think
@felis_rex: „Chef was failing before it could break things on our seven PHP7 test servers. We accidentally discovered that day that seven servers with PHP7 could keep Etsy dot com running.“ - @rynchantress #cfgmgmtcamp
system state interacted with configuration change to cause new system state
unknown and inconsistent configuration state across infrastructure
uncontrolled source state changes combined with inconsistent configuration state
=> let's talk about state (again):
what is configuration state?
...
what is source state?
...
horrible lessons learned (continued)
Wat.
=> let's talk about state (again)
-terraform state- (in an S3 bucket): is really a configuration state
@zhenech: "all three S in S3 stand for sadness - sadness sadness sadness" @rynchantress at @cfgmgmtcamp
@binford2k: When current state doesn’t actually represent the source state you expect... you’re just a “minor” deployment away from disaster. @rynchantress @cfgmgmtcamp https://t.co/ToQLbR0KBt
horrible lessons learned (continued)
=> let's talk about state (again)
=> human state
fewer sources of state
We have Chef and Puppet running, and we'll put a bit of Ansible and Terraform on top of that. Maybe don't do that.
@f3ew: Everyone uses ssh to debug, because our tooling isn't good enough. #cfgmgmtcamp
@EdVoncken: RT @felis_rex: „They gradually took all sudo commands away from ops after the ‚bad commands broke things‘. Eventually we could only run Puppet. So we puppet exec‘d everything.“ - @rynchantress #cfgmgmtcamp
surface incorrect state
make things human-readable
=> design tools/output that is sensible to humans, not only to computers
horrible lessons learned (continued)