20200203 - CfgMgmtCamp 2020 - The Road to Reliability: Infrastructure Testing explained, Constantin Weisser

by Thierry de Pauw on

#cfgmgmtcamp

The Road to Reliability: Infrastructure Testing explained, Constantin Weisser, @iSibnZe

Infrastructure Testing doesn't get the attention it deserves, @iSibnZe

working on infrastructure has changed ... a lot!

we are in a state where infrastructure can be done in development teams (where in the past you had dedicated teams for that)

=> shift responsibility

but our engineering lags behind
- version control: most team agree on this
- pipelines: few teams run IaC from pipelines
- transparency
- steady environment
- Testing

Why should I test my infrastructure?

  • why not!
    • why do you test your application code?
    • why should infrastructure code be different? But I use declarative code, nothing can go wrong ...
  • logic: if the infrastructure code contains any sort of logic
  • verify: the infrastructure behaves as (we think) we specified it in code => doing the right thing
  • validate:
    • codifying expected behaviour triggers thinking about it
    • are we implementing the right thing?
  • invariants: make sure hard requirements always hold true: port 80 must be open
  • robustness against changes

and Testing is important for automation!

Workflows

The development cycle:
- develop change
- verify the change locally
iterate

increment the infrastructure with changes -> updated infrastructure -> test the resulting behaviour

the Build Pipeline

code repo -> build artefact -> dev -> prod

between code repo and build artefact:
1. build and test infrastructure code: bare minimum is to check the validity of the definition files
2. if tests fail it doesn't produce an artefact

the Deployment Pipeline

dev -> pre-prod -> prod

on each transition:
1. provision infrastructure
2. run tests
3. failing tests fail the pipeline

Abstraction Level

Cloud -> API -> tool using the API

test on all these abstraction levels:
- Unit/Model tests against the tool
- API tests: check via the API if the provisioning did what was expected

Model tests: I don't trust my code
API tests: I don't trust my tool (nor my code)
Real infrastructure tests: I have trust issues :)

Real infrastructure tests are expensive, but the only way to know for sure ...

  • @mavimo: OMG, a simple slide can't be more clear than this one... (IMO we match that levels with unit/integration/e2e in test pyramid) #cfgmgmtcamp cc/ @iSibnZe https://t.co/cuIZgqosKo

Model Tests

  • fast, cheap (no network needed)
  • detect faulty logic
  • detect illegal inputs

example: Pulumi model tests

what about Terraform?

you will have to apply some hacks to do model testing.
run the plan and perform checks against the plan.


terraform plan -out my.plan
terraform show my.plan

Classification of IuT Pieces

  • dedicated vs reused
  • ephemeral vs persistent
  • immutable vs mutable
  • testing vs productive

Example 1: Dedicated, Ephemeral, Immutable, Testing
- test an abstraction in the infrastructure codebase
- for every test execution: Create, Test, Destroy

tests run off-site and asserts that the module works as expected.

time: 20 - 40min

pros:
- isolated
- no interference with prod infra
- cheap compared to persistent iuT

cons:
- high execution times
- glue code might be extensive
- modules are not always good for starting

example: JUnit with Before terraform init; terraform apply -auto-approve and TearDown terraform destroy
similar to terratest

Example 1b: Dedicated, Persistent, Mutable, Testing
- keep test infrastructure around
- provisions updates => terraform apply performs updates
- faster
- higher quality because you always update existing infra as you do in production
- more $$ expensive

Example 2: Reused, Persistent, (Im)mutable, Productive
- test productive infrastructure as it changes
- can enhance a pipeline stage => smoke tests
- mutable: provision infra and then check
- immutable: provision infra, check and then switch

But: may interfere with your production workload!

How to find good test cases?

  • test things your team has broken before!
  • test updates! - Kubernetes clusters break a lot when you update them
  • test critical paths for availability
  • test contracts! (when multiple teams manage infrastructure)