2019/06/04: Monitorama Day 2

by Gene Kim on

#Monitorama

2019/06/04: Monitorama Day 2

  • @acedrew: "If #kubernetes is all greek to you, just know that Grafeas is the Scribe, documenting the metadata, and Kritis is… https://t.co/fc2ihCUUrE
  • @acedrew: "If #kubernetes is all greek to you, just know that Grafeas is the Scribe, documenting the metadata, and Kritis is… https://t.co/fc2ihCUUrE
  • @topopal @botchagalupe
  • @lizthegrey: --- 0.1.0 is coming out soon for both Kritis and Grafeas, including hybrid cloud support.

[ed: I strongly strongly recommend people adopt this shit because... yeah, it is downright negligent to not use this. spoken as a former user of [redacted] at Google] #monitorama
- @tlockney: Really excited to check out @Grafeasio! Thanks to @aysylu22 for sharing this with us at @Monitorama. https://t.co/T3EoIMBQPa
- @crayzeigh: --- @aysylu22 This is tracking all the unique specific data about your deploys. You want to know everything about who is deploying the software, why isn’t the fix coming, when is the fix scheduled. All that helps us gain observability into our supply chain. @aysylu22 #monitorama
- @lizthegrey: --- [ed: it is negligent if you cannot verify that code deployed to production was verifiably built from peer-reviewed source, or if you don't know whether you're depending upon code with open CVEs. this is a huge step forward for DevSecOps. But it's also not o11y per se] #monitorama
- @RealGeneKim: #monitorama @aysylu22 Wonderful! https://t.co/LHxyRurhjH

  • @yebyen: --- @geek_dave @PrometheusIO @karenhchu @technosophos @CloudNativeFdn @Monitorama This tweet is what I needed to push me over the edge, if Zee and Phippy can handle Prometheus monitoring, then I certainly have enough time and ability to take LFS241 class from @linuxfoundation on monitoring :D

Up: Evolution of Observability at Coinbase: Luke Demi: @lukedemi

  • @lukedemi: "we allow buying/selling bitcoins; the business problem means that we're storing people's private keys
  • @lukedemi: "What it looks like when a bitcoin exchange goes down for a day
  • @asachs01: #monitorama Dat new stack tho... https://t.co/CxSEW4szWI
  • @blatanterror: When at #monitorama , make monitoring and logging jokes. Preferably with fancy acronyms. Well done @lukedemi https://t.co/w8zTmAaZXQ
  • @duckalini: --- Observability at coinbase is a funny story.

In May 2017 they hit their doom threshold and it was a bad time. Reddit advice was brutal and so they used new relic. But the dashboard meant nothing to them #Monitorama https://t.co/mwmtKLEUmq
- @lukedemi: "What you want are structured data, a killer ops team, and a sht ton of cash"
- @petecheslock: Watching @lukedemi talk about scaling during the crazy bitcoin scale events a few years ago. #monitorama https://t.co/HZJLm33YgP
- @lizthegrey: --- They wanted to know when, and *why
things were down. Not just when.

Their wishlist: low latency, long retention, short granularity, high cardinality, security, fast aggregation, and intuitive interface... #monitorama
- @lukedemi: "During an outage/crisis, don't necessarily rely on advice from Reddit"
- @lizthegrey: --- That takes structured events, a killer ops team, and a shit ton of cash! Which Facebook had when they built SCUBA, but... SCUBA isn't available to everyone. [ed: but @honeycombio is!] #monitorama
- @amandanriso: New way of how ⁦@lukedemi⁩ used to monitor #monitorama this talk is hilarious and informative! https://t.co/6RAIz8XJME
- @lizthegrey: --- so they had JSON logs, which is a pretty decent start... because they had high cardinality values they were already storing but needed to be able to analyze.

btw, Kibana doesn't do long retention, fast aggregations, and intuitive interface #monitorama
- @duckalini: --- They started setting up structured logs.
This was hugely valuable, and just adding structured logs met most of their needs!

monitorama https://t.co/IQP4lnRVwp

  • @lizthegrey: --- But it broke down under the load of having more engineers... who added more logs. The elasticsearch cluster exploded. They spent more time debugging Kibana than actually investigating the problems themselves. #monitorama
  • @Catchpoint: --- .@lukedemi on the value of observability ‘when you are down for days, it’s time to ask what do I need to see to know when and why I’m down’ #monitorama https://t.co/zIhSRoO3hY
  • @duckalini: --- Then when things grew EVEN MORE, they learnt what they needed and scaled as needed. They passed the doom line.

Logs and structured events are critical to good investigation.

They keep adding and adding and adding.... it was okay, til they got more engineers #Monitorama https://t.co/Y6vURY7G1D
- @lizthegrey: --- But it broke down under the load of having more engineers... who added more logs. The elasticsearch cluster exploded. They spent more time debugging Kibana than actually investigating the problems themselves. #monitorama

  • @duckalini: --- Then when things grew EVEN MORE, they learnt what they needed and scaled as needed. They passed the doom line.

Logs and structured events are critical to good investigation.

They keep adding and adding and adding.... it was okay, til they got more engineers #Monitorama https://t.co/Y6vURY7G1D
- @lukedemi: "what went wrong with our logging? 1; build/buy gone wrong: pricing model made what we needed impractical.."
- @duckalini: --- Then when things grew EVEN MORE, they learnt what they needed and scaled as needed. They passed the doom line.

Logs and structured events are critical to good investigation.

They keep adding and adding and adding.... it was okay, til they got more engineers #Monitorama https://t.co/Y6vURY7G1D
- @Catchpoint: --- .@lukedemi on the value of observability ‘when you are down for days, it’s time to ask what do I need to see to know when and why I’m down’ #monitorama https://t.co/zIhSRoO3hY
- @lizthegrey: --- It was build vs buy gone wrong -- they convinced themselves that they needed to build inhouse was for the made up "security" use case.

But they shouldn't have assumed "internal" = "safe" #monitorama
- @lizthegrey: --- and the UX was hard, it was missing critical features like downsampling, preventing expensive user queries, and aggregation. [ed: almost like you should PAY SOMEONE TO DO THIS PROFESSIONALLY who does it FOR THEIR FULL-TIME JOB like us!] #monitorama
- @duckalini: --- But this strategy doesn’t scale with a growing engineering team.

They believed that this strategy was more secure because their logs were managed internally - a private key in the logs? Who cares! (Actually they cared, they realised they cared) #monitorama https://t.co/U2MMc4lnUS
- @petecheslock: “Logs and Structured events are critical to investigations when you are trying to solve problems” - @lukedemi #monitorama

TODO