2019/06/04: Monitorama Day 2B

by Gene Kim on

#Monitorama

2019/06/04: Monitorama Day 2

  • @acedrew: "If #kubernetes is all greek to you, just know that Grafeas is the Scribe, documenting the metadata, and Kritis is… https://t.co/fc2ihCUUrE
  • @acedrew: "If #kubernetes is all greek to you, just know that Grafeas is the Scribe, documenting the metadata, and Kritis is… https://t.co/fc2ihCUUrE
  • @topopal @botchagalupe
  • @lizthegrey: --- 0.1.0 is coming out soon for both Kritis and Grafeas, including hybrid cloud support.

[ed: I strongly strongly recommend people adopt this shit because... yeah, it is downright negligent to not use this. spoken as a former user of [redacted] at Google] #monitorama
- @tlockney: Really excited to check out @Grafeasio! Thanks to @aysylu22 for sharing this with us at @Monitorama. https://t.co/T3EoIMBQPa
- @crayzeigh: --- @aysylu22 This is tracking all the unique specific data about your deploys. You want to know everything about who is deploying the software, why isn’t the fix coming, when is the fix scheduled. All that helps us gain observability into our supply chain. @aysylu22 #monitorama
- @lizthegrey: --- [ed: it is negligent if you cannot verify that code deployed to production was verifiably built from peer-reviewed source, or if you don't know whether you're depending upon code with open CVEs. this is a huge step forward for DevSecOps. But it's also not o11y per se] #monitorama
- @RealGeneKim: #monitorama @aysylu22 Wonderful! https://t.co/LHxyRurhjH

  • @yebyen: --- @geek_dave @PrometheusIO @karenhchu @technosophos @CloudNativeFdn @Monitorama This tweet is what I needed to push me over the edge, if Zee and Phippy can handle Prometheus monitoring, then I certainly have enough time and ability to take LFS241 class from @linuxfoundation on monitoring :D

Up: Evolution of Observability at Coinbase: Luke Demi: @lukedemi

  • @lukedemi: "we allow buying/selling bitcoins; the business problem means that we're storing people's private keys
  • @lukedemi: "What it looks like when a bitcoin exchange goes down for a day
  • @asachs01: #monitorama Dat new stack tho... https://t.co/CxSEW4szWI
  • @blatanterror: When at #monitorama , make monitoring and logging jokes. Preferably with fancy acronyms. Well done @lukedemi https://t.co/w8zTmAaZXQ
  • @duckalini: --- Observability at coinbase is a funny story.

In May 2017 they hit their doom threshold and it was a bad time. Reddit advice was brutal and so they used new relic. But the dashboard meant nothing to them #Monitorama https://t.co/mwmtKLEUmq
- @lukedemi: "What you want are structured data, a killer ops team, and a sht ton of cash"
- @petecheslock: Watching @lukedemi talk about scaling during the crazy bitcoin scale events a few years ago. #monitorama https://t.co/HZJLm33YgP
- @lizthegrey: --- They wanted to know when, and *why
things were down. Not just when.

Their wishlist: low latency, long retention, short granularity, high cardinality, security, fast aggregation, and intuitive interface... #monitorama
- @lukedemi: "During an outage/crisis, don't necessarily rely on advice from Reddit"
- @lizthegrey: --- That takes structured events, a killer ops team, and a shit ton of cash! Which Facebook had when they built SCUBA, but... SCUBA isn't available to everyone. [ed: but @honeycombio is!] #monitorama
- @amandanriso: New way of how ⁦@lukedemi⁩ used to monitor #monitorama this talk is hilarious and informative! https://t.co/6RAIz8XJME
- @lizthegrey: --- so they had JSON logs, which is a pretty decent start... because they had high cardinality values they were already storing but needed to be able to analyze.

btw, Kibana doesn't do long retention, fast aggregations, and intuitive interface #monitorama
- @duckalini: --- They started setting up structured logs.
This was hugely valuable, and just adding structured logs met most of their needs!

monitorama https://t.co/IQP4lnRVwp

  • @lizthegrey: --- But it broke down under the load of having more engineers... who added more logs. The elasticsearch cluster exploded. They spent more time debugging Kibana than actually investigating the problems themselves. #monitorama
  • @Catchpoint: --- .@lukedemi on the value of observability ‘when you are down for days, it’s time to ask what do I need to see to know when and why I’m down’ #monitorama https://t.co/zIhSRoO3hY
  • @duckalini: --- Then when things grew EVEN MORE, they learnt what they needed and scaled as needed. They passed the doom line.

Logs and structured events are critical to good investigation.

They keep adding and adding and adding.... it was okay, til they got more engineers #Monitorama https://t.co/Y6vURY7G1D
- @lizthegrey: --- But it broke down under the load of having more engineers... who added more logs. The elasticsearch cluster exploded. They spent more time debugging Kibana than actually investigating the problems themselves. #monitorama

  • @duckalini: --- Then when things grew EVEN MORE, they learnt what they needed and scaled as needed. They passed the doom line.

Logs and structured events are critical to good investigation.

They keep adding and adding and adding.... it was okay, til they got more engineers #Monitorama https://t.co/Y6vURY7G1D
- @lukedemi: "what went wrong with our logging? 1; build/buy gone wrong: pricing model made what we needed impractical.."
- @duckalini: --- Then when things grew EVEN MORE, they learnt what they needed and scaled as needed. They passed the doom line.

Logs and structured events are critical to good investigation.

They keep adding and adding and adding.... it was okay, til they got more engineers #Monitorama https://t.co/Y6vURY7G1D
- @Catchpoint: --- .@lukedemi on the value of observability ‘when you are down for days, it’s time to ask what do I need to see to know when and why I’m down’ #monitorama https://t.co/zIhSRoO3hY
- @lizthegrey: --- It was build vs buy gone wrong -- they convinced themselves that they needed to build inhouse was for the made up "security" use case.

But they shouldn't have assumed "internal" = "safe" #monitorama
- @lizthegrey: --- and the UX was hard, it was missing critical features like downsampling, preventing expensive user queries, and aggregation. [ed: almost like you should PAY SOMEONE TO DO THIS PROFESSIONALLY who does it FOR THEIR FULL-TIME JOB like us!] #monitorama
- @duckalini: --- But this strategy doesn’t scale with a growing engineering team.

They believed that this strategy was more secure because their logs were managed internally - a private key in the logs? Who cares! (Actually they cared, they realised they cared) #monitorama https://t.co/U2MMc4lnUS
- @petecheslock: “Logs and Structured events are critical to investigations when you are trying to solve problems” - @lukedemi #monitorama
- @lizthegrey: But it broke down under the load of having more engineers... who added more logs. The elasticsearch cluster explode… https://t.co/CVrFJDp64K
- @duckalini: Omg! Datadog! Looks at all these needs they meet! Everything kirbana missed! #monitorama https://t.co/BGsf4F8V3w
- @petecheslock: No vendor is safe in @lukedemi's #monitorama talk. https://t.co/47EvwwvUNk
- @lizthegrey: --- It was paradise compared to the pain and suffering of rolling their own Kibana with structured logs gone wrong. They could just plug information into the Datadog service...

But there's trouble in paradise #monitorama
- @petecheslock: No vendor is safe in @lukedemi's #monitorama talk. https://t.co/47EvwwvUNk
- @RealGeneKim: A hilarious and self-effacing (and irreverent) talk, for sure! @lukedemi #monitorama
- @lizthegrey: and then you're taking an average of your percentiles NOOOOOOOOOOOOOOO #monitorama
- @duckalini: --- Datadog eased a log of pain. It took structured logs gone wild and crashing elastic search clusters. But uhhh it didn’t solve everything. And neither does Prometheus.
(Monorails = monolithic rails applications)
Datadogs monitoring period defaults to 10sec #monitorama https://t.co/lwT5v4b0TV
- @TimeSeriesArt: Self-portrait of an outage: great #TimeSeriesArt from @lukedemi at #monitorama https://t.co/bHk5xZMD3x
- @lizthegrey: --- It forces you to have an metrics collection/aggregation window, which... is unhelpful for granularity.

Percentile aggregation, GRRRRRRR. [ed: GRRRRRR. THIS IS WHY WE USE HISTOGRAM TYPES!] #monitorama
- @lizthegrey: --- So nobody understand cardinality, because people shouldn't have to worry about the cost of a tag. Userid sounds like a great tag until you have 30 million users and your bill is a million dollars. #monitorama
- @lizthegrey: --- So nobody understand cardinality, because people shouldn't have to worry about the cost of a tag. User
id sounds like a great tag until you have 30 million users and your bill is a million dollars. #monitorama
- @duckalini: --- This is to help with percentiles. But then you get a p99 across everything high turns out to be useless (Datadog has addressed this now).
Cardinality - is what’s happening in this small square? You need the wider context. Metrics don’t give you context.

Monitorama https://t.co/6n3qdKeXg7

  • @cyen: --- trying to zoom in on pre-aggregated metrics... is like low-res art, having 🙌 lost all their context 🙌 @lukedemi ftw at #monitorama https://t.co/X4ZJf0f9CL
  • @cyen: --- trying to zoom in on pre-aggregated metrics... is like low-res art, having 🙌 lost all their context 🙌 @lukedemi ftw at #monitorama https://t.co/X4ZJf0f9CL
  • @vllry: RT @lizthegrey: Pre-aggregations LIMIT THE QUESTIONS YOU CAN ASK IN THE FUTURE.

We have to stop using gauges and counters and metrics. [ed: OMG YES] #monitorama
- @lizthegrey: RT @cyen: "there's this myth that we need three products for everything. and it's a 🔥 terrible experience for developers 🔥" @lukedemi at #monitorama https://t.co/GPkMQkhxSo

- @lizthegrey: RT @cyen: "there's this myth that we need three products for everything. and it's a 🔥 terrible experience for developers 🔥" @lukedemi at #monitorama https://t.co/GPkMQkhxSo

  • His proposal is ET: events & traces; not MELT (metrics/events/logs/traces) @duckalini: --- Let’s all work towards an answer for this. Double down on looking in to events and traces. Thanks for this great and funny talk @lukedemi #monitorama https://t.co/Aha1VsdlMa
  • @proudboffin: Is this too much to ask for? Observability nirvana for developers @lukedemi #Monitorama https://t.co/cAuUmYW8Jo
  • @Catchpoint: --- ‘metrics alone DO NOT give you context and you NEED context’ 👏👏👏 @lukedemi showing how looking at the granular details of The Starry Night doesn’t tell you what’s really going on... the lights are out in the church 😉 #monitorama https://t.co/5o6pUBj2eT
  • @RealGeneKim: --- I didn't know this: the central point of Starry Night is that there are no lights in the church — lights on in all the homes, but no longer in the church. Huh!

monitorama

  • @jasongrimes: RT @RealGeneKim: I didn't know this: the central point of Starry Night is that there are no lights in the church — lights on in all the homes, but no longer in the church. Huh!

monitorama https://t.co/Zy31ApiAIv

Up: Sergey Fedorov, Netflix @Serginio89 What can your network teach you about your service?

  • @wiredferret: --- #monitorama @sfedorov: If we want to build a burger and scale it, we’re going to have to experiment, test, iterate, and analyze the results, as quickly as possible, as cheaply as we can.
  • @wiredferret: --- #monitorama @sfedorov: If we want to build a burger and scale it, we’re going to have to experiment, test, iterate, and analyze the results, as quickly as possible, as cheaply as we can.
  • @wiredferret: --- #monitorama @sfedorov: If we want to build a burger and scale it, we’re going to have to experiment, test, iterate, and analyze the results, as quickly as possible, as cheaply as we can.
  • @wiredferret: --- #monitorama @sfedorov: If we want to build a burger and scale it, we’re going to have to experiment, test, iterate, and analyze the results, as quickly as possible, as cheaply as we can.
  • @crayzeigh: [Ah! Found the twitter: @sfedov appears to be the right person] #Monitorama

  • Was fun comparing note-taking experiences with @lizthegrey!

  • @RealGeneKim: #monitorama Thanks Sal!!!

  • @lizthegrey: Next up is @Serginio89 on network monitoring! #monitorama

TODO

  • to book: people add tag to user-id: there's 3MM user-ids; blows up the box.
  • make RT and Pull Into Notes button bigger
  • @aysylu22: RT @acedrew: "If #kubernetes is all greek to you, just know that Grafeas is the Scribe, documenting the metadata, and Kritis is the Judge determining if the deployment target meets policy requirements" ~ @aysylu22 #monitorama https://t.co/vj1tBbzdIc
  • @lizthegrey: --- Networks are not magical black boxes. You can improve and iterate upon them. [ed: I miss [redacted] from Google which put network data right into dapper traces as annotations] #monitorama
  • @stevenk11011: RT @acedrew: "Engineering is not writing software, Engineering is the art of solving problems" ~ @lizthegrey #monitorama https://t.co/CFrth
  • @crayzeigh: --- @sfedov Networking data is tricky because it run’s across infrastructure that you can’t necessarily control and is affected by lots of variables, but we know how it works, we can measure and improve our traffic. @sfedov #monitorama
  • @wiredferret: #monitorama @sfedorov: Synthetic monitoring has full control, flexibility, relatively low cost - but it doesn’t represent all the client.
  • @lizthegrey: --- There are a couple of approaches -- synthetic monitoring, real user monitoring, etc.

but you can combine into client probes which run from your real clients! #monitorama
- @crayzeigh: --- @sfedov We can use synthetic monitoring with full control and flexibility. It’s low cost, but it does not represent all clients.

Netflix wanted better data, so they used RUM alongside it to run probes from real clients.
@sfedov #Monitorama
- @crayzeigh: --- @sfedov Experiment example: Fastest server location. Create a probe that gets a similar payload off multiple server targets. and build a viz.

Probes have to run parallel because network compete changes over time. Serial tests will contain dirty data. @sfedov #Monitorama
- @lizthegrey: --- To conduct multiple fair experiments you have to avoid letting them contaminate each other, or compensate for it by running them in random order. #monitorama
- @lizthegrey: --- and you have to avoid survivorship bias -- because if you can't find the experiment to run, and report the result because of a correlated failure... then you'll only see the successful probes. #monitorama
- @thenewyogurt: RT @acedrew: "Engineering is not writing software, Engineering is the art of solving problems" ~ @lizthegrey #monitorama https://t.co/CFrth
- @lizthegrey: Remember, latency data has a shape; it isn't a single number. [ed: use both heatmaps and histograms, not just histograms] #monitorama
- @crayzeigh: --- @sfedov It’s easy to dive down the data rabbit hole. Remember ultimately we’re trying to improve the user experience. Anything not aligned with that is a waste of time. @sfedov #Monitorama
- @thenewstack: RT @LibbyMClark: You can use the same observability tools for optimizing cost at scale says Ryan Lopopolo @stripe #Monitorama Don’t trust AWS pricing tools! Measure the real cost. https://t.co/u5ristk2GP
- @thenewstack: RT @LibbyMClark: The cost and usage report is the detailed billing of interactions with AWS. You can and should suck it into ETL and data warehouse infrastructure to make use of it says Ryan Lopopolo @stripe #Monitorama

  • @lizthegrey: Sometimes network outages aren't actionable if an ISP really is hard down, but not all failures are like that. #monitorama