6/27 LSPE meetup

by Gene Kim on

#lspe

  • Etsy averages 50 deploys/day: "Here's our deploy button; we love pushing it."
    https://p.twimg.com/AwcO7q0CEAEKwdc.jpg @bradjohnsonsv: Etsy focuses on getting people shipping code on first day. Always Be Shipping culture...@mikebrittain @ #LSPE
  • RT @bradjohnsonsv: Etsy focuses on getting people shipping code on first day. Always Be Shipping culture...@mikebrittain @ #LSPE
  • @bradjohnsonsv: #LSPE meetup has record attendance! About to see @mikebrittain from #Etsy, then @dbartow on "Actionable Metrics". Woot!

  • .@mikebrittain: "If you're pushing a change, everyone should know about it

  • .@mikebrittain: "All our configs flags live in big PHP file. allows enable/disable features rapidly. Much faster than full deploy"

  • .@mikebrittain: "Failure is inevitable, a learning opportunity, and DETECTABLE"

  • Everyone can see our operational metrics: people love looking at graphs

  • Dev have graphs up, IRC channel on separate monitor

  • .@mikebrittain: "We detect failures quickly; we optimize not to prevent, but quick detect/recover"

  • .@mikebrittain: "Q: sounds like a lot of work; who does it? A: Engineers; they build the app: shared resp for logging/graphing/trending/alerting"

  • .@mikebrittain: "Metrics are part of every feature & so are config flags; we never roll something out where we can't see succ/fail"

  • .@mikebrittain: "Apache/PHP are boring to many people. We like boring. They work well."

  • .@mikebrittain: "we have field for etsyabtest to facilitate easy spilt A/B testing in campaigns, test forms, etc." (Awesome)

  • .@mikebrittain: "Our StatsD library: in one line, dev have no excuse not to log; love to see whether features are working; flushes to Graphite

  • ..@mikebrittain: "Awesome. Time to recover was 8m" (check out the graph)

    https://p.twimg.com/AwcTiRjCIAAbI1l.jpg

  • .@mikebrittain: "250K+ metrics tracked at Etsy; "

  • .@mikebrittain: "We don't have rollback; we like to fix forward"

  • @bradjohnsonsv: ah, they look for outliers, use "confidence bands", to detect issues in the 250k metrics. whew @mikebrittain #ETSY #LSPE

  • RT @bradjohnsonsv: ah, they look for outliers, use "confidence bands", to detect issues in 250k metrics. whew @mikebrittain #ETSY #LSPE

    • Wow. I'm still trying to get my head around "no rollback." Heard this before, but I just realizing implications. @mikebrittain
  • @mikebrittain: Previous (more robust) versions of Metrics-Driven Engineering slides at http://t.co/YnGHldXG #lspe /cc @wickett

Actionable Metrics at Full Production Scale
Dan Bartow @perfdan

  • .@perfdan: "Formerly managed Intuit engr team for TurboTax; people who owed money filed 4/15; people govt owed money to filed 2/15"
  • .@perfdan: "During prod stress testing for major retailers, we'll often have 100 ppl on conference bridge" (hope no hold music :)
  • .@perfdan: "In 2008, people were not receptive to production testing: like 98% freaked out; now widely accepted"
  • .@perfdan: "For high traffic site, until peak traffic hits, no way to know whether site can handle it. Ex: 2012 Olympic site
  • .@perfdan: "Since 2007: biggest offender is hardware load balancers: why? SSL concurrent sessions, often licensing issue
  • .@perfdan: "Page response time: it's the one critical metric: dictates revenue b/c it's cust experience"
  • .@perfdan: "Amazon: 100ms latency costs 1% of sales; Google lost 20% traffic b/c of .5s latency
  • .@perfdan: "1m aggregation is too long; when outage cost is high, 2nd blip is already 2nd minute" (commenting on Etsy slide)
  • .@perfdan: "
  • @wickett: real time metrics on the web is defined as within a few seconds
  • (Wow. Remember when 1 min was ok? :) RT @wickett: real time metrics on the web is defined as within a few seconds
    • .@perfdan: "Story: 8MB jpg accid put on home page; data ctr ran out of bandwidth; EVERYTHING BROKE! WebEx calls, email.." (haha)
    • .@perfdan: "We have a free product; love stories like when dev breaks alpha versions of their tool w/100 virtual users" (Nice)
    • .@perfdan: Watching him throwing 100 virtual users at Yahoo.com; "Hope NOC doesn't blacklist my IP addr" :)
    • @wickett: soasta free tool cloud test looks neat
    • .@perfdan: "We've loaded up 1500 Amazon EC2 instances in 7 minutes. Other cloud providers oft take longer..."

Netflix, Roy Rappaport

  • .@@kennethcgardner: RT @bradjohnsonsv: Lot's of pictures being taken of @perfdan's "The List" of Actionable Metrics at #LSPE meetup.
  • Here it is. RT @bradjohnsonsv: Lot's of pictures being taken of @perfdan's "The List" of Actionable Metrics at #LSPE meetup.

  • .@royrapaport: "Been at Netflix 1094 days; job: make things better (security monkey, python platform, central alert gateway)

  • .@royrapaport: "Netflix culture: freedom and responsibility; hire very smart people and let them loose

  • .@royrapaport: "Netflix open sourced Asgard, their cloud orchestration: note that there's no authentication or login" (when he told.

  • ...when @royrapaport told me this last night, I almost fell over. Laughing, shocked, aghast..

  • .@royrapaport: "Distributed Operations; Get out of the way of Developers"

  • I can hear infosec peeps fainting on absence of authentication in Netflix open source Asgard. :)

  • .@royrapaport: "Systems: flexible, scalable, self service; @marcog: Netflix in the range of 10-100m metrics. Woah!

  • RT @marcog: Netflix in the range of 10-100m metrics. Woah!

  • Really enjoying Ian Flint, Yahoo Architect on business metrics (anyone know his twitter handle?)
  • Flint: "we knew when disks were down, but not when we were losing money; saw trees, not forest
  • Flint: "Quick tutorial on ad biz lingo: Supply are ads; demand are the displays"
  • Flint: "Core biz metrics: serves, impressions, clicks, revenue
  • "Low granularity to reduce volatility
  • "All delivered to central data store
  • Flint: "Guaranteed display ads (guarantee 1MM views); non-guar ads (target to demographic); search ads (embedd into search/content)
  • Flint: "Olympic scoring: toss out max and min, then average remaining"
  • Flint: "Group properties by behavior: Mail/Messenger: highly regular; News/Sports: event driven traffic;
  • Flint: "Lifestyles/Movies/Autos: traffic driven by the Front Page fire hose; others driven by organic traffic is random
  • Flint: "Automated detection is hard; "eyes on glass is most reliable, but expensive"; week over week/Olympic scoring helps refine
  • Flint: awesome: showing live stats of Yahoo front page: showing impressions