2012/6/26: Velocity Conference

by Gene Kim on

#velocityconf

RT @v
elocityconf: FYI - @souders and @allspaw are planning something #awesome to k

ick off #velocityconf keynotes. Catch it
at 8
:30a PT; http://t.co/KyZJ0onG.

@souders, @allspaw

  • Hey, @kartar: just saw that you're talking rollbacks today at #velocityconf! Looking fw to it, buddy! 230pm today.#velocityconf
  • @skullboxx: RT @velocityconf If you couldn't make it to #velocityconf, we're streaming keynotes and selected sessions Tue + Wed - http://t.co/eq50lhoZ
  • Nice! RT @skullboxx/@velocityconf #velocityconf is streaming keynotes and selected sessions Tue + Wed - http://t.co/eq50lhoZ

  • Jay Parikh, Facebook: "Building For A Building Users"

  • Parikh: "500M ppl visit site/day; in next 30m: 10TB logs -> Hadoop; 105TB data into Hive; 6MM photos; 160MM newsfeeds"

  • Parikh: "in next 30m: 108B MySQL queries; 2.8T cache ops"

  • .@jayparikh: "every empl goes thru bootcamp: 1st day: fixed prob in local Facebook instance, depl to 100M users next day"

  • .@jayparikh: "engrs can pick what team to join after 6 wks of bookcamp; no hiring committees"

  • @mikeodea: Facebook: 6 billion mobile messages (!!) every 30 minutes

  • RT @mikeodea: Facebook: 6 billion mobile messages (!!) every 30 minutes

  • .@jayparikh: "2nd Facebook principle: Move Fast: hundreds of code pushes daily: graph of commits/day

    https://p.twimg.com/AwU7NjsCAAA3d_V.jpg

  • .@jayparikh: "Normally, # of code commits goes down over time: ours has gone up

  • .@jayparikh: "Gatekeeper: like A/B testing, we phase-in features to users: 500M+ gatekeepr chks/sec: fast flow releases"

  • .@jayparikh: "Lots of

  • @nphase: FB Perflab: Performance tests every commit against real traffic before exposing to prod

  • https://p.twimg.com/AwU7fb-CAAEfYF3.jpgRT @nphase: FB Perflab: Performance tests every commit against real traffic before exposing to prod

    • .@jayparikh: "Claspin: high density heatmap viewers for large svcs: quickly pattern match vs. random SSH'ing into servers"
    • .@jayparikh: "Problem when doing rollouts: caches that takes take days/wks to replenish on process restarts"
  • .@jayparikh: "Soln: in 1 hackathon, moved to shared mem, so cache replenish takes days, sometimes hours: sped up flow"

  • .@jayparikh: "constant focus on automation makes life more fun: repetitive work taken away by tools: equiv to 350 ops engrs

  • .@jayparikh: "

  • @xthestreams: Scaling to a billion users at facebook is not down to a single person, it's a function of a large number of small tweaks

  • RT @xthestreams: Scaling to 1B users at facebook is not down to a single person, it's a function of a large # of small tweaks

  • .@jayparikh: "Facebook 3rd Principle: Be Bold: iterate rapidly, then launch to 1B users overnight

  • .@jayparikh: "

  • @xthestreams: "People, tools and way way down the list, process." Facebook on automation

  • Wow. RT @xthestreams: "People, tools and way way down the list, process." Facebook on automation

  • .@jayparikh: "Forest City, NC data center didn't just copy Prineville, OR; chged everything: network, hw, sw" (Wow.)

  • .@jayparikh: "chgs included moving HD from back to front; 2x CPU; 40% more throughput"

  • @mark_barger: RT @nphase: FB Operates 15,000 servers per datacenter technician.

15,000.

velocityconf

  • .@jayparikh: "Wow. RT @mark_barger: RT @nphase: FB Operates 15,000 servers per datacenter technician.

15,000.

velocityconf

  • .@jayparikh: "5TB/sec peak capacity?!?
  • .@jayparikh: "12x improvement in bandwidth per rack; everything chged, including cables"
  • .@jayparikh: "3rd data center in Sweden:
  • .@jayparikh: "To get mega-project like 3rd data ctr build completed in Sweden: need lots of disciplines
  • .@jayparikh: "Capacity planning & perf engineering is the same thing at Facebook"
  • .@jayparikh: "'We need 8K servers by next week around globe!!!'; good solution next week: '16 new servers' "
  • .@jayparikh: "Describing 2010 outage: strangest incident in my career: accid launched every secret feature at 1 time"
  • (Amazing) .@jayparikh: "To recover, we pulled DNS to take entire site down; couldn't turn it off!"
  • launched every secret feature at 1 time;
  • .@ligolnik: RT @xthestreams: A bad day for change management at Facebook #velocityconf http://t.co/p3qsZGCI
  • RT @ligolnik: RT @xthestreams: A bad day for change management at Facebook #velocityconf http://t.co/p3qsZGCI launched every secret feature at 1 time;
  • .@jayparikh: "
  • @davenolan: FB accidentally launched all their secret features to everyone at once "We tried to turn off the site and we couldn't" #skynet
  • RT @davenolan: FB accid launched all secret features to everyone at once "We tried to turn off site & we couldn't" #skynet

John Rauser, Amazon

  • RT @patrickdebois: fun DevOps Cookbook meetup tonight! #velocityconf w/@botchagalupe/@realgenekim: http://t.co/RJB0yegT
  • .@jrauser: "London cholera backdrop: 50% mortality rate; 1831 1st seen; 20K, 50K dead; no germ theory; instead, miasma theory"
  • .@jrauser: "'Correlation != causation' is the defense of tiniest minds" (referring to rebuttals to Dr. John Snowe)
  • .@jrauser: "Describing 3rd wave of cholera outbreak"
  • .@jrauser: P--- H-----!!! P--- H------!!! Pick me! PIck me!!
  • .@jrauser: "
  • @TurboDad: "Correlation is not causation" - the defense of tiny minds in the face of compelling evidence.
  • RT @TurboDad: "Correlation is not causation" - the defense of tiny minds in the face of compelling evidence.
  • .@jrauser: Awesome... comparing Dr. Snowe geographic plots to plotting latency histograms. Awesome.
  • .@jrauser: "Dense cluster? it's a school; absent cluster? brewery: own well & drank beer, not water"
  • .@jrauser: "Solution: remove Pump Handle, stopping the cholera outbreaks"
  • .@jrauser is now examining
  • .@jrauser: "Digging out the fastest and slowest reqs from your logs will lead you to astounding errors"
  • .@jrauser: "Slowest reqs: n^3 algorithm, cold cache upon restart <-- these are availability risks"
  • .@jrauser: "Fastest reqs: due to runaway robots, broken links, etc..."
  • .@jrauser: "When your 99 percentile changes, you have a new failure mode; very sensitive to chgs"
  • (Brilliant, as always. Wow.) .@jrauser: "Explaining anomalies makes your system bulletproof"

Mike Brittain, Etsy

  • Mike Brittain, Etsy: "Resilient User Experiences"
  • @mikebrittain: examining how showing stack trace due to mysql thrown exception; custom error page"
  • @mikebrittain: "1st principle: upon failures, merely don't render the page component, don't serve the "
  • @mikebrittain: "Etsy sponsored ads: 400ms timeout threshold: we favor fast page response over completeness"
  • @mikebrittain: "We use non-blocking Ajax elements, taking load off client"
  • @mikebrittain: "Google Apps very good at resilient arch & UI; shows retry interval"
  • RT @kerrick/@xthestreams: "wrapping a stack trace with a logo is barely better than showing the stack trace" @mikebrittain
  • @adrianco: Mike from Etsy talking about resilient user interfaces at #velocityconf - similar to circuit breaker pattern http://t.co/xp7jmGSj @jheady: .@mikebrittain Covering user experience, graceful failure of features. Highlight critical and non-critical path features.
  • @TurboDad: Hiding failed features that are not CRITICAL to the user experience is a key pillar in building a resilient user experience.
  • RT @TurboDad: Hiding failed features that are not CRITICAL to user exp is key pillar in building a resilient user experience.
  • .@mikebrittain: "Takeaway: build this graph: # of failure page views, where customer was shown the exit" Pic here
  • @jelder: RT @pingdom: Shoppers shouldn’t be forced to leave the store because a sink is leaking. @mikebrittain on graceful degradation. #VelocityConf
  • @xthestreams: Design for operations = product managers + devs + ops
  • .@allspaw telling story of how news site, on election day, had to turn off logging, in order to save the site. Wow.

Google talk

  • Google: "Humans are slow: 3s to type URL (even if URL is autocompleted); 15s to select search result <-- can Chrome help?"
  • Google: "@beinwal: RT @adrianco: My R code for response time histogram plots http://t.co/sgYOlLYd /cc @
  • @allspaw: RT @adrianco: My R code for response time histogram plots http://t.co/sgYOlLYd /cc @jrauser
  • RT @beinwal: RT @adrianco: My R code for response time histogram plots http://t.co/sgYOlLYd /cc @jrauser
  • Google talk reminds me of @xthestreams: "why not just have Electric Monk do browsing for us?" (Douglas Adams reference)
  • Google: "when user types 'c': very likely cnn.com, so go pre-fetch and pre-render; dramatic faster"
  • Google: "we go ahead and pre-fetch/pre-render top search results, called Instant Pages: again, much faster"
  • Google: "by doing this, across all users, we save 20 years of waiting per day (!!!)"

Bryan McQuade, Google

  • Bryan McQuade, Google: Talking about PageSpeed, which shows how to make your page can load faster
  • @apistilli: Google demoing pagespeed insights #velocityconf http://t.co/3TkooyiY
  • RT @apistilli: Google demoing pagespeed insights #velocityconf http://t.co/3TkooyiY
  • @chrismunns: Sad to have missed #velocityconf this week, but YOU NEED to see Paul Hammond's slides on startup infrastructure http://t.co/rltrFkwS

  • .@souders: "use of Flash down: 50%->44%; size of av]

Dr Richard Cook: Royal Institute at Sweden

  • .@allspaw breathlessly introducing Dr Richard Cook: Royal Inst Sweden: on cognitive systems engr, human factors, etc. Yes!
  • .@allspaw recommending famous seminal work "How Complex Systems Fail" to everyone: 25 yrs of systems research
  • Cook: "systems as imagined vs. as found; resilience & resilience engineering; he's a practicing anesthesiologist"
  • Cook: "the surprise is not that there are so many accidents. the surprise is that there are so few"
  • Cook: "why are we having so few? we should be having more than we have had? systems should be crashing all the time"
  • Cook: "surgical operating; emergency rooms; commerical airlines, military cockpits, semiconductor fabs"
  • Cook: "@dlutzy: No one rings at 3am to say how well the system is running. Richard Cook, MD
  • Cook: "Story: consolidating infusion pump vendors: replaced all in one day; 1 year later... 1220am: 'my pump isn't working'"
  • Cook: "...sudd, 20% of infusion pumps failed, jeopardizing patients lives; root cause? annual mandatory software upgrade!"
  • Cook: "...1 year after deployment, infusion pumps needed upgrade, and locked up. " (Holy cow. They failed closed.)
  • Cook: "What would have happened if there were no ops people [biomed techs]?" (Holy cow. Answer: all pumps wd fail in 24h)
  • Cook: "Lessons: real world oft surprises; existential threats occur; demand/op-tempo varies"
  • Cook: "Stuff don't work as advertised (you only find out when it fails); novel conditions are common"
  • Cook: "Lots of adapting/tailoring (i.e., voodoo: turning things off/on, trail/error, rebooting); unanticipated variety
  • Cook: "Coping w/shallow and deep conflicts
  • Cook: "If you feel like normal world is well-behaved, you're living in a dream world... or not operational people" (Haha)
  • Cook: "...and I want to know what company you work for, because I want to sell all the stock."
  • Cook: "systems as found are dynamics/stochastic; not deterministic or static"
  • Cook: "design/dev/review vs. implementation/operations/maintenance/recovery"
  • Cook: "what goes on in resilient, as-found systems? monitor, respond, adapt, and LEARN"
  • Cook: "stiff boundaries, layers, formalisms, defence in depth, redundat, interference protection, assurance, accountability"
  • Cook: "withstand transients, recovery swiftly/smoothly, prioritize to serve high level goals, recognize/respond.."
  • Cook: "Resilience in my field is life/death question; in yours, it's the $64million." (No, try $6.4 trillion.)
  • Cook: "How to be resilient? support continuous maintenance; reveal controls to operators (only ppl who can help at 3am: trust them w/controls)
  • Cook: "People
  • Cook: "How to be resilient? show the lift points (eg, showing how heavy equip should be picked up and moved; plan for maint)
  • Cook: "How to be resiliant? Support mental simulation (ops must be able to plan, rehearse, mentally grasp)
  • @SubmittedDenied: When resilience is the goal, we need to trust users/operators and reveal the controls for the system.
  • RT @SubmittedDenied: When resilience is the goal, we need to trust users/operators and reveal the controls for the system.
  • Cook: "How to be resilient? Open objects/methods (you must know what's inside the black box); Deep-six the "don't touch me"
  • Cook: "Resilience agenda: 1) ops is competent to keys to kingdom & systems we build; 2) resilience engr must be 1st priority
  • Cook: "3)
  • Cook: "Goto http://ctlab.org for more info." (Great talk! Wish I had seen this talk 1 year ago! @allspaw)

Twitter

  • Twitter: wanted to overcome monolithic apps, give dev data to make improvements
  • @bossydk: Monorail = the monolithic rails application used at twitter
  • RT @bossydk: Monorail = the monolithic rails application used at twitter
  • @thewebvy: "if your developer is sitting there swearing a lot, their job satisfactionRT @thewebvy: "if your developer is sitting there swearing a lot, their job satisfaction level is probably low" aim tominimize wtf moments
  • level is probably low" aim tominimize wtf moments
  • Twitter: finagle: network substrate:
  • Twitter: new open source tool: zipkin to generate chrome-style call graphs
  • Twitter: new open source tool: ioao to generate production load tests
  • Twitter:
  • @jbarciauskas: Just learned twitters new load tool is iago not Lago
  • me too! RT @jbarciauskas: Just learned twitters new load tool is iago not Lago
  • @strangeloopnet: If you liked Dr. Richard Cook's #velocityconf keynote, there's a great interview with him in the livestream right now: http://t.co/PG1dP9Jj
  • RT @strangeloopnet: liked Dr. Richard Cook's #velocityconf keynote? great interview w/him in the livestream right now: http://t.co/PG1dP9Jj
  • decomposition
  • Twitter: starting 18 months ago, we started to get rid of monolithic monorail by decomposing into svcs
  • Twitter: "we do dark, rolling releases: only way we can gain confidence in releases, despite using iago in pre-prod"
  • Twitter: "outcomes: we can launch massive features in parallel: team org now matches sw stack;
  • Twitter: "switching to JVMs result in significant better performance; 45% of Twitter on new JVM stack
  • Twitter: "we're also more reliable; mean to to fix bugs went way down: 12 deploys yesterday" (!!)@mtnygard: Twitter: team org matches software stack. Me: Amazing things happen when you apply Conway's Law.
  • RT @mtnygard: Twitter: team org matches software stack. Me: Amazing things happen when you apply Conway's Law.
  • (Wish I heard more abt Twitter "chg mgmt" process, to signal/sequence releases that could conflict w/each other. Manual?)
  • @aarongreenlee: RT @jbarciauskas: Twitter term: "observability stack". How can we make our services more and more consistently observable?
  • RT @aarongreenlee/ @jbarciauskas: Twitter term: "observability stack". How can we make our svcs more consistently observable?
  • @BurtonSays: RT @adrianco: Great lesson from @jrauser at #velocityconf - look at distributions and outliers. I use @AppDynamics to find and drill into our outliers...
  • @TurboDad: Sweet - Twitter's Layer 7 Router (TFE) is basically just their open-sourced Finagle tool.
  • RT @TurboDad: Sweet - Twitter's Layer 7 Router (TFE) is basically just their open-sourced Finagle tool.
  • Interesting. Not very Lean Startup: Twitter: "Write code for world u want to be in, so u don't have to do 2 passes like us"

James Turnbull: Myths of Rollback

  • Next up: the DevOps philsopher @kartar, talking about whether if rollback fails and no one notices, did it really fail?
  • .@kartar: "I'm a former sysadmin; at @puppetlabs for 6 yrs: was first doc guy, release manager, etc. And Australian."
  • I'm always amazed at @kartar's English. It's quite good. :)
  • .@kartar: "Q: who relies on rollback?" (Only few people raises hands, but incl illustrious @adrianco)
  • .@kartar: "Rollback is total bullsh!t. If you rely on rollback, you're delusional."
  • .@kartar: informal poll: 25% of audience is Dev; 70% is Ops; 5% is QA. (Packed room)
  • .@kartar: "old assumptions: hw was expensive; apps were idempotent; cascade-less failure; resources were expensive"
  • .@kartar: I had to look this up: "idempotent" = "operation can be applied multiple times, and get same result"
  • .@kartar: majored in philosophy and cognitive science
  • .@kartar: "very few of our 'systems' are truly deterministic; Mark Burgess and Alva Couch proved it"
  • @adrianco: #velocityconf separate business logic from state, and concentrate on the common case rollback of logic. State rollback is rarely the issue
  • RT @adrianco: separate biz logic from state, & concentrate on common case rollback of logic. State rollback is rarely the issue. @kartar
  • .@kartar: despite rollback fallacy, it is used in chg mgmt to justify risky chgs: "I've got a rollback plan"
  • .@kartar: "b/c complex systems are..complex.., rolling back often result in system in worse, unknowable state"
  • .@kartar: "rollbacks are usually done at 4am, when biz is breathing down your neck: human error risk is huge"
  • @adrianco: #velocityconf Netflix dev teams push code 10s of times a day and rollback is needed a few times a week.
  • RT @adrianco: #velocityconf Netflix dev teams push code 10s of times a day and rollback is needed a few times a week.
  • .@kartar: "Wrong tool? Rollback originally designed to rollback erroneous transactions; we're trying to solve avail probs"
  • .@kartar: "Preventive Design: even world best ops (Etsy engrs on speed, simulated of course) can't keep crappy code up"
  • .@kartar: "Increase Resilience: use config mgmt like @puppetlabs to enforce dev = prod"
  • .@kartar: "Operational Intelligence: a little bit of devops in every byte: communication/coordination is key"
  • .@kartar: "In my enterprise IT past, I was like a mushroom, told to deploy random things; opposite of sitting next to Dev"
  • .@kartar: "Getting Dev to see pain & suffering their chgs cause reduces incidents"
  • .@kartar: "Make small, iterative changes: when change is infrequent, strange behaviors emerge: lots of voodoo"
  • .@kartar: More Australian presenters next year. Please.
  • .@kartar: "Find mythologies & scar tissues & scrutinize: 'we can't {run | upgrade | deploy} that because XXX happened."
  • .@kartar: "Take those mythologies out back and shoot them; measure and get data; they're erroneous rules"
  • @xthestreams: Poor princess! @kartar we'll fix it in post, maybe they can rollback the recording
  • Haha. Australians are cute! RT @xthestreams: Poor princess! @kartar we'll fix it in post, maybe they can rollback recording
  • .@kartar: "Rollback is possible, but
  • Rumor: Livestreams of future @kartar talks will be on 5 sec delay. I'll pay extra $$ for uncut version. :)
  • \

Yahoo:

  • Evolution of CAPTCHA: "now at limits of difficulty for actual users"
    https://p.twimg.com/AwWiFuRCMAEEXcX.jpg
  • Yahoo: Amazing: emerging CAPTCHA business: 1 guy decoded 1540 entries in 1 hr
  • Yahoo: "Market price for CAPTCHA decode: $0.75/day: even for college grads: best work avail in many countries" (Oof)
  • Yahoo: "Email accts created by Russia; bought in Africa;
  • @skastel: Going to line up early for Nygard's talk on Stability Patterns here at #velocityconf, this should be good.
  • Love @mtnygard! Here in B1 already. RT @skastel: Going to line up early for Nygard's talk on Stability Patterns here at #velocityconf.
  • Yahoo: wow. showing next gen CAPTCHA, designed to slow down ppl w/CAPTCHA money making businesses; & use of SMS@souders: Awesome! Apple has a booth at #VelocityConf exhibit hall. Would love to see them have more attendees and maybe eventually be a speaker.
  • BTW, I saw 4chan founder talk once. Awesome. Funniest slide was 4chan CAPTCHA spoof of line noise. Wish I could find it

  • .@mtnygard "has a checkered past: I've been an app dev, architect, web ops; I've seen how code survives over years: not well"

  • .@mtnygard: "We run tightly coupled systems w/non-linear feedback loops; not good"

  • .@mtnygard: "

  • @jbarciauskas: economic boundary: making money, safety boundary: not harming, production boundary: producing at maximum capacity

  • RT @jbarciauskas: economic boundary: making money, safety boundary: not harming, production boundary: producing at maximum capacity

  • .@mtnygard: "Antipattern #1: integrations are #1 risk to stability: every process call can/will kill you; even db calls"

  • .@mtnygard: "Ex: random 5am db hangs: because firewall expiring LIFO db connection" (ok, this story defies 140c limit)

  • .@mtnygard: "

  • @jschauma: "Firewalls exist to break TCP/IP in every way possible." @mtnygard

  • RT @jschauma: "Firewalls exist to break TCP/IP in every way possible." @mtnygard

  • .@mtnygard: "Patterns: Circuit Breaker, Use Timeouts, Use Decoupling Middleware, Handshaking, Test Harness"

  • .@mtnygard: "Failure Pattern: Chain reaction: where all front end servers drop in seq; common in app servers, srch"

    • Bulkhead
  • .@mtnygard: "Failure Pattern: Cascading failure: failure moves vertically across tiers: SOA: common in enterprises, SOA"

    • Timeout, Circuit Breakers
  • .@mtnygard: "Failure Pattern: Blocked Threads: all req threads blocked; most comm pattern: oft due to homegrown thread libs"

  • .@mtnygard: "

  • @iandelahorne: .@mtnygard "any websphere users out there? Yeah, you have hung threads right now"

  • RT @iandelahorne: .@mtnygard "any websphere users out there? Yeah, you have hung threads right now"

  • Thank you! RT @xthestreams/@mrembetsy: @RealGeneKim rocking the twitters live from #velocityconf. It's good.

  • Self inflicted Denial of Service: Best Buy XBox 360 promotion: "went down so fat, we thought it was power failure"

  • .@mtnygard: "Defenses:

  • PS: Someone asked how I take notes & tweet at same time: @_flynn wrote http://tweetscriber.com for iPad. It's free!

  • .@mtnygard: "@xthestreams: Scaling ratios between tiers should be tested asymmetrically. Overload the back end with front end requests. #velocityconf @mtnygard

  • RT @xthestreams: Scaling ratios betw tiers should be tested asymmetrically. Overload back end w/front end requests. #velocityconf @mtnygard

  • .@mtnygard: "Stable Patt #1: Circuit Breaker: most faults req hours to correct (cable knocked loose, app server down)

  • .@mtnygard: "Circuit Breaker goal: cut off features that aren't working correctly; "cool off" demand"

  • .@mtnygard: "Citing Netflix Nov blog entry: showing state of all circuit breakers -- is valuable dashboard"

  • .@mtnygard: "

  • @thewebvy: Failures often happen on a human time scale, your retry timeouts should reflect that. Use seconds, not milli or nano.

  • RT @thewebvy: Failures often happen on a human time scale, Retry timeouts should reflect that. Use seconds, not ms or ns.

  • .@mtnygard: "Stable Patt #2: Bulkheads: save part of the ship, if another part floods: ex: dedicated backends for diff svcs"

  • @xthestreams: Queuing theory comes up again and again at

  • Fascinating. RT @xthestreams: Queuing theory comes up again and again at

  • .@mtnygard wrote "Release It!" -- a great book: http://www.amazon.com/Release-It-Production-Ready-Pragmatic-Programmers/dp/0978739213

  • @jschauma: When I'm all grown up, I want to give talks like @adamhjk and look sharp in pictures like @@markburgess_osl.

  • @secolive: RT @xthestreams: I'm starting to think that technical debt is code people are frightened to touch

  • @DuncNisbet: RT @jrauser: Video of my #VelocityConf keynote http://t.co/e8IloGTB. Thanks folks for indulging my passion for the history of science.

  • @teresadg: RT @laraswanson: "I love outages in a way that's probably not quite right" - @teresadg during her #velocityconf presentation

  • @xthestreams: @realgenekim the photos from your #velocityconf party are on Flickr http://t.co/viHpgQUm also emailed Han