2013/04/03 Mountain West Ruby Conference

by Gene Kim on

#mwrc

http://mtnwestrubyconf.org/2013/schedule

Hell Hath Frozen Over: Infosec and DevOps

James Turbull, @kartar

  • .@kartar: "I hated working in infosec in large enterprise: I wasn't liked, I wasn't effective"
  • .@kartar: "Why infosec into DevOps? They're people, too! Many are former dev, ops, network folks; firewall admin is network ops"
  • .@kartar: "You're going to find resistance to DevOps, and infosec can help;
  • .@kartar: "whe servers take 6-8 wks to get b/c forms needs to be signed, lots handoffs, firewall rules; infosec needs be involved
  • .@kartar: "Story: dev throws war files over wall on Fri, go out & get drunk, leaving rest of us to pick up pieces"
  • .@kartar: "What's worse than that Monday morning post-mortem? When infosec is involved: 'we discovered sev 1 vuln and we could lose $1MM an hour!"
  • .@kartar: "Infosec fails when project always gets 4 more wks on Gannt chart due to found defects; destroy blame culture"
  • .@kartar: "
  • @aamax: Destro the blame cture. Blame_free post mortems.
  • @courtneynash: At #mwrc, @katar making "it's about people" argument re: security and devops. Which gal w/ a pysch degree & 2 thumbs agrees? This one.
  • .@kartar: "Controls: provision & deploy: efficiency; Config mgmt: inconsistency is enemy of infosec; Incident mgmt: info is king"
  • .@kartar: "Controls: Audit: magic away auditors: worst question: 'tell me what systems are vuln?' 'damn, there's goes my wkend'"
  • .@kartar: "Google manages 30K desktops with Puppet: lots of bad incidents result in audits of laptop upgrades: 6-8 wk exercise"
  • .@kartar: "Google: now generate reports on the fly, can roll out patch upgrades on-demand: this solves huge prob for infosec"
  • .@kartar: "DevOps & Infosec: get roles & resp right: 'build lifecycle right, incl storage, infosec, addressing their pains"
  • .@kartar: "Get DevOps into the risk and controls register"
  • Haha. Listening to @kartar channel decades-long, industry-wide #infosec angst & fear & insecurity in his DevOps & Infosec talk :)
  • .@kartar: "Put infosec into dev; designed for security == sane and responsible and smooth deployments"
  • .@kartar: "Put Infosec into Ops -- will prevent teams from working issues simultaneously (haha); invite infosec to post-mortems
  • .@kartar: "Expose Infosec to your metrics and data"
  • @justinbkay: What I'm hearing is that devops consists of tight collaboration of the stake holders and automation of environments
  • .@kartar: "Infosec magic lines: 'I get fired if things go wrong, too' 'I'm here to help'"
  • @justinbkay: What I'm hearing is that devops consists of tight collaboration of the stake holders and automation of environments

Keep Ops Happy, Designing for Production

Jason Roelofs

  • Next up: Keep Ops Happy, Designing for Production, Jason Roelofs
  • "Ops has a hard enough time w/o worrying about Dev; Internet is dangerous place; how do we design for production?"
  • "Documentation is important, mainly so other ppl can learn: what is the app (game, cms, ??), what does it need (db, load balancer)
  • "Docs: how do I get it running on my machine (if u can't get it running on your dev box, how do we get it running in Prod?" Hahah
  • "Deploy all the time; deploy at least once per day; it minimizes deploy risks, w/Heroku first dyno free, there's no excuse!"
  • "Configurability of the app: I love configs in post-J2EE era (not huge XML files); db cfg in 1 place (where it is, how to connect)
  • "Worst place for cfg is in code: like class ReddisService; now you need if/then/else for Dev/Test/Prod; now dev might need prod acc
  • "In Ruby, YAML makes configs so easy; no excuse to have configurations in code"
  • "I recommend Fagaro to enable doing staging on Heroku, but production on physical server someplace"
  • "I'm not a big fan of Rails.env; site shouldn't care what domain it's running in; use Figaro instead"
  • "Thread safe will become more important;
  • "Every dev needs to understand security risks" (showing rails attr_accessible being depreciated in favor of strong parameters) !!
  • "On passwords: never store even MD5/SHA512 hashes; GPU clusters make these hashes untenable: 20 GPUs make 19h breaks feasible"
  • .@jasonroelofs: "Use bcrypt or scrypt instead of storing hashed passwords; reduces risk of brute force attacks"
  • @mwrc: RT @tehviking: Fun talk on dev antipatterns from a devops perspective by @jasonroelofs. Lots of dumb stuff I do in there.
  • @tehviking: Fun talk on dev antipatterns from a devops perspective by @jasonroelofs. Lots of dumb @mwrc: RT @tehviking: Fun talk on dev antipatterns from a devops perspective by @jasonroelofs. Lots of dumb stuff I do in there.
  • Most #infosec people would get tons of value here at #mwrc #devops conference: awesome program!
  • .@jasonroelofs: "To counter majority of infosec threats for ruby apps, use Rack::Protection" (Nice!)
  • .@jasonroelofs describing story of being comment spammed; "now we use rack-attack"
  • .@jasonroelofs: "To Dev: don't store stuff in local file system; no excuse not to use S3; avoid schedule job errors around DST"
  • .@jasonroelofs: "We use Pingdom, @newrelic, instrumental, loggly, papertrail" "fave: dead man snitch: tell me when jobs didn't run
  • .@jasonroelofs: "Dead man's snitch saved me so many times (when backup jobs didn't run, when scheded jobs didn't complete
  • Wow. The kung fu of all the Github folks I've met are amazing....
  • @lusis: Cool - https://t.co/U0vN54xad7 (h/t @jasonroelofs #mwrc)
  • @stefano_zanella: .@jasonroelofs "Don't schedule at 2:00am because it either doesn't exist or it happens twice (DST on/off)"
  • @stefano_zanella: .@jasonroelofs "Don't schedule at 2:00am because it either doesn't exist or it happens twice (DST on/off)"
  • Q: "why bcrypt is more secure than MD5?" A: "bcrypt is 2^n algorithm, hashing each time: very computationally expensive"
  • "scrypt takes it one step further, being very memory intensive as well, making bruteforce very expensive"
  • @tehviking: Ok, @jasonroelofs’s #mwrc talk was the most pragmatic, immediately applicable talk I can remember attending. Well done sir.

Why Everyone Needs DevOps Now

Gene Kim, @realgenekim

  • @johnwilander: RT @RealGeneKim: Hello #mwrc! Thx for having me here! My book “The Phoenix Project: A DevOps Novel” is free on Kindle until midnight! http://t.co/27cEe9PYpx
  • @RealGeneKim: Hello #mwrc! Thx for having me here! My book “The Phoenix Project: A DevOps Novel” is free on Kindle until midnight! http://t.co/27cEe9PYpx
  • @mr_ndrsn: Hot diggity! Just hopped on the #mwrc IRC and livestream!
  • @courtneynash: So @RealGeneKim just asked for mic and cameras to be turned off in response to Deming question so @botchagalupe doesn't kill him.
  • @mwrc: RT @jason_madsen: #mwrc is off to a great start with @RealGeneKim. Great talk.
  • @jason_madsen: #mwrc is off to a great start with @RealGeneKim. Great talk.
  • @djdarkbeat: Its better to practice 15 minutes a day then 3 hours a week
  • @mwrc: RT @MotoWilliams: Watching #MWRC live - http://t.co/3gOuiqZ55r
  • @mwrc: RT @mkocher: "create a single repo for code and environment" - @RealGeneKim
  • @mkocher: "create a single repo for code and environment" - @RealGeneKim
  • @newrelic: RT @darinrs: Jealous of the happenings at @mwrc but still in Portland? Come on over to @newrelic for live streaming of the conference!
  • @mwrc: RT @gibsonaaron: Gene Kim, this dude is legit! #mwrc http://t.co/RV7ekfjsvM
  • @gibsonaaron: Gene Kim, this dude is legit! #mwrc http://t.co/RV7ekfjsvM

  • Haha! I love Deming! :) @courtneynash: So @RealGeneKim just asked for mic & cameras to be turned off in response to Deming question so @botchagalupe doesn't kill him.

Boxen: How to Manage an Army of Laptops and Live to Talk About It

Will Farrington

@mwrc: RT @gibsonaaron: @gibsonaaron oh Boxen, how I have needed you...
*@mwrc: RT @dbrady: Decided to "Think Different" and brought my linux laptop to #mwrc. And now I can't get on the wireless. Reduced to tweeting from my phone.
*@gibsonaaron: @gibsonaaron oh Boxen, how I have needed you...
*@DistilledDave: Decided to "Think Different" and brought my linux laptop to #mwrc. And now I can't get on the wireless. Reduce... http://t.co/FkCO4Ftl7J
*@aamax: #mwrc boxen looks great... <3
*@dbrady: Decided to "Think Different" and brought my linux laptop to #mwrc. And now I can't get on the wireless. Reduced to tweeting from my phone.
*@blowmage: 5 minutes in and I <3 @wfarr’s #mwrc session on boxen SO MUCH. Srsly, *major
bro-crush in process.
*@auxesis: Psyching myself up for my talk in an hour on AF447 and building operable systems usable in high stress incident response.
*@gibsonaaron: Bring the Boxen! #mwrc http://t.co/njRyH8GArv
*@digicazter: Eating some great food while in SLC attending #mwrc http://t.co/ROdrILTSOr
*@mkocher: Heading back to #mwrc to learn about boxen. Config management for workstations? That could never work. https://t.co/zcJLMGvtOH
*@collectiveidea: RT @tehviking: Ok, @jasonroelofs’s #mwrc talk was the most pragmatic, immediately applicable talk I can remember attending. Well done sir.
*@PDeffendol: How much IT debt have we accumulated? Interesting #mwrc topic to take back home.
*@jessereynolds: Going to try Cher's Delicatessen @auxesis @kartar anyone else interested? Walking there shortly
*@stefano_zanella: This talk by @jasonroelofs was SO intense! So many tips in so little time! Need to recheck the recording off-line soon ;)

Zero Down Time Operations: How to Get Atleast 3 9's of Uptime

Taylor Weibley

  • @mwrc: RT @qrush: Watching @themcgruff kill it at #mwrc! http://t.co/JTDvp1Gc1h
  • @davidrueck: Watching (well, mostly listening to) the live stream of Mountain West Ruby Conf while I work today.
  • @qrush: Watching @themcgruff kill it at #mwrc! http://t.co/JTDvp1Gc1h
  • @dpnl87: Watching the live stream of #mwrc.. Some really good talks!
  • @treyka: RT @jtimberman: “Does anyone know how spanning tree works, at the low level?” “No. No one knows how spanning tree works.”
  • @jtimberman: “Does anyone know how spanning tree works, at the low level?” “No. No one knows how spanning tree works.”
  • @MotoWilliams: In case you forgot what uptime is » #MWRC http://t.co/pxskSo9Ox7
  • @sethvargo: RT @jtimberman: So @wfarr mentioned why not chef in his boxen talk at #mwrc. If you’re interested in such a thing with chef find me, i’m speaking today too.
  • @padwasabimasala: dont miss the devopts talk by @RealGeneKim at
  • @jtimberman: So @wfarr mentioned why not chef in his boxen talk at #mwrc. If you’re interested in such a thing with chef find me, i’m speaking today too.
  • @jtimberman: So @wfarr mentioned why not chef in his boxen talk at #mwrc. If you’re interested in such a thing with chef find me, i’m speaking today too.
  • @jtimberman: So @wfarr mentioned why not chef in his boxen talk at #mwrc. If you’re interested in such a thing with chef find me, i’m speaking today too.
  • @DexterTheDragon: RT @mwrc: “Ignore the haters; build cool software” - Wow, great talk by @@MattRogish: RT @blowmage: Another call for blameless postmortems at #mwrc, this time by @themcgruff. This is important you guys.
  • @MotoWilliams: "Would you spend 30k on hardware for 10% more uptime? Do that same on hires" {paraphrased} #MWRC

Escalating complexity: DevOps learnings from Air France 447

Lindsay Holmwood, @auxesis

  • Next up: Escalating complexity: DevOps learnings from Air France 447 by Lindsay Holmwood
  • "I'm an engineering manager; contributed to cucumber-nagios, visage, flapjack"
  • Whoa. We just read what I think is the last 10 min of cockpit transcript of Air France 447. Chilling. @brandenwilliams
  • Thesis: was this airline crash really due to human error and poor training?
  • Poorly trained? Capt Debois:10,988 flying hours, 1.7K as captain; co-pilot: Robert 5K hours; co-pilot Bonin: 3K hours
  • French FAA: "a difference crew should have performed the same actions"
  • "actors in a complex system', local rationality, limited data,
  • @jasonrclark: RT @DexterTheDragon: What you call "root cause" is simply the place you stop looking any further -- Sidney Dekker
  • "modes of operations: flight controls; normal law (ground, flight, flare); alternative law (alternate law 1 and 2)"
  • @sarahmei: RT @DexterTheDragon: What you call "root cause" is simply the place you stop looking any further -- Sidney Dekker
  • "Law of reconfiguration feedback": what is feedback that pilots receive to indicate flight computer has switched modes of ops?
  • @katzmandu: They made a very basic error, kept pulling back instead of letting the nose down slightly to gain speed.
  • Oh crap. "when Bonin says, "we've been nose up for awhile", only then does captain realize they've been stalling the entire time"
  • @katzmandu: @RealGeneKim You really need @KarlenePetitt in on #mwrc as she flies Airbusses for a living.
  • "Comic Sans is one of the most readable typefaces" (everyone groans)
  • @katzmandu: @RealGeneKim You really need @KarlenePetitt in on #mwrc as she flies Airbusses for a living.
  • @aamax: #mwrc redesign all your sites with comic sans because it's easy to interpret. #justkidding
  • "Startling effect: in crisis situations, people tend to shut down, defeating even best training (to verbalize their thinking during acting)"
  • "How does the business know what is happening?" (Citing Tumblr 18 hour outage w/little or no info from business to customers)
  • "Simple mitigations: pair with someone, vocalize your internal monologue, have coordinator (enables retrospectives)"
  • @mwrc: RT @lusis: I love that #mwrc is getting a taste of @auxesis amazing presentation skills. So much my hero.
  • "human pattern recognition is actually pretty good if we have enough adaptive capacity (enough spare brain cycles)"
  • "pilots were overwhelmed by feedback: phenomena of alert fatigue degrades cognition"
  • "countermeasure: manual silence: silence all alerts to free up cognitive cycles, regain situational awareness"
  • (!!!) "failure is just another mode of operational; we just need to design systems to expect it"
  • "operable systems = man + machine"
  • Sully: "if you're looking at human factors alone, you're missing half to two-third of the equation"
  • @sarahmei: Mitigating a high-stress situation: pair, and vocalize your intentions. That works pretty well in lower-stress situations too.
  • .@auxesis: Book recommendations: "Drift Into Failure," "Just Culture," "Field Guide to Understand Human Error"
  • @littleidea: @courtneynash @RealGeneKim @miah_ build systems that have failure modes that are straight forward to detect and remediate

Migrating a live site across the country without downtime

Drew Blas

  • .@drewblas: @TriangleDevOps: RT @moklett: .@chargify's own @drewblas up next at #mwrc - Migrating A Live Site Across the Country Without Downtime - watch live! http://t.co/BaG5TqwMq1
  • .@drewblas: made a very compelling pitch for EC2 and how they can be reliable partners now
  • .@drewblas: "our testing; complete copy of every system in the live site; exercise everything; generate fake traffic"
  • .@drewblas: "when we test entire stack to ensure that our site is doing what it's supposed to be doing, we can test our site copies"
  • .@drewblas: "we call these process tests: more than integration tests: test user creation, email generation, etc. cron jobs running
  • .@drewblas: "exercising email, queues, hitting all the pieces; then generate fake traffic, even testing with 3x load"
  • .@drewblas: "The Plan: introduce Chef into data center to automate, enable reproducibility; it was hard: managed by hand before
  • .@drewblas: "chef project was challenging: like having app w/no testing and trying to get to 100% coverage; 1 pkg/cookbook at time"
  • .@drewblas: "helps to think from top down
  • .@drewblas: "in theory, everything is redundant, so you run both data centers at once, then just turn off the old stuff"
  • .@drewblas: "we love Amazon Virtual Private Cloud"
  • .@drewblas: "we replaced old web servers w/HPProxy: transparent forwarding of packets"
  • .@drewblas: "migrating redis/resque: reddis failover is hard; potential to write to both servers at once"
  • .@drewblas: "solution: embrace the beauty of the queue: don't slave across two DC: two masters instead: put messages into both queues
  • .@drewblas: "Everyone wonders how we did db migration, reroute traffic: didn't go master/master; we bought Continuent Tungsten
  • .@drewblas: "Continuent Tungsten: it took over replication, cluster mgr, understands SQL, holds queries during switchover
  • @blowmage: And @yukihiro_matz is in the house! #mwrc http://t.co/yetuMtZVX8
  • @mwrc: RT @jeremywrowe: if you haven’t been you should be watching @drewblas live at #mwrc http://t.co/F9B0sHKLa4
  • .@drewblas: "Before the big switchover day: lower DNS TTL to 5 min, and test, test test"
  • .@drewblas: "How do we test? We flip over from site to site, run test data, and each time, there's something that went wrong"
  • .@drewblas: "Finally, one day, we flipped sites, and nothing went wrong; now we have some assurance that we're ready"
  • .@drewblas: "The Big Day: stop all non-essential processes, switch master DB to new DC, swithc on HAProxy in old DC, switch DNS,
  • .@drewblas: "During switchover when traffic tunneled to new DC, added 300ms: then turned on HAProxy; mitigated db switchover risk
  • Wow. Am I only one blown away by all the great thinking and planning in @drewblas's talk?
  • .@drewblas: "What's next? So many payoffs showing that infrastructure was 'provably functional'. Now we maintain hot copy"
  • .@drewblas: "What would we do diff? more elaborate load testing; sync configs w/zookeeper or doozer (to help w/reddis)
  • .@drewblas: "advice: think thru every step; doc everything; test everything"
  • .@drewblas: "We switched over on Sunday morning. And on 1st of month, which was lower transaction period

The Many Ways Of Continuous Deployment

Paul Biggar

  • @SuperSerch: about to talk about Continuos deployment on http://t.co/jWVDBVJo0J
  • .@SuperSerch: co-founded @circleci which does hosted continuous integration for ruby apps: 42% are doing continuous deployment
  • @blowmage: Exited to hear @paulbiggar talk about CI in general and @circleci in particular.
  • .@SuperSerch: "most of this talk about deployments: if you only deploy monthly, very few probs; probs emerge at 10-40 deploys/day
  • .@SuperSerch: "The Old Days: PHP deploys via FTP to single svr: it was ok for user requests to come in during the file copy" haha
  • .@SuperSerch: "Every deployment had this problem: the race: no maintenance windows or planned downtime allowed anymore
  • .@SuperSerch: "Old Days: the solution was using symlinks: shut service down and switch symlinks and bring svc back up again
  • .@SuperSerch: "Modern World: we run on Heroku: in many ways, same pattern: copy new code to new svr, switchover after copy complete
  • .@SuperSerch: "Modern World: PostgreSQL: during schema change, need app that understands old and new versions
  • .@SuperSerch: "What's really hard? Data migrations; schema chgs easy compared to data" (citing IMVU & continuous delivery)
  • .@SuperSerch: "IMVU: database chgs often resulted in locked table; decided to never do schema chgs; instead, created new tables
  • .@SuperSerch: "They'd create UserTable38, and then code would be responsible for searching Tbl38, then Tbl37, etc. Migrate rows"
  • .@SuperSerch: "Facebook does it the same way: bckground processes responsible for migrating rows between tables, takes months
  • .@SuperSerch: "To mitigate long migrations, Facebook has lots of if/then stmts to search multiple tbls and data formats
  • .@PaulBiggar: "Patterns of deploys: 1) ship an image (Heroku)" (uhh, what were the others?)
  • .@PaulBiggar: "We have continuous integration builds:
  • .@PaulBiggar: "Peak developer push time: 4pm, near end of day" (Hah! Nice.)
  • .@PaulBiggar: "We push new builds about every half day, b/c we have customers around the globe doing pushes"
  • .@PaulBiggar: "Twitter has system called 'murder' (pack of crows? spelled correctly?): 'things got ugly after we got past 100s svrs
  • .@PaulBiggar: "Twitter deploys took 40 minutes: using BitTorrent, now down to 12 seconds"
  • .@PaulBiggar: "How do we handle testing now? Github protects ensures all code tested among small set of users before all users
  • .@PaulBiggar: "Facebook deploys to enter cluster, but feature disabled except to certain user demographics" (by feature flags)
  • .@PaulBiggar: "IMVU had automated rollback, based on biz metrics: example: white buy button on white bkground"
  • .@PaulBiggar: "@jasonroelofs: Twitter uses BitTorrent to deploy. That’s cool. @paulbiggar
  • @RealGeneKim: RT @DanHushon: “@RealGeneKim: .@SuperSerch: "long migrations, … lots of if/then stmts to search multiple tbls and data formats #mwrc” < CACHE MISS

TDDing tmux

Seth Vargo

  • Next up: TDDing tmux: Seth Vargo (teaching Ops how to Test Like Devs) (haha! Yes!)
  • "Like any good project, it starts on github" (haha) "here's a chef repo"
  • "You ops guys have a lot to learn from all the rubyist devs in the crowd, who test daily"
  • "TDD components to test infrastructure as code: ChefSpec, Fauxhai (like ohai, haha) to enable lots of platform testing..."
  • "other TDDelements: foodcritic, strainer, guard; steps: 1) create cookbook
  • "tmux is cool; enables sharing terminal sessions, coding together." (OMG. Like Google Docs for coding. Was looking for !!!)
  • @courtneynash: Hah, shout out at #mwrc to @sascha_d on the live feed. Ohhai!
  • @ginnyhendry: OH spec_helper.rb introduces a single point of failure but in a ggod way
  • @auxesis: Really enjoying @sethvargo's talk at #mwrc on doing TDD with chef. The chef community have some awesome tools for testing
  • @jasonroelofs: Foodcritic is a linter for chef cookbooks. Definitely need this! @sethvargo
  • Am totally fascinated by the full integration of environment create/test/deploy into CI/CD flow. Nice work, @sethvargo!

* Next up: You Should Be On Call, Too

Joshua Timberman

  • Next up: You Should Be On Call, too: Joshua Timberman
  • .@jtimberman: "Who am I? I work for Opscode; community manager; but I was sysadmin for most of my career; & I'm call for cookbooks
  • .@jtimberman: "Who's on call for production outages? (30%, even though predominantly devs)"
  • .@jtimberman: "Ops needs your help: Ops needs operable code"
  • .@jtimberman: "in old role, I did only sysadmin in support for IT services: we were siloed by design b/c of [something retarded] haha
  • .@jtimberman: "Working for Arch Wireless, I had pager, meaning primary point of contact: escalations to cust manager, mgrs, etc
  • .@jtimberman: "Sysadmins were responsible for OS, host-based firewalls, infosec: not schemas (dbas), configs (dev)s
  • .@jtimberman: "Typical prob: help desk gets 'disk full'; wakes up sysadmin at 2am; but doesn't know what huge log file safe to rm!
  • .@jtimberman: "So then we page app support, sometimes the customers: this happened all the time, b/c we didn't know how to resolve
  • .@jtimberman: "Repeat for CPU, memory utilization: alert fatigue: app down? 'I don't know what I'm doing'" (haha. yes)
  • @mwrc: RT @phinze: “cap deploy in a for loop - that’s continuous deployment, right?” - @jtimberman at
  • .@jtimberman: "You build it, you run it." -- Werner Vogels, Amazon;
  • .@jtimberman: "interviewed: opscode, heroku, etsy, sourcefire, customink: all have devs on call"
  • .@jtimberman: "Most companies have more developers than sysadmins: devs have app domain knowledge; ppl like accountability"
  • .@jtimberman: "Nagios + pagerduty are popular; ops/sysadmin pair teams work well; encourages collaboration; team bldg + know xfer
  • @gibsonaaron: Hugops!
  • .@jtimberman: "Advice: instrument your app: metrics are awesome; check out coda hale's metrics library; #monitorama
  • (I love the Etsy philosophy; make it so easy to generate metrics/logs that if anyone writes code but doesn't log, they're shamed :)
  • .@jtimberman: "Hosted Chef is an SOA; private-chef-ctl written by devs to help operate hosted chef; allows dev to self service
  • .@jtimberman: "hosted chef is easier for dev & ops to manage; private chef is easier for our customers & ops to manage
  • .@jtimberman: "make apps operable to help you, future you, 3am you" (haha. nice)
  • @darkhelmetlive: If you’re using @newrelic custom metrics are stupid easy too.
  • @dbrady: At #mwrc watching @jtimberman make a really good case for devs being on call which is a shame because I am so very very lazy.

ChatOps at GitHub

Jesse Newland

  • Next up: Jesse Newland: ChatOps at GitHub! (A really astonishing, jaw dropping talk by @jnewland)
  • .@jnewland: "chatops: hubot started as campfire; but now is primary control interface for our infrastructure
  • .@jnewland: "Ex: when we make change to github; we do checkout, commit and pull request; then 'hubot deploy puppet/git-gh13 to prod
  • .@jnewland: "hubot then does noop, pushes to 'fs1'; then graph metrics to confirm correct functioning"; then merge pull request
  • .@jnewland: "ci runs; ensures master is green; the master branch is deployed into production; then pushes to all production"
  • @nathenharvey: "chatops: when we give root to campfire" #mwrc (hat tip to @jschauma)
  • .@jnewland: "that's #chatops; all of ops is driven in a Campfire chat room; drives shell (heaps of bash scripts), heaven, janky"
  • .@jnewland: "why is chat bot so important to ops? everyone sees what's happening; everyone sees this even on first day on job"
  • .@jnewland: "hubot does smoke test of any branch; get graphs of performance; check resque queue; connections to frontend svrs
  • .@jnewland: "...take svrs out of load balancers, find who's on call (to apologize to them when we break things) (hahaha)"
  • .@jnewland: "...put svr back into rotation, update status site, check out suspicious IP addr of possible spammer.. all in chat room
  • .@jnewland: "ChatOps put tools in the middle of the conversation: everyone is pairing, all of the time" (wow)
  • .@jnewland: "Enables teaching by doing"
  • .@jnewland: "ChatOps created by @rtomayko"
  • .@jnewland: "Our main way of training is just by doing" (like the Facebook onboarding process: everyone does a code push)
  • .@jnewland: "I haven't asked recently: how's deploy going? who should deploy? who's responding to nagios alert? did deploy finish?
  • .@jnewland: "Github doesn't have cubicles: on given day, 56% of people not in office; ops team is entirely remote"
  • .@jnewland: "Means we can hire anywhere: we can hire the best of the best"
  • .@jnewland: "ChatOps is critical during tactical response: fixes problem of someone saying 'hang on, i'll fix it" & goes dark
  • @courtneynash: Every time @jnewland gets to the "teaching by doing" part of his preso, I always get the same happy feeling. I <3 Hubot.
  • .@jnewland: "I'd never ask a friend to use the Nagios web interface; hubot is so much better;
  • .@jnewland: "best feature: take on-duty pager duty for 60 min, enables helping a teammate take a shower; culture of sharing" (wow)
  • @phinze: “asking someone to use the nagios web interface is not something i’d ask of my friends” - @jnewland at
  • .@jnewland: "That's ChatOps at it's finest: making it easier to be on-call"
  • Q: "do you have any concerns about security?" A: "Yes" (laughter) "u must be ACLed into chatbot"
  • Q: "what chat systems does hubot runs on?" A: "XMPP, irc, hipchat, maybe more"
  • Q: "do you use hubot to deploy hubot? what if hubot goes down?" A: "Yes, we do. If hubot goes down, we rely on logs"
  • Q: @courtneynash asks: "does it really take you 1h to take shower?" haha. A: "Yes. It's my awesome hair"@RealGeneKim: HAHAH! “@robynbergeron: @RealGeneKim @jnewland It was slightly awkwardly when I skimmed it. "help my teammate take shower? yarrrrrgh"
  • This is 2nd time I've heard @jnewland's talk. Still blows my mind. It's like he bestows appropriate honor/gravity to 'pager duty'
  • @mkocher: ChatOps at github has me itching to go look at @cloudfoundry's Hubot, love the idea of teaching via observation.

Miscellaneous

RT @mwrc: RT @tehviking: Fun talk on dev antipatterns from a devops perspective by @jasonroelofs. Lots of dumb stuff I do in there.
RT @stefano_zanella: .@jasonroelofs "Don't schedule at 2:00am because it either doesn't exist or it happens twice (DST on/off)"
RT @tehviking: Ok, @jasonroelofs’s #mwrc talk was the most pragmatic, immediately applicable talk I can remember attending. Well done sir.
Grt talk!!! RT @MattRogish/@blowmage: Another call for blameless postmortems at #mwrc, this time by @themcgruff. This is important you guys.
RT @MotoWilliams: "Would you spend 30k on hardware for 10% more uptime? Do that same on hires" {paraphrased} #MWRC
RT @sarahmei: RT @DexterTheDragon: What you call "root cause" is simply the place you stop looking any further -- Sidney Dekker
RT @katzmandu: They made a very basic error, kept pulling back instead of letting the nose down slightly to gain speed.
@katzmandu: They made a very basic error, kept pulling back instead of letting the nose down slightly to gain speed.
RT @katzmandu: They made a very basic error, kept pulling back instead of letting the nose down slightly to gain speed.

  • @RealGeneKim: Awesome! The Github crew signing my book at #mwrc. :) @wfarr @jnewland @rodjek http://t.co/UnorGMTW2S
  • RT @aamax: #mwrc redesign all your sites with comic sans because it's easy to interpret. #justkidding

  • RT @sarahmei: Mitigating a high-stress situation: pair, & vocalize your intentions. Works pretty well in lower-stress situations too.

  • Next up! @chargify's own @drewblas upMigrating A Live Site Across the Country Without Downtime - watch live! http://t.co/BaG5TqwMq1

  • RT @mwrc: RT @jeremywrowe: if you haven’t been you should be watching @drewblas live at #mwrc http://t.co/F9B0sHKLa4

  • Jeez. This is one cool conference. RT @blowmage: And @yukihiro_matz is in the house! #mwrc http://t.co/yetuMtZVX8

  • I understand Facebook does too, now. 1GB HipHop tarball RT @jasonroelofs: Twitter uses BitTorrent to deploy. That’s cool. @paulbiggar

  • Hi, Sascha, whereever you are. :) RT @courtneynash: Hah, shout out at #mwrc to @sascha_d on the live feed. Ohhai!

  • Haha. RT @ginnyhendry: OH spec_helper.rb introduces a single point of failure but in a good way

  • @RealGeneKim: Wow! Thanks! You rock! cc @hertling

  • @DexterTheDragon: @RealGeneKim check out https://t.co/44cAoQr4y3 for easy pair programming”

  • RT @nathenharvey: "chatops: when we give root to campfire" #mwrc (hat tip to @jschauma)
    RT @phinze: “asking someone to use the nagios web interface is not something i’d ask of my friends” - @jnewland at
    HugOps indeed “@robynbergeron: @RealGeneKim @jnewland
    RT @mkocher: ChatOps at github has me itching to go look at @cloudfoundry's Hubot, love the idea of teaching via observation.