Rawpixel - Fotolia
- Alissa Irei, Senior Writer
If you've so much as dabbled in programming, you've probably spent some time on Stack Overflow, a question-and-answer community forum for professional and amateur coders. Network managers may also be familiar with Stack Overflow's sister site, Server Fault, or any of the other 160-plus Stack Exchange web properties.
With so many sites to maintain and a technically savvy user base with little patience for downtime, the members of Stack Overflow's Site Reliability Engineering Team have plenty to keep them busy. That's why -- in the interest of minimizing time spent babysitting virtual machine (VM) and hardware installations -- they are developing a custom application to automate VM deployment. In this edition of The Subnet, site reliability engineer George Beech explains why his team decided to take a DIY approach to automation, and also shares how he helped keep a Manhattan data center running during Hurricane Sandy.
NOTE: This interview has been lightly edited for length and clarity.
Tell me about your current role at Stack Overflow.
George Beech: We're a team of six to eight people -- some are Linux, networking or Windows specialists, and a couple are generalists. We're responsible for making sure that the Stack Overflow websites are available, reliable and fast because we're helping millions of developers a day. We do have outages every now and then, and Twitter almost lets us know faster than our monitoring systems. We know very quickly if we're having issues -- which is nice because it means people really like our service and want us to fix it. But it's bad because you're in the spotlight. We're like, 'OK, we've been down for 20 minutes. Why aren't we up yet?'
What's it like serving as a resource for other IT professionals?
Beech: It's really gratifying to be able to see your work help other people. You see all of these great applications that are coming out, and you know at some point those teams have come on our site and gotten help for something. Because that's just the way it is: It's very rare to talk to a developer that doesn't know Stack Overflow.
Years ago, when we first started, not everybody knew the name of the website. If you showed them the logo, they'd say, 'I've been on that website. It's helped me immensely. I didn't remember the name of it.' But it's pretty much a household name in the development community these days, and it is very gratifying to be working on a product like that.
Tell me about your efforts to automate the VM deployment process.
Beech: At Stack Overflow, we don't host in the cloud -- we own our own data centers; it's really colocation. We need to improve how we deploy our new hardware and VMs, so we're writing a custom automation application. We looked at a lot of stuff that worked really well for a Linux VM deployment or really well for a Windows VM deployment, but didn't do both very well. The tool we're writing kind of ties together a bunch of existing automation in an easy way for our engineers, so that they don't have to go and spell things out -- we're kind of automating the automation.
On the Linux side we use CentOS which uses Kickstart files, and on the Windows side we use Microsoft Deployment Toolkit to script our installation for Windows. So we're building a tool that our engineers can go in and say, 'I want a VM of X size and Y CPU,' and specify all that and click a button and, 'I want this OS on it,' and they click a button, and it installs. The physical side it's more, 'Hey, this is a new machine. Here's it's serial number, go and install Windows.' So we kind of reduce the high-touch installations in our environment.
What's the driving motivation behind automating VM deployment?
Beech: It's a highly skilled team, and nobody likes sitting there and babysitting installs. So if we can kind of remove that roadblock for people in our team, they can go and do more productive work than just clicking 'next.' It kind of frees up our resources to work on things that aren't very easily automated.
And what's a past project of which you're particularly proud?
Beech: Some of the more complex things we've done have been around data center moves and upgrades. We did a live upgrade on our New York data center last February. We didn't fail anything over. We just were very careful and upgraded things as we went. And the most gratifying thing is the director of internal IT said, 'I didn't even know you guys were doing an upgrade this weekend. I didn't even notice.' That's always great to hear, especially when you're doing it live in a data center that's still serving traffic.
The second one that comes to mind is less to do with Stack Overflow, and more with just helping out in general. I was living in New York during Hurricane Sandy, and we were in a data center with Squarespace and Fog Creek, which is a sister company. We were part of the crew who kept those companies running throughout Hurricane Sandy. The power was out, and it was a really long couple of weeks, but it was very rewarding. I would never want to live through it again, but you can accomplish these things. I remember walking past the entrance of the New York Stock Exchange, and all of the lights were out except for the lights lighting up the flags outside of the Stock Exchange, and it was the only time I had been in lower Manhattan and looked up and saw the stars. That's a memory I'll probably carry forever.
What were you doing on a technical level in that situation?
Beech: We were doing a bunch of things. Very early in that storm, Stack Overflow failed over to our Oregon data center, so we were spinning up servers every now and then to alleviate some of the caching and backlog pressures from having disconnected servers. But in reality, a lot of the time, we were helping the other companies whose secondary data centers had failed or weren't ready yet, which included helping Fog Creek get some of their systems back up.
Each company had a guy at the data center 24/7, because we couldn't get fuel [to run the generators] into Manhattan. It just wasn't coming. So each of the companies had an emergency crash plan to take down as many servers as possible to save [power]. We had to sit, and we'd hang out, and there were large pots of coffee all around, and basically our only warning that the generator had failed and we were on batteries was if the lights in the service room went out, because they weren't on batteries. It was interesting, and there was a lot of camaraderie. I run into the guys that were there every now and then, and we reminisce: 'Hey, you remember this really hard time we went through together?'
So how did you get started in IT?
Beech: I've always been good with computers. I went to a very small public school, and they were like, 'We don't have a ton of options for you, but you should really look at the tech program at the vocational technical school in the area.' So I looked -- I was 16, 17 -- and they had a computer network administration course, and I really excelled there. From there I got an internship at the tech school, then went on to be a consultant while I was in college. I started working for a multinational contact center company after college. And then I ended up here about six years ago.
Now for our rotating pop culture question -- what's your favorite sports team?
Beech: I may live in New York now, but I grew up in Philadelphia. So it's a very tight race between the Philadelphia Eagles and the Philadelphia Flyers.
Network automation scripts: Beyond the hype
Automation guru talks transition from traditional engineering
Don't rage against the machine -- embrace it, with automation