I touched a little bit on continuous deployment in talking about Craftsmanship, but got sidetracked into a discussion of why downtime matters even if it "doesn't matter." There are two big technical hurdles to continuous deployment, in my opinion. The first is having sufficient automated test coverage that you're confident, every time you publish, that the bugs are in your design and not in the rest of your code. Let's ignore that one for a moment, because the second hurdle - constantly bringing servers down to update them, disrupting your service - seems to have further caught the fancy of one Bryant Durrell, former director of operations at Turbine.
He doesn't like the idea of continuous deployment, but he loves the idea of minimizing downtime to nothing (who doesn't?). On the other hand, I think the process he describes could use some work. There are a few things that really jump out at me, ways to minimize stress and downtime without even pushing that hard. The first is get rid of the deployment checklist, as quickly as you can. Replace that checklist with a button to push. If you miss doing some things when the button is pushed, if you discover corner cases the automated publish doesn't handle, you'll notice - and you'll notice your mistakes faster when your expectation is "push button, see publish" than when you are constantly having to re-evaluate in your head where you are in the process, what's left, and at what point you see results.
The other issue I see in Bryant's posts are the discussion of rollbacks. I'm not sure there's a single thing more heinous, from a player's perspective, than a rollback. It's a terrible error when a bug causes a crash before players' data can be saved, I don't think we should be planning, as developers, to instigate extra rollbacks. I don't have any specific advice that comes to mind here, but I think that if you're resolute in avoiding rollbacks, the way to get around having to rollback after a bad publish will open. Sorry that's not very helpful. :)
On to the meat of minimizing downtime... I think the biggest issue is really just a question of load balancing. All kinds of other networked systems are designed to cluster in such a way that individual machines can go down, without the entirety of the service being affected.
Game servers have more state than most, but on the other hand we're talking about planned downtimes here, not disaster recovery (although that's worth thinking about too). If you design the entire system around short-lived processes on a cluster, for example, then it should be possible to label an individual machine as being no longer in the pool of available machines while the process updates. Of course, there you probably run into an issue of data storage, but you can design around that as well - make it easy for an older version of the process to ignore new information in the serialized version of the player character, for example.
If you have longer-lived processes, a stronger hand-off procedure would be needed. However, given gigabit interconnects, it seems reasonable that a server about to be updated could contact a hot spare, convey its current state, and transfer data ownership in a very brief span of time. Once a server has relinquished control of any player state, it could be safely updated and restarted, ready to serve updated clients.
Another option seems to be faking "zero downtime" - go ahead and force everyone to restart their client and reconnect when the server they're on is updated, but do it as a rolling update (so the majority of your playerbase can continue to be online at any given moment), make it as easy as "I'm disconnected, now I reconnect and the game is updated."


I hear what you're saying about rollbacks. Having worked in situations where rollbacks were super-difficult, I'm forced to agree that you get better at avoiding them when they're impossible to manage, as it were. I'd still prefer to have the safety net; better to be down for half an hour for a rollback than to be down for four hours for a bug fix.
The faked zero downtime you describe is a lot like Second Life's process; SL just kicks people out of the specific region since servers are tied to the coordinates they handle. It works OK; immediate reconnects would be better.
I wonder if the server handoff would work as a two stage process... hm. OK, I know it's plausible to do interserver communication to handle two clients interacting across a server boundary. It'd be nice if you could hand players off between servers granularly, which would enable you to throttle down a server over time.
One-button-deploy is a well-solved problem on the web: for every framework it might be different, and it might need to be solved again, but it's a common thing to address in developing a new framework. We didn't have it at NCsoft, but that was mostly a matter of not devoting resources to the problem: individual parts were automated, but the pieces didn't talk to each other.
Perforce checkouts, file copies to the server, file copies to the patcher, and database updates on DR were as simple as running four automated scripts. The database update in particular was nice, because the whole thing was versioned in the DB and completely idempotent - you could pull the latest version out of P4 and run it every time without problems. Game shutdown was a single command that managed in-game announcements, server lock-out, pre-shutdown player kicks, and exiting the process when everyone was gone (using the telnet interface I mentioned on your blog, which could easily be automated with Perl); and bringing the servers back up (executing one program on each server) involved simply invoking a batch script that could be done in parallel by an automated process. Database backups during downtime were another single button press, too.
When you break it down to "push these eleven buttons in this order," I think it's clear that writing one big red button that simply pushes those buttons in that order is not hard.
Incidentally, one big impetus for working on this long before beta is that it makes playtesting through out the dev cycle a snap. If updating the playtest server is a pain, developers won't do it; there may not yet be an operations team to do it, and then... who is doing it? And how often do they get around to doing it? Make it easy, and then everybody does it every time they have something to show off; make the client update process easy, and then you eliminate an excuse for not participating, and get more feedback on a faster cycle: win.
Depending on how instanced the game is, server hand-off could be VERY easy - "throttle down a server over time" is sort of what I described for short-lived individual processes: you stop creating new processes when it's that machine's turn to be updated, let all the existing processes end normally, and then update after that.
We had a lot of the same buttons. The game shutdown wasn't quite as clean throughout, which resulted in some necessary handholding. This is one example of a place where earlier requirements definition would have helped a ton.
Other parts were handled purely by Operations, and I didn't have much view into it or ability to help them automate it. I think the fact that no one on either the DR side of the fence or the Operations side of the fence had total responsibility was what really kept us from having a one-button process. Everyone was responsible for a single thing, and did enough automation that they had one button to push.
...And then you had to coordinate a lot of different people pushing different buttons, and human error creeps in. :)