The Upside of Downtime? Zendesks Lessons Learned on Datacenter Migrations
Last updated September 21, 2021
A couple of weeks ago, Foursquare had a major outage. We felt their pain; site outages of any sort are hugely frustrating for both engineers and customers. Foursquare isnt alone; there are plenty of other famous and infamous stories of services that have suffered from operational issues. Back in 1999 Ebay suffered from day long outages. Salesforce had serious issues in 2005. And, of course, the fail whale is perhaps the most well-known mammal on the internet today.
Most services will never have Twitters particular problems (accommodating World Cup chatter or breaking Cassandra), but high uptime is an almost universal requirement. At Zendesk, our customers depend on our site being available so they can support their customers. Of course, were certainly not perfect and we track our outages on Twitter.
Over the past three years, Zendesk has had hockey stick growth, which is terrific, and a problem were happy to have! Weve recently had to plan around our growth to prevent downtime, and moved our service to a new datacenter last month. The migration moved a database containing over 5000 customers with 500 million rows and more than a terabyte of stored data to a new cluster.
We had a two hour long scheduled maintenance which wed announced months in advance and in the weeks and days leading up to the move. It took us only 1 hour and 20 minutes; less time than it takes to run a half-marathon!
Otherwise none of our customers really noticed, and our takeaway includes 5 important recommendations.
Predict & Plan Months in Advance
Your app has had successful adoption and your customer base continues to flourish and grow. There comes a time when you will have to migrate to a more scalable solution.
Migrations give you a chance to rethink your architecture and your growth. Start planning at least 3-6 months before you need to make a move, ensuring that the existing infrastructure will hold up until then.
Planning once you have capacity issues is much harder, stressful, and likely more expensive.
Just One More Round!?
Cut Yourself Off, Control the Degree of Change
A migration is a great opportunity to abandon old and undesirable practices, upgrade software, and fix infrastructure issues. If you love tackling these challenges and creating solutions, you can easily get sucked into doing too much, and doing it all at once.
Dont pull at a thread and unravel the entire sweater!
At Zendesk, we used our migration as an opportunity to:
- Double our hardware footprint
- Upgrade our Linux distribution (Ubuntu 9.10) & MySQL version (Percona 5.1.47)
- Move from GFS to NFS (were planning on replacing NFS as a separate project)
- Consolidate our mail-handling infrastructure, removing a dependency on POP servers.
We wanted to do more (it was hard to resist making changes to the schema and application) but that list was where we stopped. We deployed the same code in the new datacenter wed been running in the old one for the previous two weeks.
By containing what we wanted to do and focusing on what we needed to do we simplified the migration and reduced risk exponentially, so we knew that if there were problems they would be isolated to external software and not with our code.
Identify Risks & Edge Cases
The risks you understand do not typically turn into the problems you encounter during a migration (heck, during life!). The things you arent thinking about are the things that will bite you.
This happened to us at Zendesk with our attachment store. Our attachment store is a place where users upload images and files to go alongside their support tickets. We put this data on a shared filesystem. Its not a great architecture but it worked well enough so we didnt think about it a lot. We ended up having to copy the most recent files to the new cluster by hand during the migration; that took up a good percentage of the scheduled time.
Include Slack & Sleep in Your Schedule
There will almost always be something that your plan doesnt include, so beat it to the punch and plan for slack in the schedule.
At Zendesk, we scheduled no work for the week prior to the migration. What we ended up doing was:
- Monday – A intensive round of QA
- Tuesday – Rebalanced CPUs on the VMs
- Wednesday – Changed the search engine indexing process
- Thursday – Dealt with a database replication failure
- Friday – Headed home early for good nights sleep before the migration on Saturday morning
Since we had nothing scheduled to do we had time to attend to the unexpected and everything was in place and ready to go for a smooth day on Saturday. We were all sitting down for Dim Sum on Saturday by 1pm.
Rinse & Repeat (or Rehearse, Document & QA Repeatedly)
Prior to Zendesks migration, every step done during the migration was documented in advance, and we kept that list as short as possible.
Your customers dont notice the mistakes you make in beta environments and the extra hours they cost you. They do notice if your scheduled maintenance runs over or goes badly. Keep your customers happy by doing as much work as you can prior to the migration; an hour spent then to save several minutes is well worth the investment.
Every step you perform is a potential source of error; document them in detail and do a dress rehearsal so what needs to happen on the day becomes automatic. QA the new site aggressively; we did perhaps 3 full and 10 partial QA passes over the new configuration in the final 2 weeks.
Growth without Pain: A Success Story
Theres a lot of user-driven publicity around outages, and in the interest of transparency its helpful to learn how to rectify problems and prevent that particular issue from happening again. Pulling back the curtain and sharing problems and solutions helps the larger community.
Much less has been written about the success stories. You want your startup to have the challenge of needing to scale for massive growth and record traffic. To get there its important to stay ahead of the growth curve rather than fall behind it. Migration is a complex operation but it can be done well if time is taken to understand what is involved. Mistakes are inevitable but you dont have to let missteps define your operational and response strategy. Take the time to plan in advance and the results will be smoother, cheaper and faster.
Want to know more? Talk to us anytime!