Over the July Fourth Weekend this past summer, six thousand of the world’s most talented developers at lost access to the their bugs, planning data, and most importantly, the source code for their software products. Microsoft’s own TFS deployment went down. Over twelve terabytes of data was suddenly inaccessible. Within five days all systems were back online and fully operational. The culprit was a piece of hardware: A failed SAN storage unit, recently upgraded to satisfy the ever increasing need for more storage.
This is the story of those five days and the lessons learned.
It started early Sunday morning, July 3, just past midnight with one of the SAN storage units started to go bad after routine maintenance. The corruption caused the TFS deployment go offline.
This TFS deployment is a souped up cluster of TFS application tiers behind a load balancer connected to another cluster of SQL Servers that carry the backend. If you’re familiar with TFS, you know the data is stored in Team Project Collections (TPC). Developer division at Microsoft has many TPCs, but there are two important ones (one for TFS 2010, the other for the current project TFS vNext) which are huge and stored on an attached SAN.
The OPS team at Microsoft went to work immediately on the failed server Sunday morning. The current project (vNext) was the highest priority. They restored the TPC from backup but soon discovered corruption in the restored database. On Monday, the OPS team decided to restore the vNext TPC and made an important discovery: the transaction log backups that were supposed to occur every fifteen minutes, weren’t happening. This meant going back to the last full backup, which was taken the previous Friday (7/1). This meant data loss from over the weekend, which was fortunately a paid US holiday, so hopefully many developers were out celebrating America’s independence. While the OPS team restored data, the rest of the team prepared for the loss of data by clearing all version control caches on all instances of TFS Proxy and TFS.
Meanwhile, the OPS team began copying the 2010 TPC to the SAN. This involved a copy operation with ten TB of data, so it took a considerable amount of time. By Tuesday the ten TB backup had still not finished copying to the SAN, but it was causing time outs for people using the current project, which was already back up and running. The OPS team stopped the backup to the shared SAN and secured an isolated SQL Server, where they could house the 2010 project. Ops started the ten terabyte copy again. The additional time required to bring the 2010 project online without harming the health of the TFS sever for the rest of the team was worth it.
Just when everyone was beginning to breathe, on Tuesday July 5, the vNext TPC went down again.
This time a whole host of network engineers, developers and architects from across the corporate structure were called upon—IT hardware teams, SQL IT teams, and SQL dev teams. The SAN was the main suspect. By Wednesday, the newly formed emergency response team of network engineers isolated and resolved the problem with the SAN. (Erin, what was it?) By Thursday mail went out that OPS would attempt to restore from the 7/1 backup again. By Friday everything was back to normal.
Three days of data was irrecoverably lost. The two most valuable TPCs to the division were unavailable for a few days. And it took a few days after the recovery was finally in place for all the mirrors to sync up and for the cube to refresh itself with accurate data. Mitigating our losses, the period of time for which data was lost was over a US holiday weekend. And, of course, we discovered the problem with our transaction logs backups, which were supposed to occur every fifteen minutes. Now the current TPC has its own dedicated SAN storage. With the transaction logs getting backed up, we should be able to recover from a similar disaster to within fifteen minutes instead of three days.