Tim Elhajj

1 IT PRO writer vs. hipster gadget ennui


Leave a comment

What Does Microsoft Do When the TFS Server Goes Down?

Over the July Fourth Weekend this past summer, six thousand of the world’s most talented developers at lost access to the their bugs, planning data, and most importantly, the source code for their software products. Microsoft’s own TFS deployment went down. Over twelve terabytes of data was suddenly inaccessible. Within five days all systems were back online and fully operational. The culprit was a piece of hardware: A failed SAN storage unit, recently upgraded to satisfy the ever increasing need for more storage.

This is the story of those five days and the lessons learned.

It started early Sunday morning, July 3, just past midnight with one of the SAN storage units started to go bad after routine maintenance. The corruption caused the TFS deployment go offline.

This TFS deployment is a souped up cluster of TFS application tiers behind a load balancer connected to another cluster of SQL Servers that carry the backend. If you’re familiar with TFS, you know the data is stored in Team Project Collections (TPC). Developer division at Microsoft has many TPCs, but there are two important ones (one for TFS 2010, the other for the current project TFS vNext) which are huge and stored on an attached SAN.

disaster1

The OPS team at Microsoft went to work immediately on the failed server Sunday morning. The current project (vNext) was the highest priority. They restored the TPC from backup but soon discovered corruption in the restored database. On Monday, the OPS team decided to restore the vNext TPC and made an important discovery: the transaction log backups that were supposed to occur every fifteen minutes, weren’t happening. This meant going back to the last full backup, which was taken the previous Friday (7/1). This meant data loss from over the weekend, which was fortunately a paid US holiday, so hopefully many developers were out celebrating America’s independence. While the OPS team restored data, the rest of the team prepared for the loss of data by clearing all version control caches on all instances of TFS Proxy and TFS.

Meanwhile, the OPS team began copying the 2010 TPC to the SAN. This involved a copy operation with ten TB of data, so it took a considerable amount of time. By Tuesday the ten TB backup had still not finished copying to the SAN, but it was causing time outs for people using the current project, which was already back up and running. The OPS team stopped the backup to the shared SAN and secured an isolated SQL Server, where they could house the 2010 project. Ops started the ten terabyte copy again. The additional time required to bring the 2010 project online without harming the health of the TFS sever for the rest of the team was worth it.

disaster2

Just when everyone was beginning to breathe, on Tuesday July 5, the vNext TPC went down again.

This time a whole host of network engineers, developers and architects from across the corporate structure were called upon—IT hardware teams, SQL IT teams, and SQL dev teams. The SAN was the main suspect. By Wednesday, the newly formed emergency response team of network engineers isolated and resolved the problem with the SAN. (Erin, what was it?) By Thursday mail went out that OPS would attempt to restore from the 7/1 backup again. By Friday everything was back to normal.

Three days of data was irrecoverably lost. The two most valuable TPCs to the division were unavailable for a few days. And it took a few days after the recovery was finally in place for all the mirrors to sync up and for the cube to refresh itself with accurate data.  Mitigating our losses, the period of time for which data was lost was over a US holiday weekend. And, of course, we discovered the problem with our transaction logs backups, which were supposed to occur every fifteen minutes. Now the current TPC has its own dedicated SAN storage. With the transaction logs getting backed up, we should be able to recover from a similar disaster to within fifteen minutes instead of three days.


1 Comment

What Does Miley Cyrus Have to Do With Harry Potter?

Microsoft.

At least, this past weekend these three unlikely partners came together for a few hours.

My division had a morale event for the premiere of the new Harry Potter movie. They rented an entire theater in downtown Bellevue for shows throughout the day Saturday. I love the Potter series and had been eagerly awaiting this movie. Developer division generously provided for me and three guests. And I just so happen to have twelve-year-old twins and a wife who also love Harry Potter.

Meanwhile, Microsoft also opened one of its new stores in downtown Bellevue this weekend, with a Miley Cyrus concert to promote it. My daughter loves Miley Cyrus. I was tempted to stand in the long lines to get us tickets, but she had swim team practice and I had work. And we had tickets to Harry Potter!

So I decided not to tell her about it.

I would have gotten away with it, too, but I didn’t realize the Cyrus concert was 100 yards from the theater. My wife dropped us off and went in search of parking. As we got out, she said to our daughter, Kennedy, “Go listen to Miley!”

Kennedy looked around: The streets were crowded and you could hear the band playing loud.

“This is Miley Cyrus?” she asked.

“Wait,” she said. “How did Mom know Miley Cyrus was here?”

I felt guilty.

“Microsoft,” I said. “Microsoft opened a new store and asked Miley to sing.”

Kennedy looked at me with such disappointment. Microsoft invites Miley Cyrus to Bellevue and you get us Harry Potter tickets? I felt bad. I can see I am going to have to take her to a Miley Cyrus concert soon. 

Deathly Hallows was fabulous.

We were too cold and hungry to make it to the new store after the movie, but I am sure I’ll check it out before Christmas. I’m certinly a PC, but I have been known to go into the Mac store from time to time. I’m looking forward to seeing how the new Microsoft stores stack up.

Follow

Get every new post delivered to your Inbox.

Join 134 other followers