Tim Elhajj

Off the Microsoft stack!


Leave a comment

What Does Microsoft Do When the TFS Server Goes Down?

Over the July Fourth Weekend this past summer, six thousand of the world’s most talented developers at lost access to the their bugs, planning data, and most importantly, the source code for their software products. Microsoft’s own TFS deployment went down. Over twelve terabytes of data was suddenly inaccessible. Within five days all systems were back online and fully operational. The culprit was a piece of hardware: A failed SAN storage unit, recently upgraded to satisfy the ever increasing need for more storage.

This is the story of those five days and the lessons learned.

It started early Sunday morning, July 3, just past midnight with one of the SAN storage units started to go bad after routine maintenance. The corruption caused the TFS deployment go offline.

This TFS deployment is a souped up cluster of TFS application tiers behind a load balancer connected to another cluster of SQL Servers that carry the backend. If you’re familiar with TFS, you know the data is stored in Team Project Collections (TPC). Developer division at Microsoft has many TPCs, but there are two important ones (one for TFS 2010, the other for the current project TFS vNext) which are huge and stored on an attached SAN.

disaster1

The OPS team at Microsoft went to work immediately on the failed server Sunday morning. The current project (vNext) was the highest priority. They restored the TPC from backup but soon discovered corruption in the restored database. On Monday, the OPS team decided to restore the vNext TPC and made an important discovery: the transaction log backups that were supposed to occur every fifteen minutes, weren’t happening. This meant going back to the last full backup, which was taken the previous Friday (7/1). This meant data loss from over the weekend, which was fortunately a paid US holiday, so hopefully many developers were out celebrating America’s independence. While the OPS team restored data, the rest of the team prepared for the loss of data by clearing all version control caches on all instances of TFS Proxy and TFS.

Meanwhile, the OPS team began copying the 2010 TPC to the SAN. This involved a copy operation with ten TB of data, so it took a considerable amount of time. By Tuesday the ten TB backup had still not finished copying to the SAN, but it was causing time outs for people using the current project, which was already back up and running. The OPS team stopped the backup to the shared SAN and secured an isolated SQL Server, where they could house the 2010 project. Ops started the ten terabyte copy again. The additional time required to bring the 2010 project online without harming the health of the TFS sever for the rest of the team was worth it.

disaster2

Just when everyone was beginning to breathe, on Tuesday July 5, the vNext TPC went down again.

This time a whole host of network engineers, developers and architects from across the corporate structure were called upon—IT hardware teams, SQL IT teams, and SQL dev teams. The SAN was the main suspect. By Wednesday, the newly formed emergency response team of network engineers isolated and resolved the problem with the SAN. (Erin, what was it?) By Thursday mail went out that OPS would attempt to restore from the 7/1 backup again. By Friday everything was back to normal.

Three days of data was irrecoverably lost. The two most valuable TPCs to the division were unavailable for a few days. And it took a few days after the recovery was finally in place for all the mirrors to sync up and for the cube to refresh itself with accurate data.  Mitigating our losses, the period of time for which data was lost was over a US holiday weekend. And, of course, we discovered the problem with our transaction logs backups, which were supposed to occur every fifteen minutes. Now the current TPC has its own dedicated SAN storage. With the transaction logs getting backed up, we should be able to recover from a similar disaster to within fifteen minutes instead of three days.


Leave a comment

Netflix in the Living Room: Upgrading to Wireless N

I recently upgraded the wireless network at my house. My thirteen-year-old son was complaining about network lag on his Xbox. I felt his pain. We got him the Xbox the Christmas before last, and I had been excited because I wanted to use it to watch Netflix movies in the living room. But watching a movie by piping it over the old wireless network from my office to the living room was terrible. I investigated and realized that my trusty G wireless network was running in compatibility mode to accommodate my old Tivo2.

What to do?

The old Tivo2s don’t come with a wired network jack, so it’s either run wireless connection in compatibility mode or string 50 foot of phone cord across the living room. But the problem with running legacy wireless devices on a fast network is that the network is only as fast as the slowest device that’s connected to it. What’s the answer? I went with a dual band wireless N network router. I picked the Netgear N600 Wireless Dual Band Router WNDR3400 (pictured).

Netgear N600 Wireless Dual Band Router WNDR3400

Dual band routers are great. They come with two separate radios: one broadcasts at 5Ghz frequency and the and the other at 2.4 Ghz. This means you can offer a fast N signal for all your N devices on the less crowded 5 Ghz frequency, and you can still offer wireless N in compatibility mode on the other radio at 2.4 Ghz frequency. This way legacy devices (Tivo2, iPhone 3GS) can still connect to a wireless network without slowing down the fast wireless signal on the 5Ghz radio.

Bottom line: My slow Tivo2 with the lifetime contract continues to download listings for me, and I don’t have to sacrifice picture quality or the ability to fast forward and rewind my Netflix movies. As an added bonus, I also get a little wireless speed bump for my laptop and the wife’s iPad. I think I spent about $80 US. The only annoying thing about the WNDR3400 is the big blue LED, but you can easily turn it off by pressing it.

I did this upgrade last October. My son is happy, my house is free of the 50 foot ghetto cord.


Leave a comment

Goodbye iPhone, Hello Windows Phone

Windows Phone 7, that is.

I picked up my new phone earlier this week, compliments of Microsoft. I have had it less than a week, but so far I’m impressed. This might be love.

I got my first smart phone last year. Two years prior to that, I picked up my first iPod Touch. As it turned out, having the Touch prior to getting my first smart phone ruined me for Windows Phone 6.5. I kept it for less than a day. I had to give back some of my employee discount and pay AT&T a $35 restocking fee to get an iPhone 3GS, but it was worth it. As I explored Windows Phone 6.5, it would freeze. I had to remove the battery to get it back online. Not that I never had to do that with my iPhone (hello occasional hard reset), but everything seemed so hard compared to the Touch. Not to mention the sexy factor of the iPhone.

But those days seem gone. I got the Samsung Focus (pictured). I love how it feels in my hand, its gorgeous screen. I touch it every day.

If my kids touch it with their sticky fingers, I scowl at them. Shine it up on my shirt.

“Wash your hands,” I say.

But seriously: it’s a sexy phone.

I am experiencing some iPhone related transition pain. For example, I don’t know how many times this week I have touched the Start button to wake it up. And I haven’t gotten all my music or podcasts moved over. But when I’ve found feature discrepancies between the two (no bus routes in maps), I’ve found apps that have more than filled the gap (OneBusAway is an awesome app). Once I get a few more hours with it, I’ll post more about my experience.

For now, I must say I feel proud it’s a Microsoft product.


Leave a comment

Network Service: Threat or Menace?

In the Spider-Man comic, J. Jonah Jameson, the Editor-in-Chief of the mythical Daily Bugle, launches a smear campaign against our hero with the headline, “Spider-Man: Threat or Menace?”*

Like Jameson’s ongoing vendetta against Spider-Man, some people believe that using one of the built-in user accounts—like Network Service or Local Service—for a service account is always a bad idea. A service account is an identity that you assign to software so that it can interact with other services or computers. In the last iteration of our product, we gave customers the option to use one of the built-in accounts, instead of creating a user account, and I discovered just how strong this knee-jerk response is for some people.

Don’t get me wrong: some times using one of the built-in accounts is a bad idea.

For example, if your service account requires more than the most basic privilege for your software to run, you definitely don’t want to use a built-in account. If you ever find yourself adding a built-in account to the local Administrator’s group, you know you’re in trouble. Why? Well, for one, the built-in accounts have no passwords. Moreover, you can’t even assign a password. You’ll get an error.

Holy cow. What is this business of assigning software an identity with no password, you say? That doesn’t sound too secure!

Well, there’s the rub. In fact, these type accounts came out of Microsoft’s trustworthy computing initiative right after the turn of the century. The idea was that moving forward, developers would build applications so that tasks that require high privilege go to a local identity other than the service account. This identity can be more easily locked down. Think about it. Before all our computers were networked, a computer was much easier to secure. Software that needs to use a service account to access the network or interact with other services needs to do so with a low privilege account, so that if it is compromised, the attacker gets nothing of value. The built-in accounts have permissions equivalent to an account in the local user group and no password.

I’d argue that if you’re spending time changing passwords on user accounts for software built in the last five years, you’re wasting valuable security cycles better spent doing some other security related task. Network Service: If you use it appropriately, it can increase security.

*See noted American linguist Mark Lieberman‘s post on Language Log for the origins of this popular false accusation meme.