Daniel Eran Dilger in San Francisco
Random header image... Refresh for more!

Sun, Oracle save Microsoft’s Pink after Danger data disaster

Daniel Eran Dilger

Microsoft has announced the restoration of Sidekick users’ contacts as the first milestone in recovering data it lost in the cloud computing disaster affecting its Danger subsidiary, while a new source explains why the restoration was possible without a backup and why it is taking so long.

Sun, Oracle save Microsoft’s Pink after Danger data disaster
.
A source familiar with Sun SAN hardware used in the Danger datacenter has provided AppleInsider with additional insight explaining why it is taking Microsoft weeks to recover its users’ lost data after initial reports stated that the data was completely lost and that no suitable backup existed.

Microsoft’s problems began at the beginning of the month, when the cloud servers its operates under contract to T-Mobile began falling offline. It was initially announced that large amounts of T-Mobile’s Sidekick subscribers’ data had been lost and that no backup existed for the user data, which was stored entirely on Microsoft’s servers. (Sidekick devices are not designed to backed up locally in the same way the iPhone backs itself up to iTunes on the user’s desktop computer.)

On October 6, T-Mobile issued a statement saying, “Regrettably, based on Microsoft/Danger’s latest recovery assessment of their systems, we must now inform you that personal information stored on your device – such as contacts, calendar entries, to-do lists or photos – that is no longer on your Sidekick almost certainly has been lost as a result of a server failure at Microsoft/Danger.”

On October 15, two weeks after the problems began, Roz Ho, Microsoft’s vice president of Premium Mobile Experiences, issued an apology for the outage and announced that the company had determined that, contrary to initial reports that all the data was permanently lost, the company now thought that it should actually be able to recover most of the data that had been lost, but that the recovery effort would take some time.

This week, on October 20, T-Mobile announced the availability of users’ restored contact data dating back to October 1. It provided subscribers with instructions on how to restore their data, merge the restored contacts with their information they currently had, or to just ignore the restored data and continue using the contact information currently on their devices.

In addition to contact data, the announcement also said, “We’re making solid progress on the next phase in this restoration process, including your photographs, notes, to-do lists, marketplace data and high scores. We appreciate your ongoing patience.”

Why the data recovery takes so long

Speaking as a “Sun technical guru who has been pulled into similar scenarios in the past,” a source has explained that in circumstances like this one, where Microsoft and its IT services contractor Hitachi were facing lost data and did not have a conventional backup to restore, “[SAN storage vendor] Sun and [database vendor] Oracle have sent in their best people, and they are stitching the database back together for Microsoft, and they have a good estimate of what data is recoverable.”

“It will take several days to actually get the database back up,” the source noted, echoing earlier reports that indicated that it took 6 days just to create a normal full backup of the data. The time and storage resources involved in backing up the tremendous amount of data were cited as the reason why Microsoft’s Roz Ho reportedly instructed Danger employees to proceed with work without the full backup in place over their objections, after sources say she was assured by Hitachi that a full backup was not necessary.

Salvaging the damaged data storage without a real backup in place takes even longer, the Sun storage expert explained. “The first thing to do is wheel in a big pile of new disk space, and copy the individual disks so there is a raw backup. This is like making a copy of a jigsaw puzzle one piece at a time. Then they would assemble the puzzle using the copied pieces, in case any pieces need to be re-made from the original.

”This is very hard, requires detailed inside knowledge of how SAN addresses and volume manager layouts fit together with Oracle tables. Finally, they need to start up the database on top of the assembled puzzle, and Oracle will do its own clean up to get into a consistent state.

“The next thing you do is a fresh backup (several days), before you allow any users access to it. So it’s not surprising that this would take over a week, even after it was possible to say that the data is recoverable.”

Cause of the datacenter problems still secret

While the recovery effort is being delayed and complicated by the lack of an external backup, Microsoft is still remaining quiet about the cause of the incident. After AppleInsider first reported that insiders were blaming the failure on either an aggressive and poorly orchestrated upgrade or possibly even deliberate sabotage by a disgruntled employee, CNET “Eye on Microsoft” reporter Mary Jo Foley stated “I’ve also heard that foul play has not been ruled out because the failure was so catastrophic and seemingly deliberate. Microsoft is supposedly continuing to do a full investigation.”

Whether the incident was the result of an accident or a malicious attack, Microsoft has learned an important lesson that eventually hits everyone in the world of computing: never work without a backup. In last week’s apology to Sidekick users, Ho wrote, “we have made changes to improve the overall stability of the Sidekick service and initiated a more resilient backup process to ensure that the integrity of our database backups is maintained.”

The company has also (understandably) worked to distance itself from the high profile datacenter disaster by describing its Danger operations as running non-Microsoft technology, specifically associating Sun and Oracle with the incident. Somewhat ironically, Microsoft’s capacity to recover most of its Sidekick users’ data is entirely due to the availability of Sun and Oracle experts and the inherent resilience of those company’s products to disasters of any kind, even in cases where customers do not maintain proper backups of their data.

Recovery begins amid lawsuits

While at least some of the lost data has now been recovered, T-Mobile continues to list all Sidekick products as “temporarily out of stock” on its website, and multiple lawsuits have already been filed by users. A report by CNET “Beyond Binary” columnist Ina Fried cited one attorney’s complaint as stating, “T-Mobile and its service providers ought to have been more careful the use of backup technology and policies to prevent such data loss.”

A second attorney pursuing a case against both parties wrote, “T-Mobile and Microsoft promised to safeguard the most important data their customers possess and then apparently failed to follow even the most basic data protection principles. What they did is unthinkable in this day and age.”

The suit added, “Further complicating the data loss is the fact that Sidekicks, unlike iPhones, BlackBerrys and other smartphones, are not designed to sync locally with a user’s personal computer without additional software and hardware. This means that most users were not able to backup their data locally, but were encouraged and required to rely on Microsoft/Danger.”

Last week, T-Mobile volunteered a peace offering to its affected users in the form of a $100 gift card and a month of free data service. However, given the deep pockets behind the event, the company’s million Sidekick subscribers are likely to be looking for a bigger settlement. And while subscribers don’t have an iron-clad contract specifying specific damages in compensation for any service outages on T-Mobile’s part, the mobile provider does have an explicit contract with Microsoft’s Danger subsidiary.

Writing for Mobile Crunch, Greg Kumparak reported that T-Mobile’s SLA (service level agreement) with Danger is believed to specify penalties that amount to around 87 cents per day, per user, anytime availability dipped below 99.5% (that’s less than two days per year of unscheduled downtime, 3.6 hours of downtime within a month, or 50 minutes of downtime in a week).

For T-Mobile’s million Sidekick users, that could add up to $870,000 per day over weeks of service outage, even without including any substantial additional penalties for dropping down through multiple availability ceilings in the extended problems Microsoft faced, the lost business T-Mobile experienced after suspending its Sidekick sales, and the damage it suffered to its reputation as a service provider.

Microsoft has advertised “five nines” availability for its own servers, which means 99.999% uptime, a standard that only allows for 5.26 minutes of unscheduled downtime within a year. Providing such “high availability” requires multiple redundant servers and highly resilient shared storage systems, and of course, appropriate backups.

12 comments

1 deemery { 10.21.09 at 4:25 pm }

Five 9′s normally means you have to have a hot stand-by and automatic failover switching. For the “normal computer”, availability of Five 9′s means one, maybe two, reboots/year; you get about 5 1/2 minutes of downtime to meet this requirement…

2 Sun, Oracle save Microsoft’s Pink after Danger data disaster — RoughlyDrafted Magazine « Firesaw { 10.21.09 at 5:43 pm }

[...] Microsoft has announced the restoration of Sidekick users’ contacts as the first milestone in recovering data it lost in the cloud computing disaster affecting its Danger subsidiary, while a new source explains why the restoration was possible without a backup and why it is taking so long. via roughlydrafted.com [...]

3 Berend Schotanus { 10.22.09 at 2:05 am }

Apart from the obvious management blunders at Microsoft and the obligate call for making back-ups I think incident is also reason for reconsideration of the centralized “one-truth” data organization model.

When data quantities are becoming so vast that it takes “days” to make a back-up this should raise the question whether the working model is still feasible. Even when a healthy back-up had been in place it would have taken longer than five minutes to get the data back in place. And that’s a problem that existed before Microsoft took over.

4 The Mad Hatter { 10.22.09 at 4:56 am }

Microsoft has advertised “five nines” availability for its own servers, which means 99.999% uptime, a standard that only allows for 5.26 minutes of unscheduled downtime within a year.

Now I could see you getting that level of uptime with OS2, OSX, BSD, Solaris, or Linux. Getting that much up time with Windows would seem totally impossible to me. Unless the server wasn’t running anything (For those who haven’t suffered the joys of administering a Windows Server with Exchange installed, you don’t know what you are missing, and you don’t want to know).

5 db5 { 10.22.09 at 7:41 am }

Normally I’d gloat with you, but I had the same thing happen to me a few days ago- during a newly initiated time machine backup, OS X froze and I was presented with a disk03s error. I was able to save the data off the disk, but with all of the orphaned folders, it takes FOREVER to reconstruct a drive.

For me, it was just an unfortunate coincidence. For MS, however, it was playing with fire to the tune of million$.

6 Sun, Oracle, and Microsoft Roles | Boycott Novell { 10.22.09 at 2:22 pm }

[...] Roughly Drafted, which is another independent thinker like Groklaw, argues that Sun and Oracle actually saved Microsoft amid the Danger disaster, not caused it any trouble. From the analysis: [...]

7 uberVU - social comments { 10.22.09 at 4:54 pm }

Social comments and analytics for this post…

This post was mentioned on Twitter by DanielEran: New: Sun, Oracle save Microsoft’s Pink after Danger data disaster – http://tinyurl.com/yj7kgz7

8 SunnyGuy53 { 10.23.09 at 2:32 am }

I think Microsoft is closer to “nine fives” on their server availability.

Sunny Guy

9 The Mad Hatter { 10.23.09 at 4:41 pm }

Heh. Andrew Thomas says that A round of applause is in order for Microsoft support folks for recovering the Sidekick data, and he’s blaming everyone except Microsoft for the problem. Andrew is consistent – he should work for Gartner.

10 TimmyDax { 10.23.09 at 7:19 pm }

“93% of companies that lost their data center for 10 days or longer due to a disaster filed for bankruptcy within one year of the disaster. 50% of business that found themselves without data management for this same time
period filed for bankruptcy immediately.” (Source: National Archives & Records Administration in Washington)

11 John Dvorak reverses entire career, says Microsoft should copy Apple — RoughlyDrafted Magazine { 10.26.09 at 9:48 am }

[...] Exclusive: Pink Danger leaks from Microsoft’s Windows Phone Microsoft’s Pink/Danger backup problem blamed on Roz Ho Sun, Oracle save Microsoft’s Pink after Danger data disaster [...]

12 Why Apple’s iPhone is still not coming to Verizon — RoughlyDrafted Magazine { 10.30.09 at 9:26 pm }

[...] Microsoft’s Danger SideKick data loss casts dark on cloud computing Microsoft’s Pink/Danger backup problem blamed on Roz Ho Sun, Oracle save Microsoft’s Pink after Danger data disaster [...]

You must log in to post a comment.