Sun, Oracle save Microsoft’s Pink after Danger data disaster
October 21st, 2009
Daniel Eran Dilger
Microsoft has announced the restoration of Sidekick users’ contacts as the first milestone in recovering data it lost in the cloud computing disaster affecting its Danger subsidiary, while a new source explains why the restoration was possible without a backup and why it is taking so long.
Sun, Oracle save Microsoft’s Pink after Danger data disaster
A source familiar with Sun SAN hardware used in the Danger datacenter has provided AppleInsider with additional insight explaining why it is taking Microsoft weeks to recover its users’ lost data after initial reports stated that the data was completely lost and that no suitable backup existed.
Microsoft’s problems began at the beginning of the month, when the cloud servers its operates under contract to T-Mobile began falling offline. It was initially announced that large amounts of T-Mobile’s Sidekick subscribers’ data had been lost and that no backup existed for the user data, which was stored entirely on Microsoft’s servers. (Sidekick devices are not designed to backed up locally in the same way the iPhone backs itself up to iTunes on the user’s desktop computer.)
On October 6, T-Mobile issued a statement saying, “Regrettably, based on Microsoft/Danger’s latest recovery assessment of their systems, we must now inform you that personal information stored on your device – such as contacts, calendar entries, to-do lists or photos – that is no longer on your Sidekick almost certainly has been lost as a result of a server failure at Microsoft/Danger.”
On October 15, two weeks after the problems began, Roz Ho, Microsoft’s vice president of Premium Mobile Experiences, issued an apology for the outage and announced that the company had determined that, contrary to initial reports that all the data was permanently lost, the company now thought that it should actually be able to recover most of the data that had been lost, but that the recovery effort would take some time.
This week, on October 20, T-Mobile announced the availability of users’ restored contact data dating back to October 1. It provided subscribers with instructions on how to restore their data, merge the restored contacts with their information they currently had, or to just ignore the restored data and continue using the contact information currently on their devices.
In addition to contact data, the announcement also said, “We’re making solid progress on the next phase in this restoration process, including your photographs, notes, to-do lists, marketplace data and high scores. We appreciate your ongoing patience.”
Why the data recovery takes so long
Speaking as a “Sun technical guru who has been pulled into similar scenarios in the past,” a source has explained that in circumstances like this one, where Microsoft and its IT services contractor Hitachi were facing lost data and did not have a conventional backup to restore, “[SAN storage vendor] Sun and [database vendor] Oracle have sent in their best people, and they are stitching the database back together for Microsoft, and they have a good estimate of what data is recoverable.”
“It will take several days to actually get the database back up,” the source noted, echoing earlier reports that indicated that it took 6 days just to create a normal full backup of the data. The time and storage resources involved in backing up the tremendous amount of data were cited as the reason why Microsoft’s Roz Ho reportedly instructed Danger employees to proceed with work without the full backup in place over their objections, after sources say she was assured by Hitachi that a full backup was not necessary.
Salvaging the damaged data storage without a real backup in place takes even longer, the Sun storage expert explained. “The first thing to do is wheel in a big pile of new disk space, and copy the individual disks so there is a raw backup. This is like making a copy of a jigsaw puzzle one piece at a time. Then they would assemble the puzzle using the copied pieces, in case any pieces need to be re-made from the original.
”This is very hard, requires detailed inside knowledge of how SAN addresses and volume manager layouts fit together with Oracle tables. Finally, they need to start up the database on top of the assembled puzzle, and Oracle will do its own clean up to get into a consistent state.
“The next thing you do is a fresh backup (several days), before you allow any users access to it. So it’s not surprising that this would take over a week, even after it was possible to say that the data is recoverable.”
Cause of the datacenter problems still secret
While the recovery effort is being delayed and complicated by the lack of an external backup, Microsoft is still remaining quiet about the cause of the incident. After AppleInsider first reported that insiders were blaming the failure on either an aggressive and poorly orchestrated upgrade or possibly even deliberate sabotage by a disgruntled employee, CNET “Eye on Microsoft” reporter Mary Jo Foley stated “I’ve also heard that foul play has not been ruled out because the failure was so catastrophic and seemingly deliberate. Microsoft is supposedly continuing to do a full investigation.”
Whether the incident was the result of an accident or a malicious attack, Microsoft has learned an important lesson that eventually hits everyone in the world of computing: never work without a backup. In last week’s apology to Sidekick users, Ho wrote, “we have made changes to improve the overall stability of the Sidekick service and initiated a more resilient backup process to ensure that the integrity of our database backups is maintained.”
The company has also (understandably) worked to distance itself from the high profile datacenter disaster by describing its Danger operations as running non-Microsoft technology, specifically associating Sun and Oracle with the incident. Somewhat ironically, Microsoft’s capacity to recover most of its Sidekick users’ data is entirely due to the availability of Sun and Oracle experts and the inherent resilience of those company’s products to disasters of any kind, even in cases where customers do not maintain proper backups of their data.
Recovery begins amid lawsuits
While at least some of the lost data has now been recovered, T-Mobile continues to list all Sidekick products as “temporarily out of stock” on its website, and multiple lawsuits have already been filed by users. A report by CNET “Beyond Binary” columnist Ina Fried cited one attorney’s complaint as stating, “T-Mobile and its service providers ought to have been more careful the use of backup technology and policies to prevent such data loss.”
A second attorney pursuing a case against both parties wrote, “T-Mobile and Microsoft promised to safeguard the most important data their customers possess and then apparently failed to follow even the most basic data protection principles. What they did is unthinkable in this day and age.”
The suit added, “Further complicating the data loss is the fact that Sidekicks, unlike iPhones, BlackBerrys and other smartphones, are not designed to sync locally with a user’s personal computer without additional software and hardware. This means that most users were not able to backup their data locally, but were encouraged and required to rely on Microsoft/Danger.”
Last week, T-Mobile volunteered a peace offering to its affected users in the form of a $100 gift card and a month of free data service. However, given the deep pockets behind the event, the company’s million Sidekick subscribers are likely to be looking for a bigger settlement. And while subscribers don’t have an iron-clad contract specifying specific damages in compensation for any service outages on T-Mobile’s part, the mobile provider does have an explicit contract with Microsoft’s Danger subsidiary.
Writing for Mobile Crunch, Greg Kumparak reported that T-Mobile’s SLA (service level agreement) with Danger is believed to specify penalties that amount to around 87 cents per day, per user, anytime availability dipped below 99.5% (that’s less than two days per year of unscheduled downtime, 3.6 hours of downtime within a month, or 50 minutes of downtime in a week).
For T-Mobile’s million Sidekick users, that could add up to $870,000 per day over weeks of service outage, even without including any substantial additional penalties for dropping down through multiple availability ceilings in the extended problems Microsoft faced, the lost business T-Mobile experienced after suspending its Sidekick sales, and the damage it suffered to its reputation as a service provider.
Microsoft has advertised “five nines” availability for its own servers, which means 99.999% uptime, a standard that only allows for 5.26 minutes of unscheduled downtime within a year. Providing such “high availability” requires multiple redundant servers and highly resilient shared storage systems, and of course, appropriate backups.