French Firm Apologizes After Electrical Problem and Software Bug Disrupt Service
French cloud computing and hosting giant OVH has apologized to customers after it suffered an outage that left many individuals unable to access websites, email accounts, online databases and other infrastructure.
See Also: Effective Cyber Threat Hunting Requires an Actor and Incident Centric Approach
OVH, which says it’s the world’s third largest infrastructure provider based on the number of physical servers it runs, on Thursday blamed its two hours and 33 minutes outage earlier that day on two separate causes: a software bug that led to its optical fiber connecting being cut off, as well as an electrical problem.
The cloud computing provider says the electrical problem occurred at its site in Strasbourg in eastern France, where it operates three data centers. It says that the recovery time for impacted services was between five minutes and three to four hours and that its management system was tracking which customers were still experiencing outages so that it could fix them as quickly as possible.
Meanwhile, OVH says its optical network problem occurred at its Roubaix site in northern France, where it has seven data centers.
“We are sincerely sorry,” the company says in a customer support announcement. “We have just experienced 2 simultaneous and independent events that impacted all RBX customers between 8:15 a.m. [Central European Time] and 10:37 a.m. and all SBG customers between 7:15 a.m. and 11:15 a.m.”
The company says that it will be sending emails in coming days to customers, detailing their exact outage time in light of service-level agreements.
In the case of the loss of fiber connectivity, OVH said it had to physically access the Roubaix site, disconnect cables, restart the system and then “conduct diagnostics with the equipment manufacturer” to identify the underlying problem.
The Roubaix site connects with six of OVH’s 33 points of presence – Frankfurt, Amsterdam, London, Brussels plus two sites in Paris – meaning the outage had a knock-on effect across the OVH network.
“Attempts to reboot the system took a long time because each chassis needs 10 to 12 minutes to boot,” OVH says. “This is the main reason for the duration of the incident.”
The hosting provider says its IT team found that all of its optical transponder cards had lost their configuration. “This is clearly a software bug in the optical equipment,” it says. “The database with the configuration is saved three times and copied to two supervision cards. Despite all these security, the database disappeared. We will work with the OEM to find the source of the problem and help fix the bug.”
Unfortunately for OVH, its last tweet before the outage occurred was an offer that began “tired of managing, updating & monitoring your databases?” which offered to do it for customers. Except that thanks to the outage, no one’s databases or hosting were available, as numerous customers made clear in response to the tweet.
Tired of managing, updating & monitoring your #databases? Our #DBaaS solutions will do it for you︎ https://t.co/WGh05PLq5c pic.twitter.com/yt7jj1KOBz
— OVH (@OVH) November 8, 2017
But failures happen. Indeed, for many of its services, OVH promises 99.9 percent up time, which on an annualized basis works out to a little more than four hours of downtime.
OVH has also received plaudits for the speed, completeness and transparency of its response to the outage from some technology experts, such as Kauto Huopio, a specialist at the Finnish Communications Regulatory Authority.
Very good report from #OVH on their DWDM infrastructure issue this morning. Quote: “In the business of providing cloud infrastructures, only those that are paranoid last.” https://t.co/C8p1S5JGYA (surprisingly good #GoogleTranslate work btw) #ovhdown
— Kauto Huopio (@kautoh) November 9, 2017
Greater Paranoia Promised
OVH says that any downtime at all is evidence that it failed to be paranoid enough about how it might fail.
“In the business of providing cloud infrastructure, only those that are paranoid last,” OVH says in its customer alert. “The quality of service is a consequence of two elements. All incidents must be anticipated ‘by design’ … and we must learn from our mistakes. This incident leads us to raise the bar even higher to approach zero risk.”
In an effort to be more paranoid, OVH says it’s accelerated plans to create two optical node systems instead of the one that it currently has. By having two, it says there will be two separate databases, so if there’s a repeat of the database-configuration loss, only one system will crash. “This is one of the projects we started one month ago; the chassis have been ordered and we will receive them in the coming days,” it says. “We can start the configuration and migration work in two weeks. Given today’s incident, this project is becoming a priority for all of our infrastructure.”
“It’s much nicer to write these kinds of reports when you can say that you have already ordered new [hardware] to prevent these issues from happening again,” says Finnish security professional Aleksi Manninen via Twitter.