Ah Melbourne, you’re quite the town. After spending a weekend visiting you for the weekend and soaking myself deep in your culture I’ve come to miss your delicious cuisine and exquisite coffee now that I’m back at my Canberran cubicle, but the memories of the trip still burn vividly in my mind. From the various pubs I frequented with my closest friends to perusing the wares of the Queen Victoria markets I just can’t get enough of your charm and, university admissions willing, I’ll be making you my home sometime next year. The trip was not without its dramas however and none was more far reaching than that of my attempt to depart the city of Melbourne via my airline of choice: Virgin Blue.
Whilst indulging in a few good pizzas and countless pints of Bimbo Blonde we discovered that Virgin Blue was having problems checking people in, resulting in them having to resort to manual check-ins. At the time I didn’t think it was such a big deal since initial reports hadn’t yet mentioned any flights actually being cancelled and my flight wasn’t scheduled to leave until 9:30PM that night. So we continued to indulge ourselves in the Melbourne life as was our want, cheerfully throwing our cares to the wind and ordering another round.
Things started to go all pear shaped when I thought I’d better check up on the situation and put a call into customer care hotline to see what the deal was. My first attempted was stonewalled by an automatic response stating that they weren’t taking any calls due to a large volume of people trying to get through. I managed to get into a queue about 30 minutes later and even then I was still on the phone for almost an hour before getting through. My attempts to get solid information out of them were met with the same response: “You have to go to the airport and then work it out from there”. Luckily for me and my travelling compatriots it was a public holiday on Monday so a delay, whilst annoying, wouldn’t be too devastating. We decided to proceed to the airport and what I saw there was chaos on a new level.
The Virgin check-in terminals were swamped with hundreds of passengers, all of them in varying levels of disarray and anger. Attempts to get information out of the staff wandering around were usually met with reassurance and directions to keep checking the information board whilst listening for announcements. On the way over I’d managed to work out that our flight wasn’t on the cancelled list so we were in with a chance, but seeing the sea of people hovering around the terminal didn’t give us much hope. After grabbing some quick dinner and sitting around for a while our flight number was called for manual check-ins and we lined up to get ourselves on the flight. You could see why so many flights had to be cancelled as boarding that one flight manually took them well over an hour, and that wasn’t even a full flight of passengers. 4 hours after arriving at the airport we were safe and sound in Canberra, which I unfortunately can’t say for the majority of people who chose Virgin as their carrier that day.
Throughout the whole experience all the blame was being squarely aimed at a failure in the IT system that took our their client facing check-in and online booking systems. Knowing a bit about mission critical infrastructure I remarked at how a single failure could take out a system like this, one that when it goes down costs them millions in lost business and compensation. Going through it logically I came to the conclusion that it had to be some kind of human failure that managed to wipe some critical shared infrastructure, probably a SAN that was live replicating to its disaster recovery site. I mean anything that has the potential to cause that much drama must have a recovery time less than a couple hours or so and it had been almost 12 hours since we first heard the reports of it being down.
As it turns out I was pretty far off the mark. Virgin just recently released an initial report of what happened and although it’s scant on the details what we’ve got to go on is quite interesting:
At 0800 (AEST) yesterday the solid state disk server infrastructure used to host Virgin Blue failed resulting in the outage of our guest facing service technology systems.
We are advised by Navitaire that while they were able to isolate the point of failure to the device in question relatively quickly, an initial decision to seek to repair the device proved less than fruitful and also contributed to the delay in initiating a cutover to a contingency hardware platform.
The service agreement Virgin Blue has with Navitaire requires any mission critical system outages to be remedied within a short period of time. This did not happen in this instance. We did get our check-in and online booking systems operational again by just after 0500 (AEST) today.
Navitaire are a subsidiary of Accenture, one of the largest suppliers of IT outsourcing in the world with over 177,000 employees worldwide and almost $22 billion in revenue. Having worked for one of their competitors (Unisys) for a while I know no large contract like this goes through without some kind of Service Level Agreement (SLA) in place which dictates certain metrics and their penalties should they not be met. Virgin has said that they will be seeking compensation for the blunder but to their credit they were more focused on getting their passengers sorted first before playing the blame game with Navitaire.
Still as a veteran IT administrator I can’t help but look at this disaster and wonder how it could have been avoided. A disk failure in a server is common enough that your servers are usually built around the idea of at least one of them failing. Additionally if this was based on shared storage there would have been several spare disks ready to take over in the event that one or more failed. Taking this all into consideration it appears that Navitaire had a single point of failure in the client facing parts of the system they had for Virgin and a disaster recovery process that hadn’t been tested prior to this event. All of these coalesced into an outage that lasted 21 hours when most mission critical systems like that wouldn’t tolerate anything more than 4.
Originally I had thought that Virgin had all their IT systems internal and this kind of outage seemed like pure incompetence. However upon learning about their outsourced arrangement I know exactly why this happened: profit. In an outsourced arrangement you’re always pressured to deliver exactly to the client’s SLAs whilst keeping your costs to a minimum, thereby maximising profit. Navitaire is no different and their cost saving measures meant that a failure in one place and a lack of verification testing in another lead to a massive outage to one of their big clients. Their other clients weren’t affected because they likely have independent systems for each client but I’d hazard a guess that all of them are at least partially vulnerable to the same outage that affected Virgin on the weekend.
In the end Virgin did handle the situation well all things considered, opting first to take care of their customers rather than pointing fingers right from the start. To their credit all the airport staff and plane crew stayed calm and collected throughout the ordeal and apart from the delayed check-in there was little difference between my flight down and the one back up. Hopefully this will trigger a review of their disaster recovery processes and end up with a more robust system for not only Virgin but all of Navitaire’s customers. It won’t mean much to us as customers as if that does happen we won’t notice anything, but it does mean that in the future such outages shouldn’t have such a big impact as the one of the weekend that just went by.
“…also contributed to the delay in initiating a cutover to a contingency hardware platform” – that’s the money quote right there.
1) they underestimated the time to restore a critical component of the primary platform
2) due to resources or pig headedness they did not have (enough of the right) people bringing up the secondary platform as soon as the primary failed
3) their secondary platform took more than 4 hours to bring up otherwise they would have at most had an 8 hour outage
On the face of it, it looks like Navitaire dropped the ball, but I’ve done enough systems admin to know that it’s not that clear cut. In such complex systems, simple failures often ripple their way out to places far far away from the source. It’s very rarely as simple as replacing the busted part and brining everything back on in the right order. Often requiring many subsystems to be checked, fixed and rechecked. Eating into your precious restore time.
That’s where the secondary comes in. Once a catastrophic failure has hit a system, evaluating the time to fix is critical and is incredibly difficult for a sysadmin to judge. Sysadmins often get blinkers and expect that the error at the top of their stack is the last one they need to finish to restore the service, it takes a business continuity expert to look at the failures, the design to see how many errors may be lurking underneath and to make the call to bring the secondary online. I guess Navitaire’s BC export was on holiday at the time.
It’s quite possible that the cut over to the secondary platform was delayed by Virgin due to stale data, reduced features or reduced performance. Any or all of which would have been cost cutting measures written into the SLA. The other possibility of course is that the secondary platform was broken to start with.
Indeed there’s precious few details that have (and probably will ever be) released to the public so the speculation will be rife as to exactly what caused the failure and subsequent delays in righting themselves. Still it looks like Navitaire weren’t prepared for this particular kind of failure mostly because I don’t believe they’ve actually had a comparable experience. A quick search of news articles mentioning this company appears to confirm this, save for a couple articles about fare prices being out of whack.
You hit on some good points too about how fixing a failed system isn’t always about just replacing the failed component. With any large system it becomes quite a task to ensure that each and every part is functioning correctly. It’s also unusual for the sys admin to be able to make the call to fail over to DR as that traditionally falls to the business. There’s also the possibility that the failure managed to take out your disaster recovery, depending on your set up.
It still looks like whilst they had a secondary system either the failover process or the facility itself hadn’t been functionally tested. I know there have been many systems I’ve worked on that whilst I may have been able to confirm that all the data was safe there had been no tests to see whether or not the backup systems were capable of handling the production environment. Additionally I’ve seen many great, well tested DR systems fail just because the process was unclear and the failover was done incorrectly.
This lesson in preparedness has cost Virgin and Navitaire a huge amount of money and reputation. I really hope the PR teams of Navitaire and Virgin get together and go the full disclosure route of explaining the outage. A mud slinging match isn’t going to do anyone any good. Explaining what went wrong, what they’ve learned and reassuring both companies’ customers that procedures have changed is a much better approach and works great for the tech companies that do it.
User errors are great at finding their way into backup systems which is why they often lag a few hours/days behind production systems. The cost of losing a few hours worth of data can often outweigh the cost of a sustained outage. That’s possibly what’s happened here and if that is the case then the SLA is broken or the design of the redundancy is broken.
Every sysadmin, at some point in their career, will face the problem of trying to convince the business that not testing their DRP will cost them far more than testing the DRP. Live tests of DRPs just aren’t sexy so they don’t get done. It’s wishful thinking, but maybe Branson is in the position to change that.