fremont outage
Cheers.
55 Replies
Well done, Linode team.
I have to say- every other host I have ever used would have had, at minimum, 48 hours of downtime from this. Having everything back up and running in 6 hours after a catastrophic failure is impressive.
@JasonTokoph:
No phone call
Where's Miss Hathaway when you need her.
I think it's pretty naive to think that a low cost VPS service is going to call you personally with every hiccup.
Look into monitoring apps (locally run or outsourced).
@JasonTokoph:
While its nice that the pseudo redundancy was there, I'm still disappointed that I had to learn of the outage by searching twitter for "linode". No email, phone call, or a minimum @linode tweet. Unless I'm blind, there isn't a link to the status page on the main linode.com site.
Yeah, with something like this where the affected Linodes should be pretty clearly identified, having an email go out to their owners seems like it would be a good feature, providing it was internally simple enough so it didn't distract staff from actually addressing the problem.
With that said, if you need to be aware of outages for your node, I'd strongly suggest doing some monitoring (either yourself or with any of the free and/or paid online services). There are a number of problems - especially network related - that can arise for which Linode can't necessarily know for sure which nodes are affected, not to mention more localized issues. I know for my Linodes that I need to stay up, I monitor all the services I care about myself, and it's always my own monitoring that triggers me to go look at status.linode.com to see if it's a more general problem.
I'd also suggest throwing in a boot time cron entry on your Linode to spit out a message to you any time it boots, just as a head's up.
– David
Glad to hear they had hot spares ready to go for servers and got things fixed up for clients.
From my perspective, the next few days/weeks are the interesting part… figuring out the causes, the contributing factors, and all that stuff. I live for the RFO.
@db3l:
With that said, if you need to be aware of outages for your node, I'd strongly suggest doing some monitoring (either yourself or with any of the free and/or paid online services).
Don't get me wrong. I have monitors in place for all of my systems, Linodes or otherwise. These monitors kicked off alerts that I needed. The issue is that I had no idea why the alerts were being generated.
Also, while I didn't expect a phone call, I did list it along with other forms of communication as I didn't receive anything.
Its even worse that I have yet to receive any form of communication from Linode regarding the outage. Fine, use 100% of your resources to fix the issue, but once its fixed, affected customers should receive some sort of notification.
I'll be happy when a post mortem is in my inbox along with an apology for lack of communication.
@JasonTokoph:
I'll be happy when a post mortem is in my inbox along with an apology for lack of communication.
You're kidding right?
They have a SLA, which they didn't break.
They have a Status page - that pretty much tells you what's going on.
They sell a UNMANAGED service.
Unless you have a few DOZEN HIGH END VPS's that were effected, I think you're vastly over reacting.
They were down, they fixed it, they reported it, it's a rare event, get over it already.
@JasonTokoph:
Don't get me wrong. I have monitors in place for all of my systems, Linodes or otherwise. These monitors kicked off alerts that I needed. The issue is that I had no idea why the alerts were being generated.
Oh, sorry - I got the wrong impression from your earlier phrasing, which made it sound like you didn't have anything yourself to let you notice the outage even on your own Linodes.
For what it's worth, when my monitors fire, and after the obvious stuff like issues on the monitor's own connection or the network, first thing I do is check status.linode.com, and then if there is nothing there, get on IRC and lurk a bit or ask about it. I'm not a tweeter, so I rarely look there. Typically one of the two will give me the low down fairly quickly (IRC in particular gets busier during the larger outages, and at a minimum you can probably find out if you aren't alone), but I'm willing to accept some latency also since I may have actually noticed things myself first or the issue may be small enough not to prompt a status update.
Also, the good news is that now you know about status.linode.com for the future. I agree some additional links to it couldn't hurt, though it's a fairly common naming convention among providers in the last few years, so not a bad name to try by default. Also, depending on the outage, the main Linode site may not itself respond - e.g., if the recent outage had been in Texas, so depending on using a link from the main site won't always work (status.linode.com is hosted elsewhere).
> I'll be happy when a post mortem is in my inbox along with an apology for lack of communication.
Wouldn't necessarily wait for the inbox, but I'm sure some summary information will be posted or made available when possible. For example
-- David
@Jay3ld:
I hope they really look into why the UPS system failed. I was shocked to hear a power outage would take down a hosting company.
You must be new to the internet
@bryantrv:
I use the rss feed from
http://status.linode.com/
+1. This is hosted at typepad.com, in a different datacenter to all Linodes.
Can this be related to the Fremont failure?
@Azathoth:
Anyone else in Newark having discontinued graphs until cca 17.30 UTC yesterday, on all the linodes? The nodes were up and fine, I know through external monitoring.
Yes.
@Azathoth:
Can this be related to the Fremont failure?
I'm guessing that this was the case.
@vonskippy:
@JasonTokoph:I'll be happy when a post mortem is in my inbox along with an apology for lack of communication.
…They were down, they fixed it, they reported it, it's a rare event, get over it already.
It's completely reasonable to ask why something that shouldn't happen happened. It sounds like it wasn't their fault, and they recovered beautifully. But we have a fragmentary bit of info on the status blog, and I'd like to know how (or if) they will prevent this from happening in the future, or if this is the expected response and everything went as well as could be.
@JasonTokoph:
While its nice that the pseudo redundancy was there, I'm still disappointed that I had to learn of the outage by searching twitter for "linode". No email, phone call, or a minimum @linode tweet. Unless I'm blind, there isn't a link to the status page on the main linode.com site.
A phone call? Seriously?
@Jay3ld:
I hope they really look into why the UPS system failed. I was shocked to hear a power outage would take down a hosting company. I can understand now that I see the UPS failed. Still a shock that the UPS systems failed says to me that something failed along the lines of making sure they where tested to ensure they worked.
It's entirely possible for this stuff to fail even with regular testing.
The reason it's so important for a company like Linode to be able to recover is because it simply isn't possible to completely protect yourself from failures like this. It happens.
Up and running before start of business the next day is an amazing effort, made possible by hard work from competent staff and planning ahead to have the spares available.
@Internat:
email is useless as a status message medium.. esepcially given my frontend client access server for mail was one of the nodes that went down.. Such is life though
:)
I don't know - given how many people have smart devices on their person nowadays that can notify them of new email on a real time basis, it still seems a perfectly good communications channel to support. Certainly likely to be the most ubiquitous.
Besides, I think it's safe to assume that if email status updates were offered, you'd want to put in an email address that was independent from the host for which you were getting status…
-- David
@db3l:
Besides, I think it's safe to assume that if email status updates were offered, you'd want to put in an email address that was independent from the host for which you were getting status…
-- David
I would never use a "mydomain" email address in dealing with my host- really for support, billing or what not.
Though Google handles "mydomain" email, and I'm using third party dns, so….
A dedicated twitter account might be a good idea- like @linodealerts. I guess rss is getting pretty old school.
@ArbitraryConstant:
@Jay3ld:I hope they really look into why the UPS system failed. I was shocked to hear a power outage would take down a hosting company. I can understand now that I see the UPS failed. Still a shock that the UPS systems failed says to me that something failed along the lines of making sure they where tested to ensure they worked.
It's entirely possible for this stuff to fail even with regular testing.
I don't know that I'd be quite that forgiving or understanding, depending on what the actual underlying outage was. Modern data centers should be at least N+1 redundant, and no, I don't think it's unreasonable to expect failure scenarios to be regularly tested and expected to function correctly in the very scenarios for which they are designed to cope. More often than not, it's an error of some sort involved.
With that said, the status mentioned a serious lightning storm, but not whether the data center "outage" was directly involved in a strike, or just from losing utility power as an indirect result. The latter should be much easier to expect no interruption from, while a direct strike could well fry enough front-end components while dissipating the energy involved to be a problem, even with an otherwise robust design.
In the end this is all verbal bit twiddling until we get further analysis and a post-mortem, but I certainly wouldn't fault (nor discourage) anyone at this point from wondering why the redundancies and UPSes failed to function as intended. To me, it's especially interesting that equipment within the protected zone behind the UPSes got damaged, since that seems to imply a potential issue in design (such as unprotected electrical path somewhere). But even that could in theory be explained by induced currents if the strike was close enough.
– David
Would be good to get a report of the investigation on what happened, but I wouldn't be pointing fingers until we know what actually happened
@jords:
I can remember something being said about a surge? If it was something like that it could have knocked out the redundant power systems. Power seems to be very, very hard to get complete redundancy in, since you are dealing with huge amounts of power in a data center.
The status post said lightning strike, which knocked out power and "redundant" UPSes. I was in co-lo for nearly a decade, and spent a lot of time talking to them about power. Their setup, which is standard for Tier 1 data centers, which I presume from Linode's description is where they operate, requires that the capacity is in place to provide continuous power through UPSes to all servers–that is, the power isn't from the mains, but conditioned through the UPS units, which are constantly charge. If the mains drop, the UPS continue their function without interruption. UPSes have to be scaled to handle this, of course, and regular testing and battery swapouts.
At Tier 1 facilities, sufficient generator capacity is in place to operate indefinitely with fuel deliveries, and for some number of days with fuel on hand. Generators can sometimes kick on in a matter of seconds.
The fact that Linode's facility had primary (mains) and secondary (UPS) failures can be perfectly reasonable if a lightning strike overwhelmed systems. It sounds like the power conditioning prevented hardware from melting down, which is awesome.
Linode doesn't provide battery-backed RAID 10's, from their specs. Some VPS services do. It should be double-overkill, but this demonstrates that the unlikely happens.
@pundit:
Linode doesn't provide battery-backed RAID 10's, from their specs. Some VPS services do. It should be double-overkill, but this demonstrates that the unlikely happens.
I'm pretty sure the raid cards in the Linode hosts do have BBUs* , but of course that really only ensures that data won't get lost in the cache, not to provide additional runtime in the absence of power, which will shut down the host processor in any event.
– David
* A quick forum search only showed a status report for a host, but it did mention the card's BBU. I'm pretty sure I recall it being discussed in more general threads.</r>
@pundit:
UPSes have to be scaled to handle this, of course, and regular testing and battery swapouts.
It might also be worth mentioning that the scaling isn't necessarily as bad as it may sound, since the job of the UPS is typically only to provide power long enough for the generator to spin up (done automatically on power loss) and its power output to stabilize. So the UPS needs enough capability for maybe 10-20 seconds plus whatever additional latency the DC establishes for the generator. Non-battery, flywheel UPS solutions are also becoming more popular I think in recent years with materials advances cutting down on the size/weight of the flywheel and definite advantages in terms of maintenance (up to 10-20 year cycle, and no batteries to replace).
– David
@db3l:
@pundit:UPSes have to be scaled to handle this, of course, and regular testing and battery swapouts.
It might also be worth mentioning that the scaling isn't necessarily as bad as it may sound, since the job of the UPS is typically only to provide power long enough for the generator to spin up (done automatically on power loss) and its power output to stabilize. So the UPS needs enough capability for maybe 10-20 seconds plus whatever additional latency the DC establishes for the generator. Non-battery, flywheel UPS solutions are also becoming more popular I think in recent years with materials advances cutting down on the size/weight of the flywheel and definite advantages in terms of maintenance (up to 10-20 year cycle, and no batteries to replace).– David
The problem is that if a genset fails to come fully online within the few seconds that the flywheel can provide power, then you're boned. From what I've seen, UPS installations can usually provide a few minutes of power, but I could be talking out of my ass here.
@db3l:
@pundit:Linode doesn't provide battery-backed RAID 10's, from their specs. Some VPS services do. It should be double-overkill, but this demonstrates that the unlikely happens.
I'm pretty sure the raid cards in the Linode hosts do have BBUs* , but of course that really only ensures that data won't get lost in the cache, not to provide additional runtime in the absence of power, which will shut down the host processor in any event.They had drive failures, which seems unlikely (but possible) if they had battery backups on the RAIDs. The battery would prevent a head crash and other hardware problems.</quote></r>
@Guspaz:
The problem is that if a genset fails to come fully online within the few seconds that the flywheel can provide power, then you're boned. From what I've seen, UPS installations can usually provide a few minutes of power, but I could be talking out of my ass here.
Well, it's certainly something to take into consideration, but my understanding is that current flywheel systems, which have been able to use newer materials (lighter and stronger) to spin faster (rpm scales energy more than mass) can bridge standard generator startup times. If a generator fails to start quickly, the odds of it getting fixed within the slightly longer window of a UPS isn't really that high, so the benefit of that window is marginal.
I'm definitely not an expert in either systems, but I think I've seen growing references to data centers choosing flywheel systems in recent years. Certainly from an operational (and maintenance) perspective they have a lot of attractive qualities.
– David
@caker:
A RAID BBU doesn't provide power to the drives - it provides power to the RAID card's cache.
Huh. So…how does that help in the event of a power outage?
(Bay Bridge)
It was one of the worst lightning storms in the Bay Area in recent memory.
@reaktor:
http://i.imgur.com/2VoyJ.jpg (Bay Bridge)
It was one of the worst lightning storms in the Bay Area in recent memory.
Wow- at some point, lightning will win.
> Wow- at some point, lightning will win.
Well, considering the breakdown voltage of air is in the neighborhood of a few megavolts per meter, I would say lightning – by definition -- always wins.
@pundit:
@db3l:
@pundit:Linode doesn't provide battery-backed RAID 10's, from their specs. Some VPS services do. It should be double-overkill, but this demonstrates that the unlikely happens.
I'm pretty sure the raid cards in the Linode hosts do have BBUs* , but of course that really only ensures that data won't get lost in the cache, not to provide additional runtime in the absence of power, which will shut down the host processor in any event.They had drive failures, which seems unlikely (but possible) if they had battery backups on the RAIDs. The battery would prevent a head crash and other hardware problems. maybe im wrong.. but i thought battery backup on the raid card, just kept the data in the raids cache until power was restored and it could be written out properly? that battery isnt providing power to the hard drive(s) so when they lose power the head will still crash..</quote></quote></r>
@Internat:
that battery isnt providing power to the hard drive(s) so when they lose power the head will still crash..
Actually, I doubt any drives nowadays have a head crash - at least not in the traditional meaning of the head just dropping to the surface - on power loss. Instead, the drive automatically parks the head first in a dedicated landing zone (CSS) or off-media mechanism (ramp). A spring and/or platter inertia is used to provide the energy to reposition the head during power loss so it never hits the data surface.
– David
linode is very secretive and hardly ever owns up publicly when an outage is their fault.
obviously the power failed and hosts were damaged weren't linodes fault, but their backend issues that stopped linodes from booting was.
No complaints about the outage or the staff, just a complaint about the backend issues not being mentioned in the status updates.
@chesty:
linode is very secretive and hardly ever owns up publicly when an outage is their fault.
Between the status pages, the forums here and the conversations that take place in the IRC channel which are publicly accessible to even non-customers, I think you'd find that Linode is one of the most open companies in the business.
(Especially given that there was no power protection at that point, and there was a thunderstorm in progress.)
@hoopycat:
In Linode's shoes, I would also have waited a little bit before bringing servers back up. The datacenter had just lost power, which is not supposed to happen; clearly, Something Is Wrong, and the last thing you want is to lose power again with thousands of fscks running at once. So, indeed, I can't really blame them for waiting ~15-30 minutes before turning things back on.
(Especially given that there was no power protection at that point, and there was a thunderstorm in progress.)
I've been pleased with the speed of response, and especially with quick responses on trouble tickets. They've now posted a full explanation of the data center problems on the status blog, too.
@chesty:
linode staff went above and beyond expectations in fixing all the downed linodes. but they didn't come clean about why a power outage took hours to recover from.
linode is very secretive and hardly ever owns up publicly when an outage is their fault.
So, the RFO they posted wasn't enough for you?
In my experience, Linode is very open about outage causes, and owns up pretty clearly when it's their fault (exhibit A: the shared library update debacle last year).
@JshWright:
So, the RFO they posted wasn't enough for you?
In my experience, Linode is very open about outage causes, and owns up pretty clearly when it's their fault (exhibit A: the shared library update debacle last year).
re exhibit A: all the servers were rebooted in less time than it took fremont to come back online.
My host came up straight away, my linode took 6 hours to come up. There were lots of people on irc in the same boat. Linodes were being fixed one host at a time manually.
The dashboard slowed and then stopped working, all the graphs in all data centres were stopped. Caker showed a graph with a load average of 150 which i think was a database server. Linode did have problems with their system, it wasn't mentioned in the status updates, that's all I'm saying.
When I got home and saw my linode was down, I checked the status and saw the power outage. I logged into the dashboard and it said my linode was running. I logged into lish and it definitely wasn't running. So I thought that's why my linode was still down. lassie isn't going to boot a dead linode when it thinks it's running.
So I opened a ticket to say the dashboard was out of sync with reality, at that time I didn't know there were still major issues. I was letting them know of a possible bug. I got a boiler plate reply about a power outage at fremont. That didn't address my issue at all, my host was up, my linode was down, the dashboard said it was running.
So I went ahead and rebooted from the dashboard, and replied to the ticket explaining the my host was up but jobs weren't running and my linode wouldn't boot. I got a reply saying there was a large queue in the backend and jobs would take a while to run. 6 hours.
Hat tip to the staff for working their bums off getting everyone back online, answering tickets, etc. -1 for not being completely honest in their status updates.
Its good they recovered without loss, had redundant hardware etc.., but if there were software issues (false positives) or genuine delays between host coming online to all the guests running then this points to Linode issue rather than DC issue, unless im missing something?
I guess it be great to hear something like 'whilst the cause of this incident was obviously the storm and much of the downtime was out of our hands, we have to admit that this incidient did also highlight (albiet in a small way) some issues with our software that have been further improved and enhanced with regards to status reporting and/or queue running.'
-Chris