Management Redundancy
Even if my server was located in Atlanta or Chicago, I'd have no way to manage it, because the Linode website, management console, even status tracker is in Dallas and was affected. It should be simple enough for the tech staff to do. And status shouldn't even be on your server, but a third party all together, just in case something like this ever happens again.
I am willing to accept that things happen, but the issue could have been mitigated much better yesterday.
13 Replies
(2) http://linode.typepad.com/
However, of course, the Linode Manager would still be unavailable if the Dallas datacenter went down.
@wiley14:
Even if my server was located in Atlanta or Chicago, I'd have no way to manage it, because the Linode website, management console, even status tracker is in Dallas and was affected.
That's perhaps a little broad. You still have direct access to your server, as well as its console via LISH (only the web console is affected by the management site - ssh will work fine), so most management activities should be unaffected. My Dallas Linode was offline during the outage, but those I have in Newark were fine, and I wouldn't have had any problem managing them. Also, I don't think the status site is in Dallas, but believe it's hosted in San Francisco.
> It should be simple enough for the tech staff to do.
I'd like to see HA for the Linode Manager too, but let's be fair here - any application involving changing state (which includes the Linode Manager) is going to be non-trivial to reliably replicate, and then implement synchronization, failover and recovery procedures. Not impossible by any means, but also unlikely to be a trivial amount of work (though much of the data that already feeds the manager is remote from the individual hosts, so that might help a little architecturally). See the various discussions on the forums about doing HA for your own Linode. Would it be nice to have - absolutely. Would I claim it's simple to do. No.
> And status shouldn't even be on your server, but a third party all together, just in case something like this ever happens again.
This was already done earlier this year with status.linode.com. In past outages, I've also had good luck getting status via IRC (also hosted elsewhere).
Oh, and it's a pretty safe bet something like this will happen again in the future. Very few (perhaps no) infrastructures can guarantee 100% uptime.
– David
Expecting 100% uptime (or automatic failover or console redundancy) at these prices is very naive.
> Expecting 100% uptime (or automatic failover or console redundancy) at these prices is very naive.
I'm new to VPS hosting and linode.com in general. Should we expect better uptime from VPS hosting than shared hosting?Obviously the speed and amount of freedom is not comparable, but if your site isn't up, you don't need a bandwidth test to see how fast it is.
I only ask because I just moved from a awesome shared host to linode.com and I was hoping for similar uptime.
Hopefully they can get it under control.
Understand this: Linode themselves are a "hurt customer" in these cases, just as you are (actually, BECAUSE you are).
And well, this is Internet. Stuff Happens. Only way to REDUCE (there's no 100% proof) downtime, is a HA setup spread among a few DCs in different locations.
(And yes, I had serious trouble because of the Newark network problem a few weeks ago myself. But I don't whine about it.)
users blame you >> we blame our web host >> our webhost deals with their providers
hakuna matata
More seriously, datacenter issues don't explain why ns1-ns4 were all down at the same time.
They weren't down, but a remote namserver wasn't doing its job. This has since been corrected (by correcting that issue, and adding five more, globally diverse, nameservers into the mix).
-Chris
@rsk:
I don't understand why so many people blame Linode for DATACENTERS' problems. Recent failures were about 5x times failure in datacenter's own network (NOT controlled by Linode), and once or twice a DDoS attack aimed at someone in same DC (NOT Linode's fault).
I certainly appreciate the varying sources of failures, but at the same time, Linode selected the data center providers that they are using. Just as when we select a VPS provider, there is competition and differences at the data center level, so Linode isn't completely off the hook in my book just because the actual outage was within the data center. No differently than were I selling a service via my Linode, that my customers would hold me responsible even if my service was down due to a Linode hardware failure.
Yes, Linode has been dependent on the data center providers to resolve most (if not all) recent interruptions, but the data center is a component in the solution Linode is selling - no different than the hardware they chose - and just as relevant to the service reliability. I was bummed to see the status update implying that a single router could take down connectivity in Dallas for so long, since I'd have expected a mesh switched fabric below the level of a router. The other outage with the failed switch card was a little more understandable, since a partial component component failure - particularly at the switch level - can be very hard to diagnose, but even then it was disappointing how long it took to identify/resolve.
I'm also mildly disappointed that the data center can't transparently do whatever maintenance in Dallas they need to tomorrow morning (since I'd expect a major data center to have a way to shunt traffic around hardware they were working on), but I also understand sometimes its necessary and a warning about a 10 minute outage may also just be a CYA notice in case there is disruption during such a cutover. [Edit: Looks like no outages actually occurred]
With that said, I try to periodically watch status boards and customer forums for other providers I had considered when originally selecting Linode, and in general, have found Linode comparing quite favorably in terms of interruptions. I do think they suffered in comparison this past month (Newark and multiple Dallas significant outages) but this is the first month that's happened, and also the first and only month so far where some of my Linodes fell below 99.9% availability.
– David