Fremont reliablilty

general

Creating a new thread for this to get a bit more visibility… We need a response from linode on what they are going to do about the HE datacenter.

I know that it's not linode's equipment that's failing, but in the end it doesn't matter, It's linode's reputation and service level which is going down the tubes at this DC.

At this point, The usual response of "We're talking to them about what they are doing to prevent this" isn't good enough - it clearly didn't work to solve the issues after the last 3 outages. There needs to be some real steps taken to improve the service from fmt1. I want to see a full explanation of how a single breaker can take the whole DC down, and exactly what is being done to fix what is clearly a systemic issue, not isolated incidents.

I don't really want to move DC, because (as seems to be the case with most of the people who haven't moved), my clients are in Australia/NZ so the lower RTT is advantageous. So this could probably be resolved best by Linode getting space in another DC with good latency to these areas.

Jordan

7 Replies

forum:Guspaz 13 years, 4 months ago

At this point it almost seems like Linode's only solution if they want to stay in the same facility is to invest in their own in-rack UPS systems, but boy can that get insanely expensive (and bulky) when you need at least two or more hours of runtime to survive all the HE outages.

It's a bit silly, it seems that the HE facility loses power every single time there's a power outage in fremont. It's like their UPS/genset is useless.

EDIT: Which is particularly worrying what with this:

http://hardware.slashdot.org/story/11/0 … lar-Storms">http://hardware.slashdot.org/story/11/08/08/0321231/Power-Companies-Brace-For-Solar-Storms

Before you laugh at the idea of solar storms causing issues, in 1989, solar storms caused HydroQuebec's network (which, like Texas, has its own interconnect) to lose power for 9 hours.

Basically, the storms caused a surge in the region where most of the generation capacity was, taking out 9.6 gigawatts of capacity, which was about half the capacity at the time. The sudden loss of so much capacity caused the network to begin load-shedding by shutting off parts of the network, which caused voltage swings that took out the rest of the generating capacity.

The network has since been hardened against this sort of thing, but not everywhere in North America is…

EDIT2: Of course, as was pointed out in another thread, in-rack UPS won't keep HE's network equipment up, but if the Linodes never went down, the downtime would be reduced to just the actual length of the power outage and not the power outage plus the 2+ hours it takes to bring all linodes back up (not to mention hardware damage and data loss from sudden shutdowns)

forum:bryantrv 13 years, 4 months ago

An aside- http://www.zdnet.com/blog/saas/lightnin … ag=nl.e539">http://www.zdnet.com/blog/saas/lightning-strike-zaps-ec2-ireland/1382?tag=nl.e539
> Summary: A lightning strike last night knocked out servers at Amazon’s only European data center and the provider has warned some of those affected face delays of up to two days before they get back online.

While nothing guards 100% against a lightning strike, it seems that hardware vendors would be making more servers like Google's, which have on board battery backup.

forum:hoopycat 13 years, 4 months ago

~~@bryantrv:~~

While nothing guards 100% against a lightning strike, it seems that hardware vendors would be making more servers like Google's, which have on board battery backup.

Batteries are big and tend to require replacement every few years, so in-server batteries are probably not too feasible in this situation. With the hardware Linode uses, there's barely room for a RAID BBU.

I think the best compromise nowadays is probably Facebook's Open Compute Project, but this is probably unrealistic for general-purpose multi-tenant datacenters.