Host Reboots - October 27th, 2009
Why are you doing shared library upgrades in the middle of the afternoon when they can affect ALL of your sites like this? I like the fact that I can have geographical diversity with your services but what's the point if you do things that take out all the sites…
Do you have any updates or an ETA on when this will be resolved?
102 Replies
My linode is up again (downtime less than 2 hours), so thanks for fixing it
I'm in Atlanta and my host is still up and running fine (knock on wood).
I wish i could say the same about the linode.com page, it's slow as hell (including the forums).
2 hours and counting…
Argh!!
I'm in Fremont.
linode.com/ is slow has molasses for me as well. Except for accessing the support page. That loaded and sent my support request nice and speedy
Support referred me to this thread.
This isn't OK. 3 hours. Deafening silence. No progress updates. I've loved Linode up till now, but I can't accept this type of lack of communication.
We run a public safety avalanche forecast site on one of these linodes and it's the first day of wet, unstable snow. We only saw 8 minutes of downtime, luckily. 2 hours like others are reporting would've definitely been unacceptable.
You guys hosted on a Linode 360?!
There are still a few problem hosts we're working through, although we expect all issues to be resolved shortly. Please stand by for additional updates.
-Chris
I understand these things happen, but it sounds to me like this could have been avoided, or at least delayed.
We could have at least gotten some warning to let our clients know.
I love linode and have recommended it to so many people over the years, but when things like this happen, it not only makes you look bad, but it makes me look bad to everyone I have referred here.
That is not even taking into account the users of my own site…
This is not how production systems are run. Billing credit needs to be extended to all affected customers, and this needs to never happen again.
Maintenance that has any chance of causing breakage needs to be done in off-peak hours with at least 24-hours notice to affected customers, and in case of emergencies, emails need to be sent out at least 30 minutes in advance of taking actions that you KNOW will cause problems for customer not currently affected by the problem you're trying to solve.
~~![](<URL url=)http://mibbit.com/down.png
Unfortunately my nodes were rebooted one by one, over the course of 2 hours which meant for far more downtime than if they had all been restarted together.~~
Looking at my stats, there's a huge gap from 21:00 to 0:00 - I'm guessing that's the outage, and I just wasn't here to notice.
My linode says "Up since Oct 28, 12:00 AM".
I think my linode is in Dallas.
-=-
It would be better to do these kinds of things 4 AM eastern (1 AM Pacific).
Tue Oct 27 20:01:09 EDT 2009 ERROR: 66.220.1.164 http is down!
Tue Oct 27 20:02:19 EDT 2009 ERROR: 66.220.1.164 http is down!
Tue Oct 27 20:03:31 EDT 2009 ERROR: 66.220.1.164 http is down!
Tue Oct 27 20:05:03 EDT 2009 ERROR: 66.220.1.164 http is down!
Tue Oct 27 20:06:14 EDT 2009 ERROR: 66.220.1.164 http is down!
Tue Oct 27 20:07:28 EDT 2009 ERROR: 66.220.1.164 http is down!
Tue Oct 27 20:08:48 EDT 2009 ERROR: 66.220.1.164 http is down!
Tue Oct 27 20:09:46 EDT 2009 ERROR: 66.220.1.164 http is down!
Stuff Happens. Still, it would have been better to have this particular Stuff Happen at 3 or 4 AM when nobody would notice …
RE: status updates I guess they are too busy fixing the problem (been there myself in an office environment everyone rings you to find out whats going on so much so you don't get much time to fix the problem till you go into the server room and lock the door!).
Here's hoping everything is up soon!
JobID: 1498922 - Host initiated restart
Job Entered 01/03/1974 11:00:00 PM Status In Queue
Host Start Date Host Finish Date
Host Duration waiting on host Host Message
Totally killed my linode, way after you guys already knew there was an issue, did you not stop?
@razza:
My server is still down as well. Unfortunately guys Virtual Servers are still shared servers and no-matter what time they do maintenance I guess there would be complaints as they have customers all over the world.
Colo companies have the exact same problem with power and network maintenance, but the difference is they tell their customers in advance and they schedule it for the least possible impact. You can't get it perfect, but there are far better times to reboot a bunch of US-based servers than late afternoon/early evening US time.
All the boot requests are just sitting in the queue.
Guess my plans for tonight are canceled…
What is your SLA?
We're not going to lie to you: server maintenance, upgrades, hardware and network issues, all affect a Linode in the same ways as any other provider. What we can boast about is our commitment to resolving these issues in the quickest fashion possible. Most customers will tell you the last time they rebooted was to take advantage of a plan upgrade. 99.99% uptime, or your lost time is refunded back to your account.
@Infinito:
Oh newbies.. take it easy. For me this is the first time that there has been a problem(at all, as relatively minor as it is) since I signed up, in 2007(my linode is on Freemont btw). And this isn't even a real problem so it seems, just some upgrade that went awry. Over two years up, I believe that fucking beats Amazon services.
:)
Who are you to tell us to take it easy? Not all of us use our Linode's for expensive shell accounts. I have clients who are pissed and ZERO warning. You take it easy when this is costing you money.
That being said, I am a bit irked as I haven't gotten any explanation on my current situation: my servers have come back up and I can SSH in but I cannot get a shell prompt and I am unable to reboot from the dashboard.
@nknight:
Colo companies have the exact same problem with power and network maintenance, but the difference is they tell their customers in advance and they schedule it for the least possible impact. You can't get it perfect, but there are far better times to reboot a bunch of US-based servers than late afternoon/early evening US time.
Colo customers also pay a premium for that level of redundancy/reliability.
I'm not saying that the linode folks don't deserve some ribbing for this, or that customers don't deserve some credit on their accounts. Rather, just trying to provide some perspective on the issue. If you need umpteen nines of reliability, then linode probably isn't the service for you. For the rest of us, it's pretty dang good. I've been a customer since 2004, and since that time, the number of issues like this can be counted on say, 2 fingers. Pretty damn good if you ask me. While any amount of downtime sucks, this afternoon's incident surely isn't going to turn me away from Linode. Once the dust settles, I'm sure that we'll get an explanation from caker, as well as some credit on our accounts is likely as well.
@anderiv:
Colo customers also pay a premium for that level of redundancy/reliability.
I'm not saying that the linode folks don't deserve some ribbing for this, or that customers don't deserve some credit on their accounts. Rather, just trying to provide some perspective on the issue. If you need umpteen nines of reliability, then linode probably isn't the service for you. For the rest of us, it's pretty dang good. I've been a customer since 2004, and since that time, the number of issues like this can be counted on say, 2 fingers. Pretty damn good if you ask me. While any amount of downtime sucks, this afternoon's incident surely isn't going to turn me away from Linode. Once the dust settles, I'm sure that we'll get an explanation from caker, as well as some credit on our accounts is likely as well.
totally agree. i came from a shared hosting environment (dreamhost) that had one nine worth of uptime. since moving here to linode in april, outside of today's 30m of downtime, the rest of the downtime was caused by my own stupidity. i won't be leaving linode anytime soon. that said, it was kind of irksome to suddenly without warning lose my vps.
@anderiv:
Colo customers also pay a premium for that level of redundancy/reliability.
This has nothing to do with redundancy or reliability, this is purely a procedural and judgment problem. Linode made a bad choice – a series of them, actually -- resulting in wholly avoidable downtime and hard resets for many customers.
Saying we should have to pay a premium to avoid a rather horrific error in judgment on the part of Linode is like saying we should pay a premium to get a car that has a steering wheel. It's nonsense.
Any hosting service that holds itself out as being suitable for commercial purposes should not be making these kinds of mistakes, regardless of their level of redundancy or hardware reliability.
@ultramookie:
99.99% uptime, or your lost time is refunded back to your account.
According to that, a 4 hour outage on a Linode 2880 (the most expensive) gets you 88 cents!
–John
The big difference their is that even with their lame CEO's emails and speeches, there was customer contact. Even though inside my soul I cried "BS" with every word they said about the incident, at least they appeared to try. What I see here, is chaos and pain.
I'm not sure how the company is built personnel-wise, but I'm sure we'll hear something useful when its all said and done. Customers are what makes any company, and I'm sure upsetting us is likely not on their todo list.
@jpw:
@ultramookie:99.99% uptime, or your lost time is refunded back to your account.
According to that, a 4 hour outage on a Linode 2880 (the most expensive) gets you 88 cents!
:) –John
Exactly! I can hardly wait for my refund on my 720 for the past 4.5 hours. I'm sure my customers won't mind either.
@Orrin:
My linode has been down over 4 hours now with no warning and no information other than a canned response from support referring me to this thread. Not impressed.
A) Unplanned outage, how do they warn against those?
B) Do you want them to fix your box or post here?
@nknight:
This has nothing to do with redundancy or reliability, this is purely a procedural and judgment problem. Linode made a bad choice – a series of them, actually -- resulting in wholly avoidable downtime and hard resets for many customers.
See above.
C) You get what you pay for.
D) Run it yourself if you think you can do better.
There's no way I'd be up at 4am performing upgrades, so I wouldn't expect anyone to do the same for me.
Some of us do appreciate you're working your balls off to sort the issue out. We've all been there.
Keep up the good work, Mr Linode!
@techman224:
It seems I'm been unaffected by this problem, so I wonder if it only affects older hosts and linodes (as I got mine a few days ago). Plus my host is always idle.
I don't think so. I am also unaffected (fingers crossed) and I got my Linode in 2006. It's on Dallas 5.
@OverlordQ:
B) Do you want them to fix your box or post here?
Both.
j/k
Actually, I'm not, I totally agree with people in this thread that complain about a lack of communication.
Posting an update takes very little time so I don't think it's too much to ask to keep people in the loop about the scope of the problem and what's being done to fix it.
@DharmaTech:
Our linodes were shutdown ungracefully, which of course isn't good for our databases.
This is the biggest issue as far as I'm concerned. No hosting company would consider walking into the Data Center and literally pulling the power cords out of the servers in a rack, and the same care should be taken with physical servers and virtual servers.
So far 2 of my Linodes have been down, and fortunately come back up again (atlanta54 and atlanta57) but my database server (atlanta20) is still down at the moment, and will hopefully not have data issues after having it's power plug yanked out!
did just that
Not only would it save a lot of customer frustration, but it would make me look good to the other members of the non-profit I host when I warn them of upcoming maintenance. Right now it has the opposite effect, making look to them like I dropped the ball. Not cool.
@OverlordQ:
A) Unplanned outage, how do they warn against those?
B) Do you want them to fix your box or post here?
Usually "unplanned outage" would imply that something unexpected happened outside of the control of the administrators. Chris initially said it was due to "a shared library update distributed to our hosts". Based on the thoroughness I've observed in the past from Linode, I would expect that that sort of update is (1) scheduled by Linode staff and (2) tested on a staging host before pushing to production hosts. If (1) is true then the update was planned, even if the outage was not. I think the point of many posters here is that such maintenance should be announced, even if no outage is expected. If they aren't doing (2), they should be, although that doesn't always catch the problem.
–John
@OverlordQ:
A) Unplanned outage, how do they warn against those?
The original outage was unplanned, the maintenance that caused it was not.
@OverlordQ:
D) Run it yourself if you think you can do better.
I do. My linode use is for my personal business use.
I've been responsible for one particular corporate production service with thousands of customers since early 2006.
You know what each customer is paying us? The equivalent of a few US dollars per month. You know what our contractual uptime obligations are to them? Nothing. You know how much impact 24 hours of downtime would have on our customers? 95% probably wouldn't even notice.
You know how many people are involved with this service? At peak, it was 4. Now it's just me.
And yet, in all that time, all maintenance has taken place during off-peak hours, as has all planned downtime (which was communicated to customers well in advance). We have had approximately 30 minutes worth of unplanned downtime in that period, and about two hours of "partial" downtime due to one of our upstream ISP's flapping BGP (causing approximately 50% of customers to have intermittent difficulty connecting to the service).
I can do better, and I have done better. I do better every day. I'm confident Linode can and will, too, but they have seriously dropped the ball today, and need to be held accountable for it.
@dmuth:
How hard is it to set up a mailing list of some sort that we can subscribe to to get announcements of upcoming maintenance? A former ISP of mine
, and it worked out great. They would email out a description of maintenance that was to be performed, who would be impacted, and a time window. did just thatNot only would it save a lot of customer frustration, but it would make me look good to the other members of the non-profit I host when I warn them of upcoming maintenance. Right now it has the opposite effect, making look to them like I dropped the ball. Not cool.
I feel your pain, as I've had four angry customers call me and I'm left holding the bag, but remember that it's always hard to warn someone of unexpected downtime
@Infinito:
Oh newbies.. take it easy. For me this is the first time that there has been a problem(at all, as relatively minor as it is) since I signed up, in 2007(my linode is on Freemont btw). And this isn't even a real problem so it seems, just some upgrade that went awry. Over two years up, I believe that fucking beats Amazon services.
:)
+1 I've been with linode since Oct 2003 (also in Freemont) and this is the first real outage I'm aware of. The only other issues I've ever experienced are short outages due to DDOS etc..
Fortunately my box is up and running again
@OverlordQ:
A) Unplanned outage, how do they warn against those?
Did you read the OP?
> To recover from this we may be issuing host reboots to upgrade their software to our latest stack, and then bringing the Linodes to their last state. We're working on this now and expect to have additional updates shortly. We'll also be notifying those affected via our support ticket system.
So no, I didn't get notified via the support system.
@OverlordQ:
B) Do you want them to fix your box or post here?
Oh yeah, because that's an either/or thing right?
I'm not bashing Linode and overall I've been very happy with the service, but today they fell short of my expectations.
Job Entered 01/04/1974 12:00:00 AM Status In Queue
Host Start Date Host Finish Date
Host Duration waiting on host Host Message
Thought that was kinda funny! My Newark node is all sorts of borked now.
@spearson:
Job Entered 01/04/1974 12:00:00 AM Status In Queue
Host Start Date Host Finish Date
Host Duration waiting on host Host Message
Thought that was kinda funny! My Newark node is all sorts of borked now.
:shock:
Nah, they do that to force the boot job to the front of the queue.
INIT: /etc/inittab[33]: rlevel field too long (max 11 characters)
INIT: /etc/inittab[34]: rlevel field too long (max 11 characters)
INIT: /etc/inittab[35]: rlevel field too long (max 11 characters)
INIT: /etc/inittab[36]: missing action field
Enter runlevel:
If I enter runlevel 3 it gives me this error:
INIT: Entering runlevel 3
INIT: no more processes left in this runlevel
What can I do at this point? I need help bad here. All my websites are down. I am dead in the water…
@hiscom:
Still, it would have been better to have this particular Stuff Happen at 3 or 4 AM when nobody would notice …
Which 3 or 4 AM?
From Linode's 'Interesting stats': "131 countries customer diversity".
I myself have to answer to my customers whom I referred to Linode and who pay me to make sure their services stay online. However, they understand that outages are a reality of the business.
Even the big guys like rackspace, amazon, the planet, google, and facebook have outages.
its not a matter of if…its a matter of when. It could be worse.
Two of my customers who I emailed to pass along information about this incident replied back and told me this is nothing in comparison to the outages they endured at media temple. Not uncommon for servers to remain offline for an entire day or more at a time without resolution.
My linode appear to be up again, according to the dashboard. I got connection refused when ssh'ing, so I was fairly confident it was running indeed (otherwise, it would have timed out). I used the AJAX console to inspect why the ssh service was down, and lo and behold, it was stuck at the initramfs prompt complaining about a failed fask. Ran fsck -y, and the … lost connection to the console, which have been down since then. Arrrgh!!!11
Rebooting now. Hopefully it will come up just fine. Hopefully.
It was stated that those affected would receive notification if any hosts require a reboot. Thus far, 2 of my Linodes have been restarted without any notification - after I'd read that I would be notified.
If you say you are going to notify customers, you should do just that. I was expecting notification if downtime was going to occur - not just the downtime!
@andrewz:
What can I do at this point? I need help bad here. All my websites are down. I am dead in the water…
Open a new thread, or support ticket…
@Rogi:
@hiscom:Still, it would have been better to have this particular Stuff Happen at 3 or 4 AM when nobody would notice …
Which 3 or 4 AM?
From Linode's 'Interesting stats': "131 countries customer diversity".
Countries exist outside the USA?
@OverlordQ:
A) Unplanned outage, how do they warn against those?
By creating maint@linode which lists all maintenance, without exception. That way I can look at that mail archive and see todays date and 'upgrading libraries on xen hosts' and go… ahhhh!
> B) Do you want them to fix your box or post here?
Actually I want them to post here first, then fix the problem
Ive just had to tell a customer "your servers seem ok but may be rebooted at some point" because I dont know if they are going to reboot all hosts. I had to tell them that because there is NO official information that I can find on the status of the problem (or even much on the cause, fix eta, etc)
Take 5 mins to post all the info you have. Say what is going to happen to rebooted hosts (are they now ok) and what about un-rebooted hosts (will they be rebooted later or are they fine).
Part of taking credit and fanboy love we have for this wonderful thing (and I think linode is great) is also taking the responsibility for the fsck-ups that happen along the way.
Im a professional sysadmin, so I understand things 'go wrong' and that is fine, people have to live with that. But what you can do is be open and honest and fully informative about the problem. It takes 5 mins to do and often stopping and thinking about the problem enough to lay it out clearly can actually help.
For example xen instances can be saved to disk. Is there some reason that admin's cant do the following:
save all xen instances on a host
reboot the host
restore the xen instance
If that was workable, then maybe you could have a 2 minute 'hang' for each host and not need a reboot. shrugs Maybe linode should look at trying that (xm save savefile) and see if it could be used to reduce the impact next time.
> C) You get what you pay for.
D) Run it yourself if you think you can do better.
I pay for a service and part of that service involves updating the:
forums
outage announcements
blogs
twitter
None of which have any useful information.
I have taken 5 mins to email my clients and say "linode hosted servers are to be taken as 'unreliable' until further notice".
So, I have tried to see if I can just access or bring up a file and still not able to get anything. Anyone else seeing this type of problem?
@EtienneG:
Earlier this week, I was bragging about my linode uptime on a mailing list. Oh well, that will teach me! Karma has its way …
Aha! So you're the one to blame for all this.;-)
a) This is not a frequent occurrence.
b) I'm sure they like it even less than you do. Reputation is important to small business like this.
c) I'm sure they will take steps to make sure it doesn't happen again. However, the nature of a shared virtual hosting environment with geographic (e.g., timezone) diversity means that unscheduled problems (aren't they all?) will always happen at a bad time for somebody. And they have to be able to do scheduled non-reboot service to hosts without coordinating windows with 40 people. As for notification of a patch to a host, sure, they could do that. But it still wouldn't have stopped this. Would you rather them patch the running host or bring down 40 VMs unnecessarily? Clearly, they've done this kind of maintenance before and it worked fine except for this time. If it were isolated to one host no more than 40 people would notice, but this was widespread.
d) They are busy fixing the problem; posting "we're working on it" messages to the forums is counter-productive. They've said just that on IRC.
e) Linode.com clearly states one 9 of uptime. In my experience, it exceeds this SLA. But if you want five 9s, you're going to have to pay more for it. If you're running a revenue generating operation that's that important (e.g., amount of money at stake per minute downtime) then you have either a backup/business continuity plan involving another provider or are paying to colocate your own racks, and even then you still have a backup plan with another provider. In other words, you're spending more money somewhere.
f) Other providers (EC2?) have unscheduled downtime / outages / capacity problems all the time.
All this doesn't make it OK. It's frustrating to have a host go down. But I think it's hard to find as reliable a host for the money.
One other observation, my original linode on a UML host has been rock solid. Anecdotally, I think UML is more stable than Xen. But Xen is faster … go figure.
@pparadis:
Once again, we sincerely apologize for these issues. All system administrators are continuing to work to restore full service for all affected customers. Support tickets are being processed as rapidly as possible. Thank you for your patience.
What would you guess your ETA is, for solving this problem for all hosts? How many hosts have you fixed and how many are left? It's been five hours downtime for one of my linodes now, and it's still down.
The guy who did this (or the boss who ordered him, if that was the case) should get punished by being forced to work 4 weekend-days without pay, and linodes with more than one hour downtime should get one month of service credited to their account.
That way the technician/boss would think twice before 1. not notifying customers of the potentially risky upgrade well in advance, and 2. not testing the update on a few hosts before pushing the update onto all (?) of your hosts at once.
And the company as a whole would get a clear financial incentive not to repeat such foolishness in the future.
That said, I will not move to a different provider just because of todays downtime. I just won't recommend Linode as enthusiastically any more. You're still better than any other alternative I've heard of.
For the future:
Let's say you have 1000 hosts you wish to update. You can never be sure nothing will break. You should therefore test your update on a test-server. If that works you should try it on a live production server. If that works, you should try it on two additional production servers at once. If that works you should try it on 4 more servers at once. If that works, try it on 8, 16, 32 and so on. After 11 tests you would have upgraded 1024 servers. Testing to see if everything is ok before proceeding to update even more servers, eleven times, is a reasonable "waste of time" considering the impact one undiscovered error would have for so many people. If you do upgrades on just a few servers at a time (as suggested above in this message), any problems you miss, we customers will catch because we are so many people.
And please create a mailinglist anyone could join if we want updates on planned upgrades and potential problems, progress reports and so on. Maybe one list per datacenter? I only want to know about any updates you do in the Atlanda DC for example.
Anyway, good luck fixing today's problems. It's 03:26 in Sweden now and I'm hitting the sack.
I have logged in by LISH and I see that my kernel is ubuntu, why is that hell - my server is running on Gentoo
I didn't necessarily find any answers, but I'm now more annoyed at the customers posting in this thread than I am at Linode.
Thank God I am not the underling or child of any of the perfectionists posting in here. May He have mercy on anyone who has to live up to your standards.
To Linode: I've been with a lot of companies since 2002 and your reliability has been the best I've dealt with to date. Going forward, I would appreciate a little more emphasis on communication as we have people to communicate with as well. It is difficult to do that when we have little or no information to go on. Please consider that - but otherwise good work.
@randrp:
Well I came to this thread annoyed and looking for answers.
I didn't necessarily find any answers, but I'm now more annoyed at the customers posting in this thread than I am at Linode.
Thank God I am not the underling or child of any of the perfectionists posting in here. May He have mercy on anyone who has to live up to your standards.
To Linode: I've been with a lot of companies since 2002 and your reliability has been the best I've dealt with to date. Going forward, I would appreciate a little more emphasis on communication as we have people to communicate with as well. It is difficult to do that when we have little or no information to go on. Please consider that - but otherwise good work.
Another tool who clearly only uses his Linode server as an expensive shell account.
@nsajeff:
Another tool who clearly only uses his Linode server as an expensive shell account.
@randrp:
@nsajeff:Another tool who clearly only uses his Linode server as an expensive shell account.
:D I love that the internet gives people like you a place to be "tough"
And you the place to preach to others about what is acceptable to everyone else.
I was down for a good part of the evening too, but to be fair, this is maybe the second real outage (other than brief network issues, etc) I've had in what… 5 years? That's a pretty good track record to me.
I don't like it either, but these things happen. That's simply the reality of it.
Linode guarantees 99.99% uptime which means that there's about nine hours a year when you can expect it to go offline. That's fine. What is not okay is having a major outage (all three of my servers went down) in the middle of the day and more importantly, not being able to provide detailed information about the outage that I can pass on to my clients.
This whole situation could have been a lot smoother if some administrator had posted perhaps every half an hour with something along the lines of,
These nodes are down. We expect to bring these nodes up in the next half hour. We expect to bring these nodes up in the next hour. And we expect to bring these nodes up in the next two hours. We expect to have the entire system back online within eight hours.
Not only would I have an ETA to give to the people who depend on this server, but I also wouldn't have to check back every five minutes to ascertain whether or not my servers were operable.
@andrewz:
CAN SOMEONE TELL US WHAT IS GOING ON!?!?! I AM COMPLETELY DOWN AND THERE IS NO HELP AT ALL FROM SUPPORT?!!!!!
My suggestion is for you to go to IRC, #linode or #linode-is-broken at irc.oftc.net, the staff is there 24/7, you might be able to get help faster.
@nsajeff:
@Infinito:Oh newbies.. take it easy. For me this is the first time that there has been a problem(at all, as relatively minor as it is) since I signed up, in 2007(my linode is on Freemont btw). And this isn't even a real problem so it seems, just some upgrade that went awry. Over two years up, I believe that fucking beats Amazon services.
:) Who are you to tell us to take it easy? Not all of us use our Linode's for expensive shell accounts. I have clients who are pissed and ZERO warning. You take it easy when this is costing you money.
I think rather than poking at other members and making assumptions about their usage, it might be better to take a step back and evaluate your own configuration. The reality is whether you are a sole proprietor or Google things can and will happen and if you want to isolate yourself and your customers from those events you need to plan, develop and deploy your solution in a manner consistent with your level of concern.
That's not to say things couldn't have been done better today and I would like to think that they will in the future.
I logged into the forum here and learned of the problems. I thought I had lucked out until I tried to access my site. Only the front static page was visible. I could not log in via ssh. I was able to log in via the Linode console. Mysql is up but not responding through the site.
I am lucky, the site I have here is still in development. Thank goodness.
I have had service since February. This is the first problem I have had. It is just another lesson in server administration. Constant vigilance and backups!
Jeff
Don't think I don't know what being under pressure and annoyance by customers and bosses is like, next week there will be a real life 'test drive' of the red5/flash video conferencing/video broadcast/chat tool I(along along with other two developers) have been developing for the last 8 months, where 2700 people(that's right) will be using it at the same time, and I couldn't circumvent all the performance glitches yet.
@randrp:
Thank God I am not the underling or child of any of the perfectionists posting in here. May He have mercy on anyone who has to live up to your standards.
Production systems operations is a tough, uncompromising, and precise business. If you can't take the heat, get out of the fire. This is no more a reflection on how we conduct our personal lives than a surgeon's precision in an operating room is on his personal life.
Mistakes happen, but this was not a "mistake". They didn't mistype a hostname or accidentally bounce the wrong box. This was negligence.
If I had failed so spectacularly to adhere to basic industry standards of care at any of the companies I've worked for, there is a very good chance I would have simply and instantly been shown the door. At the very least, I would not have been allowed near production boxes again for months.
What I hope is that once this is fixed, (and the admins get a chance to sleep) is that this "event" becomes a "lesson learned" from a communications standpoint. Since they are much bigger now than they've ever been, they need to find a way to communicate to a large number of customers at once on occasion. (Maybe have someone post every hour the list of affected hosts) Other than this big incident, though, the communication from the staff has been great!
I'm still a satisfied customer, though, and I won't hesitate to recommend Linode to someone else!
If you really want a better SLA, then you should probably get one in writing.
Thanks for getting it back up and running, linode staffers.
@kbrantley:
Hey guys! Crap breaks.
If a RAID controller had starting puking all over itself or a power supply popped a capacitor, I assure you I would not be here loudly voicing my displeasure. Based on their past performance in such situations, I'd probably be thanking them for prompt action.
The present situation is not "crap breaks", it is "Linode broke it".
@nknight:
@kbrantley:Hey guys! Crap breaks.
If a RAID controller had starting puking all over itself or a power supply popped a capacitor, I assure you I would not be here loudly voicing my displeasure. Based on their past performance in such situations, I'd probably be thanking them for prompt action.
The present situation is not "crap breaks", it is "Linode broke it".
It's a good thing that their uptime guarantee (section 5)
Second, not to be the obvious counter-whiner, but if you're really that pissed then take it to their contact page
I absolutely call BS on Linode for the unacceptable uncommunicated maintenance. And the continuing silence after the event. And even now, 8+ hours after the event started, they haven't chimed in with much information.
And let's focus on those issues. Who cares if your Linode is a personal shell (mine is not) or a production system? We all expect to know when something we pay for is broken, and when it'll be fixed. Oh, and a warning before it's going to be broken, if you know it.
My point? Focus. Let's focus on one thing.
Let's all demand more and better communication from Linode. I agree that we don't want to pull people away from resolving the issues. And we don't want to be kept in the dark, either.
So let's all demand clear and timely communication from Linode. Whether it's in forums, blogs, tweets, or email, what we want to know is "What's going on?".
We don't care about the medium, we care about the content.
We don't care about the framing, we care about the content.
Please, Linode, communicate with us. I haven't seen a single post from someone say they're going to leave Linode (yet), but why wait till it comes to that point? Why wait till you're in the position of begging for customers before talking to the ones you have?
/ramble
@kbrantley:
It's a good thing that their
covers both hardware error and human error then, huh? The part where it is 99.99% over a month as opposed to a year is nice too. uptime guarantee (section 5)
The existence of an SLA and the paltry refunds do not excuse the careless behavior in evidence here.
@kbrantley:
Second, not to be the obvious counter-whiner, but if you're really that pissed then take it to
or ultimately, your wallet - not the boards here. It won't change a thing. their contact page
Really? I'd always been under the impression that Linode was one of those rare companies that actually encourages open dialogue with their community, including in their official forums.
If that is not the case, then that is indeed a reason to consider taking my business elsewhere.
To be sure, there are local datacenters where I could put one of my spare 1Us for little more than I pay Linode, and get more resources in the process. My time is valuable enough that I've elected to stay with Linode rather than administer my own hardware, but perhaps I should reevaluate that stance if Linode truly does not care what is posted in their own forums.
@nknight:
My time is valuable enough that I've elected to stay with Linode rather than administer my own hardware, but perhaps I should reevaluate that stance if Linode truly does not care what is posted in their own forums.
Don't care? You clearly don't use IRC where they are active even right now … I'm pretty sure the standard operating procedure is ticket -> IRC -> forum for problem resolution (e.g., forum is not for realtime support).
@kg1866:
Don't care? You clearly don't use IRC where they are active even right now …
I was responding specifically to kbrantley's frankly irrelevant assertions.
And no, I generally don't use IRC these days. It is an inefficient and rushed form of communication for which I generally have little use.
@kg1866:
I'm pretty sure the standard operating procedure is ticket -> IRC -> forum for problem resolution (e.g., forum is not for realtime support).
Which is neither here nor there. If you'd care to look at what I've actually posted in this thread, you'll see that not once have I sought any form of support or assistance. You and kbrantley seem to be addressing a strawman.
service status announcement
1. There was zero communication about the maintenance being done. I understand Linode has international customers, so they can't schedule downtime that will make everyone happy. On the other hand, it would have been nice to know this was coming so we wouldn't have all been caught unaware.
2. That every server in every datacenter was upgraded at once. Even if this upgrade was tested prior to pushing it out to production, there was no way of knowing for sure that there would be zero problems. Even the best tested upgrade can go awry. My business isn't big enough yet to have taken a serious hit from what looks like 4 hours of downtime. When it did get that big, I was planning on buying a second Linode in another datacenter, but if upgrading everything at once is going to be the policy going forward, I'll probably end up buying my secondary server from a different VPS provider.
@craversp:
2. That every server in every datacenter was upgraded at once. Even if this upgrade was tested prior to pushing it out to production, there was no way of knowing for sure that there would be zero problems.
This is what bugs me as well. No matter how well it's been tested, at the very least, upgrades should be broken up by datacenter and performed with some gap between. ie more than 5 minutes, so that the upgrade can bake in and in this case only one DC would have been affected.
To address someone else's comment (too lazy to go back and find it), would I rather someone post a communication or work on the problem. To a sysadmin, the answer isn't intuitive, but it's post a communication. Working for a very large company, I've been angered when my boss has pulled me off of fixing things to communicate about the outage, but in the end, it's the right move. Taking 2 minutes to dash off a communication calms people and gets them off your back to a much greater degree than the cost of that 2 minutes getting the last server up.
@glg:
This is what bugs me as well. No matter how well it's been tested, at the very least, upgrades should be broken up by datacenter and performed with some gap between. ie more than 5 minutes, so that the upgrade can bake in and in this case only one DC would have been affected.
It could be that Linode's structure is such that pushing it to all datacentres at once was vital. But we won't know until they tell us
@gnummep-martin:
It could be that Linode's structure is such that pushing it to all datacentres at once was vital. But we won't know until they tell us
:)
I sure as hell hope not. If this turn out to be the case, then it adds serious weight to the "your hot standby machine should be with another provider" argument. If this is the case, Linode should (and most likely will) address the problem by changing the structure that required simultaneous upgrades.
@pclissold:
@gnummep-martin:It could be that Linode's structure is such that pushing it to all datacentres at once was vital. But we won't know until they tell us
:)
I sure as hell hope not. If this turn out to be the case, then it adds serious weight to the "your hot standby machine should be with another provider" argument. If this is the case, Linode should (and most likely will) address the problem by changing the structure that required simultaneous upgrades.
Well, perhaps it was the nature of the upgrade, I don't know. And we won't know unless they publish a better explanation of some sort.
I get notified of every post to "System and Network Status", and I suspect that I'm not the only one.
I want to be notified about status changes, but in the last 12 hours my inbox has been overflowing with notifications about postings that offer no additional information.
@cirric:
I want to be notified about status changes, but in the last 12 hours my inbox has been overflowing with notifications about postings that offer no additional information.
There is a link in the lower left corner of the page saying "Stop watching this topic" - maybe that's what you're looking for ?
Good luck everyone!
EDIT:
Btw: This is the response I got from my ticket:
> Hello,
I repaired /etc/inittab and /etc/fstab and issued a boot job – and the Linode appears to have booted correctly. We apologize for this inconvenience.
Please let us know if there's anything else we can assist you with.
Regards,
-Chris
So if you can mount your disk image with the Finnix rescue disk image (create a profile for booting the Finnix disk image and mount your real disk image as the second disk image) you could maybe fiddle around with /etc/inittab and /etc/fstab and perhaps fix the problem yourselves. But that is probably not possible because if it were, then Linode staff would probably have posted instructions on how we could fix our own Linodes. But someone who knows more than me and still have a nonfunctioning Linode could at least try while waiting.
EDIT2:
You can look at a copy of my inittab and fstab files on the below urls if you want to see a functioning version. I run my Linode on Debian and use ext3 as my filesystem.
EDIT3:
Can someone post a non-working fstab and inittab file? It would be interesting so see what the differences are. Maybe we could make a short howto for those who would want to fix the problem themselves instead of waiting for the admins.
Those who are complaining about communication problems were obviously looking in the wrong places. Their twitter account got several updates during the downtime. The staff was always on irc giving us updates.
The problem I got was that Linode was doing a shared library update that occurs every once in a while with no downtime. I assume that this type of thing has been done in the past without error and this was just another routine host update. Something about the libraries or the way they were installed caused issues on the host. They then decided to fix every host one at a time. Having over 500 hosts with a staff as small if Linode's leaves some downtime.
@oliver:
There is a link in the lower left corner of the page saying "Stop watching this topic" - maybe that's what you're looking for ?
That option doesn't appear until you reply to a particular thread. But even before I replied, I was subscribed to the forum. So, I got notification of any replies posted to the forum.
The notification email includes a link to stop watching this forum, but I don't want to disable that, because I'll miss the original postings of announcements by staff.
Yes, it's a limitation of the forum software. I'm just asking that folks use "General Discussion" for general discussion, and reserve "System and Network Status" for actual status updates.
Cheers,
Antonio
It's not a question of if the ball will get dropped…. Ohhh it will believe you me. Like in all respects when dealing with humans, it's a question of when. So be prepared.