Over the twelve days between December 25th and January 5th, Linode saw more than a hundred denial-of-service attacks against every major part of our infrastructure, some severely disrupting service for hundreds of thousands of Linode customers. I’d like to follow up on my earlier update by providing some more insight into how we were attacked and what we’re doing to stop it from happening again.
Essentially, the attacker moved up our stack in roughly this order:
- Layer 7 (“400 Bad Request”) attacks toward our public-facing websites
- Volumetric attacks toward our websites, authoritative nameservers, and other public services
- Volumetric attacks toward Linode network infrastructure
- Volumetric attacks toward our colocation provider’s network infrastructure
Most of the attacks were simple volumetric attacks. A volumetric attack is the most common type of distributed denial-of-service (DDoS) attack in which a cannon of garbage traffic is directed toward an IP address, wiping the intended victim off the Internet. It’s the virtual equivalent to intentionally causing a traffic-jam using a fleet of rental cars, and the pervasiveness of these types of attacks has caused hundreds of billions of dollars in economic loss globally.
Typically, Linode sees several dozen volumetric attacks aimed toward our customers each day. However, these attacks almost never affect the wider Linode network because of a tool we use to protect ourselves called remote-triggered blackholing. When an IP address is “blackholed,” the Internet collectively agrees to drop all traffic destined to that IP address, preventing both good and bad traffic from reaching it. For content networks like Linode, which have hundreds of thousands of IPs, blackholing is a blunt but crucial weapon in our arsenal, giving us the ability to ‘cut off a finger to save the hand’ – that is, to sacrifice the customer who is being attacked in order to keep the others online.
Blackholing fails as an effective mitigator under one obvious but important circumstance: when the IP that’s being targeted – say, some critical piece of infrastructure – can’t go offline without taking others down with it. Examples that usually come to mind are “servers of servers,” like API endpoints or DNS servers, that make up the foundation of other infrastructure. While many of the attacks were against our “servers of servers,” the hardest ones for us to mitigate turned out to be the attacks pointed directly toward ours and our colocation providers’ network infrastructure.
Secondary Addresses
The attacks leveled against our network infrastructure were relatively straightforward, but mitigating them was not. As an artifact of history, we segment customers into individual /24 subnets, meaning that our routers must have a “secondary” IP address inside each of these subnets for customers to use as their network gateways.
As time has gone by, our routers have amassed hundreds of these secondary addresses, each a potential target for attack. Of course, this was not the first time that our routers have been attacked directly. Typically, special measures are taken to send blackhole advertisements to our upstreams without blackholing in our core, stopping the attack while allowing customer traffic to pass as usual. However, we were unprepared for the scenario where someone rapidly and unpredictably attacked many dozens of different secondary IPs on our routers. This was for a couple of reasons. First, mitigating attacks on network gear required manual intervention by network engineers which was slow and error-prone. Second, our upstream providers were only able to accept a limited number of blackhole advertisements in order to limit the potential for damage in case of error.
After several days of playing cat-and-mouse games with the attacker, we were able to work with our colocation providers to either blackhole all of our secondary addresses, or to instead drop the traffic at the edges of their transit providers’ networks where blackholing wasn’t possible.
Cross-Connects
The attacks targeting our colocation providers were just as straightforward, but even harder to mitigate. Once our routers were no longer able to be attacked directly, our colocation partners and their transit providers became the next logical target – specifically, their cross-connects. A cross-connect can generally be thought of as the physical link between any two routers on the Internet. Each side of this physical link needs an IP address so that the two routers can communicate with each other, and it was those IP addresses that were targeted.
As was the case with our own infrastructure, this method of attack was not novel in and of itself. What made this method so effective was the rapidity and unpredictability of the attacks. In many of our data centers, dozens of different IPs within the upstream networks were attacked, requiring a level of focus and coordination between our colocation partners and their transit providers which was difficult to maintain. Our longest outage by far – over 30 hours in Atlanta – can be directly attributed to frequent breakdowns in communication between Linode staff and people who were sometimes four-degrees removed from us. We were eventually able to completely close this attack vector after some stubborn transit providers finally acknowledged that their infrastructure was under attack and successfully put measures in place to stop the attacks.
Lessons Learned
On a personal level, we’re embarrassed that something like this could have happened, and we’ve learned some hard lessons from the experience.
Lesson one: don’t depend on middlemen In hindsight, we believe the longer outages could have been avoided if we had not been relying on our colocation partners for IP transit. There are two specific reasons for this: First, in several instances we were led to believe that our colocation providers simply had more IP transit capacity than they actually did. Several times, the amount of attack traffic directed toward Linode was so large that our colocation providers had no choice but to temporarily de-peer with the Linode network until the attacks ended. Second, successfully mitigating some of the more nuanced attacks required the direct involvement of senior network engineers from different Tier 1 providers. At 4am on a holiday weekend, our colocation partners became an extra, unnecessary barrier between ourselves and the people who could fix our problems.
Lesson two: absorb larger attacks Linode’s capacity management strategy for IP transit has been simple: when our peak daily utilization starts approaching 50% of our overall capacity, then it’s time to get more links. This strategy is standard for carrier networks, but we now understand that it is inadequate for content networks like ours. To put some real numbers on this, our smaller datacenter networks have a total IP transit capacity of 40Gbps. This may seem like a lot of capacity to many of you, but in the context of an 80Gbps DDoS that can’t be blackholed, having only 20Gbps worth of headroom leaves us with crippling packet loss for the duration of the attack.
Lesson three: let customers know what’s happening It’s important that we acknowledge when we fail, and our lack of detailed communication during the early days of the attack was a big failure. Providing detailed technical updates during a time of crisis can only be done by those with detailed knowledge of the current state of affairs. Usually, those people are also the ones who are firefighting. After things settled down and we reviewed our public communications, we came to the conclusion that our fear of wording something poorly and causing undue panic led us to speak more ambiguously than we should have in our status updates. This was wrong, and going forward, a designated technical point-person will be responsible for communicating in detail during major events like this. Additionally, our status page now allows customers to be alerted about service issues by email and SMS text messaging via the “Subscribe to Updates” link.
Our Future is Brighter Than our Past
With these lessons in mind, we’d like you to know how we are putting them into practice. First, the easy part: we’ve mitigated the threat of attacks against our public-facing servers by implementing DDoS mitigation. Our nameservers are now protected by Cloudflare, and our websites are now protected by powerful commercial traffic scrubbing appliances. Additionally, we’ve made sure that the emergency mitigation techniques put in place during these holiday attacks have been made permanent.
By themselves, these measures put us in a place where we’re confident that the types of attacks that happened over the holidays can’t happen again. Still, we need to do more. So today I’m excited to announce that Linode will be overhauling our entire datacenter connectivity strategy, backhauling 200 gigabits of transit and peering capacity from major regional points of presence into each of our locations.
Here is an overview of forthcoming infrastructure improvements to our Newark datacenter, which will be the first to receive these capacity upgrades.
The headliner of this architecture is the optical transport networks that we have already begun building out. These networks will provide fully diverse paths to some of the most important PoPs in the region, giving Linode access to hundreds of different carrier options and thousands of direct peering partners. Compared to our existing architecture, the benefits of this upgrade are obvious. We will be taking control of our entire infrastructure, right up to the very edge of the Internet. This means that, rather than depending on middlemen for IP transit, we will be in direct partnership with the carriers who we depend on for service.
Additionally, Linode will quintuple the amount of bandwidth available to us currently, allowing us to absorb extremely large DDoS attacks until properly mitigated. As attack sizes grow in the future, this architecture will quickly scale to meet their demands without any major new capital investment.
Final Words
Lastly, sincere apologies are in order. As a company that hosts critical infrastructure for our customers, we are trusted with the responsibility of keeping that infrastructure online. We hope the transparency and forward-thinking in this post can regain some of that trust. We would also like to thank you for your kind words of understanding and support. Many of us had our holidays ruined by these relentless attacks, and it’s a difficult thing to try and explain to our loved ones. Support from the community has really helped. We encourage you to post your questions or comments below.
Comments (67)
Thanks for your great work. My VPS was running well during these days.
Good postmortem analysis – thanks for being candid.
Thanks for being honest and forthcoming about this and the issues you addressed-both on the technical and PR sides-as well as the steps you are taking to better your company.
Kimo.
You people are awesome and have great stamina. We are satisfied customer from Pakistan.
I’ll never stop buying linodes!!
You guys are are rock stars in my book, and I appreciate the transparency. More tech companies need to live and breath that these days, or else find themselves losing the game to cheaper competitors.
While I haven’t been a fan of how some past incidents were handled, I still give Linode a 5-star rating. Good job!
Things happen. Those of us who network or sysadmin know that when youre fighting fires and figuring out what is going on and fielding calls from angry clients the last thing you have time for is updating everyone. Hell…you may not even know what all is going on for a couple days or more with huge attacks.
This is a good postmortem and your ability to learn and adapt and invest in your own infrastructure is why I love and continue to be a Linode fanatic.
Keep it up you guys. Sorry Christmas was such a bummer.
May the Network be with you!
Can’t thank the Linode team enough for your dedication. The livelyhood of thousands rest in your hands, I feel like this whole event further proves how well qualified you guys are to be doing what you’re doing.
The only part of this that really bothers me is the idea that if I get a DDOS, Linode is just going to blackhole me, and me alone. Doesn’t that mean that I have to give in to ransom demands from attackers?
I really appreciate this. We were waiting for this to take the decision if we will stay in linode or move away, and we are staying.
I strongly agree that being more transparent would have helped a LOT.
I’d like to know, though, when is scheduled the above change in the rest of the datacenters. I’m not using newark right now and would like to know when my datacenter will have it : )
Thanks a lot,
Rodrigo
@Mogden – for people who are attacked regularly, we suggest Cloudflare or others in the DDoS protection market. I’m not sure what the future holds on this subject, but rest assured that it really bothers us too.
Thanks for the update. Any time frame for other datacenters to be updated? My linodes are in Atlanta and we suffered almost three days of downtime.
Cheers
We had 2 linodes, one of them in Atlanta datacenter. We have not experience any issues during holidays, but I was worried though. Thanks for the explanation and amazing work. Honestly hope your family can understand the situation.
Amazing company!
Like Rodrigo, this is a huge thing to us. I was honestly feeling that it was going the usual corporate way with silence and deniability, just waiting for the furore to die down. It really makes a difference to hear not only the details of the response/mitigation activities, which we appreciate, but also acknowledgement of the position we were put into when communication was sparse.
It goes a long way.
Thanks again.
Mark.
Great to hear we could help you get protected.
swiner@cloudflare.com
@mogden – if your the one being ddos’d then you deserve to be blackholed. I dont pay for my linodes for you to be targetted with a ddos and mine linodes taken down!!
Thank you for the analysis and a break down of what took place, and most importantly, thank you for being honest with customers!
Cheers!
I’m obviously a huge fan of Linode, but I wonder if this attack will force them to re-evaluate their “3 strikes” policy towards hosted sites which come under DDoS attack. As this attack should have taught them, it’s indiscriminate, and there’s not a whole lot a small website owner can do to mitigate it. We rely on Linode to be able to deal with this, and punishing the victim is hardly a fair solution.
And attacks started minutes after posting updates. http://status.linode.com/incidents/mkcgnmjmnnln
I’ve a message for Linode especially Chris, please invest more and more on infrastructure if you want to stay in the game otherwise, you’ll be overtaken by heavily funded startups in this domain. We know you have innovative mind and excellent technology but this alone is not sufficient for you to win in this domain. I like performance and flexibility of Linode but moved to DO just because I needed to setup my stuff at Japan and Singapore data-centers and Japan DC is sold out. 3 out of 6 locations are sold out and you are not yet expanding? How will you compete?
Come out of your box and look at your neighbors. It was painful to move to Digital Ocean for me but I had to take this decision. I am still using Linode for some of my stuff will continue using it until I need redundancy or you expand.
There’s nothing that I love more than the amount of technical detail that you provide to us on these cases, and even with some minor updates.
I love being a Linode customer, no DDoS will get that away from me 😀
Thanks for this post, Alex. This was a rough period for everyone involved and affected but I am extremely impressed by Linode making the effort to hopefully prevent the same scenario from happening again.
There were many lessons to be learned from this – both for Linode and for customers.
Linode appears to have realized what they needed to do and that is fantastic. Instead of saying sh*t happens and going about business as usual you are actively working to make sure it doesn’t happen again. Well done.
We (customers) need to cover our own bases too. For anything critical or even slightly important you need to have a plan in place in the event of a Linode outage (regardless of the reason).
I have now split some of my services and are far better placed to recover quickly in the event something like this were to happen again. Linode had always been so reliable that I got complacent. Lesson definitely learned.
In my case my costs have now increased as I am now paying other providers in addition to what I have and will continue to pay Linode, but the ability to keep some important services online is worth it.
Thank you to everyone at Linode for your hard work and for looking out for your customers.
Some of our big clients suffered with the downtime on those days but, with several VPS and more online each day, we never accepted any offer from others players. This kind of behaviour make us confident with the team and give us peace of mind that we’re in good hands.
Thank you for the update and respect with your customers.
Hostcare Internet
Thank you for being open, good luck with your new defences and I hope that you catch up on your family time!
Linode user here. Thanks for the transparency. I wasn’t directly affected but I appreciate the openness on the issue. It’s a welcome change to most companies now. I plan to keep using Linode just because of how cool you all handled the situation. Keep up the good work!
Cloudflare will probably help with your DDoS but they aren’t infallible as any other vendor.. But what happens when they get hit really hard themselves? I’d recommend getting a second DNS provider.
See Also: https://blog.thousandeyes.com/ultradns-ddos-affects-major-web-services/
https://blog.thousandeyes.com/ultradns-outage-october-2015/
I was beginning to wonder if such a note would arrive. The explanation is useful and I’m feeling as though things are safer than before.
Thank you for being transparent about what happened. That was a truly hellish attack. Getting slammed with a sophisticated and highly targeted 80 Gbit DDoS is stressful for any network admin and I’m glad that Linode succeeded in weathering the storm.
I am really impressed with way you have handled this whole situation, your company’s honesty and explanation is more than anyone could have expected. I’m sure there were many hours invested, not only in locating and fixing the problem on top of adding the double protection; but even in your letter to your customers. I hope all your customers are as loyal to your company as you have been with them. Way to step up your game, keep up the good work. Wishes for much more success……
Thank you very much for the detailed breakdown of what went wrong and what you plan to do to prevent this in the future. I have to say though, technical reasons and justifications aside, Linode has a lot to learn in regards to communication. I know you acknowledge that in your blog post but for many people (myself included) it’s too little way too late. It’s taken you 30 days to write a blog post that could’ve been written in hours. For 30 days people have been sitting on the fence wondering exactly what you guys are doing and whether or not they should jump ship. For many people (myself included), the absence of this response and the overall feeling that it has been so long since you said you were going to provide an update, that honestly you were just going to push this to the side and hope it went away, has directly contributed to Linode losing a significant amount of business from us.
I don’t want my response to turn into some Linode bashing post, but I want you to be aware that your failure to provide sufficient information and responses is the biggest problem here – for me, at least. It hit your reputation hard and caused us to lose a significant amount of trust in your company and services. DDoS attacks happen, and we know you guys were working extremely hard to deal with those. You reminded us often enough in your status updates. What we really wanted to know was that the worst was over and that you identified your weaknesses and were addressing those. The longer we had to wait for this information, the less trust we had/have in you.
I’d like to end this on a more positive note. All of the above said, your services are fantastic overall and I’d love to come back to Linode in the future, once you’ve performed all of the changes you have mentioned here. Just please, improve on your communications!
Long-time Linode customer…I wasn’t affected by the outage, but I’m really glad you’ve taken the time to write up what happened. Thanks for being transparent and generally awesome.
Alex, this caught my attention: “… requiring a level of focus and coordination between our colocation partners and their transit providers which was difficult to maintain.”
How did you structure this communication? What tools / technologies did you use or tried to use?
This is a nicely put article. I only have amazing things to say about Linode and its staff. Awesome post!
As a long time customer and a fellow network administrator I just wanted to say that I do really appreciate all your hard work. Respect.
Sounds an exciting project Alex, good luck!
Any news on continued security farces at Linode? and ‘The Best Practices not invented here’ approach.. For example to reset 2FA
—
Should you need us to disable your Two-Factor Authentication, the following information is required:
An image of the front and back of the payment card on file, which clearly shows both the last 6 digits and owner of the card.
An image of the front and back of the matching government-issued photo ID.
—
A) Photoshop CC in 2 mins, you have no idea what my CC should look like.
B) You can’t verify government ID so say 5 minute photoshop.
Woohoo for 2FA, known as 2 f… alls
Thanks for the update, and letting us know that things will be better handled in the future. Both technically and on the communication front.
Any idea who attacked and why?
Linode – you are the best. Thanks for your service.
Thanks for the update. As a long time linode customer, it is appreciated.
For you guys complaining about being kicked out in case of a DDoS, I recommend getting DDoS protection for your linodes. There are a lot of cheap options right there that can be integrated easily.
Some one recommended CloudFlare and they are great. You can also look at Sucuri:
http://sucuri.net/website-firewall/
Or Incapsula:
https://incapsula.com
Both great products and solutions. Stay safe!
200g? this years ddos was 800gbps…
good postmortem. now can you explain what happened with the “leaked” credentials and the fact that we had to reset the passwords.
thank you
These attacks could happen to anyone and any provider. Keep up the good work!
Great article and the right way to handle these kinds of problems. Transparency and constructive retros are the way to go.
I think you did great job considering the size of the attack. That’s why continue to use Linode for my virtual machines. Thank you for your support and keep up the good work.
Thank you for the clear and concise explanation. I look forward to you rolling out your upgrades and continue to be a happy customer with Linode.
Cisco routers, seriously?
Juniper high end routers take a gigantic steaming dump all over Cisco.
@Jake that’s essentially what ASRs are 😉
If you want to do it on the cheap side and be safe, get some cheaper / best equipment from huawei (give them a call). You might think the Chinese cannot be better than Cisco, but Cisco is now also made in China. Also I’m sorry, but you need some Ddos protection (expensive). You cannot just nullroute your costumers… you have to protect them. If the cheap OVH company can do it, why can’t you…
Looks like you guys need to hire someone with real experience in network engineering (worked at ISP level), not just some cheap undergraduate out of university.
You need to rely more on anycast, have reserved capacity, etc.
After reading this, I would not host my sites on linode. You guys look amateur (sorry).
I appreciate this honest insight, but I’ve moved back to a local server since these attacks made access to my Linode difficult or impossible, and always-on, always-accessible was my main reason for moving to Linode in the first place. Sorry, and better luck in the future.
I like the transparency, even delayed. I like that you’re taking steps. I DON’T like that your “security appliances” block ALL ICMP packets including the “Packet Too Big” messages required for path MTU discovery and breaking my ability to access the Manager over my VPN.
Buying blended internet direct from your colo provider is a bad idea (as it seems you have learned the hardway)
You should be getting your transit direct from diverse carriers… this is networking 101
Love the armchair quarterbacks giving their input. Now, for you QBs, where is your massive company you are running and making decisions and learning lessons from? Oh you don’t have one and you don’t work for one? Sit back and let Linode do their job, they are by far the best provider out there. The cost of this type of infrastructure is gigantic and you wanna-be QBs have no idea what it takes to run a business.
Great job Linode. I know I’ve made the right choice by using you.
Excellent. I knew you guys were “on it”. I really appreciate the detail you provided.
Thank you for releasing this honest and detailed report
Regarding CloudFlare, did you shop around for any other DNS DDOS protection services? The reason I ask is because CloudFlare happily caches too many dodgy websites. Some sources that may be of interest:
http://news.netcraft.com/archives/2015/10/12/certificate-authorities-issue-hundreds-of-deceptive-ssl-certificates-to-fraudsters.html (large number of phishing certificates issued by CloudFlare)
http://www.crimeflare.com (non-profit that investigates CloudFlare and its customers)
I appreciate the update, but i find a bit late too.
Also i don’t really get why Mr. Forster signing this post?
And don’t get me wrong, i have nothing against him, i don’t doubt his intentions or knowledge.
But i expected a statement from someone from the top of the food chain . This was also one of my main problems when the events happened, its like nobody cares from the top management, until one of the engineers realized that they can’t be silent anymore.
I still have that feeling, and is pretty alarming .
It’s time to move to IPv6-only internet. Attacking a single address will become impractical if a host can have millions of them changed automatically in an unpredictable way.
Appreciate the info.
It is a minor point, I know, but status.linode.com should either be un-available over https, or have its own cert.
try this in chrome…
https://status.linode.com
Thanks Linode Team for acknowledging your challenges, and courageously taking adaptive actions 🙂
Great job! Didn’t know such a story ongoing since my site was on all the time. Really appreciate all the hard work of LINODE support team!
Thank you for the very interesting update. Best of luck for the future.
I’m also quite curious on who could benefit from such attacks in the first place.
I am using Cloud Flare to protect the blog from DDOS attack, is there any other best application available to replace cloudflare? Is there a way to stop the DDOS or brute force attack for wordpress sites?
Great write up & good to see such honesty and transparency. I think it is important for readers of this to understand that DDoS attacks can affect anyone at any time on any host. Obviously when you are on the receiving end of a nullroute it is not nice, but It’s important to note though that providers do not want for you to have downtime, but if a DDoS directed at you is affecting other customers and you don’t have some form of mitigation, there is seldom any other option than to take this action. As they said, ‘cut off a finger to save the hand’. I’m quite sure that if someone else is being DDoS’d that you would prefer to see them nullrouted than have your own service impacted, so that has to work both ways in my eyes.
It’s important to look at the issue objectively – DDoS attacks are not going to go away and really if you have concerns around protection then this does mean paying for a mitigation service, especially if outages will be more costly than the monthly sub.
@Srinivas – You’ll need a CloudFlare business plan for DDoS attack mitigation. Simply being behind CloudFlare on a free plan won’t give you this protection, and there isn’t another service that I am aware of that provides free DDoS protection without at least having some other paid service. Keep in mind that CloudFlare isn’t an application, but rather a service which is totally separate from your Wordpress sites. If you want to run something locally to stop a brute force attack then have a look at a plugin such as Wordfence, which is very effective. Another good plugin is iQ Block Country which uses GeoLocation – you can lock down your back end to whitelisted countries only. Plugins are not infallible, but they definitely add extra security. Another good way to stop brute force attacks is by not using obvious account names for the administration area of your site…lots of tools will try to brute force on usernames like ‘admin’ – as with any security approach, it’s all about the layers!
As a final note, I do always find it interesting when posts like this attract the critics who dish out ‘advice’ about how X and Y should have already been done, or that they are amateur, etc. I would like to know which fairytale jobs they have at companies that have everything 100% perfect with 100% uptime and 0% chance of outages or attacks…
Fair play Linode, tip of the cap.
Thank you for your honesty and transparency. Very very good post. Thank you for your hard work during the attacks even on holidays. Keep pushing Linode Team!
yeah thank you also for your transparency. I remember what happened, evthg gave tears and I think, as many people, we planned to move to another company. Even some days ago, I compared with AWS, reading their doc for RDS, EC2, ELB, S3 etc, but Linode, even with much less available options and possibilities if we compare to amazon, Linode stay for us a better company, with a great support and reactive, providing faster and cheaper solutions.
I started with Linode 4 years ago, I loved the service and I am not going to go away from you guys. I know how painful firefighting could be, thanks to your team for working so hard. And please do everything that could prevent this from repeating.
Hello,
on the article you said following
“our nameservers are now protected by Cloudflare, and our websites are now protected by powerful commercial traffic scrubbing appliances.”
but seems it is not anymore. did you moved away from cloudflare protection? if yes then why? many hosting giants now rely on cloudflare protection.
Thank you for this update and the recent additional high memory and $5 options.