Use TTL and/or load balancer to implement failover?
One possible remedy (I think) would be to set a short TTL (e.g. 5 minutes) on my website's DNS A zone records (plain and www) and if anything goes wrong, change their IP addresses to point to the Django server. Hopefully, if I had to do this, my site would only be unavailable for 15 minutes, i.e., until the next time Linode's name servers are updated. Is changing the TTL to such a short value a bad idea with respect to performance? Is this a bad approach for handling this type of problem?
Also, are there other better ways to handle this situation? For example, I've never used a load balancer before but I'm wondering if I could set up a load balancer and tell it to direct all traffic to the WP server. Then, if something happens to the WP, could I take it down and tell the L.B. to re-route all traffic around it to the Django server until I could get the WP server operational again? Would something like this work?
What's the best way to implement fault tolerance in a situation like this? Thanks!
3 Replies
WP is relatively easy to manage via the GUI, it is its spaghetti internals that make it difficult to work as a developer.
WRT high availability, I think your approach is misguided. The probability of WP breaking on its own, if it is properly updated, is not higher than for any other component; if you want more reliability/performance out of the entire setup, you can always have several instances of your frontend application behind a load balancer; however depending on the load balancer you might need additional changes to your application, esp. wrt sessions so that if the LB sends different requests to different nodes, they can all process the session cookie.
Changing DNS TTL to a lower value has no practical implications wrt performance, it is more about being nice to the DNS infrastructure by allowing it to save on recursion by caching your records for longer.
I am not familiar with your situation, but at a guess I would do something like this:
1) Script the provisioning of the entire setup with e.g. ansible, this would also give you an easily recreated staging environment. Alternatively, use StackScripts
2) Enable Linode backups
3) Measure the request-per-second performance and actual request volume, and run load tests to discover how much your current setup can take before you need to upgrade it. It is good to know this threshold so that you have early warning and have a time to prepare if your traffic grows.
4) Make database snapshots (dumps off a replica or live hot backup with e.g. xtrabackup) and ship them away off the database box
5) Keep a copy of everything in a separate pool - source code, provisioning scripts, database, DNS zone, generated configuration files. I use B2 for this kind of thing.
6) Write down the recovery procedure and test it often
The above will give you a near-foolproof recovery toolkit in case of total failure and an estimate on how close you are to the red line performance-wise. Now you can calculate a strategy of upgrading the setup.