Questions about DNS - Best Practices
I am using DNSMadeEasy which is $15 per year for a global anycast network of nameservers. Apparently no down time in their 8+ year history. Six nameserver IPs assigned but as I understand anycasting each nameserver IP has multiple geographically distributed hosts associated with it and DNS requests gets routed to the closest one. This seems to be a very reasonable price to pay for that kind of reliability and low latency.
If you want to use it please use my affilliate link
I configured the linode zones as slave to the dnsmadeeasy master using AXFR transfers so that the DNS will always be in the local nameservers (better latency) but I do not include the linode nameservers in my registrar record so that the public DNS queries are always directed to the ultra-redundant low-latency dnsmadeeasy nameservers.
I listed 67.18.186.57 plus ns1-ns4.linode.com as the authorized IPs for AXFR requests. Do I need to include any others?
Do all four nameservers synchronize as slaves directly with the dnsmadeeasy master or does one of them do a sync and then sync the others within linode? If its the former, I would think this would be a waste of requests especially since only the local fremont nameserver would be used for my fremont-based linode (I assume my resolver would check the local DNS only and should always get a hit (since it is authoritative) and even if it did not I assume it would go out to the internet rather than check the other linode nameservers.
If all four are synchronizing directly with dnsmadeeasy is there any way to limit it to the fremont one without causing more trouble than it solves… For example, if I only authorized the fremont IP for AXFR would the other nameservers just continue to poll the DNSMADEEASY master and get error messages such that it would not be much less efficient than just letting all four nameservers do the synchronization?
I only have two IP addresses. One is used as a mail server and as I understand it reverse DNS is important in some cases for spam filtering, etc. I set up rDNS at linode so that the mail server IP resolves to the right domain name, but curious whether there is any way to take advantage of the DNSMADEEASY network for rDNS to improve performance and reliability for rDNS queries?
I am curious why people run their own DNS (besides just the fun and learning experience of DIY). I am new to all this, so any other feedback and suggestions are always welcome.
9 Replies
A higher value has the benefit that DNS records are cached for longer periods and therefore more likely to be in the caching nameserver when their is a DNS request (lower latency since local copy). As a result the cache miss requests for updates are less frequent therefore reducing load on authoritative nameservers.
A lower value has the advantage that a change to a DNS record is going to be updated throughout the web more quickly since the maximum time that an old copy should be in a remote DNS is TTL seconds. So low end provides more update flexibility and high end provides lower DNS request latency and reduced loads on the authoritative nameservers.
That said, people have pointed out that you can have your cake and eat it too with a little planning by reducing TTL in advance of changes (if you had a high TTL of 1 day you would have to lower it one day in advance) and get a quick update for subsequent changes and then raise TTL after the update to maximize caching again.
The suprising fact is that studies suggest that in many cases a TTL of only 1800 seconds (30 minutes) results in cache hits of more than 80%. See figure 5 in the following paper (three different sample trace files of connections) and you can see that the returns for increasing TTL quickly diminishes and the cache hit rate quickly approaches mid-90 percentile with a TTL in the 2-6 hour range.
This suggests that there really is not much average latency reduction or reduction of load on nameservers when you increase the TTL beyond this range. Given that the most common TTL is probably 24 hours, there can be room for reduction.
On the other hand, if your DNS records are quite stable, I guess there is little return in terms of flexibility for keeping it low. You may very well decide that even though there may be relatively few avoided cache misses achieved when increasing TTL, you might as well do it and reduce the latency for those requests. One point the paper makes is that hit rate can be dramatically affected by the distribution of accesses so its possible that even though the modelling seemed to work well for their real world traces that real-life experiences may be inconsistent with these findings.
Any observations, suggestions, comments on setting TTL?
Set the TTL to default, Lower it in advance of any changes to existing records. Then, get on your with your life.
More important things to worry about than insignificant settings of your DNS.
For example, as I suggest in the above postings, I thought that by synchronizing the linode nameservers as slaves to the dnsmadeeasy dns network, that local name resolution by the linode resolvers would access a local authoritative copy of the DNS records on a local preferred nameserver before going out to the internet to do a DNS lookup. But I am told that the resolver process does not reference the linode nameservers unless it is one of the DNS listed for that domain at the registrar. So my conjured configuration would not work the way I had expected and i have removed the slave configuration of the linode servers.
I've always kept my DNS separate from my hosting/server provider or run it myself. I agree with H3LR4ZR that DNS should be a set it and forget it service, but too many providers manage to get it wrong or have outages. The last time I trusted a host (a respectable one too) with a tiny part of my DNS, they had a DNS outage soon afterwards causing me a lot of disruption. Needless to say that I immediately took it back under my direct control and haven't had an issue since.
TTL wise, unless you make regular updates it doesn't really matter. I generally use 1 hour, but some providers ignore the TTLs in their caches anyway.
The other advantage of separating your DNS from your hosts is that when you want to move for whatever reason, it's one less thing to migrate.
If you've only got a few records in your zones, a record of what they are is probably good enough backup should anything really bad happen to your DNS setup - just point your domains to a new nameservers fast (and upload your records into your new zone)!
Finally on the performance front, unless your nameservers have slow, unreliable or poor connectivity, you should be fine. If super fast resolution is of paramount importance (I've had requirements for 1 millisecond resolution in the past), you need to own the whole DNS and possibly network infrastructure and contain it within your own (or rented) datacenters. Speed should be a "don't worry about it" if you're users are out on the internet.
Given that the hit rate is >95% for a relatively small TTL as per the paper above, the access is cached in local nameserver so has nothing to do with DNS provider. So the DNS provider is accessed in the <5% of DNS queries that do not result in a cache hit.
I guess the latency I am hoping to avoid at least some of the long tail in web page download times. Not sure what percentage of these would be attributable to DNS. And even if it is attributable to DNS, it may be the part of the resolution delay from root server on down that have nothing to do with which authoritative name servers I use. Maybe the latency benefit to dnsmadeeasy is largely imagined on my part…
Another interesting tidbit that further reduces dependence on DNS provider response times is DNS prefetching. Apparently at least google chrome immediately initiates DNS queries for all the domains in links on a page before the user clicks them so that the DNS information is already locally cached if and when the user clicks on the link. They also preload DNS information for the domains suggested in the omnibox while the user is entering the domain name. So this prefetching no doubt further reduces the frequency of DNS lookups that result in the user experiencing the latency. Not sure whether other browsers are doing any prefetching.
@000:
but too many providers manage to get it wrong
Something like this? Domain obscured to protect the guilty.
fukawi2 ~ $ dig <snip>.com.au ns
; <<>> DiG 9.6.1 <<>> <snip>.com.au ns
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 19243
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
;; QUESTION SECTION:
;<snip>.com.au. IN NS
;; ANSWER SECTION:
<snip>.com.au. 3567 IN NS pqlddc-csg01.
;; Query time: 0 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Thu Oct 8 12:44:57 2009
;; MSG SIZE rcvd: 66
fukawi2 ~ $ jwhois <snip>.com.au
<snip>Name Server: ns1.<snip>.com.au
Name Server IP: 203.27.142.53
Name Server: ns2.<snip>.com.au
Name Server IP: 203.27.142.53</snip></snip></snip></snip></snip></snip></snip></snip>
I can spot at least 3 problems with the above.
More about TTL: I pointed out the study about TTL above made the point that for the distributions that they studied a TTL of 15 minutes led to 80% cache rate. Eyeballing the graphs, it seems to quickly climb to mid-90%, for 2-6 hours. But they also make the point that different usage distributions, the hit rate can be significantly different. For example, the point out that for an exponential distribution (Poisson process) with a mean of 2000 seconds, then the hit rate for TTL=15 minutes would only be 31%.
I think the distinction is more significant than they make it out to be. This is really based on the type of site you run. For high volume sites, like a google, amazon, etc., the distribution is going to be normal, and the main analysis of that paper applies. But the traffic on web sites has a long tail and that most sites have much smaller traffic, and when one further breaks that down to the traffic for a particular website corresponding to a particular caching nameserver, then that frequency distribution is most likely going to be Poisson. In many cases, its probably just the usage pattern of a single user.
For one of my sites, where I send out emails to notify subscribers of events and they click to visit the website and see more details and register if interested, there is a spike and probably an exponential decay in visits as people read and check the website. Some will register immediately. Many will think about the event, consider competing plans, talk to friends who might attend, and may check back to at the site to review details and/or register hours, or a day or two later. Then there might be little traffic for the rest of the month until the next email.
I think even in cases where that is not an email that triggers visits, there is something on the mind (searching for a product, information, etc) that causes someone to visit a site, they may visit many pages on the site in the next 0-30 minutes after the initial DNS request, but its often likely they may take breaks and come back hours, or a day or two later. Of course they may not come back for weeks, but I think there is the sort of temporal locality of reference as applied to user browsing behavior thats not just limited to a single visit (visiting multiple pages), but multiple visits. When you visit a site you are more likely to come back to that site in the near future…
It seems more people try to set TTL as low as reasonably possible to maximize update flexibility. But given that in the vast majority of cases the DNS information is "set it and forget it" for long periods of time and that you can anticipate when you will be entering a period of change (even if you may not know exactly what that change will be right away), this strategy seems to have little value.
I think many people feel its free so might as well have them update. But I have come around to thinking that one should set it to a maximum reasonable amount unless you are in a period of updating the DNS info. Basically caching nameservers around the world are willing to store your information as long as you would like for free and this reduces latency and the susceptibility to the long tail of DNS responses (as mentioned in the google article above), so why not take advantage of it?
When I combine that with thinking about the usage patterns of a typical user in response to an event notice, I thinking of setting the TTL of 3-7 days. It would capture the vast majority of the site visits and revisits during the event contemplation stage. And I think its extremely unlikely that I would want to make changes to the DNS where I wouldn't be able to lower the TTL a week in advance of a change without inconvenience. I recognize this is a largely philosophical discussion. I can only speculate that this would reduce the long tail of DNS latency mentioned in the google article (thereby improve load times and user satisfaction?) but for all I know some or all of that long tail may be due to glitches at the caching nameserver!