Bandwidth caps on internal network
My application runs on several nodes with different roles, and there is sometimes around 1MB of data transferred between nodes for a single web request. With a 50Mbps cap, that transfer would take around 160ms - that's quite a while.
After rebooting to change to a 250Mbps cap, I saw a measurable improvement in overall request time for many of my customers…on the order of a 30% improvement for certain requests. Purely because of internal network speed.
So anyway, if you heavily use the internal network, just be aware the bandwidth caps apply to that!
10 Replies
@gregr:
My application runs on several nodes with different roles, and there is sometimes around 1MB of data transferred between nodes for a single web request. With a 50Mbps cap, that transfer would take around 160ms - that's quite a while.
I don't know the details but 1MB internal traffic per web request sounds obscenely high.
HOWEVER. The shaper Linode is using on its host seems to be suffering from a, uh, somewhat not very nice case of bufferbloat which can make the performance quite bad in some cases, resulting very bad intra-DC performance or even complete TCP failures between for example frontends and backends. I will be writing something up for the Feature Request/Bug Report subforum on this topic shortly - along with our current workaround, I just need to double check using a pristine Linode kernel first.
In the meantime these might be interesting reads on the topic:
and
@sednet:
@gregr:My application runs on several nodes with different roles, and there is sometimes around 1MB of data transferred between nodes for a single web request. With a 50Mbps cap, that transfer would take around 160ms - that's quite a while.
I don't know the details but 1MB internal traffic per web request sounds obscenely high.
It's a lot, but I definitely wouldn't call it "obscenely high". There are lots of applications that require a lot of data to be flowing around.
This particular case doesn't happen for every request - just some of them. It's retrieving up to 2 months of 1-minute interval financial pricing data from an internal server that stores all of this data, in order to generate a chart for the user. The results of this are cached - but every now and then it has to be generated from scratch, and it takes a fair amount of data to do it. That 1MB is highly compressed as well - it's quite a bit larger in its natural form.
And if you imagine 5 or 10 of these all happening in parallel, you can see how bandwidth matters a lot.
@gregr:
So anyway, if you heavily use the internal network, just be aware the bandwidth caps apply to that!
Bandwidth caps definitely do not apply on the internal network. I do lots of transfers between a database server and 5 web server nodes, and if that were the case I'd be in deep trouble.
If it shows up as Priv. In / Priv. Out on the traffic graphs on the Linode Manager, it is not being counted towards your cap. I have a graph currently showing an average of 58 Mbit/sec private outbound traffic on the internal network (192.168.x.x address) and no other traffic generated from that host, and it says my combined traffic for the day is a whopping 3.5 MB.
So definitely internal traffic does not count towards your cap, if you use the internal network IPs.
@dataiv:
@gregr:So anyway, if you heavily use the internal network, just be aware the bandwidth caps apply to that!
Bandwidth caps definitely do not apply on the internal network. I do lots of transfers between a database server and 5 web server nodes, and if that were the case I'd be in deep trouble.
If it shows up as Priv. In / Priv. Out on the traffic graphs on the Linode Manager, it is not being counted towards your cap. I have a graph currently showing an average of 58 Mbit/sec private outbound traffic on the internal network (192.168.x.x address) and no other traffic generated from that host, and it says my combined traffic for the day is a whopping 3.5 MB.
So definitely internal traffic does not count towards your cap, if you use the internal network IPs.
He's not talking transfer. He's talking about the port speed cap on Linodes, I.E. 250 Mbps.
@dataiv:
my combined traffic for the day is a whopping 3.5 MB.
excuse me?:roll:
@trippeh:
Note that staff would set that limiter higher upon request if it caused you any issues. It was mostly considered a protective measure I suppose, to avoid a single vm to monopolize the hosts port or cause damage to other networks inadvertently.
HOWEVER. The shaper Linode is using on its host seems to be suffering from a, uh, somewhat not very nice case of bufferbloat which can make the performance quite bad in some cases, resulting very bad intra-DC performance or even complete TCP failures between for example frontends and backends. I will be writing something up for the Feature Request/Bug Report subforum on this topic shortly - along with our current workaround, I just need to double check using a pristine Linode kernel first.
:) In the meantime these might be interesting reads on the topic:
http://en.wikipedia.org/wiki/Bufferbloat and
http://www.bufferbloat.net/
As a linode user, I would love it if I could help them try out an fq_codel enabled shaper on their servers (if that is what they are using) as well as on the underlying hardware under the vm with no shaper. The results we get all the way up to 10GigE have been remarkable.
The results at 4-100Mbit are even more remarkable.
See for example, cablelabs results on cable modems as discussed in last week's ietf iccrg meeting:
Plenty more data like that floating around. Some caveats apply particularly at lower speeds:
please have someone at linode contact me offline if you would like to try this stuff out.
dave taht
It seems it is a little less of an issue now at 250Mbit/s, but it is still quite helpful for the responsiveness of the sites we run when under stress (either by one huge backup job or just a user with fast pipes).
Yes I'm still meaning to do those measurements
(Edit) PS! I do not know if the vanilla Linode kernel has all the required bits compiled in. I'm using pv-grub to load our own kernel. 3.4 and newer ships with the needed parts, but it might not be enabled.
You can test it directly by running
IFACE=eth0 IF_EGRESS_RATE=240Mbit ./shaper
#!/bin/sh
#
# Add fq_codel to all networking devices without any config.
#
# Save as /etc/network/if-up.d/shaper
# chmod 755 /etc/network/if-up.d/shaper
#
# Needs a fairly recent iproute package (from sometime in 2012 IIRC).
# fq_codel and Byte Queue Limits kernel support (mainline since 3.4 I think)
# HFSC shaper kernel support if using egress-rate
# IFB device kernel support if using ingress-redir
#
# Needs kernel supporting fq_codel qdisc, HFSC shaper, and a fairly
# recent iproute package. IFB also needs to be supported in kernel if
# ingress-redir is used.
#
# Support setting a custom egress-rate to avoid excessive buffering when
# it happens upstream of us (say, shaper on a vm host). Otherwise we
# just add as root qdisc.
#
# Avoids wireless devices because they are currently incompatible.
#
# In /etc/network/interfaces
# if no excessive buffering upstream, no config needed. fq_codel is
# attached directly to device.
#
# If excessive buffering upstream - limit our egress by setting a egress-rate:
#
# iface eth0 inet dhcp
# egress-rate 240Mbit
#
# To shape ingress, redirect to a queueing device and set the egress
# limit on that (works only because TCP tries to be nice)
#
# iface eth0 inet dhcp
# ingress-redir ifb0
#
# iface ifb0 inet manual
# egress-rate 500Mbit
#
# Note that the bandwidths must be set low enough to avoid packets
# getting queued in the upstream buffer, typically a little under what
# you're sold.
#
# Other available parameters are:
# egress-target try to keep buffering below this value, default 5ms
# egress-flows fq_codel flow "buckets", default 10240
# egress-ecn do ECN marking when saturating, "yes" to turn on - default off
#
# It can be useful to also turn off TSO, GSO and GRO using ethtool -K
# to improve accuracy (at the cost of some more CPU usage).
#
# Debug (show commands)
set -x
IP=/sbin/ip
TC=/sbin/tc
[ ! -x "$TC" ] || [ ! -x "$IP" ] && exit 0
[ "$IFACE" = "lo" ] && exit 0
# Buggy logic to detect physical device and exclude wireless
[ ! -e "/sys/class/net/$IFACE/device" ] && [ -z "$IF_EGRESS_RATE" ] && exit 0
[ -e "/sys/class/net/$IFACE/wireless" ] && [ -z "$IF_EGRESS_RATE" ] && exit 0
[ -z "$IF_EGRESS_TARGET" ] && IF_EGRESS_TARGET=5ms
[ -z "$IF_EGRESS_FLOWS" ] && IF_EGRESS_FLOWS=10240
# Is ECN useful on egress when we are the source (not forwarding)?
# leave off by default for now.
ecn="noecn"
if [ "$IF_EGRESS_ECN" = "on" ] || [ "$IF_EGRESS_ECN" = "yes" ]; then
ecn="ecn"
fi
# Reset
$TC qdisc del dev $IFACE root 2>/dev/null || true
$TC qdisc del dev $IFACE ingress 2>/dev/null || true
if [ ! -z "$IF_EGRESS_RATE" ]; then
# Bandwidth limiting mode
echo "Setting egress rate of $IFACE to $IF_EGRESS_RATE, target $IF_EGRESS_TARGET, flows $IF_EGRESS_FLOWS"
$TC qdisc add dev $IFACE root handle 1 hfsc default 1
$TC class add dev $IFACE parent 1: classid 1:1 hfsc sc rate $IF_EGRESS_RATE ul rate $IF_EGRESS_RATE
$TC qdisc add dev $IFACE parent 1:1 handle 11: fq_codel target $IF_EGRESS_TARGET flows $IF_EGRESS_FLOWS $ecn
else
# Link-limited mode (default and dont fail interface bringup if it doesnt work)
echo "Setting scheduler of $IFACE to fq_codel, target $IF_EGRESS_TARGET, flows $IF_EGRESS_FLOWS"
$TC qdisc add dev $IFACE root fq_codel target $IF_EGRESS_TARGET flows $IF_EGRESS_FLOWS $ecn || true
fi
# Redir to queueing device
if [ ! -z "$IF_INGRESS_REDIR" ]; then
$IP link set dev $IF_INGRESS_REDIR up
$TC qdisc add dev $IFACE ingress
# Redirect both IPv4 and IPv6..
$TC filter add dev $IFACE parent ffff: protocol ip prio 1 u32 match u32 0 0 flowid 1:1 action mirred egress redirect dev $IF_INGRESS_REDIR
$TC filter add dev $IFACE parent ffff: protocol ipv6 prio 2 u32 match u32 0 0 flowid 1:1 action mirred egress redirect dev $IF_INGRESS_REDIR
fi