[b]Linux 2.6 sur les hôtes[/b][/size]
Six des 20 serveurs hôtes fonctionnent désormais avec la version 2.6 du noyau Linux, avec le planificateur de disque CFQ fair-queuing.
Maintenant que nous l'utilisons sur quelques machines depuis un certain temps, j'ai une bonne idée de ses performances. J'ai remarqué que la version 2.6 est meilleure pour certaines charges de travail, et un peu moins bonne pour d'autres charges de travail, par rapport à la version 2.4 (déterminé en comparant les sorties mrtg et vmstats avant/après la version 2.6). Je suis optimiste et pense qu'il y a des gains supplémentaires à obtenir avec certaines des options de réglage de la VM (/proc/sys/vm/*).
Dans l'ensemble, je pense que la version 2.6 est "une bonne chose", et nous finirons par migrer le reste des hôtes vers la version 2.6.
[b]Disk I/O Thrashing is no more ![/b][/size]
La principale raison pour laquelle j'ai voulu passer à la version 2.6 était l'amélioration des performances d'E/S par rapport à la version 2.4.
Linux est sensible à ce que j'appellerais une "attaque par déni de service du disque dur" lorsqu'il y a un taux élevé de demandes de lecture/écriture aléatoires, remplissant la (les) file(s) d'attente. Cela entraîne des problèmes de latence pour les autres requêtes et ralentit essentiellement les choses.
C'est exactement le type de charge de travail qui se produit lorsqu'un Linode sollicite continuellement ses périphériques d'échange (lecture et écriture rapides) et lorsque l'hôte est sous pression pour écrire ces pages sales (ce qui sera toujours le cas, au bout d'un certain temps). Malheureusement, le correctif CFQ de la version 2.6 n'a pas résolu ce problème. (Pas plus que les ordonnanceurs anticipatifs ou de délai par défaut).
CFQ aide un peu avec de nombreux threads qui font des E/S aléatoires (comme pendant les parties du travail cron), mais il n'élimine pas la possibilité pour un Linode de coincer l'hôte entier. Lisez la suite pour la solution...
[b]Patch de limitation de jeton de demande d'E/S UML[/b][/size]
J'ai implémenté un simple Token Bucket Filter/Limiter autour du driver async UBD dans UML. La méthode du token-bucket est assez intéressante. Voici comment elle fonctionne : Chaque seconde, x jetons sont ajoutés au panier. Chaque demande d'E/S nécessite un jeton, il faut donc attendre que le seau ait quelques jetons avant de pouvoir effectuer l'E/S.
Cette méthode permet d'obtenir un débit illimité jusqu'à ce que le seau soit vide, puis il commence à s'étrangler. C'est parfait !
Liens :
[url=http://www.theshore.net/~caker/patches/token-limiter-v1.patch]token-limiter-v1.patch[/url]
[url=http://www.theshore.net/~caker/patches/token-limiter-v1.README]token-limiter-v1.README[/url]
[b][color=darkred]Avec ce patch, un seul Linode ne peut plus coincer l'hôte ![/color][/b].
C'est un gros problème, car la seule méthode pour corriger ce problème quand il se produit est que j'intervienne et que j'arrête le Linode incriminé.
Le correctif du limiteur se trouve dans le noyau 2.4.25-linode24-1um (2.6 suivra sous peu).
Les valeurs par défaut sont très élevées, et je doute que l'un d'entre vous en subisse les conséquences dans le cadre d'une utilisation normale. Je peux changer les valeurs de recharge et de taille du seau pendant l'exécution, je pourrai donc concevoir un moniteur pour chaque hôte qui changera dynamiquement les profils en fonction de la charge de l'hôte. C'est très important 🙂 .
[b]Linux 2.6 pour les Linodes[/b][/size]
Je n'ai pas encore annoncé officiellement le noyau 2.6-um. Il reste encore quelques bogues et problèmes de performance à résoudre. Je ne recommande pas encore d'utiliser le noyau 2.6-um en production, mais quelques utilisateurs aventureux l'ont testé et ont rapporté certaines des bizarreries impliquées dans son fonctionnement sous chaque distro. J'essaierai de compiler un guide sur la migration vers la 2.6 et de le publier une fois que le noyau sera plus stable.
[b]Quoi de neuf dans le monde d'UML ?[/b][/size]
Il y a longtemps que nous avons besoin de nouveaux correctifs pour UML. Je pense que nous aurons une nouvelle version d'UML (pour les versions 2.4 et 2.6) dans les deux prochaines semaines environ.
En plus des corrections de bogues habituelles, je sais que Jeff a travaillé sur le support AIO pour le pilote IO dans UML. AIO est une nouvelle fonctionnalité implémentée dans la version 2.6 (sur les hôtes). Voici quelques avantages :
[La possibilité de soumettre plusieurs demandes d'E/S avec un seul appel système.
[La possibilité de soumettre une requête d'E/S sans attendre qu'elle soit terminée et de la faire chevaucher avec d'autres traitements.
[Optimisation de l'activité du disque par le noyau en combinant ou en réorganisant les demandes individuelles d'une entrée/sortie par lots.
[Meilleure utilisation de l'unité centrale et meilleur débit du système en éliminant les threads supplémentaires et en réduisant les changements de contexte.
[/list]
Plus d'informations sur AIO :
http://lse.sourceforge.net/io/aio.html
http://archive.linuxsymposium.org/ols2003/Proceedings/All-Reprints/Reprint-Pulavarty-OLS2003.pdf
-
C'est tout !
-Chris
Commentaires (12)
This may be naive, but wouldn’t it help tremendously to have all the swap partitions for a given linode on a different drive?
[quote:9a75d3e3be=”diN0″]This may be naive, but wouldn’t it help tremendously to have all the swap partitions for a given linode on a different drive?[/quote]
It might, but that’s not the point, really. Before this patch, a single UML could consume all of the I/O (say, for a given device, like you suggested). It would still cause the same problem when other Linodes tried to access the device. The same effect can be had with “swap files” that exist on your filesystem (rather than actual ubd images) or heavy I/O on any filesystem.
With this patch, I am able to guarantee a minimum level of service. Previously that wasn’t possible.
-Chris
Great work chris, I genuinely can’t think of anything else you can improve upon! 😉
Chris,
I tried the 2.6 kernel of Redhat 9 (large) a few days ago. It failed to boot & I had to switch back to 2.4.
Another forum thread had the same problem.
dev/ubd/disc0: unknown partition table
/dev/ubd/disc1: unknown partition table
I am really excited about this. As you know I have been one of the most vocal proponents of some system of throttling disk I/O so that an overzealous Linode cannot DOS the host.
It sounds like this solution will require everyone to upgrade to a 2.6 kernel, which means that it cannot be applied until everyone is ready to go to 2.6 (and it will only be effective when *everyone* has upgraded to this fixed kernel). So I guess the solution is months away. But at least there is a plan in the works to solve this problem for good.
Great job man! Keep up the good work!
Just curious – why not solve this problem in the host kernel instead? Can the host kernel be patched to limit any one of its processes using the I/O token system that you have devised? Then the Linode themselves can run any kernel they want to and the host system will prevent any one from thrashing the disk.
Ideally this would be some kind of rlimit option, so that it could be applied just to the Linode processes themselves and not to the other processes of the host system.
I don’t know if the I/O layer that’s deeper in the kernel than the UML ubd driver is harder to work with though … perhaps it would be too complex to modify the fundamental Linux I/O code than it is to modify the ubd driver?
caker, thanks for all the hard work you’ve put in to keep the linode hosts in top shape.
It’s rather surprising that CFQ didn’t solve the I/O scheduling problem, though. The algorithm is supposed to be [i]completely fair[/i] towards each thread requesting I/O. 😛
[quote:52760ef410=”Quik”]Great work chris, I genuinely can’t think of anything else you can improve upon! :wink:[/quote]
Thanks Quik 🙂
[quote:52760ef410=”gmt”]Chris,
I tried the 2.6 kernel of Redhat 9 (large) a few days ago. It failed to boot & I had to switch back to 2.4.
Another forum thread had the same problem.
dev/ubd/disc0: unknown partition table
/dev/ubd/disc1: unknown partition table[/quote]
You can always ignore this warning message — it’s just telling you that the ubd devices are not partitioned. You’re using the entire block device as one giant ‘partition’.
To get 2.6 to work under RedHat, first rename /lib/tls to something else (since 2.6-um and NPTL don’t mix yet).
-Chris
[quote:2eaacf3890=”bji”]I am really excited about this. As you know I have been one of the most vocal proponents of some system of throttling disk I/O so that an overzealous Linode cannot DOS the host.
It sounds like this solution will require everyone to upgrade to a 2.6 kernel, which means that it cannot be applied until everyone is ready to go to 2.6[/quote]
Not sure where you read that from my post. I’ve already patched the 2.4.25-linode24-1um kernel with the token-limiter patch, and 2.6-um to follow shortly.
[quote:2eaacf3890=”bji”](and it will only be effective when *everyone* has upgraded to this fixed kernel). So I guess the solution is months away. But at least there is a plan in the works to solve this problem for good.[/quote]
Most/all of the repeat offenders have already been rebooted into the “linode24″ kernel (with the limiter patch). So the solution is in effect right now. But, you are correct — there are still many Linodes running un-limited.
[quote:2eaacf3890=”bji”]Great job man! Keep up the good work![/quote]
Thanks!
-Chris
[quote:f066e66db0=”bji”]Just curious – why not solve this problem in the host kernel instead? Can the host kernel be patched to limit any one of its processes using the I/O token system that you have devised? Then the Linode themselves can run any kernel they want to and the host system will prevent any one from thrashing the disk.
Ideally this would be some kind of rlimit option, so that it could be applied just to the Linode processes themselves and not to the other processes of the host system.
I don’t know if the I/O layer that’s deeper in the kernel than the UML ubd driver is harder to work with though … perhaps it would be too complex to modify the fundamental Linux I/O code than it is to modify the ubd driver?[/quote]
I agree — the correct solution is to get Linux fixed, or perhaps to get UML to use the host more efficiently. Some of the UML I/O rework is already under way (the AIO stuff), but that kind of thing *is* months away…
One interesting “feature” of the CFQ scheduler is an ionice priority level. But, I wasn’t able to get the syscalls working to test it.
-Chris
[quote:01c9cda963=”griffinn”]caker, thanks for all the hard work you’ve put in to keep the linode hosts in top shape.
It’s rather surprising that CFQ didn’t solve the I/O scheduling problem, though. The algorithm is supposed to be [i]completely fair[/i] towards each thread requesting I/O. :P[/quote]
I’m not sure where the bottleneck is — but as far as I can tell, CFQ and the standard scheduler in 2.4 appear equally (non)responsive in the worst-case scenario. Go figure…
One interesting thing is that UML uses the no-op elevator. Jeff and I got into a discussion about this, and he says there’s no point to UML doing any request merging, but I disagree. I’d rather have UML do some of it’s own request merging and reordering than force the host to do it all. Plus, it makes UML appear to the host as more of a streaming type load than a random load…
Think back to the last set of tiobench benchmark results you’ve seen — look how poorly the random-i/o results are compared to “streaming-read” and “streaming-write”…
So .. another hack to the UML code (one-liner) to test…
-Chris
Thanks, Caker. I have a tiny linode and I make almost no demands on the system, so far at least. However, fairness is part of what you sell. It sounds like the leaky bucket in the UM kernel solves most of the problem with a minimum of effort. I’ve been implementing fairness algorithms for at least 30 years, so I have a few theoretical observations and questions:
You appear to be issueing tokens independently to each process at an absolute rate, independent of the actual resource availability. This means that a UML may get limited even if nobody else wants the resource, yes? It might be better for the host kernel to issue tokens at an over-all rate to the UMLs.That way a particular UML can use the whole resource if nobody else wants it. since everybody’s buckets are full, the instant anyone else wants to use the resource the original user is instantly throttled to 50% as the tokens are returned equally to the two users, and so on as more users are added. That is, the main kernel returns tokens to each UML with a non-full bucket equally, but does not add tokens to a bucket that is already full. The host kernel should dyamically adjust its token generation rate to just keep the resource occupied. I’ve successfully done this in the past by watching the resource: if the resource goes idle when thre are any empty buckets, slightly increase the token rate. If the resource never goes idle, slightly decrease the token rate.
Next issue: Do you “oversubscribe” the host memory? That is, does the sum of the UML memory sizes exceed the size of the host’s real application space? If so, the host swapspace is used, causing disk activity at this level. This is independent of the swap activity within each UML as the user exceeds its “real” space and begins to use its swap partition. I’m guessing that host-level swapping does not count against any UML’s bucket. but that UML-level swapping does. This would be tha fair way to do this. However, host-level swapping will reduce the overall amount of IO resource that is available to the users. The algorithm above will account for this.
Next issue: Do we have fairness issues with network bandwidth? do you intend to add a token system to address this?
Again: I’m a happy camper. These are purely theoretical questions for me.