Time sliced CFQ IO scheduler
Here's a copy of Jens's posting to LKML:
List: linux-kernel
Subject: Time sliced CFQ io scheduler
From: Jens Axboe <axboe ()="" suse="" !="" de="">Date: 2004-12-02 13:04:57
Message-ID: <20041202130457.GC10458 () suse ! de>
Hi,
Some time ago I pondered modifying CFQ to do fairness based on slices of
disk time. It appeared to be a minor modification, but with some nice
bonus points:
- It scales nicely with CPU scheduler slices, making io priorities a
zinch to implement.
- It has the possibility to equal the anticipatory scheduler for
multiple processes competing for disk bandwidth
So I implemented it and ran some tests, the results are pretty
astonishing. A note on the testcases - read_files and write_files. They
either read or write a number of files sequentially or randomly, each
file has io being done to it by a specific process. IO bypasses the page
cache by using O_DIRECT. Runtime is capped at 30 seconds for each test.
Each test case was run on deadline, as, and new cfq. Drive used was an
IDE drive (results similar for SCSI), fs used was ext2.
Scroll past results for the executive summary.
Case 1: read_files, sequential, bs=4k
-------------------------------------
Scheduler: deadline
Clients Max bwidth Min bdwidth Agg bdwidth Max latency
1 19837 19837 19837 22msec
2 2116 2114 4230 22msec
4 361 360 1444 41msec
8 150 149 1201 111msec
Note: bandwidth quickly becomes seek bound as clients are added.
Scheduler: as
Clients Max bwidth Min bdwidth Agg bdwidth Max latency
1 19480 19480 19480 30msec
2 9250 9189 18434 261msec
4 4513 4469 17970 488msec
8 2238 2157 17581 934msec
Note: as maintains good aggregate bandwidth as clients are added, while
still being fair between clients. Latency rises quickly.
Schedule: cfq
Clients Max bwidth Min bdwidth Agg bdwidth Max latency
1 19433 19433 19433 9msec
2 8686 8628 17312 90msec
4 4507 4471 17963 254msec
8 2181 2104 17134 578msec
Note: cfq performs close to as. Aggregate bandwidth doesn't suffer with
added clients, inter-client latency and throughput excellent. Latency
half that of as.
Case 2: read_files, random, bs=64k
-------------------------------------
Scheduler: deadline
Clients Max bwidth Min bdwidth Agg bdwidth Max latency
1 7042 7042 7042 20msec
2 3052 3051 6103 28msec
4 1560 1498 6124 101msec
8 802 581 5487 231msec
Scheduler: as
Clients Max bwidth Min bdwidth Agg bdwidth Max latency
1 7041 7041 7041 18msec
2 4616 2298 6912 270msec
4 3190 928 6901 360msec
8 1524 645 6765 636msec
Note: Aggregate bandwidth remains good, has big problems with
inter-client fairness.
Scheduler: cfq
Clients Max bwidth Min bdwidth Agg bdwidth Max latency
1 7027 7027 7027 19msec
2 3429 3413 6841 107msec
4 1718 1700 6844 282msec
8 875 827 6795 627msec
Note: Aggregate bandwidth remains good and basically identical to as,
ditto for the latencies where cfq is a little better though.
inter-client fairness very good.
Case 3: write_files, sequential, bs=4k
-------------------------------------
Scheduler: deadline
Clients Max bwidth Min bdwidth Agg bdwidth Max latency
1 13406 13406 13406 21msec
2 1553 1551 3104 171msec
4 690 689 2759 116msec
8 329 318 2604 106msec
Note: Aggregate bandwidth quickly drops with number of clients. Latency
is good.
Scheduler: as
Clients Max bwidth Min bdwidth Agg bdwidth Max latency
1 13330 13330 13330 21msec
2 2694 2694 5388 77msec
4 1754 17 4988 762msec
8 638 342 3866 848msec
Note: Aggregate bandwidth better than deadline, but still not very good.
Latency not good. inter-client horrible.
Scheduler: cfq
Clients Max bwidth Min bdwidth Agg bdwidth Max latency
1 13267 13267 13267 30msec
2 6352 6150 12459 239msec
4 3230 2945 12524 289msec
8 1640 1640 12564 599msec
Note: Aggregate bandwidth remains high with added clients
ditto for the latencies where cfq is a little better though.
inter-client fairness very good.
Case 4: write_files, random, bs=4k
-------------------------------------
Scheduler: deadline
Clients Max bwidth Min bdwidth Agg bdwidth Max latency
1 6749 6749 6749 112msec
2 1299 1277 2574 813msec
4 432 418 1715 227msec
8 291 247 2147 1723msec
Note: Same again for deadline - aggregate bandwidth really drops with
adding clients, but at least client fairness is good.
Scheduler: as
Clients Max bwidth Min bdwidth Agg bdwidth Max latency
1 4110 4110 4110 114msec
2 815 809 1623 631msec
4 482 349 1760 606msec
8 476 111 2863 752msec
Note: Does generally worse than deadline and has fairness issues.
Scheduler: cfq
Clients Max bwidth Min bdwidth Agg bdwidth Max latency
1 4493 4493 4493 129msec
2 1710 1513 3216 321msec
4 521 482 2002 476msec
8 938 877 7210 927msec
Good results for such a quick hack, I'm generally surprised how well it
does without any tuning. The results above use the default settings for
cfq slices: 83ms slice time with allowed 4ms idle period (queues are
preempted if they exceed this idle time). With the disk time slices,
aggregate performance bandwidth stays close to real disk performance
even with many clients.</axboe>
List: linux-kernel
Subject: Re: Time sliced CFQ io scheduler
From: Jens Axboe <axboe ()="" suse="" !="" de="">Date: 2004-12-02 13:48:02
Message-ID: <20041202134801.GE10458 () suse ! de>
Hi,
One more test case, while the box is booted... This just demonstrates a
process doing a file write (bs=64k) with a competing process doing a
file read (bs=64k) at the same time, again capped at 30sec.
deadline:
Reader: 2520KiB/sec (max_lat=45msec)
Writer: 1258KiB/sec (max_lat=85msec)
as:
Reader: 27985KiB/sec (max_lat=34msec)
Writer: 64KiB/sec (max_lat=1042msec)
cfq:
Reader: 12703KiB/sec (max_lat=108msec)
Writer: 9743KiB/sec (max_lat=89msec)
If you look at vmstat while running these tests, cfq and deadline give
equal bandwidth for the reader and writer all the time, while as
basically doesn't give anything to the writer (a single block per second
only). Nick, is the write batching broken or something?
--
Jens Axboe</axboe>
It also appears that Jens has discovered a bug in the Anticipatory Scheduler (AS) in recent 2.6 kernels which basically provides zero throughput to writes when reads are happening:
On Thu, Dec 02 2004, Andrew Morton wrote:
> Jens Axboe <axboe@suse.de>wrote:
> >
> > as:
> > Reader: 27985KiB/sec (max_lat=34msec)
> > Writer: 64KiB/sec (max_lat=1042msec)
> >
> > cfq:
> > Reader: 12703KiB/sec (max_lat=108msec)
> > Writer: 9743KiB/sec (max_lat=89msec)
> >
> > If you look at vmstat while running these tests, cfq and deadline give
> > equal bandwidth for the reader and writer all the time, while as
> > basically doesn't give anything to the writer (a single block per second
> > only). Nick, is the write batching broken or something?
>
> Looks like it. We used to do 2/3rds-read, 1/3rd-write in that testcase.
But 'as' has had no real changes in about 9 months time, it's really
strange. Twiddling with write expire and write batch expire settings
make no real difference. Upping the ante to 4 clients, two readers and
two writers work about the same: 27MiB/sec aggregate read bandwidth,
~100KiB/sec write.
At least something needs to be done about it. I don't know what kernel
this is a regression against, but at least it means that current 2.6
with its default io scheduler has basically zero write performance in
presence of reads.</axboe@suse.de>
List: linux-kernel
Subject: [PATCH] Time sliced CFQ #2
From: Jens Axboe <axboe ()="" suse="" !="" de="">Date: 2004-12-04 10:49:21
Message-ID: <20041204104921.GC10449 () suse ! de>
[Download message RAW]
Hi,
Second version of the time sliced CFQ. Changes:
- Sync io has a fixed time slice like before, async io has both a time
based and a request based slice limit. The queue slice is expired when
one of these limits are reached.
- Fix a bug in invoking the request handler on a plugged queue.
- Drop the ->alloc_limit wakeup stuff, I'm not so sure it's a good idea
and there are probably wakeup races buried there.
With the async rq slice limit, it behaves perfectly here for me with
readers competing with async writers. The main slice settings for a
queue are:
- slice_sync: How many msec a sync disk slice lasts
- slice_idle: How long a sync slice is allowed to idle
- slice_async: How many msec an async disk slice lasts
- slice_async_rq: How many requests an async disk slice lasts
Interestingly, cfq is now about 10% faster on an fsck than deadline and
as:
AS:
bart:~ # time fsck.ext2 -fy /dev/hdc1
e2fsck 1.34 (25-Jul-2003)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/hdc1: 36/3753600 files (8.3% non-contiguous), 644713/7504552 blocks
real 0m30.594s
user 0m1.862s
sys 0m5.214s
DEADLINE:
bart:~ # time fsck.ext2 -fy /dev/hdc1
e2fsck 1.34 (25-Jul-2003)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/hdc1: 36/3753600 files (8.3% non-contiguous), 644713/7504552 blocks
real 0m30.475s
user 0m1.855s
sys 0m5.280s
CFQ:
bart:~ # time fsck.ext2 -fy /dev/hdc1
e2fsck 1.34 (25-Jul-2003)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/hdc1: 36/3753600 files (8.3% non-contiguous), 644713/7504552 blocks
real 0m27.921s
user 0m1.846s
sys 0m5.648s</axboe>
So, I messed around with time-slice-CFQ today a bit, and I must say I'm very impressed.
Here's the current CFQ vmstat output while doing a streaming read and a streaming write:
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
0 2 0 0 20700 14612 4006148 0 0 15232 7360 1716 0 0 2 98
0 2 0 0 20700 14624 4006136 0 0 18176 9024 1878 0 0 3 97
0 2 0 0 20732 14644 4006184 0 0 18816 9524 1911 0 0 3 97
1 2 0 0 20732 14652 4006176 0 0 13120 6488 1623 0 0 1 99
0 2 0 0 20732 14656 4006172 0 0 13056 6440 1635 0 0 2 98
And with time-slice-CFQ:
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
1 1 0 0 46140 10960 4017224 0 0 55360 5576 1996 1930 0 3 97
0 2 0 0 46140 10960 4017224 0 0 51200 4480 1883 1748 0 2 98
0 2 0 0 46140 10960 4017224 0 0 55104 4864 1980 1886 0 2 98
0 2 0 0 46140 10960 4017224 0 0 58240 4992 2028 1990 0 2 98
Read performance was about 4x what the old CFQ could provide. I'm hoping this will help when a lot of Linodes are running updatedb via cron (you could help by removing it – hint hint!).
I'll continue testing, but this looks great!
-Chris
1 Reply
Second, I believe this is exactly what we were looking for back in this thread:
And actually, skimming through the thread again, I completely forgot you mentioned something about this earlier.