Time sliced CFQ IO scheduler

Linode Staff

Jens Axboe, kernel hacker extraordinaire and author of the CFQ disk I/O Scheduler, recently posted a patch to LKML about his new "time-slice" version of CFQ. It gives each process requesting IO a time slice on the disk, rather than just pulling a request out of each queue.

Here's a copy of Jens's posting to LKML:

List:       linux-kernel
Subject:    Time sliced CFQ io scheduler
From:       Jens Axboe <axboe ()="" suse="" !="" de="">Date:       2004-12-02 13:04:57
Message-ID: <20041202130457.GC10458 () suse ! de>

Hi,

Some time ago I pondered modifying CFQ to do fairness based on slices of
disk time. It appeared to be a minor modification, but with some nice
bonus points:

- It scales nicely with CPU scheduler slices, making io priorities a
  zinch to implement.

- It has the possibility to equal the anticipatory scheduler for
  multiple processes competing for disk bandwidth

So I implemented it and ran some tests, the results are pretty
astonishing. A note on the testcases - read_files and write_files. They
either read or write a number of files sequentially or randomly, each
file has io being done to it by a specific process. IO bypasses the page
cache by using O_DIRECT. Runtime is capped at 30 seconds for each test.
Each test case was run on deadline, as, and new cfq. Drive used was an
IDE drive (results similar for SCSI), fs used was ext2.

Scroll past results for the executive summary.

Case 1: read_files, sequential, bs=4k
-------------------------------------

Scheduler: deadline

Clients   Max bwidth   Min bdwidth   Agg bdwidth   Max latency
   1        19837         19837         19837        22msec
   2         2116          2114          4230        22msec
   4          361           360          1444        41msec
   8          150           149          1201       111msec

Note: bandwidth quickly becomes seek bound as clients are added.

Scheduler: as

Clients   Max bwidth   Min bdwidth   Agg bdwidth   Max latency
   1        19480         19480         19480        30msec
   2         9250          9189         18434       261msec
   4         4513          4469         17970       488msec
   8         2238          2157         17581       934msec

Note: as maintains good aggregate bandwidth as clients are added, while
still being fair between clients. Latency rises quickly.

Schedule: cfq

Clients   Max bwidth   Min bdwidth   Agg bdwidth   Max latency
   1        19433         19433         19433         9msec
   2         8686          8628         17312        90msec
   4         4507          4471         17963       254msec
   8         2181          2104         17134       578msec

Note: cfq performs close to as. Aggregate bandwidth doesn't suffer with
added clients, inter-client latency and throughput excellent. Latency
half that of as.

Case 2: read_files, random, bs=64k
-------------------------------------

Scheduler: deadline

Clients   Max bwidth   Min bdwidth   Agg bdwidth   Max latency
   1         7042          7042          7042        20msec
   2         3052          3051          6103        28msec
   4         1560          1498          6124       101msec
   8          802           581          5487       231msec

Scheduler: as

Clients   Max bwidth   Min bdwidth   Agg bdwidth   Max latency
   1         7041          7041          7041        18msec
   2         4616          2298          6912       270msec
   4         3190           928          6901       360msec
   8         1524           645          6765       636msec

Note: Aggregate bandwidth remains good, has big problems with
inter-client fairness.

Scheduler: cfq

Clients   Max bwidth   Min bdwidth   Agg bdwidth   Max latency
   1         7027          7027          7027        19msec
   2         3429          3413          6841       107msec
   4         1718          1700          6844       282msec
   8          875           827          6795       627msec

Note: Aggregate bandwidth remains good and basically identical to as,
ditto for the latencies where cfq is a little better though.
inter-client fairness very good.

Case 3: write_files, sequential, bs=4k
-------------------------------------

Scheduler: deadline

Clients   Max bwidth   Min bdwidth   Agg bdwidth   Max latency
   1        13406         13406         13406        21msec
   2         1553          1551          3104       171msec
   4          690           689          2759       116msec
   8          329           318          2604       106msec

Note: Aggregate bandwidth quickly drops with number of clients. Latency
is good.

Scheduler: as

Clients   Max bwidth   Min bdwidth   Agg bdwidth   Max latency
   1        13330         13330         13330        21msec
   2         2694          2694          5388        77msec
   4         1754            17          4988       762msec
   8          638           342          3866       848msec

Note: Aggregate bandwidth better than deadline, but still not very good.
Latency not good. inter-client horrible.

Scheduler: cfq

Clients   Max bwidth   Min bdwidth   Agg bdwidth   Max latency
   1        13267         13267         13267        30msec
   2         6352          6150         12459       239msec
   4         3230          2945         12524       289msec
   8         1640          1640         12564       599msec

Note: Aggregate bandwidth remains high with added clients
ditto for the latencies where cfq is a little better though.
inter-client fairness very good.

Case 4: write_files, random, bs=4k
-------------------------------------

Scheduler: deadline

Clients   Max bwidth   Min bdwidth   Agg bdwidth   Max latency
   1         6749          6749          6749       112msec
   2         1299          1277          2574       813msec
   4          432           418          1715       227msec
   8          291           247          2147      1723msec

Note: Same again for deadline - aggregate bandwidth really drops with
adding clients, but at least client fairness is good.

Scheduler: as

Clients   Max bwidth   Min bdwidth   Agg bdwidth   Max latency
   1         4110          4110          4110       114msec
   2          815           809          1623       631msec
   4          482           349          1760       606msec
   8          476           111          2863       752msec

Note: Does generally worse than deadline and has fairness issues.

Scheduler: cfq

Clients   Max bwidth   Min bdwidth   Agg bdwidth   Max latency
   1         4493          4493          4493       129msec
   2         1710          1513          3216       321msec
   4          521           482          2002       476msec
   8          938           877          7210       927msec

Good results for such a quick hack, I'm generally surprised how well it
does without any tuning. The results above use the default settings for
cfq slices: 83ms slice time with allowed 4ms idle period (queues are
preempted if they exceed this idle time). With the disk time slices,
aggregate performance bandwidth stays close to real disk performance
even with many clients.</axboe> 
List:       linux-kernel
Subject:    Re: Time sliced CFQ io scheduler
From:       Jens Axboe <axboe ()="" suse="" !="" de="">Date:       2004-12-02 13:48:02
Message-ID: <20041202134801.GE10458 () suse ! de>

Hi,

One more test case, while the box is booted... This just demonstrates a
process doing a file write (bs=64k) with a competing process doing a
file read (bs=64k) at the same time, again capped at 30sec.

deadline:
Reader:  2520KiB/sec (max_lat=45msec)
Writer:  1258KiB/sec (max_lat=85msec)

as:
Reader: 27985KiB/sec (max_lat=34msec)
Writer:    64KiB/sec (max_lat=1042msec)

cfq:
Reader: 12703KiB/sec (max_lat=108msec)
Writer:  9743KiB/sec (max_lat=89msec)

If you look at vmstat while running these tests, cfq and deadline give
equal bandwidth for the reader and writer all the time, while as
basically doesn't give anything to the writer (a single block per second
only). Nick, is the write batching broken or something?

-- 
Jens Axboe</axboe> 

It also appears that Jens has discovered a bug in the Anticipatory Scheduler (AS) in recent 2.6 kernels which basically provides zero throughput to writes when reads are happening:

On Thu, Dec 02 2004, Andrew Morton wrote:
> Jens Axboe <axboe@suse.de>wrote:
> >
> > as:
> > Reader: 27985KiB/sec (max_lat=34msec)
> > Writer:    64KiB/sec (max_lat=1042msec)
> > 
> > cfq:
> > Reader: 12703KiB/sec (max_lat=108msec)
> > Writer:  9743KiB/sec (max_lat=89msec)
> > 
> > If you look at vmstat while running these tests, cfq and deadline give
> > equal bandwidth for the reader and writer all the time, while as
> > basically doesn't give anything to the writer (a single block per second
> > only). Nick, is the write batching broken or something?
> 
> Looks like it.  We used to do 2/3rds-read, 1/3rd-write in that testcase.

But 'as' has had no real changes in about 9 months time, it's really
strange. Twiddling with write expire and write batch expire settings
make no real difference. Upping the ante to 4 clients, two readers and
two writers work about the same: 27MiB/sec aggregate read bandwidth,
~100KiB/sec write.

At least something needs to be done about it. I don't know what kernel
this is a regression against, but at least it means that current 2.6
with its default io scheduler has basically zero write performance in
presence of reads.</axboe@suse.de> 
List:       linux-kernel
Subject:    [PATCH] Time sliced CFQ #2
From:       Jens Axboe <axboe ()="" suse="" !="" de="">Date:       2004-12-04 10:49:21
Message-ID: <20041204104921.GC10449 () suse ! de>
[Download message RAW]

Hi,

Second version of the time sliced CFQ. Changes:

- Sync io has a fixed time slice like before, async io has both a time
  based and a request based slice limit. The queue slice is expired when
  one of these limits are reached.

- Fix a bug in invoking the request handler on a plugged queue.

- Drop the ->alloc_limit wakeup stuff, I'm not so sure it's a good idea
  and there are probably wakeup races buried there.

With the async rq slice limit, it behaves perfectly here for me with
readers competing with async writers. The main slice settings for a
queue are:

- slice_sync: How many msec a sync disk slice lasts
- slice_idle: How long a sync slice is allowed to idle
- slice_async: How many msec an async disk slice lasts
- slice_async_rq: How many requests an async disk slice lasts

Interestingly, cfq is now about 10% faster on an fsck than deadline and
as:

AS:

bart:~ # time fsck.ext2 -fy /dev/hdc1
e2fsck 1.34 (25-Jul-2003)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/hdc1: 36/3753600 files (8.3% non-contiguous), 644713/7504552 blocks

real    0m30.594s
user    0m1.862s
sys     0m5.214s

DEADLINE:

bart:~ # time fsck.ext2 -fy /dev/hdc1
e2fsck 1.34 (25-Jul-2003)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/hdc1: 36/3753600 files (8.3% non-contiguous), 644713/7504552 blocks

real    0m30.475s
user    0m1.855s
sys     0m5.280s

CFQ:

bart:~ # time fsck.ext2 -fy /dev/hdc1
e2fsck 1.34 (25-Jul-2003)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/hdc1: 36/3753600 files (8.3% non-contiguous), 644713/7504552 blocks

real    0m27.921s
user    0m1.846s
sys     0m5.648s</axboe> 

So, I messed around with time-slice-CFQ today a bit, and I must say I'm very impressed.

Here's the current CFQ vmstat output while doing a streaming read and a streaming write:

   procs                      memory      swap          io     system      cpu
 r  b  w   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id
 0  2  0      0  20700  14612 4006148    0    0 15232  7360 1716     0  0  2 98
 0  2  0      0  20700  14624 4006136    0    0 18176  9024 1878     0  0  3 97
 0  2  0      0  20732  14644 4006184    0    0 18816  9524 1911     0  0  3 97
 1  2  0      0  20732  14652 4006176    0    0 13120  6488 1623     0  0  1 99
 0  2  0      0  20732  14656 4006172    0    0 13056  6440 1635     0  0  2 98

And with time-slice-CFQ:

   procs                      memory      swap          io     system      cpu
 r  b  w   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id
 1  1  0      0  46140  10960 4017224    0    0 55360  5576 1996  1930  0  3 97
 0  2  0      0  46140  10960 4017224    0    0 51200  4480 1883  1748  0  2 98
 0  2  0      0  46140  10960 4017224    0    0 55104  4864 1980  1886  0  2 98
 0  2  0      0  46140  10960 4017224    0    0 58240  4992 2028  1990  0  2 98

Read performance was about 4x what the old CFQ could provide. I'm hoping this will help when a lot of Linodes are running updatedb via cron (you could help by removing it – hint hint!).

I'll continue testing, but this looks great!

-Chris

1 Reply

First off, nice idea with the new forum.

Second, I believe this is exactly what we were looking for back in this thread:

http://www.linode.com/forums/viewtopic.php?t=1147

And actually, skimming through the thread again, I completely forgot you mentioned something about this earlier.

Reply

Please enter an answer
Tips:

You can mention users to notify them: @username

You can use Markdown to format your question. For more examples see the Markdown Cheatsheet.

> I’m a blockquote.

I’m a blockquote.

[I'm a link] (https://www.google.com)

I'm a link

**I am bold** I am bold

*I am italicized* I am italicized

Community Code of Conduct