ext3 journal modes

I've been reading about ext3's various journal modes this fine evening, and have some questions. Well, really only one question: Have I gone insane and misunderstood everything, or have Linode and the Linux fs folks gone completely insane?

What I'm hoping I'm misunderstanding is writeback mode, which my Linodes (Ubuntu Hardy and Lucid) apparently use by default. Aside from the fact that it seems rather likely to corrupt files in the face of a crash – which is no small concern itself -- it seems to have another interesting property in how it corrupts files: If a file grows right before a crash, its new size may be preserved, but the new contents may be lost. In which case when the fs is recovered, the new part of the file would now contain whatever those blocks happened to contain before. For example, a copy of /etc/shadow that was deleted a month ago. Now, is it just me, or is this a stunningly bad idea security-wise on a system that gives untrusted users access to any of its files? Such as a Linode that gets Fremonted 30 seconds after its owner installs a new WordPress theme*.

So, am I misunderstanding, or is that completely insane?

Now, I have also learned that ordered mode sucks horribly in its own way: if a lot of data is sitting around in the write buffers, an fsync() can freeze I/O for a painfully long time – a dozen or two seconds. Still, that seems like a limited price, when the only safe way to run writeback is to check every file modified in the last $maximumtimelinuxwilleverbufferawriteever or, more simply, wipe the fs every time the system crashes…

Recommendations? Risk writeback? Suffer ordered? Seek psychiatric help, because I totally misread things?

  • Linode uses battery-backed RAID, of course, but that would only prevent the power outage from seriously scribbling on the disk; it would not magically save data that was sitting in the kernel's write buffers but that it had not bothered to write out to the disk -- meaning the BBU -- yet, right?

Edit: Fix editing error

13 Replies

Well, I'd try to cut back a bit on the paranoia, for one thing ;)

I can't tell you if it works like this, but if I were designing it, I'd only update the block pointers after writing the new data, ensuring that the extra space effectively points nowhere until there's data there to point to.

Of course, this presupposes a concept of "after", and I don't believe there's any guarantee that disk writes occur in a linear, time-increasing fashion without intentionally forcing the writes to occur :-)

Anyway, from ~~[http://batleth.sapienti-sat.org/projects/FAQs/ext3-faq.html" target="_blank">](http://batleth.sapienti-sat.org/project … 3-faq.html">http://batleth.sapienti-sat.org/projects/FAQs/ext3-faq.html](, which quotes the CHANGES file for ext3:

> "mount -o data=writeback"

Only journals metadata changes, and data updates are entirely

left to the normal "sync" process. After a crash, files will

may contain stale data blocks from old files: this mode is

exactly equivalent to running ext2 with a very fast fsck on reboot.

So it sounds like it is no worse than ext2, but is no better (safety-wise) than it, either.

"man mount" states that "ordered" is the default, but from a check of /proc/mounts, I think you're on to something:

Ubuntu 10.04 desktop:
/dev/mapper/witte-root / ext4 rw,relatime,errors=remount-ro,barrier=1,data=ordered 0 0

Ubuntu 10.04 server, upgraded from 8.10 incrementally:
/dev/mapper/hennepin-root / ext3 rw,noatime,errors=remount-ro,data=ordered 0 0

Ubuntu 11.04 netbook:
/dev/disk/by-uuid/9f1f6d4f-ecf1-47f0-ac14-da83dcfbfe0d / ext4 rw,relatime,errors=remount-ro,barrier=1,data=ordered 0 0

Ubuntu 8.04 Linode:
/dev/root / ext3 rw,noatime,errors=remount-ro,barrier=0,data=writeback 0 0

Ubuntu 10.04 Linode (upgraded from 8.04):
/dev/root / ext3 rw,relatime,errors=remount-ro,barrier=0,data=writeback 0 0

Ubuntu 10.04 Rackspace Cloud Server:
/dev/sda1 / ext3 rw,noatime,errors=remount-ro,barrier=0,data=writeback 0 0

HoopyCat, from my reading, the kernel's default was changed to writeback around 2.6.30. Your "man mount" probably predates that.

(This ignores the default that can be set on the fs by tune2fs, and, of course, you can override it with /etc/fstab or mount. [Or the kernel command line, at least for the root fs.])

(There's a .config option to change the default back to ordered, but Linode's kernels do not use it. [And neither do Rackspace's.])

So I whipped out the kernel source. The entire "situation" spans three commits in 2009 and 2010. Here's what we have:

* April 2009 (torvalds): Configuration option CONFIGEXT3DEFAULTSTOORDERED added, with no default set. Help text describes EXT3MOUNTORDEREDDATA as an "unfortunate choice" and a "legacy default", and advises that the option not be set (i.e. the default should be writeback) and if the users "really want" to use ordered mode, to set it by tune2fs (bbae8bcc49)
* August 2009 (tytso): "(legacy option)" removed from CONFIG
EXT3DEFAULTSTOORDERED prompt; help text rewritten to be more neutral, due to concerns about the strong bias in favor of writeback (6d41807614)
* July 2010 (Dave Chinner): default for CONFIG
EXT3DEFAULTSTO_ORDERED changes to "y", for data safety reasons, with a rather stern commit message. Interestingly, Chinner claims that "all major distros" are ensuring ext3 filesystems are using ordered mode, but one could interpret that to have a somewhat nonstandard definition of "major distro" (aa32a79638)
So, the option first appeared in 2.6.30 with a strong recommendation for, and an implicit default to, writeback. The help text was copyedited to be more neutral and mention the tradeoffs in 2.6.31, but it still implicitly defaulted to writeback until the config option was changed to defaulted to "y" in 2.6.36.

Linode's paravirt kernel configuration likely traces its pedigree to ~2.6.31 or so. At that time, the default would have been to not set CONFIGEXT3DEFAULTSTOORDERED, and this has probably carried forward to today, unbeknownst to anyone.

There does not appear to be a similar configuration option for ext4; its documentation states that ordered is the default.

I would strongly support setting CONFIGEXT3DEFAULTSTOORDERED.

I just setupped a node to test switching to ordered mode…

1.) As the docs warned, trying to do use /etc/fstab to change the root fs's journal mode results in unhappiness:

EXT3-fs (xvda): error: cannot change data mode on remount. The filesystem is mounted in data=writeback mode and you try to remount it in data=ordered mode.
mount: / not mounted already, or bad option
mountall: mount / [1454] terminated with status 32
mountall: Filesystem could not be mounted: /
mountall: Skipping mounting / since Plymouth is not available
rm: cannot remove `/var/lib/urandom/random-seed': Read-only file system

:mrgreen:

2.) A quick 'tune2fs -o journaldataordered /dev/xvda' (typed from memory; could be wrong) worked perfectly. (I did it in Finnix while repairing /etc/fstab; I don't know if you can do it on a live, writable fs.) Edit: Yes, doing it on a live, writable fs works. I don't know if it's supposed to, though. I would hope tune2fs would be smart enough to bail if it was dangerous.

Edit: The other options, are, of course, CONFIGEXT3DEFAULTSTOORDERED and changing the kernel command line (rootflags=data=ordered). But that requires pv-grub or cooperation by Linode.

@mnordhoff:

a Linode that gets Fremonted
Fremont is a verb now? :P

I'm more or less a newbie when it comes to filesystems, but the above discussion seems to suggest that Linode should update its kernels to default to ordered mode, if they haven't already done so.

@hybinet:

Fremont is a verb now? :P
I'm trying to coin it.

@hybinet:

I'm more or less a newbie when it comes to filesystems, but the above discussion seems to suggest that Linode should update its kernels to default to ordered mode, if they haven't already done so.
Well, that's certainly my opinion, at least. We'll see if they agree.

They have not already done so, by the way. (I know because I just did the tune2fs thing and rebooted half an hour ago.)

Edit: I filed a ticket about it. If you never see me again, that's why.

This thread peaked my curiosity so I did some digging and tested the write performance of the various journal modes.

These are all done on the same linode 512 in london dumping 500mb to disk 10 times for each test here are the results

ext3 writeback ~5s

ext3 ordered ~5s

ext4 ordered ~4.4s

ext4 writeback ~4.5s

This should be taken with a pinch of salt these results are in no way scientific, but it does seem to indicate no performance degradation from using ordered. It also appears ext4 maybe quicker, I asked support if they had any plans for supporting it but they will forward the suggestion to the developers but cannot guarantee they will support it or when.

So I'd agree that setting the default mode to ordered would be a good idea.

@mnordhoff:

2.) A quick 'tune2fs -o journaldataordered /dev/xvda' (typed from memory; could be wrong) worked perfectly. (I did it in Finnix while repairing /etc/fstab; I don't know if you can do it on a live, writable fs.) Edit: Yes, doing it on a live, writable fs works. I don't know if it's supposed to, though. I would hope tune2fs would be smart enough to bail if it was dangerous.
I think in this case, tune2fs is simply setting the default filesystem options in the filesystem metadata, and likely not influencing the currently mounted behavior. That is, the ext3 driver reads and applies those filesystem options at mount-time. So you'd probably have to arrange to re-mount the live filesystem after changing the value with tune2fs to have it take effect.

– David

obs,

That's not the sort of situation where you run into trouble with ext3's ordered mode. Where it gets ugly is when you're doing a lot of I/O and then something starts fsync()ing, because that blocks all(?) I/O until it finishes writing out all of the buffers.

@db3l:

I think in this case, tune2fs is simply setting the default filesystem options in the filesystem metadata, and likely not influencing the currently mounted behavior. That is, the ext3 driver reads and applies those filesystem options at mount-time. So you'd probably have to arrange to re-mount the live filesystem after changing the value with tune2fs to have it take effect.
Oh, certainly. What I was wondering was if tune2fs would let me modify the default mount options while the fs is mounted writable, and whether things would get horribly corrupted if it did.

tune2fs docs all have scary warnings about doing stuff to a writable fs, but doing this is very simple, and wasn't explicitly covered. All I can say is that I tried it and it worked, but that was on a test node doing zero I/O. I did not risk it on anything important.

I just did it on a couple busy nodes doing significantly non-zero IO, and then updated the maximum mount counts and intervals while I was in there.

No problems noted. ^A^@^C

^@^@^F^B^@^@^@^@^@^@^@^@^@^@^@F'^@^@^@Microsoft Office Word 97-2003 Document^@

^@^@^@MSWordDoc^@^P^@^ @^@Word.Document.8^@9q^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@

^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@

^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@

@hoopycat:

^@F'^@^@^@Microsoft Office Word 97-2003 Document
RIP hoopycat, he seems to have been assassinated by a team of Microsoft ninjas while testing his new Linux box.

Also, something on this page is missing a "word-wrap: break-word" CSS directive.

Reply

Please enter an answer
Tips:

You can mention users to notify them: @username

You can use Markdown to format your question. For more examples see the Markdown Cheatsheet.

> I’m a blockquote.

I’m a blockquote.

[I'm a link] (https://www.google.com)

I'm a link

**I am bold** I am bold

*I am italicized* I am italicized

Community Code of Conduct