Documente Academic
Documente Profesional
Documente Cultură
In the past few installments, we've taken a bit of a detour by looking at non-traditional filesystems
such as tmpfs and devfs. Now, it's time to get back to disk-based filesystems, and we do this by
taking a look at ext3. The ext3 filesystem, designed by Dr. Stephen Tweedie, is built on the
framework of the existing ext2 filesystem; in fact, ext3 is very similar to ext2 except for one small
(but important) difference -- it supports journaling. Yet even with this small addition, I think you'll
find that that ext3 has several surprising and intriguing capabilities. In this article, I'll give you a
good understanding of how ext3 compares to the other journaling filesystems currently available. In
my next article, we'll get ext3 up and running.
Understanding Ext3
So, how does ext3 compare to ReiserFS? In previous articles, I explained how ReiserFS is well
suited to handling small files (under 4K), and in certain situations, ReiserFS' small file performance
is ten to fifteen times greater than that of ext2 and ext3. However, while ReiserFS has many
strengths, it also has weaknesses. In the current implementation of ReiserFS (version 3.6), certain
file access patterns can actually result in significantly worse performance than ext2 and ext3,
particularly when reading large mail directories. Also, ReiserFS doesn't have a good track record of
NFS compatibility and has poor sparse file performance. In contrast, ext3 is a very well-rounded
filesystem. It's a lot like ext2; it's not going to give you the blazingly fast small-file performance
that ReiserFS gives you, but it's not going to give you any unexpected performance or functionality
hiccups either.
One of the nice things about ext3 is that because it is based on the ext2 code, ext2 and ext3's on-disk
format is identical; this means that a cleanly unmounted ext3 filesystem can be remounted as an
ext2 filesystem with absolutely no problems. And that's not all. Thanks to the fact that ext2 and ext3
use identical metadata, it's possible to perform in-place ext2 to ext3 filesystem upgrades. Yes, you
read that right. By upgrading a few key system utilities, installing a modern 2.4 kernel and typing in
a single tune2fs command per filesystem, you can convert your existing ext2 servers into
journaling ext3 systems. You can even do this while your ext2 filesystems are mounted. The
transition is safe, reversible, and incredibly easy, and unlike a conversion to XFS, JFS, or ReiserFS,
you don't need to back up and recreate your filesystems from scratch. Now, for a moment, consider
the thousands of production ext2 servers in existence that are just minutes away from an ext3
upgrade; then, you'll have a good grasp of ext3's importance to the Linux community.
If I had to describe ext3 in one word, I'd call it "comfortable". It's incredibly easy to ext3-enable an
existing ext2 system, and after you do, you're not going to run into any unexpected performance
quirks. And there's yet another way that ext3 excels in the comfort department; ext3 happens to be
one of the most reliable journaled filesystems available for Linux, as I explain below.
Ext3 reliability
In addition to being ext2-compatible, ext3 inherits other benefits by sharing ext2's metadata format.
For one, ext3 users gain access to a rock-solid fsck tool. You'll recall that one of the points of using
a journaling filesystem is to avoid the need for an exhaustive fsck in the first place; however if you
do end up getting corrupt metadata, either from a flaky kernel, bad hard drive, or something else,
you'll greatly appreciate the fact that ext3 inherits ext2's fsck. In contrast, ReiserFS' fsck is in its
infancy, and fixing flaky metadata when it does show up can be a difficult and dangerous process.
Metadata-only journaling
Interestingly, ext3 handles journaling very differently than ReiserFS and other journaling
filesystems do. With ReiserFS, XFS, and JFS, the filesystem driver journals metadata, but makes
no provisions for journaling data. With metadata-only journaling, your filesystem metadata is going
to be rock solid, and you will probably never need to perform an exhaustive fsck. However,
unexpected reboots and system lock-ups can result in significant corruption of recently-modified
data. Ext3 uses a couple of innovative solutions to avoid these problems, which we'll look at in a
bit.
But first, it's important to understand exactly how metadata-only journaling could end up biting
you. As an example, let's say that you were modifying a file called /tmp/myfile.txt when the
machine unexpectedly locked up, forcing a reboot. If you were using a metadata-only journaling
filesystem such as ReiserFS, XFS or JFS, your filesystem metadata would be easily repaired, thanks
to the metadata journal, and you wouldn't need to sit through a laborious fsck.
However, there's the distinct possibility that when you load /tmp/myfile.txt into a text editor, your
file will not simply be missing recent changes, but will contain a good amount of garbage and may
even be completely unreadable. This isn't something that will always happen, but it could happen
and often does.
Here's why. Typical journaled filesystems like ReiserFS, XFS, and JFS take extra special care of
metadata, but don't pay too much attention to data. In our above example, the filesystem driver was
in the process of modifying several filesystem blocks. The filesystem driver updated the appropriate
metadata, but didn't have time to flush the data from its caches to the new blocks on disk. Thus,
when you loaded up /tmp/myfile.txt into a text editor, part or all of the file contained garbage --
blocks of data that didn't get initialized in time before the system locked up.
Conclusion
These days, a lot of people are trying to determine which Linux journaling filesystem is "best". In
truth, there is no one "right" filesystem for every application; each one has its own strengths. This is
one of the benefits from having so many next-generation Linux filesystems from which to choose.
So, instead of picking an arbitrary "best" filesystem and using it for every conceivable application,
it's far preferable to understand each filesystem's strengths and weaknesses so that you can make an
educated decision as to which one to use.
Ext3 has a number of strengths. It has been designed to be extremely easy to deploy. It's based on
the solid ext2 filesystem code and it inherits a great fsck tool. And ext3's journaling capabilities
have been specially designed to ensure the integrity of both metadata and data. All in all, ext3 is a
truly great filesystem, and a worthy successor to the now-venerable ext2 filesystem. Join me in my
next article, when we get ext3 up and running. Until then, you may want to check out the following
resources.
Resources
• Read Daniel's other articles in this series, where he describes:
• the benefits of journaling and ReiserFS (Part 1)
• setting up a ReiserFS system (Part 2)
• using the tmpfs virtual memory filesystem and bind mounts (Part 3)
• the benefits of devfs, the device management filesystem (Part 4)
• beginning the conversion to devfs (Part 5)
• completing the conversion to devfs using an init wrapper (Part 6)
• Find out more about using ext3 with 2.4 kernels at Andrew Morton's ext3 for 2.4 page.
Andrew Morton is the man responsible for porting ext3 to the 2.4 kernel, and provided
invaluable assistance in writing this article. If you can't wait until my next article, Andrew
has a very nice ext3 and 2.4 usage page that will show you how to get ext3 up and running
on your system in no time.
• To keep abreast of the latest ext3 developments, be sure to visit the ext3-users mailing list
archive. Of course, you can also subscribe.
I'm going to be honest. For this article, I was planning to show you how to get ext3 up and running
on your system. Although that's what I said I'd do, I'm not going to do it. Andrew Morton's
excellent "Using the ext3 filesystem in 2.4 kernels" page (see Resources later in this article) already
does a great job of explaining how to ext3-enable your system, so there's no need for me to repeat
all the basics here. Instead, I'm going to delve into some meatier ext3 topics, ones that I think you'll
find very useful. After you read this article, when you're ready to get ext3 up and running, head over
to Andrew's page.
2.4 kernel update
First, let's start with a 2.4 kernel update. I last discussed 2.4 kernel stability when I was covering
ReiserFS. Way back then, finding a stable 2.4 kernel was a challenge, and I recommended sticking
with the known and at that time bleeding-edge 2.4.4-ac9 kernel -- especially for anyone planning to
use the ReiserFS filesystem in a production environment. As you might guess, a lot has happened
since 2.4.4-ac9, and it's definitely time to start looking at newer kernels.
With kernel 2.4.10, the 2.4 series reached a new level of performance and scalability (something
that we've been anticipating for a long time). So, what happened to allow Linux 2.4 to finally grow
up? In an acronym, VM. Linus, recognizing that the 2.4 series wasn't performing spectacularly,
ripped out Linux's problematic VM code and replaced it with a lean and mean VM implementation
from Andrea Archangeli. Andrea's new VM implementation (which first appeared in 2.4.10) was
really great; it really sped up the kernel and made the entire system more responsive. 2.4.10 was
definitely a major turning point in 2.4 Linux kernel development; up until then, things weren't
looking very good, and many of us were wondering why we weren't FreeBSD developers. We all
should thank Linus for his heroism in making such a major (but sorely needed) change in the 2.4
stable kernel series.
Since Andrea's new VM code needed a bit of time to be integrated seamlessly with the rest of the
kernel, use 2.4.13+. Even better, use 2.4.16+, since the rock-solid ext3 filesystem code was finally
integrated into the official Linus kernel starting with the 2.4.15-pre2 release. There's no reason to
avoid using 2.4.16+ kernel, and it'll make your job of getting ext3 up and running that much easier.
If you do use a 2.4.16+ kernel, just remember that it's no longer necessary to apply the ext3 patch as
described on Andrew's page (see Resources). Linus already added it for you :)
You'll notice that I recommend using 2.4.16+ rather than 2.4.15+, and with good reason. With the
release of kernel 2.4.15-pre9, a really ugly filesystem corruption bug was introduced to the kernel.
It took until 2.4.16-pre1 for the problem to be identified and fixed, resulting in a span of kernels
(including 2.4.15) that should be avoided at all costs. Choosing a 2.4.16+ kernel allows you to avoid
this bad batch entirely.
Laptops...beware?
Ext3 has a stellar reputation for being a rock-solid filesystem, so I was surprised to learn that quite a
few laptop users were having filesystem corruption problems when they switched to ext3. In
general, it's tempting to react to these kinds of reports by avoiding ext3 entirely; however, after
asking around, I discovered that the disk corruption problems that people were experiencing had
nothing to do with ext3 itself, but were being caused by certain laptop hard drives.
The write cache
You may not know this, but most modern hard drives have something called a "write cache", used
by the hard drive to collect pending write operations. By putting pending writes into a cache, the
hard drive firmware can then reorder and group them so that they're written to disk in the fastest
possible way. The write cache is generally considered to be a very good thing (read Linus'
explanation and opinion of write caching in Resources).
Unfortunately, certain laptop hard drives now on the market have the dubious feature of ignoring
any official ATA request to flush their write cache to disk. This isn't a wonderful design feature,
although it has been allowed by the ATA spec up until recently. With these types of drives, there's
no way for the kernel to guarantee that a particular block has actually been recorded to the disk
platters. Although this sounds like a thorny problem, this particular issue by itself is probably not
the cause of the data corruption problems that people have been experiencing.
However, it gets worse. Some modern laptop hard drives have an even nastier habit of throwing
away their write cache whenever the system is rebooted or suspended. Obviously, if a hard drive
has both of these problems, it's going to regularly corrupt data, and there's nothing that Linux can
do to prevent it from doing so.
So, what's the solution? If you have a laptop, tread carefully. Back up all your important files before
making any major change to your filesystems. If you experience data corruption problems that seem
to fit the pattern of what I described above, particularly with ext3, then remember that it may be
your laptop hard drive that's at fault. In that case, you may want to contact your laptop manufacturer
and inquire about getting a replacement drive. Hopefully, in a few months time, these flaky hard
drives will be pulled from the market and we'll never need to worry about this issue again.
Now that I've scared you out of your minds, let's take a look at ext3's various data journaling
options.
Rapid writing
while true
do
dd if=/dev/zero of=largefile bs=16384 count=131072
done
While data was being written to the test filesystem, he attempted to read 16Mb of data from another
ext2 filesystem on the same disk, timing the results:
The results were astounding. data=journal mode allowed the 16-meg-file to be read from 9 to
over 13 times faster than other ext3 modes, ReiserFS, and even ext2 (which has no journaling
overhead):
Written-to-filesystem 16-meg-read-time (seconds)
ext2 78
ReiserFS 67
ext3 data=ordered 93
ext3 data=writeback 74
ext3 data=journal 7
Andrew repeated this test, but tried to read a 16Mb file from the test filesystem (rather than a
different filesystem), and he got identical results. So, what does this mean? Somehow, ext3's
data=journal mode is incredibly well-suited to situations where data needs to be read from and
written to disk at the same time. Therefore, ext3's data=journal mode, which was assumed to
be the slowest of all ext3 modes in nearly all conditions, actually turns out to have a major
performance advantage in busy environments where interactive IO performance needs to be
maximized. Maybe data=journal mode isn't so sluggish after all!
Andrew is still trying to figure out exactly why data=journal mode is doing so much better
than everything else. When he does, he may be able to add the necessary tweaks to the rest of ext3
so that data=writeback and data=ordered modes see some benefit as well.
data=journal tweaks
Some people have had a particular performance problem when using ext3's data=journal mode on
busy servers -- busy NFS servers, in particular. Every thirty seconds, the server experiences a huge
storm of disk-writing activity, causing the system to nearly grind to a halt. If you experience this
problem, it's easy to fix. Simply type the following command as root to tweak Linux's dirty buffer-
flushing algorithm:
Tweaking bdflush
These new bdflush settings will cause kupdate to run every 0.6 seconds rather than every 5 seconds.
In addition, they tell the kernel to flush a dirty buffer after 3 seconds rather than 30, the default. By
flushing recently-modified data to disk more regularly, these write storms can be avoided. It's
slightly less efficient to do things this way, since the kernel will have fewer opportunities to
combine writes. But for a busy server, writes will happen more consistently, and interactive
performance will be greatly improved.
Conclusion
We've now concluded our coverage of ext3. Join me in my next article as we explore the many
wonders of...XFS!