Ubuntu copying directory strange slowness
The disk is 20GB, ext3, running on Ubuntu 10.04. I have a folder called
/srv/www/server.com/html/shared
The total size of the /shared directory is around 8GB or so, most of it in
/srv/www/server.com/html/shared/data
I mention the data directory, because it's 8GB on this server, but is not present on the servers that don't have this same problem.
The 50MB folder I'm trying to copy is also in shared:
/srv/www/server.com/html/shared/cached-copy
The following command takes about 55 sec:
cp -RPp /srv/www/tradervue.com/html/shared/cached-copy ~/testing1
That's the exact same command that finishes in a second on other servers; the only real difference is on the other servers, the 8GB data directory is not there.
But then, oddly, the following will complete in a second or so:
cp -RPp ~/testing1 ~/testing2
This is reproducible, in any order, so I don't think it's related to file caching.
Any ideas what might be happening, or how I could debug this?
Thanks,
Greg
8 Replies
@Vance:
Possibly /srv and /home are on different partitions on the slow system? Also, there might be different filesystem options. What does mount tell you for each system?
I think it's all one partition.
On slow system:
~$ mount
/dev/xvda on / type ext3 (rw,noatime,errors=remount-ro)
proc on /proc type proc (rw)
none on /proc/sys/fs/binfmtmisc type binfmtmisc (rw,noexec,nosuid,nodev)
none on /sys type sysfs (rw,noexec,nosuid,nodev)
none on /sys/fs/fuse/connections type fusectl (rw)
devtmpfs on /dev type devtmpfs (rw,mode=0755)
none on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=0620)
none on /dev/shm type tmpfs (rw,nosuid,nodev)
none on /var/run type tmpfs (rw,nosuid,mode=0755)
none on /var/lock type tmpfs (rw,noexec,nosuid,nodev)
none on /lib/init/rw type tmpfs (rw,nosuid,mode=0755)
The output from 'mount' is identical on the other systems…
CONFIGJBDDEBUG
But re-reading your original post, 55 seconds to copy 50 MB is a really long time. It might be worth putting in a support ticket; this could be a hardware problem.
Or perhaps one of the parent directories?
ls -d /srv/www/server.com/html/shared /srv/www/server.com/html/shared/cached-copy
would show the size of the directories, Edit: but I may be totally out to lunch anyway.
@Vance:
But re-reading your original post, 55 seconds to copy 50 MB is a really long time. It might be worth putting in a support ticket; this could be a hardware problem.
That's what I originally thought…but the Linode folks say all is well on the host.
@mnordhoff:
Do you folks think it could be something like… The directory used to have an obscene number of files in it, so its inode is really big, and it takes an excessively long time to work with? I don't think so -- it shouldn't take that long, and a modern ext3 fs should have dir_index enabled, which I think would handle it reasonably well -- but I'm not sure, and nobody else has any ideas…
Well, there was certainly some activity with a lot of files. For example, the data directory I was talking about in the OP was about 150K files in one directory; I later split this into 27 separate directories (e.g. data/A, data/B, etc) and moved the files into their appropriate places.
I'm experimenting with a few things now - restoring a new node from this node's backup, for example, to see if it suffers from the same problem.
I'm secretly hoping rebooting this node might fix it…waiting for a slow time to take a few minute maintenance window.
~$ df -i
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/xvda 4580736 417217 4163519 10% /
devtmpfs 63626 2036 61590 4% /dev
none 63678 1 63677 1% /dev/shm
none 63678 26 63652 1% /var/run
none 63678 2 63676 1% /var/lock
none 63678 1 63677 1% /lib/init/rw
and on the new node:
~$ df -i
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/xvda 1286752 408301 878451 32% /
devtmpfs 63626 2036 61590 4% /dev
none 63678 1 63677 1% /dev/shm
none 63678 20 63658 1% /var/run
none 63678 2 63676 1% /var/lock
none 63678 1 63677 1% /lib/init/rw
Note the different number of inodes. Not sure why that would make a difference, but it's something I noticed.
Next step is to reboot the slow node and see if that changes anything. If not, I might just rsync the incremental changes over to this new node and use that…
Ran a few experiments, and I ended up doing the following:
1. make a copy of shared/cached-copy directory as shared/copy1
2. rm -rf shared/cached-copy
3. mv shared/copy1 shared/cached-copy
After running a few random copies of other folders to ensure the disk cache cleared, I find that the new cached-copy directory behaves much better than the old one, copying in seconds.
Not sure why this helped, but at least at the moment it looks to be working better now. Hopefully the problem won't come back…