CentOS4 installations: rpm database corruption
Please refer back to this post (third post in thread) for when I first came across the problem:
http://www.linode.com/forums/viewtopic.php?t=1878
Summary: linode installations of CentOS4 which are subsequently updated (here testing with yum, but others have run into the same problem with up2date - which is pretty much deprecated in CentOS) seem to produce an rpm database with duplicate "ghost" packages that aren't really installed.
Symptoms of problems during first updates after installation (if you chose a 2.4 kernel): after your first yum update but before a reboot, you'll see shell warnings like:
-bash: child setpgid (1176 to 1176): No such process
If you subsequently run yum, you may see many repeats of:
sem_post: Invalid argument
Both these problems seem to relate to the glibc update and kernel 2.4. They disappear after a reboot, but the underlying problem doesn't and is also present in my testing for 2.6 kernels: duplicate packages in the rpm database after updating. Another unpredictable consequence of the problem is that yum update sometimes hangs during the same first reboot after glibc has been upgraded: neither ^C or ^\ will stop it and it requires a kill -9. rpm –rebuilddb and rm /var/lib/rpm/__db* don't fix the underlying problem.
Implementing the nptl fix:
mv /lib/tls /lib/tls-disabled
is necessary for other reasons but doesn't fix this problem either: even if implemented before the yum update, the bash "setpgid" problem occurs for 2.4 kernels, at least for me. In any case, it seems to be irrelevant to the underlying duplicate rpm problem.
The key test to see if you have ghost entries in the rpm database is:
rpm -qa | sed 's/([A-Za-z0-9])-[0-9].*$/\1/' | sort | uniq -d
- if this comes out with any more than a few packages (like kernel* if you didn't remove them, gpg-pubkey and kbd which are all understandable and ok) then you have the same problem. In my case I have 77 duplicate packages - examination with rpm -V shows that the older packages aren't really installed.
Having done a fair bit of testing, the solution for new installations seems to be:
1. Install CentOS4
2. Boot with the latest 2.6 kernel
3. mv /lib/tls /lib/tls-disabled
4. yum update glibc
5. Reboot!
6. yum update
This seems to fix the problem on my test linode, and for a willing victim who tried the same.
The questions that remain:
A. What caused this? I suspect it's a feature of the initial CentOS4 image being produced via a yum(?) update from CentOS3, which isn't strictly supported, so caker will be able to address this.
B. Most importantly, how do we patch up our existing CentOS4 installations to remove these ghost entries in the rpm db? I'm not comfortable with them there at all: no current problems but their very presence indicates an inconsistent database which can only be bad news.
I've successfully used rpm -e –justdb to remove an inconsequential ghost duplicate package as described in the above post, and testing the remaining package with rpm -V indicates it's fine: but I'm reluctant to do this with packages like glibc, and automating the removal of many duplicate packages may be tricky. I've thought of sorting the output from the rpm command about and running an rpm -e --justdb on alternate (lower versioned) packages, but sorting (with or without -n) is unpredictable due to rpm naming conventions. The only solution, if rpm -e –justdb is reliable, would seem to be to go through the output of the above rpm command manually, leave in the older packages and turn it into a script to remove them with --justdb. Any other ideas?
I've done a lot of work on my linode so can only test this if I could have a duplicate image of my / and /home on another linode to stick around for some time after cleanup in case - possible, caker? If so, I'd like to time when the snapshot's taken carefully before doing the rpm removals…
I'd be very interested if other CentOS4 users could share their experiences of this: and in particular run the above rpm command and report if they get many duplicates.
2 Replies
I'll log what I've found out since the post above and done. I've found the problem is often reproducible on a spare node with kernel 2.4, 2.6 and with our without the /lib/tls nptl fix. But not always. So there's some hidden variable at the linode level - caker's suggested yum may be running out of iotokens - although it didn't seem to do so (watch 'cat /proc/iostatus' while yum update-ing). I'm also going to try a fresh CentOS4 image he's preparing direct from ISOs.
Using the rpm command above I've analyzed the duplicated rpms, and they're always subsets of each other on different builds, which leads me to suspect this is due to a hung/dead yum process happening at a different time for each. Unusual though: I've not heard or read of any recent reports of bad yum/rpm processes causing ghost rpm entries in the database.
So anyway, I've done exhaustive checking that I'm not removing any rpm that really exists, by sorting, using comm on sorted lists of "good" and "bad" rpms from the above command, and then checking every rpm that's "good" and "bad" semi-automatically with rpm -V, e.g.:
for rpm in cat bads | grep -v kbd
; do rpm -V $rpm > /dev/null; echo $rpm: $?; done
- they should all come out as status 1. Of course acknowledging some rpm components will have changed since installation and so some "good" ones will also have changed, but worth checking you have the right "bad" dupe manually for those.
Having done all that three times(!), I decided to rpm -e –justdb the older versions. Rebooted and so far so good.
Please note that the rpm expression above misses libstdc++ - I couldn't be bothered to work out a sed expression that coped with both:
compat-libstdc++-
libstdc++-
It seems that interrupting yum is the cause of the problem.
-chris