Upgrade problems

I am noticing some strange behavior since upgrading from host9 to host20. I went from a linode64 to a linode128. Everything ran pretty good on my old machine but I have had a couple of strange things happen. I have not been able to tie down the problem but I thought I would post a question to see if you guys have any ideas. Here are some things I have noticed.

  • My MySQL server processes keep disappearing

  • I have had emerge segfault a couple of times when I tried to run it (other times it runs fine).

  • While compiling a program I noticed that a mv command failed . The error said out of memory.

I have not changed anything from the software side so I can only assume it is because of the move (but I could be wrong). My only thought is perhaps a change in CPU. I don't know much about UML but I have the following compile flags in my gentoo make.conf settings:

CFLAGS="-march=pentium3 -O3 -pipe"

I had set my arch as pentium3 because I thought I had read somewhere that the CPU on these machines was an Intel. It seemed to work well so I went with it. So my only guess is that the arch has changed. Any other ideas on why changing machines will cause the instability?

20 Replies

@eman:

  • My MySQL server processes keep disappearing
    Try adding "set-variable=threadcachesize=40" under the [mysqld] section of your mysql config file. This was reported in this thread:

http://www.linode.com/forums/viewtopic.php?p=4810#4810

I had to do this on a (real) FC2 box. mysql kept creating threads. Might be worth a shot..

-Chris

@caker:

@eman:

  • My MySQL server processes keep disappearing
    Try adding "set-variable=threadcachesize=40" under the [mysqld]

I have made the suggested update and it seems to be running much better. Time will tell if it fixes my problem but I still getting odd problems. Seems to be memory related. For example I was trying to compile tetex on my machine. Did the standard emerge -v tetex. It got most the way though and gave me this output:

/bin/install -c -m 644 tex.pool /var/tmp/portage/tetex-2.0.2-r3/image//usr/share/texmf/web2c/tex.pool
/bin/install: memory exhausted
make[2]: *** [install-data] Error 1
make[2]: Leaving directory `/var/tmp/portage/tetex-2.0.2-r3/work/tetex-src-2.0.2/texk/web2c'
make[1]: *** [install] Error 1
make[1]: Leaving directory `/var/tmp/portage/tetex-2.0.2-r3/work/tetex-src-2.0.2/texk'
make: *** [install] Error 1

!!! ERROR: app-text/tetex-2.0.2-r3 failed.
!!! Function einstall, Line 388, Exitcode 2
!!! einstall failed

You can see where the install command get's a memory exhausted error. But I should not be running out of memory. I have just doubled it and I have a 256MB swap partition. Here is the output the top part of top:

top - 13:26:38 up  3:45,  1 user,  load average: 0.22, 0.25, 0.48
Tasks: 130 total,   1 running, 129 sleeping,   0 stopped,   0 zombie
Cpu(s):   0.0% user,   1.6% system,   0.0% nice,  98.4% idle
Mem:    123692k total,   114792k used,     8900k free,    25304k buffers
Swap:   263160k total,    74712k used,   188448k free,    18636k cached

Any ideas?

Eric

@eman:

I have made the suggested update and it seems to be running much better. Time will tell if it fixes my problem but I still getting odd problems. Seems to be memory related. For example I was trying to compile tetex on my machine. Did the standard emerge -v tetex.

Ran the emerge a couple more times and am still getting the same problems. Here is the output of the latest run towards the end:

gcc -DHAVE_CONFIG_H  -I. -I. -I.. -I./.. -DKPATHSEA  -march=pentium3 -O3 -pipe -fno-stack-protector  -c -DLJ4 ./dvi2xx.c
mv dvi2xx.o dvilj4.o
mv: memory exhausted
make[2]: *** [dvilj4.o] Error 1
make[2]: Leaving directory `/var/tmp/portage/tetex-2.0.2-r3/work/tetex-src-2.0.2/texk/dviljk'
make[1]: *** [all] Error 1
make[1]: Leaving directory `/var/tmp/portage/tetex-2.0.2-r3/work/tetex-src-2.0.2/texk'
make: *** [all] Error 1

!!! ERROR: app-text/tetex-2.0.2-r3 failed.
!!! Function tetex_src_compile, Line 133, Exitcode 2
!!! make teTeX failed

This is top at the time of this error occuring:

top - 14:19:45 up  4:39,  2 users,  load average: 0.91, 1.19, 1.07
Tasks: 132 total,   1 running, 131 sleeping,   0 stopped,   0 zombie
Cpu(s):   0.0% user,   1.6% system,   0.0% nice,  98.4% idle
Mem:    123692k total,    81660k used,    42032k free,     3192k buffers
Swap:   263160k total,    87672k used,   175488k free,    43396k cached

As you can see I have plenty of memory and plenty of swap free. So there has to be something else. Anyone with any ideas? This has all started since switching host machines from an upgrade. Before that everything worked just fine (only slower :)).

Thanks,

Eric

I have been working on my problem some more. I cannot emerge most packages. When attempting to emerge something that uses java it sometimes locks up. I got the latest binaries from Sun so that I am sure my JDK is fine. So I copied and pasted one command from the emerge of subversion where it locked up during the emerge. Sometimes the command ran fine. Othertimes is just locked up. So hoping that I can solve the problem ran it with strace to see maybe what resource it is accessing. My command is:

/opt/sun-jdk-1.4.2.06/bin/javah -force -verbose -classpath ../cls org.tigris.subversion.javahl.NodeKind

If the command runs fine I get the following output from strace:

bigsky native # strace /opt/sun-jdk-1.4.2.06/bin/javah -force -verbose -classpath ../cls org.tigris.subversion.javahl.NodeKind
execve("/opt/sun-jdk-1.4.2.06/bin/javah", ["/opt/sun-jdk-1.4.2.06/bin/javah", "-force", "-verbose", "-classpath", "../cls", "org.tigris.subversion.javahl.NodeKind"], ) = 0
--- SIGRTMIN (Unknown signal 32) @ 0 (0) ---
--- SIGRTMIN (Unknown signal 32) @ 0 (0) ---
--- SIGRTMIN (Unknown signal 32) @ 0 (0) ---
--- SIGRTMIN (Unknown signal 32) @ 0 (0) ---
--- SIGRTMIN (Unknown signal 32) @ 0 (0) ---
--- SIGRTMIN (Unknown signal 32) @ 0 (0) ---
--- SIGRTMIN (Unknown signal 32) @ 0 (0) ---
--- SIGRTMIN (Unknown signal 32) @ 0 (0) ---
--- SIGRTMIN (Unknown signal 32) @ 0 (0) ---
--- SIGRTMIN (Unknown signal 32) @ 0 (0) ---
--- SIGRTMIN (Unknown signal 32) @ 0 (0) ---
--- SIGRTMIN (Unknown signal 32) @ 0 (0) ---
--- SIGRTMIN (Unknown signal 32) @ 0 (0) ---
--- SIGRTMIN (Unknown signal 32) @ 0 (0) ---
--- SIGRTMIN (Unknown signal 32) @ 0 (0) ---
--- SIGRTMIN (Unknown signal 32) @ 0 (0) ---
--- SIGRTMIN (Unknown signal 32) @ 0 (0) ---
--- SIGRTMIN (Unknown signal 32) @ 0 (0) ---
--- SIGRTMIN (Unknown signal 32) @ 0 (0) ---
[Search path = /opt/sun-jdk-1.4.2.06/jre/lib/rt.jar:/opt/sun-jdk-1.4.2.06/jre/lib/i18n.jar:/opt/sun-jdk-1.4.2.06/jre/lib/sunrsasign.jar:/opt/sun-jdk-1.4.2.06/jre/lib/jsse.jar:/opt/sun-jdk-1.4.2.06/jre/lib/jce.jar:/opt/sun-jdk-1.4.2.06/jre/lib/charsets.jar:/opt/sun-jdk-1.4.2.06/jre/classes:../cls]
[Loaded ../cls/org/tigris/subversion/javahl/NodeKind.class]
--- SIGRTMIN (Unknown signal 32) @ 0 (0) ---
[Loaded /opt/sun-jdk-1.4.2.06/jre/lib/rt.jar(java/lang/Object.class)]
[Forcefully writing file org_tigris_subversion_javahl_NodeKind.h]
--- SIGRTMIN (Unknown signal 32) @ 0 (0) ---
--- SIGRT_1 (Unknown signal 33) @ 0 (0) ---

The number of times SIGRTMIN is sent varies each time I run the command. On the other hand when it locks up I get the following:

bigsky native # strace /opt/sun-jdk-1.4.2.06/bin/javah -force -verbose -classpath ../cls org.tigris.subversion.javahl.NodeKind
execve("/opt/sun-jdk-1.4.2.06/bin/javah", ["/opt/sun-jdk-1.4.2.06/bin/javah", "-force", "-verbose", "-classpath", "../cls", "org.tigris.subversion.javahl.NodeKind"], ) = 0
--- SIGRTMIN (Unknown signal 32) @ 0 (0) ---
--- SIGRTMIN (Unknown signal 32) @ 0 (0) ---
--- SIGRTMIN (Unknown signal 32) @ 0 (0) ---
--- SIGRTMIN (Unknown signal 32) @ 0 (0) ---
--- SIGRTMIN (Unknown signal 32) @ 0 (0) ---
--- SIGRTMIN (Unknown signal 32) @ 0 (0) ---
--- SIGRTMIN (Unknown signal 32) @ 0 (0) ---

This all doesn't tell me much but at least I can reproduce one of the problems I am having. So I have two questions for you guys. First what is the SIGRTMIN signal and why is it being sent so often? The second question is what other tests can I run to try to get more information about my problems? And of course I still keep coming back to the question of why did these problems start occuring when I upgraded and changed hosts?

Any wild guess are welcome. I am banging my head at this point. Thanks,

Eric

Eric,

This host will be getting a kernel upgrade sometime early next week (schedule TBD). I suspect the version of the SKAS patch used on the existing kernel might be at fault – it's had a number of bug fixes since then.

-Chris

@eman:

… but I have the following compile flags in my gentoo make.conf settings:

CFLAGS="-march=pentium3 -O3 -pipe"

I don't think gcc's pentium3 is the right one for Xeon/P4s. This is what is in my make.conf:

CHOST="i686-pc-linux-gnu"
CFLAGS="-O2 -march=i686 -fomit-frame-pointer"

No problems with this on my Linode or local box (AMD/Duron.) And Chris has probably used stage3-i686 for the Gentoo images, so stick with that -march.

See make.conf(5) and Gentoo forums if you want to play with CFLAGS.

HTH,

Cliff

@c1i77:

@eman:

… but I have the following compile flags in my gentoo make.conf settings:

CFLAGS="-march=pentium3 -O3 -pipe"

I don't think gcc's pentium3 is the right one for Xeon/P4s. This is what is in my make.conf:

If pentium 3 predates Xeon/P4s, then it should be backwards compatabile I would think. Plus it was working fine on the old machine which had the same processor.

@c1i77:

CHOST="i686-pc-linux-gnu"
CFLAGS="-O2 -march=i686 -fomit-frame-pointer"

The problem is that I cannot simply rebuild the software with different settings because most compiles will not make it all the way through. The services I offer on my server seem to be stable (web/email) so I think I am going to wait until the kernel upgrade and see if that fixes my problems. If it doesn't then I will either have to rebuild the entire box or downgrade back to my old linode host (if that is possible). So I am hoping this kernel upgrade works for me because I really don't want to do either of those two options.

It's generally a good idea to not optimize those settings with any server. It's great for a desktop that you want running in top shape, but it's not always gauranteed you'll be running on a Xeon processor, and you typically want your server software to be able to handle anything thrown at it such as upgrades in a couple years. It's just going to cost you more hassle than it's worth as it already aparently has. The per-processor optimizations are more aimed at desktop users in the first place.

I don't run Gentoo on my Linode for the optimizations, I use it because it's easier to maintain, keep up to date, and manage services.

@tierra:

It's generally a good idea to not optimize those settings with any server. It's great for a desktop that you want running in top shape, but it's not always gauranteed you'll be running on a Xeon processor, and you typically want your server software to be able to handle anything thrown at it such as upgrades in a couple years. It's just going to cost you more hassle than it's worth as it already aparently has. The per-processor optimizations are more aimed at desktop users in the first place.

I have found on actual hardware I rarely tend to upgrade the box. By the time I am ready to upgrade a server usually it is just a cheap to buy a whole new box instead of upgrading all the parts. I may slap some more memory in the system but that just requires a reboot. If I have a whole new box I would rather rebuild the os and install my programs than do some kind of copy of the entire system because all sorts of hardware has changed. So I would be rebuilding the software with optimizations anyway.

In the case of UML this is not really the case since it is much easier to move the machine a virtual server is on. But even with that I don't think the optimizations are the cause of the problem. There should be no reason why pentium 3 optimizations should not work on the Xeon processor. Also don't forget these optimizations were working fine on the old machine I was on (host9) which should have been a Xeon also. Right now I am leaning towards the host kernel as Caker suggested. At least I am hoping that is it. :D

@tierra:

I don't run Gentoo on my Linode for the optimizations, I use it because it's easier to maintain, keep up to date, and manage services.

I agree 100%. The reason I like Gentoo is because I get to choose what is in the build. The optimizations are just a bonus. If I have to choose an optimization setting might as well choose a faster one. I didn't try anything fancy. I figured -O3 aimed at pentium3 would be safe and fast. Perhaps I was a bit optimistic. Or perhaps it is host kernel problem. Only time will tell. Thanks everybody for the advice and suggestions. Hopefully I won't have to rebuild my entire machine. :)

@eman:

I figured -O3 aimed at pentium3 would be safe and fast.

Yeah, that's not all out crazy like some I've seen.

@caker:

Eric,

This host will be getting a kernel upgrade sometime early next week (schedule TBD). I suspect the version of the SKAS patch used on the existing kernel might be at fault – it's had a number of bug fixes since then.

Caker, did you get a chance to upgrade the host kernel on this machine. I am still getting my MySQL process disappearing and still cannot compile software.

Thanks,

Eric

@eman:

Caker, did you get a chance to upgrade the host kernel on this machine. I am still getting my MySQL process disappearing and still cannot compile software.
Glad you asked. Look for the reboot schedule announcement later this afternoon. I'm shooting for Thursday evening for host20. Basically I was waiting for more bug fixes from the UML dev guys…

-Chris

@caker:

Glad you asked. Look for the reboot schedule announcement later this afternoon. I'm shooting for Thursday evening for host20. Basically I was waiting for more bug fixes from the UML dev guys…

Chris,

I hate to sound impatiant but was wondering on the status of this kernel upgrade. This host has really be causing some stability problems for me and I need to get them resolved. If it is not the kernel then I will need to really dig in to track down the problem but I would like to eliminate the kernel from the list of suspects.

Things seemed to be quite stable on my old host and I really hate the fact that I am paying more now and getting a worse system. It runs fast but with processes dying at random it really makes the whole point of the server useless. Like I said I don't mean to sound impatient but if I can't get these stability problems resolved then I may have to switch to a dedicated server to get rid of UML problems. UML may be cheaper but if I cannot compile programs and cannot run services what good is it?

Eric,

I'm still working on the kernel upgrade, but in the meantime I've configured a migration of your Linode (I think I got the correct username for you) to a host running the 2.6.8-4 kernel. Your current host is 2.6.7-1. Would you mind migrating and letting me know if the problems you are seeing go away?

Thanks,

-Chris

The other option is to temporarily move to my test box at HE (with no one else on it) where I have one of my numerous 2.6.9 host kernels running. I'd need you to move back to another host shortly before the reboots occur, though. Your call.

-Chris

@caker:

The other option is to temporarily move to my test box at HE (with no one else on it) where I have one of my numerous 2.6.9 host kernels running. I'd need you to move back to another host shortly before the reboots occur, though. Your call.

I don't really keep up with the UML stuff so I don't know which would be best. My main criteria are:

  • Something that is different from what I have now

  • Something that is as stable as possible (while still being different)

  • Something that doesn't involve me changing IP addresses.

Also what kernel should I be running on my machine. I have been running 2.4 for the most part. I tried the 2.6 a while back but started getting some problems and didn't feel like debugging so I just switched back to 2.4.

I really appreciated your help in getting this resolved.

It would help me determine if the new kernel/skas/sysemu patches solve the problem you and a few others have been reporting. Give me an hour or so, and I'll get the test environment ready for you, and I'll configure a new migration (don't do the one I've already configured). These migrations don't take very long to complete (10 minutes or so), and nothing will change on your end…

I'd suggest you stay with the 2.4 based kernel for now.

Shoot me a support ticket and we'll communicate that way.

Thanks,

-Chris

Eric,

I've configured a migration to the test box. Please login to your account, shutdown, and press the migrate button. I'd estimate it will take about 10 minutes to make the transfer.

How soon do you think you'll know if there's a difference with this host kernel?

Thanks,

-Chris

Linode Staff

Eric,

How are things working out on the new host kernel?

-Chris

@caker:

How are things working out on the new host kernel?

It is working great! I haven't had a problem, and everything has been performing great. I really appreciate your help in getting it resolved.

Reply

Please enter an answer
Tips:

You can mention users to notify them: @username

You can use Markdown to format your question. For more examples see the Markdown Cheatsheet.

> I’m a blockquote.

I’m a blockquote.

[I'm a link] (https://www.google.com)

I'm a link

**I am bold** I am bold

*I am italicized* I am italicized

Community Code of Conduct