VPS blew up and had to reboot, what happened?
What I know is that CPU usage dropped, traffic dropped, IO and swap rocketed and I got an email from a cronjob saying that "psycopg2.Error: psycopg went psycotic without error set" which just seems to be a generic exception for when the PostgreSQL server goes away mid connection, so it's probably a symptom rather than the cause.
Where should I look to try and figure out what went wrong and what could I do for the future so that I have more information in situations like this?
![](
12 Replies
How are you running the app, with mod_wsgi? In daemon mode? Which framework? Transactions properly closed or aborted? Anything in postgres logs?
Unless you've since rebooted a second time, you can use lish
The most common cause of OOM situations around here is a default Apache configuration, which is designed for servers with gigs of RAM. How's your MaxClients setting?
Out of memory: Kill process 32604 (httpd) score 10 or sacrifice child
Killed process 32604 (httpd) total-vm:28256kB, anon-rss:7976kB, file-rss:224kB
Is there a way to redirect logview to a file or something?
I'm currently using Django with MPM + non-daemoned mod_wsgi because it's actually running fine 99% of the time and I use ReportLab which has infamous thread problems, but I could put some locks in.
What realistic difference do worker and daemon mode actually make?
mpm_preforkMPM
You can tell which you're using with this command. It'll be the one on the line starting with "ii":
dpkg -l apache2-mpm-*
For capturing the output from logview, do it like you'd do a remote command over ssh:
ssh linode12345@cityname666.linode.com logview
Either redirect to a file, or scrollback/copy/paste. (and replace it with your actual lish info)
@hoopycat:
You can tell which you're using with this command. It'll be the one on the line starting with "ii":
dpkg -l apache2-mpm-*
Since the process name is httpd, I think he's probably using centos. debuntu have the process named apache2. So, he'd need a yum command (that I unfortunately don't know). However, another way is to run
httpd -V
then look at the APACHEMPMDIR line.
@KyleMac:
What realistic difference do worker and daemon mode actually make?
I prefer the daemon mode because I can use mpmworker for Apache which is then only application server (or optionally it can also serve static files for which mpmworker is more than fine), and I can set fixed number of processes and threads per process, assign user and group and dedicated virtualenv python path.
What is weird in all of this is that you have 10+ hours of reduced bandwidth, no visibly significant growth in CPU consumption (albeit there is a visible growth trend toward the end) and then suddenly collapse with lots of disk IO and swapping. I could be wrong, but somehow I doubt Apache OOMed, it would start swapping immediately, not after 10 hours, unless I'm gravely mistaken…
@Azathoth:
What is weird in all of this is that you have 10+ hours of reduced bandwidth, no visibly significant growth in CPU consumption (albeit there is a visible growth trend toward the end) and then suddenly collapse with lots of disk IO and swapping. I could be wrong, but somehow I doubt Apache OOMed, it would start swapping immediately, not after 10 hours, unless I'm gravely mistaken…
prefork does not immediately start up 150 processes with MaxClients 150, that's the limit. It will just run as many as are needed, as specified by StartServers, MinSpareServers, and MaxSpareServers. IINM, defaults on those are 5/5/10. So, if you're usually getting no more than, say, 10 hits at a time, you'd have at most 20 processes running. However, if you suddenly get crushed by a bunch of traffic, that 150 MaxClients allows apache to crash the system by spinning up those 150 processes, which your amount of RAM can't handle.
This is a pretty common thread here, people install apache, don't configure it, and it's fine for a while when they're not getting much traffic at any given time. But then something happens that causes their site to get a bunch of traffic and they OOM.
@Azathoth:
What is weird in all of this is that you have 10+ hours of reduced bandwidth, no visibly significant growth in CPU consumption (albeit there is a visible growth trend toward the end) and then suddenly collapse with lots of disk IO and swapping. I could be wrong, but somehow I doubt Apache OOMed, it would start swapping immediately, not after 10 hours, unless I'm gravely mistaken…
Syslog (which had better info than logview) shows what must be 100+ httpd processes whereas I rarely see more than 20 on top. So unless I'm interpreting it wrong I think something really did make Apache fire up everything.
In Apache's error_log there are a lot of:
[Mon Nov 28 17:50:11 2011] [error] [client 0.0.0.0] mod_wsgi (pid=7268): Exception occurred processing WSGI script '/home/kyle/example.com/bin/django.wsgi'.
[Mon Nov 28 17:50:11 2011] [error] [client 0.0.0.0] IOError: failed to write data
But I'm pretty sure that just happens when a client closes their connection before receiving the result. If it happens in some places Django will actually email you the exception.
Five minutes before things go wrong a lot of these start appearing in access_log and sometimes appear in bundles of 5-10:
::1 - - [30/Nov/2011:21:02:28 +0000] "OPTIONS * HTTP/1.0" 200 - "-" "Apache (internal dummy connection)"
A couple seconds later these start appearing (32 in total) in the error_log:
[Wed Nov 30 21:02:42 2011] [error] [client 0.0.0.0] mod_wsgi (pid=32437): Exception occurred processing WSGI script '/home/kyle/example.com/bin/django.wsgi'., referer: http://blog.blogspot.com/
[Wed Nov 30 21:02:42 2011] [error] [client 0.0.0.0] Traceback (most recent call last):, referer: http://blog.blogspot.com/
[Wed Nov 30 21:02:42 2011] [error] [client 0.0.0.0] File "/home/kyle/example.com/eggs/Django-1.3.1-py2.7.egg/django/core/handlers/wsgi.py", line 274, in __call__, referer: http://blog.blogspot.com/
[Wed Nov 30 21:02:42 2011] [error] [client 0.0.0.0] signals.request_finished.send(sender=self.__class__), referer: http://blog.blogspot.com/
[Wed Nov 30 21:02:42 2011] [error] [client 0.0.0.0] File "/home/kyle/example.com/eggs/Django-1.3.1-py2.7.egg/django/dispatch/dispatcher.py", line 172, in send, referer: http://blog.blogspot.com/
[Wed Nov 30 21:02:42 2011] [error] [client 0.0.0.0] response = receiver(signal=self, sender=sender, **named), referer: http://blog.blogspot.com/
[Wed Nov 30 21:02:42 2011] [error] [client 0.0.0.0] File "/home/kyle/example.com/eggs/Django-1.3.1-py2.7.egg/django/db/__init__.py", line 85, in close_connection, referer: http://blog.blogspot.com/
[Wed Nov 30 21:02:42 2011] [error] [client 0.0.0.0] conn.close(), referer: http://blog.blogspot.com/
[Wed Nov 30 21:02:42 2011] [error] [client 0.0.0.0] File "/home/kyle/example.com/eggs/Django-1.3.1-py2.7.egg/django/db/backends/__init__.py", line 244, in close, referer: http://blog.blogspot.com/
[Wed Nov 30 21:02:42 2011] [error] [client 0.0.0.0] self.connection.close(), referer: http://blog.blogspot.com/
[Wed Nov 30 21:02:42 2011] [error] [client 0.0.0.0] InterfaceError: connection already closed, referer: http://blog.blogspot.com/
This seems to be a Django signal sent when a request is finished to close the database connection. Since it's sent right at the end it's presumably outside of Django's error handling and so I received no emails and nothing was logged in the app itself.
Then finally it happens:
[Wed Nov 30 21:07:25 2011] [error] server reached MaxClients setting, consider raising the MaxClients setting
So I'm thinking, PostgreSQL fell over which also hung all the Apache processes connected to it and so Apache launched new processes and quickly ran out of memory. In this case lowering MaxClients would only be half the fix.
PostgreSQL doesn't seem to log to a file by default.
So the system was low on memory and oom-killer killed an instance of postmaster, which made PostgreSQL terminate all connections and rollback. This seems to have caused most of the Apache processes to hang so Apache started new ones which of course couldn't reconnect to the database yet and they hung as well. Eventually Apache had hit MaxClients and oom-killer was killing off httpd processes until it all locked up and died.
@KyleMac:
So I'm thinking, PostgreSQL fell over which also hung all the Apache processes connected to it and so Apache launched new processes and quickly ran out of memory. In this case lowering MaxClients would only be half the fix.
Psycopg recovers from lost connections so I doubt that alone is the cause of Apache hanging. I mean, you can test it. Fire up the app, restart the database (effectively killing current connections), and then see what happens.
What you should see is an error (Django probably shows 500 or 503), but if you repeat the requests it should come back online. So if you had 5 processes active and Apache round robins among them, then it should come back online after 5 requests.
IOError: failed to write data
You should really investigate this and be 100% not just pretty sure that the exception is raised when the client closes connection before the wsgi handler returns naturally. For starters you could match the timestamps with access logs to see the status code and length returned and the resource queried.
BTW, you show log entries around 9pm which is when the swapping and collapse occurred. But what happened around 10am when the bandwidth suddenly went south with NO visible swapping in the graphs?
@Azathoth:
Psycopg recovers from lost connections so I doubt that alone is the cause of Apache hanging. I mean, you can test it. Fire up the app, restart the database (effectively killing current connections), and then see what happens.
What you should see is an error (Django probably shows 500 or 503), but if you repeat the requests it should come back online. So if you had 5 processes active and Apache round robins among them, then it should come back online after 5 requests.
I tried to recreate what happened by killing postmaster but it just goes down completely like you said, whereas what happened when it blew up was the following.
In syslog:
Nov 30 21:02:36 hostname kernel: Killed process 1812 (postmaster) total-vm:48660kB, anon-rss:204kB, file-rss:27768kB
And then in the PostgreSQL logs:
LOG: background writer process (PID 1812) was terminated by signal 9: Killed
LOG: terminating any other active server processes
Then the following repeated a lot:
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
And then after the reboot a lot of:
FATAL: the database system is in recovery mode
None of this happens if I just bluntly "kill -9 postmaster", so I think I have to time it to kill postmaster in the middle of an insert otherwise it just dies quietly rather than rolling back and protecting the data.
@Azathoth:
IOError: failed to write data
You should really investigate this and be 100% not just pretty sure that the exception is raised when the client closes connection before the wsgi handler returns naturally. For starters you could match the timestamps with access logs to see the status code and length returned and the resource queried.
This is a pretty common error if you google for it + "mod_wsgi". If the connection is closed before Django has finished reading the request it will raise an "IOError: request data read error" and will be logged by Django (this is common during large file uploads that fail). If the connection is closed while writing the response it will raise an "IOError: failed to write data" and be logged by Apache.
In this case all the errors seem to be created by IE8 and 9 pinging for /favicon.ico which is returning 404 (my real favicon is on S3 so maybe I should add a redirect to it).
I think anyone who has run a Django app should have run into this error a couple of times. It may be made worse by Django writing to the response in a bad way:
@Azathoth:
BTW, you show log entries around 9pm which is when the swapping and collapse occurred. But what happened around 10am when the bandwidth suddenly went south with NO visible swapping in the graphs?
Before 10am was due to some crawler called Sistrix (http://crawler.sistrix.net ). Since all my media is on S3 the linode traffic graph is more sensitive to pageviews.