[Postgresql] strange pattern in UPDATE/SELECT on one table
> => \d tokens
Table "public.tokens"
Column | Type | Modifiers
–----------+--------------------------+-----------
session_id | character varying | not null
username | character varying | not null
expires | timestamp with time zone | not null
id | character varying | not null
created | timestamp with time zone | not null
Indexes:
"tokens_pk" PRIMARY KEY, btree (id)
I measured the performance of UPDATEs/SELECTs on this table using the primary key in the where clause, and while most of the time the query completes in 2-3 ms, sometimes it can take 500-1000ms (0.3% of the time, but still).
My first two reactions were to REINDEX and DELETE/VACUUM/ANALYZE since old records are useless to me. That didn't change anything.
According to my measurement over the 250k last SELECTs/UPDATEs, the average query execution time was 3 ms on average with a standard deviation of
Now something worth knowing is that my id is quite large and the query plan is the following:
QUERY PLAN
-------------------------------------------------------------------------------------------------------
Index Scan using tokens_pk on tokens (cost=0.00..8.27 rows=1 width=128)
Index Cond: ((id)::text = '3aaa6faaab3aaadbaaaaaae181faaabc6aaaaa7c7aaaacce1da3aaa1222a5aaa'::text)
(2 rows)
I should add that the load on the server is quite low, there's CPU/RAM to spare and I/O is low also.
Would anyone know what's wrong with that table?
41 Replies
Your indexes are also odd, or rather your complete lack of them on anything but your primary key. You only ever do selects/updates with the id, never with the session_id, username, expires, created, or anything else in the where (or order by) clause?
but I agree it's quite large (64 characters) and could/should very much be reduced by 3/4
the where clause is on (id,expires) 95% of the time and 5% on (id,username,expires) (the latter will change so that it's always on (id,expire)
keep in mind this works well in 99.% queries (<20 ms), and the table doesn't have a huge row count (it should oscillate between 5k and 20k)
for indexes, I assumed (maybe wrongly) that having one on id would be sufficient since id is unique and it's only a matter of checking if the token has expired or not
thanks for taking the time to answer
That will very likely grease the weasel all the way to market, but if not, an index on expires or (id, expires) might help too.
Edit: on second thought, the btree on that huge string is most likely sub-optimal, and if not balanced properly could lead to high lookup times for certain values (which could cause the issue) but again the table is small so even if the tree is the worse possible, I don't expect the performance to be that bad even in the worse case. Something to investigate anyway.
Is there anything that differentiates the slow queries from the fast queries? Are they only UPDATEs? Only SELECTs?
the only thing that differs from a fast and a long query is the value of id… the rest of the clause being "expires > now()"
I'll try to see if some values cause of id are recurrently the source of issues
I didn't specify any caching of the table or index in memory, I'll look into it
@sob:
Edit: on second thought, the btree on that huge string is most likely sub-optimal, and if not balanced properly could lead to high lookup times for certain values (which could cause the issue) but again the table is small so even if the tree is the worse possible, I don't expect the performance to be that bad even in the worse case. Something to investigate anyway.
The ASCII representations for 0..9 and a..f don't lend themselves to a good random distribution, that's for sure.
Caching is another good thing to check for. Unfortunately, I'm not too familiar with PostgreSQL's tuning knobs, but disk will take forever (comparatively speaking).
The more I think about it, the more I like the idea of sticking an index on (id, expires)… if you're just looking for the existence or not of a row where id=bleh,expires>now(), I have reason to believe a btree can just crap that out straight from the index. 'course, if you're doing SELECT session_id,username,etc,etc, you probably won't buy too much except on a miss.
Another questions: indexes are btree by default but can be hash is specified.
It seems btree is good at comparisons < > and hash at equality test =
My where clause involving expires will obviously always be a comparison, whereas my id is always an equality test.
Should I have an index btree on id and hash on expires? or btree on (id, expires)?
It's not obvious to me what would work best…
Another questions: indexes are btree by default but can be hash is specified.
It seems btree is good at comparisons < > and hash at equality test =
My where clause involving expires will obviously always be a comparison, whereas my id is always an equality test.
Should I have an index btree on id and hash on expires? or btree on (id, expires)?
It's not obvious to me what would work best…
As I don't want to run the risg of having two identical integers, my UNIQUE constraint is on id and expires.
pilotest=> \d tokens
Table "public.tokens"
Column | Type | Modifiers
------------+--------------------------+-----------
session_id | character varying | not null
username | character varying | not null
expires | timestamp with time zone | not null
created | timestamp with time zone | not null
id | integer | not null
Indexes:
"tokens_id_key" UNIQUE, btree (id, expires)
And I still have queries that are extremely slow (some reach 1.5s), things actually seem to have gotten worse.
Most SELECT/UPDATE on that table get executed within 3ms. And some (identical in structure) go beyond 1s. I don't get it.
Still sounds weird, whatever's going on here… is it reproducible for particular queries, or does it happen on a periodic basis (every x seconds/minutes)?
Might be worthwhile watching "vmstat 1" and seeing if there's a spike in I/O around the time of a slow query. Sounds like there's also a fair amount of writes to go along with the reads, so maybe it is picking that moment to sync to disk…
(I really oughta read up on PostgreSQL more. I have a fairly active database running on it, sure, but it hasn't needed a lot of tuning, and most situations like this get covered up by considerable memcacheding.)
Then I removed the index altogether and I still have slow queries, but they're in the 100-200ms range instead of 1s.
I am obviously monitoring I/O, CPU load and others… the server is really underused.
SQL Antipatterns
How closely are you watching IO? This is one of those cases where you almost need as much resolution as you can get; five minutes won't cut it. Watching vmstat on a one-second interval should help cut through the noise and point towards where a problem might be, without capturing too much data and creating its own IO storm.
With that said, I suspect that focusing on the table structure, or indices, or certainly the data types is not going to make any difference. From what I can see you're talking quite small amounts of data, and even full scans probably shouldn't be subject to that much variation. Certainly worrying about an integer vs. 64 character strings as an index, or btree vs. hash indices is way premature optimization at this point (and unlikely to be a space-I/O-performance issue until you have a much larger data set). Yes, a large and frequent update cycle will create some bloat due to dead rows but if that was your problem I'd expect a gradual and overall slowdown in transaction rate, not spikes.
My first inclination is that given the sporadic nature and the significant deviations in time, that you're just running into host contention for disk I/O. That is, it's not really your schema structure at all.
However, it's also possible that there are some things you can check for which is increasing the odds of your Linode having a spike in I/O itself which could then make any host contention worse. A few thoughts (roughly oriented in decreasing priority):
* If you haven't, enable long duration query logging to ensure that the delay is internal to the database. Plus you'll then have timestamps to correlate with other database logging.
If you haven't, enable checkpoint logging and see if it correlates at all to the increased timings. If your data patterns are creating large and brief checkpoints that can delay other I/O, you could increase checkpoint frequency and lengthen a checkpoint duration to more evenly distribute the load.
Do you have autovacuum enabled and is it running? Enable logging for it if not sure. Depending on your access patterns its default configuration may not be aggressive enough for a table that is frequently updated. Or conversely, it may be kicking in too aggressively and interfering a little, so you might wish to take more explicit control and only do the vacuums when you're not doing heavy updates. (Though again, total quantity of your data would seem to make this not a big deal either way).
When checking your query, be sure to use "explain analyze" and not just "explain" so it can include actual results vs. estimates. This is a pretty trivial planner result, but if you have a complicated one,
http://explain.depesz.com/ can help highlight bottlenecks.Does your Linode have enough memory allocated to buffer/cache and have you configured PostgreSQL reasonably. It's default configuration is extremely conservative. At a minimum check effectivecachesize so it has a reasonable estimate of system cache available to it. (Although given your data size and the query explain you posted, I doubt this is an issue specifically in this case)
Check what I/O scheduler your kernel is using. If it's defaulting to cfq, change to deadline or even noop. There's little benefit from cfq in a Linode environment (and with a shared RAID array with BBU) and it can delay PostgreSQL I/O in favor of other processes.
Try testing with fsync=off. Part of the bottleneck may be PostgreSQL waiting on WAL or table syncs, but the Linode RAID+BBU environment makes the need to wait on a sync much less of an issue. I don't think I'd run with fsync=off, but it is a useful test to see if it changes behavior. If you're using 8.3+, synchronous_commit=off should help with performance just as much but be safe from a consistency perspective (and data loss perspective given the BBU in the host RAID).
Do you have any long lived transactions (I suspect not, but…). If so, they can keep additional dead rows alive until they exit since they technically need visibility as of then they started.
Is anything else on your Linode generating sporadic I/O traffic? You might try something like iotop or collect to watch per process, and at a high frequency while you try to generate a delayed query. If the increase in I/O at that point in time isn't huge, then it's good odds that it's something beyond your control.
If you can watch your I/O Wait % during such a problem, if that jumps very high, it's also a good indication that you're purely waiting on I/O - which can be normal in some circumstances - but only if you are generating a lot.
And finally, in terms of testing the basic setup, another choice is to spin up another Linode, clone your current one over there, and try the same tests. If you get different behavior, it again may point at heavier than usual host load. Another similar alternative would be to set up a VM at home with the same configuration - where you know there is no additional contention for disk I/O - and see how the same configuration and data performs there.
Assuming you find a discrepancy compared to your current Linode under the same configuration, at that point you could open a ticket and Linode will probably offer to migrate you to another host. Or you could just do it yourself by switching to your second Linode as primary and deleting the first if feasible.
– David
Thank you for all these suggestions. I enabled query logging for long queries as a first step, I'll see tomorrow what the results are when users come back to the site.
Then I'll try to play with the linux I/O scheduler and fsync as you advised.
Thanks!
@sob:
I switched to a 32 bit integer for id, generated by rand().
why rand instead of using auto_increment?
though if you go auto_increment, do not use another value in the primary key
@sob:
I'm measuring I/O with iostat every 5 min indeed. Tried to vo "vmstat 1" but there's really nothing going on, a few blocs out here and there but not much.
5 minute cycles are not going to be anywhere near granular enough for something like this. You'll want at least every second, or more frequently if the tool permits and you're able to reproduce the issue in a relatively tight time frame.
One thing I do sometimes is use a tool like vmstat, iostat, etc.. in a small shell loop where I let it run once a second for maybe 60 seconds, then generate a timestamp and then go back and loop again. To avoid polluting the I/O with the stat logging itself, just leave an ssh connection running the script and log the information on a remote host.
That way if you have a timestamp (in this case from the database long duration logging) you can find the nearest timestamp in the stat output logging and see what was happening in the vicinity.
Oh, and to your prior response, if you're enabling any logging, you may as well ensure that all of long duration queries, checkpoints and autovacuums are all logged - no reason to wait to at least turn on the logging for diagnostic purposes.
– David
rand() so that they're not predictable
db3l
of course 5 min is not granular enough, but I did run vmstat 1 on screen after your comment, trying to find spikes in I/O, and didn't see anything crazy
I enabled logging for long queries and reduced other I/O from other sources on the guest (such as debug logging of a couple custom apps), that way I hope to verify where the delays comes from (postgres or outside)
autovacuum is disabled (I prefer to run it myself at night, and there's not much to vacuum anyway as I move rows only at night in my archiving batch, and VACUUM ANALYZE after)
@sob:
of course 5 min is not granular enough, but I did run vmstat 1 on screen after your comment, trying to find spikes in I/O, and didn't see anything crazy
Did you get one of the longer queries occurring while you were running the vmstat? If so, and if there wasn't any noticeable increase in I/O load at the time, that's probably a reasonably good indicator that you're running into contention on the host external to your own guest. Though in that case I would expect you to see a noticeable spike in I/O wait percentage (final column in vmstat).
Of course, if you didn't have a long query simultaneous with the vmstat logging, then that particular run isn't all that useful.
– David
I'm logging queries > 20 ms in Postgresql, and the queries that php logs as >200ms (for instance) don't even show up as >20ms in the postgresql logs. That goes in the direction of contention for host IO right?
Checkpoints and long queries don't coincide, although I must admit I'm not sure what to make of such debug:
2011-05-09 16:30:00 EDT LOG: checkpoint complete: wrote 194 buffers (6.3%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=38.671 s, sync=0.002 s, total=38.675 s
I have 3 segments per checkpoint, of 16 MB each. So does this mine the 2*16MB write was spread over 38s?
I enabled a vmstat 1 60 && date in a loop to see if I get huge IO spikes around the time the long queries take place.
Thanks again.
@sob:
I'm logging queries > 20 ms in Postgresql, and the queries that php logs as >200ms (for instance) don't even show up as >20ms in the postgresql logs. That goes in the direction of contention for host IO right?
Actually, no, that argues for no contention, unless your PHP script is also doing a lot of I/O. In other words, the database is able to do all of it's work (including I/O) for all the queries made while that logging threshold was in place, without exceeding the threshold, which means that out of those 200ms, less than 10% is the database and disk I/O.
I suppose if I were you I might very briefly drop the threshold a bit further - something that has to be exceeded - just to make sure the logging was in fact working.
What sort of processing is the PHP code doing - are there any potential bottlenecks or ways it could get delayed?
> I'm not sure what to make of such debug:
2011-05-09 16:30:00 EDT LOG: checkpoint complete: wrote 194 buffers (6.3%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=38.671 s, sync=0.002 s, total=38.675 s
I have 3 segments per checkpoint, of 16 MB each. So does this mine the 2*16MB write was spread over 38s?
Not quite. You get a checkpoint when either checkpoint_segments WAL segments have been written or the checkpoint timeout exceeded (whichever comes first). But how much data needs to be forced out at checkpoint time depends on how well the background writer has been keeping up with the actual changes.
See
At the point of a checkpoint, any remaining data that hasn't made it to the actual table/index files is flushed, at which point the WAL records are no longer needed for recovery (a checkpoint record is written to the WAL at that point). It is also at that point that earlier WAL segment files are permitted to be re-used.
In terms of the checkpoint, the log indicates how many remaining buffers were dirty and needed to be written, and the overall time to do so. The latter aims to spread your checkpoint over a time equal to the configured completion target as a percentage of the overall checkpoint timeout. This helps spread the I/O overhead.
So in your case, at the checkpoint time you had 194 remaining buffers to flush. I think a buffer is 8K, so that's about 1.5MB of data. The I/O was spread over 38 seconds. And the final fsync to ensure it was on the disk took almost no time. That latter point in particular is a good indication that there is little contention, since otherwise you'd block there waiting for the disk to fully acknowledge the sync.
If you wanted to minimize the checkpoint spike even further you could increase the checkpoint frequency (either by dropping the timeout or segment limit) which would then most likely have less dirty data to flush each time. But this is a pretty tiny amount of data.
To be honest, if the database isn't logging any queries beyond 20ms I'm not sure your issue is at the database level at all. You might need to work your way up the stack, and perhaps add some intermediate logging in your PHP code to try to determine where the remaining 180+ms is coming from in the longer queries. To minimize intrusion of the monitoring you could just store a timestamp at various points in the code path and then have a single top-level handler dump all the timestamps/deltas as part of some final output/log.
This assumes that you've definitely caught one of those 1000ms+ queries that started the thread. If not, then I suppose those extreme cases might still be contention within the database layer.
-- David
The following portion of code is what all my queries go through
$starttime = explode(' ', microtime());
$starttime = $starttime[1] + $starttime[0];
$res = pg_query_params($dbconn,$query,$arguments);
if (!$res) {
// ...
}
$mtime = explode(' ', microtime());
$totaltime = $mtime[0] + $mtime[1] - $starttime;
if (1000*$totaltime > 20) {
mylog(LOG_DEBUG,"dbfunc.inc","query execution time",round(1000*$totaltime,0)." ms ($query)");
}
mylog() formats the string and sends it to syslog… that's how I see which queries take more than 20ms
I'm using pgpool2 as a connection pooling service, which is supposed to reduce latency (and apparently it does the job well… if pgpool2 was the culprit I would see delays on all tables, not just "tokens")
now monitoring with vmstat, will see what the results are after tomorrow…
php is making a few queries per page viewed, writing a few <10k png's occasionally, really not much… and it's not like the 20ms I measure are from beginning to end of the page, it's really just around the execution of the query!
@sob:
I am certain I am logging queries above 20ms in postgresql as I see my VACUUM and others.
The only conclusion I think I can take from this is that if you have a query your application code is logging as taking 200ms, but there's no long duration log in the PostgreSQL logs, that over 80% of that time was above the level of the database engine. About the only thing that such logging can't account for is the network I/O between processes on the local host for a query and response but I have a really hard time believing that's anywhere but in the noise. Well, that or the connection setup itself if that's happening per connection.
Whether that's in the PHP code, in libraries you are using to access the database, or in an intermediate layer like pgpool, but it doesn't seem to be in the database, and by extrapolation, anything below that like the host disk subsystem.
This is different than how I read your original query which seemed to indicate logging showing it was the database itself periodically spiking and taking longer to process queries.
> php is making a few queries per page viewed, writing a few <10k png's occasionally, really not much… and it's not like the 20ms I measure are from beginning to end of the page, it's really just around the execution of the query!
Of course, execution of the query involves a lot more than just the database, given that you have the PHP db libraries, and then pgpool.
I suppose if I were in your shoes, I'd jettison pgpool, if only because it's a variable you can do without (and it's not like your scale likely needs it), and at least testing without it would be interesting. Connection setup/teardown with PostgreSQL is definitely high overhead comparatively speaking, so sharing longer lived connections is definitely a good thing in general though.
Otherwise, I guess I'd have to tear down the application code path and try to figure out just where the latency is coming from if it isn't the database. Probably with a bunch of logging since the problem only occurs sporadically. Maybe it'll turn out to be something like occasional garbage collection latency in the PHP interpreter, though that seems extreme to push up into the 1s range like some of your extreme cases.
It would be interesting to get timing from the actual database connection setup (pre-query). Do you use host names in pg_hba.conf and IP connections to the database even for local connections? I think that may involve a reverse DNS lookup, and maybe that's varying? But I'd think all of that would be very quick for localhost connections anyway.
Definitely odd, but also more and more looking like higher level than the basic database engine and schema.
– David
Postgresql didn't log it as above 20ms.
And vmstat shows this on the exact execution time of the query. There's IO but it's not outrageous (I/O is zero around theses seconds):
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 1 6376 44120 25540 216888 0 0 0 112 124 147 0 0 90 10
0 0 6376 42632 25540 216892 0 0 0 112 586 701 5 0 83 12
I'll bypass pgpool2 tonight when users are gone, and try to look at a higher level as advised.
Only local connections so far.
Thanks
@sob:
And vmstat shows this on the exact execution time of the query. There's IO but it's not outrageous (I/O is zero around theses seconds):
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 0 1 6376 44120 25540 216888 0 0 0 112 124 147 0 0 90 10 0 0 6376 42632 25540 216892 0 0 0 112 586 701 5 0 83 12
I will at least say that I/O wait values of 10% and 12% for those two seconds do show you are waiting on I/O a bit, and is a bit higher than I'd expect for that quantity of data. Assuming this is a 1s granularity, your 800ms logging is below the level of these stats so those delays could account for it.
The wait isn't outrageous, particularly if it goes away quickly, but I was just watching one of my systems for a bit (reasonably idle) and see cases where it is able to push 2-4x that number of blocks out while still showing 100% idle and 0% wait. (Then again, I just caught one single entry with an I/O wait of 10 for 220 blocks, so…)
So host contention isn't completely off the table (isn't debugging fun). What is interesting is that whatever delay is occurring doesn't seem to be happening at or below the database engine, or it would delay the actual query execution, which is still intriguing, in terms of just where your overall request path is being delayed.
-- David
@sob:
I guess as a side question if someone knows a tool that may be more suitable than jmeter to simulate load with authenticated users, it would help (I would need jmeter to be able to alternatively select POST values for user/pwd from a list for example, which doesn't seem possible now, at least to the extent of my knowledge).
Since the problem is below the client request layer how about just tweaking the top level PHP code to imagine that it's been receiving the requests. E.g., just write a top level loop (with any values/delays you want) and the execute the request handling path code just as it would be with a real web request.
You could also write some code to actually behave like a client, but that seems like it would be more work with very little diagnostic benefit.
Ideally you'd be able to generate a few delayed cases over a very short period (like no more than minutes). Having a controlled way to replicate the issue without depending on user traffic would be a big win when it comes to narrowing this down further.
Then you can re-run the same process under different scenarios and on different machines.
– David