reading postgresql explain output?

Question

reading postgresql explain output?

general

I am trying to use postqresql explain for the first time. My intent is to perhaps create some indexes in order to speed up some queries. I ran analyze already.

Here is the query (table names abstracted using search and replace):

SELECT *
 FROM "a"
INNER JOIN "b" ON ("a"."b_id" = "b"."id")
INNER JOIN "c" ON ("a"."c_id" = "c"."id")
INNER JOIN "e" ON ("b"."e_id" = "e"."id")
INNER JOIN "d" ON ("b"."d_id" = "d"."id")
INNER JOIN "d" T6 ON ("c"."d_id" = T6."id")
WHERE (
"b"."count" = true 
AND "c"."pulldate" >= '1994-04-03 00:00:00-06:00'
AND "b"."e_id" = 10 
AND "a"."state" IN (0,1)
)
ORDER BY "c"."pulldate" DESC, "c"."pubdate" DESC
LIMIT 10

and here is the output:

 Limit  (cost=4142.16..4142.19 rows=10 width=1331)
   ->  Sort  (cost=4142.16..4144.22 rows=821 width=1331)
         Sort Key: c.pulldate, c.pubdate
         ->  Hash Join  (cost=22.29..4124.42 rows=821 width=1331)
               Hash Cond: (c.d_id = t6.id)
               ->  Nested Loop  (cost=0.00..4090.84 rows=821 width=1153)
                     ->  Nested Loop  (cost=0.00..2122.80 rows=821 width=391)
                           ->  Nested Loop  (cost=0.00..37.22 rows=3 width=374)
                                 ->  Nested Loop  (cost=0.00..12.38 rows=3 width=196)
                                       ->  Seq Scan on e  (cost=0.00..1.19 rows=1 width=157)
                                             Filter: (id = 10)
                                       ->  Index Scan using b_e_id on b  (cost=0.00..11.16 rows=3 width=39)
                                             Index Cond: (e_id = 10)
                                             Filter: count
                                 ->  Index Scan using d_id_pkey on d  (cost=0.00..8.27 rows=1 width=178)
                                       Index Cond: (id = b.d_id)
                           ->  Index Scan using a_b_id on a  (cost=0.00..691.61 rows=287 width=17)
                                 Index Cond: (b_id = b.id)
                                 Filter: (state = ANY ('{0,1}'::integer[]))
                     ->  Index Scan using c_id_pkey on c  (cost=0.00..2.38 rows=1 width=762)
                           Index Cond: (id = a.c_id)
                           Filter: (pulldate >= '1994-04-03 06:00:00+00'::timestamp with time zone)
               ->  Hash  (cost=17.13..17.13 rows=413 width=178)
                     ->  Seq Scan on d t6  (cost=0.00..17.13 rows=413 width=178)

I'm looking to understand how to determine the best candidates for indexes. I don't want to do a multiple column index because I do several variations of this query on my site.

In your answers, please identify which columns to create indexes for, and how you arrived at that conclusion so that I can learn how to use this output better myself. Any help is appreciated!

For comparison's sake, here is the same query, but now I'm using state = 0 instead of state in (0,1):

SELECT *
 FROM "a"
INNER JOIN "b" ON ("a"."b_id" = "b"."id")
INNER JOIN "c" ON ("a"."c_id" = "c"."id")
INNER JOIN "e" ON ("b"."e_id" = "e"."id")
INNER JOIN "d" ON ("b"."d_id" = "d"."id")
INNER JOIN "d" T6 ON ("c"."d_id" = T6."id")
WHERE (
"b"."count" = true 
AND "c"."pulldate" >= '1994-04-03 00:00:00-06:00'
AND "b"."e_id" = 10 
AND "a"."state" = 0
)
ORDER BY "c"."pulldate" DESC, "c"."pubdate" DESC
LIMIT 10

And the output:

Limit  (cost=2969.43..2969.45 rows=10 width=1331)
   ->  Sort  (cost=2969.43..2970.12 rows=279 width=1331)
         Sort Key: c.pulldate, c.pubdate
         ->  Hash Join  (cost=22.29..2963.40 rows=279 width=1331)
               Hash Cond: (c.d_id = t6.id)
               ->  Nested Loop  (cost=0.00..2937.27 rows=279 width=1153)
                     ->  Nested Loop  (cost=0.00..2115.67 rows=279 width=391)
                           ->  Nested Loop  (cost=0.00..37.22 rows=3 width=374)
                                 ->  Nested Loop  (cost=0.00..12.38 rows=3 width=196)
                                       ->  Seq Scan on e  (cost=0.00..1.19 rows=1 width=157)
                                             Filter: (id = 10)
                                       ->  Index Scan using b_e_id on b  (cost=0.00..11.16 rows=3 width=39)
                                             Index Cond: (e_id = 10)
                                             Filter: count
                                 ->  Index Scan using d_id_pkey on d  (cost=0.00..8.27 rows=1 width=178)
                                       Index Cond: (id = b.d_id)
                           ->  Index Scan using a_b_id on a  (cost=0.00..691.61 rows=97 width=17)
                                 Index Cond: (b_id = b.id)
                                 Filter: (state = 0)
                     ->  Index Scan using c_id_pkey on c  (cost=0.00..2.93 rows=1 width=762)
                           Index Cond: (id = a.c_id)
                           Filter: (pulldate >= '1994-04-03 06:00:00+00'::timestamp with time zone)
               ->  Hash  (cost=17.13..17.13 rows=413 width=178)
                     ->  Seq Scan on d t6  (cost=0.00..17.13 rows=413 width=178)

8 Replies

forum:dfd25 · Answer 1 · June 2, 2013, 7:27 p.m.

forum:dfd25 11 years, 6 months ago

…or do I need to start with smaller queries and analyze one join at a time?

forum:db3l · Answer 2 · June 2, 2013, 10:03 p.m.

forum:db3l 11 years, 6 months ago

A few thoughts off the top of my head:
* First, you'll probably get more useful feedback from a PostgreSQL specific mailing list (http://www.postgresql.org/community/lists/). psql-performance is in particular for tuning queries, though also see http://wiki.postgresql.org/wiki/SlowQueryQuestions to best post your question. It's reasonably rare that a schema (as opposed to the data it contains) needs to be secret, so obfuscating your schema isn't really all that helpful. At least for me, it's harder to mentally review a query full of "a", "b" ,"c" all over rather than "family", "child", "class" etc…

You may find http://explain.depesz.com/ helpful in breaking down explain output and accounting for where the time goes, but for actual timing be sure to use explain analyze.
Alternatively, doing the explain through pgAdmin has a graphical display of the output that can also be helpful.
It's possible you're jumping the gun and assuming you need further indices.
In general when trying to improve performance, I'd start with an explain analyze so you can get actual timing information and not just planner estimates. That can help indicate which particular portion of the operation is the biggest bottleneck that you can focus on.

Towards the last point, while sometimes it may be obvious that an additional index could help, that's not always the case. I usually find much more bang for the buck in trying to restructure the query logic itself, though obviously in some cases there's only so much you can do.

How long is the current query taking? And what version of PostgreSQL are you running?

In this specific case, PostgreSQL is already primarily using index operations (probably your primary key indices). It's true that most of the joins are being implemented as nested loops, but I suspect that is due to the estimated record count. It's pushing your conditions into the inner index-based scans (see the Filter: lines) so is also already filtering the result rows as much as possible at each step so minimizing the output of each nested loop.

If the planner estimates are right, you've got so few rows involved that it may actually be hard to get more specialized indexes involved without significantly changing the server planner weighting. That's also probably why it did a flat out sequential scan of T6 with a hash join rather than another nested loop index scan. And that's probably right - in such lower volume cases it's often better to just scan the data (or index). Of course if the estimates are way off (you can see that in explain analyze) then it may be a horrible choice and a hash or bitmap join would have been better.

It's almost impossible to guess further without schema specifics and actual timing information, but clearly you're creating a potentially very large and wide number of rows from the sequence of join operations. Each join is going to multiple the row set by the matching rows, and then that new set of rows is input to the next join. The planner estimates would seem to say however that the query will generate a very small result set (no more than 1000 rows at any intermediate point, approximately 1MB of data, and 1/10 of that as a final result), so it may actually be a non-issue. But assuming it is, one question would be if that join process could be restructured, perhaps broken up into separate operations so you have intermediate results (independent queries or even CTEs) that act as input to the final query.

But at least as the query is written, and assuming accurate planner estimates, that plan seems like a pretty good choice and already taking quite a lot of advantage of your indices. Personally I'd expect this to be a pretty fast query. Is it performing particularly poorly? What's your environment like and how have you configured PostgreSQL (especially sharedbuffers, effectivecachesize and workmem)?

– David

forum:dfd25 · Answer 3 · June 2, 2013, 11:23 p.m.

forum:dfd25 11 years, 6 months ago

David, thanks for you reply. My version is psql (9.1.8) on Ubuntu 12.04. You can see in the bottom of the output that the run time is: 2251.192 ms, which is pretty slow for my needs.

I have not changed any settings yet. I see:

shared_buffers = 24MB

effectivecachesize = 128MB

work_mem = 1MB

Obviously those last two are commented out, but I assume those are the default values if they are not changed in the config.

Also, I had used the wrong user id (I had been using one from my dev environment that didn't have as many records in production). Here is the explain analyze output with a new user id with more records:

 Limit  (cost=54749.16..54749.18 rows=10 width=1328) (actual time=2248.845..2248.893 rows=10 loops=1)
   ->  Sort  (cost=54749.16..54838.16 rows=35599 width=1328) (actual time=2248.841..2248.857 rows=10 loops=1)
         Sort Key: c.pulldate, c.pubdate
         Sort Method: top-N heapsort  Memory: 45kB
         ->  Hash Join  (cost=8248.49..53979.88 rows=35599 width=1328) (actual time=676.413..2118.978 rows=23159 loops=1)
               Hash Cond: (c.d_id = t6.id)
               ->  Hash Join  (cost=8226.20..53468.10 rows=35599 width=1150) (actual time=674.759..2029.260 rows=23159 loops=1)
                     Hash Cond: (b.d_id = d.id)
                     ->  Nested Loop  (cost=8203.90..52956.32 rows=35599 width=972) (actual time=673.321..1937.464 rows=23159 loops=1)
                           ->  Seq Scan on e  (cost=0.00..1.19 rows=1 width=157) (actual time=0.013..0.021 rows=1 loops=1)
                                 Filter: (id = 2)
                           ->  Merge Join  (cost=8203.90..52599.14 rows=35599 width=815) (actual time=673.300..1865.240 rows=23159 loops=1)
                                 Merge Cond: (c.id = a.c_id)
                                 ->  Index Scan using c_id_pkey on c  (cost=0.00..43412.63 rows=163753 width=759) (actual time=0.014..719.134 rows=163678 loops=1)
                                       Filter: (pulldate >= '1994-04-03 06:00:00+00'::timestamp with time zone)
                                 ->  Materialize  (cost=8201.76..8379.77 rows=35602 width=56) (actual time=673.270..799.844 rows=23159 loops=1)
                                       ->  Sort  (cost=8201.76..8290.77 rows=35602 width=56) (actual time=673.264..718.404 rows=23159 loops=1)
                                             Sort Key: a.c_id
                                             Sort Method: external merge  Disk: 1584kB
                                             ->  Hash Join  (cost=13.91..4292.31 rows=35602 width=56) (actual time=0.751..594.926 rows=23159 loops=1)
                                                   Hash Cond: (a.b_id = b.id)
                                                   ->  Seq Scan on a  (cost=0.00..3405.80 rows=137754 width=17) (actual time=0.077..299.521 rows=171136 loops=1)
                                                         Filter: (state = ANY ('{0,1}'::integer[]))
                                                   ->  Hash  (cost=12.29..12.29 rows=130 width=39) (actual time=0.512..0.512 rows=130 loops=1)
                                                         Buckets: 1024  Batches: 1  Memory Usage: 10kB
                                                         ->  Seq Scan on b  (cost=0.00..12.29 rows=130 width=39) (actual time=0.012..0.301 rows=130 loops=1)
                                                               Filter: (count AND (e_id = 2))
                     ->  Hash  (cost=17.13..17.13 rows=413 width=178) (actual time=1.408..1.408 rows=413 loops=1)
                           Buckets: 1024  Batches: 1  Memory Usage: 87kB
                           ->  Seq Scan on d  (cost=0.00..17.13 rows=413 width=178) (actual time=0.003..0.611 rows=413 loops=1)
               ->  Hash  (cost=17.13..17.13 rows=413 width=178) (actual time=1.606..1.606 rows=413 loops=1)
                     Buckets: 1024  Batches: 1  Memory Usage: 87kB
                     ->  Seq Scan on d t6  (cost=0.00..17.13 rows=413 width=178) (actual time=0.033..0.786 rows=413 loops=1)
 Total runtime: 2251.192 ms
(34 rows)

And here's a link http://explain.depesz.com/s/WWX to the depesz output. Should I focus on the exclusive measurements? If so, it seems like the major pain points are:

The date scan
The merge between a and c
The merge between a and b
The filter for state in (0,1)

Any other thoughts? Thanks!

forum:dfd25 · Answer 4 · June 2, 2013, 11:33 p.m.

forum:dfd25 11 years, 6 months ago

Actually, one more thing I just noticed: the planner is way off on the number of rows in the Sort. Actual rows were 3559.9X higher. Not sure how that affects speed exactly.

forum:db3l · Answer 5 · June 3, 2013, 1:01 a.m.

forum:db3l 11 years, 6 months ago

~~@dfd25:~~

And here's a link http://explain.depesz.com/s/WWX to the depesz output. Should I focus on the exclusive measurements?
It depends - you sort of want to remain aware of both exclusive and inclusive. Wall clock time is eventually going to be based on inclusive, so you could have a step with a small exclusive value but that is looped a ton of times and yield a high inclusive at the outer step. But clearly any large exclusive value (even if its a step only executed once) is a direct contribution to the final time.

You can see changes in this query since you have the larger data set in that some of the nested loops have been replaced with merge or hash joins. And yeah, I'd probably be looking in the same directions as you note as things to try to attack.

For example, the date and state filters are being applied to an index scan of the respective table, and now there's a decent number of rows involved, so if an index would cut that down it could help. But it'll depend on the current distribution of values and how selective your filter is.

What appears to be happening with pulldate is that the planner chose to scan c first by its primary key index, filtering the tuples by the pulldate filter, and afterwards merge joining to a. (So filter, join rather than join, filter, which avoids matching up rows in c to those in a for those for which the pulldate comparison will eliminate). So if your pulldate filter is reasonably selective (e.g., it's pulling out ~163K rows, but I don't know how big the table is to start with), an index on it will help as the planner should use that index rather than the cidpkey index. But if your pulldate is selecting a large fraction of the records in c anyway, that may be a marginal win or the planner may still stay with its current scan. Same concept for state.

The merge and hash join steps are largely a function of the row count, so if you need all the matched rows, there's not a lot you can do about them. However, one thing that may help is if you don't actually need all the columns of each complete joined row in your result set, you can cut down on the row size all along the way by being more selective so there is less data for the join steps to move around. It looks like c is your widest table (see the width= values) so if you only pick that table to be more selective, assuming you don't really need all the columns, that would be the single point to start with.

I guess if it were me, I would go ahead and separately try indices on pulldate and state, plus try to tune my query to only retrieve the columns I really needed.

On the sort estimate, I think you read that backwards - at the outermost sort the planner was estimating 36k rows and got 10 (probably a mis-estimate on the selectivity of the T6 self-join selection). So the planner estimate was high not low (see the arrow by the "rows x" value on the web site). Large estimate errors (whether over or under) can certainly shift a plan selection, but in this case it's for your final sort which is always going to occur at that point, and the mis-estimate works out in your favor.

While mentioning sorts, you may also notice that the inner sort spilled to disk as it was about 1.5MB. It didn't take particularly long but if it had been a slow step in an inner loop, that might have been another place to consider attacking. In such cases you can also consider increasing workmem (assuming your system memory can support it) to let it take place in memory. Note that workmem is allocated at least once, but sometimes multiple times per client per query, so you want to be cautious how high you increase it. Or you can just increase it for a particular query that you know it would benefit.

In terms of other parameters, sharedbuffers of 24MB and effectivecachesize of 128MB are pretty conservative (as are most of PostgreSQL's default settings). And yes, the commented values are the defaults, so your workmem will be 1MB. You can check live values with the "show" SQL command. sharedbuffers is the actual memory used by PostgreSQL to hold data during query processing, though that's not as critical with PostgreSQL as some databases as PostgreSQL assumes heavy use of the OS cache. effectivecache_size is just a hint as to the system memory available for caching, but is important to accurately reflect your system since it can shift the planner choices to better take advantage of the fact that the OS is most likely already caching the data PostgreSQL wants.

How big is the system you are running this on? For recent PostgreSQL releases (8.4+) you can bump sharedbuffers up quite a bit (it wasn't as effective in earlier versions), though you'll need to adjust the kernel shmmax parameter since Ubuntu defaults to 32MB. I think the current recommendation for 1G+ systems (available to PostgreSQL) is to try around 25%, while there's diminishing returns above about 40%. Personally I've been using 128MB on a Linode 1024, but could probably go higher, and have 768MB as the effectivecache_size. This is on a dedicated node so not much besides PostgreSQL running.

I'm not sure any of these parameters are particularly hurting this query, so that may just be some items to try in the future.

– David

PS: I'd recommend upgrading to 9.1.9 (latest version in Ubuntu 12.04 repository), especially if your server accepts network client connections - it has an important security fix. There's no need to do anything with your data as it's completely backwards compatible with your current 9.1.8 cluster, so you can just do an apt-get upgrade in-place.

forum:dfd25 · Answer 6 · June 4, 2013, 1:36 a.m.

forum:dfd25 11 years, 6 months ago

Thanks again for your help, David.

I have a 3GB linode that hosts my db and application. I realize this isn't a best practice, but should be fine for now.

I tried indexes on pulldate and state (running analyze afterward each time) and they didn't seem to make much difference (and some times were even slower, although query times have a fair amount of variance). What did seem to work was expanding work_mem to 4MB. I also changed state IN (0,1) to state < 2 and that seemed to shave off a little time consistently.

Here is the result of the new setup: http://explain.depesz.com/s/Y9f

Switched from an Index scan to Seq scan on c.

Afterwards, I tried limiting the column selection (unfortunately, I need most of table c) to the bare minimum and it shaved another .2 seconds to about 1.5+ seconds, and seems to have changed the query plan.

And here is the result of that: http://explain.depesz.com/s/L7OS

I increased shared_buffers to 256MB. That doesn't seem to impact this query, but hopefully can help the database in the future.

I also changed the effectivecachesize to 512MB, but I'm not really sure how to determine a good size for this. That also does not seem to make a difference for this query.

And I upgraded to 9.1.9.

Any other suggestions?

forum:db3l · Answer 7 · June 4, 2013, 5:24 p.m.

forum:db3l 11 years, 6 months ago

~~@dfd25:~~

Here is the result of the new setup: http://explain.depesz.com/s/Y9f

Switched from an Index scan to Seq scan on c.
If I had to guess, it would be that the stats are telling PostgreSQL that your filter is going to select a large percentage of table c, so it'll be more efficient to simply scan it rather than bothering with an index. It's row estimate for each sequential scan is pretty much bang on, but I'm not sure what fraction of the respective table sizes the result set represents.

> I also changed the effectivecachesize to 512MB, but I'm not really sure how to determine a good size for this. That also does not seem to make a difference for this query.
Basically just observe your system and see how much working set (real memory used) your current processes require, and you can set effectivecachesize to what remains (or 90% of that or something). As long as your system has been running in steady state for a while, that can probably be pretty close to the "free" column in the "-/+ buffers" line in the output of the "free" command. You're just trying to indicate to PostgreSQL how much memory it can expect the OS to use as a cache.

> Any other suggestions?
Probably only to ask for the actual schema, table sizes and intent of the query, since there may be a different way to approach this from the top down. For example, it may be better to structure this as joining among more selective sub-queries rather than one overall join. Or, for example, it may prove better to purely join a to c in a subquery first before filtering (if, perhaps, a hash join across keys would be faster). Or it may just be that due to the size of the tables and the amount of data you are selecting, it needs this amount of time.

BTW, something I missed previously was that the reason your sort estimate is off by so much is due to your limit - e.g., that's the reason you only get 10 rows back. If you really only need the top 10 rows according to your pulldate,pubdate ordering, you may be able to do a subquery directly on C with that limit and then join to that. That would be a pretty big change on the number of rows (160k to 10) that then need to be joined to anything else. But I'm not absolutely sure that would produce the same logical result of the query without more detail on the actual query goal and schema.

At this point I'd also probably suggest again the pgsql-performance list. There are a lot more people available there that may be able to offer additional suggestions.

– David

forum:dfd25 · Answer 8 · June 10, 2013, 12:25 a.m.

forum:dfd25 11 years, 6 months ago

OK, I just wanted to quickly share how I managed to speed things up. The summary is a combination of indexing, clustering, and creative coding.

I clustered table a by an index of bid, and that dropped the query to under a second. Only certain queries are important for preformance, and this is helpful in all of them so it makes sense to do. I did a similar thing for table b by eid. Not much of an impact now, but if the table grows significantly, it will help down the line with no real downside in the meantime.

The other thing I did was limit the pulldate to just 30 days in the past. That should handle most cases. If I don't get enough records, then I'll re-query without any limit, and it's OK in those rare cases to have a slower query. That in combination with a pulldate index (in addition to the other things mentioned above) got the query to about 0.35 seconds, which is good enough for now.

Thanks again for the help!

Compute

Storage

Networking

Databases

Services

Developer Tools

Industries

Pricing

Community

Engage With Us

reading postgresql explain output?

8 Replies

effectivecachesize = 128MB

work_mem = 1MB

Reply

Tips: