Detecting unstable runtimes in PostgreSQL

Do you know situations with unstable runtimes? You have a query, which is sometimes slow and sometimes just lightning fast. The question is usually: Why? Maybe this little post can be inspiring and shade some light one some of your performance issues, which might bug you in your daily life.

Reasons for unstable runtimes

There are many reasons, why the very same type of query might cause so “unpredictable” behavior. The following list shows some of the most common reasons:

  • Unstable execution plans
  • Excessive buffer usage
  • Locking related troubles

Of course there are some other possible reasons (other processes exhausting CPU/IO, checkpoints, etc.), however, given the possible scope of a post, I would like to focus especially on excessive buffer usage and its consequences.

Setting up an example

To demonstrate what can happen, I have compiled a simple example. All it does is to load 50 million rows into a table. The main point is that the data will be stored on disk in random order. Then an index is created:


test=# CREATE TABLE t_test AS SELECT *

                        FROM            generate_series(1, 50000000) AS id

                        ORDER BY    random();

SELECT 50000000

test=# CREATE INDEX idx_id ON t_test (id);

CREATE INDEX

After running VACUUM the following query shows, what this is all about:


test=# explain (buffers true, timing true, analyze true, costs true)

                        SELECT         *

                        FROM            t_test

                       WHERE          id < 10000;

                                                      QUERY PLAN      

-------------------------------------------------------------------------------------------------------

 Bitmap Heap Scan on t_test  (cost=182.72..31985.07 rows=9568 width=4)

            (actual time=10.298..102.130 rows=9999 loops=1)

   Recheck Cond: (id < 10000)    Heap Blocks: exact=9775    Buffers: shared hit=1078 read=8728    ->  Bitmap Index Scan on idx_id  (cost=0.00..180.32 rows=9568 width=0)

            (actual time=4.945..4.945 rows=9999 loops=1)

         Index Cond: (id < 10000)

         Buffers: shared hit=6 read=25

 Planning time: 0.162 ms

 Execution time: 103.472 ms

(9 rows)

In this query almost 10.000 buffers are used. The trouble now is: If the data is coming from RAM (shared_buffers or OS filesystem cache)  everything is somewhat fine and runtimes will be within an acceptable range. Even running the query a second time will already speed up things considerably:


test=# explain (buffers true, timing true, analyze true, costs true)

                        SELECT         *

                        FROM            t_test

                        WHERE          id < 10000;

                                                      QUERY PLAN    --------------------------------------------------------------------------------------------------------

 Bitmap Heap Scan on t_test  (cost=182.72..31985.07 rows=9568 width=4)

                        (actual time=8.673..25.573 rows=9999 loops=1)

            Recheck Cond: (id < 10000)             Heap Blocks: exact=9775             Buffers: shared hit=9806    ->  Bitmap Index Scan on idx_id  (cost=0.00..180.32 rows=9568 width=0)

            (actual time=4.037..4.037 rows=9999 loops=1)

         Index Cond: (id < 10000)

         Buffers: shared hit=31

 Planning time: 0.162 ms

 Execution time: 26.604 ms

(9 rows)

We are down to 26 ms.

Random disk I/O

Now let us assume for a second that there are no cache hits. Our system contains an old disk from uncle Sam’s hardware store. What would happen to the runtime of the query?

Let us do some math (assuming disk seek time to be 12 ms):

9775 blocks * 12 ms = 117.300 milliseconds

In case of an old mechanical disk, the query would inevitably take around 2 minutes to complete. 2 minutes vs. 26ms – that is actually quite a significant difference.

The question is: Which conclusions can we draw? The first conclusion is that you should try to write your queries in a way that you don’t need so many buffers in the first place. The second thing you can do is to try to keep data close together using index organized (a.k.a. column ordered) tables.

In PostgreSQL the CLUSTER command allows you to achieve exactly that:


test=# CLUSTER t_test USING idx_id;

CLUSTER

Keep in mind though that the clustering effect will reduce over time as rows get changed (thus you would normally want a Cron job there) but running the very same query using an index organized table will be A LOT faster:


test=# explain (buffers true, timing true, analyze true, costs true)

SELECT         *

FROM            t_test

WHERE         id < 10000;

QUERY PLAN

--------------------------------------------------------------------------------------------------------

Bitmap Heap Scan on t_test  (cost=182.72..31985.07 rows=9568 width=4)

(actual time=2.021..4.961 rows=9999 loops=1)

Recheck Cond: (id < 10000)

Heap Blocks: exact=45

Buffers: shared hit=5 read=71

->  Bitmap Index Scan on idx_id  (cost=0.00..180.32 rows=9568 width=0)

(actual time=1.977..1.977 rows=9999 loops=1)

Index Cond: (id < 10000)

Buffers: shared hit=4 read=27

Planning time: 0.349 ms

Execution time: 6.294 ms

(9 rows)

You will also see that the number of buffers used has been greatly reduced, which makes the runtime of the query better – even if the cache hit rates tend to be close to zero.

Hans-Juergen Schoenig
Hans-Jürgen Schönig has 15 years of experience with PostgreSQL. He is consultant and CEO of the company „Cybertec Schönig & Schönig GmbH“ (www.cybertec.at, www.postgresql-support.de), which has served countless customers around the globe.