PostgreSQL 9.4 aggregation filters: They do pay off

In my previous posting on PostgreSQL 9.4 I have shown aggregate FILTER clauses, which are a neat way to make partial aggregates more readable. Inspired by some comments to this blog post I decided to create a follow up posting to see which impact this new FILTER clause has on performance.

Loading some demo data

To see the performance impact people can expect, here is some demo data:

test=# BEGIN;

BEGIN

test=# CREATE TABLE t_test (id int, dummy char(200));

CREATE TABLE

test=# INSERT INTO t_test SELECT *, 'dummy' FROM generate_series(1, 20000000) AS x;

INSERT 0 20000000

test=# COMMIT;

COMMIT

The data set is roughly 4.5 GB of data. On the test box this still fits nicely into memory:

test=# SELECT pg_size_pretty(pg_relation_size('t_test'));

 pg_size_pretty

----------------

 4596 MB

(1 row)

A classical approach

For years I have used the classical approach. The trick is that aggregation functions won’t consider NULL values. So to do partial aggregation functions the idea is to just replace values on the fly. The code is as already stated in the previous post:

test=# SELECT  count(CASE WHEN id % 2 = 0 THEN 1 ELSE NULL END) AS even,

        count(CASE WHEN id % 2 = 1 THEN 1 ELSE NULL END) AS odd

FROM    t_test;

   even   |   odd   

----------+----------

 10000000 | 10000000

(1 row)

 

Time: 6208.092 ms

In this example the aggregation takes slightly over 6 seconds (with all hint bits set, etc.). This is not too bad.

The modern approach

Let us see now what happens when the new approach is used:

SELECT  count(id) FILTER (WHERE id % 2 = 0) AS even,

        count(id) FILTER (WHERE id % 2 = 1) AS odd

FROM    t_test;

   even   |   odd   

----------+----------

 10000000 | 10000000

(1 row)

 

Time: 5353.172 ms

The query is almost one second faster than the original one. The overhead of the new FILTER clause is simply ways smaller than in the original version of the query. Of course, the more clauses you add the larger the difference will be.

More realistic scenarios

Before a more complex query is fired I decided to increase work_mem to make sure that all aggregations can happen in memory:

test=# SET work_mem TO '1 GB';

SET

Time: 0.375 ms

This time an additional GROUP BY clause is added to split the data set into smaller chunks. A GROUP BY clause is basically what most people will have in their queries in case FILTER is used:

test=# SELECT  id % 3 AS g,

        count(CASE WHEN id % 2 = 0 THEN 1 ELSE NULL END) AS even,

        count(CASE WHEN id % 2 = 1 THEN 1 ELSE NULL END) AS odd

FROM    t_test

GROUP BY 1;

 g |  even   |   odd  

---+---------+---------

 0 | 3333333 | 3333333

 1 | 3333333 | 3333334

 2 | 3333334 | 3333333

(3 rows)

 

Time: 22119.977 ms

The result is created in 22 seconds, which is actually quite ok.

The next query shows the same thing using the new syntax:

SELECT  id % 3 AS g,

        count(id) FILTER (WHERE id % 2 = 0) AS even,

        count(id) FILTER (WHERE id % 2 = 1) AS odd

FROM    t_test

GROUP BY 1;

 g |  even   |   odd  

---+---------+---------

 0 | 3333333 | 3333333

 1 | 3333333 | 3333334

 2 | 3333334 | 3333333

(3 rows)

 

Time: 21379.657 ms

The important thing here is that the performance difference is again around 1 second. This makes sense because the performance gained by the FILTER clause is the same regardless of the grouping around it. This is somewhat expected.

Hans-Juergen Schoenig
Hans-Jürgen Schönig has 15 years of experience with PostgreSQL. He is consultant and CEO of the company „Cybertec Schönig & Schönig GmbH“ (www.cybertec.at, www.postgresql-support.de), which has served countless customers around the globe.