Scaling PostgreSQL: Using more than just one core for aggregation

Scaling PostgreSQL: Modern servers provide people with more and more CPU cores. 16 cores or more are not uncommon these days. Unfortunately, PostgreSQL cannot scale a single query to more than one core. A lot of effort is going into this issue already.

Table of Contents

CYBERTEC experiments

To address the issue, we've done some experiments to see what can be done to scale things up. At this point, we've written a small module which takes an existing execution plan and does some post-processing of the plan using “custom plan nodes”.

The idea is experimental: what happens if the approach described in this post is actually used?

The main challenge we face at a client is:

SELECT      … a LOT of processing going on here …
FROM        large_table
GROUP BY    some_groups

SELECT … a LOT of processing going on here …

FROM large_table

GROUP BY some_groups

So the goal in our case is really to scale out the processing done in the SELECT clause.

Some test data for scaling

For a start, 100 million rows have been created to do a little testing. 20 million rows are in each partition:

parallel=# SELECT count(*) FROM t_test;
  count
-----------
100000000
(1 row)

Time: 8715.324 ms

parallel=# d+ t_test
Table 'public.t_test'

 Column |       Type       | Modifiers | Storage | Stats target | Description
--------+------------------+-----------+---------+--------------+-------------
 id     | integer          | not null  | plain   |              |
 r      | double precision |           | plain   |              |
Child tables: t_test_1,
      t_test_2,
      t_test_3,
      t_test_4,
      t_test_5

parallel=# SELECT count(*) FROM t_test;

count

-----------

100000000

(1 row)

Time: 8715.324 ms

parallel=# d+ t_test

Table 'public.t_test'

--------+------------------+-----------+---------+--------------+-------------

Child tables: t_test_1,

t_test_2,

t_test_3,

t_test_4,

t_test_5

Giving multicore PostgreSQL a try - scaling test

Our prototype module has a simple runtime variable, which allows us to define the number of processes we want to see. Setting parallel.agg_workers to 1 basically means turning the plan-processing code off.

The data in the table is scanned normally. We need around 29 seconds to process the query:

parallel=# SET parallel.agg_workers = 1;
SET
Time: 1.572 ms

parallel=# SELECT sum(r) FROM t_test GROUP BY id < 1000;
      sum
------------------
49998665.9549409
2538.35037390701
(2 rows)

Time: 29477.039 ms

parallel=# SET parallel.agg_workers = 1;

SET

Time: 1.572 ms

parallel=# SELECT sum(r) FROM t_test GROUP BY id < 1000;

sum

------------------

49998665.9549409

2538.35037390701

(2 rows)

Time: 29477.039 ms

Let us try the same with more cores: 7 cores will be used for worker processes.

The 8^th core will be taken by the coordinating process:

parallel=# SET parallel.agg_workers = 7;
SET
Time: 0.324 ms

parallel=# SELECT sum(r) FROM t_test GROUP BY id < 1000;
      sum
------------------
49998665.9549346
2538.35037390701
(2 rows)

Time: 31279.595 ms

parallel=# SET parallel.agg_workers = 7;

SET

Time: 0.324 ms

parallel=# SELECT sum(r) FROM t_test GROUP BY id < 1000;

sum

------------------

49998665.9549346

2538.35037390701

(2 rows)

Time: 31279.595 ms

The important part here is that more processes are actually a net loss. The reason is simple: The table is scanned by a single process at the moment so there is a serious bottleneck on the scan side. The aggregation part is simply not complex enough to provide us with a speedup. Remember: All we do is adding up some values (which is ways too simple). This is by far not enough to justify the overhead of moving data around via shared memory queues.

Handling long `SELECT` clauses when scaling

However, the situation starts to change significantly if we try a query involving a lot of processing in the SELECT-clause. The more stuff we do in the SELECT-clause, the more benefit we can observe.

Here is an example:

parallel=# SELECT
   sum(CASE WHEN r < 0   THEN r ELSE 0 END),
   sum(CASE WHEN r > 0   THEN r ELSE 0 END),
   sum(CASE WHEN r > 0.1 THEN r ELSE 0 END),
   sum(CASE WHEN r > 0.2 THEN r ELSE 0 END),
   sum(CASE WHEN r > 0.3 THEN r ELSE 0 END),
   sum(CASE WHEN r > 0.4 THEN r ELSE 0 END),
   sum(CASE WHEN r > 0.5 THEN r ELSE 0 END),
   sum(CASE WHEN r > 0.6 THEN r ELSE 0 END),
   sum(CASE WHEN r > 0.7 THEN r ELSE 0 END),
   sum(CASE WHEN r > 0.8 THEN r ELSE 0 END),
   sum(CASE WHEN r > 0.9 THEN r ELSE 0 END),
   sum(r) FILTER (WHERE r < 0)
FROM    t_test
GROUP BY id < 1000, id % 5, id % 10 ;

parallel=# SELECT

sum(CASE WHEN r < 0 THEN r ELSE 0 END),

sum(CASE WHEN r > 0 THEN r ELSE 0 END),

sum(CASE WHEN r > 0.1 THEN r ELSE 0 END),

sum(CASE WHEN r > 0.2 THEN r ELSE 0 END),

sum(CASE WHEN r > 0.3 THEN r ELSE 0 END),

sum(CASE WHEN r > 0.4 THEN r ELSE 0 END),

sum(CASE WHEN r > 0.5 THEN r ELSE 0 END),

sum(CASE WHEN r > 0.6 THEN r ELSE 0 END),

sum(CASE WHEN r > 0.7 THEN r ELSE 0 END),

sum(CASE WHEN r > 0.8 THEN r ELSE 0 END),

sum(CASE WHEN r > 0.9 THEN r ELSE 0 END),

sum(r) FILTER (WHERE r < 0)

FROM t_test

GROUP BY id < 1000, id % 5, id % 10 ;

In this case, we do a lot of aggregation and therefore the overhead of moving tuples around is not as big compared to the rest of the query.

The performance difference is already quite huge:

8 CPU cores: 206667.727 ms
1 CPU core:   59226.385 ms

1 2	8 CPU cores: 206667.727 ms 1 CPU core: 59226.385 ms

What we see is a nice 3.5 times speedup. Of course, this is still far from linear. Linear scalability can hardly be achieved with this approach. However, if there are 50-100 aggregation functions in the SELECT-clause (which is far from uncommon), there is clearly a major performance improvement. That tends to get pretty close to “linear” in many cases.

At the moment, actually providing the data to our worker processes is still clearly an issue:

PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
8411 hs        20   0  507012 183284 178560 R 100.0  1.1  23:19.98 postgres
13014 hs        20   0  503632  25400  24212 R  70.1  0.2   0:08.69 postgres
13013 hs        20   0  503632  25440  24244 R  69.8  0.2   0:08.83 postgres
13015 hs        20   0  503632  25476  24276 R  69.4  0.2   0:08.74 postgres
13017 hs        20   0  503632  25320  24132 R  69.4  0.2   0:08.73 postgres
13012 hs        20   0  503632  25460  24272 R  68.1  0.2   0:08.69 postgres
13018 hs        20   0  503632  24572  23384 R  68.1  0.1   0:08.65 postgres
13016 hs        20   0  503632  25468  24276 S  66.4  0.2   0:08.60 postgres

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

8411 hs 20 0 507012 183284 178560 R 100.0 1.1 23:19.98 postgres

13014 hs 20 0 503632 25400 24212 R 70.1 0.2 0:08.69 postgres

13013 hs 20 0 503632 25440 24244 R 69.8 0.2 0:08.83 postgres

13015 hs 20 0 503632 25476 24276 R 69.4 0.2 0:08.74 postgres

13017 hs 20 0 503632 25320 24132 R 69.4 0.2 0:08.73 postgres

13012 hs 20 0 503632 25460 24272 R 68.1 0.2 0:08.69 postgres

13018 hs 20 0 503632 24572 23384 R 68.1 0.1 0:08.65 postgres

13016 hs 20 0 503632 25468 24276 S 66.4 0.2 0:08.60 postgres

The process dispatching the data is running at 100%, while the worker processes are not able to make full use of the CPU. It seems that some overhead can still be shaved off to provide users with better overall performance. A prototype might be possible fairly soon.

In case you need any assistance, please feel free to contact us.

In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Twitter, Facebook, or LinkedIn.

0 0 votes

Article Rating

9 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

Daniel Cristian

8 years ago

The SELECT with multiple CASE statements appears to have the results in inverse order.

David Fetter

8 years ago

Is your use of CASE rather than (or in addition to, I find the construct here somewhat baffling) FILTER an implementation limitation?

Hans-Jürgen Schönig

8 years ago

Reply to David Fetter

FILTER is basically faster ... i did a blog post on that some time ago. i used the slow version here to create some load for the workers. we tested with FILTER. no showstoppers so far 🙂

David Fetter

8 years ago

Reply to Hans-Jürgen Schönig

Sorry I was unclear. Does your implementation allow using parallel workers in the FILTER case?

Hans-Jürgen Schönig

8 years ago

Reply to David Fetter

yes. we can do that. the code will be released when it is in a useful state. still a long road ahead

David Fetter

8 years ago

Oh, and where's the code? 🙂

man_brain

8 years ago

8 CPU cores: 206667.727 ms

1 CPU core: 59226.385 ms
What we see is a nice 3.5 times speedup...

Seems, like numbers are mixed up here.

Attila Max Ruf

8 years ago

Reply to man_brain

Actually they aren't. In the postgresql FAQ there is an explanation for this behaviour. https://wiki.postgresql.org/wiki/FAQ#How_does_PostgreSQL_use_CPU_resources.3F

man_brain

8 years ago

Reply to Attila Max Ruf

Well, I see from numbers above that query running on 8 CPUs spent ~4 times more time than running on one CPU. That's not a speedup. The link to PostgreSQL FAQ is not about that. Am I missing something?

Stay tuned with our

Scaling PostgreSQL: Using more than just one core for aggregation

CYBERTEC experiments

The main challenge we face at a client is:

Some test data for scaling

Giving multicore PostgreSQL a try - scaling test

The 8^th core will be taken by the coordinating process:

Handling long `SELECT` clauses when scaling

Here is an example:

The performance difference is already quite huge:

At the moment, actually providing the data to our worker processes is still clearly an issue:

Hans-Jürgen Schönig

Blog Tags

NEWSLETTER

Articles by our PostgreSQL Experts

Scaling PostgreSQL: Using more than just one core for aggregation

CYBERTEC experiments

The main challenge we face at a client is:

Some test data for scaling

Giving multicore PostgreSQL a try - scaling test

The 8th core will be taken by the coordinating process:

Handling long SELECT clauses when scaling

Here is an example:

The performance difference is already quite huge:

At the moment, actually providing the data to our worker processes is still clearly an issue:

Hans-Jürgen Schönig

Blog Tags

NEWSLETTER

Articles by our PostgreSQL Experts

The 8^th core will be taken by the coordinating process:

Handling long `SELECT` clauses when scaling