For a long time PostgreSQL only used a single CPU core to perform expensive aggregations. The idea of “agg” is to change exactly that. Scale your analytical operations to many CPU core and use the full computing power of your system.
- Perform common aggregations in parallel (sum, count, min, max, etc.)
- Support for FILTER-clauses
- Support for contraint exclusion (each partition is handled by a CPU core)
- Super-easy installation
- Works with PostgreSQL 9.5
Our tests have shown almost linear increases for many workloads. In many cases the actual aggregation is a major bottleneck. Especially complex CASE / WHEN clauses need a lot of CPU and slow down aggregations. Those operations are exactly what “agg” is aiming at. By scaling out to many CPU cores most of the pain is eased.
On a large server we have seen speedups up to 30 times for certain queries (analytical workloads). We see an almost linear increase for many operations, which were previously limited to a single CPU core.
agg is ideal for large scale aggreations and analytics.
The main plus is that there is no need to modify your SQL code – agg simply does its work. It speeds up your queries whenever possible. It is totally transparent.
Agg can be downloaded freely from our website (at the end of this page).
To install the package, download the tar archive and extract it:
tar xvfz agg-1.0beta1.tar.gz
Then set PG_CONFIG to the correct path:export PG_CONFIG=/path_to_postgres/bin/pg_config
Then compile the module:cd src make install.
“agg” has to be loaded as module as soon as PostgreSQL fires up. To load “agg” add “agg.so” to shared_preload_libraries in your postgresql.conf:shared_preload_libraries = 'agg.so'
Then restart PostgreSQL.
To enable “agg” in your database of choice, run…CREATE EXTENSION agg;
… in your favorite SQL client. PostgreSQL will load all the necessary components to scale your queries.
The easiest way to test “agg” is to create some simple test data and see what happens:CREATE TABLE t_test (id serial, name test); CREATE TABLE t_test_1 () INHERITS (t_test); CREATE TABLE t_test_2 () INHERITS (t_test); INSERT INTO t_test_1 (name) SELECT 'hans' FROM generate_series(1, 10000000); INSERT INTO t_test_2 (name) SELECT 'paul' FROM generate_series(1, 10000000);
Then run a test query:\timing (in case you are using psql) SELECT name, count(*) FROM t_test GROUP BY 1;
To turn “agg” on you have to configure the number of workers you want PostgreSQL to use:SET agg.hash_workers = 4; # makes aggregations use 4 CPU cores
Setting agg.hash_workers to 0 means that the module is turned off.
In case agg.hash_workers is set to 1 a single worker will be used. This is mostly done for debugging purposes.
HINT: To make that max_worker_processes is set to a reasonably high value in postgresql.conf. Otherwise agg does not have enough worker processes to do its job:test=# SHOW max_worker_processes ; max_worker_processes ---------------------- 100 (1 row)
What agg can do for you:
agg is able to run the following things in parallel:
- Simple aggregations: GROUP BY, HAVING
- Scanning partitioned tables
- Scanning single tables
agg does not scale the following operations:
- Index scans
- All kinds of joins
- Custom aggregates (unless there is explicit support for agg)
- VACUUM, CREATE INDEX, and other admin tasks
- At this point grouping sets are NOT supported
agg was designed for large scale aggregations. If agg finds plan nodes it cannot handle, it simple turns itself off and executes the standard plan generated by PostgreSQL. However, if agg finds out that a plan can be scaled out, it modifies the plan and tries to launch some worker process to run the query in parallel.
The general rule of thumb is: The more complicated the SELECT clause – the better for agg. Queries with countless aggregations and complicated GROUP BY clauses tend to benefit most.
We have seen that large shared buffers seem to be beneficial for agg. So using significantly larger buffers than in “standard” PostgreSQL seems to speed up things on most systems. With large enough buffers almost linear scalability can be achieved up to a reasonable number of CPUs.
It has also proven to be a good idea to set agg.hash_workers to a value equal to the number of partitions (a multiplier is also fine) as long as the number of partitions is reasonably small (maybe up to 20).
agg can also help to scale up a single, non partitioned, table by dispatching data using a shared memory queue. However, it does not pay off for simple aggregations. The simple rule is: The more complicated your aggregations, the better for agg. If you have 10, 20, 30 or more aggregation functions in your SELECT clause, agg will really take off and offer superior performance compared to single core operations.
We have observed that planning time goes up a bit if agg is enabled. However, this does not create any bottlenecks.
Make use of synchronized seq scans: PostgreSQL is able to synchronize sequential scans to significantly reduce I/O for queries running concurrently. Firing up concurrent queries on the same set of tables tends to benefit app because I/O bottlenecks can be reduced considerably.
Supported versions of PostgreSQL:
At this point only PostgreSQL 9.5 is supported.
Support for future released will be available.
Versions prios to PostgreSQL 9.5 are not supported and never will be.
If you are planning to run agg in production, Cybertec Schönig & Schönig GmbH offers professional 24×7 for agg. You can obtain fixes, consulting, training as well as expertise to make your projects successful.
Contact us to find out more.
“agg” is available under the terms of the PostgreSQL license.
We decided to use the same license as PostgreSQL to make it easy for people to use the code.
Please fill in the form, we will send you the download link!