pgstrom: Checking GPU performance

Finally I got around to take a more extensive look at pgstrom (a module to make use of GPUs). The entire GPU thing fascinates me, and I was excited to see the first real performance data.

Here is some simple test data:

test=# CREATE TABLE t_test AS
   SELECT       x, 'a'::char(100) AS y, 'b'::char(100) AS z
   FROM         generate_series(1, 5000000) AS x
   ORDER BY random();
SELECT 5000000

1

2

3

4

5

test=# CREATE TABLE t_test AS

SELECT x, 'a'::char(100) AS y, 'b'::char(100) AS z

FROM generate_series(1, 5000000) AS x

ORDER BY random();

SELECT 5000000

5 million rows should be enough to get a first impression of what is going on.

Queries can benefit

To make sure that a real difference can actually be observed, I have decided to use no indexes. In real life, this is not too realistic because performance would suffer in a horrible way. pgstrom has not been made to speed up index lookups anyway so this should not be an issue.

The first thing I tried was to filter and group some data on the CPU:

test=# explain analyze SELECT count(*)
        FROM   t_test
        WHERE  sqrt(x) > 0
        GROUP BY y;
                                     QUERY PLAN
--------------------------------------------------------------------------------------
 HashAggregate  (cost=242892.24..242892.25 rows=1 width=101)
     (actual time=3965.362..3965.362 rows=1 loops=1)
     Group Key: y
     ->  Seq Scan on t_test  (cost=0.00..234558.91 rows=1666665 width=101)
         (actual time=0.033..1417.593 rows=5000000 loops=1)
         Filter: (sqrt((x)::double precision) > '0'::double precision)
 Planning time: 0.459 ms
 Execution time: 3965.495 ms

1

2

3

4

5

6

7

8

9

10

11

12

13

14

test=# explain analyze SELECT count(*)

FROM t_test

WHERE sqrt(x) > 0

GROUP BY y;

QUERY PLAN

--------------------------------------------------------------------------------------

HashAggregate (cost=242892.24..242892.25 rows=1 width=101)

(actual time=3965.362..3965.362 rows=1 loops=1)

Group Key: y

-> Seq Scan on t_test (cost=0.00..234558.91 rows=1666665 width=101)

(actual time=0.033..1417.593 rows=5000000 loops=1)

Filter: (sqrt((x)::double precision) > '0'::double precision)

Planning time: 0.459 ms

Execution time: 3965.495 ms

My box (4 GHz AMD) can do that in just under 4 seconds. Note that I am using the standard PostgreSQL storage manager here (no column store or so).

Let us try the same thing on the GPU:

test=# explain analyze SELECT count(*)
FROM   t_test
WHERE  sqrt(x) > 0
GROUP BY y;

                                        QUERY PLAN
------------------------------------------------------------------------------------------
 HashAggregate  (cost=176230.88..176230.89 rows=1 width=101)
    (actual time=2004.355..2004.356 rows=1 loops=1) 
    Group Key: y
    ->  Custom Scan (GpuPreAgg)  (cost=11929.24..171148.30 rows=75 width=108)
        (actual time=1151.310..2003.868 rows=76 loops=1)
        Bulkload: On (density: 100.00%)
        Reduction: Local + Global

        Device Filter: (sqrt((x)::double precision) > '0'::double precision)
        ->  Custom Scan (BulkScan) on t_test  (cost=8929.24..167897.55 rows=5000001 width=101)
            (actual time=12.956..1152.273 rows=5000000 loops=1)
Planning time: 0.550 ms
Execution time: 2299.633 ms
(9 rows)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

test=# explain analyze SELECT count(*)

FROM t_test

WHERE sqrt(x) > 0

GROUP BY y;

QUERY PLAN

------------------------------------------------------------------------------------------

HashAggregate (cost=176230.88..176230.89 rows=1 width=101)

(actual time=2004.355..2004.356 rows=1 loops=1)

Group Key: y

-> Custom Scan (GpuPreAgg) (cost=11929.24..171148.30 rows=75 width=108)

(actual time=1151.310..2003.868 rows=76 loops=1)

Bulkload: On (density: 100.00%)

Reduction: Local + Global

Device Filter: (sqrt((x)::double precision) > '0'::double precision)

-> Custom Scan (BulkScan) on t_test (cost=8929.24..167897.55 rows=5000001 width=101)

(actual time=12.956..1152.273 rows=5000000 loops=1)

Planning time: 0.550 ms

Execution time: 2299.633 ms

(9 rows)

We see a nice improvement here. The speedup is incredible - especially when taking into consideration that getting the data already takes more than a second. It seems moving stuff out to the GPU definitely pays off in this case.

The interesting thing to notice is that the real improvement can be seen because of the GROUP BY clause. A normal filter does not show a benefit:

test=# explain analyze SELECT count(*)
FROM   t_test
WHERE  sqrt(x) > 0;

                                     QUERY PLAN
--------------------------------------------------------------------------------------------
 Aggregate  (cost=238725.58..238725.59 rows=1 width=0)
      (actual time=1762.829..1762.829 rows=1 loops=1)
      ->  Seq Scan on t_test  (cost=0.00..234558.91 rows=1666665 width=0)
          (actual time=0.055..1311.220 rows=5000000 loops=1)
          Filter: (sqrt((x)::double precision) > '0'::double precision)
Planning time: 0.074 ms
Execution time: 1762.875 ms
(5 rows)

test=# explain analyze SELECT count(*)
FROM   t_test
WHERE  sqrt(x) > 0;
                                           QUERY PLAN
-------------------------------------------------------------------------------------------------
 Aggregate  (cost=172064.21..172064.22 rows=1 width=0)
       (actual time=1411.036..1411.036 rows=1 loops=1)
       ->  Custom Scan (GpuPreAgg)  (cost=11929.24..171148.30 rows=75 width=4)
           (actual time=650.590..1410.837 rows=76 loops=1)
           Bulkload: On (density: 100.00%)
           Reduction: NoGroup
           Device Filter: (sqrt((x)::double precision) > '0'::double precision)
           ->  Custom Scan (BulkScan) on t_test  (cost=8929.24..167897.55 rows=5000001 width=0)
               (actual time=16.034..1160.592 rows=5000000 loops=1) 
 Planning time: 1.634 ms
 Execution time: 1745.605 ms
(8 rows)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

test=# explain analyze SELECT count(*)

FROM t_test

WHERE sqrt(x) > 0;

QUERY PLAN

--------------------------------------------------------------------------------------------

Aggregate (cost=238725.58..238725.59 rows=1 width=0)

(actual time=1762.829..1762.829 rows=1 loops=1)

-> Seq Scan on t_test (cost=0.00..234558.91 rows=1666665 width=0)

(actual time=0.055..1311.220 rows=5000000 loops=1)

Filter: (sqrt((x)::double precision) > '0'::double precision)

Planning time: 0.074 ms

Execution time: 1762.875 ms

(5 rows)

test=# explain analyze SELECT count(*)

FROM t_test

WHERE sqrt(x) > 0;

QUERY PLAN

-------------------------------------------------------------------------------------------------

Aggregate (cost=172064.21..172064.22 rows=1 width=0)

(actual time=1411.036..1411.036 rows=1 loops=1)

-> Custom Scan (GpuPreAgg) (cost=11929.24..171148.30 rows=75 width=4)

(actual time=650.590..1410.837 rows=76 loops=1)

Bulkload: On (density: 100.00%)

Reduction: NoGroup

Device Filter: (sqrt((x)::double precision) > '0'::double precision)

-> Custom Scan (BulkScan) on t_test (cost=8929.24..167897.55 rows=5000001 width=0)

(actual time=16.034..1160.592 rows=5000000 loops=1)

Planning time: 1.634 ms

Execution time: 1745.605 ms

(8 rows)

It certainly makes sense that there is no improvement in this case because moving data around is simply too expensive to make a difference. Remember: GPUs only make sense if things can be done in parallel and if data is coming fast enough. sqrt is not complicated enough to justify the effort of moving data around and PostgreSQL cannot provide data fast enough.

Or queries can be slower

It is important to mention that many queries won't benefit from the GPU at all. In fact, I would expect than the majority of queries in a usual system will not behave differently.

Here is an example of a query, which is actually slower with pgstrom:

test=# explain analyze SELECT *, sum(x) OVER ()
FROM   t_test
WHERE  sqrt(x) > 0;

                                        QUERY PLAN
------------------------------------------------------------------------------------------------
 WindowAgg  (cost=8929.24..188730.88 rows=1666667 width=206)
      (actual time=4449.716..5848.233 rows=5000000 loops=1)
      ->  Custom Scan (GpuScan) on t_test  (cost=8929.24..167897.55 rows=1666667 width=206)
          (actual time=616.879..1899.651 rows=5000000 loops=1)
          Device Filter: (sqrt((x)::double precision) > '0'::double precision)
 Planning time: 0.142 ms
 Execution time: 6260.461 ms
(5 rows)


test=# explain analyze SELECT *, sum(x) OVER ()
FROM   t_test
WHERE  sqrt(x) > 0;
                                              QUERY PLAN
--------------------------------------------------------------------------------------------------
 WindowAgg  (cost=0.00..255392.23 rows=1666665 width=206)
     (actual time=3610.914..4895.518 rows=5000000 loops=1)
     ->  Seq Scan on t_test  (cost=0.00..234558.91 rows=1666665 width=206)
         (actual time=0.038..1389.645 rows=5000000 loops=1)
         Filter: (sqrt((x)::double precision) > '0'::double precision)
 Planning time: 0.324 ms
 Execution time: 5187.048 ms
(5 rows)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

test=# explain analyze SELECT *, sum(x) OVER ()

FROM t_test

WHERE sqrt(x) > 0;

QUERY PLAN

------------------------------------------------------------------------------------------------

WindowAgg (cost=8929.24..188730.88 rows=1666667 width=206)

(actual time=4449.716..5848.233 rows=5000000 loops=1)

-> Custom Scan (GpuScan) on t_test (cost=8929.24..167897.55 rows=1666667 width=206)

(actual time=616.879..1899.651 rows=5000000 loops=1)

Device Filter: (sqrt((x)::double precision) > '0'::double precision)

Planning time: 0.142 ms

Execution time: 6260.461 ms

(5 rows)

test=# explain analyze SELECT *, sum(x) OVER ()

FROM t_test

WHERE sqrt(x) > 0;

QUERY PLAN

--------------------------------------------------------------------------------------------------

WindowAgg (cost=0.00..255392.23 rows=1666665 width=206)

(actual time=3610.914..4895.518 rows=5000000 loops=1)

-> Seq Scan on t_test (cost=0.00..234558.91 rows=1666665 width=206)

(actual time=0.038..1389.645 rows=5000000 loops=1)

Filter: (sqrt((x)::double precision) > '0'::double precision)

Planning time: 0.324 ms

Execution time: 5187.048 ms

(5 rows)

In this case the GPU seems like a loss - at least there is no benefit to be expected at this stage.

One word about sorts

According to the main developer of pgstrom sorting is not yet as good as he wants it to be, so I skipped the sort part for now. As sorts are key to many queries, there is still pgstrom functionality I am really looking forward to.

I assume that sorts can greatly benefit from a GPU because there is a lot of intrinsic parallelism in a sort algorithm. Therefore sorting on the GPU could be highly beneficial. The speedup we can expect is hard to predict but I firmly believe that it can be quite substantial.

Stability

What stunned me is that I have not encountered a single segmentation fault during my tests. I definitely did not expect that. My assumption was that there would be more loose ends but actually things worked as expected most of the time - given the stage of the project I am pretty excited. pgstrom certainly feels like the future ...

Find all the latest CYBERTEC blog posts by Hans-Jürgen Schönig, Laurenz Albe, Pavlo Golub and others in our Performance blog spot.

0 0 votes

Article Rating

Subscribe

19 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

Ernst-Georg Schmid

8 years ago

> The interesting thing to notice is that the real improvment can be seen because of the GROUP BY clause.

Having done a bit of programming in OpenCL, I'd say the improvement comes from the GROUP BY implemented as parallel reduction, a pattern where GPGPUs really shine.

Unfortunately, the choice of CUDA limits the range of supported devices. As you mentioned, transferring data between host and device memory can kill performance easily, so integrated GPGPUs like Intel's IRIS/Pro often show comparatively good performance because they use the same slow main memory as the CPU but in return do not need those transfers. In addition, this allows to spread the work over both, CPU and GPGPU.

Since recent GPGPUs allow multiple concurrent kernels, combining GPGPU parallel processing with columnar data stores seems also very promising. Ah, so many options, so little time...

0

Reply

Hans-Jürgen Schönig

8 years ago

Reply to Ernst-Georg Schmid

this seems to be the issue here. i am planning to give it some more tests with more grouping and so on. it seems grouping is where pgstrom really excels. in addition to that i am really looking forward to seeing, what sorts can do once they are done the way they are planned. we got interesting times ahead.

0

Reply

KaiGai Kohei

8 years ago

Reply to Ernst-Georg Schmid

> Unfortunately, the choice of CUDA limits the range of supported devices.

The previous version of PG-Strom used OpenCL, however, I backed to CUDA because of driver's quality and debug support.
CUDA has widespread user base, thus gives us stable run-time environment. On the other hands, I faced some strange behavior on *ntel's driver when PG-Strom used OpenCL implementation. It was hard time for me...

0

Reply

Hans-Jürgen Schönig

8 years ago

Reply to KaiGai Kohei

what you have achieved is definitely beyond incredible. i was stunned when i tested things. not a single segfault ... despite the size of the code.

0

Reply

KaiGai Kohei

8 years ago

Reply to Hans-Jürgen Schönig

If you give me the back trace of the crash, it may help to fix.
Also, I merged cumulative bugfixes around GpuJoin code. If you can, please retry with the latest master branch.

0

Reply

Ernst-Georg Schmid

8 years ago

Reply to KaiGai Kohei

Oh, this was no criticism. GPGPU acceleration for PostgreSQL is way cool, no matter what API. 🙂

0

Reply

Ernst-Georg Schmid

8 years ago

Reply to KaiGai Kohei

I just tried pg_strom:

LOG: CUDA Runtime version: 6.5.0
LOG: NVIDIA driver version: 340.76
LOG: GPU0 Quadro K1100M (384 CUDA cores, 705MHz), L2 256KB, RAM 2047MB (128bits, 1400MHz), capability 3.0
LOG: NVRTC - CUDA Runtime Compilation vertion 7.5

But when I try to run a query with pg_strom.enabled = on:

ERROR: failed on cuModuleLoadData (CUDA_ERROR_NO_BINARY_FOR_GPU - no kernel image is available for execution on the device)

Any hints what causes this?

0

Reply

Ernst-Georg Schmid

8 years ago

Reply to Ernst-Georg Schmid

Seems to be a driver conflict. But on Ubuntu 14.04 the latest official driver is 346 and then pg_strom does not compile:

src/cuda_control.c:2522:4: error: ‘CU_DEVICE_ATTRIBUTE_GLOBAL_L1_CACHE_SUPPORTED’ undeclared (first use in this function)
{CU_DEVICE_ATTRIBUTE_GLOBAL_L1_CACHE_SUPPORTED,
^

src/cuda_control.c:2522:4: note: each undeclared identifier is reported only once for each function it appears in
src/cuda_control.c:2524:4: error: ‘CU_DEVICE_ATTRIBUTE_LOCAL_L1_CACHE_SUPPORTED’ undeclared (first use in this function)
{CU_DEVICE_ATTRIBUTE_LOCAL_L1_CACHE_SUPPORTED,
^

and so on...

0

Reply

Ernst-Georg Schmid

8 years ago

Reply to Ernst-Georg Schmid

Ok, works now

0

Reply

KaiGai Kohei

8 years ago

Reply to Ernst-Georg Schmid

I guess you use unsupported CUDA version (6.5), please ensure CUDA 7.0 or later is installed.

LOG: CUDA Runtime version: 6.5.0
LOG: NVIDIA driver version: 340.76

Also, PG-Strom should have version check here.
Thanks for your feedback. If you can, please file your troubles in the project github.
https://github.com/pg-strom/devel/issues

0

Reply

⚡ ⚕ Ayy LOLz LMAO ⚕ ⚡

7 years ago

Reply to KaiGai Kohei

Could you revisit the OpenCL implementation in the current landscape (2016)?

Maybe things have improved.

Worst case, ignore *ntel and focus on AMD/Nvidia.

0

Reply

KaiGai Kohei

7 years ago

Reply to ⚡ ⚕ Ayy LOLz LMAO ⚕ ⚡

No, what I have to do "first" is provision of a working, valuable and stable software for users.
Other comprehensive might change, however, I already built many stuffs on CUDA.
Switch of the platform makes the v1.0 delayed. Sorry.

0

Reply

⚡ ⚕ Ayy LOLz LMAO ⚕ ⚡

7 years ago

Reply to KaiGai Kohei

HIP : C++ Heterogeneous-Compute Interface for Portability
http://gpuopen.com/compute-product/hip-convert-cuda-to-portable-c-code/

OpenCL for AMD
CUDA for NVIDIA
WIN / WIN

From the link: "To further reduce the learning curve when moving from Cuda to HIP, we developed the hipify tool to automate your application’s core conversion."

Maybe for future consideration?

0

Reply

⚡ ⚕ Ayy LOLz LMAO ⚕ ⚡

7 years ago

Reply to ⚡ ⚕ Ayy LOLz LMAO ⚕ ⚡

For future reference: https://github.com/pg-strom/devel/issues/245

0

Reply

⚡ ⚕ Ayy LOLz LMAO ⚕ ⚡

7 years ago

Reply to KaiGai Kohei

GPUOpen is being pushed a lot:
http://gpuopen.com/

HIP : C++ Heterogeneous-Compute Interface for Portability:
http://gpuopen.com/compute-product/hip-convert-cuda-to-portable-c-code/

OpenCL for AMD
CUDA fro NVIDIA
WIN / WIN

0

Reply

Nikos

6 years ago

Hi! Very interesting article. Is it possible to use PG-Storm with PostgreSQL in Windows 10 Pro? Alternatively, would it be straightforward to rebuild for Windows, or does it depend on Linux-specific libraries?

0

Reply

Hans-Jürgen Schönig

6 years ago

Reply to Nikos

we did not dare to run that on Windows.

0

Reply

Nikos

6 years ago

Reply to Hans-Jürgen Schönig

Why not, is it not possible? I've downloaded PG-Storm and trying to figure out if/how to compile/build it using Visual Studio or any other appropriate toolchain.

Any help/advise would be kindly accepted! 🙂

0

Reply

Anthony DeMaio

5 years ago

Hi Hans-Jürgen, Can you please provide an update to this article with testing of PG-strom 2.0? I'm interested to know if the performance has improved.

3. PG-Strom v2.0 features highlight PG-Strom v2.0 Release Technical Brief (17-Apr-2018)3 ▌Storage Enhancement  SSD-to-GPU Direct SQL Execution  In-memory Columnar Cache  GPU memory store (gstore_fdw) ▌Advanced SQL Infrastructure  PostgreSQL v9.6/v10 support – CPU+GPU Hybrid Parallel  SCAN+JOIN+GROUP BY combined GPU kernel  Utilization of demand paging of GPU device memory ▌Miscellaneous  PL/CUDA related enhancement  New data type support  Documentation and Packaging

0

Reply