int4 vs int8 vs uuid vs numeric performance on bigger joins

Some weeks ago at pgDay Paris, during the evening social event, we got into a small "guestimation" with another Postgres enthusiast, about the precepted speed of using integer based ID-s vs UUID based ones for bigger amounts of data. In short he reasoned that one should generally avoid the UUID-s for performance reasons as they take 16 bytes instead of 4/8 bytes for normal integer and thus are slower. While it is true that they do occupy some more bytes on the disk (and are a bit rare in the wild as simpler applications just don't need them) I was still sceptical about the direct translation of those bytes to speed difference. More specifically we talked about bigger joins (full table scans), as on indexed single row operations there shouldn't be any difference anyways. In the end it was hard to say anything meaningful as I was just guessing - but now I finally got to testing it out with a simple test setup and am now laying out the numbers for you to judge also. In addition to UUID I also included int8 and numeric for completeness.

Table of Contents

Test setup

I created 4 pairs of tables with 5 million rows with same ID-s in both of the tables, that would result in a "complete" join. Then I had to make a difficult choice – as the discussion rose from real life situations, it would be pretty meaningless to have only one table consisting of the ID column. Thus, to make the test a bit more real, I decided to add 6 other columns (2 ints, 2 timestamps, 2 short texts) so that the extra 4 bytes wouldn't have such a huge effect on the outcome. Then I launched a new test cluster with the latest Postgresql version 9.6.2, configured the shared_buffers=2GB, work_mem=1GB and maintenance_work_mem=1GB so that the join would be performed in the memory and also enabled the pg_stat_statements extension. Then I left the test script to run over night on my laptop (64-bit Ubuntu 14.04, 16 GB RAM, i5-6200U @ 2.30 GHz, SSD), doing
1) 1000 join iterations for every data type and
2) 100 index building iterations.

The scripts can be found here if you want to try it out yourself. Note that I'm using UNLOGGED tables, a feature introduced with Postgres 9.5, to reduce IO dependence of our test (and speed up inserts).

Tables ended up being around 364..403 MB in size and looked like that in the end:

krl@postgres=# d int4_a
        Unlogged table 'public.int4_a'
 Column │           Type           │ Modifiers
────────┼──────────────────────────┼───────────
 id     │ integer                  │
 col1   │ integer                  │
 col2   │ integer                  │
 col3   │ text                     │
 col4   │ text                     │
 col5   │ timestamp with time zone │
 col6   │ timestamp with time zone │

1

2

3

4

5

6

7

8

9

10

11

krl@postgres=# d int4_a

Unlogged table 'public.int4_a'

Column │ Type │ Modifiers

────────┼──────────────────────────┼───────────

id │ integer │

col1 │ integer │

col2 │ integer │

col3 │ text │

col4 │ text │

col5 │ timestamp with time zone │

col6 │ timestamp with time zone │

And the test queries looked like that:

SELECT count(*) FROM datatypeX_a JOIN datatypeX_b USING(id)

1	SELECT count(*) FROM datatypeX_a JOIN datatypeX_b USING(id)

The results

Datatype	JOIN mean time 5m rows (s)	% diff. to smallest	Index Size (MB)	% diff. to smallest
int4	2.72	-	107	-
int8	2.95	+8 %	107	-
numeric	3.65	+34 %	107	-
uuid	3.07	+13 %	150	+40 %

Looking at the results table, we see that indeed there's a penalty when starting to use any of the bigger data types. It makes kind of sense. But is this +13% growth, int4 vs UUID, little or much? I'd rather think it is small enough to be blissfully dismissed still, given that they benefit application architecture when using shards for example and also they provide "some" security as guessing or looping over them is not so easily possible. The thing that surprised me the most though was the performance of numeric-s. +34%! One should definitely watch out there and avoid them if possible – i.e. a very bad choice for normally incrementing integer ID-s.

The conclusion

As it was a pure syntethical test, in real life scenarios one has more processes fighting for resources, locking, also more bloat, most probably more columns in the tables, thus making waiting for disk access more relevant so that the real performance loss from processing those extra bytes spent on the ID column should be actually smaller.

By the way, by googling I also found a couple of articles on data type speed differences when doing a lot of INSERTs. There the penalty was similarly modest in most cases, so in general – nothing to be afraid of. So in the end I'd still recommend to choose ID column data types based on utility, not performance. And as always - one should always generate a lot of test data carefully within an application specific scenario before betting on a certain approach.

In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Twitter, Facebook, or LinkedIn.

0 0 votes

Article Rating

Subscribe

10 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

Colin 't Hart

7 years ago

UUIDs are 16 bytes, surely?

0

Reply

Kaarel

7 years ago

Reply to Colin 't Hart

Ah, you're absolutely right! Sorry for that, corrected

0

Reply

Colin 't Hart

7 years ago

Can you explain the index size equivalence for 4-byte and 8-byte integers? Can you do the same exercise for multi-column indexes (2, maximum 3 columns -- where all columns are the same data type)?

0

Reply

Kaarel

6 years ago

Reply to Colin 't Hart

A good question. Was a bit puzzled there a bit also at first. But it has to do with "padding" - on 64bit architecture 32b is expanded to 64b. When I did the same test for 2-column indexes (https://gist.github.com/kmoppel/1c76e34c912989cd34e637ef95985de9#file-2-col-sql) the difference emerges:

krl@postgres=# di+ idx_test_*
List of relations
Schema │ Name │ Type │ Owner │ Table │ Size │ Description
────────┼──────────────────┼───────┼───────┼────────────┼────────┼─────────────
public │ idx_test_int4 │ index │ krl │ int4_aa │ 107 MB │
public │ idx_test_int8 │ index │ krl │ int8_aa │ 150 MB │
public │ idx_test_numeric │ index │ krl │ numeric_aa │ 150 MB │
public │ idx_test_uuid │ index │ krl │ uuid_aa │ 237 MB │
(4 rows)

About 2-col join execution times - seems that the performance differences interestingly enough decreased! Can't quite figure out why though...

int4 3718 -
int8 3745 0.7%
uuid 4038 7.8%
numeric 5122 26.8%

0

Reply

Dennis Brouwer

7 years ago

Hi Kaarel, nice article!

I would like to comment that the index size of the UUID column depends a lot on the type of UUID used. If a random UUID (type 4) is used, which is used here (uuid_generate_v4(), from the script), the index is larger in size compared to a timebased UUID (type 1). I've done some testing in the past and the index was about 20% smaller. So if security doesn't matter and a functional UUID type1 would do I'm really curious how that does perform on the same hardware compared to int4,8 and numeric. Would it be possible to spin up the test again by using uuid_generate_v1()?

0

Reply

Nigel Ramsay

7 years ago

Reply to Dennis Brouwer

Thanks for the article. I'd just like to second Dennis's request for the UUID type 1 test. I'd be really interested to see what the performance differences are, as I understand that they are indexed more easily that the random type 4 UUIDs.

0

Reply

Kaarel

6 years ago

Reply to Dennis Brouwer

Thanks! This v1 vs v4 difference sounded suspicious enough to test it out 🙂 I didn't see any difference though - both indexes at 150 MB still...

0

Reply

Dennis Brouwer

6 years ago

Reply to Kaarel

Hi Kaarel, I just realized that this benchmark doesn't reflect a real live situation where the index already exists when data is inserted. I've done a small test with a minimal dataset using both a primary key and an ordinary index and the difference is clear:

CREATE EXTENSION IF NOT EXISTS "uuid-ossp";
CREATE TABLE uuid_a (id uuid);
CREATE TABLE uuid_b (id uuid);
-- PRIMARY KEY
--ALTER TABLE uuid_a ADD CONSTRAINT uuid_a_pkey PRIMARY KEY (id);
--ALTER TABLE uuid_b ADD CONSTRAINT uuid_b_pkey PRIMARY KEY (id);
CREATE INDEX ON uuid_a (id);
CREATE INDEX ON uuid_b (id);
-- PRIMARY KEY
--INSERT INTO uuid_a SELECT uuid_generate_v4() FROM generate_series(1, 5*1e6) i; --> runtime 56.9 sec; index 196MB
--INSERT INTO uuid_b SELECT uuid_generate_v1() FROM generate_series(1, 5*1e6) i; --> runtime 44.0 sec; index 150MB
INSERT INTO uuid_a SELECT uuid_generate_v4() FROM generate_series(1, 5*1e6) i; --> runtime 56.6 sec; index 195MB
INSERT INTO uuid_b SELECT uuid_generate_v1() FROM generate_series(1, 5*1e6) i; --> runtime 43.8 sec; index 150MB

Conclusion is that the index for a timebased uuid stays small (150MB) and the random uuid grows some 30% (195-196MB).

0

Reply

Kaarel

6 years ago

Reply to Dennis Brouwer

I see. That is unexpected indeed. I guess it shows that v4 is more random though - more page splits along the way. Thanks for sharing!

0

Reply

med bouzid

6 years ago

Hi Kaarel,

I was about to use Numeric type as primary key for one of my tables, I have users with blogs, and I wanted to track page views of every user's blog, when you have a lot of users you don't know how many visits every user will get for his own blog, so I was afraid to run out of INT maximum size, I retrieve views/visits from that table based on the user_id, I don't do queries based on the id, example of request I do: "SELECT COUNT(*) FROM page_views WHERE user_id = ? AND created_at BETWEEN ? AND ?", what data type you suggest for this kind of table when you create a new row every page visit and you have no idea how popuplar a user's blog is (or will get) ?

0

Reply