Shrinking the storage footprint of data

When writing data structures and table layouts people usually don’t care about the order of columns. Why should they anyway? Columns can be reordered inside a SELECT at will so why care? Well, you should care because the order of columns might have significant impact on the size of your table. That’s right: The size of the table might depend on the order of columns – even if the same data is stored.

Why is that so? Internally there is something called CPU alignment which governs the internal size of a data structure. By cleverly aligning and ordering columns there is some potential to optimize things. You don’t believe it? Let us give it a try:

test=# CREATE TABLE t_test (

            i1         int,

            i2         int,

            i3         int,

            v1        varchar(100),

            v2        varchar(100),

            v3        varchar(100)

);

CREATE TABLE

In this example there are 6 columns. 3 integer columns close to each other and 3 varchar columns also close to each other. Let us add 10 million rows to the table:

test=# INSERT INTO t_test SELECT 10, 20, 30, 'abcd', 'abcd', 'abcd'

            FROM generate_series(1, 10000000);

INSERT 0 10000000

The overall size of the table is 574 MB:

test=# SELECT pg_size_pretty(pg_relation_size('t_test'));

 pg_size_pretty

----------------

 574 MB

(1 row)

Let us try to change the layout of those columns. In the next example there is a varchar field followed by an integer column. This layout is repeated three times:

test=# CREATE TABLE t_test (

            v1        varchar(100),

            i1         int,

            v2        varchar(100),

            i2         int,

            v3        varchar(100),

            i3         int

);

CREATE TABLE

If 10 millions rows are added now …

test=# INSERT INTO t_test SELECT 'abcd', 10, 'abcd', 20, 'abcd', 30

            FROM generate_series(1, 10000000);

INSERT 0 10000000

… the table will be a lot larger:

test=# SELECT pg_size_pretty(pg_relation_size('t_test'));

 pg_size_pretty

----------------

 651 MB

(1 row)

The data in the table is the same – it has just been chosen in a way to demonstrate the effect. If I had used “abc” instead of “abcd” there would be no difference in size because the 4 character string made sure that the string just did not fit into a smaller space.

Conclusion

The important conclusion here is that it definitely makes sense, to pack similar data types together. And, I have found it useful to pack integer columns at the beginning of the tables. In many cases this can yield in some extra percentage points of speed – just because things are a bit smaller than if you don’t make use of this little rule.

Hans-Juergen Schoenig
Hans-Jürgen Schönig has 15 years of experience with PostgreSQL. He is consultant and CEO of the company „Cybertec Schönig & Schönig GmbH“ (www.cybertec.at, www.postgresql-support.de), which has served countless customers around the globe.