Speeding up things with hint bits

PostgreSQL is a highly sophisticated relational database system capable of performing wonderful operations. But, sophistication also means that there is a bit of complexity under the surface, which is not always well understood by users. One thing people usually don’t know about are hint bits. What are hint bits? Actually they are an internal optimization, making sure that visibility checks are speeded up considerably. As you might know PostgreSQL traditionally had to decide for each row whether it was visible to a user or not. The best way of course to speed up the process is to avoid that – and this is when hint bits were invented.

When PostgreSQL finds a row that is visible to all transactions it sets a bit in the row header indicating this fact. Next time someone finds this row they can avoid comparing the row visibility information to their snapshot because the hint bit already tells them that it must be visible.

Creating some demo data

To demonstrate how things work some demo data can be created:

test=# CREATE TABLE t_test (a int, b int);

CREATE TABLE

A simple Perl script can create the data nicely. In this example half a billion rows are used (big enough to demonstrate the point and yet small enough to do the test in reasonable time):

[hs@jacqueline blog]$ cat numbers.pl

#!/usr/bin/perl

 

for ($count = 1; $count <= 500000000; $count++)

{

            print "$count   $count\n";

}

To ensure that autovacuum does not spoil our day, it is turned off for this test:

test=# ALTER TABLE t_test SET (autovacuum_enabled = off);

ALTER TABLE

Then data is loaded into PostgreSQL. The easiest way to do that is to use “COPY … FROM PROGRAM” and to simply read things in through a pipe. It works like this:

test=# \timing

Timing is on.

test=# COPY t_test FROM PROGRAM '/path/numbers.pl';

COPY 500000000

Time: 612716.438 ms

On our system (some Intel i5 and a 750 GB SATA disk) it takes around 10 minutes to load the data. Note that default PostgreSQL parameters have been used (no checkpoint optimizations and so on). “top” leaves us with output as shown in the next listing:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                

27570 hs        20   0  257m  24m  23m R 48.6  0.2   0:10.33 postgres                                                                                               

27574 hs        20   0  119m 1584 1252 R 31.6  0.0   0:06.66 numbers.pl                                                                                             

27356 hs        20   0  256m  16m  16m S  1.3  0.1   0:00.18 postgres

It can be seen clearly that generating numbers actually causes a decent share of the CPU load. Actually this is good because it means that PostgreSQL is cheap relative to the data creation part.

hint bits: First contact …

Up to now nothing special could be observed. Let us try to run a simple SELECT:

test=# SELECT count(*) FROM t_test;

   count  

-----------

 500000000

(1 row)

 

Time: 359669.393 ms

The interesting part here is not the result of the query. The really interesting part is hidden nicely below the surface. Here is vmstat:

[hs@jacqueline blog]$ vmstat 2

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----

 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st

 1  0    164 4968092  46412 10534028    0    0  3486  1903    3   13  1  0 91  7  0 

 0  0    164 4967960  46412 10534144    0    0     0     0   78  225  0  0 100  0  0   

 0  0    164 4967960  46420 10534144    0    0     0     8   72  203  0  0 99  1  0     

 0  0    164 4967960  46420 10534144    0    0     0     0  103  301  0  0 100  0  0  

 0  1    164 4946260  46420 10555520    0    0 10586    20  246  437  1  0 97  2  0           

 1  1    164 4732608  46420 10763488    0    0 104000     0 1403 2046 10  2 76 12  0      

 0  1    164 4517592  46420 10972988    0    0 104704     0 1415 2042 10  3 76 11  0      

 1  0    164 4301832  46420 11183036    0    0 105024     0 1410 2038 10  3 75 12  0       

 1  0    164 4082228  46420 11396740    0    0 106880     4 1449 2048 11  3 75 11  0       

 0  1    164 3868576  46420 11604996    0    0 104064     0 1414 2050 10  3 75 12  0      

 1  0    164 3650460  46420 11817164    0    0 106176     0 1415 2058 11  2 75 12  0       

 0  1    164 3433460  46420 12028640    0    0 105664     0 1419 2078 10  2 75 12  0      

 1  1    164 3262588  46420 12194508    0    0 82944 45356 1199 2209  8  3 66 23  0     

 1  1    164 3143176  46420 12310644    0    0 58112 43818  949 1738  6  2 75 18  0      

 1  1    164 3018680  46420 12432100    0    0 60736 36900  940 1760  6  2 75 17  0      

 1  2    164 2897284  46420 12550404    0    0 59136 29726  917 1727  6  2 75 17  0      

 0  2    164 2775020  46420 12669348    0    0 59392 36642  912 1746  6  2 76 17  0      

 0  2    164 2656848  46420 12784480    0    0 57600 36388  883 1687  6  2 75 17  0      

 1  1    164 2537560  46420 12900636    0    0 58048 36644  937 1755  5  2 73 19  0

“bi” (= blocks in) and “bo” (= blocks out) are the interesting two columns here. First things look pretty normal. We start to read at roughly 100 MB / second. A reasonable value for a desktop PC featuring a single SATA drive. But then things start to go south. At some point PostgreSQL starts to write data out to disk. Why that? Well, when a row is loaded PostgreSQL has no idea whether it is visible to others without checking visibility all over again. Therefore PostgreSQL will mark a row as “visible to everyone” when it’s read for the first time. This is exactly where the I/O comes from – when rows are marked as “visible to everybody”. The advantage here is that visibility checks are speeded up tremendously.

When the query is executed again the I/O pattern is totally different:

[hs@jacqueline blog]$ vmstat 2

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----

 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st         

 0  1    164 191076   1688 15325392    0    0 104256     0 1401 2077  8  2 75 15  0          

 0  1    164 183140   1688 15334056    0    0 105792     0 1394 2113  8  2 75 15  0          

 0  1    164 187728   1688 15329596    0    0 106880     0 1429 2121  9  2 75 15  0          

 0  1    164 190828   1688 15327804    0    0 105472     0 1413 2124  8  2 75 15  0          

 2  0    164 184008   1688 15335792    0    0 104640     0 1389 2108  8  2 75 15  0          

 1  0    164 188348   1688 15336880    0    0 106240     0 1466 2107  9  2 75 14  0          

 1  0    164 191944   1692 15337824    0    0 100226     4 1372 2060  8  2 75 15  0

As you can see there is no more writing going on – just a clean 100+ MB / second. Exactly what we want to see. The important part here is that the read is a lot faster now:

test=# SELECT count(*) FROM t_test;

   count  

-----------

 500000000

(1 row)

 

Time: 110263.714 ms

Wow, the time needed to perform this one has gone down dramatically. And more importantly the execution time is pretty stable:

Time: 110382.362 ms

The theory is that reads happen more frequently than writes and therefore investing into hint bits makes an awful lot of sense because it seriously improves runtimes.

Turning to VACUUM

But there is a bit more. What if we VACUUM the table?

Here is vmstat again:

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----

 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st

 1  1    164 161936    616 15427528    0    0 50718   356  845 1262  4  2 79 15  0

 0  1    164 168136    616 15421384    0    0 87680   636 1267 2517  6  4 75 15  0           

 0  1    164 169624    616 15419204    0    0 88704   624 1254 2529  6  3 75 15  0           

 0  1    164 179296    616 15409532    0    0 90304   560 1279 2435  6  4 75 15  0           

 0  1    164 185868    616 15402548    0    0 89216   628 1319 1898  7  4 74 15  0           

 1  1    164 178924    616 15409580    0    0 88832   628 1295 1960  7  3 75 15  0           

 0  1    164 183636    616 15404972    0    0 80166   624 1196 1802  6  3 75 16  0

 0  1    164 186240    616 15402652    0    0 91648   568 1354 1951  7  4 75 15  0           

 0  2    164 180660    624 15408296    0    0 89472   670 1307 1982  6  3 75 16  0           

 1  1    164 170120    624 15418500    0    0 72260 49908 1080 2035  5  3 70 22  0         

 0  2    164 164788    624 15423656    0    0 67072 22782  995 1955  4  2 70 23  0          

 0  2    164 168012    624 15420556    0    0 61888 22492  984 1928  4  2 73 20  0          

 0  2    164 164044    624 15424496    0    0 60672 29582  959 1887  4  2 73 20  0          

 0  2    164 172352    624 15416452    0    0 67712 22524 1053 2097  5  3 72 20  0

Again some writing is going on. We are seriously bound by the disk here as the following couple of lines taken from “top” indicate:

 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND                                                                                                

27570 hs        20   0  257m  30m  29m R 26.3  0.2   9:34.80 postgres                                                                                                

   59 root      20   0     0    0    0 S  1.7  0.0  18:00.13 kswapd0                                                                                                

   36 root      20   0     0    0    0 S  1.0  0.0  10:42.29 kblockd/2                                                                                               

27245 root      20   0     0    0    0 D  0.3  0.0   0:03.53 flush-8:16                                                                                             

30861 root      20   0 15036 1196  836 R  0.3  0.0   0:03.75 top

VACUUM takes quite a while as well:

test=# VACUUM t_test;

VACUUM

Time: 363773.532 ms

The beauty now is that we not only know that a single row is visible or not – VACUUM can also figure out if an entire block is visible or not.

Again, we see a speed improvement – but this time it is not too big (just two seconds):

test=# SELECT count(*) FROM t_test;

   count  

-----------

 500000000

(1 row)

 

Time: 108715.332 ms

The main benefit comes really from those bits attached to a single line – and this benefit is massive.

Hans-Juergen Schoenig
Hans-Jürgen Schönig has 15 years of experience with PostgreSQL. He is consultant and CEO of the company „Cybertec Schönig & Schönig GmbH“ (www.cybertec.at, www.postgresql-support.de), which has served countless customers around the globe.