Reporting: Creating correct output

Creating reports is a core task of every PostgreSQL database engineer or developer. However, many people think that it is enough to hack up some SQL aggregating some data and execute it. This is not quite true. We have repeatedly seen reports being just plain wrong without people even taking note of it.

How can this happen?

Transaction isolation does make a difference

In our example we want to aggregate some simple data:

test=# CREATE TABLE t_test (name text, gender char(1));

CREATE TABLE

We add two boys and one girl:

test=# INSERT INTO t_test VALUES ('hans', 'm');

INSERT 0 1

test=# INSERT INTO t_test VALUES ('paul', 'm');

INSERT 0 1

test=# INSERT INTO t_test VALUES ('maria', 'f');

INSERT 0 1

Let us start with our report. We want to create a nice and colorful analysis displaying the number of girls, the number of boys, and the overall total.

To prove our point we have written two queries:

test=# BEGIN;

BEGIN

test=# SELECT gender, count(*) FROM t_test GROUP BY 1;

gender | count

--------+-------

m | 2

f | 1

(2 rows)

test=# SELECT count(*) FROM t_test;

count

-------

3

(1 row)

test=# COMMIT;

COMMIT

We start a transaction, do an analysis per gender followed by an overall count (and yes, there are more efficient ways of doing that).

The question now is: Is this report correct or not? Most people would ask: “Why not?”

Here is why:


User 1

BEGIN;

test=# SELECT gender, count(*)
FROM t_test
GROUP BY 1;

 gender | count
--------+-------
   m    |   2
   f    |   1
(2 rows)

.
.
.

test=# SELECT count(*) FROM t_test;

count
-------
2
(1 row)

COMMIT;


User 2

.
.
.
.
.
.
.
.
.
.
.
.
test=# DELETE FROM t_test
WHERE name = 'hans';
DELETE 1
.
.
.
.
.
.
.
.
.

What we see here is that the count does not reflect what we have seen before. The overall count is two – it is supposed to be three. The reason for this (for many users) unexpected behavior is that we are in so called “read committed” mode. In “read committed” mode every SELECT statement will take a so called snapshot. In other words: Every SELECT will already see data committed by some other transaction. In our case the second SELECT will already see the other transaction’s commit. So, over time, a “read committed” transaction can change its view of the data. This implies that a report consisting of more than just one query might not operate on the same data and thus return inconsistent results.

Repeatable read – a way to fix transaction visibility

To fix the problem we can run our transaction in “repeatable read” mode. This will make sure that your transaction will have a consistent view of the data throughout the transaction and therefore return correct results.

Running a transaction in “repeatable read” works like this:

test=# BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;

BEGIN

test=# SELECT gender, count(*) FROM t_test GROUP BY 1;

gender | count

--------+-------

m | 2

f | 1

(2 rows)

test=# SELECT count(*) FROM t_test;

count

-------

3

(1 row)

test=# COMMIT;

COMMIT

If you happen to do reporting, we always recommend to use “repeatable read” because it allows you to do your reporting on a consistent view of the data.

———-
We welcome your feedback on our brand new Facebook page: www.fb.com/cybertec.postgresql

Hans-Juergen Schoenig
Hans-Jürgen Schönig has 15 years of experience with PostgreSQL. He is consultant and CEO of the company „Cybertec Schönig & Schönig GmbH“ (www.postgresq-support.de), which has served countless customers around the globe.