Reducing the impact of locking
“Waiting is the slowest way to execute an operation” – this is how a professor at my university in Vienna usually described bad locking and low concurrency. He could not have been more right. The same applies to performing calculations: The fastest way to calculate something is to try to skip it entirely. In this posting we want to focus on the first half of the story – minimizing locking.
Let us assume we got three tables:
test=# CREATE TABLE a (aid int4); CREATE TABLE test=# CREATE TABLE b (bid int4); CREATE TABLE test=# CREATE TABLE c (cid int4); CREATE TABLE
We can populate those tables with a handful of values to get our example started:
test=# INSERT INTO a VALUES (1); INSERT 0 1 test=# INSERT INTO b VALUES (1); INSERT 0 1 test=# INSERT INTO c VALUES (1); INSERT 0 1
SELECT … FOR UPDATE
Concurrency (= doing things in parallel) is an essential thing to consider when writing code. In most applications there is more than just one user at the same time and those users might very well focus their attention on the same set of data.
To make sure that nobody can modify a row while we are processing it, all modern relational databases such as Oracle and PostgreSQL provide a syntax component called “SELECT .. FOR UPDATE”. All rows returned by “SELECT FOR UPDATE” are locked as if they were about to be updated. This will help us to avoid concurrency issues and strange behavior (race conditions).
In our example we want to perform a join between table A and table B and lock all rows returned by SELECT. Here is the code:
test=# BEGIN; BEGIN test=# SELECT * FROM a, b WHERE aid = bid FOR UPDATE; aid | bid -----+----- 1 | 1 (1 row)
As you can see, data is returned as expected. For the sake of this example we keep the transaction open. Let us run a second transaction now, which is executed, while the first session is still active:
test=# BEGIN; BEGIN test=# SELECT * FROM a, b WHERE aid = bid FOR UPDATE; -- waiting for the first session to commit
The point here is: The second session has to wait because session 1 and session 2 will have to lock rows in the B table. This is exactly the desired behavior.
But: What if B was just a simple lookup table? What if A was a payment table and B just a postal code table? What if C was a table storing data about people?
If somebody was about to modify some payment and if he has to look up some postal code for this purpose, we would automatically lock other transactions out, which want to modify data about people. The lock on the postal code table would magically spread to other areas of the code, which have nothing to do with payment.
If changes to payments are frequent, the postal code table would be the poison pill for all writing transactions touching the table storing people. Clearly, this is not acceptable. How can you ever scale to 64 CPUs or beyond if you are shot down by a simple row level lock?
PostgreSQL has the right solution to this kind of problem: SELECT FOR UPDATE allows you to specify, which table in the join you want to lock. It works like this:
test=# BEGIN; BEGIN test=# SELECT * FROM a, b WHERE aid = bid FOR UPDATE OF a; aid | bid -----+----- 1 | 1 (1 row)
In this example we are telling the system that we are joining A and B but we only want to lock A. This makes sure that C can be locked undisturbed:
test=# BEGIN; BEGIN test=# SELECT * FROM b, c WHERE bid = cid FOR UPDATE OF c; bid | cid -----+----- 1 | 1 (1 row)
This SELECT statement will only take care of table C and make sure that B is not touched because we don’t want to modify it anyway.
As you can see, a small change can have a significant impact on the overall performance because we gave magically allowed a second transaction (and thus a second CPU) to proceed without having to wait on a statement.
The golden rule of thumb is: The more concurrency you got in your system the faster you are overall. Never lock more than needed.