UPDATED JULY 19, 2023 – When I do training here at CYBERTEC, people often ask how to load data fast and efficiently. There are many ways to achieve this. One way not too many know about is COPY (FREEZE). It makes sure that PostgreSQL can organize data nicely straight away by instantly writing hint bits (as explained in my previous post).

Sample data for our bulk load test

To demonstrate COPY (FREEZE), first generate some data:

$ cat numbers.pl
#!/usr/bin/perl

for ($count = 1; $count <= 500000000; $count++)
{
	print "$count\t$count\n";
}

We will load the data into this table:

CREATE TABLE t_test (
   a integer,
   b integer
);

Importing the data with COPY

To import the data on my test system (single SATA disk in a Linux box) I use a normal COPY:

test=# COPY t_test FROM '/data/numbers.txt';
COPY 500000000
Time: 627509.061 ms

As you can see, it takes around 10 minutes to do so.

Speeding up the load with COPY (FREEZE)

To speed things up, let’s try COPY (FREEZE):

test=# COPY t_test FROM '/data/numbers.txt' (FREEZE);
ERROR:  cannot perform COPY FREEZE because the table was not created or truncated in the current subtransaction

The important thing here is that this command can only operate inside the transaction that created or truncated the table. Otherwise, it doesn’t work (I’ll explain the reason later). So, let’s try the following:

test=# BEGIN;
BEGIN

test=*# SELECT pg_current_wal_lsn();
 pg_current_wal_lsn 
--------------------
 2/A7D6310
(1 row)

test=*# TRUNCATE t_test;
TRUNCATE

test=*# \timing
Timing is on.

test=*# COPY t_test FROM '/data/numbers.txt' FREEZE;
COPY 500000000
Time: 304082.907 ms

test=*# \timing
Timing is off.

test=*# SELECT pg_current_wal_lsn();
 pg_current_wal_lsn 
--------------------
 2/A7DC000
(1 row)

test=*# COMMIT;
COMMIT

The execution time goes down dramatically!

Explanation for the speed of COPY (FREEZE)

The main reason it’s now more efficient is that PostgreSQL did not have to write transaction log data. That’s because I configured wal_level = minimal. With this setting, you can’t use WAL archiving or replication, and PostgreSQL only logs enough data for crash recovery. Because of that, PostgreSQL does not need to write WAL for data modifications that take place in the transaction that created or truncated the table. Skipping WAL writes speeds up data loading considerably! Changing wal_level requires a restart of the database, but it may be worth the pain if you need to load data quickly.

But there’s also another gain: The second major benefit will be visible when the data is read for the first time after the import. As already shown in my previous posts, PostgreSQL has to set hint bits during the first reading operation. This translates to a large number of writes later on down the road.

COPY (FREEZE) has been made to fix exactly this kind of problem – the data are already loaded in the frozen state, and there is no need to write hint bits any more. Frozen rows are unconditionally visible, and that is the reason why the table had to be created or truncated in the same transaction: otherwise, concurrent readers could see the rows before the transaction commits, which would violate transaction isolation.

Analytical applications can benefit greatly
from the COPY (FREEZE) feature!

Finally…

I’ve recently seen some really broad tables (hundreds of columns) in a somewhat inefficiently structured database. Our PostgreSQL support customer complained about strange runtime behavior which could not be easily explained. To help other PostgreSQL users in this same situation, I decided to reveal the secrets of a fairly common performance problem many people don’t understand: Column order and column access.
 


In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Twitter, Facebook, or LinkedIn.