Removing duplicate rows in PostgreSQL

After some time of absence due to a high work load I finally got around to write more on an issue, which has been on my mind for quite a while: Removing duplicate rows.

It happens from time to time that PostgreSQL support clients accidentally load data twice or somehow manage to duplicate data (human error, bugs in applications, etc). The main question is, how can the problem be fixed? How can developers handle duplicate data and cleanup things? This post tries to answer exactly that question.

Detecting and finding duplicate entries

To demonstrate how duplicates can be found, I have prepared some sample data:


CREATE TABLE t_location

(

country text,

city     text

);

The idea is to store a couple of German cities:


COPY t_location FROM stdin DELIMITER ',';

Germany,Berlin

Germany,Frankfurt

Germany,Frankfurt

Germany,Dortmund

Germany,Dortmund

Germany,Dresden

\.

Obviously some cities show up more than once:


test=# SELECT country, city, count(*)

FROM     t_location

GROUP BY 1, 2;

country |   city    | count

---------+-----------+-------

Germany | Dresden   |     1

Germany | Dortmund  |     2

Germany | Berlin    |     1

Germany | Frankfurt |     2

(4 rows)

“Dortmund” and “Frankfurt” are the ones, which have been added more than once and need a proper cleanup.

Identifying rows

Before we can approach the problem head on I want to tell you about a thing called “ctid”, which is used in PostgreSQL to identify the row in a table:


test=# SELECT ctid, * FROM t_location;

ctid  | country |   city

-------+---------+-----------

(0,1) | Germany | Berlin

(0,2) | Germany | Frankfurt

(0,3) | Germany | Frankfurt

(0,4) | Germany | Dortmund

(0,5) | Germany | Dortmund

(0,6) | Germany | Dresden

(6 rows)

(0, 1) means that we are looking at the first block and the first row. (0, 2) is the second row in the first 8k block and so on. The trouble is: There is no such thing as “DELETE … LIMIT 1”. Therefore we have to find a way to use the “hidden“ columns to solve the problem. The ctid gives us a free identifier, which can help to delete just a single row.

Here is a query, which returns the first incarnation of a row:


test=# SELECT t_location, min(ctid)

FROM     t_location

GROUP BY 1

HAVING count(*) > 1;

t_location      |  min

---------------------+-------

(Germany,Dortmund)  | (0,4)

(Germany,Frankfurt) | (0,2)

(2 rows)

Now that we have identified all the duplicate rows in the table, it is possible to write a DELETE statement, which just kills the undesired data:


WITH x AS (SELECT      t_location dup, min(ctid)

FROM         t_location

GROUP BY 1

HAVING count(*) > 1

)

DELETE FROM    t_location

USING     x

WHERE     (t_location) = (dup)

AND t_location.ctid <> x.min

RETURNING *;

country |   city    |         dup         |  min

---------+-----------+---------------------+-------

Germany | Dortmund  | (Germany,Dortmund)  | (0,4)

Germany | Frankfurt | (Germany,Frankfurt) | (0,2)

(2 rows)

As you can see the two rows we don’t want anymore are removed from the table. Mind that the ctid is not a static thing. UPDATE statements, VACUUM FULL, etc. can actually change it. Therefore it is highly recommended to perform this operation only if you are the only one on the system and if the table is locked appropriately.

Hans-Juergen Schoenig
Hans-Jürgen Schönig has 15 years of experience with PostgreSQL. He is consultant and CEO of the company „Cybertec Schönig & Schönig GmbH“ (www.cybertec.at, www.postgresql-support.de), which has served countless customers around the globe.