Removing duplicate rows in PostgreSQL

After some time of absence due to a high workload I finally got around to write more on an issue, which has been on my mind for quite a while: Removing duplicate rows.

Table of Contents

It happens from time to time that PostgreSQL support clients accidentally load data twice or somehow manage to duplicate data (human error, bugs in applications, etc). The main question is, how can the problem be fixed? How can developers handle duplicate data and cleanup things? This post tries to answer exactly that question.

Detecting and finding duplicate entries

To demonstrate how duplicates can be found, I have prepared some sample data:

CREATE TABLE t_location
(
     country     text,
     city        text
);

CREATE TABLE t_location

(

country text,

city text

);

The idea is to store a couple of German cities:

COPY t_location FROM stdin DELIMITER ',';
Germany,Berlin
Germany,Frankfurt
Germany,Frankfurt
Germany,Dortmund
Germany,Dortmund
Germany,Dresden
.

COPY t_location FROM stdin DELIMITER ',';

Germany,Berlin

Germany,Frankfurt

Germany,Dortmund

Germany,Dresden

Obviously some cities show up more than once:

test=# SELECT country, city, count(*)
       FROM   t_location
       GROUP BY 1, 2;

 country |   city    | count
---------+-----------+-------
 Germany | Dresden   |     1
 Germany | Dortmund  |     2
 Germany | Berlin    |     1
 Germany | Frankfurt |     2
(4 rows)

test=# SELECT country, city, count(*)

FROM t_location

GROUP BY 1, 2;

country | city | count

---------+-----------+-------

Germany | Dresden | 1

Germany | Dortmund | 2

Germany | Berlin | 1

Germany | Frankfurt | 2

(4 rows)

“Dortmund” and “Frankfurt” are the ones, which have been added more than once and need a proper cleanup.

Identifying rows

Before we can approach the problem head on I want to tell you about a thing called “ctid”, which is used in PostgreSQL to identify the row in a table:

test=# SELECT ctid, * FROM t_location;
 ctid  | country |   city
-------+---------+-----------
 (0,1) | Germany | Berlin
 (0,2) | Germany | Frankfurt
 (0,3) | Germany | Frankfurt
 (0,4) | Germany | Dortmund
 (0,5) | Germany | Dortmund
 (0,6) | Germany | Dresden
(6 rows)

test=# SELECT ctid, * FROM t_location;

ctid | country | city

-------+---------+-----------

(0,1) | Germany | Berlin

(0,2) | Germany | Frankfurt

(0,3) | Germany | Frankfurt

(0,4) | Germany | Dortmund

(0,5) | Germany | Dortmund

(0,6) | Germany | Dresden

(6 rows)

(0, 1) means that we are looking at the first block and the first row. (0, 2) is the second row in the first 8k block and so on. The trouble is: There is no such thing as “DELETE … LIMIT 1”. Therefore we have to find a way to use the “hidden“ columns to solve the problem. The ctid gives us a free identifier, which can help to delete just a single row.

Here is a query, which returns the first incarnation of a row:

test=# SELECT  t_location, min(ctid)
       FROM    t_location
       GROUP BY 1
       HAVING count(*) > 1;

     t_location      |  min
---------------------+-------
 (Germany,Dortmund)  | (0,4)
 (Germany,Frankfurt) | (0,2)
(2 rows)

test=# SELECT t_location, min(ctid)

FROM t_location

GROUP BY 1

HAVING count(*) > 1;

t_location | min

---------------------+-------

(Germany,Dortmund) | (0,4)

(Germany,Frankfurt) | (0,2)

(2 rows)

Now that we have identified all the duplicate rows in the table, it is possible to write a DELETE statement, which just kills the undesired data:

WITH x AS (SELECT      t_location dup, min(ctid)
           FROM        t_location
           GROUP BY 1
           HAVING count(*) > 1
)
DELETE FROM    t_location
USING     x
WHERE     (t_location) = (dup)
          AND t_location.ctid <> x.min
RETURNING *;

 country |   city    |         dup         |  min
---------+-----------+---------------------+-------
 Germany | Dortmund  | (Germany,Dortmund)  | (0,4)
 Germany | Frankfurt | (Germany,Frankfurt) | (0,2)
(2 rows)

WITH x AS (SELECT t_location dup, min(ctid)

FROM t_location

GROUP BY 1

HAVING count(*) > 1

)

DELETE FROM t_location

USING x

WHERE (t_location) = (dup)

AND t_location.ctid <> x.min

RETURNING *;

country | city | dup | min

---------+-----------+---------------------+-------

Germany | Dortmund | (Germany,Dortmund) | (0,4)

Germany | Frankfurt | (Germany,Frankfurt) | (0,2)

(2 rows)

As you can see the two rows we don't want any more are removed from the table. Mind that the ctid is not a static thing. UPDATE statements, VACUUM FULL, etc. can actually change it. Therefore it is highly recommended to perform this operation only if you are the only one on the system and if the table is locked appropriately.

In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Twitter, Facebook, or LinkedIn.

0 0 votes

Article Rating

8 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

F(log)

7 years ago

Very useful and easy to understand. This will go to my personal documentation 🙂

Vyacheslav

7 years ago

Hey, thanks for the article. Take a loop at this one as well

https://wiki.postgresql.org/wiki/Deleting_duplicates

Hans-Jürgen Schönig

7 years ago

Reply to Vyacheslav

this one is know too. however, we should not assume that we have unique IDs. if there is no primary key, the assumption might be false 🙂

Vyacheslav

7 years ago

Reply to Hans-Jürgen Schönig

Thanks for pointing out how to use tuple notation. Before that I would convert the row to json row_to_json(*). I didn't know that this is unnecessary.

Golokesh Patra

6 years ago

**WORKS FOR BOTH NORMAL SQL AND POSTGRESQL (ALSO WORKS IN AWS REDSHIFT)**

DROP TABLE IF EXISTS backupOfTheTableContainingDuplicates;

CREATE TABLE aNewEmptyTemporaryOrBackupTable
AS SELECT DISTINCT * FROM originalTableContainingDuplicates;

TRUNCATE TABLE originalTableContainingDuplicates;

INSERT INTO originalTableContainingDuplicates SELECT * FROM
aNewEmptyTemporaryOrBackupTable ;

DROP TABLE aNewEmptyTemporaryOrBackupTable ;

**EXPLANATION OF THE ABOVE SQL SCRIPT**
So,

The 1st query ensures, If you have any backup/temporary table of the original table containing duplicates then first drop that table.

The 2nd query, creates a new table(Temporary/Backup) table with unique entries in the original Table containing duplicate, so the new temporary table is same as the original table MINUS the duplicate entries.

The 3rd Query, truncates or empties the original table.

the 4th Query, inserts or copies all the unique entries in the temporary table to the original table which has been recently truncated (So has no Data). After this query is executed, the Original Table will be populated with UNIQUE data that was in the temporary table.

The 5th Query, removes/drops the unnecessary temporary table.

So End result is, the original table has only UNIQUE ENTRIES and no duplicates.

Salah Jubeh

7 years ago

You can delete duplicates without CTE and GROUP BY clause by using DELETE statement. The following is much faster than using group by an less complicated.

delete from t_location a using t_location b where (a.country, a.city)=(b.country, b.city) and a.ctid < b.ctid;

Antonio

4 years ago

Hi Hans-Jurgen, very clever approach, thanks.
But what if instead to delete those rows we would like to identify them by adding a field?
It would allow to merge duplicates or select the one who really must be deleted...
Imagine you have a list of contacts, in those duplicated ones some operators has added info into some records but not in every records (records with some data and records without).
I'm actually struggling to find this in SQL (postgres actually) while it is relatively easy in a spreadsheet.

thanks

Douglas H. Bradshaw

5 years ago

That's really helpful -- I didn't know about ctid. Thanks!

Stay tuned with our

Removing duplicate rows in PostgreSQL

Detecting and finding duplicate entries

Identifying rows

Hans-Jürgen Schönig

Blog Tags

NEWSLETTER

Articles by our PostgreSQL Experts