Flexible grouping: Some dirty SQL trickery

While doing PostgreSQL consulting for a German client, I stumbled over an interesting issue this week, which might be worth sharing with some folks out on the Internet, it’s all about grouping.

Suppose you are measuring the same thing various times on different sensors every, say, 15 minutes. Maybe some temperature, some air pressure or whatever. The data might look like it is shown in the next table:


CREATE TABLE t_data (t time, val int);



COPY t_data FROM stdin;

14:00   12

14:01   22

14:01   43

14:14   32

14:15   33

14:16   27

14:30   19

\.

The human eye can instantly spot that 14:00 and 14:01 could be candidates for grouping (maybe the differences are just related to latency or some slighty inconsistent timing). The same applies to 14:14 to 14:16. You might want to have this data in the same group during aggregation.

The question now is: How can that be achieved with PostgreSQL?

Some dirty SQL trickery

The first thing to do is to check out those difference from one timestamp to the next:


SELECT *, lag(t, 1) OVER (ORDER BY t)

FROM    t_data;

The lag function offers a nice way to solve this kind of problem:


t     | val |   lag

----------+-----+----------

14:00:00 |  12 |

14:01:00 |  22 | 14:00:00

14:01:00 |  43 | 14:01:00

14:14:00 |  32 | 14:01:00

14:15:00 |  33 | 14:14:00

14:16:00 |  27 | 14:15:00

14:30:00 |  19 | 14:16:00

(7 rows)

Now that we have used lag to „move“ the time to the next row, there is a simple trick, which can be applied:


SELECT  *, CASE WHEN t - lag < '10 minutes'

THEN currval('seq_a')

ELSE nextval('seq_a') END AS g

FROM    ( SELECT *, lag(t, 1) OVER (ORDER BY t)

FROM  t_data) AS x;

Moving the lag to a subselect allows us to start all over again and to create those groups. The trick now is: If the difference from one line to the next is high, start a new group – otherwise stay within the group.

This leaves us with a simple result set:


t     | val |   lag    | g

----------+-----+----------+---

14:00:00 |  12 |          | 1

14:01:00 |  22 | 14:00:00 | 1

14:01:00 |  43 | 14:01:00 | 1

14:14:00 |  32 | 14:01:00 | 2

14:15:00 |  33 | 14:14:00 | 2

14:16:00 |  27 | 14:15:00 | 2

14:30:00 |  19 | 14:16:00 | 3

(7 rows)

From now on, life is simple. We can take this output and aggregate on this data easily. „GROUP BY g“ will give us nice groups for each value of „g“.

Hans-Juergen Schoenig
Hans-Jürgen Schönig has 15 years of experience with PostgreSQL. He is consultant and CEO of the company „Cybertec Schönig & Schönig GmbH“ (www.cybertec.at, www.postgresql-support.de), which has served countless customers around the globe.