Flexible grouping: Some dirty SQL trickery

11.2015 / Category: How To / Tags: development | sql help

While doing PostgreSQL consulting for a German client, I stumbled over an interesting issue this week, which might be worth sharing with some folks out on the Internet, it's all about grouping.

Suppose you are measuring the same thing various times on different sensors every, say, 15 minutes. Maybe some temperature, some air pressure or whatever. The data might look like it is shown in the next table:

CREATE TABLE t_data (t time, val int);


COPY t_data FROM stdin;
14:00   12
14:01   22
14:01   43
14:14   32
14:15   33
14:16   27
14:30   19
.

1

2

3

4

5

6

7

8

9

10

11

12

CREATE TABLE t_data (t time, val int);

COPY t_data FROM stdin;

14:00 12

14:01 22

14:01 43

14:14 32

14:15 33

14:16 27

14:30 19

.

The human eye can instantly spot that 14:00 and 14:01 could be candidates for grouping (maybe the differences are just related to latency or some slightly inconsistent timing). The same applies to 14:14 to 14:16. You might want to have this data in the same group during aggregation.

The question now is: How can that be achieved with PostgreSQL?

Some dirty SQL trickery

The first thing to do is to check out those difference from one timestamp to the next:

SELECT *, lag(t, 1) OVER (ORDER BY t)
FROM    t_data;

1 2	SELECT *, lag(t, 1) OVER (ORDER BY t) FROM t_data;

The lag function offers a nice way to solve this kind of problem:

    t     | val |   lag
----------+-----+----------
 14:00:00 |  12 |
 14:01:00 |  22 | 14:00:00
 14:01:00 |  43 | 14:01:00
 14:14:00 |  32 | 14:01:00
 14:15:00 |  33 | 14:14:00
 14:16:00 |  27 | 14:15:00
 14:30:00 |  19 | 14:16:00
(7 rows)

1

2

3

4

5

6

7

8

9

10

t | val | lag

----------+-----+----------

14:00:00 | 12 |

14:01:00 | 22 | 14:00:00

14:01:00 | 43 | 14:01:00

14:14:00 | 32 | 14:01:00

14:15:00 | 33 | 14:14:00

14:16:00 | 27 | 14:15:00

14:30:00 | 19 | 14:16:00

(7 rows)

Now that we have used lag to "move" the time to the next row, there is a simple trick which can be applied:

SELECT  *, CASE WHEN t - lag < '10 minutes'
           THEN currval('seq_a')
           ELSE nextval('seq_a') END AS g
FROM    ( SELECT *, lag(t, 1) OVER (ORDER BY t)
          FROM  t_data
        ) AS x;

1

2

3

4

5

6

SELECT *, CASE WHEN t - lag < '10 minutes'

THEN currval('seq_a')

ELSE nextval('seq_a') END AS g

FROM ( SELECT *, lag(t, 1) OVER (ORDER BY t)

FROM t_data

) AS x;

Moving the lag to a subselect allows us to start all over again and to create those groups. The trick now is: If the difference from one line to the next is high, start a new group - otherwise stay within the group.

This leaves us with a simple result set:

    t     | val |   lag    | g
----------+-----+----------+---
 14:00:00 |  12 |          | 1
 14:01:00 |  22 | 14:00:00 | 1
 14:01:00 |  43 | 14:01:00 | 1
 14:14:00 |  32 | 14:01:00 | 2
 14:15:00 |  33 | 14:14:00 | 2
 14:16:00 |  27 | 14:15:00 | 2
 14:30:00 |  19 | 14:16:00 | 3
(7 rows)

1

2

3

4

5

6

7

8

9

10

t | val | lag | g

----------+-----+----------+---

14:00:00 | 12 | | 1

14:01:00 | 22 | 14:00:00 | 1

14:01:00 | 43 | 14:01:00 | 1

14:14:00 | 32 | 14:01:00 | 2

14:15:00 | 33 | 14:14:00 | 2

14:16:00 | 27 | 14:15:00 | 2

14:30:00 | 19 | 14:16:00 | 3

(7 rows)

From now on, life is easy. We can take this output and quickly aggregate on this data. "GROUP BY g" will give us nice groups for each value of "g".

In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Twitter, Facebook, or LinkedIn.

0 0 votes

Article Rating

Subscribe

4 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

Corey Huinker

8 years ago

Here's a way to do that without sequences:

select x.*, sum(edge) over (order by t) as group_num from (select *, case when (t - lag(t,1) over (order by t)) >= '10 minutes' then 1 else 0 end as edge from t_data) x order by t;

t | val | edge | group_num ----------+-----+------+----------- 14:00:00 | 12 | 0 | 0 14:01:00 | 22 | 0 | 0 14:01:00 | 43 | 0 | 0 14:14:00 | 32 | 1 | 1 14:15:00 | 33 | 0 | 1 14:16:00 | 27 | 0 | 1 14:30:00 | 19 | 1 | 2 (7 rows)

0

Reply