Finding patterns in timeseries: A poor man’s method

A lot has been written about timeseries analysis and handling temporal data in general. Countless papers outlining various strategies have been posted and published all over the internet. However, in many cases a lot of the real technology is hidden behind colorful marketing papers without real meaning and without any useful content.

Still: Analyzing timeseries is criticial and becoming ever more important as more and more companies are starting to use modern techniques and are recognizing the real value of data.

For basic timeseries analysis is definitely more than sufficient and additional tooling is in many cases not needed to get things going in a fast and professional way. A lot can be done with timeseries and in this article I want to share a simple yet powerful idea, which can help to look for trends or known patterns in the data.

Loading test data into PostgreSQL

To show how this can be done, I first created a simple table with a little bit of data in our PostgreSQL database:


CREATE TABLE t_timeseries
(
        id              serial,
        data            numeric
);


COPY t_timeseries FROM stdin DELIMITER ',';
1,11
2,14
3,16
4,9
5,12
6,13
7,14
8,9
9,15
10,9
\.

The question now is: Can we find a certain trend in the data? In this example the question is whether we can find a period during which the value has grown constantly (e.g. “3 times in a row”). However, the idea is to come up with a strategy, which allows for more sophisticated analysis.

Encoding timeseries as strings

One approach, which is fairly easy, is to encode timeseries as strings. The advantage is that standard string crunching approaches can be applied on those strings then easily. When talking about trends and so on it can be useful to calculate the difference between the current value and the previous value. Fortunately PostgreSQL (and SQL in general) provides an easy way to do that:


test=# SELECT  *, data - lag(data, 1)
		OVER (ORDER BY id)  AS diff    
     FROM    t_timeseries;
 id | data | diff
----+------+------
  1 |   11 |     
  2 |   14 |    3
  3 |   16 |    2
  4 |    9 |   -7
  5 |   12 |    3
  6 |   13 |    1
  7 |   14 |    1
  8 |    9 |   -5
  9 |   15 |    6
 10 |    9 |   -6
(10 rows)


The lag function will move the data by one row given the order defined in the OVER-clause. We can now easily calculate the difference from one row to the next.

After this basic introduction it is time to focus on the real trick. The idea is to use the output of the previous SQL statement and analyze the differences from one row to the next. In case the value is higher than zero, we encode it as “u” (for “up”) and in case it is not we use “d” (“down”). What does it buy us? Well, the advantage is that every move of the series is represented as a single character, which makes it easy to process later on. After the encoding process we are using a sliding window. We take all the data from 5 periods (2 before, the current period and 2 later) and turn it into a single string.

The result is as follows:


test=# SELECT  *,
	string_agg(CASE WHEN diff > 0
			THEN 'u'::text
			ELSE 'd'::text END, '')
                OVER (ORDER BY id ROWS BETWEEN 2 PRECEDING
			AND 2 FOLLOWING) AS encoded
FROM    (   
                SELECT  *, data - lag(data, 1)
			OVER (ORDER BY id)  AS diff 
                FROM    
t_timeseries  
                ) AS x;
 id | data | diff | encoded
----+------+------+---------
  1 |   11 |      | duu
  2 |   14 |    3 | duud
  3 |   16 |    2 | duudu
  4 |    9 |   -7 | uuduu
  5 |   12 |    3 | uduuu
  6 |   13 |    1 | duuud
  7 |   14 |    1 | uuudu
  8 |    9 |   -5 | uudud
  9 |   15 |    6 | udud
 10 |    9 |   -6 | dud
(10 rows)

What you see here is that the encoded string starts with just three characters. The reason is that there are no preceding values so we only see what is ahead of us in the future. The query gives us some data along with an encoded string.

patterns in timeseries

For simplicity reasons the output if the query can be turned into a view:


CREATE VIEW v AS
SELECT  *, string_agg(CASE WHEN diff > 0
				THEN 'u'::text
				ELSE 'd'::text END, '')
                OVER (ORDER BY id ROWS BETWEEN 2 PRECEDING
				AND 2 FOLLOWING) AS encoded
FROM    (
        SELECT  *, data - lag(data, 1) OVER (ORDER BY id)  AS diff
        FROM    t_timeseries
) AS x;

Keep in mind that the encoder is pretty simple and is usually not enough to do a real world example. If you are planning to do real world timeseries analysis in PostgreSQL, the encoder (“time series codec”) might be a lot more sophisticated. The point here is just to give you some ideas of what can be done with a fairly simple technique using a standard PostgreSQL database.

Analyzing an encoded string

Now that the data has been encoded it can be analyzed easily using standard PostgreSQL features. Suppose we want to find all parts of the data in which the value was moving up at least 3 times in a row, the following simple query can be used:

test=# SELECT  * 
FROM    v 
WHERE   encoded LIKE '%uuu%';
 id | data | diff | encoded
----+------+------+---------
  5 |   12 |    3 | uduuu
  6 |   13 |    1 | duuud
  7 |   14 |    1 | uuudu
(3 rows)

Usually you might want to use a more sophisticated search algorithm to find a more complex pattern. Regular expressions might be pretty useful to look for more advanced things. It can also make sense to create a “distance” function and use KNN to look for areas in your timeseries, which are similar to what you are looking for. Basically a lot of stuff can be done easily if you are prepared to be a little creative.

Hans-Juergen Schoenig
Hans-Jürgen Schönig has 15 years of experience with PostgreSQL. He is consultant and CEO of the company „Cybertec Schönig & Schönig GmbH“ (www.cybertec.at, www.postgresql-support.de), which has served countless customers around the globe.