Filtering dozens of columns
Is there anybody out there who has not scratched his head when it comes to
indexing? I think once in a while we all ask ourselves how to index a very large
table with dozens of columns. This is not PostgreSQL specific but rather
something which will happen in any kind of database system.
The natural question which arises is: Which columns will REALLY need an index?
In many cases programmers will know which ones will be searched most often but what if
the combinations used to query are totally random? What if filtering on the
10th column is as likely as filtering on the 15th or on the 29th column?
To fix the problem it is sometimes simply not possible to index ALL columns -
it is simply by far too expensive to index everything and write performance
would be degraded ways too much. This is not just true for PostgreSQL but for
any other relational database out there.
Over the weekend I had some time to take a closer look at Oleg Bartunov's
"bloom" filter implementation. The bloom filter package has been written some
years ago but was
never ported to a recent version of PostgreSQL. Fortunately it was not too hard
to teach the module some PostgreSQL 9.2 so I gave it a try. The results are as
far as performance and space are concerned impressive.
How does it work? Internally a bloom filter is based on a bit array which is
calculated for a set of columns inside a row. It allows fast and efficient
pre-filtering of data. The main advantage is that the index itself is really
Here is an example:
CREATE TABLE tab (a int4, b int4, c int4); INSERT INTO tab SELECT x, x, x FROM generate_series(1, 100000) AS x;
Then the table can be indexed:
CREATE INDEX bloomidx ON tab USING bloom(a, b, c);
The index is then used to fetch data for any kind of combination:
SET enable_seqscan TO off; EXPLAIN ANALYZE SELECT * FROM tab WHERE c=20 AND b=15; QUERY PLAN --------------------------------------------------------------------------- Bitmap Heap Scan on tab (cost=0.00..4.02 rows=1 width=12) (actual time=0.972..0.972 rows=0 loops=1) Recheck Cond: ((a = 20) AND (b = 15)) Rows Removed by Index Recheck: 1 -> Bitmap Index Scan on bloomidx (cost=0.00..0.00 rows=1 width=0) (actual time=0.963..0.963 rows=1 loops=1) Index Cond: ((c = 20) AND (b = 15)) Total runtime: 1.056 ms (6 rows)
Note that the index is used – even without a filter on the "a" column. In case
of a btree this would not be the case. A bloom filter can do that which makes a
lot of sense if you got many different columns.
It is worth mentioning that the index itself is about a third of the size of the
table used in this example. So, space consumption is a lot lower than if we had
used 3 indexes.