So what are Bloom indexes for Postgres?

A new feature called Bloom indexing (an implementation of Bloom filters) slipped in somewhat quietly with the latest 9.6 release, altough new index methods are pretty rare. Maybe because it’s shipped under contrib modules (a.k.a. extensions) and served as a feature demo for the new Generic WAL interface, which in itself will enable easier implementation of more of such new features in future. At the moment of release I of course looked through the 9.6 release notes, spent a minute there, and thought – a new index type, great! But to be honest, the explanations were not too detailed and I didn’t exactly understand all the bits involved, leaving me with an impression that it can only tell Postgres which search value(s) „do not exist“ in a table, making its use limited.  But now I took some time look deeper into this new functionality and I was positively surprised – it is actually a fully working and potentially very useful index method for real life problems, that one should know about.

Working principles and when to apply?

First we should explain what’s the intended use case for Bloom filters/indexes – when should we use it? Short answer – when we have a lot of columns and we’re also using many of those columns in kind of random combinations in queries involving the equality operator (“=”), e.g. something like “SELECT * FROM tbl WHERE col1 = 2 AND col9 = ‘x’ AND col17 = 865”. And note the accent on the “equality operator” part – as for Bloom, on the mathematical side, hashes of values are stored/compared and range operations don’t make any sense there.

Basically Postgres calculates a hash for each of your column values and then stores some bits out of each hash as one index entry, together with the row’s physical location info (as with every other index). And exactly this “merging” of many column values into one index entry, resulting in a signature in our Bloom context, is where the effectiveness of this index type comes to shine. In short – it can help you to save a lot of disk space! Thus instead of 10 separate normal B-tree indexes you can now have only one Bloom index that’s though lossy, meaning it won’t give you perfect accuracy as matched values need to be always re-checked from the table, but from a probabilistic viewpoint it is “good enough” to be useful. If of course applied and configured correctly – and hopefully after reading through this Blog it shouldn’t be a problem:) But okay, given we have thought about our application now and have a hunch about a legit use case for Bloom – how does it work?

Basic usage for bloom

So here’s how it goes in the simplest form:


CREATE EXTENSION bloom;  -- make sure “contrib” is installed



CREATE INDEX bloomidx ON tbloom USING bloom (i1,i2,i3)

WITH (length=80, col1=2, col2=2, col3=4);

Looks somewhat familiar with GIN/GiST indexes so far. Create the extension, define your index as making use of the Bloom index method and basically start querying, hoping Postgres will use the index. And remember, only equality comparisons were supported, with the additional limitation that currently only integer and text data types are supported. This shouldn’t be too much of a show-stopper though for human input queries as we’re not too good at remembering/inputting floating point numbers and millisecond precision timestamps.

But what on earth do those extra parameters in the WITH part mean? Documentation says:

The index is created with a signature length of 80 bits, with attributes i1 and i2 mapped to 2 bits, and attribute i3 mapped to 4 bits. We could have omitted the length, col1, and col2 specifications since those have the default values.

Things get fuzzy here … how does it add up exactly? 80 bits total is mapped to only 2×2 bits + 4 bits. Eh? Couldn’t find any good examples or explanations by googling also, so – taram taram taraa … decided to go for the absolute truth – meaning hitting the source code for the extension.

Going deeper

So after an hour of snooping around in the code (and grinding my teeth, simultaneously feeling a lot of respect for the authors as this stuff is pretty clever in the end) I think I got it figured out and I also gave a pet name to the algorithm in the process, which could well be named „wicked semi-random bit twisting modulo game“ 😉 For those who want to check it out themselves, the „light bulb on“ parts of code can be found from the signValue() function in blutils.c, but here’s a small excerpt showing which signature bits are set for a single column value.


/*

* Init hash sequence to map our value into bits. the same values in

* different columns will be mapped into different bits because of step

* above

*/

hashVal = DatumGetInt32(FunctionCall1(&state->hashFn[attno], value));

mySrand(hashVal ^ myRand());



for (j = 0; j < state->opts.bitSize[attno]; j++)

{

/* prevent multiple evaluation in SETBIT macro */

nBit = myRand() % (state->opts.bloomLength * SIGNWORDBITS);

SETBIT(sign, nBit);

}

So what does it tell us actually then? We see that there’s quite some randomness involved – but luckily it well answers my question about 2×2 bits + 4 bits adding up to 80 bits. How it works is that for every single indexed column and column value separate bit storing locations (2 locations by default but user configurable in the WITH part) are calculated! This basically guarantees that the whole available bit range (80 bits by default) is filled up in an uniform way, making good use of all of our bits. And as randomness “seed” there serves actually the column index. Normally, from a security point of view not something we would like to do normally, but here it makes perfect sense – with a normal/random seed, indexing and index comparisons wouldn’t be predictable. Quite an elegant solution in the end!

To be continued…

Hope you followed through so far … it’s probably enough material to digest for one blogpost, but as we’ve now familiarized ourselves with the global concept and the inner mechanics, I plan to follow up with a sample test case, looking at performance and other factors, in the next post. Stay tuned!

Kaarel Moppel
I’ve been interested with databases for the last 10 years, working last 6 years exclusively with PostgreSQL. And still I’m constantly surprised by it’s powerful set of features and the fast pace of development by the globally friendly community. On my spare time I enjoy playing soccer and travelling.