There are points in your life when you look into the mirror asking yourself if your model of the world you used to have is really still sustainable …
Some 5 years ago we fired up some 8.1 database servers which have successfully served XX bio transactions since then. Running a large PostgreSQL database successfully on UNIX over a number of years is actually nothing special. So, why care? Well, this thing is special — and trust me, it was never intended to be …
Imagine you start designing a simple database system for a small company (maybe 10 people). Things should be scalable and simply work for the given purpose. You might want to use some partitioning and maybe do a little trickery with indexing. So far so good … nothing special.
But, sometimes things just don’t work out as expected. In this special case it means: End users started to behave somehow totally differently then expected — and: there is A LOT more growth then ever anticipated. So, given the totally unexpected way people use the database, given the seriously higher growth in both users and amount of data the number of tables / partitions in the PostgreSQL systems went up and up and up and up ….
This kind of growth is especially scary when you take into account that in case of high growth it is usually not too easy to make a serious change in architecture within a small period of time (there is never enough time). The business side is usually occupied with sales, important extensions required by the customers and so on. Given a (meanwhile) complex application it is not too easy to simply introduce some PL/Proxy layer to scale out more easily without investing a lot of time and (hardly available manpower). So, you are actually forced to “tweak” things a little to make sure that you can actually buy some time for a change which can reflect the new needs …
And months go by and you keep tweaking and months go by and at some point you end up with something like this:
db=# \x Expanded display is on.
db=# SELECT count(*) FROM pg_tables; -[ RECORD 1 ]-- count | 2574107
I want to point out again that this was never intentional … nobody EVER planned to do put over 2 mio tables in the very same PostgreSQL database on the very same box. This happened definitely WAYS beyond my control — again; it was never intentional and NEVER planned this way. But still, it is impressing to see how well PostgreSQL behaves under really disgusting workloads.
Files, files, files …
Let us think for a moment how PostgreSQL stores tables inside a database instance — they simply go into the same directory unless you use tablespaces:
postgres@db1:/var/lib/postgresql/8.1/main/base/16385$ ls | wc –l
8104310
How can the operating system handle that? Well, first of all you have to adjust some kernel parameters to make this work. Secondly you got to make sure that you filesystem uses a tree down to the file level — otherwise a “seq scan” on the file list will kill you during lookup. Fortunately this was properly set when the system was set up.
Again, this is not sane design and it was never intended — nobody knew beforehand for how long this can be sustained. It just ended up a little scary
.
What really impresses me here is that this thing works like a charm. We did not have a single database crash in 5 years. Interestingly the fileystem is playing along well for years as well — no filesystem corruption for 5 years; it simply works like a charm while serving around 20 mio transactions a week. The load on the system is actually fairly mixed — there is a large number of small transactions but also a fair list of real killers (bulk load, creating / removing hundreds or tables at a time and so on).
Why not consolidate into larger tables?
Well, given the large number of tables people would actually think that using larger tables would make more sense. Given the load this is actually not quite true … recall, using a seriously partitioned structure you basically get a “free index pointer” to a chunk of data without actually having to index the data itself. If you want to read a million rows at once it is a lot cheaper to look up the right table containing and read it sequentially than to actually fetch 1 mio rows from an insane amount of data by traversing some insanely large index. A seq scan on the right 1 mio rows is a lot faster than to actually go through a (largely) uncached, giant index.
Tribute to PostgreSQL 8.1
It is seriously impressing how far you can squeeze even an old release such as 8.1 when things get really hard to manage. Given the vast number of improvements which went into PostgreSQL since 8.1 it is fun too imagine how much better and easier to handle a new release might be. Still, this kind of application is definitely a killer … and PostgreSQL can really show its potential.
PostgreSQL is famous for its scalability and especially for its exceptionally good transaction manager. Sometimes PostgreSQL is observed as so excellent that people forget about proper coding conventions.
Nested savepoints are an area which can cause issues in this area.
First of all: For the purpose of this blog entry a nested savepoint is this:
test=# BEGIN; BEGIN
… some action …
test=# SAVEPOINT a1; SAVEPOINT
… some action …
test=# SAVEPOINT a2; SAVEPOINT
… some action …
test=# SAVEPOINT a3; SAVEPOINT
If you nest thousands and thousands of savepoints inside one transaction without ever releasing a single savepoint you will figure out that after some time the performance of your transaction is about to decline rapidly.
The reason is fairly simple. Here is a brief output of some profiling session:
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total time seconds seconds calls s/call s/call name 35.20 0.69 0.69 11442642 0.00 0.00 LWLockAcquire 28.06 1.24 0.55 11442642 0.00 0.00 LWLockRelease 20.41 1.64 0.40 11433942 0.00 0.00 SimpleLruReadPage_ReadOnly 5.61 1.75 0.11 11443596 0.00 0.00 TransactionIdPrecedes 4.08 1.83 0.08 11429226 0.00 0.00 SubTransGetParent 4.08 1.91 0.08 4716 0.00 0.00 SubTransGetTopmostTransaction 0.51 1.92 0.01 9728 0.00 0.00 slot_deform_tuple
Every additional savepoints adds some small additional effort needed inside the database to keep track of the transaction status and so on. so, if your transaction has accumulated thousands of unreleased savepoints you will see a lot of of heavy activity inside the backend which is simply related to locking. Nothing bad is going to happend and PostgreSQL will definitely survive 500 mio savepoints inside the same transaction — however, you might not be happy with the performance after all.
How to get around this issue:
test=# RELEASE SAVEPOINT a1; RELEASE
once a savepoint is not needed anymore it can simply be released. this will resolve the extensive CPU consumption shown in the list above.
So, after all … thank you all for having so much faith in us and that you think we can do millions of savepoints inside a transaction
. yes, we can — but, you should not necessarily do it.
When I am on the road as PostgreSQL consultant I am sometimes surprised about the amount of work which is wasted simply by the lack of knowledge. Recently I have come across a simple problem: How many people are in the room at a given point in time?
It is interesting to see how many lines of code are sometimes needed to solve a problem. Long programs seem to be an especially common fetish in the Java world.
How would you approach a problem like that in PostgreSQL?
Here is a simplified version:
test=# CREATE TABLE t_room (id serial, name text, action text); NOTICE: CREATE TABLE will create implicit sequence "t_room_id_seq" for serial column "t_room.id" CREATE TABLE We create a table with an ID, some name and an action. Now we feed some data:
test=# INSERT INTO t_room (name, action) VALUES ('hans', 'IN');
INSERT 0 1
test=# INSERT INTO t_room (name, action) VALUES ('paul', 'IN');
INSERT 0 1
test=# INSERT INTO t_room (name, action) VALUES ('james', 'IN');
INSERT 0 1
test=# INSERT INTO t_room (name, action) VALUES ('paul', 'OUT');
INSERT 0 1
test=# INSERT INTO t_room (name, action) VALUES ('john', 'IN');
INSERT 0 1
test=# INSERT INTO t_room (name, action) VALUES ('hans', 'OUT');
INSERT 0 1
This is already enough to make the fun start:
test=# SELECT *, sum(CASE WHEN action = 'IN' THEN 1 ELSE -1 END) OVER ( ORDER BY id) AS field FROM t_room; id | name | action | field ----+-------+--------+------- 1 | hans | IN | 1 2 | paul | IN | 2 3 | james | IN | 3 4 | paul | OUT | 2 5 | john | IN | 3 6 | hans | OUT | 2 (6 rows)
Basically we have defined 2 actions: Somebody can enter the room or he can leave it. Order is provided by the id we have used in the example. So, how can we translate this to a counter? First of all we have to define that we want to increment the counter by 1 if somebody goes in and decrement it by 1 if somebody leaves. now we can do a running total.
This can be achieved with a simple windowing function.
Sometimes problems which make people code for days can be solved in one line.
Dealing with banking-related is always an interesting thing. Last week I came across an interesting issue which seems to take some more thinking.
The challenge is actually quite simple: “We have to trace all changes to data and structure”. Well, the first thing does not seem like a huge problem but how about changes to data structures? Well, somebody would argue that …
SET log_statement TO mod;
… will do the job. This will display all DDLs and commands doing changes to data. However, consider …
test=# SET log_statement TO none;
SET
This is a perfectly valid SQL command leaving us with a problem when it comes to tracking and auditing. We cannot tell for certain who has modified what because logging of DDLs can be turned off by user interaction.
It seems some additional and reliable way to prohibit that is needed — alternatively some additional tracing / auditing mode might help as well. It does not seem to be too attractive to have permissions of GUC variables because it might add an extra level of complexity nobody might want.
This was a fairly interesting weekend. This years “Linuxwochen” made a stop in Vienna and people enjoyed 3 days of interesting presentations and talks.
What somehow surprised me was the fact the Microsoft gave some talks on some Open Source components provided by them — mostly in the area of open government data and cloud-compliance. It is amusing to see how bad the sentiment towards those people actually was … but, after all — my PostgreSQL talk had at least twice as many listeners than the “Open Data” talk provided by Microsoft before
.
I gave a talk called “PostgreSQL roadshow” (video will be available soon) which showed some new technologies in the PostgreSQL area such as KNN, PgOpenCL, writable CTEs and so on. It seems people are highly interested in some stuff which goes seriously beyond “key / value”.
