Tuesday, 25 December 2012

The Christmas Shitload Post

Happy Christmas! Believe it or not we are almost done. We’re almost out
of winter!

This post is my present for a brilliant new PostgreSQL year.
You know, Christmas usually means goodness, happiness, joy and
And if you’re waiting for this, well, you will be hardly disappointed.
I decided to close the year writing my feelings, about VACUUM FULL, one
of the  PostgreSQL’s key functionality and an interesting thing
I discovered on the utility script shipped with the debian package for controlling
the clusters, pg ctlcluster.

And believe me if I tell you I feel like the Angry Birds, when the bad piggies
stole their eggs.

VACUUM two face

VACUUM is one of the most important functions shipped with PostgreSQL. It’s
main goal is to free space cleaning the dead rows generated during the normal
database activity.

As the conventional VACUUM simply marks as reusable the dead rows,
and freeze the tuples avoiding the s***t to hit the fan, preventing the XID
wraparound failure, this does not shrink the data file.
If you have a large table with many dead rows after the VACUUM you will
see the table stop growing.
If you need to reclaim space on disk the VACUUM FULL is what do you
need as it shrinks down the datafile locking in accessExclusive mode the affected
relation and preventing the read and the write.

During a VACUUM FULL on a massively bloated table I noticed a strange

The free space on the table’s tablespace lessened and then suddendly stopped
meanwhile the free space on the pg default began to lower until the filesystem
runt out of space with the subsequent failure of the task.

I started investigating the problem increasing the client verbosity with the
SET client min messages=’debug’;

I’ll explain what I’ve found with a simple test using a
copy of the pg attribute table.

To watch what’s happening I’ve switched on the debug message level for the
client and I’ve turned on the trace sort parameter during a single column index
creation on the relname field.
This is the entire procedure output.

db_test =# CREATE TABLE t_vacuum AS
pg_attribute ;
-- repeat this many times in order to
-- fill the table with a good amount of data
db_test =# INSERT INTO t_vacuum
db_test -# SELECT * FROM t_vacuum
db_test -# ;
INSERT 0 2155
SET trace_sort = ’ on ’;
SET client_min_messages = ’ debug ’;
db_test =# CREATE INDEX
idx_attname ON t_vacuum USING btree ( attname );
DEBUG : building index " idx_attname " on table " t_vacuum "
LOG : begin index sort : unique = f , workMem = 16384 , randomAccess = f
LOG : switching to external sort with 59 tapes : CPU 0.04 s /0.48 u sec elapsed 0.53 sec
LOG : performsort starting : CPU 0.04 s /0.72 u sec elapsed 0.77 sec
LOG : finished writing run 1 to tape 0: CPU 0.08 s /2.56 u sec elapsed 2.65 sec
LOG : finished writing final run 2 to tape 1: CPU 0.08 s /2.56 u sec elapsed 2.65 sec
LOG : performsort done ( except 2 - way final merge ): CPU 0.09 s /2.61 u sec elapsed 2.72
LOG : external sort ended , 836 disk blocks used : CPU 0.14 s /2.84 u sec elapsed 3.16 se

As my work mem is small the index sort switch immediately on an external
sort in order to build the index, then generate the btree from the external sort
Now let’s run a VACUUM FULL the same table.

db_test =# VACUUM FULL t_vacuum ;
DEBUG : building index " pg_toast_16411_index " on table " pg_toast_16411 "
LOG : begin index sort : unique = t , workMem = 16384 , randomAccess = f
LOG : begin index sort : unique = f , workMem = 1024 , randomAccess = f
LOG : internal sort ended , 17 KB used : CPU 0.00 s /0.00 u sec elapsed 0.00 sec
LOG : performsort starting : CPU 0.00 s /0.00 u sec elapsed 0.00 sec
LOG : performsort done : CPU 0.00 s /0.00 u sec elapsed 0.00 sec
LOG : internal sort ended , 17 KB used : CPU 0.00 s /0.00 u sec elapsed 0.08 sec
DEBUG : vacuuming " public . t_vacuum "
DEBUG : " t_vacuum " : found 0 removable , 275840 nonremovable row versions in 4756 page
DETAIL : 0 dead row versions cannot be removed yet .
CPU 0.17 s /0.99 u sec elapsed 1.87 sec .
DEBUG : building index " idx_attname " on table " t_vacuum "
LOG : begin index sort : unique = f , workMem = 16384 , randomAccess = f
LOG : switching to external sort with 59 tapes : CPU 0.07 s /0.51 u sec elapsed 0.58 seLOG : finished writing final run 2 to tape 1: CPU 0.14 s /2.58 u sec elapsed 2.73 sec
LOG : performsort done ( except 2 - way final merge ): CPU 0.14 s /2.64 u sec elapsed 2.80
LOG : external sort ended , 836 disk blocks used : CPU 0.21 s /2.85 u sec elapsed 3.27 se
DEBUG : drop auto - cascades to type pg_temp_16402
DEBUG : drop auto - cascades to type pg_temp_16402 []
DEBUG : drop auto - cascades to toast table pg_toast . pg_toast_16411
DEBUG : drop auto - cascades to index pg_toast . pg_toast_16411_index
DEBUG : drop auto - cascades to type pg_toast . pg_toast_16411

The interesting part begins at row 9. The database actually performs the
VACUUM, shrinking down the data file and cleaning the dead rows then at row
13 we found the same index creation output.
This explains well why the VACUUM FULL failed.

As VACUUM FULL acts exactly as CLUSTER,  during the process the
table is read sequentially and the data output sent to a new node file and this
happens on the same tablespace where the original table belongs.

The index build is performed on the same index tablespace except for the
sort on disk which is governed by the temp tablespaces parameter. And if this
is not set then the sort is performed on the pg default.

In conclusion, if you want to run a VACUUM FULL you will need space for the new
file node on the same table’s tablespace. The indexes do require the space on the
same tablespace plus the sort on disk, if you don't have enough memory.

So, if you have space shortage on the pg_default, the command SET temp_tablespaces to a
more spacious tablespace will prevent the pg default to run out of space.

Aggressive shutdown

Working with the hot standby on debian I had a bad experience when I’d set up
a new slave server. After the data copy with the archive command started I ac-
tually started the slave forgetting to set the hot standby parameter to on. After
a while the slave reached the consistent state and I issued the pg ctlcluster com-
mand in order to stop the cluster itself with the –force to have a fast shutdown
instead of a smart one.
The script doesn’t have an option to switch between PostgreSQL’s shutdown
modes but, if without options performs a smart shutdown mode.
Adding the –force option the script first try a fast shutdown mode, if the
database doesn’t shutdown in a reasonable time the script switch to the imme-
diate shutdown, if this is not enough to stop the instance finally a kill -9 on
the postgres process is performed.


If you are too much delicate I suggest you to stop reading, now begins the real

I understand the hard efforts done by the global team to create a first class
product, I left Oracle for PostgreSQL and I don’t regret my choice because
PostgreSQL is a damn good database, probably too much good.

But can anyone explain me why, in the name of one of any of the thousands
fake gods invented by the human race, no one put the VACUUM FULL be-
haviour in the on line manual leaving this HUGE lack of documentation on

Regarding the debian's script, I understand the –force option of pg ctlcluster is not
 to be run in normal condition, but does the author understand what
does it means to perform in normal conditions a smart shutdown?

A user that forget to logout can let the script to run forever.
On the other side if the –force is passed to the script and the database takes too much to
shutdown an immediate shutdown is performed without warnings.

And this will wipe out any bloody unlogged table from the cluster on
the next startup.

Then, as cherry on the top, if the immediate does not work a kill -9 is
performed on the postmaster leaving the shared memory as garbage.

And this works the same way if you use the debian’s init script.

HEAVY MACHINE GUN! (Metal slug’s quote)
So, my last advice, is do not use the wrapper pg ctlcluster for stopping the
clusters created with pg createcluster.
It’s better to shutdown the cluster using the command.

/usr/lib/postgresql/<MAJOR VERSION>/bin/pg ctl stop -m <THEWAYYOUWANT>


Thursday, 20 December 2012

Time and relative arrays in space

Scale what?

Few years ago when I was self employed as PostgreSQL consultant and I was
appointed a big performance tuning task by a great search engine (no, not
THAT search engine).

The application was similar to the Google’s desktop, a grid of boxes with
RSS feeds. Unfortunately the implementation of the two dimensional grid had
bad performance issues.

Any position change with a drag and drop, a delete or an add, even in low
activity periods, required the consequent database update to run for at least
20 seconds. During the peak time the system was unusable as the php hit the
execution limit timeout and this caused any transaction to abort.

The simple design, a table with two integer columns mapping the X-Y co-
ordinates with a unique constraint to avoid duplicates, wasn’t scaling and the
problem was caused by the unique constraint itself.

As the PostgreSQL 8.4 didn’t supported the deferrable unique constraints
the position change required an update for each row in order to keep the sequence
without holes, and no way to run a bulk update because the unique constraint
was violated during the update.

This approach worked well with the stage, once pushed to the live system
the map table became several millions rows big and the updates, with all the
overhead introduced by the indexes itself simply did not scale.

Scale me this!

After one week of analysis and test I’d realized that the solution was to rewrite
completely the application logic changing the conventional table into an aggre-
gated table.

The original table had four fields. userid, boxid,xposition,yposition.

I moved the boxid, from the map table field into a two dimensional text
array binding and adding the userid column as primary key.

In that way I could access the array by userid and get the boxid simply
passing the x-y coordinate. The task to build the box grid became incredibly
simple, with the array unnest.

With this structure any box move,add or delete with drag and drop was
performed over the array values and required only one table row update.

The previous implementation required for the move at least three updates.
For the box delete one database delete with a new sequencing for all the re-
maining rows. For the add a re sequencing for all the rows and an insert.
After the new implementation gone live the improvement was massive.
In peak activiy the longest update was 0.7 s.


After this experience, recently I used the same approach to implement a grid of
FIFO queues simulating a physical structure into the abstract data structure.

Obviously the arrays are not the solution for any scale problem, but if care-
fully designed can be used to aggregate data easing massively the database

As there’s no implementation of foreign keys over the array’s elements is not
possible, in theory, to have a referential integrity on those data structures.
I heard a proposal to have this feature added to PostgreSQL.

IMHO this is absolutely useless and if you need a foreign on an array, then probably
your database design is wrong.

If you are wondering where is it the web product described in this post, then it’s
a shame it was decommissioned after one year I did my job.