Thursday, 29 January 2015

The cluster in action - part 2

2.4 The background writer

Before the spread checkpoints the only solution to ease down the IO spike caused by the checkpoint was to tweak the background writer. This process were introduced with the revolutionary PostgreSQL 8.0. The writer, as the name suggests, works in the background searching for dirty buffers to write on the data files. The writer works in rounds. When the process awakes scans the shared buffer for dirty buffers. When the amount of buffers cleaned reaches the value set in bgwriter_lru_maxpages the process sleeps for the time set in bgwriter_delay.

2.5 The autovacuum

The routine vacuuming is an important task to prevent the table bloat and the dreaded XID wraparound failure. If enabled the autovacuum launcher starts one daemon for each relation with enough dead tuples to trigger the conditions set in autovacuum_vacuum_threshold and autovacuum_vacuum_scale_factor. An autovacuum daemon is a normal backend and appears in the view pg_stat_activity. Because the XID wraparound failure is a really serious problem, the autovacuum to prevent wraparound starts even if the autovacuum is turned off.

2.6 The backends

The PostgreSQL backend architecture is the brilliant solution to a nasty problem. How to guarantee the buffers are read only by one session at time and avoid the bottleneck of a long waiting queue. When a backend needs to access a particular tuple, either for read or write, the relation's pages are accessed to find the tuple matching the search criteria. When a buffer is accessed then the backend sets a pin on the buffer which prevents the other backends requiring the same page to wait. As soon as the tuple is found and processed the pin is removed. If the tuple is modified the MVCC enforces the tuple's visibility to the other backends. The process is fine grained and very efficient. Even with an high concurrency rate on the same buffers is very difficult to have the backends entangled.
A backend process is a fork of the main postgres process. It's very important to understand that the backend is not the connection but a server process which interacts with the connection. Usually the backend terminates when the connection disconnects. However, if a client disconnects ungracefully meanwhile a query is running without signalling the backend, the query will continue only to find there's nothing listening on the other side. This is bad for many reasons. First because is consuming a connection slot for nothing. Also the cluster is doing something useless consuming CPU cycles and memory.

Like everything in PostgreSQL the backend architecture is oriented to protect the data and in particular the volatile shared buffer. If for some reasons one of the backend process crashes then the postgres process terminates all the backends in order to prevent the potential shared buffer corruption. The clients should be able to manage this exception resetting the connection.

2.7 Wrap up

The cluster's background activity remains most of the time unnoticed. The users and developers can mostly ignore this aspect of the PostgreSQL architecture leaving the difficult business of understanding the database heartbeat to the DBA, which should have the final word on any potential mistake in the design specs. The next chapters will explore the PostgreSQL's architecture in details, starting with the memory.

Monday, 26 January 2015

The cluster in action - part 1

The cluster in action

PostgreSQL delivers his services ensuring the ACID rules are enforced at any time. This chapter will give an outlook of a ``day in the life'' of a PostgreSQL's cluster. The chapter approach is purposely generic. At this stage is very important to understand the global picture rather the technical details.

After the startup

When the cluster completes the startup procedure it starts accepting the connections. When a connection is successful then the postgres main process forks into a new backend process which is assigned to the connection for the connection's lifetime. The fork is quite expensive and does not work very well for a high rate of connection's requests. The maximum number of connections is set at startup and cannot be changed dynamically. Whether the connection is used or not for each connection slot are consumed 400 bytes of shared memory.
Alongside the client's request the cluster have several subprocesses working in the background.

The write ahead log

The data pages are stored into the shared buffer either for read and write. A mechanism called pinning ensures that only one backend at time is accessing the requested page. If the backend modifies the page then this becomes dirty. A dirty page is not yet written on its data file. However the page's change is first saved on the write ahead log as WAL record and the commit status for the transactions is then in the directory clog or the directory pg_serial, depending on the transaction isolation level. The wal records are stored into a shared buffer's area sized by the parameter wal_buffers before the flush on disk into the pg_xlog directory on fixed length segments. When a WAL segment is full then a a new one is created or recycled. When this happens there is a xlog switch. The writes on the WAL are managed by a background process called WAL writer. This process were first introduced with PostgreSQL 8.3.

The checkpoint

The cluster, on a regular basis, executes an important activity called checkpoint. The frequency of this action is governed by the time and space, measured respectively in seconds and log switches between two checkpoints. The checkpoint scans the shared buffer and writes down to the data files all the dirty pages. When the checkpoint is complete the process determines the checkpoint location and writes this information on the control file stored into the cluster's pg_global tablespace. In the case of unclean shutdown this value is used to determine the WAL segment from where to start the crash recovery.
Before the version 8.3 the checkpoint represented a potential bottleneck because the unavoidable IO spike generated during the writes. That's the reason why the version 8.3 introduced the concept of spread checkpoints. The cluster aims to a particular completion target time measured in percent of the checkpoint timeout. The default values are respectively 0.5 and 5 minutes. This way the checkpoint will spread over a target time of 2.5 minutes. From PostgreSQL 9.2 a new checkpointer process has been created to manage efficiently the checkpoint.

Friday, 23 January 2015


The acronym RTFM stands for Read The Fucking Manual. It's quite obvious that reading the manual is the perfect approach for finding the correct information or at least a hint for finding the solution.

The false confidence or a lack of humbleness make people to forget how important is to read the documents. Understanding the documents is the next step, and is not simple indeed. In particular if the background is poor or the reader have preconceptions which alter the word meanings.

On the same message board mentioned in 1.4 an user asked some details about the binary data. Another "expert" explained how PostgreSQL had two ways of storing binary data. The bytea and the OID, which maximum size limit was respectively  one and two GB. I replied with a simple question. How is possible to store two GB in a four byte integer without strange voodoo rituals?

After an unpleasant argument finally the "expert" admitted he did not understood the mechanism used by PostgreSQL to workaround the TOASTed datum's size limit. If you are curious about that, well, RTFM!

Tuesday, 20 January 2015

Failure is not an option

The failure is not an option. Despite this statement is quite pretentious is also the rule number zero of any decent DBA. The task failure, should this be a simple alter table or an emergency restore, is not acceptable. The database is the core of any application and therefore is the most important element of the infrastructure.

In order to achieve this impossible level, any task should be considered single shot. Everything must run perfectly at the first run like if without rollback plan. However the rollback procedure must be prepared and tested alongside with the main task on the test environment in order to get a smooth transition if the rollback should happen. It's also very important to remember the checklist. This allow the DBA to catch the exact point of any possible failure ensuring a rapid fix when this happens.

Having a plan B gives a multiple approach for the task if something goes wrong. For example, when dealing with normal size databases* is possible to implement many strategies for the disaster recovery having a similar time needed for the recovery. When the amount of data becomes important this is no longer true. For example a logical restore takes more time than a point in time recovery or a standby failover. In this case the plan A is the physical restore.If this does not work then the logical recovery should be used instead.

 * I consider normal sized any database under the 500 GB

Saturday, 10 January 2015

DBA in a nutshell

I've already started writing the second volume of the database administration series.
The first volume was quite basic with enough information to be useful but not too much to scare people.

The second volume is strongly focused on the database administration and I decided to start it with a chapter explaining what it means to be a DBA.

Probably this will dissuade many people to start this exciting career.

So, here we go again.

A database administrator is a strange combination of theory and practice. A mix of strictness and loose rules. It's quite difficult to explain what exactly a DBA does. Working with databases requires passion, knowledge and a strange combination of empathy and pragmatism in order to try to understand what the DBMS is thinking. With the commercial products the knowledge is limited by the vendor's documentations and is improved mostly by the personal experience. With a free DBMS like PostgreSQL the source code's availability facilitates the knowledge acquisition. Also, reading the superb poetry which is the source code, which is C language, creates an intimate relation with the cold binary programs and the geniuses who create them.

A day in the life of a DBA does not have fixed boundaries. It can last one day or several months, for example if there are long procedures to run inside the well controlled maintenance windows. Every day is a different combination of the duties including the routine tasks, the monitoring and the proactive thinking. The latter is probably the distinctive mark between a good and DBA and a cheap professional. The capability of building a mental map for finding any possible issue, projected in the near future, can make the difference between spending the night sleeping or working frantically before the next day begins. Of course when the emergency strikes there is the fourth and most important duty. The emergency handling.

Routine tasks

Under the group of routine tasks fall the procedures which are well consolidated on the documents and in the DBA mind. For example, configuring the PostgreSQL's memory parameters should be something immediate. Any task in this category is successful if remains completely unnoticed. Is not unlikely for the live systems that the routine tasks are performed in antisocial hours like Sunday night or early morning in the working days.


A system without monitoring is an one way ticket to the disaster. Whether solution is used it should be something simple to configure and with a decently low rate of false positives. Having a nagging monitor is exactly the same like not having at all. The important alerts will pass completely unnoticed. Alongside with the general purpose solutions like nagios there is a new interesting PostgreSQL's dedicated called Configuring a passive error detector like tail_n_mail is a very good idea to trap any possible server's misbehaviour.

Proactive thinking

Reacting to the issues is fine. Trying to prevent them is much better. This kind of tasks is strictly related with the monitoring. For example let's consider a master and a slave configured in continuous recovery. Because the slave is made copying the data files we should expect both having the same size. A (not so) small size difference between the master and the standby server can tell us the filesystem have some problems or is configured not correctly on one box. In this example, assuming the filesystem is xfs a bloating box is caused by the dynamic EOF allocation introduced in the newer releases.

Emergency handling

Shit happens, deal with it. Sooner or later a disaster will strike requiring all the experience and competence of the database experts for putting back the system in production state. It doesn't matter if the issue is caused by an user screwing up the database or a power outage or a hardware failure or a fleet of Vogon spaceships ready to demolish the earth. The rule number zero when in emergency is never guess. Guessing is a one way ticket to the disaster, in particular if what is the action can destroy the database.
As example let's consider one of the most common causes of outage in the linux world, the distribution's upgrade. Usually when a linux distribution is upgraded, is very likely PostgreSQL will upgrade the binaries to a different major version. Now the data area is not compatible across the major releases. The immediate effect when a system is upgraded is PostgreSQL will refuse to start because the version's mismatch. The fix is quite simple. A dump and reload of the databases in the data area, or alternatively the upgrade in place with pg_upgrade, will get the cluster in production state. Few years ago a desperate user posted a message asking for help on a message board. One ``expert'' guessed pg_resetxlog would be the solution. Before reading further spend two minutes to read the pg_resetxlog documentation. In the emergency, the ``expert'' guess would transform a simple version's mismatch in a corrupted data area.

Thursday, 1 January 2015

Happy 2015

Finally I finished the first volume of the PostgreSQL book administration.
The book is free to download on slideshare or, if  you want the hard copy there is the button to order on

I'll build an ebook version in the next days and I'll make it available on amazon's kindle and the kobo.

I hope this book will spread the knowledge on PostgreSQL.

Anyway, I've not finished. I've already started a second volume and a manual for the developers willing to learn the noble art of the SQL writing.

Download the book :
Buy the paperback: 

Happy new PostgreSQL year everybody.