Friday, 30 May 2014

Chapter 6 part1 - Data files

The physical layout

This chapter explores the physical storage. We'll start from the data files moving deep into the data blocks and finally to the tuples. After looking to the tablespaces we'll complete the outline started in 5.7 with the MVCC.

Data files

Wherever the database stores the files, the $PGDATA/base or a different tablespace, those files are named, initially, after the relation's object identifier, a 4 byte unsigned integer. The word initially means there's no guarantee the file will stay the same in the future. Some file altering operations like the REINDEX or the VACUUM FULL change the file name leaving the relation's object identifier unchanged. The maximum size allowed for a datafile is 1 GB, then a new segment with the same name and a sequential suffix is created. For example if the relation 34554001 reaches the upper limit a new file named 34554001.1 is added, when this one reaches 1 GB then a 34554001.2 is added and so on.
Alongside the main data files there're some additional forks needed for the database activity.

Free space map

The free space map segment is present for the index and table's data files . It's named after the relation's filenode with the suffix _fsm and is used to track the free space in the relation.

Visibility map

The table's file are also called heap files. Alongside those files there is a second fork called visibility map. Like before this file is named after the relation's filenode with the suffix _vm. Its usage is for tracking the data pages having all the tuples visible to all the active transactions. This fork is used also for the index only scans where the data is retrieved from the index page only.

Initialisation fork

The initialisation fork is an empty table or index page, stored alongside the unlogged relation's data file. As seen in 5.3.2 when the database performs a crash recovery the unlogged relations are zeroed. The initialisation fork is used to reset them and all the relation's accessory forks are deleted.


All the current database's relations are listed in the pg_class system table. The field relfilenode shows the relation's filename.
The oid field, hidden if wildcard is used in the select list, is the internal object identifier. PostgreSQL is shipped with plenty of useful functions to get informations from the relation's OID. For example the function pg_total_relation_size(regclass) returns the space used by the table plus the indices, the additional forks and the eventual TOAST table. The function returns the size bytes. Another function, the pg_size_pretty(bigint), returns a human readable format for better reading.

The field relkind is used to track the relation's kind

Table 6.1: Relkind possible values
Value Relation's kind Read
r ordinary table
i index
S sequence
v view
m materialised view
c composite type
t TOAST table
f foreign table

Monday, 26 May 2014

Chapter 5 part 7 - Transactions


PostgreSQL implements the MVCC which stands for Multi Version Concurrency Control. This offers high efficiency in multi user access for read and write queries. When a new query starts a transaction identifier is assigned, the XID a 32 bit quantity. To determine the transaction's snapshot visibility, all the committed transactions with XID lesser than the current XID are in the past and then visible. Otherwise, all the transactions with XID greater than the current XID are in the future and not visible.
This comparison happens at tuple level using two system fields xmin and xmax having the xid data type. When a transaction creates a new tuple then the transaction's xid is put into the tuple's xmin value. When a transaction deletes a tuple then the xmax value is set to the transaction's xid leaving the tuple in place for read consistency. When a tuple is visible to any transaction is called a live tuple, a tuple which is no longer visible is a dead tuple.

PostgreSQL have no dedicated field for the update's xid. That's because when an UPDATE is issued PostgreSQL creates a new tuple's version with the updated data and sets the xmax value in the old version making it disappear.

A dead tuple can be reclaimed by VACUUM if no longer required by running transactions, anyway tables updated often can result in data bloat for the dead tuples and for the eventual indices.

When designing a new data model, the PostgreSQL's peculiar behaviour on the update should be the first thing to consider, in order to limit the table and index bloat.

Among the xmin,xmax two other system fields the cmin and cmax which data type is CID, command id. Those are similar to the xmin/xmax quantities and usage and their usage is to track the internal transaction's commands, in order to avoid the command execution on the same tuple more than one time. The pratical issue is explained in the well known Halloween Problem. For more informations take a look here .

The SQL standard defines four level of transaction's isolation levels where some phenomena are permitted or forbidden.
Those phenomena are the following.

  • dirty read A transaction reads data written by a concurrent uncommitted transaction
  • nonrepeatable read A transaction re reads the data previously read and finds the data changed by another transaction which has committed since the initial read
  • phantom read A transaction re executes a query returning a set of rows satisfying a search condition and finds that the set of rows satisfying the condition has changed because another recently-committed transaction
Table 5.2 shows the isolation levels with the allowed phenomena. In PostgreSQL it's possible to set all the four isolation levels but only the three more strict are supported. Setting the isolation level to read uncommited fallback to the read committed in any case.
By default the global isolation level is set to read committed, it's possible to change the session's transaction isolation level using the command:
To change the default transaction isolation level cluster wide there is the GUC parameter transaction_isolation.

Table 5.2: Standard SQL Transaction Isolation Levels
Isolation Level Dirty Read Nonrepeatable Read Phantom Read
Read uncommitted Possible Possible Possible
Read committed Not possible Possible Possible
Repeatable read Not possible Not possible Possible
Serializable Not possible Not possible Not possible

Saturday, 24 May 2014

Chapter 5 part 6 - Tablespaces


A tablespace is a logical shortcut for a physical location. This feature was first introduced with the major release 8.0, recently with the 9.2 had a small adjustment to make the dba life easier. When a new relation is created without tablespace specification, the relation's tablespace is set to the GUC prameter default_tablespace or, if this is missing, is set to the database's default tablespace. Anyway, without any specification the default tablespace is the pg_default, corresponding to the $PGDATA/base directory.
In order to create a new tablespace the chosen directory must be owned by the os user which started the postgres process and must be specified as absolute path.
For example, having a folder named /var/lib/postgresql/pg_tbs/ts_test a tablespace we can create a new tablespace ts_test.
OWNER postgres
LOCATION '/var/lib/postgresql/pg_tbs/ts_test' ;
Only superusers can create tablespaces. The OWNER clause is optional, if omitted the tablespace is owned by the user issuing the command.
The tablespaces are cluster wide, each database sees the same list in the pg_tablespace system table.
To create a relation into the tablespace ts_test just add the TABLESPACE clause followed by the tablespace name at creation time.
CREATE TABLE t_ts_test
                i_id serial,
                v_value text
TABLESPACE ts_test ;
It's possible to move a relation from a tablespace to another using the ALTER command.
For example, this is the command to move the previously created table to the pg_default tablespace.
ALTER TABLE t_ts_test SET TABLESPACE pg_default;
The move is transaction safe but requires an exclusive lock on the affected relation. If the relation have a significant size this means no access to the data for the time required by the move.
In addition, changing the tablespaces is not permitted when the backup is in progress, the exclusive lock is not compatible with the locks issued by the schema and data export.
The tablespace feature adds flexibility to the space management. Even if is still primitive a careful design can improve sensibly the performances, for example, putting tables and indices on different devices to maximise the disks bandwidth.
To remove a tablespace there is the DROP TABLESPACE command. The tablespace must be empty before the drop. There's no CASCADE clause to have the tablespace's contents dropped with the tablespace.
postgres=# DROP TABLESPACE ts_test;
ERROR:  tablespace "ts_test" is not empty

postgres=# ALTER TABLE t_ts_test SET TABLESPACE pg_default;
postgres=# DROP TABLESPACE ts_test;
In 6.4 we'll take a look to the how PostgreSQL implements the tablespaces on the physical side.

Thursday, 22 May 2014

Chapter 5 part 5 - Views


A view is the representation of a query, stored in the system catalogue for quick access. All the objects involved in the view are translated to the internal identifiers at the creation time; the same happens for any wild card which is expanded to the column list. An example will explain better the concept. Let's create a simple table. Using the generate_series() function let's put some data into it.
                i_id serial,
                t_content       text


                        generate_series(1,200) as i_counter
        ) t_series;

The SELECT * from t_data or v_data looks exactly the same, the view simply runs the stored SQL used at creation time. If we look to the stored definition in pg_views we'll find the wildcard is expanded into the table's columns.
 db_test=# SELECT * FROM pg_views where viewname='v_data';
-[ RECORD 1 ]--------------------
schemaname | public
viewname   | v_data
viewowner  | postgres
definition |  SELECT t_data.i_id,
           |     t_data.t_content
           |    FROM t_data;
Now let's add a new column to the t_data table and run again the select on the table and the view.
 ALTER TABLE t_data ADD COLUMN d_date date NOT NULL default now()::date;
 db_test=# SELECT * FROM t_data LIMIT 1;
 i_id |            t_content             |   d_date   
    1 | c4ca4238a0b923820dcc509a6f75849b | 2014-05-21
(1 row)

db_test=# SELECT * FROM v_data LIMIT 1;
 i_id |            t_content             
    1 | c4ca4238a0b923820dcc509a6f75849b
(1 row)
The view doesn't show the new column. To update the view definition a new CREATE OR REPLACE VIEW statement must be issued.
db_test=# SELECT * FROM v_data LIMIT 1;
 i_id |            t_content             |   d_date   
    1 | c4ca4238a0b923820dcc509a6f75849b | 2014-05-21
(1 row)
Because the views are referring the objects identifiers they will never invalidate when the referred objects are altered. The CREATE OR REPLACE statement updates the view definition only if the column list adds new attributes in the end. Otherwise, any change to the existing columns requires the view's drop and recreate.
When one or more view are pointing a relation this cannot be dropped. The option CASCADE in the drop statement will drop the dependant objects before the final drop. This is a dangerous approach though. Dropping objects regardless can result in data or functionality loss.
When a drop is blocked by dependant objects the database emits a message with the informations about the dependencies. If the amount of objects is too much big it's better to query the pg_depend table to find out the correct dependencies. This table lists all the dependencies for each object using a peculiar logic. We'll take a deep look in 5.8.
As seen before a view is a logical short cut to a pre saved query. This means the database will follow all the steps to execute exactly the same way if the entire query has been sent via client, except for the network overhead.
Nothing forbids a view to point another view inside the definition or join the one or more views in a different query. This can cause massive regression on the overall performance because each view require an execution plan and mixing the views will cause not efficient planning.
To mark a relation is a view it's a good idea to use a naming prefix like v_. This will distinguish them from the tables marked with the prefix t_. In 9 we'll take a to the naming conventions to let the database schema self explanatory.
PostgreSQL from the version 9.3 supports the updatable simple views. A view is simple if

  • Have exactly one entry in its FROM list, which must be a table or another updatable view
  • Does not contain WITH, DISTINCT, GROUP BY, HAVING,LIMIT, or OFFSET clauses at the top level
  • Does not contain set operations (UNION, INTERSECT or EXCEPT) at the top level
  • All columns in the view's select list must be simple references to columns of the underlying relation. They cannot be expressions, literals or functions. System columns cannot be referenced, either
  • columns of the underlying relation do not appear more than once in the view's select list
  • does not have the security_barrier property
If the view doesn't fit those rules it's still possible to make it updatable using the triggers with the INSTEAD OF clause.
The major version 9.3 introduces also the materialised view concept. This is a physical snapshot of the saved SQL and can be refreshed with the statement REFRESH MATERIALIZED VIEW.

Sunday, 18 May 2014

Chapter 5 part 4 - Indices


An index is a relation capable to map the values in an structured way pointing the table's position where the rows are stored. The presence of an index doesn't mean this will be used for read. By default and index block read have an estimated cost four times than the table block read. However the optimiser can be controlled using the two GUC parameters seq_page_cost for the table sequential read read and random_page_cost for the index. Lowering the latter will make more probable the optimiser will choose the indices when building the execution plan. Creating indices is a complex task. At first sight adding an index could seem a harmless action. Unfortunately their presence adds overhead to the write operations unpredictably. The rule of thumb is add an index only if really needed. Monitoring the index usage is crucial. Querying the statistics view pg_stat_all_indexes is possible to find out if the indices are used or not.
For example, the following query finds all the indices in che public schema, with zero usage from the last database statistics reset.
        AND     idx_scan=0
PostgreSQL supports many index types.
The general purpose B-tree, implementing the Lehman and Yao's high-concurrency B-tree management algorithm. The B-tree can handle equality and range queries and returns ordered data. As the data is actually stored in the page and because the index is not TOASTable, the max length for an index entry is 1/3 of the page size. This is the limitation for the variable length indexed data (e.g. text).

The hash indices can handle only equality and aren't WAL logged. That means their changes are not replayed if the crash recovery occurs, requiring a reindex in case of unclean shutdown.

The GiST indices are the Generalised Search Tree. The GiST is a collection of indexing strategies organized under an infrastructure. They can implement arbitrary indexing schemes like B-trees, R-trees or other. The operator classes shipped with PostgreSQL are for the two elements geometrical data and for the nearest-neighbor search. As the GiST indices are not exact , when scanned the returned set doesn't requires a to remove the false positives.

The GIN indices are the Generalised Inverted Indices. This kind of index is optimised for indexing the composite data types or vectors like the full text search elements. The GIN are exact indices, when scanned the returned set doesn't require recheck.
There's no bitmap index implementation in PostgreSQL. At runtime the executor can emulate partially the bitmap indices reading the B-tree sequentially and matching the occurrences in the on the fly generated bitmap.
The index type shall be specified in the create statement. If the type is omitted then the index will default to the B-tree.
 CREATE INDEX idx_test ON t_test USING hash (t_contents);
As the index maintenance is a delicate matter, the argument is described in depth in 7.

Friday, 16 May 2014

PostgreSQL 9.4 Beta 1 Released

Today the PostgreSQL 9.4 beta1 has been released.
Nice and interesting new features indeed.

For me the top are the ALTER SYSTEM SET, finally a way to control the configuration file via SQL like Oracle does from ages.

Reductions in WAL volume, great news for standby servers shipping  or stream the wal on slow networks.

The full announcement is here

Thursday, 15 May 2014

Chapter 5 part 3 - The Tables


A database is the logical container of the relations. A relation is the relational object making the data accessible. The first kind of relation we'll take a look is the table.
This is the fundamental storage unit for the data. PostgreSQL implements various kind of tables having the full support, a partial implementation or no support at all for the durability.
The table creation is performed using the standard SQL command CREATE TABLE.
The PostgreSQL implementation does not guarantee the data is stored in a particular order. This is a straight MVCC consequence Take a look to 5.7 for more information.

Logged tables

If executed without options, CREATE TABLE creates a logged table. This kind of table implements fully the durability being WAL logged at any time. The data is managed in the shared buffer, logged to the WAL and finally consolidated to the data file.

Unlogged tables

This kind of table were introduced in the 9.1. The data is still consolidated to the data file but the blocks aren't WAL logged. This make the write operations considerably faster at the cost of the data consistency. This kind of table is not crash safe and the database truncate any existing data during the crash recovery. Also, because there's no WAL record the unlogged tables aren't replicated to the physical standby.

Temporary tables

A temporary table's lifespan lasts the time of the connection. This kind of tables are useful for any in memory operation. The temporary table stays in memory as long as the amount of data is no bigger than temp_buffers seen in .

Table inheritance

As PostgreSQL is an Object Relational Database Management System, some of the object oriented programming concepts are implemented. The relations are referred generally as classes and the columns as attributes. The inheritance binds a parent table to one or more child tables which have the same parent's attribute structure. The inheritance can be defined at creation time or later. If a manually defined table shall inherit another the attribute structure shall be the same as the parent's.
The PostgreSQL implementation is rather confusing as the unique constraints aren't globally enforced on the inheritance tree and this prevents the foreign key to refer inherited tables. This limitation makes the table partitioning tricky.

Foreign tables

The foreign tables were first introduced with PostgreSQL 9.1, improving considerably the way the remote data can be accessed. A foreign table requires a called foreign data wrapper to define a foreign server. This can be literally anything. Between the contrib modules PostgreSQL has the file_fdw to create foreign tables referring CSV or COPY formatted flat files. In the the version 9.3 finally appeared the postgres_fdw with the read write for the foreign tables. The postgres_fdw implementation is similar to dblink with a more efficient performance management and the connection caching.

Wednesday, 14 May 2014

WIP PostgreSQL Dba Book

I've put the wip pdf on slideshare. The file is obviously incomplete.
You ar  free to download it and redistribute under the terms of the licence CC Attribution-NonCommercial-ShareAlike.

Here's the preview.

Any feedback is very welcome.

Saturday, 10 May 2014

Chapter 5 part 2 - The Databases

The Databases

In order to establish the connection, PostgreSQL requires a target database. When the connection happens via the client psql, the database can be omitted. In this case the environment variable $PGDATABASE is used. If the variable is not set then the target database defaults to the login user.
This can be confusing because even if the pg_hba.conf authorises the connection, this aborts with the message
postgres@tardis:~$ psql -U test -h localhost
Password for user test: 
psql: FATAL:  database "test" does not exist
In this case if the database to connect to is unknown, the connection should contain the template1 as last parameter. This way the connection will establish to the always present template1 database . After the connection is established a query to the pg_database system table will return the database list.
postgres@tardis:~$ psql -U test -h localhost template1
Password for user test: 
psql (9.3.4)
SSL connection (cipher: DHE-RSA-AES256-SHA, bits: 256)
Type "help" for help.

template1=> SELECT datname FROM pg_database;
(3 rows)
A brief note for the database administrators coming from MS SQL or MySql. The database postgres can be confused for the system database, like the master db or the mysql database.
That's not correct, the postgres database have nothing special and it's created by default only in the recent major PostgreSQL versions because is required by specific tools like pg_bench.
The template0 and template1 are the template databases. A template database is used to build new database copies via the physical file copy.
During the initdb the template1 is initialised with the correct references to the WAL records. The system views and the procedural language PL/PgSQL are then loaded into the template1. Finally the template0 and postgres databases are created from template1.
The database template0 doesn't allow the connections and is used to rebuild template1 if gets corrupted or to build a new database with character encoding or ctype an different from the cluster wide values.
ERROR:  new LC_CTYPE (en_US.UTF-8) is incompatible with the LC_CTYPE of the 
template database (en_GB.UTF-8)
HINT:  Use the same LC_CTYPE as in the template database, or use template0 as 

TEMPLATE template0;
If not specified, the CREATE DATABASE statement will use template1. A database can be renamed or dropped with the ALTER DATABASE and DROP DATABASE statements. Those operations require the exclusive access to the database. If any connection except the one performing the operation is present on the database the operation will abort.
postgres=# ALTER DATABASE db_test RENAME TO db_to_drop;

postgres=# DROP DATABASE db_to_drop;

Saturday, 3 May 2014

This week update

For this week I'll not update the blog with the new chapter.
I'm scattered around time and space and I've little time to concentrate on writing.

Anyway I'll take the occasion to stress a little on the source of informations about PostgreSQL.
As someone will surely notice on my blog there's no indication about the postgresql releases nor informations about the critical releases.

The reason is because it's better to refer to the official website.

There're few websites which display the PostgreSQL releases in a foreign language.
Yes, bunch of idiots, I'm talking about you.

If you follow their outdated informations you will miss critical bugs which can harm your standbys or result in a catastrophic data loss exposing your data directory to dogs and pigs.

So this week's lesson is, follow for any release and don't trust anything except the official releases.