Monday, 20 November 2017

pg_chameleon v2.0.0.alpha2 is out

Has been a while since I wrote a post on this blog.
I've been quite busy on coding pgchameleon's version 2.
I finally managed to release an alpha1 the 11th of November but this release had several issues which prevented the users to have real tests.
However, after a week of debugging I've released pg_chameleon v2.0.0.alpha2 which is now quite usable.

For a complete command line reference and an usage example click here.

The documentation now have a dedicated a page with the configuration file details explained.

Please note this pre-release and despite the changelog shows many fixes, there is still a lot of work to do.
Do not use it in production.

So, why pgchameleon v2 and why the version 1 can't be fixed?

Well, pg_chameleon started as an experiment and the version 1 was built without any clue on the architecture and what should be the final result.

This caused the version 1 to have several issues making very difficult to improve it without dismantling the existing logic.

Building up from fresh start required less time to reorganise the original in the correct way.
The code now is more modular, but is still not very pythonic.
My bad, I'm not a developer after all.

The major changes from the version 1 are the following.

  • Python 3 only development
  • Supports the replica from multiple MySQL schemas withing a single MySQL instance. The source's and target schema names can be different.
  • Conservative approach to the replica. Tables which generate errors are automatically excluded from the replica.
  • Daemonised init_replica process.
  • Daemonised replica process with two separated subprocess, one for the read and one for the replay.
  • Soft replica initialisation. The tables are locked when needed and stored with their log coordinates. The replica daemons will put the database in a consistent status gradually.
  • Rollbar integration for a simpler error detection.

The version 2 improves the display of the show_status command when a source name is specified.

There is also an error log in the replica schema which saves the exception's data during the replay phase.

The error log can be queried with the new command show_errors.

To install the version 2 please run.

python3 -m venv venv
source venv/bin/activate
pip install pip --upgrade 
pip install pg_chameleon==2.0.0a2

Release notes:

Please report any issue on github.

Wednesday, 23 August 2017

Happy birthday pg_chameleon

Today is one year since I started working seriously on pg_chameleon.
With this commit I changed the project's license to the 2 clause BSD and the project's scope, evolving the project into a MySQL to PostgreSQL replica system.

Initially this change was just a try. I needed to synchronise the data between MySQL and PostgreSQL and at that time the only option I had it was to use the MySQL foreign data wrapper, eventually to copy the data locally every now and then. However, because the previous implementation relied on a MySQL replica this approach approach wasn't really feasible.

If you are curious about the background story and how we scaled the analytics database  in Transferwise you can read it here.

I developed pg_chameleon in my spare time. I like to think about it like  my little commute project.

The first test on large datasets happened during the amazing days of the pgconf eu 2016. I remember how the process were incredibly slow, taking the unacceptable amount of time.  Four days to copy a 600GB database. I found the bottlenecks during the nights between the conference days building a faster implementation.

I also had to cope with the sql dialect conversion. The solution is still in progress.

Initially I decided to use an existing library but after few failures I realised that sqlparse didn't fit my needs.
So I took the occasion to learn how to use the regular expressions and I doubled my problems at the same time.

In May I presented the project at the Estonian PostgreSQL User Group and the video is available here.

Currently the project is at the version 1.6 which improves the replay speed and comes with better status view with the replay lag along the read lag.

The upcoming release 1.7 will add an optional threaded mode for the replica, where the read and replay processes will run independently.

This version will also see the support for the type override during the init schema and the ddl replay. This change will make simpler to use pg_chameleon as a migration tool (e.g. conversion of tinyint(1) into a boolean).

However the current replay implementation can result in a broken in case of not compatible data pushed into the data fiels (e.g. insert a value >1 in tinyint(1) will throw a type error on postgres if the data is boolean). I'm working on a solution.

I've also started the development of the version 2 but I've not yet kicked off seriously the coding yet. The reason why is that I'm still learning a lot of things thanks to the feedback I'm getting via github.
I will start the version 2 soon and hopefully I will release the first alpha by the beginning of the next year.

However I'm very happy to see pg_chameleon gaining popularity.

If my work even will help just one person to move from MySQL to PostgreSQL, I feel satisfied.

So, happy birthday to pg_chameleon, my little pet project.