Today is one year since I started working seriously on pg_chameleon.
With this commit I changed the project's license to the 2 clause BSD and the project's scope, evolving the project into a MySQL to PostgreSQL replica system.
Initially this change was just a try. I needed to synchronise the data between MySQL and PostgreSQL and at that time the only option I had it was to use the MySQL foreign data wrapper, eventually to copy the data locally every now and then. However, because the previous implementation relied on a MySQL replica this approach approach wasn't really feasible.
If you are curious about the background story and how we scaled the analytics database in Transferwise you can read it here.
I developed pg_chameleon in my spare time. I like to think about it like my little commute project.
The first test on large datasets happened during the amazing days of the pgconf eu 2016. I remember how the process were incredibly slow, taking the unacceptable amount of time. Four days to copy a 600GB database. I found the bottlenecks during the nights between the conference days building a faster implementation.
I also had to cope with the sql dialect conversion. The solution is still in progress.
Initially I decided to use an existing library but after few failures I realised that sqlparse didn't fit my needs.
So I took the occasion to learn how to use the regular expressions and I doubled my problems at the same time.
In May I presented the project at the Estonian PostgreSQL User Group and the video is available here.
Currently the project is at the version 1.6 which improves the replay speed and comes with better status view with the replay lag along the read lag.
The upcoming release 1.7 will add an optional threaded mode for the replica, where the read and replay processes will run independently.
This version will also see the support for the type override during the init schema and the ddl replay. This change will make simpler to use pg_chameleon as a migration tool (e.g. conversion of tinyint(1) into a boolean).
However the current replay implementation can result in a broken in case of not compatible data pushed into the data fiels (e.g. insert a value >1 in tinyint(1) will throw a type error on postgres if the data is boolean). I'm working on a solution.
I've also started the development of the version 2 but I've not yet kicked off seriously the coding yet. The reason why is that I'm still learning a lot of things thanks to the feedback I'm getting via github.
I will start the version 2 soon and hopefully I will release the first alpha by the beginning of the next year.
However I'm very happy to see pg_chameleon gaining popularity.
If my work even will help just one person to move from MySQL to PostgreSQL, I feel satisfied.
So, happy birthday to pg_chameleon, my little pet project.
Wednesday, 23 August 2017
Monday, 19 June 2017
Bugfix emergency release v1.3.1I discovered a regression when running the init_replica caused by a wrong handling of missing time in master coordinates. Sorry about that.
After another round of bug fixes I've released the version 1.3 of my pet project pg_chameleon.
The package is available on pypi as usual.
The changelog is available here.
If you have any question/issue to discuss I created a community on gitter.
I've also added a RELASE_NOTES file to explain the changes. As this release changes the relay storage and you plan to upgrade please have look to the file.
The initial implementation for the relay data was to have two log tables t_log_replica_1 and t_log_replica_2 with the replica process accessing one table at a time.
This approach allows autovacuum to take care of the unused partition meanwhile the other is written.
The method worked fine with only one replica worker. However as the flip flop between the tables is calculated indepentently for each source this could cause unwanted bloat on the log tables if several sources are replicating all togheter.
In this scenario autovacuum will struggle to truncate the empty space in the table's end.
The pg_chameleon version 1.3 implements the log tables per source. Each source have a dedicated couple of tables still inherited from
the root partition t_log_replica.
The schema upgrade happen automatically at the first run.
Because this one involves a data copy it could take more time than the usual. If the process seems frozen do not stop it otherwise you may lose your replica setup .
Those are the upgrade steps you should follow.
- Stop all the replica sources. The show_status command must show all the rows in stopped status
- Take a backup of the schema sch_chameleon with pg_dump
- Upgrade pg_chameleon with
pip install pg_chameleon --upgrade
chameleon.py upgrade_schema --config <your_config> --debug
- When the upgrade is finished start the replica process as usual