Wednesday, March 12, 2008

Custom replication

One of our requirements is to have multiple master databases merge into a single slave database. The built-in MySQL replication engine does not support this kind of setup. A search on google came up with this paper: http://www.shinguz.ch/MySQL/mm-single-slave-repl.pdf. In it, the author described a way of using a middle-man database slave but using federated table pointing to the real slave. This approach would work but there are still a few bugs in the federated table engine (especially for auto-increment keys).

So I was thinking what would it take to create a custom solution. Here is what I came up with:

First a little diagram:

[M1][M2]
| |
\ /
[S]

We have 3 servers. M1 and M2 are masters and they do not share any information. So one going down would not effect the other. The data in both M1 and M2 would be replicated in S in a timely manner.

Let's say both M1 and M2 has a table t that we wanted to replicated into S. The structure of t is as follows:

CREATE TABLE `t` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`s` varchar(100) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB

The first challenge is to make sure that records in M1 and M2 do not conflict. To achieve that, we setup M1 and M2 to produce non-conflicting auto-increment key using the settings "auto-increment-offset" and "auto-increment-increment". For instances, M1 would produce records that has odd numbered keys and M2 would produce records with even numbered keys.

Now, for each table t in M1 and M2, create a corresponding federated table (fed_t1 and fed_t2) in S as follows:

CREATE TABLE `fed_t1` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`s` varchar(100) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=FEDERATED CONNECTION='mysql://user@masterdbhost:3306/test/t'

So now, S contains a "view" into both M1 and M2.

The next step is to regularly query fed_t1 and fed_t2 to move the new records from fed_t1 and fed_t2 to table t in S. To keep things simple, assuming that we are only interested in new records, we can first setup a table to remember the last id from each table that got copied over. I call this table insert_pos:

mysql> select * from insert_pos;
+----+------------+-----------+
| id | table_name | latest_id |
+----+------------+-----------+
| 1 | fed_t1 | 18 |
| 2 | fed_t2 | 19 |
+----+------------+-----------+

In the above case, it shows that the last record that was copied over from fed_t1 has an id of 18, and the last record from fed_t2 has an id of 19.

So now, it is simply a matter of setting up an event to iterate over the insert_pos table. The algorithm is something like this:

1. For each record in the insert_pos:
1.1 insert into t all records from table_name between id 18 and the max id.
1.2 set the latest_id to the max id.
2. End

In my implementation, I have the following stored procedure to do the work:



CREATE PROCEDURE populate_t()
BEGIN

DECLARE remote_table_name VARCHAR (20);
DECLARE min_id INT;
DECLARE max_id INT;
DECLARE done INT DEFAULT 0;

-- open a cursor to the position table
DECLARE pos_table_cur CURSOR FOR SELECT table_name, latest_id FROM insert_pos;
DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = 1;
OPEN pos_table_cur;

-- iterate the position table. For each table, insert records into
-- the main table and update position
REPEAT
FETCH pos_table_cur INTO remote_table_name, min_id;
IF NOT done THEN
-- First, get the max id for the transaction table
DROP TEMPORARY TABLE IF EXISTS remote_max_id;
SET @get_max = CONCAT('CREATE TEMPORARY TABLE remote_max_id',
' AS SELECT max(id) AS max FROM ',
remote_table_name);
PREPARE stmt FROM @get_max;
EXECUTE stmt;
DEALLOCATE PREPARE stmt;

SELECT max INTO max_id FROM remote_max_id;

-- Now, insert into
SET @populate = CONCAT('INSERT INTO t (id, s) SELECT id, s FROM ',
remote_table_name,' WHERE id > ',min_id,
' AND id <= ',max_id);
PREPARE stmt FROM @populate;
EXECUTE stmt;
DEALLOCATE PREPARE stmt;
-- Now update with latest insert position
SET @update_pos = CONCAT('UPDATE insert_pos SET latest_id = ',
max_id,
' WHERE table_name = \'',
remote_table_name, '\'');
PREPARE stmt FROM @update_pos;
EXECUTE stmt;
DEALLOCATE PREPARE stmt;
END IF;
UNTIL done END REPEAT;
END;



The actual procedure is very ugly since I have to construct the query dynamically using prepared statements.

Finally, I setup an event to call the procedure every 5 seconds:

CREATE EVENT timed_populate
ON SCHEDULE EVERY 5 SECOND
DO
CALL populate_t;



Preliminary Results

I have only tested the above in a very limited way. However, it seems to be rather reliable. If any of the connections to fed_t* is interrupted, the insert query will just not execute. Once the connection re-established itself, replication would work again.

To add a new master database in the above scenario, simple create a federated table from S to the new master, and add the corresponding entry to insert_pos table. Of course, always make sure that there are no key conflicts!

1 comment:

Henrik Gemal said...

I have a question about your article:

What if I have multiple masters with different databases and what to replicate that onto one slave.

So I have:
Master1 with database "x"
Master2 with database "y"
Master3 with database "z"

Now I like to have a slave with:
Slave1 with database "x", "y" and "z"

Can that be done?