MySQL Binary Log Connector works in the Change Data Capture

In the previous article, I talked about Change Data Capture as a Gateway to the Big Data and Streaming Platforms. What is the CDC? What is the impact on business metrics? How does our team decide to build in-house our CDC project? How does our team migrate our CDC from Debezium to Gargantua?

In this article, I want to talk more about a few insights into Gargantua. I will use one of the Gargantua connectors for MySQL and explain some of the components. Gargantua MySQL connector is a connector to capture any data changes from the MySQL database by reading Binlog files (the MySQL transaction logs). Gargantua not only captures DML statements — insert, update, delete, commits, and rollbacks, but also afford to monitor DDL statements — create, alter, delete a table.

How does Binlog work in MySQL?

Binlog or binary log is a set of log files that contain information about data modifications made to a MySQL server instance. Basically, the Binlog file is used to enable the MySQL replication feature that allows a server — the master — to send all changes to another server — the slave — and the slave tries to apply all changes to keep up-to-date with the master. If you want to know more about the binary log works in MySQL, you may want to look at this documentation. In short, the replication works as follows:

  1. Whenever the master’s database is modified, the change is written to a binlog. The process is done by the client thread that executed the query that modified the database.
  2. The master has a thread, called the dump thread, that continuously reads the master’s binlog and sends it to the slave.
  3. The slave has a thread, called the IO thread, that receives the binlog that the master’s dump thread sent and writes it to a file: the relay log.
  4. The slave has another thread, called the SQL thread, that continuously reads the relay log and applies the changes to the slave server.

Please take note if you want to work with the CDC & Binlog files, you need to set up your MySQL server to enable the replication stream. You may want to look to this documentation for configuring the replication stream.

In the Binlog file, there are several event types. Each type has a different meaning & purposes. I will not talk about the detail of each event type, but instead, you could read more about Binlog event types through this official binlog-event-type. Each transaction log in MySQL is represented several events in a Binlog file, and I will talk more about it.

Figure 1. Write Rows Event in the Binlog file

In figure 1, you could see how does a typical transaction look like for a single write rows event. I will give you an example so you could understand it more easily. Example: There is a transaction that inserts new data into a particular table (Let’s say member table), then the following events will be created in the Binlog file.

  1. The GTID event. The beginning of a transaction.
  2. The Query event with “BEGIN” as SQL.
  3. The Table Map event contains table identifier information like schema & table name for the member table.
  4. The Write Rows event contains the new data for the member table.
  5. The XID or Query event that contains “COMMIT/ROLLBACK” as a SQL. The end of a transaction.

Therefore, a single transaction with a single write rows event will produce five events in the Binlog file.

Figure 2. Update Rows Event in the Binlog file

In figure 2, you could see how does a typical transaction looks like for a single update rows event. Example: There is a transaction that updates some data in a member table, then the following events will be created in the Binlog file.

  1. The GTID event. The beginning of a transaction.
  2. The Query event with “BEGIN” as SQL.
  3. The Table Map event contains table identifier information like schema & table name for the member table.
  4. The Update Rows event contains both before and after the data updated in the member table.
  5. The XID or Query event that contains “COMMIT/ROLLBACK” as a SQL. The end of a transaction.

Therefore, a single transaction with a single update rows event will produce five events in the Binlog file.

Figure 3. Delete Rows Event in the Binlog file

In figure 3, you could see how does a typical transaction look like for a single delete rows event. Example: There is a transaction that deletes some data in a member table, then the following events will be created in the Binlog file.

  1. The GTID event. The beginning of a transaction.
  2. The Query event with “BEGIN” as SQL.
  3. The Table Map event contains table identifier information like schema & table name for the member table.
  4. The Delete Rows event contains before the data deleted in the member table.
  5. The XID or Query event that contains “COMMIT/ROLLBACK” as a SQL. The end of a transaction.

Therefore, a single transaction with a single delete rows event will produce five events in the Binlog file.

Figure 4. Multiple Data Changes in the Binlog file

In figure 4, you could see how does a typical transaction looks like for multiple data changes. Example: There is a transaction that inserts new data into a cart table and updates some data in a transaction table, then the following events will be created in the Binlog file.

  1. The GTID event. The beginning of a transaction.
  2. The Query event with “BEGIN” as SQL.
  3. The Table Map event contains table identifier information like schema & table name for the cart table.
  4. The Write Rows event contains new data for the cart table.
  5. The Table Map event contains table identifier information like schema & table name for the transaction table.
  6. The Update Rows event contains both before and after the data updated in the transaction table.
  7. The XID or Query event that contains “COMMIT/ROLLBACK” as a SQL. The end of a transaction.

Therefore, a single transaction with multiple data changes will produce seven events in the Binlog file.

Figure 5. Alter Table in the Binlog file

In figure 5, you could see how does a typical transaction looks like for altering a particular table. Example: There is a transaction that modifies schema for adding new columns in a member table, then the following events will be created in the Binlog file.

  1. The GTID event. The beginning of a transaction.
  2. The Query event with “BEGIN” as SQL.
  3. The Query event with “ALTER TABLE `member` ADD new_column datatype” as SQL.
  4. The XID or Query event that contains “COMMIT/ROLLBACK” as a SQL. The end of a transaction.

Therefore, a single transaction with an altered query will produce four events in the Binlog file.

How does Gargantua work with Binlog file?

In the previous section, we already talked about some of the behavior of Binlog events for each transaction type that you must capture in the CDC app. Next, I want to talk about how Gargantua handles those Binlog events. There are three main components in Gargantua.

  1. BinlogReader is an entry-point for accessing and managing the connector. The BinlogReader captures all monitored table information and updates the monitored table schema when there is a DDL statement. The BinlogReader store records in a queue and handles how to put & poll the records synchronously when there is a request both from BinlogClient & BinlogPusher.
  2. BinlogClient opens the stream to MySQL to read the binlog files, then deserialize the rows into records in Debezium format, finally put the records into the queue.
  3. BinlogPusher manages the execution process to poll the records from the queue, then publish the records into the target store. Finally, gets the latest record to update the current binlog checkpoint into Cassandra. The BinlogPusher schedules every period of time.
Figure 6. Gargantua Workflow

In figure 6, you could see how the three Gargantua components work together in the big picture.

Figure 7. BinlogClient Reads & Handles Events Flowchart

In figure 7, you could see how does Gargantua read the Binlog file. The whole process runs in a Thread. The key points you should know are below.

  • The Gargantua records a GTID set and other information so that we could use those checkpoint information to rewind the Binlog events to where to the specific transaction later.
  • The Gargantua stores table schema for all monitored tables in the memory and updates it when there is a DDL statement. The table schemas are used when extracting the data change so that Gargantua can handle any schema changes.
  • The Gargantua uses a buffer to store the data changes in the memory. It helps Gargantua to process the data changes when there is a commit transaction or remove the invalid data changes when there is a rollback transaction.
  • The Gargantua uses a queue to store and send the valid data changes into your target source in order. It allows Gargantua to put and get the data synchronously.
Figure 8. BinlogPusher Sends Data Changes Flowchart

In figure 8, you could see how does Gargantua sends the data changes into another store. The Gargantua spawns another Thread to schedule & run the above process for every period of time depends on the configuration. The key points you should know are below.

  • The Gargantua uses an event listener when sending all data changes so that you can implement your own code and target store such as Kafka, Google PubSub, Azure Event Hubs, etc. You also could decide how to send data on your own. It could be synchronous or asynchronous.
  • The Gargantua uses a checkpoint listener to manage the Binlog source checkpoint so that you can implement your own code. The Binlog source checkpoint contains a connector id, server id, binlog filename, binlog position, GTID set, and XID. You also could decide where the Binlog source checkpoint is stored.

Conclusion

I already talked about how does Binlog works in MySQL and Gargantua work with the Binlog events. I also gave more details in the flowchart about how the Binlog events behavior for some transactions. I hope you could understand how the CDC works in MySQL and get more insight about it now.

Thank you for reading my article.

Data Engineering at Bukalapak, have the interest to become a solution architect