CON5409 - Jadhav-Yahoo Case Study - MySQL GTIDs and Parallel or Multithreaded Replication

Yahoo Case Study: MySQL GTIDs and Parallel or
Multithreaded Replication
PRESENTED BY Stacy Yuan, Yashada Jadhav
October 2015
About Yahoo
Yahoo is focused on making the worlds daily habits inspiring and

entertaining.
By creating highly personalized experiences for our users, we keep people
connected to what matters most to them, across devices and around the
Yahoo Case Study: MySQL GTIDs and Parallel or
world.
Multithreaded Replication
In turn, we create value for advertisers by connecting them with the
audiences that build their businesses
More than 1B monthly active users across Yahoo and Tumblr
More than 575M mobile monthly active users across Yahoo and Tumblr
Ad Products Team
Mission Statement: Delivering scalable and cost efficient data services

through innovation and automation powering Yahoo Products
Thousands of Production Servers

OLTP systems & Data marts
Database Design and Architecture
Capacity Planning and Performance Reviews
24x7 Monitoring and Operational Support
MySQL at Yahoo
MySQL powers many mission-critical products within Advertising

and User space across Desktop and Mobile
Multiple production configurations based on product requirement
DBaaS setup for multiple products
Yahoo Sports: Mobile friendly
Flickr: Sharded across thousands of servers
Hot:Hot, Hot:Warm Configurations
Versions range from Percona Server 5.1 to 5.6 including Percona
XtraDB Cluster
Operating systems running customized RHEL 5.6 to 6.5
About Stacy
Senior MySQL database administrator

10+ years of experience on various flavors of relational databases.
Focus on performance tuning, code reviews, database deployment
and infrastructure management for MySQL
In her spare time, she enjoys reading books and doing some
volunteer work.
About Yashada
MySQL DevOps Engineer with a background in database design

and performance tuning.
4+ years of experience on various flavors of relational databases.
In her spare time, she enjoys listening to music and going to
concerts. She appreciates sarcasm a lot too!
What are the next 45 minutes about?

GTID Replication
Advantages and Disadvantages

Performance when compared to regular replication
Multi threaded slaves
Why do we want MTS?

MTS vs single threaded replication - Performance tests
Rolling out GTID and MTS to a live system with no downtime

GTID and MTS in Production
Operational issues
Monitoring and HA
Backups using xtrabackup
Why go for GTID and MTS
Slave promotion becomes easier with a global transaction ID

Multitenant database systems suffer from problems like resource
contention due to bad queries, batch jobs etc. that affect replication.
MTS without GTID - replication co-ordinates might no longer be
accurate due to multiple parallel worker threads.
MTS with GTID
GTID Replication
File-based Replication
Enables data from one MySQL database server (the master) to be
replicated to one or more MySQL database servers (the slaves) through
MySQL log file and its position
Needs replication user, binlog is enabled
Needs a copy of master database
Connect to master through master_host, port, replication user,
master log file and its position.
Each slave pulls the data from the master, and execute the events
to the slave.
GTID Replication
A global transaction identifier (GTID) is a unique identifier created and
associated with each transaction committed on the server of origin
(master).
GTID is unique not only to the server on which it originated, but is
unique across all servers in a given replication setup.
GTID = source_id:transaction_id
The source_id identifies the originating server.
The transaction_id is a sequence number determined by the order in
which the transaction was committed on this server.
Example:
5c7401d3-3623-11e5-ae8c-78e7d15fd641:1-13476
GTID Replication Advantage
Replication topology is easy to change - binlog file name and

position are not required any more instead we use
master_auto_position=1
Master_log_file=mysql-bin.***
Master_log_pos=****
master_auto_position=1
Failover is simplified
Increase performance in relay slave - set sync_binlog=0
Managing multi-tiered replication is easier
Replication Failover Comparison

Regular Rep Failover
If S1 is bad, S4
S5 need
to be rebuilt.
GTID Rep Failover

Redirect
S4 to M
M
S2
S1
S2
S1
S3
S4
S3
S4
S5
S5
GTID Replication Limitations

GTID does not provide replication monitoring
SQL_SKIP_SLAVE_COUNTER does not work
Can not force the database to start replication from specific position
GTIDs Replication Caveats

Updates involving non-transactional storage engines.
CREATE TABLE ... SELECT statements is not supported.
Temporary table is not supported inside a transaction
To prevent GTID-based replication to fail: enforce-gtid-consistency
Replication Performance GTID vs Regular Rep

In terms of performance, GTID is almost same as regular replication. It
is slightly slower.
The reasons could be GTIDs write more lines into binary log - information about GTID
GTID performs additional checks for transactions
GTID vs Regular Replication
Multi threaded Replication
Replication Performance Issue

Multi threaded applications write to the master in a parallel fashion
But the replication from master to slave is single thread, it becomes
bottleneck in a busy system.
Master
Slave
Multi-Threaded Slaves (MTS)

Type of thread Coordinator thread and Worker thread
Coordinator thread on slave dispatches work across several worker
threads
Each worker thread commit transaction individually.
Multiple active schemas/databases can take advantage of parallel
replication
Master
Slave
MTS Prerequisites
MySQL 5.6 above

Transactions are independently based on different databases.
Multitenant databases is the best to enable MTS
N databases, use N parallel workers
slave_parallel_workers = N
Master
db1, db2, db3
Slave
db1, db2, db3
Example: 3 databases in MySQL, better to set

slave_parallel_workers =3
Configure MTS
STOP SLAVE;
SET GLOBAL slave_parallel_workers=3;
START SLAVE;
MTS Execution Gaps and Checkpoint

Events are no longer guaranteed to be consecutive
Execution gaps are tracked
Checkpoints are performed from time to time
Check settings
slave_checkpoint_period default 300 ms
slave_checkpoint_group default 512 trx
Exec_Master_Log_Pos shows the latest checkpoint and not latest
transaction
How to fix execution gaps STOP SLAVE; START SLAVE UNTIL SQL_AFTER_MTS_GAPS
Convert MTS to Single-threaded

Run MTS until no more gaps are found in the relay log
Stop Replication
Configure single threaded slave
Start single threaded slave
START SLAVE UNTIL SQL_AFTER_MTS_GAPS;
SET @@GLOBAL.slave_parallel_workers = 0;
START SLAVE;
MTS Advantages and Limitations
Advantages:
Take advantage of multi-core servers
Changes to each schema applied and committed independently by
worker threads
Smaller risk of data loss
Limitations:
START SLAVE UNTIL no longer support
Foreign Keys cross-referencing DBs will disable MTS
No implicit transaction retry after transient failure
MTS Caveats
Enforcing foreign key relationships between tables in different

databases causes MTS to use sequential mode which can have
negative impact on performance
Single database replication, it slows down the replication
performance
MTS without GTID

Exec_Master_Log_Pos in SHOW SLAVE STATUS is misleading.
Skipping replication errors with SQL_SLAVE_SKIP_COUNTER=1 is
dangerous
Backup from slave, either mysqldump and xtrabackup might not get
right position
GTID comes to the rescue
Performance Testing - GTID with MTS Setup

Test scenario:
one master,
two slaves (one is single-threaded replication, another slave is multithreaded replication both using GTID
Master
GTID Rep
Slave1
MTS GTID Rep
Slave2
Replication Performance Comparison

QPS is increased about 3 or 4 times
Load, CPU, and Writes per second are increased as well
GTID with MTS enabled: Things to watch out for

Exec_Master_Log_Pos is no longer reliable
Executed_Gtid_Set is the reliable
SQL_SLAVE_SKIP_COUNTER no longer works

START SLAVE UNTIL is not supported
Slave_transaction_retries is treated as 0, and can not be changed.
Rolling out GTID and MTS to production
Online Rollout GTID with MTS in Percona Server

MySQL56 requires downtime to enable GTID, it is not acceptable
With Percona server 5.6, with almost no downtime
The variable GTID_DEPLOYMENT_STEP plays an important role
Database Servers Setup

Dual masters setup
Masters setup cross different colos.
Each master carries one slave
DNS
Prod master
BCP master
Prod slave
BCP slave
Enable GTID without downtime

Enable GTID in BCP side
1. Make sure BCP master and
BCP slave are sync
DNS
2. Stop mysqld in BCP master and BCP slave,

add gtid_deployment_step=on,
gtid_mode=ON,
BCP master
Prod master
GTID_deployme
nt_step=on
enforce-gtid-consistency into my.cnf

Restart mysqld in both servers.
BCP slave
3. Replication from prod master to

BCP master is good.
Prod slave
GTID_deployme
nt_step=on
Promote BCP master to Prod master

4. Prod master: set global read_only=on
DNS
5. BCP master:
set global gtid_deployment_step = off;
set global read_only=off;
6. The replication from BCP master to
BCP master
Prod master
GTID_deployme
nt_step=off
Prod master is broken.
BCP slave
Prod slave
GTID_deployme
nt_step=on

Enable GTID in Prod master
7. Enable GTID on old prod master and prod slave
DNS
8. Fix replication from BCP master to prod master

CHANGE MASTER TO
MASTER_AUTO_POSITION = 1;
START SLAVE;
Prod master
GTID enabled
BCP master
GTID_deployme
nt_step=off
9. Enable GTID replication from

Prod master to BCP master
10. Enable MTS in all servers
stop slave;
set global slave_parallel_workers=16;
start slave;
Prod slave
GTID enabled
BCP slave
GTID_deployme
nt_step=on

Switch back
10. Perform switchover in Prod master
Disable gtid_deployment_step across all servers.
DNS
Prod master
GTID enabled
Prod slave
GTID enabled
BCP master
GTID_deployme
nt_step=off
BCP slave
GTID_deployme
nt_step=off
Switchover Steps
Enable global read_only=on in prod master
Sanity check to make sure BCP master catch up its master

(WAIT_UNTIL_SQL_THREAD_AFTER_GTIDS)
Disable read_only in BCP master. BCP master becomes prod

master
Failover:
If prod master is unreachable, it will be failover without step 1 and 2.
GTID and MTS in Production : MySQL Ops
GTID and MTS in production : What did we learn?

Errant Transactions
Replication Monitoring
Building slaves using xtrabackup
Errant Transaction
The errant transactions are:
They are only executed in slaves.
Could result from a mistake
Could be intentionally by design, such as report tables
Why they cause problem
When the slave becomes the master during failover, it exchanges its own
set of executed GTIDs, then send any missing transactions to the slaves.
Errant Transaction Detection and Fix

Detect: GTID_SUBSET(slave-Executed_Gtid_Set, master-Executed_Gtid_Set)
If it returns true(1), no errant trx.
If it returns false(0), it does have errant trx.
Identify:GTID_SUBTRACT(slave-Executed_Gtid_Set, master-Executed_Gtid_Set)
It returns the errant GTID.
Fix: Inject empty transaction on all other servers.
If the transaction must be executed in slave only, use
set sql_log_bin=0;
Inject Empty Transaction

Sql_skip_slave_counter=n no longer works
Execute a fake trx with the GTID that you want to skip
For example: GTID=68fb0071-299b-11e5-9cd6-78e7d15dbe38:501
STOP SLAVE;
SET GTID_NEXT="68fb0071-299b-11e5-9cd6-78e7d15dbe38:501";
BEGIN; COMMIT;
SET GTID_NEXT="AUTOMATIC";
START SLAVE;
SHOW SLAVE STATUS\G # Verification
MySQL Replication Monitoring
Seconds_Behind_Master
A good approximation of how late the slave is only when the slave
actively processes updates.
If the network is slow or not much updates in the master, this is NOT
a good measurement.
MySQL Replication Monitoring at Yahoo

MySQL Health Heartbeat
1. Master generates heartbeat by updating timestamp (last_update)
2. Slave checks the difference between current time and last_update
GTID MTS Monitoring Challenger
SHOW SLAVE STATUS

Seconds_Behind_Master is still a good indication of the
replication lag
Retrieved_Gtid_Set: List of GTIDs received by the I/O thread,
cleared after a server restart
Executed_Gtid_Set: List of GTIDs executed by the SQL thread
Auto_position: 1 if GTID-based replication is enabled
5.7 is using performance_schema
Build Slaves Using Xtrabackup

Start Xtrabackup from either master or slave
If the backup is taken from the master,
Please check the file xtrabackup_binlog_info in the backup folder
If the backup is from slave,
Please check the file xtrabackup_slave_info
$ cat xtrabackup_slave_info
SET GLOBAL gtid_purged='ffee1ff8-363f-11e5-af47-9cb654954cac:1-29123533';
CHANGE MASTER TO MASTER_AUTO_POSITION=1
Build Slave Using Xtrabackup

Enable Replication in Slave
Issue
mysql> SET GLOBAL gtid_purged='ffee1ff8-363f-11e5-af47-9cb654954cac:1-29123533';
ERROR 1840 (HY000): @@GLOBAL.GTID_PURGED can only be set when
@@GLOBAL.GTID_EXECUTED is empty.
How to fix
RESET MASTER;
SET GLOBAL gtid_purged='ffee1ff8-363f-11e5-af47-9cb654954cac:1-29123533;
CHANGE MASTER TO MASTER_HOST="mastername", master_user='rep_user',

master_password='rep_password', MASTER_AUTO_POSITION = 1;
START SLAVE;
Build Slave Using Xtrabackup

Still issue?
mysql> start slave;
ERROR 1872 (HY000): Slave failed to initialize relay log info structure from the repository
RESET SLAVE;
START SLAVE;
Summary
GTID
MTS
GTID with MTS performance comparison
GTID with MTS online rollout
Things to watch out
Rebuild slave
We would love to talk more ..
http://mysqlatyahoo.tumblr.com
YJ
yashada@yahoo-inc.com
https://www.linkedin.com/pub/yashada-jadhav/18/659/a6
Stacy Yuan
syuan@yahoo-inc.com
https://www.linkedin.com/pub/stacy-yuan/53/577/324
Appendix
Appendix: 1. GTID with MTS Configuration and Setup
Hardware: 24 CPUs, 48 GB memory

Database: innodb_buffer_pool_size = 32 GB
innodb_log_file_size = 1G
innodb_thread_concurrency = 0
Case1:
Master: Create 8 schemas and 8 users, each user access one
schema
Create 8 tables and 1 million rows in each table for each schema
Eight sysbench run executed concurrently on the master.
num-threads=2
max-time=900
oltp-read-only=off

CON5409 - Jadhav-Yahoo Case Study - MySQL GTIDs and Parallel or Multithreaded Replication

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

CON5409 - Jadhav-Yahoo Case Study - MySQL GTIDs and Parallel or Multithreaded Replication

Încărcat de

Drepturi de autor:

Formate disponibile

Yahoo Case Study: MySQL GTIDs and Parallel or

Yahoo is focused on making the worlds daily habits inspiring and

Mission Statement: Delivering scalable and cost efficient data services

Thousands of Production Servers

MySQL powers many mission-critical products within Advertising

Senior MySQL database administrator

MySQL DevOps Engineer with a background in database design

What are the next 45 minutes about?

Advantages and Disadvantages

Multi threaded slaves

Why do we want MTS?

Rolling out GTID and MTS to a live system with no downtime

Why go for GTID and MTS

Slave promotion becomes easier with a global transaction ID

GTID Replication Advantage

Replication topology is easy to change - binlog file name and

Replication Failover Comparison

GTID Rep Failover

GTID Replication Limitations

GTIDs Replication Caveats

Replication Performance GTID vs Regular Rep

GTID vs Regular Replication

GTID vs Regular Replication

GTID vs Regular Replication

Multi threaded Replication

Replication Performance Issue

Multi-Threaded Slaves (MTS)

MySQL 5.6 above

Example: 3 databases in MySQL, better to set

MTS Execution Gaps and Checkpoint

Convert MTS to Single-threaded

MTS Advantages and Limitations

Enforcing foreign key relationships between tables in different

MTS without GTID

Performance Testing - GTID with MTS Setup

MTS GTID Rep

Replication Performance Comparison

Replication Performance Comparison

Replication Performance Comparison

GTID with MTS enabled: Things to watch out for

SQL_SLAVE_SKIP_COUNTER no longer works

Rolling out GTID and MTS to production

Online Rollout GTID with MTS in Percona Server

Database Servers Setup

Enable GTID without downtime

2. Stop mysqld in BCP master and BCP slave,

enforce-gtid-consistency into my.cnf

3. Replication from prod master to

Enable GTID without downtime

Promote BCP master to Prod master

Prod master is broken.

Enable GTID without downtime

8. Fix replication from BCP master to prod master

9. Enable GTID replication from

Enable GTID without downtime

Enable global read_only=on in prod master

Sanity check to make sure BCP master catch up its master

Disable read_only in BCP master. BCP master becomes prod

If prod master is unreachable, it will be failover without step 1 and 2.

GTID and MTS in Production : MySQL Ops

GTID and MTS in production : What did we learn?

Errant Transaction Detection and Fix