RFC: Atomic Distributed Transactions #16245

harshit-gangal · 2024-06-21T15:54:53Z

Introduction

This document focuses on reintroducing the atomic distributed transaction implementation and addressing the shortcomings with improved and robust support.

Background

Existing System Overview

Vitess has three transaction modes; those are Single, Multi and TwoPC.

In Single Mode, any transaction that spans more than one shard is rolled back immediately. This mode keeps the transaction to a single shard and provides ACID-compliant transactions.

In Multi Mode, a commit on a multi-shard transaction is handled with a best-effort commit. Any commit failure on a shard rolls back the non-committed transactions. The previously committed shard transactions and the failure shard need application-side handling.

In TwoPC Mode, a commit on a multi-shard transaction follows a sequence of steps to achieve an atomic distributed commit. The existing design document is extensive and explains all the component interactions needed to support it. It also highlights the different failure scenarios and how they should be handled.

Existing Implementation

A Two-Phase commit protocol requires a Transaction Manager (TM) and Resource Managers (RMs).

Resource Managers are the participating VTTablets for the transaction. Their role is to prepare the transaction and return a success or failure response. During the prepare phase, RMs store all the queries executed on that transaction in recovery logs as statements. If an RM fails, upon coming back online, it prepares all the transactions using the transaction recovery logs by executing the statements before accepting any further transactions or queries.

The Transaction Manager role is handled by VTGate. On commit, VTGate creates a transaction record and stores it in one of the participating RMs, designating it as the Metadata Manager (MM). VTGate then issues a prepare request to the other involved RMs. If any RM responds with a failure, VTGate decides to roll back the transaction and stores this decision in the MM. VTGate then issues a rollback prepared request to all the involved RMs.

If all RMs respond successfully, VTGate decides to commit the transaction. It issues a start commit to the MM, which commits the ongoing transaction and stores the commit decision in the transaction record. VTGate then issues a commit prepared request to the other involved RMs. After committing on all RMs, VTGate concludes by removing the transaction record from the MM.

All MMs have a watcher service that monitors unresolved transactions and transmits them to the TM for resolution.

Benefits of the Existing Approach:

The application does not have to communicate upfront about transactions going cross-shard.
TM maintains the transaction metadata with RM keeping itself stateless.
Storing transaction metadata with one of the RM stating it as MM avoids the Prepare phase for the MM.
The transaction is committed with non-2PC workflow if the transaction exists on a single shard.

Problem Statement

The existing implementation of atomic distributed commit is a modified version of the Two-Phase Commit (2PC) protocol that addresses its inherent issues while making practical trade-offs. This approach efficiently handles single-shard transactions and adopts a realistic method for managing transactions across multiple shards. However, there are issues with the watchdog design, as well as other reliability concerns. Additionally, there are workflow improvements and performance enhancements that need to be addressed. This document will highlight these issues and provide solutions with the rework.

Existing Issues and Proposal

1. Distributed Transaction Identifier (DTID) Generation

The Transaction Manager (TM) designates the first participant of the transaction as MM. It generates the DTID using MM’s shard prefix and the transaction ID. This method ensures uniqueness across shards, it introduces potential conflicts due to the auto-increment transaction ID being reset upon a VTTablet restart.

Impact:

Additional Recovery Workflow: To prevent collisions, on startup all VTTablets must ensure their last transaction ID is set to the maximum value of the in-flight distributed transactions.
Risk of DTID Collision: If synchronization is missed or fails, VTGate might generate DTIDs that collide with existing in-flight transactions. This could result in incorrect transactions being modified or committed, leading to data corruption and loss of transactional integrity.
Exhaustion of ID Space: The skipping of ID ranges will likely continue, and over time, there is a potential risk of reaching the limit of the auto-increment ID range.

Proposals:

Proposal 1: Use a centralized sequence generator that will ensure unique ID across keyspace and shards. This sequence will then be mapped to a shard using a hash function. That shard primary will become the MM for that transaction.
Proposal 2: TM will create the DTID using UUIDv7 or Nano ID and will follow similar steps as Proposal 1. There is a possibility that the DTID generated might not be unique and can fail on Create Transaction Record API on MM which will lead to transaction rollback.
Proposal 3: The first participant of the transaction will be the MM. It will create the DTID using a 32-byte keyspace-shard name as a prefix along with a 14-byte Nano ID when TM invokes the Create Transaction Record API.
Proposal 4: TM creates the DTID using Proposal 3 and sends it to MM to store as part of the Create Transaction Record API. There is a possibility that the DTID generated might not be unique and can fail on Create Transaction Record API on MM which will lead to transaction rollback.

Conclusion:

Proposal 1 is good but it adds a dependency on a new system to provide the DTID. Proposal 2 reduces that dependency by having TM generate the DTID, but it risks generating duplicate DTID which might fail on Create Transaction Record API, leading to transaction rollback. Proposal 3 ensures the DTID is unique but results in a long DTID key. Proposal 4 also risks DTID collisions, causing transaction rollback on the Create Transaction Record API call.

Proposals 1 & 2 can map the DTID to non-participating RMs, making it the MM. These additional network calls will increase the system’s commit latency. Proposals 3 & 4 avoid this extra hop but significantly increase the DTID size. The larger DTID size outweighs the efficiency gains from using one of the participating RMs as MM in the overall commit process.

Proposal 3 looks like the most balanced and reliable option here.

2. Transaction Resolution Design

The MM is currently being provided with a fixed IP address of the TM on startup to invoke TM ResolveTransaction API for unresolved transactions.

Impact:

Operational Overhead: If the TM's IP Address changes then MM will not be able to contact the TM without a manual update and restart of MM.
Single Point of Failure: If the TM is down or not reachable, then MM will not be able to provide unresolved transactions.
Scalability: Having a single TM to resolve the transactions will limit the ability to scale horizontally.

Proposals:

Proposal 1: MM will notify VTGates via the vttablet health stream for pending unresolved transactions. VTGate will invoke MM to retrieve the unresolved transaction details. MM will manage the multiple invocations from VTGates and provide the different sets of unresolved transactions to each VTGate. VTGate will then resolve the transaction based on the current state of the transaction.
Proposal 2: VTOrc will track the unresolved transactions from the MM. VTOrch will then notify the VTGates to resolve the transaction. This would need service discovery for VTGates.

Conclusion:

Proposal 1 is the more practical choice as it utilizes existing infrastructure, which is proven and already used for other purposes like real-time stats and schema tracking. Unlike Proposal 2, which requires full-fledged development of a VTGate service discovery system.

3. Connection Settings

The current implementation does not store changes in the connection settings in the transaction recovery logs. Its omission risks the integrity and consistency of the distributed transaction during a failure recovery scenario.

Impact:

Data Integrity Risk: When the transaction is recovered using the recovery logs, the session state may not reflect the connection settings that were in effect when the transaction was originally executed. This can lead to inconsistencies in transaction behaviour such as differences in character sets, time zones, isolation levels and other session-specific settings.
Failure Recovery Issue: During the failure recovery, the lack of stored connection settings can lead to some statements failing to get applied which will lead to failed prepared transactions and loss of atomicity of the transaction.

Proposal: Along with redo statement logs, the connections settings as set statements will be stored in the sequence of when they were executed. On recovery, the same sequence will be used to prepare the transaction.

4. Prepared Transactions Connection Stability

The current implementation assumes a stable MySQL connection after preparing a transaction on an RM. Any connection disruption will roll back the transaction and may cause data inconsistency due to modifications by other concurrent transactions.

Impact:

Transaction Atomicity Loss: Recovery of the prepared transaction can fail as the underlying data might have changed and hence this transaction will be called as unrecoverable losing the atomicity of the transaction.

Proposals:

Proposal 1: All the database connections that are part of the distributed transaction to MySQL would only be allowed on the Unix socket. Prepare Transaction step will fail if they are using a network connection.
Proposal 2: The locking mechanism needs to be moved to MySQL this will ensure the rows part of the transaction remains locked for modification by other transactions even when the connection is disconnected. MySQL should be able to provide a recovery mechanism for such locked rows. MySQL does solve this problem via the XA protocol.

Conclusion:

Proposal 1 is recommended for immediate adoption to enhance connection stability and prevent unreliable TCP connections. If testing identifies issues with Unix socket stability, Proposal 2 will be implemented to leverage MySQL's XA protocol for transactional integrity and recovery.

5. Transaction Recovery Logs Application Reliability

The current implementation stores the transaction recovery logs as DML statements. On transaction recovery, while applying the statements from these logs it is not expected to fail as the current shutdown and startup workflow ensure that no other DMLs leak into the database. Still, there remains a risk of statement failure during the redo log application, potentially resulting in lost modifications without clear tracking of modified rows.

Impact:

Lack of Visibility: In case of a failed statement application, it becomes unclear which rows were originally modified, complicating data recovery.

Proposals:

Proposal 1: Implement a copy-on-write approach, where raw mutations are stored in a separate shadow table. Upon commit, these mutations are materialized to the primary data tables. This will ensure that any failure during redo log application can be traced back to the original modifications, improving data recovery and maintaining transactional integrity.
Proposal 2: MySQL provides XA protocol support which handles the transaction recovery logs for distributed transactions. This will eliminate the need to store the recovery logs by Vitess.

Conclusion:

Currently, neither proposal will be implemented, as the expectation is that redo log applications should not fail during recovery. Should any recovery tests fail due to redo log application issues, Proposal 2 will be prioritized for its inherent advantages over Proposal 1.

6. Unsupported Consistent Lookup Vindex

The current implementation disallows the use of consistent lookup vindexes and upfront rejects any distributed transaction commit involving them.

Impact:

Operational Disruption: Existing Vitess clusters using consistent lookup vindexes must drop them before enabling distributed commits, causing operational disruption and requiring significant changes to the existing setup.

Proposal: Allow the consistent lookup vindex to continue. The pre-transaction will continue to work as-is. Any failure on the pre-transaction commit will roll back the main transaction. The post-transaction will only continue once the distributed transaction is completed. Otherwise, the post-transaction will be rolled back.

7. Resharding, Move Tables and Online Schema Change not Accounted

The current implementation has not handled the complications of running a resharding workflow, a move tables workflow, or an online schema change workflow in parallel with in-flight prepared distributed transactions.

Impact:

Transaction Atomicity Loss: Running these workflows in parallel with distributed transactions can destabilize prepared transactions, potentially compromising the atomicity guarantee.

Proposals:

Proposal 1: All the VReplication-related workflow needs to account for ongoing distributed transactions during their cut-over. A new tablet manager API will be added to check if it is safe to continue the cutover. This API will take a lock on the involved table and the workflow needs to unlock them at the end of it. VTOrc needs to handle the unlocking of the table if the workflow is abandoned for any reason.
Proposal 2: Look into different kinds of workflow and see if a safe cutover can happen and have different case-by-case implementations to support distributed transactions along the cutover.

Conclusion:

Proposal 1 is relatively easier to argue about the expectation. All workflows will use the same strategy. The new API can be extended to be used for other flows as well.

Exploratory Work

MySQL XA was considered as an alternative to having RMs manage the transaction recovery logs and hold up the row locks until a commit or rollback occurs.

There are currently 20 open bugs on XA. On MySQL 8.0.33, reproduction steps were followed for all the bugs, and 8 still persist. Out of these 8 bugs, 4 have patches attached that resolve the issues when applied. For the remaining 4 issues, changes will need to be made either in the code or the workflow to ensure they are resolved.

MySQL’s XA seems a probable candidate if we encounter issues with our implementation of handling distributed transactions that XA can resolve. XA’s usage is currently neither established nor ruled out in this design.

Rework Design

Commit Phase Interaction

The Component interaction for different cases

Case 1: All components respond with success.

sequenceDiagram
  participant App as App
  participant G as VTGate
  participant MM as VTTablet/MM
  participant RM1 as VTTablet/RM1
  participant RM2 as VTTablet/RM2

  App ->>+ G: Commit
  G ->> MM: Create Transaction Record
  MM -->> G: Success
  par
  G ->> RM1: Prepare Transaction
  G ->> RM2: Prepare Transaction
  RM1 -->> G: Success
  RM2 -->> G: Success
  end
  G ->> MM: Store Commit Decision
  MM -->> G: Success
  opt Any failure here does not impact the reponse to the application
  par
  G ->> RM1: Commit Prepared Transaction
  G ->> RM2: Commit Prepared Transaction
  RM1 -->> G: Success
  RM2 -->> G: Success
  end
  G ->> MM: Delete Transaction Record
  end
  G ->>- App: OK Packet

Case 2: When the Commit Descision from MM responds with an error. In this case, the watcher service needs to resolve the transaction as it is not certain whether the commit decision persisted or not.

sequenceDiagram
  participant App as App
  participant G as VTGate
  participant MM as VTTablet/MM
  participant RM1 as VTTablet/RM1
  participant RM2 as VTTablet/RM2

  App ->>+ G: Commit
  G ->> MM: Create Transaction Record
  MM -->> G: Success
  par
  G ->> RM1: Prepare Transaction
  G ->> RM2: Prepare Transaction
  RM1 -->> G: Success
  RM2 -->> G: Success
  end
  G ->> MM: Store Commit Decision
  MM -->> G: Failure
  G ->>- App: Err Packet

Case 3: When a Prepare Transaction fails. TM will decide to roll back the transaction. If any rollback returns a failure, the watcher service will resolve the transaction.

sequenceDiagram
  participant App as App
  participant G as VTGate
  participant MM as VTTablet/MM
  participant RM1 as VTTablet/RM1
  participant RM2 as VTTablet/RM2

  App ->>+ G: Commit
  G ->> MM: Create Transaction Record
  MM -->> G: Success
  par
  G ->> RM1: Prepare Transaction
  G ->> RM2: Prepare Transaction
  RM1 -->> G: Failure
  RM2 -->> G: Success
  end
  par
  G ->> MM: Store Rollback Decision
  G ->> RM1: Rollback Prepared Transaction
  G ->> RM2: Rollback Prepared Transaction
  MM -->> G: Success / Failure
  RM1 -->> G: Success / Failure
  RM2 -->> G: Success / Failure
  end
  opt Rollback success on MM and RMs
    G ->> MM: Delete Transaction Record
  end
  G ->>- App: Err Packet

Case 4: When Create Transaction Record fails. TM will roll back the transaction.

sequenceDiagram
  participant App as App
  participant G as VTGate
  participant MM as VTTablet/MM
  participant RM1 as VTTablet/RM1
  participant RM2 as VTTablet/RM2

  App ->>+ G: Commit
  G ->> MM: Create Transaction Record
  MM -->> G: Failure
  par
  G ->> RM1: Rollback Transaction
  G ->> RM2: Rollback Transaction
  RM1 -->> G: Success / Failure
  RM2 -->> G: Success / Failure
  end
  G ->>- App: Err Packet

Transaction Resolution Watcher

If there are long pending distributed transactions in the MM. This watcher service will ensure that TM is invoked to resolve them.

sequenceDiagram
  participant G1 as VTGate
  participant G2 as VTGate
  participant MM as VTTablet/MM
  participant RM1 as VTTablet/RM1
  participant RM2 as VTTablet/RM2

  MM -) G1: Unresolved Transaction
  MM -) G2: Unresolved Transaction
  Note over G1,MM: MM sends this over health stream.
  loop till no more unresolved transactions
  G1 ->> MM: Provide Transaction details
  G2 ->> MM: Provide Transaction details
  MM -->> G2: Distributed Transaction ID details
  Note over G2,MM: This VTGate recieves the transaction to resolve.
  end
  alt Transaction State: Commit
  G2 ->> RM1: Commit Prepared Transaction
  G2 ->> RM2: Commit Prepared Transaction
  else Transaction State: Rollback
  G2 ->> RM1: Rollback Prepared Transaction
  G2 ->> RM2: Rollback Prepared Transaction
  else Transaction State: Prepare
  G2 ->> MM: Store Rollback Decision
  MM -->> G2: Success
  opt Only when Rollback Decision is stored
  G2 ->> RM1: Rollback Prepared Transaction
  G2 ->> RM2: Rollback Prepared Transaction
  end
  end
  opt Commit / Rollback success on MM and RMs
  G2 ->> MM: Delete Transaction Record
  end

Improvements and Enhancements

Track which shards have DML applied as not all transactions open will have DML. This can reduce the participating RMs in the distributed transaction. Any shard that does not have a DML applied but an open transaction can be closed without impacting the atomicity of the transaction.
All the atomic transactions related APIs will be idempotent so that if TM tries to resolve the same DTID multiple times, the outcome will not be impacted.

Implementation Plan

Task Breakdown:

Record and store connection settings to the transaction recovery log
Reject distributed commit on a network connection
Modify the commit phase component interaction in TM based on new flows.
Implement the new transaction resolution design
Tracking DMLs on shard transactions and using them for improved commit phase.
Implement new DTID generator logic
Modify existing API implementation to be idempotent
Add an API to the tablet manager to provide lock/unlock on tables
Add support for consistent lookup vindex with distributed transaction

Testing Strategy

This is the most important piece to ensure all cases are covered, and APIs are tested thoroughly to ensure correctness and determine scalability.

Test Plan

Basic Tests

Commit or rollback of transactions, and handling prepare failures leading to transaction rollbacks.

Test Case	Expectation
Distributed Transaction - Commit	Transaction committed, transaction record cleared, and metrics updated
Distributed Transaction - Rollback	Transaction rollbacked, transaction record cleared, and metrics updated
Distributed Transaction - Commit (Prepare to Fail on MM)	Transaction rollbacked, metrics updated
Distributed Transaction - Commit (Prepare to Fail on RM)	Transaction rollbacked, transaction record updated, metrics updated

Resilient Tests

Handling failures of components like VTGate, VTTablet, or MySQL during the commit or recovery steps.

Test Case	Expectation
Distributed Transaction - Store Commit Decision fails on MM	Transaction recovered based on transaction state.
Distributed Transaction - Prepared Commit fail on RM	Transaction recovered and committed
Distributed Transaction - Delete Transaction Record fail	Recovery and transaction record removed
Distributed Transaction - VTGate restart on commit received	Transaction rolled back on timeout
Distributed Transaction - VTGate restart after transaction record created on MM	Recovery and transaction rolled back
Distributed Transaction - VTGate restart after transaction prepared on a subset of RMs	Recovery and transaction rolled back
Distributed Transaction - VTGate restart after transaction prepared on all RMs	Recovery and transaction rolled back
Distributed Transaction - VTGate restart after storing the commit decision on MM	Recovery and transaction committed
Distributed Transaction - VTGate restart after transaction prepared commit on a subset of RMs	Recovery and transaction committed
Distributed Transaction - VTGate restart after transaction prepared commit on all RMs	Recovery and transaction record removed

The failure on MM and RM includes the VTTablet and MySQL interuption cases.

System Tests

Tests Involving multiple moving parts such as distributed transactions with Reparenting (PRS & ERS), Resharding, OnlineDDL, and MoveTables.

Stress Tests

Tests will run conflicting transactions (single and multi-shard), and validate on error metrics related to distributed transaction failure.

Reliability tests

A continuous stream of transactions (single and distributed) will be executed, with all successful commits recorded along with the expected rows. The binary log events will be streamed continuously and validated against the ordering of the change stream and the successful transactions.

This test should run over an extended period, potentially lasting a few days or a week, and must endure various scenarios including:

Failure of different components (e.g., VTGate, VTTablets, MySQL)
Reparenting (PRS & ERS)
Resharding
Online DDL operations

Deployment Plan

The existing implementation has remained experimental therefore no compatibility guarantees will be maintained with the new design changes.

Monitoring

The existing monitoring support will continue as per the old design.

Future Enhancements

1. Read Isolation Guarantee

The current system lacks isolation guarantees, placing the burden on the application to manage it. Implementing read isolation will enable true cross-shard ACID transactions.

2. Distributed Deadlock Avoidance

The current system can encounter cross-shard deadlocks, which are only resolved when one of the transactions times out and is rolled back. Implementing distributed deadlock avoidance will address this issue more efficiently.

deepthi · 2024-06-24T21:49:14Z

This is very well written. Everything I could think of on a first read has already been addressed.
One minor suggestion is to change this wording to be clearer (it has been used in 2 places):
Distributed Transaction - VTGate restart after transaction prepared on partial RMs -> Distributed Transaction - VTGate restart after transaction prepared on a subset of RMs

GuptaManan100 · 2024-06-25T10:38:44Z

Just some thoughts and questions -

With respect to Distributed Transaction Identifier (DTID) Generation. In proposal 3 why do we need a nano ID? If we are going to get the DTID from the MM, then can't it just use auto-incrementing numbers after the keyspace shard? Since the transaction will be written to the MM's mysql table for the creation to be successful. We can initialize the initial value with 1 more than the max available in the said table. This will prevent collisions even across restarts, reparents, or any other issues.

For Transaction Resolution Design, In proposal 2, we can get VTOrc to just run the Resolution code too. It is only going to entail calling RPCs against vttablet. If we create a common function in a package, we can get VTOrc to do it too. For example, we don't get VTOrc to tell vtctld to run PRS/ERS, it just runs it. The same concept here can be used.

For Prepared Transactions Connection Stability. I'm just summarizing what we already discussed before. If we decide to go ahead with Proposal 1, then we need to make sure that the Unix socket connections can't be dropped by MySQL. If for whatever reason MySQL drops a connection that is running a distributed transaction, any other write that was blocked on the same rows might go through, and then the distributed transaction's state would be unrecoverable.
There is a proper fix for read and write exclusion but it entails doing locking at a level higher than MySQL connection using a shadow table as implemented by dropbox. That has a significant performance penalty however.
Edit: ☝️ is essentially the same as proposed in the next section's first proposal too. We would just overload the shadow table for also read/write exclusion.

systay · 2024-06-28T07:51:18Z

Could you explain case 4? Why do we rollback the transactions on RM1 and RM2? There was no transaction preparation done before this, right?

harshit-gangal · 2024-06-28T10:10:36Z

Could you explain case 4? Why do we rollback the transactions on RM1 and RM2? There was no transaction preparation done before this, right?

We still have open transactions so we should roll them back otherwise they will remain holding locks till transaction timeout is not achieved.

harshit-gangal added the Type: RFC Request For Comment label Jun 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Atomic Distributed Transactions #16245

RFC: Atomic Distributed Transactions #16245

harshit-gangal commented Jun 21, 2024 •

edited

Loading

deepthi commented Jun 24, 2024 •

edited

Loading

GuptaManan100 commented Jun 25, 2024

systay commented Jun 28, 2024

harshit-gangal commented Jun 28, 2024

RFC: Atomic Distributed Transactions #16245

RFC: Atomic Distributed Transactions #16245

Comments

harshit-gangal commented Jun 21, 2024 • edited Loading

Introduction

Background

Existing System Overview

Existing Implementation

Benefits of the Existing Approach:

Problem Statement

Existing Issues and Proposal

1. Distributed Transaction Identifier (DTID) Generation

2. Transaction Resolution Design

3. Connection Settings

4. Prepared Transactions Connection Stability

5. Transaction Recovery Logs Application Reliability

6. Unsupported Consistent Lookup Vindex

7. Resharding, Move Tables and Online Schema Change not Accounted

Exploratory Work

Rework Design

Commit Phase Interaction

Transaction Resolution Watcher

Improvements and Enhancements

Implementation Plan

Testing Strategy

Test Plan

Basic Tests

Resilient Tests

System Tests

Stress Tests

Reliability tests

Deployment Plan

Monitoring

Future Enhancements

1. Read Isolation Guarantee

2. Distributed Deadlock Avoidance

deepthi commented Jun 24, 2024 • edited Loading

GuptaManan100 commented Jun 25, 2024

systay commented Jun 28, 2024

harshit-gangal commented Jun 28, 2024

harshit-gangal commented Jun 21, 2024 •

edited

Loading

deepthi commented Jun 24, 2024 •

edited

Loading