-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Atomic Distributed Transactions #16245
Comments
This is very well written. Everything I could think of on a first read has already been addressed. |
Just some thoughts and questions - With respect to For For |
Could you explain case 4? Why do we rollback the transactions on RM1 and RM2? There was no transaction preparation done before this, right? |
We still have open transactions so we should roll them back otherwise they will remain holding locks till transaction timeout is not achieved. |
Introduction
This document focuses on reintroducing the atomic distributed transaction implementation and addressing the shortcomings with improved and robust support.
Background
Existing System Overview
Vitess has three transaction modes; those are Single, Multi and TwoPC.
In Single Mode, any transaction that spans more than one shard is rolled back immediately. This mode keeps the transaction to a single shard and provides ACID-compliant transactions.
In Multi Mode, a commit on a multi-shard transaction is handled with a best-effort commit. Any commit failure on a shard rolls back the non-committed transactions. The previously committed shard transactions and the failure shard need application-side handling.
In TwoPC Mode, a commit on a multi-shard transaction follows a sequence of steps to achieve an atomic distributed commit. The existing design document is extensive and explains all the component interactions needed to support it. It also highlights the different failure scenarios and how they should be handled.
Existing Implementation
A Two-Phase commit protocol requires a Transaction Manager (TM) and Resource Managers (RMs).
Resource Managers are the participating VTTablets for the transaction. Their role is to prepare the transaction and return a success or failure response. During the prepare phase, RMs store all the queries executed on that transaction in recovery logs as statements. If an RM fails, upon coming back online, it prepares all the transactions using the transaction recovery logs by executing the statements before accepting any further transactions or queries.
The Transaction Manager role is handled by VTGate. On commit, VTGate creates a transaction record and stores it in one of the participating RMs, designating it as the Metadata Manager (MM). VTGate then issues a prepare request to the other involved RMs. If any RM responds with a failure, VTGate decides to roll back the transaction and stores this decision in the MM. VTGate then issues a rollback prepared request to all the involved RMs.
If all RMs respond successfully, VTGate decides to commit the transaction. It issues a start commit to the MM, which commits the ongoing transaction and stores the commit decision in the transaction record. VTGate then issues a commit prepared request to the other involved RMs. After committing on all RMs, VTGate concludes by removing the transaction record from the MM.
All MMs have a watcher service that monitors unresolved transactions and transmits them to the TM for resolution.
Benefits of the Existing Approach:
Problem Statement
The existing implementation of atomic distributed commit is a modified version of the Two-Phase Commit (2PC) protocol that addresses its inherent issues while making practical trade-offs. This approach efficiently handles single-shard transactions and adopts a realistic method for managing transactions across multiple shards. However, there are issues with the watchdog design, as well as other reliability concerns. Additionally, there are workflow improvements and performance enhancements that need to be addressed. This document will highlight these issues and provide solutions with the rework.
Existing Issues and Proposal
1. Distributed Transaction Identifier (DTID) Generation
The Transaction Manager (TM) designates the first participant of the transaction as MM. It generates the DTID using MM’s shard prefix and the transaction ID. This method ensures uniqueness across shards, it introduces potential conflicts due to the auto-increment transaction ID being reset upon a VTTablet restart.
Impact:
Proposals:
Conclusion:
Proposal 1 is good but it adds a dependency on a new system to provide the DTID. Proposal 2 reduces that dependency by having TM generate the DTID, but it risks generating duplicate DTID which might fail on Create Transaction Record API, leading to transaction rollback. Proposal 3 ensures the DTID is unique but results in a long DTID key. Proposal 4 also risks DTID collisions, causing transaction rollback on the Create Transaction Record API call.
Proposals 1 & 2 can map the DTID to non-participating RMs, making it the MM. These additional network calls will increase the system’s commit latency. Proposals 3 & 4 avoid this extra hop but significantly increase the DTID size. The larger DTID size outweighs the efficiency gains from using one of the participating RMs as MM in the overall commit process.
Proposal 3 looks like the most balanced and reliable option here.
2. Transaction Resolution Design
The MM is currently being provided with a fixed IP address of the TM on startup to invoke TM ResolveTransaction API for unresolved transactions.
Impact:
Proposals:
Conclusion:
Proposal 1 is the more practical choice as it utilizes existing infrastructure, which is proven and already used for other purposes like real-time stats and schema tracking. Unlike Proposal 2, which requires full-fledged development of a VTGate service discovery system.
3. Connection Settings
The current implementation does not store changes in the connection settings in the transaction recovery logs. Its omission risks the integrity and consistency of the distributed transaction during a failure recovery scenario.
Impact:
Proposal: Along with redo statement logs, the connections settings as set statements will be stored in the sequence of when they were executed. On recovery, the same sequence will be used to prepare the transaction.
4. Prepared Transactions Connection Stability
The current implementation assumes a stable MySQL connection after preparing a transaction on an RM. Any connection disruption will roll back the transaction and may cause data inconsistency due to modifications by other concurrent transactions.
Impact:
Proposals:
Conclusion:
Proposal 1 is recommended for immediate adoption to enhance connection stability and prevent unreliable TCP connections. If testing identifies issues with Unix socket stability, Proposal 2 will be implemented to leverage MySQL's XA protocol for transactional integrity and recovery.
5. Transaction Recovery Logs Application Reliability
The current implementation stores the transaction recovery logs as DML statements. On transaction recovery, while applying the statements from these logs it is not expected to fail as the current shutdown and startup workflow ensure that no other DMLs leak into the database. Still, there remains a risk of statement failure during the redo log application, potentially resulting in lost modifications without clear tracking of modified rows.
Impact:
Proposals:
Conclusion:
Currently, neither proposal will be implemented, as the expectation is that redo log applications should not fail during recovery. Should any recovery tests fail due to redo log application issues, Proposal 2 will be prioritized for its inherent advantages over Proposal 1.
6. Unsupported Consistent Lookup Vindex
The current implementation disallows the use of consistent lookup vindexes and upfront rejects any distributed transaction commit involving them.
Impact:
Proposal: Allow the consistent lookup vindex to continue. The pre-transaction will continue to work as-is. Any failure on the pre-transaction commit will roll back the main transaction. The post-transaction will only continue once the distributed transaction is completed. Otherwise, the post-transaction will be rolled back.
7. Resharding, Move Tables and Online Schema Change not Accounted
The current implementation has not handled the complications of running a resharding workflow, a move tables workflow, or an online schema change workflow in parallel with in-flight prepared distributed transactions.
Impact:
Proposals:
Conclusion:
Proposal 1 is relatively easier to argue about the expectation. All workflows will use the same strategy. The new API can be extended to be used for other flows as well.
Exploratory Work
MySQL XA was considered as an alternative to having RMs manage the transaction recovery logs and hold up the row locks until a commit or rollback occurs.
There are currently 20 open bugs on XA. On MySQL 8.0.33, reproduction steps were followed for all the bugs, and 8 still persist. Out of these 8 bugs, 4 have patches attached that resolve the issues when applied. For the remaining 4 issues, changes will need to be made either in the code or the workflow to ensure they are resolved.
MySQL’s XA seems a probable candidate if we encounter issues with our implementation of handling distributed transactions that XA can resolve. XA’s usage is currently neither established nor ruled out in this design.
Rework Design
Commit Phase Interaction
The Component interaction for different cases
Case 1: All components respond with success.
Case 2: When the Commit Descision from MM responds with an error. In this case, the watcher service needs to resolve the transaction as it is not certain whether the commit decision persisted or not.
Case 3: When a Prepare Transaction fails. TM will decide to roll back the transaction. If any rollback returns a failure, the watcher service will resolve the transaction.
Case 4: When Create Transaction Record fails. TM will roll back the transaction.
Transaction Resolution Watcher
If there are long pending distributed transactions in the MM. This watcher service will ensure that TM is invoked to resolve them.
Improvements and Enhancements
Implementation Plan
Task Breakdown:
Testing Strategy
This is the most important piece to ensure all cases are covered, and APIs are tested thoroughly to ensure correctness and determine scalability.
Test Plan
Basic Tests
Commit or rollback of transactions, and handling prepare failures leading to transaction rollbacks.
Resilient Tests
Handling failures of components like VTGate, VTTablet, or MySQL during the commit or recovery steps.
The failure on MM and RM includes the VTTablet and MySQL interuption cases.
System Tests
Tests Involving multiple moving parts such as distributed transactions with Reparenting (PRS & ERS), Resharding, OnlineDDL, and MoveTables.
Stress Tests
Tests will run conflicting transactions (single and multi-shard), and validate on error metrics related to distributed transaction failure.
Reliability tests
A continuous stream of transactions (single and distributed) will be executed, with all successful commits recorded along with the expected rows. The binary log events will be streamed continuously and validated against the ordering of the change stream and the successful transactions.
This test should run over an extended period, potentially lasting a few days or a week, and must endure various scenarios including:
Deployment Plan
The existing implementation has remained experimental therefore no compatibility guarantees will be maintained with the new design changes.
Monitoring
The existing monitoring support will continue as per the old design.
Future Enhancements
1. Read Isolation Guarantee
The current system lacks isolation guarantees, placing the burden on the application to manage it. Implementing read isolation will enable true cross-shard ACID transactions.
2. Distributed Deadlock Avoidance
The current system can encounter cross-shard deadlocks, which are only resolved when one of the transactions times out and is rolled back. Implementing distributed deadlock avoidance will address this issue more efficiently.
The text was updated successfully, but these errors were encountered: