Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

group/transactions: compaction fixes #19931

Merged
merged 7 commits into from
Jun 28, 2024

Conversation

bharathv
Copy link
Contributor

@bharathv bharathv commented Jun 20, 2024

Fixes two main issues related to compaction of group transactions:

  • There was no enforcement of max_collectible_offset, so open transactions were candidates for compaction
  • tx/compaction: don't deduplicate control batches #16295 disabled compaction of all control batches (all transactional batches in group topic are control batches) effectively disabling compaction of the topic.

These issues are addressed by adding a new state machine that enforces max_collectible_offset and then allowing compaction of control batch types from group transactions.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.1.x
  • v23.3.x
  • v23.2.x

Release Notes

Bug Fixes

  • Fixes compaction related issues of transactional data in groups topic. This prevents a case where groups topic was growing unbounded due to ineffective compaction.

@bharathv
Copy link
Contributor Author

/dt

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Jun 20, 2024

new failures in https://buildkite.com/redpanda/redpanda/builds/50483#01903499-4b14-4281-8fe3-3ff604ca2c45:

"rptest.tests.scaling_up_test.ScalingUpTest.test_moves_with_local_retention.use_topic_property=True"

new failures in https://buildkite.com/redpanda/redpanda/builds/50483#01903499-4b16-46f1-8744-bc44f9f0834e:

"rptest.transactions.compaction_e2e_test.CompactionE2EIdempotencyTest.test_basic_compaction.initial_cleanup_policy=delete.workload=Workload.TX_UNIQUE_KEYS"
"rptest.transactions.transactions_test.TxUpgradeTest.upgrade_does_not_change_tx_coordinator_assignment_test"

new failures in https://buildkite.com/redpanda/redpanda/builds/50483#01903499-4b18-47bc-b679-07b621e5740a:

"rptest.transactions.compaction_e2e_test.CompactionE2EIdempotencyTest.test_basic_compaction.initial_cleanup_policy=compact.workload=Workload.TX_UNIQUE_KEYS"

new failures in https://buildkite.com/redpanda/redpanda/builds/50498#01903792-80c0-4daa-944f-8c8e5e87664a:

"rptest.transactions.compaction_e2e_test.CompactionE2EIdempotencyTest.test_basic_compaction.initial_cleanup_policy=delete.workload=Workload.TX_UNIQUE_KEYS"
"rptest.transactions.compaction_e2e_test.CompactionE2EIdempotencyTest.test_basic_compaction.initial_cleanup_policy=compact.workload=Workload.TX"

new failures in https://buildkite.com/redpanda/redpanda/builds/50498#01903792-80c2-4d61-a546-f2c6ec6bc20b:

"rptest.transactions.compaction_e2e_test.CompactionE2EIdempotencyTest.test_basic_compaction.initial_cleanup_policy=compact.workload=Workload.TX_UNIQUE_KEYS"

new failures in https://buildkite.com/redpanda/redpanda/builds/50509#01903998-bf65-426e-b1a1-611a9db57536:

"rptest.transactions.compaction_e2e_test.CompactionE2EIdempotencyTest.test_basic_compaction.initial_cleanup_policy=delete.workload=Workload.TX_UNIQUE_KEYS"

new failures in https://buildkite.com/redpanda/redpanda/builds/50509#01903998-bf67-47c0-8473-c9523c31fcdf:

"rptest.transactions.compaction_e2e_test.CompactionE2EIdempotencyTest.test_basic_compaction.initial_cleanup_policy=compact.workload=Workload.TX_UNIQUE_KEYS"
"rptest.tests.raft_availability_test.RaftAvailabilityTest.test_leadership_transfer"

@bharathv
Copy link
Contributor Author

/dt

@vbotbuildovich
Copy link
Collaborator

@bharathv
Copy link
Contributor Author

/dt

@bharathv
Copy link
Contributor Author

/dt

@bharathv
Copy link
Contributor Author

/ci-repeat 3

@bharathv bharathv force-pushed the group_co_stm branch 2 times, most recently from cb72c7f to 8e293de Compare June 23, 2024 23:04
@bharathv bharathv marked this pull request as ready for review June 23, 2024 23:12
return std::tie(begin_offsets, producer_to_begin);
}
};
using all_txs_t = absl::btree_map<kafka::group_id, per_group_state>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason not to use chunked_hash_map ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chm is not copyable because of the backing frag_vector, the code to snapshot required iteration, given this scale is very small here, used absl variants.

template<class Base>
class group_data_parser {
protected:
ss::future<> parse(model::record_batch b) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks to me that we do not have to return a future from all the handle methods ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its needed for handle_raft_data but not for the rest of them, I just kept it consistent as a 'future' proof incase some 'future' code wants to do something interesting than adding to some in memory state.

@bharathv
Copy link
Contributor Author

Last force push is a rebase to fix conflicts from recent group changes.


bool group_tx_tracker_stm_factory::is_applicable_for(
const storage::ntp_config& config) const {
auto nt = model::topic_namespace{config.ntp().ns, config.ntp().tp.topic};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here we copy topic and namespace for each single batch, maybe we can directly compare topic and namespace or use model::topic_namespace_view

public:
group_data_parser() {
static_assert(
GroupDataParserBase<Base>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't we just write template<GroupDataParserBase Base> instead of a static assert?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The concept is called when the template is initialized and by then non virtual members are not initialized, so the type is really incomplete at that point, so the compiler complains. By the time constructor runs that is not the case, hence the static assert.

@@ -884,6 +884,11 @@ class group final : public ss::enable_lw_shared_from_this<group> {
std::vector<model::topic_partition>
get_expired_offsets(std::chrono::seconds retention_period);

bool use_new_fence_batch_type() const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'm always suspicious of using "new" in naming anything because after a year this thing is not so new anymore so it is less clear what "new" is referring to :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed and clarified in a comment.

@@ -77,6 +77,7 @@ enum class feature : std::uint64_t {
cluster_topic_manifest_format_v2 = 1ULL << 45U,
node_local_core_assignment = 1ULL << 46U,
unified_tx_state = 1ULL << 47U,
group_fence_batch_type_switch = 1ULL << 48U,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: a bit unclear to me what "type switch" here means? I guess the incompatible change here is that we are using a new batch type, and it would be nice to consistently use a single name in all the places that are referring to it.


model::offset group_tx_tracker_stm::max_collectible_offset() {
auto result = last_applied_offset();
for (const auto& [_, group_state] : _all_txs) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm I wonder what the scale here is? Maybe we should just recalculate it once in maybe_end_tx?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few 100s to max 1000 at the higher end I can think of. Also do note this inflight state, that is groups with open transaction (not historical groups in the log), so very unlikely this is in double digits IMO. Adding to that this is only called from compaction patch attempting to compute the compaction boundary, which is not a hot path.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, sounds good

Copy link
Contributor

@andrwng andrwng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally the compaction stuff looks good. I don't fully understand the nuances of the new batch type, so deferring to others there, but IIUC we are free to remove the new group tx fences, right?

@@ -50,6 +50,7 @@ enum class record_batch_type : int8_t {
= 29, // place holder for last batch in a segment that was aborted
role_management_cmd = 30, // role management command
client_quota = 31, // client quota command
group_fence_tx = 32, // fence batch in group transactions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this go into offset_translator_batch_types? Is the offsets topic consumable with Kafka offsets?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in the consumer offsets topic which doesn't have kafka based consumption (non data partition)

// Wait until all segments are compacted and only two remain
co_await log->flush();
co_await log->force_roll(ss::default_priority_class());
co_await tests::cooperative_spin_wait_with_timeout(30s, [&]() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: might be able to use RPTEST_REQUIRE_EVENTUALLY_CORO here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya I was trying to remember this name but couldn't, thanks, will switch.

@@ -367,105 +367,50 @@ ss::future<> index_rebuilder_reducer::do_index(model::record_batch&& b) {
});
}

void tx_reducer::consume_aborted_txs(model::offset upto) {
void tx_reducer::refresh_ongoing_aborted_txs(const model::record_batch& b) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prepare batches are no longer used

What does this mean for old topics that have legacy prepare batches?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they were in pre-transactions-GA code ( > 2years ago), so we don't really support them anymore (after 22.3.0)

.. into a separate utility so it can be shared by multiple code paths
that parse this log data.
This helps clamp the max collectible offset on group partitions to not
exceed earliest open trasnaction.
Prior to this change group_tx_fence shared the fence record
batch type with data partitions (tx_fence). This made compaction
logic complicated particularly because different compaction rules
applied for fence batch in groups and data paritions. With the new
feature, group fence has a separate dedicated batch type so it is
easy to diambiguate both fence types.
Remove all control batch parsing. Control batches cannot be compacted
(after redpanda-data#15404) and prepare batches are no longer used, most branches are
dead code here.

No logical changes.
We will add another transactional type for consumer offsets in a later
commit. Renaming to disambiguate differnt transactional stms
Enables compaction of group transaction markers which was broker prior
to this change. Group transaction batches are ommitted in the compaction
reducer pass, check the code comments for more details.
The test generates interleavings of transactions across multiple groups
in the same consumer offsets partition and repeatedly compacts and
checks for some invariants.
// committed data has already been rewritten as separate raft_data batches,
// so no need to retain originally written group_prepare_tx batches while
// the transaction is in progress.
return is_compactible_control_batch(b.header().type);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So simple!

I wonder why we needed a separate batch type - if we know that we are in the consumer offsets topic, can't we just compact all fence batch types? Is this because in theory ordinary user transactions are also possible for the consumer offsets topic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, this is more of a code structuring problem because we don't have the "topic context" in all places, that is fixable with some (ugly) refactoring but even otherwise there is no reason to use the same batch type for fence in data & consumer offsets especially when every other batch type (data, commit/abort) are different. This change simplifies things because we can now just look at the batch header and say whether it is compactible and all the batch types are neatly grouped by topic (no overlap).

@piyushredpanda piyushredpanda merged commit c37412c into redpanda-data:dev Jun 28, 2024
15 of 18 checks passed
}

private:
ss::future<> parse_fence(model::record_batch b) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would probably be worth trying to move some of these implementations out of the header and into .cc files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants