Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lw heartbeats during inflight appends #19974

Merged
merged 5 commits into from
Jun 27, 2024

Conversation

bharathv
Copy link
Contributor

@bharathv bharathv commented Jun 24, 2024

If there are appends in flight and are stuck (eg: disk pressure on the follower), that may result in a spurious leadership step down due to heartbeat loss as the hbs are suppressed for the duration of the append. This commit switches to using lw heartbeats when there are inflight appends to avoid this scenario.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.1.x
  • v23.3.x
  • v23.2.x

Release Notes

Bug Fixes

  • Avoid spurious step downs if the append_entries are blocked for longer than usual on the followers.

@bharathv
Copy link
Contributor Author

/dt

@bharathv
Copy link
Contributor Author

/dt

@bharathv bharathv changed the title [draft] lw heartbeats during inflight appends lw heartbeats during inflight appends Jun 25, 2024
@bharathv bharathv marked this pull request as ready for review June 25, 2024 14:40
@bharathv
Copy link
Contributor Author

Failure unrelated: #19954

src/v/raft/tests/basic_raft_fixture_test.cc Show resolved Hide resolved
src/v/raft/tests/basic_raft_fixture_test.cc Outdated Show resolved Hide resolved
@@ -449,7 +449,7 @@ class consensus {
};

// precondition: is_elected_leader() must be true.
suppress_heartbeats_guard suppress_heartbeats(vnode);
inflight_appends_guard append_inflight(vnode);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: would be nice to have a verb in the function name, something like "track_inflight_append", with the current name there is an impression that we are appending something

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

src/v/raft/heartbeat_manager.cc Outdated Show resolved Hide resolved
src/v/raft/heartbeat_manager.cc Show resolved Hide resolved
Simulates stuck append entries which supress heartbeats resulting
in a leader step down.
Now that there is only one variant, code internal to the heatbeat
manager is easy to follow without the _v2 suffix.
ztlpn
ztlpn previously approved these changes Jun 27, 2024
@@ -449,7 +453,7 @@ class consensus {
};

// precondition: is_elected_leader() must be true.
suppress_heartbeats_guard suppress_heartbeats(vnode);
track_inflight_appends_guard append_inflight(vnode);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hah, what I meant was to add "track" to the function name, not the class name (so that function name has a verb in it) :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lmao 🤦 , let me fix it again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one more stamp please :-)

With the upcoming change, we don't necessarily suppress hearbeats if
appends are in flight, so the current naming schema may confuse the
readers. No logical changes.
If the inflight appends are stuck, we donot want the leader to
mistakenly step down with majority heartbeat loss. Instead this commit
switches to using lw heartbeats in that window.
@bharathv
Copy link
Contributor Author

Failure unrelated #20316

@piyushredpanda piyushredpanda merged commit 1a28f98 into redpanda-data:dev Jun 27, 2024
17 of 21 checks passed
@vbotbuildovich
Copy link
Collaborator

/backport v24.1.x

@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v24.1.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-19974-v24.1.x-53 remotes/upstream/v24.1.x
git cherry-pick -x 4e926224134d35734e7d8dcaf5b276a9b2bec471 efc6c72fd5eb422a8f1a31af28b66d47e27d0a42 8663d4efa582ef154f672efe3dc839b6aa88f2c2 5fde255b592f197ee90dfe2e8964978150a27e2d 6619817decc79453900852d9943f8e93da6b148f

Workflow run logs.

@@ -60,7 +60,6 @@ enum class feature : std::uint64_t {
force_partition_reconfiguration = 1ULL << 26U,
raft_append_entries_serde = 1ULL << 28U,
delete_records = 1ULL << 29U,
lightweight_heartbeats = 1ULL << 30U,
Copy link
Member

@dotnwat dotnwat Jun 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bharathv is this really "dead" in the sense that this value could be contained in a controller log somewhere and need to be replayed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its a retired flag as it is enabled by default in any cluster installing this version. More details here
https://github.com/redpanda-data/redpanda/blob/dev/src/v/features/feature_table.h#L88

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants