fix: shutdown does not kill walredo processes #8150
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
While investigating Pageserver logs from the cases where systemd hangs during shutdown (https://github.com/neondatabase/cloud/issues/11387), I noticed that even if Pageserver shuts down cleanly1, there are lingering walredo processes.
While systemd should never lock up like it does, maybe we can avoid hitting that bug by cleaning up properly.
Changes
This PR adds a shutdown method to
WalRedoManager
and hooks it up to tenant shutdown.We keep track of intent to shutdown through the new
enum ProcessOnceCell
stored inside the pre-existingredo_process
field.A gate is added to keep track of running processes, using the new type
struct Process
.Future Work
Requests that don't need the redo process will not observe the shutdown (see doc comment).
Doing so would be nice for completeness sake, but doesn't provide much benefit because
Tenant
andTimeline
already shut down all walredo users.Testing
I did manual testing to confirm that the problem exists before this PR and that it's gone after.
Setup:
neon_local
with a single tenant, create some data usingpgbench
strace -e kill,wait4 -f -p "$(pgrep pageserver)"
neon_local pageserver stop
With this PR, we always observe
Before this PR, we'd usually observe just
Refs
refs https://github.com/neondatabase/cloud/issues/11387
Footnotes
Meaning, pageserver finishes its shutdown procedure and calls
exit(0)
on its own terms, instead of hitting the systemd unit'sTimeoutSec=
limit and getting SIGKILLed. ↩