Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: shutdown does not kill walredo processes #8150

Merged
merged 8 commits into from
Jun 27, 2024
Merged

Conversation

problame
Copy link
Contributor

@problame problame commented Jun 25, 2024

While investigating Pageserver logs from the cases where systemd hangs during shutdown (https://github.com/neondatabase/cloud/issues/11387), I noticed that even if Pageserver shuts down cleanly1, there are lingering walredo processes.

While systemd should never lock up like it does, maybe we can avoid hitting that bug by cleaning up properly.

Changes

This PR adds a shutdown method to WalRedoManager and hooks it up to tenant shutdown.

We keep track of intent to shutdown through the new enum ProcessOnceCell stored inside the pre-existing redo_process field.
A gate is added to keep track of running processes, using the new type struct Process.

Future Work

Requests that don't need the redo process will not observe the shutdown (see doc comment).
Doing so would be nice for completeness sake, but doesn't provide much benefit because Tenant and Timeline already shut down all walredo users.

Testing

I did manual testing to confirm that the problem exists before this PR and that it's gone after.
Setup:

  • neon_local with a single tenant, create some data using pgbench
  • ensure walredo process is running, not pid
  • watch strace -e kill,wait4 -f -p "$(pgrep pageserver)"
  • neon_local pageserver stop

With this PR, we always observe

$ strace -e kill,wait4 -f -p "$(pgrep pageserver)"
...
[pid 591120] --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=591215, si_uid=1000} ---
[pid 591134] kill(591174, SIGKILL)      = 0
[pid 591134] wait4(591174,  <unfinished ...>
[pid 591142] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=591174, si_uid=1000, si_status=SIGKILL, si_utime=0, si_stime=0} ---
[pid 591134] <... wait4 resumed>[{WIFSIGNALED(s) && WTERMSIG(s) == SIGKILL}], 0, NULL) = 591174
...
+++ exited with 0 +++

Before this PR, we'd usually observe just

...
[pid 596239] --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=596455, si_uid=1000} ---
...
+++ exited with 0 +++

Refs

refs https://github.com/neondatabase/cloud/issues/11387

Footnotes

  1. Meaning, pageserver finishes its shutdown procedure and calls exit(0) on its own terms, instead of hitting the systemd unit's TimeoutSec= limit and getting SIGKILLed.

@problame problame added the c/storage/pageserver Component: storage: pageserver label Jun 25, 2024
@problame problame marked this pull request as ready for review June 25, 2024 09:48
@problame problame requested a review from a team as a code owner June 25, 2024 09:48
@problame problame requested a review from jcsp June 25, 2024 09:48
Copy link

2922 tests run: 2805 passed, 0 failed, 117 skipped (full report)


Flaky tests (1)

Postgres 15

Code coverage* (full report)

  • functions: 32.5% (6876 of 21141 functions)
  • lines: 50.1% (53658 of 107014 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
e402c46 at 2024-06-25T10:33:54.354Z :recycle:

@problame problame merged commit 66b0bf4 into main Jun 27, 2024
65 checks passed
@problame problame deleted the problame/walredo-shutdown branch June 27, 2024 13:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/pageserver Component: storage: pageserver
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants