fix: shutdown does not kill walredo processes #8150

problame · 2024-06-25T08:57:47Z

While investigating Pageserver logs from the cases where systemd hangs during shutdown (https://github.com/neondatabase/cloud/issues/11387), I noticed that even if Pageserver shuts down cleanly¹, there are lingering walredo processes.

While systemd should never lock up like it does, maybe we can avoid hitting that bug by cleaning up properly.

Changes

This PR adds a shutdown method to WalRedoManager and hooks it up to tenant shutdown.

We keep track of intent to shutdown through the new enum ProcessOnceCell stored inside the pre-existing redo_process field.
A gate is added to keep track of running processes, using the new type struct Process.

Future Work

Requests that don't need the redo process will not observe the shutdown (see doc comment).
Doing so would be nice for completeness sake, but doesn't provide much benefit because Tenant and Timeline already shut down all walredo users.

Testing

I did manual testing to confirm that the problem exists before this PR and that it's gone after.
Setup:

neon_local with a single tenant, create some data using pgbench
ensure walredo process is running, not pid
watch strace -e kill,wait4 -f -p "$(pgrep pageserver)"
neon_local pageserver stop

With this PR, we always observe

$ strace -e kill,wait4 -f -p "$(pgrep pageserver)"
...
[pid 591120] --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=591215, si_uid=1000} ---
[pid 591134] kill(591174, SIGKILL)      = 0
[pid 591134] wait4(591174,  <unfinished ...>
[pid 591142] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=591174, si_uid=1000, si_status=SIGKILL, si_utime=0, si_stime=0} ---
[pid 591134] <... wait4 resumed>[{WIFSIGNALED(s) && WTERMSIG(s) == SIGKILL}], 0, NULL) = 591174
...
+++ exited with 0 +++

Before this PR, we'd usually observe just

...
[pid 596239] --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=596455, si_uid=1000} ---
...
+++ exited with 0 +++

Refs

refs https://github.com/neondatabase/cloud/issues/11387

Meaning, pageserver finishes its shutdown procedure and calls exit(0) on its own terms, instead of hitting the systemd unit's TimeoutSec= limit and getting SIGKILLed. ↩

… the process

…g when it exits

github-actions · 2024-06-25T10:33:55Z

2922 tests run: 2805 passed, 0 failed, 117 skipped (full report)

Flaky tests (1)

Postgres 15

test_subscriber_restart: release

Code coverage* (full report)

functions: 32.5% (6876 of 21141 functions)
lines: 50.1% (53658 of 107014 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
e402c46 at 2024-06-25T10:33:54.354Z :recycle:}

pageserver/src/walredo.rs

problame added 7 commits June 24, 2024 19:38

the fix: shutdown method for WalRedoManager, so the Arc<> can outlive…

7bc7d7c

… the process

pull in minimum amount of changes to make previous commit compile

80bd9c8

aesthetic fixes

8802354

clippy

9c76d26

fix bench

b015e40

self-review

ca68b41

make shutdown() more powerful by guaranteeing no processes are runnin…

b666458

…g when it exits

problame added the c/storage/pageserver Component: storage: pageserver label Jun 25, 2024

refine guarantees

e402c46

problame marked this pull request as ready for review June 25, 2024 09:48

problame requested a review from a team as a code owner June 25, 2024 09:48

problame requested a review from jcsp June 25, 2024 09:48

jcsp reviewed Jun 26, 2024

View reviewed changes

pageserver/src/walredo.rs Show resolved Hide resolved

jcsp approved these changes Jun 26, 2024

View reviewed changes

problame merged commit 66b0bf4 into main Jun 27, 2024
65 checks passed

problame deleted the problame/walredo-shutdown branch June 27, 2024 13:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: shutdown does not kill walredo processes #8150

fix: shutdown does not kill walredo processes #8150

problame commented Jun 25, 2024 •

edited

Loading

github-actions bot commented Jun 25, 2024

Postgres 15

fix: shutdown does not kill walredo processes #8150

fix: shutdown does not kill walredo processes #8150

Conversation

problame commented Jun 25, 2024 • edited Loading

Changes

Future Work

Testing

Refs

Footnotes

github-actions bot commented Jun 25, 2024

2922 tests run: 2805 passed, 0 failed, 117 skipped (full report)

Postgres 15

Code coverage* (full report)

problame commented Jun 25, 2024 •

edited

Loading