k3s is unable to start after stopped for a while #10253

harryzcy · 2024-05-30T17:14:50Z

Environmental Info:
K3s Version: v1.28.10+k3s1

Node(s) CPU architecture, OS, and Version: Linux rpi-3 5.15.0-1055-raspi #58-Ubuntu SMP PREEMPT Sat May 4 03:52:40 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux

Cluster Configuration: 1 server, 2 agents

Describe the bug:

I stopped the all server and agents for a few days, and it cannot restart anymore.

Steps To Reproduce:

Expected behavior:

Cluster starts successfully

Actual behavior:

Some error like Error syncing pod, skipping is recursing, agents cannot join the server, and server is not ready: https://127.0.0.1:6443/v1-k3s/readyz: 500 Internal Server Error

Additional context / logs:

Gist: https://gist.github.com/harryzcy/34f7bb0a54defffda64377f17b863609

The text was updated successfully, but these errors were encountered:

brandond · 2024-05-30T18:42:02Z

You didn't mention that you're running longhorn on these nodes, but that appears to be the case? The apiserver isn't coming ready, probably because it can't talk to longhorn to complete aggregated api discovery. Pods appear to be crashlooping because the LH CSI isn't registered:

E0530 12:44:49.589154 32845 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/driver.longhorn.io^pvc-29339ee5-93ea-4b92-b811-1af27fbf8f8c podName:116fd16b-b81c-4750-a0be-682ed21c14f6 nodeName:}" failed. No retries permitted until 2024-05-30 12:44:50.589051758 -0400 EDT m=+12.882424672 (durationBeforeRetry 1s). Error: UnmountVolume.TearDown failed for volume "pvc-29339ee5-93ea-4b92-b811-1af27fbf8f8c" (UniqueName: "kubernetes.io/csi/driver.longhorn.io^pvc-29339ee5-93ea-4b92-b811-1af27fbf8f8c") pod "116fd16b-b81c-4750-a0be-682ed21c14f6" (UID: "116fd16b-b81c-4750-a0be-682ed21c14f6") : kubernetes.io/csi: Unmounter.TearDownAt failed to get CSI client: driver name driver.longhorn.io not found in the list of registered CSI drivers

Maybe check the pod logs on the server node to see what's wrong with the LH CSI deployment? You need to get LH fixed on the server node so that it can finish api discovery and become ready for agents to join.

harryzcy · 2024-05-31T07:40:27Z

@brandond Yes, I have longhorn installed, but the longhorn-csi-plugin pod is not starting correctly. It cannot connect to longhorn-backend. I can't seem to restart the pod. Is there anyway to skip api discovery with longhorn?

Also, it seems pods in kube-system namespace is also not starting correctly. How can I get the pod as part of deployment to start? The deployment and replicasets seems fine but related pods are not starting

brandond · 2024-05-31T15:46:25Z

Have you tried force-deleting the pods on the agents so that they get rescheduled to the server node? I think that the LH pods will need to come up before other things will work properly. You might create an issue in the LH repo to ask if there are any specific cold-start procedures that you need to observe.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

k3s is unable to start after stopped for a while #10253

k3s is unable to start after stopped for a while #10253

harryzcy commented May 30, 2024 •

edited

Loading

brandond commented May 30, 2024

harryzcy commented May 31, 2024 •

edited

Loading

brandond commented May 31, 2024

k3s is unable to start after stopped for a while #10253

k3s is unable to start after stopped for a while #10253

Comments

harryzcy commented May 30, 2024 • edited Loading

brandond commented May 30, 2024

harryzcy commented May 31, 2024 • edited Loading

brandond commented May 31, 2024

harryzcy commented May 30, 2024 •

edited

Loading

harryzcy commented May 31, 2024 •

edited

Loading