Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

k3s is unable to start after stopped for a while #10253

Open
harryzcy opened this issue May 30, 2024 · 3 comments
Open

k3s is unable to start after stopped for a while #10253

harryzcy opened this issue May 30, 2024 · 3 comments

Comments

@harryzcy
Copy link

harryzcy commented May 30, 2024

Environmental Info:
K3s Version: v1.28.10+k3s1

Node(s) CPU architecture, OS, and Version: Linux rpi-3 5.15.0-1055-raspi #58-Ubuntu SMP PREEMPT Sat May 4 03:52:40 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux

Cluster Configuration: 1 server, 2 agents

Describe the bug:

I stopped the all server and agents for a few days, and it cannot restart anymore.

Steps To Reproduce:

Expected behavior:

Cluster starts successfully

Actual behavior:

Some error like Error syncing pod, skipping is recursing, agents cannot join the server, and server is not ready: https://127.0.0.1:6443/v1-k3s/readyz: 500 Internal Server Error

Additional context / logs:

Gist: https://gist.github.com/harryzcy/34f7bb0a54defffda64377f17b863609

@brandond
Copy link
Contributor

You didn't mention that you're running longhorn on these nodes, but that appears to be the case? The apiserver isn't coming ready, probably because it can't talk to longhorn to complete aggregated api discovery. Pods appear to be crashlooping because the LH CSI isn't registered:

E0530 12:44:49.589154 32845 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/driver.longhorn.io^pvc-29339ee5-93ea-4b92-b811-1af27fbf8f8c podName:116fd16b-b81c-4750-a0be-682ed21c14f6 nodeName:}" failed. No retries permitted until 2024-05-30 12:44:50.589051758 -0400 EDT m=+12.882424672 (durationBeforeRetry 1s). Error: UnmountVolume.TearDown failed for volume "pvc-29339ee5-93ea-4b92-b811-1af27fbf8f8c" (UniqueName: "kubernetes.io/csi/driver.longhorn.io^pvc-29339ee5-93ea-4b92-b811-1af27fbf8f8c") pod "116fd16b-b81c-4750-a0be-682ed21c14f6" (UID: "116fd16b-b81c-4750-a0be-682ed21c14f6") : kubernetes.io/csi: Unmounter.TearDownAt failed to get CSI client: driver name driver.longhorn.io not found in the list of registered CSI drivers

Maybe check the pod logs on the server node to see what's wrong with the LH CSI deployment? You need to get LH fixed on the server node so that it can finish api discovery and become ready for agents to join.

@harryzcy
Copy link
Author

harryzcy commented May 31, 2024

@brandond Yes, I have longhorn installed, but the longhorn-csi-plugin pod is not starting correctly. It cannot connect to longhorn-backend. I can't seem to restart the pod. Is there anyway to skip api discovery with longhorn?

Also, it seems pods in kube-system namespace is also not starting correctly. How can I get the pod as part of deployment to start? The deployment and replicasets seems fine but related pods are not starting

@brandond
Copy link
Contributor

Have you tried force-deleting the pods on the agents so that they get rescheduled to the server node? I think that the LH pods will need to come up before other things will work properly. You might create an issue in the LH repo to ask if there are any specific cold-start procedures that you need to observe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Triage
Development

No branches or pull requests

2 participants