Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

karmada-controller-manager frequently restart(Crash and restart and running and Crash always) #5087

Closed
buptjinguodong opened this issue Jun 24, 2024 · 10 comments
Labels
kind/question Indicates an issue that is a support question.

Comments

@buptjinguodong
Copy link

What happened:
karmada-controller-manager-54c8657457-c5k8b always crash. Here is the info:
W0624 01:28:11.425228 1 reflector.go:462] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: watch of *v1alpha1.ServiceImport ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
I0624 01:28:11.425239 1 reflector.go:295] Stopping reflector *v1.Secret (0s) from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105
I0624 01:28:11.425252 1 reflector.go:295] Stopping reflector *v1.Namespace (0s) from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105
I0624 01:28:11.425264 1 reflector.go:295] Stopping reflector *v1.ClusterRole (0s) from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105
I0624 01:28:11.425285 1 reflector.go:295] Stopping reflector *v1alpha1.ClusterOverridePolicy (0s) from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105
I0624 01:28:11.425383 1 reflector.go:295] Stopping reflector *v1alpha2.ResourceBinding (0s) from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105
I0624 01:28:11.425419 1 reflector.go:295] Stopping reflector *v1.Service (0s) from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105
I0624 01:28:11.425434 1 reflector.go:295] Stopping reflector *v1alpha1.Work (0s) from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105
I0624 01:28:11.425473 1 reflector.go:295] Stopping reflector *v1alpha2.ClusterResourceBinding (0s) from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105
I0624 01:28:11.425437 1 reflector.go:295] Stopping reflector *v1.EndpointSlice (0s) from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105
I0624 01:28:11.425627 1 reflector.go:295] Stopping reflector *v1alpha1.ServiceImport (0s) from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105
I0624 01:28:11.425660 1 reflector.go:295] Stopping reflector *v1alpha1.Cluster (0s) from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105
I0624 01:28:11.425772 1 internal.go:550] "Stopping and waiting for webhooks"
I0624 01:28:11.425909 1 internal.go:553] "Stopping and waiting for HTTP servers"
I0624 01:28:11.426369 1 server.go:251] "Shutting down metrics server with timeout of 1 minute" logger="controller-runtime.metrics"
I0624 01:28:11.426657 1 server.go:43] "shutting down server" kind="health probe" addr="[::]:10357"
I0624 01:28:11.426750 1 internal.go:557] "Wait completed, proceeding to shutdown the manager"
E0624 01:28:11.427301 1 controllermanager.go:199] controller manager exits unexpectedly: failed to wait for workload-rebalancer caches to sync: timed out waiting for cache to be synced for Kind *v1alpha1.WorkloadRebalancer
E0624 01:28:11.427353 1 run.go:74] "command failed" err="failed to wait for workload-rebalancer caches to sync: timed out waiting for cache to be synced for Kind *v1alpha1.WorkloadRebalancer"

What you expected to happen:
Normaly it should be running, but it perform "44 (4m27s ago) 5d19h" frequently restart(Crash and restart and running and Crash always).

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:
no

Environment:

  • Karmada version: v1.10.0
  • kubectl-karmada or karmadactl version (the result of kubectl-karmada version or karmadactl version):
  • Others: CentOS7.6
@buptjinguodong buptjinguodong added the kind/bug Categorizes issue or PR as related to a bug. label Jun 24, 2024
@chaosi-zju
Copy link
Member

Hi @buptjinguodong, thank you for your feedback~

E0624 01:28:11.427301 1 controllermanager.go:199] controller manager exits unexpectedly: failed to wait for workload-rebalancer caches to sync: timed out waiting for cache to be synced for Kind *v1alpha1.WorkloadRebalancer

It seems like here has some error with WorkloadRebalancer controller.

However, there is too little information here, can you provide more information, such as:

  1. How did you install the karmada?
  2. What actions did you do in the installed environment before you encountered a crash? If you have used WorkloadRebalancer feature, what does the WorkloadRebalancer you create look like?
  3. Are there any other error logs before these logs?
  4. Have other karmada components been restarted? like etcd or karmada-apiserver?

@buptjinguodong
Copy link
Author

  1. How did you install the karmada?

不使用nfs存储,直接本地存储;

karmadactl init
--crds /root/karmada/karmada-deployment/crds.tar.gz
--namespace='karmada-system'
--port 30000
--etcd-image='registry.k8s.io/etcd:3.5.9-0'
--etcd-init-image='registry.cn-hangzhou.aliyuncs.com/xxx/alpine:3.19.1'
--etcd-replicas=1
--karmada-aggregated-apiserver-replicas=1
--karmada-apiserver-replicas=1
--karmada-controller-manager-replicas=1
--karmada-kube-controller-manager-replicas=1
--karmada-scheduler-replicas=1
--karmada-webhook-replicas=1
--karmada-apiserver-image='registry.k8s.io/kube-apiserver:v1.27.7'
--karmada-kube-controller-manager-image='registry.k8s.io/kube-controller-manager:v1.27.7'
--karmada-aggregated-apiserver-image='registry.cn-hangzhou.aliyuncs.com/xx/karmada-aggregated-apiserver:v1.10.0'
--karmada-controller-manager-image='registry.cn-hangzhou.aliyuncs.com/xx/karmada-controller-manager:v1.10.0'
--karmada-scheduler-image='registry.cn-hangzhou.aliyuncs.com/xx/karmada-scheduler:v1.10.0'
--karmada-webhook-image='registry.cn-hangzhou.aliyuncs.com/xx/karmada-webhook:v1.10.0'
--karmada-data='/home/appuser/karmada/etc'
--karmada-pki='/home/appuser/karmada/etc/pki'
--karmada-image-pull-secrets karmada-imagepullsecret

  1. What actions did you do in the installed environment before you encountered a crash? If you have used WorkloadRebalancer feature, what does the WorkloadRebalancer you create look like?
    What actions did you do in the installed environment before you encountered a crash? = no, I do not do anything, it crash itself.
    and I do not use WorkloadRebalancer feature privately.

  2. Are there any other error logs before these logs?
    I0624 08:47:11.476018 1 controller.go:242] "All workers finished" controller="clusterresourcebinding" controllerGroup="work.karmada.io" controllerKind="ClusterResourceBinding"
    I0624 08:47:11.476026 1 controller.go:242] "All workers finished" controller="namespace" controllerGroup="" controllerKind="Namespace"
    I0624 08:47:11.476032 1 controller.go:242] "All workers finished" controller="federatedresourcequota" controllerGroup="policy.karmada.io" controllerKind="FederatedResourceQuota"
    I0624 08:47:11.476042 1 controller.go:242] "All workers finished" controller="resourcebinding" controllerGroup="work.karmada.io" controllerKind="ResourceBinding"
    I0624 08:47:11.476051 1 controller.go:242] "All workers finished" controller="resourcebinding" controllerGroup="work.karmada.io" controllerKind="ResourceBinding"
    I0624 08:47:11.476061 1 controller.go:242] "All workers finished" controller="work" controllerGroup="work.karmada.io" controllerKind="Work"
    I0624 08:47:11.476069 1 internal.go:546] "Stopping and waiting for caches"
    W0624 08:47:11.476479 1 reflector.go:462] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: watch of *v1.Secret ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
    W0624 08:47:11.476905 1 reflector.go:462] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: watch of *v1.Service ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
    I0624 08:47:11.476948 1 reflector.go:295] Stopping reflector *v1.Service (0s) from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105
    W0624 08:47:11.477196 1 reflector.go:462] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: watch of *v1alpha1.ClusterOverridePolicy ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
    W0624 08:47:11.477234 1 reflector.go:462] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: watch of *v1.EndpointSlice ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
    W0624 08:47:11.477316 1 reflector.go:462] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: watch of *v1alpha1.Work ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
    I0624 08:47:11.477070 1 reflector.go:295] Stopping reflector *v1.ClusterRoleBinding (0s) from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105
    I0624 08:47:11.477522 1 reflector.go:295] Stopping reflector *v1alpha1.ServiceImport (0s) from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105
    I0624 08:47:11.477550 1 reflector.go:295] Stopping reflector *v1alpha1.Remedy (0s) from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105
    I0624 08:47:11.477568 1 reflector.go:295] Stopping reflector *v1alpha1.OverridePolicy (0s) from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105
    I0624 08:47:11.477585 1 reflector.go:295] Stopping reflector *v1alpha1.FederatedHPA (0s) from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105
    W0624 08:47:11.477589 1 reflector.go:462] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: watch of *v1alpha2.ResourceBinding ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
    I0624 08:47:11.477600 1 reflector.go:295] Stopping reflector *v1alpha1.CronFederatedHPA (0s) from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105
    I0624 08:47:11.477626 1 reflector.go:295] Stopping reflector *v1alpha1.Cluster (0s) from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105
    W0624 08:47:11.477635 1 reflector.go:462] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: watch of *v1alpha2.ClusterResourceBinding ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
    I0624 08:47:11.477644 1 reflector.go:295] Stopping reflector *v1.ClusterRole (0s) from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105
    I0624 08:47:11.477661 1 reflector.go:295] Stopping reflector *v1.Namespace (0s) from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105
    I0624 08:47:11.477676 1 reflector.go:295] Stopping reflector *v1.Secret (0s) from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105
    I0624 08:47:11.477735 1 reflector.go:295] Stopping reflector *v1alpha2.ResourceBinding (0s) from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105
    I0624 08:47:11.477750 1 reflector.go:295] Stopping reflector *v1alpha1.ClusterOverridePolicy (0s) from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105
    I0624 08:47:11.477862 1 reflector.go:295] Stopping reflector *v1.EndpointSlice (0s) from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105
    I0624 08:47:11.477894 1 reflector.go:295] Stopping reflector *v1alpha1.Work (0s) from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105
    I0624 08:47:11.477910 1 reflector.go:295] Stopping reflector *v1alpha2.ClusterResourceBinding (0s) from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105
    I0624 08:47:11.478092 1 reflector.go:295] Stopping reflector *v1alpha1.FederatedResourceQuota (0s) from sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105
    I0624 08:47:11.478494 1 internal.go:550] "Stopping and waiting for webhooks"
    I0624 08:47:11.478526 1 internal.go:553] "Stopping and waiting for HTTP servers"
    I0624 08:47:11.479170 1 server.go:43] "shutting down server" kind="health probe" addr="[::]:10357"
    I0624 08:47:11.479325 1 server.go:251] "Shutting down metrics server with timeout of 1 minute" logger="controller-runtime.metrics"
    I0624 08:47:11.479420 1 internal.go:557] "Wait completed, proceeding to shutdown the manager"
    E0624 08:47:11.479492 1 controllermanager.go:199] controller manager exits unexpectedly: failed to wait for workload-rebalancer caches to sync: timed out waiting for cache to be synced for Kind *v1alpha1.WorkloadRebalancer
    E0624 08:47:11.479534 1 run.go:74] "command failed" err="failed to wait for workload-rebalancer caches to sync: timed out waiting for cache to be synced for Kind *v1alpha1.WorkloadRebalancer"

  3. Have other karmada components been restarted? like etcd or karmada-apiserver?
    other karmada components perform normal.

@RainbowMango
Copy link
Member

@buptjinguodong Can you help to post the output of karmadactl version?
Just a guess, you are not using the v1.10.0 of karmadactl.

@chaosi-zju
Copy link
Member

Hi @buptjinguodong

karmadactl init
--crds /root/karmada/karmada-deployment/crds.tar.gz

Does the crds.tar.gz is the v1.10.0 version crd file?

Can you post the /root/karmada/karmada-deployment/crds.tar.gz file and the full karmada-controller-manager logs file?

@buptjinguodong
Copy link
Author

@buptjinguodong Can you help to post the output of karmadactl version? Just a guess, you are not using the v1.10.0 of karmadactl.

[root@master-1 nginx-fhpa]# karmadactl version
karmadactl version: version.Info{GitVersion:"v1.10.0", GitCommit:"c181917f3c4f2349ffbaa806d5b4839c11866b82", GitTreeState:"clean", BuildDate:"2024-05-31T10:14:20Z", GoVersion:"go1.21.10", Compiler:"gc", Platform:"linux/amd64"}

@buptjinguodong
Copy link
Author

Hi @buptjinguodong

karmadactl init
--crds /root/karmada/karmada-deployment/crds.tar.gz

Does the crds.tar.gz is the v1.10.0 version crd file?

Can you post the /root/karmada/karmada-deployment/crds.tar.gz file and the full karmada-controller-manager logs file?

It might be v1.10.0, here are the files.

crds.tar.gz
karmada-controller-manager.log

@chaosi-zju
Copy link
Member

It might be v1.10.0, here are the files.

No, your crd is out-dated, it is not version v1.10.0

Because it doesn't have the newly introduced type WorkloadRebalancer in version v1.10.0

Please download version 1.10.0 CRDs: https://github.com/karmada-io/karmada/releases/download/v1.10.0/crds.tar.gz

@buptjinguodong
Copy link
Author

Up to now, it works, thanks.
This issue can be closed.

@RainbowMango
Copy link
Member

Thanks for the information.

/close

/remove-kind bug
/kind question

@karmada-bot karmada-bot added the kind/question Indicates an issue that is a support question. label Jun 27, 2024
@karmada-bot
Copy link
Collaborator

@RainbowMango: Closing this issue.

In response to this:

Thanks for the information.

/close

/remove-kind bug
/kind question

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@karmada-bot karmada-bot removed the kind/bug Categorizes issue or PR as related to a bug. label Jun 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/question Indicates an issue that is a support question.
Projects
Status: No status
Development

No branches or pull requests

4 participants