Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase waiting time to verify workload completion. #2463

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mszadkow
Copy link
Contributor

@mszadkow mszadkow commented Jun 21, 2024

The time could be increased due to the fact that is out of Kueue's control.
As the process is controlled by the Kubernetes.
The change does not affect happy path.

What type of PR is this?

/kind flake

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #2429

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

The time could be increased due to the fact that is out of Kueue's control.
As the time is controlled by the Kubernetes.
@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none Denotes a PR that doesn't merit a release note. kind/flake Categorizes issue or PR as related to a flaky test. labels Jun 21, 2024
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 21, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @mszadkow. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Jun 21, 2024
Copy link

netlify bot commented Jun 21, 2024

Deploy Preview for kubernetes-sigs-kueue ready!

Name Link
🔨 Latest commit b8735fc
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/66758d964feae80008da80fd
😎 Deploy Preview https://deploy-preview-2463--kubernetes-sigs-kueue.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@trasc
Copy link
Contributor

trasc commented Jun 21, 2024

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 21, 2024
@@ -541,7 +541,7 @@ var _ = ginkgo.Describe("Pod groups", func() {
var p corev1.Pod
g.Expect(k8sClient.Get(ctx, replKey, &p)).To(gomega.Succeed())
g.Expect(p.Status.Phase).To(gomega.Equal(corev1.PodSucceeded))
}, util.Timeout, util.Interval).Should(gomega.Succeed())
}, util.LongTimeout, util.Interval).Should(gomega.Succeed())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering, why it's take so much time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I understood, time increase in here is just to ensure our testing infra had no "hiccup", because it happened in the past that it was overwhelmed.
Also it doesn't affect the happy path.

Copy link
Contributor

@mbobrovskyi mbobrovskyi Jun 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but why we have "hiccup"? How much time is needed in a successful case?

Copy link
Contributor Author

@mszadkow mszadkow Jun 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We talked on daily that it is most probably resource exhaustion or something with test infra, but I dug deeper...

Successful takes something around 1-2 seconds
e.g.

 �[1mSTEP:�[0m Verify the replacement pods of the default priority workload complete �[38;5;243m@ 06/21/24 17:12:58.032�[0m
 �[1mSTEP:�[0m Verify the default priority workload is finished �[38;5;243m@ 06/21/24 17:13:00.065�[0m

And all events (of Kueue) came between the eventually expired:

2024-06-21T17:13:00.029650262Z stderr F 2024-06-21T17:13:00.02943541Z	LEVEL(-2)	workload-reconciler	core/workload_controller.go:649	Workload update event	{"workload": {"name":"default-priority-group","namespace":"pod-e2e-vm8bl"}, "queue": "queue", "status": "finished", "prevStatus": "admitted", "clusterQueue": "cq"}
2024-06-21T17:13:00.029797574Z stderr F 2024-06-21T17:13:00.029638912Z	LEVEL(-3)	localqueue-reconciler	core/localqueue_controller.go:199	Got Workload event	{"workload": {"name":"default-priority-group","namespace":"pod-e2e-vm8bl"}}
2024-06-21T17:13:00.029807114Z stderr F 2024-06-21T17:13:00.029726863Z	LEVEL(-2)	cluster-queue-reconciler	core/clusterqueue_controller.go:330	Got generic event	{"obj": {"name":"default-priority-group","namespace":"pod-e2e-vm8bl"}, "kind": "/, Kind="}
2024-06-21T17:13:00.057307688Z stderr F 2024-06-21T17:13:00.057121706Z	LEVEL(-2)	workload-reconciler	core/workload_controller.go:649	Workload update event	{"workload": {"name":"default-priority-group","namespace":"pod-e2e-vm8bl"}, "queue": "queue", "status": "finished", 

However for failed run:
We didn't get the events at all (eventually expired - 15:28:01.191), errors below.
The workload was admitted only after expected time, so no chance to finish as teardown kicked in.

2024-06-17T15:28:01.509816691Z stderr F 2024-06-17T15:28:01.509626579Z	LEVEL(-2)	workload-reconciler	core/workload_controller.go:649	Workload update event	{"workload": {"name":"default-priority-group","namespace":"pod-e2e-hm8rn"}, "queue": "queue", "status": "admitted", "clusterQueue": "cq"}
2024-06-17T15:28:01.509941372Z stderr F 2024-06-17T15:28:01.509698379Z	LEVEL(-2)	multikueue/workload.go:160	Reconcile Workload	{"controller": "workload", "controllerGroup": "kueue.x-k8s.io", "controllerKind": "Workload", "Workload": {"name":"default-priority-group","namespace":"pod-e2e-hm8rn"}, "namespace": "pod-e2e-hm8rn", "name": "default-priority-group", "reconcileID": "4f245af4-326e-497c-bb9a-bba9e27b6f2e"}
2024-06-17T15:28:01.509961772Z stderr F 2024-06-17T15:28:01.50977886Z	LEVEL(-2)	multikueue/workload.go:185	Skip Workload	{"controller": "workload", "controllerGroup": "kueue.x-k8s.io", "controllerKind": "Workload", "Workload": {"name":"default-priority-group","namespace":"pod-e2e-hm8rn"}, "namespace": "pod-e2e-hm8rn", "name": "default-priority-group", "reconcileID": "4f245af4-326e-497c-bb9a-bba9e27b6f2e"}
2024-06-17T15:28:01.509967423Z stderr F 2024-06-17T15:28:01.509812271Z	LEVEL(-2)	core/workload_controller.go:144	Reconciling Workload	{"controller": "workload", "controllerGroup": "kueue.x-k8s.io", "controllerKind": "Workload", "Workload": {"name":"default-priority-group","namespace":"pod-e2e-hm8rn"}, "namespace": "pod-e2e-hm8rn", "name": "default-priority-group", "reconcileID": "a6331098-1635-47f2-a35a-718f73eda864", "workload": {"name":"default-priority-group","namespace":"pod-e2e-hm8rn"}}
2024-06-17T15:28:01.510008473Z stderr F 2024-06-17T15:28:01.509816931Z	LEVEL(-3)	localqueue-reconciler	core/localqueue_controller.go:199	Got Workload event	{"workload": {"name":"default-priority-group","namespace":"pod-e2e-hm8rn"}}
2024-06-17T15:28:01.510028353Z stderr F 2024-06-17T15:28:01.509859591Z	LEVEL(-2)	cluster-queue-reconciler	core/clusterqueue_controller.go:330	Got generic event	{"obj": {"name":"default-priority-group","namespace":"pod-e2e-hm8rn"}, "kind": "/, Kind="}
2024-06-17T15:28:01.52873076Z stderr F 2024-06-17T15:28:01.528530457Z	LEVEL(-2)	jobframework/reconciler.go:308	Reconciling Job	{"controller": "v1_pod", "namespace": "group/pod-e2e-hm8rn", "name": "default-priority-group", "reconcileID": "76daee36-48c8-41ea-a75f-e2c88f3ef61d", "job": "group/pod-e2e-hm8rn/default-priority-group", "gvk": "/v1, Kind=Pod"}
2024-06-17T15:28:01.528989573Z stderr F 2024-06-17T15:28:01.528844161Z	LEVEL(-2)	multikueue/workload.go:160	Reconcile Workload	{"controller": "workload", "controllerGroup": "kueue.x-k8s.io", "controllerKind": "Workload", "Workload": {"name":"default-priority-group","namespace":"pod-e2e-hm8rn"}, "namespace": "pod-e2e-hm8rn", "name": "default-priority-group", "reconcileID": "bf59503b-e97c-496f-a0b6-6b188715fc06"}
2024-06-17T15:28:01.528998833Z stderr F 2024-06-17T15:28:01.528854941Z	LEVEL(-2)	workload-reconciler	core/workload_controller.go:649	Workload update event	{"workload": {"name":"default-priority-group","namespace":"pod-e2e-hm8rn"}, "queue": "queue", "status": "admitted", "clusterQueue": "cq"}
2024-06-17T15:28:01.529002803Z stderr F 2024-06-17T15:28:01.528905302Z	LEVEL(-2)	multikueue/workload.go:185	Skip Workload	{"controller": "workload", "controllerGroup": "kueue.x-k8s.io", "controllerKind": "Workload", "Workload": {"name":"default-priority-group","namespace":"pod-e2e-hm8rn"}, "namespace": "pod-e2e-hm8rn", "name": "default-priority-group", "reconcileID": "bf59503b-e97c-496f-a0b6-6b188715fc06"}
2024-06-17T15:28:01.529200485Z stderr F 2024-06-17T15:28:01.529041383Z	LEVEL(-3)	localqueue-reconciler	core/localqueue_controller.go:199	Got Workload event	{"workload": {"name":"default-priority-group","namespace":"pod-e2e-hm8rn"}}
2024-06-17T15:28:01.529208905Z stderr F 2024-06-17T15:28:01.529069784Z	LEVEL(-2)	core/workload_controller.go:144	Reconciling Workload	{"controller": "workload", "controllerGroup": "kueue.x-k8s.io", "controllerKind": "Workload", "Workload": {"name":"default-priority-group","namespace":"pod-e2e-hm8rn"}, "namespace": "pod-e2e-hm8rn", "name": "default-priority-group", "reconcileID": "a55e8f9f-87c7-40ee-beab-b0f25a018d86", "workload": {"name":"default-priority-group","namespace":"pod-e2e-hm8rn"}}
2024-06-17T15:28:01.529212665Z stderr F 2024-06-17T15:28:01.529100014Z	LEVEL(-2)	cluster-queue-reconciler	core/clusterqueue_controller.go:330	Got generic event	{"obj": {"name":"default-priority-group","namespace":"pod-e2e-hm8rn"}, "kind": "/, Kind="}
2024-06-17T15:28:01.548056673Z stderr F 2024-06-17T15:28:01.547834571Z	LEVEL(-2)	jobframework/reconciler.go:308	Reconciling Job	{"controller": "v1_pod", "namespace": "group/pod-e2e-hm8rn", "name": "default-priority-group", "reconcileID": "eada761e-5458-4d61-875b-fd2eecaa7665", "job": "group/pod-e2e-hm8rn/default-priority-group", "gvk": "/v1, Kind=Pod"}
2024-06-17T15:28:01.548095684Z stderr F 2024-06-17T15:28:01.547926642Z	LEVEL(-3)	jobframework/reconciler.go:390	update reclaimable counts if implemented by the job	{"controller": "v1_pod", "namespace": "group/pod-e2e-hm8rn", "name": "default-priority-group", "reconcileID": "eada761e-5458-4d61-875b-fd2eecaa7665", "job": "group/pod-e2e-hm8rn/default-priority-group", "gvk": "/v1, Kind=Pod"}
2024-06-17T15:28:01.56416916Z stderr F 2024-06-17T15:28:01.563885097Z	ERROR	jobframework/reconciler.go:400	Updating reclaimable pods	{"controller": "v1_pod", "namespace": "group/pod-e2e-hm8rn", "name": "default-priority-group", "reconcileID": "eada761e-5458-4d61-875b-fd2eecaa7665", "job": "group/pod-e2e-hm8rn/default-priority-group", "gvk": "/v1, Kind=Pod", "error": "admission webhook \"vworkload.kb.io\" denied the request: status.reclaimablePods[c3465716]: Required value: cannot be removed"}
2024-06-17T15:28:01.56419449Z stderr F sigs.k8s.io/kueue/pkg/controller/jobframework.(*JobReconciler).ReconcileGenericJob
2024-06-17T15:28:01.56419795Z stderr F 	/workspace/pkg/controller/jobframework/reconciler.go:400
2024-06-17T15:28:01.56420105Z stderr F sigs.k8s.io/kueue/pkg/controller/jobs/pod.(*Reconciler).Reconcile
2024-06-17T15:28:01.56420359Z stderr F 	/workspace/pkg/controller/jobs/pod/pod_controller.go:121
2024-06-17T15:28:01.56420603Z stderr F sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
2024-06-17T15:28:01.56420877Z stderr F 	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:119
2024-06-17T15:28:01.56421138Z stderr F sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
2024-06-17T15:28:01.5642136Z stderr F 	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:316
2024-06-17T15:28:01.564216191Z stderr F sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
2024-06-17T15:28:01.564218841Z stderr F 	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
2024-06-17T15:28:01.564221141Z stderr F sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
2024-06-17T15:28:01.564223411Z stderr F 	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227
2024-06-17T15:28:01.564226091Z stderr F 2024-06-17T15:28:01.564014328Z	ERROR	controller/controller.go:329	Reconciler error	{"controller": "v1_pod", "namespace": "group/pod-e2e-hm8rn", "name": "default-priority-group", "reconcileID": "eada761e-5458-4d61-875b-fd2eecaa7665", "error": "admission webhook \"vworkload.kb.io\" denied the request: status.reclaimablePods[c3465716]: Required value: cannot be removed"}
2024-06-17T15:28:01.564228601Z stderr F sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
2024-06-17T15:28:01.564230811Z stderr F 	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:329
2024-06-17T15:28:01.564255761Z stderr F sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
2024-06-17T15:28:01.564258331Z stderr F 	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
2024-06-17T15:28:01.564260651Z stderr F sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
2024-06-17T15:28:01.564262871Z stderr F 	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227
2024-06-17T15:28:01.569613573Z stderr F 2024-06-17T15:28:01.56942685Z	LEVEL(-2)	jobframework/reconciler.go:308	Reconciling Job	{"controller": "v1_pod", "namespace": "group/pod-e2e-hm8rn", "name": "default-priority-group", "reconcileID": "5a75cf3b-e1a9-4e2c-9f37-ad6a81c5923f", "job": "group/pod-e2e-hm8rn/default-priority-group", "gvk": "/v1, Kind=Pod"}
2024-06-17T15:28:01.569639643Z stderr F 2024-06-17T15:28:01.569526852Z	LEVEL(-3)	jobframework/reconciler.go:390	update reclaimable counts if implemented by the job	{"controller": "v1_pod", "namespace": "group/pod-e2e-hm8rn", "name": "default-priority-group", "reconcileID": "5a75cf3b-e1a9-4e2c-9f37-ad6a81c5923f", "job": "group/pod-e2e-hm8rn/default-priority-group", "gvk": "/v1, Kind=Pod"}
2024-06-17T15:28:01.57802327Z stderr F 2024-06-17T15:28:01.577808728Z	LEVEL(-3)	pod/expectations.go:59	Observed UID	{"pod": {"name":"replacement-for-default-priority-group-1","namespace":"pod-e2e-hm8rn"}, "store": "finalizedPods", "key": {"name":"default-priority-group","namespace":"pod-e2e-hm8rn"}, "uid": "97a58d62-5b4b-4cdc-b17f-c6a8d49640b2"}
2024-06-17T15:28:01.587391209Z stderr F 2024-06-17T15:28:01.587178517Z	ERROR	jobframework/reconciler.go:400	Updating reclaimable pods	{"controller": "v1_pod", "namespace": "group/pod-e2e-hm8rn", "name": "default-priority-group", "reconcileID": "5a75cf3b-e1a9-4e2c-9f37-ad6a81c5923f", "job": "group/pod-e2e-hm8rn/default-priority-group", "gvk": "/v1, Kind=Pod", "error": "admission webhook \"vworkload.kb.io\" denied the request: status.reclaimablePods[c3465716]: Required value: cannot be removed"}
2024-06-17T15:28:01.58742253Z stderr F sigs.k8s.io/kueue/pkg/controller/jobframework.(*JobReconciler).ReconcileGenericJob
2024-06-17T15:28:01.58742704Z stderr F 	/workspace/pkg/controller/jobframework/reconciler.go:400
2024-06-17T15:28:01.58743087Z stderr F sigs.k8s.io/kueue/pkg/controller/jobs/pod.(*Reconciler).Reconcile
2024-06-17T15:28:01.58743446Z stderr F 	/workspace/pkg/controller/jobs/pod/pod_controller.go:121
2024-06-17T15:28:01.58743766Z stderr F sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
2024-06-17T15:28:01.58744038Z stderr F 	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:119
2024-06-17T15:28:01.5874432Z stderr F sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
2024-06-17T15:28:01.58744607Z stderr F 	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:316
2024-06-17T15:28:01.58744881Z stderr F sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
2024-06-17T15:28:01.5874517Z stderr F 	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
2024-06-17T15:28:01.58745479Z stderr F sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
2024-06-17T15:28:01.5874577Z stderr F 	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227
2024-06-17T15:28:01.58746151Z stderr F 2024-06-17T15:28:01.587283298Z	ERROR	controller/controller.go:329	Reconciler error	{"controller": "v1_pod", "namespace": "group/pod-e2e-hm8rn", "name": "default-priority-group", "reconcileID": "5a75cf3b-e1a9-4e2c-9f37-ad6a81c5923f", "error": "admission webhook \"vworkload.kb.io\" denied the request: status.reclaimablePods[c3465716]: Required value: cannot be removed"}
2024-06-17T15:28:01.58746463Z stderr F sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
2024-06-17T15:28:01.58746712Z stderr F 	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:329
2024-06-17T15:28:01.58746939Z stderr F sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
2024-06-17T15:28:01.58747168Z stderr F 	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
2024-06-17T15:28:01.58747391Z stderr F sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
2024-06-17T15:28:01.5874763Z stderr F 	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it seem to you @mbobrovskyi that extended time could save us here?

@mbobrovskyi
Copy link
Contributor

/test pull-kueue-test-scheduling-perf-main

@alculquicondor
Copy link
Contributor

/cc

@alculquicondor
Copy link
Contributor

We generally use LongTimeout when waiting for Pods to finish, so I'm ok with this change.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 25, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: aab9fa18db0c616965505740dcc494f9626a6c2d

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, mszadkow

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/flake Categorizes issue or PR as related to a flaky test. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flaky e2e test for Pod groups
5 participants