Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWX not able to delete the worker pods after finished running #15247

Open
5 of 11 tasks
chinna44 opened this issue Jun 3, 2024 · 6 comments
Open
5 of 11 tasks

AWX not able to delete the worker pods after finished running #15247

chinna44 opened this issue Jun 3, 2024 · 6 comments

Comments

@chinna44
Copy link

chinna44 commented Jun 3, 2024

Please confirm the following

  • I agree to follow this project's code of conduct.
  • I have checked the current issues for duplicates.
  • I understand that AWX is open source software provided for free and that I might not receive a timely response.
  • I am NOT reporting a (potential) security vulnerability. (These should be emailed to [email protected] instead.)

Bug Summary

We have recently upgraded the awx version from 22.5.0 to 23.9.0 which is deployed on EKS 1.28 version.

After AWX upgrade, we observed that only few jobs (not all jobs) running on workers pods specific to inventory sync are not getting deleted even after job workflow is completed . The pods will be in queue for hours and days until we delete them manually. I don't see any other errors

The worker pods status is shown below
NAME READY STATUS RESTARTS AGE
automation-job-462026-6zf7c 1/2 NotReady 0 3m23s

The errors that are captured from awx control plane ee logs for the worker pods that are not getting deleted
Error deleting pod automation-job-462026-6zf7c: client rate limiter Wait returned an error: context canceled
Context was canceled while reading logs for pod awx-workers/automation-job-462026-6zf7c. Assuming pod has finished

The pod status description shows: Not displaying the data that is condifential
Containers:
worker:
State: Terminated
Reason: Completed
Exit Code: 0
Ready: False
Restart Count: 0
authenticator:
State: Running
Ready: True
Restart Count: 0

The automation-job-462026-6zf7c pod contains two containers: worker and authenticator.

When the pod is stuck, we can see that the worker container is terminated, and the authenticator container keeps running. This is what we can see in the worker container and authenticator container
worker-container.txt
authenticator-container.txt

For now we are testing this in non production environment, currently its a blocker to upgrade the production. Please have a look and provide the fix or suggest the best awx version if it is a known issue

AWX version

23.9.0

Select the relevant components

  • UI
  • UI (tech preview)
  • API
  • Docs
  • Collection
  • CLI
  • Other

Installation method

kubernetes

Modifications

no

Ansible version

No response

Operating system

No response

Web browser

No response

Steps to reproduce

Run many AWX jobs based on the pod that contains worker and authenticator images.(we observed mainly on Inventory sync jobs)

Expected results

AWX deletes all the pods that finished running.

Actual results

AWX Worker pods got stuck

Additional information

No response

@chronicc
Copy link

chronicc commented Jun 5, 2024

I observe the same issues on Kubernetes 1.27 with AWX 23.0.0.

The pods that are not deleted are pods where the awx jobs have been deleted immediately after the pod failed. It looks like awx only knows about existing pods through the jobs inside of awx.

If this is the case, the pod should be actively be removed from Kubernetes when the job is deleted OR the api output of the job should give a hint on whether the pod has already been deleted inside kubernetes.

@TheRealHaoLiu
Copy link
Member

can u give us the output of /api/v2/jobs/462026

@chinna44
Copy link
Author

chinna44 commented Jun 7, 2024

@TheRealHaoLiu below is the output.
I want to highlight again, pod is not deleting only for few inventory sync jobs which is completed successfully

ansible-inventory [core 2.15.5]
config file = /ansible.cfg
configured module search path = ['/cyberark-ansible-modules/lib/ansible/modules', '/runner/project']
ansible python module location = /usr/local/lib/python3.9/site-packages/ansible
ansible collection location = /root/.ansible/collections:/usr/share/ansible/collections:/usr/share/automation-controller/collections
executable location = /usr/local/bin/ansible-inventory
python version = 3.9.18 (main, Jan 24 2024, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)] (/usr/bin/python3)
jinja version = 3.0.0
libyaml = False
Using /ansible.cfg as config file
[DEPRECATION WARNING]: DEFAULT_GATHER_TIMEOUT option, the module_defaults
keyword is a more generic version and can apply to all calls to the
M(ansible.builtin.gather_facts) or M(ansible.builtin.setup) actions, use
module_defaults instead. This feature will be removed from ansible-core in
version 2.18. Deprecation warnings can be disabled by setting
deprecation_warnings=False in ansible.cfg.
redirecting (type: inventory) ansible.builtin.aws_ec2 to amazon.aws.aws_ec2
Using inventory plugin 'ansible_collections.amazon.aws.plugins.inventory.aws_ec2' to process inventory source '/runner/inventory/aws_ec2.yml'
Parsed /runner/inventory/aws_ec2.yml inventory source with auto plugin
8.867 INFO Processing JSON output...
8.868 INFO Loaded 1 groups, 0 hosts
8.898 INFO Inventory import completed for AWS-sandbox-Windows in 0.0s

@TheRealHaoLiu
Copy link
Member

@chinna44 that does not look like the output from the api endpoint... that looks like the stdout of the job

@chinna44
Copy link
Author

@TheRealHaoLiu yes.. you are correct, I'm sorry for that

Below is the output for endpoint /api/v2/jobs/462026, but I could not see the details of job for this kind of any other Inventory Sync jobs. Please let me know if you require details in any other possible ways

HTTP 404 Not Found
Allow: GET, DELETE, HEAD, OPTIONS
Content-Type: application/json
Vary: Accept
X-API-Node: awx-web-c8bc64f45-h7xwt
X-API-Product-Name: AWX
X-API-Product-Version: 23.9.0
X-API-Time: 0.057s

{
"detail": "Not found."
}

@chinna44
Copy link
Author

@TheRealHaoLiu please let me know if you need any other details

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants