Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job with slices might fail when multiple slices try to create new jobevent partition #15267

Open
5 of 11 tasks
tomsiewert opened this issue Jun 11, 2024 · 0 comments
Open
5 of 11 tasks

Comments

@tomsiewert
Copy link
Contributor

Please confirm the following

  • I agree to follow this project's code of conduct.
  • I have checked the current issues for duplicates.
  • I understand that AWX is open source software provided for free and that I might not receive a timely response.
  • I am NOT reporting a (potential) security vulnerability. (These should be emailed to [email protected] instead.)

Bug Summary

If a job with slices is being created and a new jobevent partition is needed, it can happen that multiple slices try to create a new partition and one of the jobs is faster, thus the job might go into "error" state because the partition already exists.

It's a very time-specific issue and I was only able to trigger it twice at this time.

AWX version

24.4.0

Select the relevant components

  • UI
  • UI (tech preview)
  • API
  • Docs
  • Collection
  • CLI
  • Other

Installation method

kubernetes

Modifications

no

Ansible version

No response

Operating system

No response

Web browser

No response

Steps to reproduce

Create a job with multiple slices. Make sure that the latest jobevent table partition is big enough to create a new one.

Expected results

All jobs will be created and started.

Actual results

One or more jobs fails with DuplicateObject / ProgrammingError error.

Additional information

Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/django/db/backends/utils.py", line 87, in _execute
    return self.cursor.execute(sql)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/psycopg/cursor.py", line 732, in execute
    raise ex.with_traceback(None)
psycopg.errors.DuplicateObject: type "main_jobevent_20240611_13" already exists

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/awx/main/tasks/jobs.py", line 499, in run
    self.pre_run_hook(self.instance, private_data_dir)
  File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/awx/main/tasks/jobs.py", line 1073, in pre_run_hook
    super(RunJob, self).pre_run_hook(job, private_data_dir)
  File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/awx/main/tasks/jobs.py", line 427, in pre_run_hook
    create_partition(instance.event_class._meta.db_table, start=instance.created)
  File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/awx/main/utils/common.py", line 1154, in create_partition
    cursor.execute(
  File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/django/db/backends/utils.py", line 67, in execute
    return self._execute_with_wrappers(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/django/db/backends/utils.py", line 80, in _execute_with_wrappers
    return executor(sql, params, many, context)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/django/db/backends/utils.py", line 84, in _execute
    with self.db.wrap_database_errors:
  File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/django/db/utils.py", line 91, in __exit__
    raise dj_exc_value.with_traceback(traceback) from exc_value
  File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/django/db/backends/utils.py", line 87, in _execute
    return self.cursor.execute(sql)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/lib/awx/venv/awx/lib64/python3.11/site-packages/psycopg/cursor.py", line 732, in execute
    raise ex.with_traceback(None)
django.db.utils.ProgrammingError: type "main_jobevent_20240611_13" already exists
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant