Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Umbrella issue for "redpanda falied to start in 20 sec" #19959

Open
travisdowns opened this issue Jun 23, 2024 · 2 comments
Open

Umbrella issue for "redpanda falied to start in 20 sec" #19959

travisdowns opened this issue Jun 23, 2024 · 2 comments
Labels
area/storage ci-failure ci-ignore Automatic ci analysis tools ignore this issue kind/bug Something isn't working

Comments

@travisdowns
Copy link
Member

travisdowns commented Jun 23, 2024

There are several ci-failures related to CI failure all failed to start in 20 seconds, which I believe share the same underlying cause.

This is am umbrella issue for these so we have one stream of investigation. We can't close the other issues because of how PT works, but we can point them here.

A typical example:

TimeoutError('Redpanda service docker-rp-7 failed to start within 20 sec')
Traceback (most recent call last):
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 184, in _do_run
    data = self.run_test()
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 276, in run_test
    return self.test_context.function(self.test)
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/mark/_mark.py", line 535, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/root/tests/rptest/services/cluster.py", line 103, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/partition_movement_test.py", line 971, in test_shadow_indexing
    self.start_redpanda(num_nodes=3, install_opts=install_opts)
  File "/root/tests/rptest/tests/end_to_end.py", line 125, in start_redpanda
    self.redpanda.start(auto_assign_node_id=new_bootstrap,
  File "/root/tests/rptest/services/redpanda.py", line 2552, in start
    raise e
  File "/root/tests/rptest/services/redpanda.py", line 2544, in start
    self.for_nodes(to_start, start_one)
  File "/root/tests/rptest/services/redpanda.py", line 1391, in for_nodes
    return list(executor.map(cb, nodes))
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 621, in result_iterator
    yield _result_or_cancel(fs.pop())
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 319, in _result_or_cancel
    return fut.result(timeout)
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/root/tests/rptest/services/redpanda.py", line 2536, in start_one
    self.start_node(node,
  File "/root/tests/rptest/services/redpanda.py", line 2871, in start_node
    self.start_service(node, start_rp)
ducktape.errors.TimeoutError: Redpanda service docker-rp-7 failed to start within 20 sec
Test requested 5 nodes, used only 3

JIRA Link: CORE-4276

@michael-redpanda
Copy link
Contributor

Maybe we add a call to

def raise_on_bad_logs(self, allow_list=None):
"""
Raise a BadLogLines exception if any nodes' logs contain errors
not permitted by `allow_list`
:param allow_list: list of compiled regexes, or None for default
:return: None
"""
allow_list = prepare_allow_list(allow_list)
_searchable_nodes = []
for node in self.nodes:
if self._skip_if_no_redpanda_log and not node.account.exists(
RedpandaServiceBase.STDOUT_STDERR_CAPTURE):
self.logger.info(
f"{RedpandaServiceBase.STDOUT_STDERR_CAPTURE} not found on {node.account.hostname}. Skipping log scan."
)
continue
_searchable_nodes.append(node)
lsearcher = LogSearchLocal(self._context, allow_list, self.logger,
RedpandaServiceBase.STDOUT_STDERR_CAPTURE)
lsearcher.search_logs(_searchable_nodes)
on nodes that fail to start within 20 seconds? At least start with that, then we can hopefully further distinguish the reason for the 20 second time out.

@michael-redpanda
Copy link
Contributor

That probably won't catch some of the above failures, but for the failure in #18845, it would have since it asserted at start up

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/storage ci-failure ci-ignore Automatic ci analysis tools ignore this issue kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants