Umbrella issue for "redpanda falied to start in 20 sec" #19959

travisdowns · 2024-06-23T02:12:52Z

There are several ci-failures related to CI failure all failed to start in 20 seconds, which I believe share the same underlying cause.

This is am umbrella issue for these so we have one stream of investigation. We can't close the other issues because of how PT works, but we can point them here.

A typical example:

TimeoutError('Redpanda service docker-rp-7 failed to start within 20 sec')
Traceback (most recent call last):
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 184, in _do_run
    data = self.run_test()
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 276, in run_test
    return self.test_context.function(self.test)
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/mark/_mark.py", line 535, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/root/tests/rptest/services/cluster.py", line 103, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/partition_movement_test.py", line 971, in test_shadow_indexing
    self.start_redpanda(num_nodes=3, install_opts=install_opts)
  File "/root/tests/rptest/tests/end_to_end.py", line 125, in start_redpanda
    self.redpanda.start(auto_assign_node_id=new_bootstrap,
  File "/root/tests/rptest/services/redpanda.py", line 2552, in start
    raise e
  File "/root/tests/rptest/services/redpanda.py", line 2544, in start
    self.for_nodes(to_start, start_one)
  File "/root/tests/rptest/services/redpanda.py", line 1391, in for_nodes
    return list(executor.map(cb, nodes))
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 621, in result_iterator
    yield _result_or_cancel(fs.pop())
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 319, in _result_or_cancel
    return fut.result(timeout)
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/root/tests/rptest/services/redpanda.py", line 2536, in start_one
    self.start_node(node,
  File "/root/tests/rptest/services/redpanda.py", line 2871, in start_node
    self.start_service(node, start_rp)
ducktape.errors.TimeoutError: Redpanda service docker-rp-7 failed to start within 20 sec
Test requested 5 nodes, used only 3

JIRA Link: CORE-4276

The text was updated successfully, but these errors were encountered:

michael-redpanda · 2024-06-28T13:50:06Z

Maybe we add a call to

redpanda/tests/rptest/services/redpanda.py

Lines 1548 to 1570 in 5f07574

    
               def raise_on_bad_logs(self, allow_list=None): 
        
                   """ 
        
                   Raise a BadLogLines exception if any nodes' logs contain errors 
        
                   not permitted by `allow_list` 
        
                   :param allow_list: list of compiled regexes, or None for default 
        
                   :return: None 
        
                   """ 
        
                   allow_list = prepare_allow_list(allow_list) 
        
                   _searchable_nodes = [] 
        
                   for node in self.nodes: 
        
                       if self._skip_if_no_redpanda_log and not node.account.exists( 
        
                               RedpandaServiceBase.STDOUT_STDERR_CAPTURE): 
        
                           self.logger.info( 
        
                               f"{RedpandaServiceBase.STDOUT_STDERR_CAPTURE} not found on {node.account.hostname}. Skipping log scan." 
        
                           ) 
        
                           continue 
        
                       _searchable_nodes.append(node) 
        
                   lsearcher = LogSearchLocal(self._context, allow_list, self.logger, 
        
                                              RedpandaServiceBase.STDOUT_STDERR_CAPTURE) 
        
                   lsearcher.search_logs(_searchable_nodes)

on nodes that fail to start within 20 seconds? At least start with that, then we can hopefully further distinguish the reason for the 20 second time out.

michael-redpanda · 2024-06-28T13:52:16Z

That probably won't catch some of the above failures, but for the failure in #18845, it would have since it asserted at start up

travisdowns added kind/bug Something isn't working ci-failure ci-ignore Automatic ci analysis tools ignore this issue labels Jun 23, 2024

travisdowns added the area/storage label Jun 23, 2024

travisdowns mentioned this issue Jun 23, 2024

CI Failure (Assertion in feature table) in ControlCharacterPermittedAfterUpgrade.test_upgrade_from_pre_v23_2 #18013

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Umbrella issue for "redpanda falied to start in 20 sec" #19959

Umbrella issue for "redpanda falied to start in 20 sec" #19959

travisdowns commented Jun 23, 2024 •

edited by github-actions bot

Loading

michael-redpanda commented Jun 28, 2024

michael-redpanda commented Jun 28, 2024

Umbrella issue for "redpanda falied to start in 20 sec" #19959

Umbrella issue for "redpanda falied to start in 20 sec" #19959

Comments

travisdowns commented Jun 23, 2024 • edited by github-actions bot Loading

michael-redpanda commented Jun 28, 2024

michael-redpanda commented Jun 28, 2024

travisdowns commented Jun 23, 2024 •

edited by github-actions bot

Loading