CI Failure (decommissioning stopped making progress) in `NodePoolMigrationTest.test_migrating_redpanda_nodes_to_new_pool` #19933

mmaslankaprv · 2024-06-20T07:50:11Z

https://buildkite.com/redpanda/redpanda/builds/50454#01903098-f01c-43b5-9802-57743666b009

Module: rptest.tests.node_pool_migration_test
Class:  NodePoolMigrationTest
Method: test_migrating_redpanda_nodes_to_new_pool
Arguments:
{
  "balancing_mode": "off",
  "cleanup_policy": "compact",
  "test_mode": "no_tiered_storage"
}

test_id:    rptest.tests.node_pool_migration_test.NodePoolMigrationTest.test_migrating_redpanda_nodes_to_new_pool.balancing_mode=off.test_mode=TestMode.NO_TIRED_STORAGE.cleanup_policy=compact
status:     FAIL
run time:   3 minutes 56.293 seconds


    AssertionError('Node 1 decommissioning stopped making progress')
Traceback (most recent call last):
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 184, in _do_run
    data = self.run_test()
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 276, in run_test
    return self.test_context.function(self.test)
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/mark/_mark.py", line 535, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/root/tests/rptest/services/cluster.py", line 105, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/node_pool_migration_test.py", line 367, in test_migrating_redpanda_nodes_to_new_pool
    self._wait_for_nodes_removed(decommissioned_ids)
  File "/root/tests/rptest/tests/node_pool_migration_test.py", line 150, in _wait_for_nodes_removed
    return [r for r in result]
  File "/root/tests/rptest/tests/node_pool_migration_test.py", line 150, in <listcomp>
    return [r for r in result]
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 621, in result_iterator
    yield _result_or_cancel(fs.pop())
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 319, in _result_or_cancel
    return fut.result(timeout)
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/root/tests/rptest/tests/node_pool_migration_test.py", line 147, in <lambda>
    lambda id: self._wait_for_node_removed(id, decommissioned_ids),
  File "/root/tests/rptest/tests/node_pool_migration_test.py", line 139, in _wait_for_node_removed
    waiter.wait_for_removal()
  File "/root/tests/rptest/utils/node_operations.py", line 205, in wait_for_removal
    assert self._made_progress(
AssertionError: Node 1 decommissioning stopped making progress

JIRA Link: CORE-4258

The text was updated successfully, but these errors were encountered:

mmaslankaprv · 2024-06-20T07:55:35Z

The problem that causes node 1 decommission to stop is related with stuck recovery of kafka/migration-test-workload/0 partition.

recovery_stm requests a read:

TRACE 2024-06-19 13:11:34,923 [shard 1:raft] raft - [follower: {id: 6, revision: 170}] [group_id:64, {kafka/migration-test-workload/0}] - recovery_stm.cc:261 - Reading batches, starting from: 6973

It is stopped after more than 2 minutes of waiting:

TRACE 2024-06-19 13:13:52,671 [shard 1:raft] raft - [follower: {id: 6, revision: 170}] [group_id:64, {kafka/migration-test-workload/0}] - recovery_stm.cc:270 - Read no batches for recovery, stopping

The recovery continues after compaction finishes:

DEBUG 2024-06-19 13:13:52,671 [shard 1:lc  ] storage-gc - disk_log_impl.cc:790 - [{kafka/migration-test-workload/0}] Final compacted segment {offset_tracker:{term:1, base_offset:7192, committed_offset:7447, dirty_offset:7447}, compacted_segment=1, finished_self_compaction=1, finished_windowed_compaction=1, generation=89, reader={/var/lib/redpanda/data/kafka/migration-test-workload/0_41/7192-1-v1.log, (92511 bytes)}, writer=nullptr, cache=nullptr, compaction_index:nullopt, closed=0, tombstone=0, index={file:/var/lib/redpanda/data/kafka/migration-test-workload/0_41/7192-1-v1.base_index, offsets:0, index:{header_bitflags:0, base_offset:0, max_offset:7447, base_timestamp:{timestamp: 1718802677049}, max_timestamp:{timestamp: 1718802677196}, batch_timestamps_are_monotonic:1, with_offset:true, non_data_timestamps:0, broker_timestamp:{{timestamp: 1718802677191}}, num_compactible_records_appended:{22}, index(3,3,3)}, step:32768, needs_persistence:0}}

travisdowns · 2024-06-24T02:06:06Z

ci-ignore because https://redpandadata.slack.com/archives/C06MCRY75U7/p1718996592057029

mmaslankaprv added kind/bug Something isn't working ci-failure labels Jun 20, 2024

mmaslankaprv mentioned this issue Jun 21, 2024

[v24.1.x] Refined leadership changed metrics #19891

Merged

travisdowns added the ci-ignore Automatic ci analysis tools ignore this issue label Jun 24, 2024

michael-redpanda changed the title ~~CI Failure (decommissioning stopped making progress) in Class.method~~ CI Failure (decommissioning stopped making progress) in NodePoolMigrationTest.test_migrating_redpanda_nodes_to_new_pool Jun 27, 2024

michael-redpanda added the team/replication helper for jira sync label Jun 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI Failure (decommissioning stopped making progress) in `NodePoolMigrationTest.test_migrating_redpanda_nodes_to_new_pool` #19933

CI Failure (decommissioning stopped making progress) in `NodePoolMigrationTest.test_migrating_redpanda_nodes_to_new_pool` #19933

mmaslankaprv commented Jun 20, 2024 •

edited by jira bot

Loading

mmaslankaprv commented Jun 20, 2024

travisdowns commented Jun 24, 2024

CI Failure (decommissioning stopped making progress) in NodePoolMigrationTest.test_migrating_redpanda_nodes_to_new_pool #19933

CI Failure (decommissioning stopped making progress) in NodePoolMigrationTest.test_migrating_redpanda_nodes_to_new_pool #19933

Comments

mmaslankaprv commented Jun 20, 2024 • edited by jira bot Loading

mmaslankaprv commented Jun 20, 2024

travisdowns commented Jun 24, 2024

CI Failure (decommissioning stopped making progress) in `NodePoolMigrationTest.test_migrating_redpanda_nodes_to_new_pool` #19933

CI Failure (decommissioning stopped making progress) in `NodePoolMigrationTest.test_migrating_redpanda_nodes_to_new_pool` #19933

mmaslankaprv commented Jun 20, 2024 •

edited by jira bot

Loading