Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Failure (decommissioning stopped making progress) in NodePoolMigrationTest.test_migrating_redpanda_nodes_to_new_pool #19933

Open
mmaslankaprv opened this issue Jun 20, 2024 · 2 comments
Labels
ci-failure ci-ignore Automatic ci analysis tools ignore this issue kind/bug Something isn't working team/replication helper for jira sync

Comments

@mmaslankaprv
Copy link
Member

mmaslankaprv commented Jun 20, 2024

https://buildkite.com/redpanda/redpanda/builds/50454#01903098-f01c-43b5-9802-57743666b009

Module: rptest.tests.node_pool_migration_test
Class:  NodePoolMigrationTest
Method: test_migrating_redpanda_nodes_to_new_pool
Arguments:
{
  "balancing_mode": "off",
  "cleanup_policy": "compact",
  "test_mode": "no_tiered_storage"
}
test_id:    rptest.tests.node_pool_migration_test.NodePoolMigrationTest.test_migrating_redpanda_nodes_to_new_pool.balancing_mode=off.test_mode=TestMode.NO_TIRED_STORAGE.cleanup_policy=compact
status:     FAIL
run time:   3 minutes 56.293 seconds


    AssertionError('Node 1 decommissioning stopped making progress')
Traceback (most recent call last):
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 184, in _do_run
    data = self.run_test()
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/tests/runner_client.py", line 276, in run_test
    return self.test_context.function(self.test)
  File "/opt/.ducktape-venv/lib/python3.10/site-packages/ducktape/mark/_mark.py", line 535, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/root/tests/rptest/services/cluster.py", line 105, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/node_pool_migration_test.py", line 367, in test_migrating_redpanda_nodes_to_new_pool
    self._wait_for_nodes_removed(decommissioned_ids)
  File "/root/tests/rptest/tests/node_pool_migration_test.py", line 150, in _wait_for_nodes_removed
    return [r for r in result]
  File "/root/tests/rptest/tests/node_pool_migration_test.py", line 150, in <listcomp>
    return [r for r in result]
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 621, in result_iterator
    yield _result_or_cancel(fs.pop())
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 319, in _result_or_cancel
    return fut.result(timeout)
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/root/tests/rptest/tests/node_pool_migration_test.py", line 147, in <lambda>
    lambda id: self._wait_for_node_removed(id, decommissioned_ids),
  File "/root/tests/rptest/tests/node_pool_migration_test.py", line 139, in _wait_for_node_removed
    waiter.wait_for_removal()
  File "/root/tests/rptest/utils/node_operations.py", line 205, in wait_for_removal
    assert self._made_progress(
AssertionError: Node 1 decommissioning stopped making progress

JIRA Link: CORE-4258

@mmaslankaprv mmaslankaprv added kind/bug Something isn't working ci-failure labels Jun 20, 2024
@mmaslankaprv
Copy link
Member Author

The problem that causes node 1 decommission to stop is related with stuck recovery of kafka/migration-test-workload/0 partition.

recovery_stm requests a read:

TRACE 2024-06-19 13:11:34,923 [shard 1:raft] raft - [follower: {id: 6, revision: 170}] [group_id:64, {kafka/migration-test-workload/0}] - recovery_stm.cc:261 - Reading batches, starting from: 6973

It is stopped after more than 2 minutes of waiting:

TRACE 2024-06-19 13:13:52,671 [shard 1:raft] raft - [follower: {id: 6, revision: 170}] [group_id:64, {kafka/migration-test-workload/0}] - recovery_stm.cc:270 - Read no batches for recovery, stopping

The recovery continues after compaction finishes:

DEBUG 2024-06-19 13:13:52,671 [shard 1:lc  ] storage-gc - disk_log_impl.cc:790 - [{kafka/migration-test-workload/0}] Final compacted segment {offset_tracker:{term:1, base_offset:7192, committed_offset:7447, dirty_offset:7447}, compacted_segment=1, finished_self_compaction=1, finished_windowed_compaction=1, generation=89, reader={/var/lib/redpanda/data/kafka/migration-test-workload/0_41/7192-1-v1.log, (92511 bytes)}, writer=nullptr, cache=nullptr, compaction_index:nullopt, closed=0, tombstone=0, index={file:/var/lib/redpanda/data/kafka/migration-test-workload/0_41/7192-1-v1.base_index, offsets:0, index:{header_bitflags:0, base_offset:0, max_offset:7447, base_timestamp:{timestamp: 1718802677049}, max_timestamp:{timestamp: 1718802677196}, batch_timestamps_are_monotonic:1, with_offset:true, non_data_timestamps:0, broker_timestamp:{{timestamp: 1718802677191}}, num_compactible_records_appended:{22}, index(3,3,3)}, step:32768, needs_persistence:0}}

@travisdowns travisdowns added the ci-ignore Automatic ci analysis tools ignore this issue label Jun 24, 2024
@travisdowns
Copy link
Member

@michael-redpanda michael-redpanda changed the title CI Failure (decommissioning stopped making progress) in Class.method CI Failure (decommissioning stopped making progress) in NodePoolMigrationTest.test_migrating_redpanda_nodes_to_new_pool Jun 27, 2024
@michael-redpanda michael-redpanda added the team/replication helper for jira sync label Jun 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-failure ci-ignore Automatic ci analysis tools ignore this issue kind/bug Something isn't working team/replication helper for jira sync
Projects
None yet
Development

No branches or pull requests

3 participants