Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SUPPORT] The clean service can't clean historical version files after the savepoint instant when i set hoodie.archive.beyond.savepoint=true #11405

Open
gfcoder opened this issue Jun 6, 2024 · 1 comment

Comments

@gfcoder
Copy link

gfcoder commented Jun 6, 2024

The clean service can't clean historical version files after the savepoint instant when i set hoodie.archive.beyond.savepoint=true

To Reproduce

  1. set hoodie.archive.beyond.savepoint=true
  2. use default clean policy (KEEP_LATEST_COMMITS)
  3. use default archive policy
  4. start flink job
  5. after several commit, create savepoint
  6. after several clean periods, check the partition data

Expected behavior
old commit data should be cleaned up according to the clean policy.

Environment Description

  • Hudi version: 0.13.1
  • Flink version: 1.14.4
  • Hadoop version: 3.1.0
  • Storage: HDFS

Additional context
I found that in the HoodieDefaultTimeline.getFirstNonSavepointCommit method, 'savepointTimestamps" set is always empty, even though the savepoint instant already exist.
this issue occurs because in the CleanPlanner.getFilesToCleanKeepingLatestCommits method, the call to fileSystemView.getAllFileGroups retrieves all fileGroups in the partition path. however the HoodieTimeline in HoodieFileGroup only matches the following actions: COMMIT_ACTION, DELTA_COMMIT_ACTION, COMPACTION_ACTION, LOG_COMPACTION_ACTION, REPLACE_COMMIT_ACTION . Consequently, when getFirstNonSavepointCommit is called, it nerver returns the first instant beyond the savepoint instant. As a result, historical version files are nerver cleaned.

CleanPlanner.getFilesToCleanKeepingLatestCommits -> fileSystemView.getAllFileGroups -> AbstractTableFileSystemView.addFilesToView -> this.visibleCommitsAndCompactionTimeline = visibleActiveTimeline.**getWriteTimeline** -> fileGroup.getAllFileSlices -> HoodieDefaultTimeline.getFirstNonSavepointCommit

@danny0405
Copy link
Contributor

@nsivabalan can you give some insights here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Awaiting Triage
Development

No branches or pull requests

4 participants