Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add version tags #2482

Draft
wants to merge 12 commits into
base: main
Choose a base branch
from
Draft

feat: add version tags #2482

wants to merge 12 commits into from

Conversation

dsgibbons
Copy link

Closes #588

This PR takes a different course from what was proposed in #588 and the subsequent PR #605. Rather than using the existing tag field in the manifest, this PR adds a new file at the root of the dataset called _refs. This new file contains a field called tags, which is a mutable map from tag names to versions. When a user wants to checkout a specific tag, we do a lookup to find the version number, and then simply call checkout_version under-the-hood (not implemented yet). By considering tags outside of the manifests file, we get the following benefits (all of which are consistent with the more familiar git tag):

  • Allow users to freely create and delete tags retroactively.
  • Allow multiple tags to point to the same underlying version.
  • Easily enforce that tag names are unique.
  • Allow users to re-use a previously deleted tags(so long as the name is unique among all other tags at the time of reuse).

So far, I've just added the basic functionality to create _refs as the dataset is created. Over the next week, I'll flesh out the other methods to support creating, reading, and deleting tags. I thought I'd raise a draft PR now though in case there are major issues with the high-level design that will prevent this PR from ever being accepted.

Copy link

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

@dsgibbons dsgibbons changed the title Add version tags feat: add version tags Jun 17, 2024
@github-actions github-actions bot added the enhancement New feature or request label Jun 17, 2024
@wjones127
Copy link
Contributor

Feedback on tags

I think the design for tags looks good to me. Some caveats to keep in mind for it:

  1. If this is a mutable file, it must be read as a whole, rather than with some of the Lance readers that use get_range.
  2. There will be no strong guarantees that concurrent writers won't clobber each other. If two concurrent writers try to add a tag at the same time, it will be possible that only one will win.
    • If people care about fixing this, we might just want to implement some fancy catalog integration.
  3. cleanup_files() should probably be programmed to refuse to cleanup a version that has been tagged. I think users should have to explicitly delete the tag before it could be cleaned up.

A few things to consider for tags:

  1. Instead of using protobufs, perhaps we should just use JSON. This makes the file easily human readable, which makes for easier debugging.
  2. In Apache Iceberg, they allow you to set an expiration on a tag: https://iceberg.apache.org/docs/1.5.1/branching/#historical-tags. We don't have to do this immediately, but the possibility of additional fields attached to a tag does suggest that a simple map from name to version isn't quite the right data structure. Perhaps the keys should be an object with a version field.

Feedback on heads

I'm not sure heads will work well. My assumption is that there would be multiple versions that can be written to to create new versions, moving the respective head. Is that correct?

Our commit protocol is carefully designed to handle concurrent writers, but only for one linear sequence of versions. I don't see how it would work with multiple heads. I think you should consider limiting the scope to just tags for now. If you do want to tackle heads, I think you should read through the commit process, and propose changes to that which would support heads.

@wjones127 wjones127 self-requested a review June 17, 2024 22:26
@dsgibbons
Copy link
Author

Thank you for your prompt feedback @wjones127

  1. If this is a mutable file, it must be read as a whole, rather than with some of the Lance readers that use get_range.
  2. There will be no strong guarantees that concurrent writers won't clobber each other. If two concurrent writers try to add a tag at the same time, it will be possible that only one will win.

...

  1. Instead of using protobufs, perhaps we should just use JSON. This makes the file easily human readable, which makes for easier debugging.

If we want to go with a human readable form, is it worth considering a file tree like git? So instead of having a single JSON, we could have a file for each tag:

└── refs
    └── tags
        ├── v1
        ├── v2
        └── v3

This would help to minimise the chance of conflicts, as it would require concurrent writers to be writing to the same tag simultaneously. What do you think? Happy to use JSON otherwise.

If people care about fixing this, we might just want to implement some fancy catalog integration.

Can you expand on what you mean by "fancy catalog integration"?

  1. cleanup_files() should probably be programmed to refuse to cleanup a version that has been tagged. I think users should have to explicitly delete the tag before it could be cleaned up.

I was thinking the same thing. Should updating cleanup_files() be part of this PR?

  1. In Apache Iceberg, they allow you to set an expiration on a tag: https://iceberg.apache.org/docs/1.5.1/branching/#historical-tags. We don't have to do this immediately, but the possibility of additional fields attached to a tag does suggest that a simple map from name to version isn't quite the right data structure. Perhaps the keys should be an object with a version field.

Yes we can do that. I wasn't aware of this feature, so good to know.

Feedback on heads

Sounds good - I'll remove heads for now and think about this more.

@wjones127
Copy link
Contributor

This would help to minimise the chance of conflicts, as it would require concurrent writers to be writing to the same tag simultaneously.

This seems like a good idea. It means listing the tags will require a list directory operation, which can be slow on S3, but we won't ever have that many tags so I don't think it would be a big deal.

Another thought: an additional field we'd like to have will be manifestSize, providing the full size in bytes of the manifest file for that version. This can be used later to optimize the read of the manifest.

Can you expand on what you mean by "fancy catalog integration"?

Some systems (Iceberg would be a good example), use an separate catalog service to store the versions and other metadata about a table. This would handle the concurrent transactions to update tags more easily. But it also makes it harder for users to setup, as they would have to configure and host the catalog.

@codecov-commenter
Copy link

codecov-commenter commented Jun 21, 2024

Codecov Report

Attention: Patch coverage is 82.71605% with 14 lines in your changes missing coverage. Please review.

Project coverage is 79.63%. Comparing base (0bf765b) to head (52fdd5b).
Report is 10 commits behind head on main.

Files Patch % Lines
rust/lance/src/dataset.rs 84.12% 1 Missing and 9 partials ⚠️
rust/lance/src/dataset/refs.rs 77.77% 0 Missing and 4 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2482      +/-   ##
==========================================
+ Coverage   79.55%   79.63%   +0.07%     
==========================================
  Files         208      208              
  Lines       59348    59268      -80     
  Branches    59348    59268      -80     
==========================================
- Hits        47216    47196      -20     
+ Misses       9371     9266     -105     
- Partials     2761     2806      +45     
Flag Coverage Δ
unittests 79.63% <82.71%> (+0.07%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@dsgibbons
Copy link
Author

@wjones127 I'm almost ready to mark this as "Ready for review". I just need to update the doc comments and check the test coverage. I also want to address the point you made previously:

cleanup_files() should probably be programmed to refuse to cleanup a version that has been tagged. I think users should have to explicitly delete the tag before it could be cleaned up.

I couldn't find any functions called cleanup_files() in the codebase (other than being in mentioned in some of the doc comments). Could you please elaborate on where I can find this?

@wjones127
Copy link
Contributor

cleanup_files() should probably be programmed to refuse to cleanup a version that has been tagged. I think users should have to explicitly delete the tag before it could be cleaned up.

I couldn't find any functions called cleanup_files() in the codebase (other than being in mentioned in some of the doc comments). Could you please elaborate on where I can find this?

It would be this file here: https://github.com/lancedb/lance/blob/main/rust/lance/src/dataset/cleanup.rs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support version tagging
3 participants