-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add version tags #2482
base: main
Are you sure you want to change the base?
feat: add version tags #2482
Conversation
ACTION NEEDED The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. For details on the error please inspect the "PR Title Check" action. |
Feedback on
|
Thank you for your prompt feedback @wjones127
...
If we want to go with a human readable form, is it worth considering a file tree like git? So instead of having a single JSON, we could have a file for each tag:
This would help to minimise the chance of conflicts, as it would require concurrent writers to be writing to the same tag simultaneously. What do you think? Happy to use JSON otherwise.
Can you expand on what you mean by "fancy catalog integration"?
I was thinking the same thing. Should updating
Yes we can do that. I wasn't aware of this feature, so good to know.
Sounds good - I'll remove heads for now and think about this more. |
This seems like a good idea. It means listing the tags will require a list directory operation, which can be slow on S3, but we won't ever have that many tags so I don't think it would be a big deal. Another thought: an additional field we'd like to have will be
Some systems (Iceberg would be a good example), use an separate catalog service to store the versions and other metadata about a table. This would handle the concurrent transactions to update tags more easily. But it also makes it harder for users to setup, as they would have to configure and host the catalog. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2482 +/- ##
==========================================
+ Coverage 79.55% 79.63% +0.07%
==========================================
Files 208 208
Lines 59348 59268 -80
Branches 59348 59268 -80
==========================================
- Hits 47216 47196 -20
+ Misses 9371 9266 -105
- Partials 2761 2806 +45
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
@wjones127 I'm almost ready to mark this as "Ready for review". I just need to update the doc comments and check the test coverage. I also want to address the point you made previously:
I couldn't find any functions called |
It would be this file here: https://github.com/lancedb/lance/blob/main/rust/lance/src/dataset/cleanup.rs |
Closes #588
This PR takes a different course from what was proposed in #588 and the subsequent PR #605. Rather than using the existing
tag
field in the manifest, this PR adds a new file at the root of the dataset called_refs
. This new file contains a field calledtags
, which is a mutable map from tag names to versions. When a user wants to checkout a specific tag, we do a lookup to find the version number, and then simply callcheckout_version
under-the-hood (not implemented yet). By considering tags outside of the manifests file, we get the following benefits (all of which are consistent with the more familiar git tag):So far, I've just added the basic functionality to create
_refs
as the dataset is created. Over the next week, I'll flesh out the other methods to support creating, reading, and deleting tags. I thought I'd raise a draft PR now though in case there are major issues with the high-level design that will prevent this PR from ever being accepted.