Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Cross-Version Collection Migration #2400

Open
wants to merge 1 commit into
base: anton/distributed_param_management
Choose a base branch
from

Conversation

atroyn
Copy link
Contributor

@atroyn atroyn commented Jun 22, 2024

Description of changes

This PR creates a path to migrating from previous versions of Chroma to the new version where we have collection configuration storage. The migration is idempotent and non-destructive.

Since all collections now must have a configuration, old collections would error when loading them - this was reflected in cross-version persistence failures.

With this approach, that doesn't happen. This is a first step to providing user-facing migration tooling. For now it's just this one script, but later as we add more of these, they can be composed in a more intelligent way.

This PR includes a new CLI application as part of the chroma CLI, chroma migrate which will migrate all collections in a specified path (and optional tenant, and database), with ./chroma being the default.

Test plan

Manual Test:

  • Create a collection using an old version of chroma with a persisted client
  • Load the database with a persistent client from this version. list_collections() should fail with a JSON parsing error (since configurations don't exist)
  • Run the migration function over the client from the new version.
  • list_collections() should work as expected.

Automated:
test_cross_version_persist passes locally and in CI.

ALL TESTS Should pass by this point in the stack.

Documentation Changes

The migration and migration tool is documented at https://docs.trychroma.com/deployment/migration

Additionally, when a collection tries and fails to load a CollectionConfiguration from JSON, the error points the user to the same migration documentation.

TODO:

  • Documentation of migration
  • Migration tool in CLI

Copy link

Please tag your PR title with one of: [ENH | BUG | DOC | TST | BLD | PERF | TYP | CLN | CHORE]. See https://docs.trychroma.com/contributing#contributing-code-and-ideas

Copy link

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

Copy link
Contributor Author

atroyn commented Jun 22, 2024

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @atroyn and the rest of your teammates on Graphite Graphite

@atroyn atroyn changed the title Cross-version collection migrate [ENH] Cross-Version Collection Migration Jun 22, 2024
@atroyn atroyn force-pushed the anton/parameter_management branch from 9da85c6 to 876de4c Compare June 22, 2024 01:52
@atroyn atroyn force-pushed the anton/collection_migrations branch 2 times, most recently from 4e51a90 to fb78fc6 Compare June 22, 2024 01:54
@atroyn atroyn changed the base branch from anton/parameter_management to anton/distributed_param_management June 22, 2024 01:54
@atroyn atroyn force-pushed the anton/distributed_param_management branch from 50ecef5 to c8f99a4 Compare June 22, 2024 15:30
@atroyn atroyn force-pushed the anton/collection_migrations branch from fb78fc6 to 72962a4 Compare June 22, 2024 15:31
@atroyn atroyn force-pushed the anton/distributed_param_management branch from c8f99a4 to 3781435 Compare June 23, 2024 03:49
@atroyn atroyn force-pushed the anton/collection_migrations branch from 72962a4 to d121f6c Compare June 23, 2024 03:49
@atroyn atroyn force-pushed the anton/distributed_param_management branch from 3781435 to 9e82022 Compare June 23, 2024 17:37
@atroyn atroyn force-pushed the anton/collection_migrations branch from d121f6c to 4e0f924 Compare June 23, 2024 17:37
@atroyn atroyn force-pushed the anton/distributed_param_management branch from 9e82022 to 646b55d Compare June 23, 2024 19:29
@atroyn atroyn force-pushed the anton/collection_migrations branch from 4e0f924 to d45d0a0 Compare June 23, 2024 19:29
@atroyn atroyn force-pushed the anton/distributed_param_management branch from 646b55d to 7efdcc3 Compare June 23, 2024 19:53
@atroyn atroyn force-pushed the anton/collection_migrations branch 2 times, most recently from 4f97ea8 to e17f7ed Compare June 23, 2024 20:33
@atroyn atroyn force-pushed the anton/distributed_param_management branch from 7efdcc3 to e4b8410 Compare June 24, 2024 20:19
@atroyn atroyn force-pushed the anton/collection_migrations branch from e17f7ed to 071ab50 Compare June 24, 2024 20:19
@atroyn atroyn force-pushed the anton/distributed_param_management branch from e4b8410 to c224667 Compare June 24, 2024 20:37
@atroyn atroyn force-pushed the anton/collection_migrations branch from 071ab50 to 625451c Compare June 24, 2024 20:37
@atroyn atroyn force-pushed the anton/distributed_param_management branch from c224667 to 5516b7e Compare June 24, 2024 22:08
@atroyn atroyn force-pushed the anton/collection_migrations branch from 625451c to 8917034 Compare June 24, 2024 22:08
@atroyn atroyn force-pushed the anton/distributed_param_management branch from 5516b7e to c1fb6d6 Compare June 24, 2024 22:44
@atroyn atroyn force-pushed the anton/collection_migrations branch from 8917034 to 782f125 Compare June 24, 2024 22:44
@atroyn atroyn force-pushed the anton/distributed_param_management branch from c1fb6d6 to b96e4aa Compare June 24, 2024 23:15
@atroyn atroyn force-pushed the anton/collection_migrations branch from 782f125 to 03a9139 Compare June 24, 2024 23:16
@atroyn atroyn force-pushed the anton/distributed_param_management branch from b96e4aa to 5d4ca28 Compare June 25, 2024 16:26
@atroyn atroyn force-pushed the anton/collection_migrations branch from 03a9139 to 47f0966 Compare June 25, 2024 16:26
@atroyn atroyn force-pushed the anton/distributed_param_management branch from 5d4ca28 to 3d27445 Compare June 25, 2024 18:52
@atroyn atroyn force-pushed the anton/collection_migrations branch from 47f0966 to 5ee7c97 Compare June 25, 2024 18:52
@atroyn atroyn marked this pull request as ready for review June 25, 2024 18:55
@@ -97,5 +100,23 @@ def docs() -> None:
webbrowser.open("https://docs.trychroma.com")


@app.command() # type: ignore
def migrate(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its a bit odd that you have to specify the tenant and database, and would prefer if we added some tooling to migrate all of them. This feels like pushing our constraints (no list tenants) onto our users and delivering a worse UX.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thought here was that if you are using the tenant and database args, you know what you're doing and what you want to migrate. I could instead crawl the entire sysDB too. Let's discuss.

(the error here is I forget to pass these args, fixed.)

from tqdm import tqdm


def migrate_collections(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to version these? how does this extend in the future?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's discuss. My thought here was we would wait for the next one to happen and then write the right abstraction, but you make the point elsewhere that we do already have other migrations. Will take a look at those.

Copy link
Collaborator

@HammadB HammadB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should remove and deprecate https://github.com/chroma-core/chroma-migrate since we are adding this

class EmptyConfiguration(Configuration):
definitions = {}

def patched_from_json_str(
Copy link
Collaborator

@HammadB HammadB Jun 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may be able to run this migration as a sql migration? That would be preferable I think. I'm not really a fan of this UX and introducing a new migration hook that isn't future proof in a nice way - forcing people to run a migration tool for a change like this feels heavy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doing this in SQL is scary because it would require direct string manipulation without validation.

@@ -312,6 +313,7 @@ def test_cycle_versions(
system = config.System(settings)
system.start()
client = ClientCreator.from_system(system)
migrate_collections(client)
Copy link
Collaborator

@HammadB HammadB Jun 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we only run this if the old version is one we expect to fail?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thought is that any migration we run here must be idempotent in the case there's nothing to be done, and doing this test this way helps ensure that.

@atroyn atroyn force-pushed the anton/distributed_param_management branch from 3d27445 to c55656f Compare June 28, 2024 15:46
@atroyn atroyn force-pushed the anton/collection_migrations branch from 5ee7c97 to 8509e67 Compare June 28, 2024 15:46
@atroyn atroyn force-pushed the anton/distributed_param_management branch from c55656f to ec7dbb6 Compare June 28, 2024 16:21
@atroyn atroyn force-pushed the anton/collection_migrations branch from 8509e67 to 37f7fc2 Compare June 28, 2024 16:21
@atroyn atroyn force-pushed the anton/distributed_param_management branch from ec7dbb6 to fafcc3e Compare June 28, 2024 16:43
@atroyn atroyn force-pushed the anton/collection_migrations branch from 37f7fc2 to 094b6ea Compare June 28, 2024 16:44
@atroyn atroyn force-pushed the anton/distributed_param_management branch from fafcc3e to 14643db Compare June 28, 2024 17:05
@atroyn atroyn force-pushed the anton/collection_migrations branch from 094b6ea to 7e4c972 Compare June 28, 2024 17:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants