Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-target vector search #5138

Open
1 task done
dirkkul opened this issue Jun 11, 2024 · 8 comments
Open
1 task done

Multi-target vector search #5138

dirkkul opened this issue Jun 11, 2024 · 8 comments
Assignees

Comments

@dirkkul
Copy link
Contributor

dirkkul commented Jun 11, 2024

Describe your feature request

tracking issue

Code of Conduct

@dirkkul dirkkul mentioned this issue Jun 11, 2024
4 tasks
@dirkkul dirkkul self-assigned this Jun 11, 2024
@hadfield
Copy link

Migrating some comments/questions from:
#4955

I would like to suggest the ability to configure scoring/ranking, such as, for a nearText case, sorting by the minimum average distance based on a distance metric (such as cosine) and including some weighting, so if this was using 3 vectors, weights might be [0.4, 0.3, 0.3] to more heavily weight the first vector. depending on the distance metric, there may need to be some normalization, especially if the vectors are coming from different embedding models.

@hadfield
Copy link

Migrating some comments/questions from:
#4955

Related to this, but a different usage scenario, is a query that extends across collections that involves more than one vector.

Given a data model like:

Document (Collection), Topic (Collection), Image (Collection)

Document:
content (Vector, Text Embedding)

Topic:
description (Vector, Text Embedding)
multiModalDescription (Vector, MultiModal Embedding)

Image:
content (Vector, MultiModal Embedding)

Query:

Document content: nearText("cute kittens")
Matching Documents provide vectors to find nearby Topics based on closeness in the Text Embedding space
Matching Topics provide vectors to find nearby Images based on closeness in the MultiModal Embedding space
So the Topics collection has two vectors and serves to "join" the two embedding spaces allowing queries to traverse across the embedding spaces. One scenario when this arises is when there is an existing dataset for "documents" and an existing dataset for "images" and you want to query across them without having to modify the current data (or the processes that maintain it).
I briefly discussed this use-case with @bobvanluijt a few months back at an event in NYC.
Hopefully I articulated what I mean, but let me know if clarifications are needed, or if I'm on the wrong track.
If this use-case is completely separate, i guess an issue could be added?

@hadfield
Copy link

Migrating some comments/questions from:
#4955

For the parallel N vector query case, is there the concept of optimizing the ordering, such that the vector that has the least nearby results can be a gating factor on the others? In document search, if you were querying for "happy" AND "aardvark" you would search for "aardvark" first which presumably would be less frequent and help filter the "happy" results. The situation with vectors is not exactly the same but thought a similar process might help.

In a query I would use this for, one of the vectors would have something like 1000x the number of nearby vectors than the others so it could be bad performance-wise to enumerate them all only to be just intersected with the other much smaller sets.

@dirkkul
Copy link
Contributor Author

dirkkul commented Jun 27, 2024

Hi @hadfield

I would like to suggest the ability to configure scoring/ranking, such as, for a nearText case, sorting by the minimum average distance based on a distance metric (such as cosine) and including some weighting, so if this was using 3 vectors, weights might be [0.4, 0.3, 0.3] to more heavily weight the first vector. depending on the distance metric, there may need to be some normalization, especially if the vectors are coming from different embedding models.

This will be included, the options will be:

  • sim
  • minimum
  • average
  • manual weights
  • relative scores (same as hybrid)

@dirkkul
Copy link
Contributor Author

dirkkul commented Jun 27, 2024

Related to this, but a different usage scenario, is a query that extends across collections that involves more than one vector.

Ths won't be added in the near future - this is more complex to add and would need more work

@dirkkul
Copy link
Contributor Author

dirkkul commented Jun 27, 2024

For the parallel N vector query case, is there the concept of optimizing the ordering, such that the vector that has the least nearby results can be a gating factor on the others? In document search, if you were querying for "happy" AND "aardvark" you would search for "aardvark" first which presumably would be less frequent and help filter the "happy" results. The situation with vectors is not exactly the same but thought a similar process might help.

All searches run concurrently so there is no explicit order. In my testing multi target vector search is not much slower than single target vector search (<10%)

@hadfield
Copy link

Related to this, but a different usage scenario, is a query that extends across collections that involves more than one vector.

Ths won't be added in the near future - this is more complex to add and would need more work

Ok, I'll open a new issue specific to this to track it.

@hadfield
Copy link

For the parallel N vector query case, is there the concept of optimizing the ordering, such that the vector that has the least nearby results can be a gating factor on the others? In document search, if you were querying for "happy" AND "aardvark" you would search for "aardvark" first which presumably would be less frequent and help filter the "happy" results. The situation with vectors is not exactly the same but thought a similar process might help.

All searches run concurrently so there is no explicit order. In my testing multi target vector search is not much slower than single target vector search (<10%)

I would suggest in your tests to include wildly imbalanced vector query results for the individual vectors of the query to explore the performance of such cases. Like a ratio of 10,000+ to 1 for an Object O with vectors A, B that is near to very many objects in A and very few objects in B.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants