-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Milvus takes forever to load partitions eventually throwing timeout errors #34073
Comments
did you do load partition before insert? |
Yes, partition is a chat on our CVP. And follow up questions on a chat will eventually lead to insert and manual flush operations. |
/assign @ganderaj |
I think for some reason you have too many segments in your cluster, this could happen becasue of:
The current reason you cluster is crashed is due to [2024/06/21 15:50:38.313 +00:00] [WARN] [meta/coordinator_broker.go:148] ["get recovery info failed"] [collectionID=449660350148322962] [partitionIDis="[]"] [error="stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace\n/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:556 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call\n/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:570 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall\n/go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/client/client.go:107 github.com/milvus-io/milvus/internal/distributed/datacoord/client.wrapGrpcCall[...]\n/go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/client/client.go:339 github.com/milvus-io/milvus/internal/distributed/datacoord/client.(*Client).GetRecoveryInfoV2\n/go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/coordinator_broker.go:145 github.com/milvus-io/milvus/internal/querycoordv2/meta.(*CoordinatorBroker).GetRecoveryInfoV2\n/go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/target_manager.go:211 github.com/milvus-io/milvus/internal/querycoordv2/meta.(*TargetManager).PullNextTargetV2.func1\n/go/src/github.com/milvus-io/milvus/pkg/util/retry/retry.go:44 github.com/milvus-io/milvus/pkg/util/retry.Do\n/go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/target_manager.go:249 github.com/milvus-io/milvus/internal/querycoordv2/meta.(*TargetManager).PullNextTargetV2\n/go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/target_manager.go:121 github.com/milvus-io/milvus/internal/querycoordv2/meta.(*TargetManager).UpdateCollectionNextTarget: rpc error: code = DeadlineExceeded desc = context deadline exceeded"] [errorVerbose="stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace: rpc error: code = DeadlineExceeded desc = context deadline exceeded\n(1) attached stack trace\n -- stack trace:\n | github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call\n | \t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:556\n | github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall\n | \t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:570\n | github.com/milvus-io/milvus/internal/distributed/datacoord/client.wrapGrpcCall[...]\n | \t/go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/client/client.go:107\n | github.com/milvus-io/milvus/internal/distributed/datacoord/client.(*Client).GetRecoveryInfoV2\n | \t/go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/client/client.go:339\n | github.com/milvus-io/milvus/internal/querycoordv2/meta.(*CoordinatorBroker).GetRecoveryInfoV2\n | \t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/coordinator_broker.go:145\n | github.com/milvus-io/milvus/internal/querycoordv2/meta.(*TargetManager).PullNextTargetV2.func1\n | \t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/target_manager.go:211\n | github.com/milvus-io/milvus/pkg/util/retry.Do\n | \t/go/src/github.com/milvus-io/milvus/pkg/util/retry/retry.go:44\n | github.com/milvus-io/milvus/internal/querycoordv2/meta.(*TargetManager).PullNextTargetV2\n | \t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/target_manager.go:249\n | github.com/milvus-io/milvus/internal/querycoordv2/meta.(*TargetManager).UpdateCollectionNextTarget\n | \t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/target_manager.go:121\n | github.com/milvus-io/milvus/internal/querycoordv2/observers.(*TargetObserver).updateNextTarget\n | \t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/observers/target_observer.go:283\n | github.com/milvus-io/milvus/internal/querycoordv2/observers.(*TargetObserver).check\n | \t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/observers/target_observer.go:201\n | github.com/milvus-io/milvus/internal/querycoordv2/observers.(*taskDispatcher[...]).schedule.func1.1\n | \t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/observers/task_dispatcher.go:101\n | github.com/milvus-io/milvus/pkg/util/conc.(*Pool[...]).Submit.func1\n | \t/go/src/github.com/milvus-io/milvus/pkg/util/conc/pool.go:81\n | github.com/panjf2000/ants/v2.(*goWorker).run.func1\n | \t/go/pkg/mod/github.com/panjf2000/ants/[email protected]/worker.go:67\n | runtime.goexit\n | \t/usr/local/go/src/runtime/asm_amd64.s:1598\nWraps: (2) stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace\n | /go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:556 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call\n | /go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:570 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall\n | /go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/client/client.go:107 github.com/milvus-io/milvus/internal/distributed/datacoord/client.wrapGrpcCall[...]\n | /go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/client/client.go:339 github.com/milvus-io/milvus/internal/distributed/datacoord/client.(*Client).GetRecoveryInfoV2\n | /go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/coordinator_broker.go:145 github.com/milvus-io/milvus/internal/querycoordv2/meta.(*CoordinatorBroker).GetRecoveryInfoV2\n | /go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/target_manager.go:211 github.com/milvus-io/milvus/internal/querycoordv2/meta.(*TargetManager).PullNextTargetV2.func1\n | /go/src/github.com/milvus-io/milvus/pkg/util/retry/retry.go:44 github.com/milvus-io/milvus/pkg/util/retry.Do\n | /go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/target_manager.go:249 github.com/milvus-io/milvus/internal/querycoordv2/meta.(*TargetManager).PullNextTargetV2\n | /go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/target_manager.go:121 github.com/milvus-io/milvus/internal/querycoordv2/meta.(*TargetManager).UpdateCollectionNextTarget\nWraps: (3) rpc error: code = DeadlineExceeded desc = context deadline exceeded\nError types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *status.Error"] |
there are many tiemout failure for querycoordv2 to call datacoord. Suggestions:
This cluster is a little bit mess up, but over all it looks like datacoord has a bottle neck with too many collections and segments. If it possible I would recommend to rebuild a new cluster and rethink about your schema |
Hello @xiaofan-luan Thank you for your review. Per our current design, we have around 3600 vectors spanned across 630 collections and 1715 partitions. We acknowledge that it is not ideal to work with too many partitions on Milvus and we are working on a new design based on Partition-key based multi-tenancy strategy. However, in the meantime we wanted to better understand of the challenges in Milvus since the behavior is only intermittent. Can you please help me with below:
|
Restore command: Restore log which includes Error messages: milvus_backup-restore_error.txt |
|
so you can try to upgrade and see if it work. If not then I will need pprof of datacoord to check which execution path eats your cpu |
I have an interesting observation to share. We have different Milvus implementations at our organization, and I have chosen a database (10 Collections, 10 Partitions and 73K Vectors) from our Dev environment. I tried to restore this database on the newly created Milvus (v2.4.1) instance which ended up in similar failure message: test_backup_error.txt. To validate the integrity of backup, I have attempted to restore the same onto the same milvus instance where the backup was taken which completed successfully. Considering restore behavior is failing consistently with 3 different databases from 2 milvus instances, I assume something is wrong with my newly created milvus instance. Can you please review the error message and suggest what could be a potential cause? |
/assign @wayblink |
this is not a backup issue. |
Is there an existing issue for this?
Environment
Current Behavior
Self-hosted Milvus cluster is deployed on AWS Cloud using EKS. All the components of the clusters are healthy and no resource crunch is observed. However, partition load API takes forever to complete load and at times timeout on our application
Expected Behavior
Ideally, Partition load activity should not exceed more than 10s.
Steps To Reproduce
No response
Milvus Log
Attached are the milvus logs captured from all the milvus components -- milvus-logs.tar.gz
Error messages as received by our application when a Partition load is too slow:
Anything else?
Our application leveraging Milvus is a CVP stack. At the moment, we are mapping each partition to a "chat" for a user who gets a dedicated collection. Hence the need for frequent partition load considering increasing user activity.
Multi-tenancy based approach using partition-key is on our roadmap. However, we would like to understand what causes Partition Loads to be substancially slow and address this performance remediation at the earliest.
The text was updated successfully, but these errors were encountered: