[Bug]: Milvus takes forever to load partitions eventually throwing timeout errors #34073

ganderaj · 2024-06-21T15:57:05Z

Is there an existing issue for this?

I have searched the existing issues

Environment

- Milvus version: v2.4.1
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka): pulsar
- SDK version(e.g. pymilvus v2.0.0rc2): v2.3.7+
- OS(Ubuntu or CentOS): CentOS
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

Self-hosted Milvus cluster is deployed on AWS Cloud using EKS. All the components of the clusters are healthy and no resource crunch is observed. However, partition load API takes forever to complete load and at times timeout on our application

Expected Behavior

Ideally, Partition load activity should not exceed more than 10s.

Steps To Reproduce

No response

Milvus Log

Attached are the milvus logs captured from all the milvus components -- milvus-logs.tar.gz

Error messages as received by our application when a Partition load is too slow:

ERROR:pymilvus.decorators:RPC error: [get_loading_progress], <MilvusException: (code=1, message=partition not loaded[partition=449705667935668507])>, <Time:{'RPC start': '2024-06-21 15:23:11.733993', 'RPC error': '2024-06-21 15:23:11.737653'}>

ERROR:pymilvus.decorators:RPC error: [wait_for_loading_partitions], <MilvusException: (code=1, message=partition not loaded[partition=449705667935668507])>, <Time:{'RPC start': '2024-06-21 15:13:11.256893', 'RPC error': '2024-06-21 15:23:11.737829'}>

WARNING:pymilvus.decorators:Retry timeout: 30s

ERROR:pymilvus.decorators:RPC error: [load_partitions], <MilvusException: (code=1, message=Retry timeout: 30s, message=partition not loaded[partition=449705667935668507])>, <Time:{'RPC start': '2024-06-21 15:13:08.111562', 'RPC error': '2024-06-21 15:23:11.737951'}>

ERROR:app.milvus_helper:Error in insert_data Milvus while loading partition: <MilvusException: (code=1, message=Retry timeout: 30s, message=partition not loaded[partition=449705667935668507])>

ERROR:app.milvus_helper:Traceback (most recent call last):
  File "/python-docker/database/milvus_helper.py", line 204, in insert_data
    partition.load(timeout=30)  # _async=True
  File "/usr/local/lib/python3.10/site-packages/pymilvus/orm/partition.py", line 206, in load
    return conn.load_partitions(
  File "/usr/local/lib/python3.10/site-packages/pymilvus/decorators.py", line 127, in handler
    raise e from e
  File "/usr/local/lib/python3.10/site-packages/pymilvus/decorators.py", line 123, in handler
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/pymilvus/decorators.py", line 162, in handler
    return func(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/pymilvus/decorators.py", line 95, in handler
    raise MilvusException(
pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=Retry timeout: 30s, message=partition not loaded[partition=449705667935668507])>

DEBUG:app.milvus_helper:skipping partition loading in insert_data to avoid user impact
INFO:app:/chat_upload - chat metadata table values updated : True

Anything else?

Our application leveraging Milvus is a CVP stack. At the moment, we are mapping each partition to a "chat" for a user who gets a dedicated collection. Hence the need for frequent partition load considering increasing user activity.

Multi-tenancy based approach using partition-key is on our roadmap. However, we would like to understand what causes Partition Loads to be substancially slow and address this performance remediation at the earliest.

The text was updated successfully, but these errors were encountered:

xiaofan-luan · 2024-06-22T02:18:27Z

did you do load partition before insert?

ganderaj · 2024-06-22T02:23:03Z

did you do load partition before insert?

Yes, partition is a chat on our CVP. And follow up questions on a chat will eventually lead to insert and manual flush operations.

yanliang567 · 2024-06-22T07:47:25Z

/assign @ganderaj
please try as comments anove

xiaofan-luan · 2024-06-22T08:10:18Z

I think for some reason you have too many segments in your cluster, this could happen becasue of:

you have too many small collections -> Milvus cluster should not exceed 10000 collections
you call flush on very time you do small insertion -> from the log each segment is only 10s of entities, for example
[2024/06/21 15:50:01.531 +00:00] [INFO] [datacoord/index_service.go:605] ["completeIndexInfo success"] [collectionID=449705667927213932] [indexID=449705667927213961] [totalRows=3] [indexRows=3] [pendingIndexRows=0] [state=Finished] [failReason=]
[2024/06/21 15:50:01.531 +00:00] [INFO] [datacoord/index_service.go:732] ["DescribeIndex success"] [traceID=1c70deb1a01aab13fb497332856d7b39] [collectionID=449705667927213932] [indexName=]

The current reason you cluster is crashed is due to

[2024/06/21 15:50:38.313 +00:00] [WARN] [meta/coordinator_broker.go:148] ["get recovery info failed"] [collectionID=449660350148322962] [partitionIDis="[]"] [error="stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace\n/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:556 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call\n/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:570 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall\n/go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/client/client.go:107 github.com/milvus-io/milvus/internal/distributed/datacoord/client.wrapGrpcCall[...]\n/go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/client/client.go:339 github.com/milvus-io/milvus/internal/distributed/datacoord/client.(*Client).GetRecoveryInfoV2\n/go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/coordinator_broker.go:145 github.com/milvus-io/milvus/internal/querycoordv2/meta.(*CoordinatorBroker).GetRecoveryInfoV2\n/go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/target_manager.go:211 github.com/milvus-io/milvus/internal/querycoordv2/meta.(*TargetManager).PullNextTargetV2.func1\n/go/src/github.com/milvus-io/milvus/pkg/util/retry/retry.go:44 github.com/milvus-io/milvus/pkg/util/retry.Do\n/go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/target_manager.go:249 github.com/milvus-io/milvus/internal/querycoordv2/meta.(*TargetManager).PullNextTargetV2\n/go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/target_manager.go:121 github.com/milvus-io/milvus/internal/querycoordv2/meta.(*TargetManager).UpdateCollectionNextTarget: rpc error: code = DeadlineExceeded desc = context deadline exceeded"] [errorVerbose="stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace: rpc error: code = DeadlineExceeded desc = context deadline exceeded\n(1) attached stack trace\n -- stack trace:\n | github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call\n | \t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:556\n | github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall\n | \t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:570\n | github.com/milvus-io/milvus/internal/distributed/datacoord/client.wrapGrpcCall[...]\n | \t/go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/client/client.go:107\n | github.com/milvus-io/milvus/internal/distributed/datacoord/client.(*Client).GetRecoveryInfoV2\n | \t/go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/client/client.go:339\n | github.com/milvus-io/milvus/internal/querycoordv2/meta.(*CoordinatorBroker).GetRecoveryInfoV2\n | \t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/coordinator_broker.go:145\n | github.com/milvus-io/milvus/internal/querycoordv2/meta.(*TargetManager).PullNextTargetV2.func1\n | \t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/target_manager.go:211\n | github.com/milvus-io/milvus/pkg/util/retry.Do\n | \t/go/src/github.com/milvus-io/milvus/pkg/util/retry/retry.go:44\n | github.com/milvus-io/milvus/internal/querycoordv2/meta.(*TargetManager).PullNextTargetV2\n | \t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/target_manager.go:249\n | github.com/milvus-io/milvus/internal/querycoordv2/meta.(*TargetManager).UpdateCollectionNextTarget\n | \t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/target_manager.go:121\n | github.com/milvus-io/milvus/internal/querycoordv2/observers.(*TargetObserver).updateNextTarget\n | \t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/observers/target_observer.go:283\n | github.com/milvus-io/milvus/internal/querycoordv2/observers.(*TargetObserver).check\n | \t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/observers/target_observer.go:201\n | github.com/milvus-io/milvus/internal/querycoordv2/observers.(*taskDispatcher[...]).schedule.func1.1\n | \t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/observers/task_dispatcher.go:101\n | github.com/milvus-io/milvus/pkg/util/conc.(*Pool[...]).Submit.func1\n | \t/go/src/github.com/milvus-io/milvus/pkg/util/conc/pool.go:81\n | github.com/panjf2000/ants/v2.(*goWorker).run.func1\n | \t/go/pkg/mod/github.com/panjf2000/ants/[email protected]/worker.go:67\n | runtime.goexit\n | \t/usr/local/go/src/runtime/asm_amd64.s:1598\nWraps: (2) stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace\n | /go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:556 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call\n | /go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:570 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall\n | /go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/client/client.go:107 github.com/milvus-io/milvus/internal/distributed/datacoord/client.wrapGrpcCall[...]\n | /go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/client/client.go:339 github.com/milvus-io/milvus/internal/distributed/datacoord/client.(*Client).GetRecoveryInfoV2\n | /go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/coordinator_broker.go:145 github.com/milvus-io/milvus/internal/querycoordv2/meta.(*CoordinatorBroker).GetRecoveryInfoV2\n | /go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/target_manager.go:211 github.com/milvus-io/milvus/internal/querycoordv2/meta.(*TargetManager).PullNextTargetV2.func1\n | /go/src/github.com/milvus-io/milvus/pkg/util/retry/retry.go:44 github.com/milvus-io/milvus/pkg/util/retry.Do\n | /go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/target_manager.go:249 github.com/milvus-io/milvus/internal/querycoordv2/meta.(*TargetManager).PullNextTargetV2\n | /go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/target_manager.go:121 github.com/milvus-io/milvus/internal/querycoordv2/meta.(*TargetManager).UpdateCollectionNextTarget\nWraps: (3) rpc error: code = DeadlineExceeded desc = context deadline exceeded\nError types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *status.Error"]

xiaofan-luan · 2024-06-22T08:19:37Z

there are many tiemout failure for querycoordv2 to call datacoord.

Suggestions:

stop calling flush manually, if any
stop creating more collections, if you are doing multitenancy, think of using partition keys.
if cluster still can not recovered after that, please to a pprof for datacoord (port 9091), see what is the current performance bottle neck for datacoord.
change queryCoord.brokerTimeout to longer time (by default 5s). this might be a bad idea but your cluster might be able to recover.
check how many segments you have with birdwatcher. https://github.com/milvus-io/birdwatcher

This cluster is a little bit mess up, but over all it looks like datacoord has a bottle neck with too many collections and segments. If it possible I would recommend to rebuild a new cluster and rethink about your schema

ganderaj · 2024-06-22T20:12:23Z

Hello @xiaofan-luan

Thank you for your review. Per our current design, we have around 3600 vectors spanned across 630 collections and 1715 partitions. We acknowledge that it is not ideal to work with too many partitions on Milvus and we are working on a new design based on Partition-key based multi-tenancy strategy.

However, in the meantime we wanted to better understand of the challenges in Milvus since the behavior is only intermittent. Can you please help me with below:

Is queryCoord.brokerTimeout an allowed parameter in extraConfigFiles which goes into helm values file?
I have rebuild a new test cluster. I have attempted to restore this database onto the new cluster which failed with "context deadline exceeded" error messages. Is this an expected behavior?

ganderaj · 2024-06-22T21:04:41Z

Restore command:
nohup /PATH/TO/milvus-backup restore --databases docqa_prod --name backup_None_2406221600 --restore_index true --config sbx_restore_prod03.yaml &

Restore log which includes Error messages: milvus_backup-restore_error.txt

xiaofan-luan · 2024-06-23T03:15:13Z

Hello @xiaofan-luan

Thank you for your review. Per our current design, we have around 3600 vectors spanned across 630 collections and 1715 partitions. We acknowledge that it is not ideal to work with too many partitions on Milvus and we are working on a new design based on Partition-key based multi-tenancy strategy.

However, in the meantime we wanted to better understand of the challenges in Milvus since the behavior is only intermittent. Can you please help me with below:

Is queryCoord.brokerTimeout an allowed parameter in extraConfigFiles which goes into helm values file?

I have rebuild a new test cluster. I have attempted to restore this database onto the new cluster which failed with "context deadline exceeded" error messages. Is this an expected behavior?

yes, you can change brokerTimeout with extra config
some call in the cluster is really slow (getRecoveryInfoV2). this is a interface try to iterate on meta data. I don't have info about why this is slow(need pprof), but likely this is related to partition/collection number. We've optimized some key path on 2.4 later version

xiaofan-luan · 2024-06-23T03:15:46Z

so you can try to upgrade and see if it work. If not then I will need pprof of datacoord to check which execution path eats your cpu

ganderaj · 2024-06-24T03:50:52Z

@xiaofan-luan

I have an interesting observation to share. We have different Milvus implementations at our organization, and I have chosen a database (10 Collections, 10 Partitions and 73K Vectors) from our Dev environment.

I tried to restore this database on the newly created Milvus (v2.4.1) instance which ended up in similar failure message: test_backup_error.txt. To validate the integrity of backup, I have attempted to restore the same onto the same milvus instance where the backup was taken which completed successfully.

Considering restore behavior is failing consistently with 3 different databases from 2 milvus instances, I assume something is wrong with my newly created milvus instance. Can you please review the error message and suggest what could be a potential cause?

yanliang567 · 2024-06-24T04:01:34Z

/assign @wayblink
/unassign

xiaofan-luan · 2024-06-24T05:08:30Z

/assign @wayblink /unassign

this is not a backup issue.
the problem is that when request to datacoord it timeout

ganderaj added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 21, 2024

ganderaj assigned yanliang567 Jun 21, 2024

sre-ci-robot assigned ganderaj Jun 22, 2024

yanliang567 added triage/needs-information Indicates an issue needs more information in order to work on it. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 22, 2024

sre-ci-robot assigned wayblink and unassigned yanliang567 Jun 24, 2024

xiaofan-luan unassigned wayblink Jun 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Milvus takes forever to load partitions eventually throwing timeout errors #34073

[Bug]: Milvus takes forever to load partitions eventually throwing timeout errors #34073

ganderaj commented Jun 21, 2024

xiaofan-luan commented Jun 22, 2024

ganderaj commented Jun 22, 2024

yanliang567 commented Jun 22, 2024

xiaofan-luan commented Jun 22, 2024

xiaofan-luan commented Jun 22, 2024

ganderaj commented Jun 22, 2024

ganderaj commented Jun 22, 2024

xiaofan-luan commented Jun 23, 2024

xiaofan-luan commented Jun 23, 2024

ganderaj commented Jun 24, 2024

yanliang567 commented Jun 24, 2024

xiaofan-luan commented Jun 24, 2024

[Bug]: Milvus takes forever to load partitions eventually throwing timeout errors #34073

[Bug]: Milvus takes forever to load partitions eventually throwing timeout errors #34073

Comments

ganderaj commented Jun 21, 2024

Is there an existing issue for this?

Environment

Current Behavior

Expected Behavior

Steps To Reproduce

Milvus Log

Anything else?

xiaofan-luan commented Jun 22, 2024

ganderaj commented Jun 22, 2024

yanliang567 commented Jun 22, 2024

xiaofan-luan commented Jun 22, 2024

xiaofan-luan commented Jun 22, 2024

ganderaj commented Jun 22, 2024

ganderaj commented Jun 22, 2024

xiaofan-luan commented Jun 23, 2024

xiaofan-luan commented Jun 23, 2024

ganderaj commented Jun 24, 2024

yanliang567 commented Jun 24, 2024

xiaofan-luan commented Jun 24, 2024