Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Milvus takes forever to load partitions eventually throwing timeout errors #34073

Open
1 task done
ganderaj opened this issue Jun 21, 2024 · 12 comments
Open
1 task done
Assignees
Labels
kind/bug Issues or changes related a bug triage/needs-information Indicates an issue needs more information in order to work on it.

Comments

@ganderaj
Copy link

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: v2.4.1
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka): pulsar
- SDK version(e.g. pymilvus v2.0.0rc2): v2.3.7+
- OS(Ubuntu or CentOS): CentOS
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

Self-hosted Milvus cluster is deployed on AWS Cloud using EKS. All the components of the clusters are healthy and no resource crunch is observed. However, partition load API takes forever to complete load and at times timeout on our application

Expected Behavior

Ideally, Partition load activity should not exceed more than 10s.

Steps To Reproduce

No response

Milvus Log

Attached are the milvus logs captured from all the milvus components -- milvus-logs.tar.gz

Error messages as received by our application when a Partition load is too slow:

ERROR:pymilvus.decorators:RPC error: [get_loading_progress], <MilvusException: (code=1, message=partition not loaded[partition=449705667935668507])>, <Time:{'RPC start': '2024-06-21 15:23:11.733993', 'RPC error': '2024-06-21 15:23:11.737653'}>

ERROR:pymilvus.decorators:RPC error: [wait_for_loading_partitions], <MilvusException: (code=1, message=partition not loaded[partition=449705667935668507])>, <Time:{'RPC start': '2024-06-21 15:13:11.256893', 'RPC error': '2024-06-21 15:23:11.737829'}>

WARNING:pymilvus.decorators:Retry timeout: 30s

ERROR:pymilvus.decorators:RPC error: [load_partitions], <MilvusException: (code=1, message=Retry timeout: 30s, message=partition not loaded[partition=449705667935668507])>, <Time:{'RPC start': '2024-06-21 15:13:08.111562', 'RPC error': '2024-06-21 15:23:11.737951'}>

ERROR:app.milvus_helper:Error in insert_data Milvus while loading partition: <MilvusException: (code=1, message=Retry timeout: 30s, message=partition not loaded[partition=449705667935668507])>

ERROR:app.milvus_helper:Traceback (most recent call last):
  File "/python-docker/database/milvus_helper.py", line 204, in insert_data
    partition.load(timeout=30)  # _async=True
  File "/usr/local/lib/python3.10/site-packages/pymilvus/orm/partition.py", line 206, in load
    return conn.load_partitions(
  File "/usr/local/lib/python3.10/site-packages/pymilvus/decorators.py", line 127, in handler
    raise e from e
  File "/usr/local/lib/python3.10/site-packages/pymilvus/decorators.py", line 123, in handler
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/pymilvus/decorators.py", line 162, in handler
    return func(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/pymilvus/decorators.py", line 95, in handler
    raise MilvusException(
pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=Retry timeout: 30s, message=partition not loaded[partition=449705667935668507])>

DEBUG:app.milvus_helper:skipping partition loading in insert_data to avoid user impact
INFO:app:/chat_upload - chat metadata table values updated : True

Anything else?

Our application leveraging Milvus is a CVP stack. At the moment, we are mapping each partition to a "chat" for a user who gets a dedicated collection. Hence the need for frequent partition load considering increasing user activity.

Multi-tenancy based approach using partition-key is on our roadmap. However, we would like to understand what causes Partition Loads to be substancially slow and address this performance remediation at the earliest.

@ganderaj ganderaj added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 21, 2024
@xiaofan-luan
Copy link
Contributor

did you do load partition before insert?

@ganderaj
Copy link
Author

did you do load partition before insert?

Yes, partition is a chat on our CVP. And follow up questions on a chat will eventually lead to insert and manual flush operations.

@yanliang567
Copy link
Contributor

/assign @ganderaj
please try as comments anove

@yanliang567 yanliang567 added triage/needs-information Indicates an issue needs more information in order to work on it. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 22, 2024
@xiaofan-luan
Copy link
Contributor

I think for some reason you have too many segments in your cluster, this could happen becasue of:

  1. you have too many small collections -> Milvus cluster should not exceed 10000 collections
  2. you call flush on very time you do small insertion -> from the log each segment is only 10s of entities, for example
    [2024/06/21 15:50:01.531 +00:00] [INFO] [datacoord/index_service.go:605] ["completeIndexInfo success"] [collectionID=449705667927213932] [indexID=449705667927213961] [totalRows=3] [indexRows=3] [pendingIndexRows=0] [state=Finished] [failReason=]
    [2024/06/21 15:50:01.531 +00:00] [INFO] [datacoord/index_service.go:732] ["DescribeIndex success"] [traceID=1c70deb1a01aab13fb497332856d7b39] [collectionID=449705667927213932] [indexName=]

The current reason you cluster is crashed is due to

[2024/06/21 15:50:38.313 +00:00] [WARN] [meta/coordinator_broker.go:148] ["get recovery info failed"] [collectionID=449660350148322962] [partitionIDis="[]"] [error="stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace\n/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:556 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call\n/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:570 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall\n/go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/client/client.go:107 github.com/milvus-io/milvus/internal/distributed/datacoord/client.wrapGrpcCall[...]\n/go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/client/client.go:339 github.com/milvus-io/milvus/internal/distributed/datacoord/client.(*Client).GetRecoveryInfoV2\n/go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/coordinator_broker.go:145 github.com/milvus-io/milvus/internal/querycoordv2/meta.(*CoordinatorBroker).GetRecoveryInfoV2\n/go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/target_manager.go:211 github.com/milvus-io/milvus/internal/querycoordv2/meta.(*TargetManager).PullNextTargetV2.func1\n/go/src/github.com/milvus-io/milvus/pkg/util/retry/retry.go:44 github.com/milvus-io/milvus/pkg/util/retry.Do\n/go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/target_manager.go:249 github.com/milvus-io/milvus/internal/querycoordv2/meta.(*TargetManager).PullNextTargetV2\n/go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/target_manager.go:121 github.com/milvus-io/milvus/internal/querycoordv2/meta.(*TargetManager).UpdateCollectionNextTarget: rpc error: code = DeadlineExceeded desc = context deadline exceeded"] [errorVerbose="stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace: rpc error: code = DeadlineExceeded desc = context deadline exceeded\n(1) attached stack trace\n -- stack trace:\n | github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call\n | \t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:556\n | github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall\n | \t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:570\n | github.com/milvus-io/milvus/internal/distributed/datacoord/client.wrapGrpcCall[...]\n | \t/go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/client/client.go:107\n | github.com/milvus-io/milvus/internal/distributed/datacoord/client.(*Client).GetRecoveryInfoV2\n | \t/go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/client/client.go:339\n | github.com/milvus-io/milvus/internal/querycoordv2/meta.(*CoordinatorBroker).GetRecoveryInfoV2\n | \t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/coordinator_broker.go:145\n | github.com/milvus-io/milvus/internal/querycoordv2/meta.(*TargetManager).PullNextTargetV2.func1\n | \t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/target_manager.go:211\n | github.com/milvus-io/milvus/pkg/util/retry.Do\n | \t/go/src/github.com/milvus-io/milvus/pkg/util/retry/retry.go:44\n | github.com/milvus-io/milvus/internal/querycoordv2/meta.(*TargetManager).PullNextTargetV2\n | \t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/target_manager.go:249\n | github.com/milvus-io/milvus/internal/querycoordv2/meta.(*TargetManager).UpdateCollectionNextTarget\n | \t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/target_manager.go:121\n | github.com/milvus-io/milvus/internal/querycoordv2/observers.(*TargetObserver).updateNextTarget\n | \t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/observers/target_observer.go:283\n | github.com/milvus-io/milvus/internal/querycoordv2/observers.(*TargetObserver).check\n | \t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/observers/target_observer.go:201\n | github.com/milvus-io/milvus/internal/querycoordv2/observers.(*taskDispatcher[...]).schedule.func1.1\n | \t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/observers/task_dispatcher.go:101\n | github.com/milvus-io/milvus/pkg/util/conc.(*Pool[...]).Submit.func1\n | \t/go/src/github.com/milvus-io/milvus/pkg/util/conc/pool.go:81\n | github.com/panjf2000/ants/v2.(*goWorker).run.func1\n | \t/go/pkg/mod/github.com/panjf2000/ants/[email protected]/worker.go:67\n | runtime.goexit\n | \t/usr/local/go/src/runtime/asm_amd64.s:1598\nWraps: (2) stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace\n | /go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:556 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call\n | /go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:570 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall\n | /go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/client/client.go:107 github.com/milvus-io/milvus/internal/distributed/datacoord/client.wrapGrpcCall[...]\n | /go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/client/client.go:339 github.com/milvus-io/milvus/internal/distributed/datacoord/client.(*Client).GetRecoveryInfoV2\n | /go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/coordinator_broker.go:145 github.com/milvus-io/milvus/internal/querycoordv2/meta.(*CoordinatorBroker).GetRecoveryInfoV2\n | /go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/target_manager.go:211 github.com/milvus-io/milvus/internal/querycoordv2/meta.(*TargetManager).PullNextTargetV2.func1\n | /go/src/github.com/milvus-io/milvus/pkg/util/retry/retry.go:44 github.com/milvus-io/milvus/pkg/util/retry.Do\n | /go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/target_manager.go:249 github.com/milvus-io/milvus/internal/querycoordv2/meta.(*TargetManager).PullNextTargetV2\n | /go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/target_manager.go:121 github.com/milvus-io/milvus/internal/querycoordv2/meta.(*TargetManager).UpdateCollectionNextTarget\nWraps: (3) rpc error: code = DeadlineExceeded desc = context deadline exceeded\nError types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *status.Error"]

@xiaofan-luan
Copy link
Contributor

there are many tiemout failure for querycoordv2 to call datacoord.

Suggestions:

  1. stop calling flush manually, if any
  2. stop creating more collections, if you are doing multitenancy, think of using partition keys.
  3. if cluster still can not recovered after that, please to a pprof for datacoord (port 9091), see what is the current performance bottle neck for datacoord.
  4. change queryCoord.brokerTimeout to longer time (by default 5s). this might be a bad idea but your cluster might be able to recover.
  5. check how many segments you have with birdwatcher. https://github.com/milvus-io/birdwatcher

This cluster is a little bit mess up, but over all it looks like datacoord has a bottle neck with too many collections and segments. If it possible I would recommend to rebuild a new cluster and rethink about your schema

@ganderaj
Copy link
Author

Hello @xiaofan-luan

Thank you for your review. Per our current design, we have around 3600 vectors spanned across 630 collections and 1715 partitions. We acknowledge that it is not ideal to work with too many partitions on Milvus and we are working on a new design based on Partition-key based multi-tenancy strategy.

However, in the meantime we wanted to better understand of the challenges in Milvus since the behavior is only intermittent. Can you please help me with below:

  • Is queryCoord.brokerTimeout an allowed parameter in extraConfigFiles which goes into helm values file?
  • I have rebuild a new test cluster. I have attempted to restore this database onto the new cluster which failed with "context deadline exceeded" error messages. Is this an expected behavior?

@ganderaj
Copy link
Author

Restore command:
nohup /PATH/TO/milvus-backup restore --databases docqa_prod --name backup_None_2406221600 --restore_index true --config sbx_restore_prod03.yaml &

Restore log which includes Error messages: milvus_backup-restore_error.txt

@xiaofan-luan
Copy link
Contributor

Hello @xiaofan-luan

Thank you for your review. Per our current design, we have around 3600 vectors spanned across 630 collections and 1715 partitions. We acknowledge that it is not ideal to work with too many partitions on Milvus and we are working on a new design based on Partition-key based multi-tenancy strategy.

However, in the meantime we wanted to better understand of the challenges in Milvus since the behavior is only intermittent. Can you please help me with below:

  • Is queryCoord.brokerTimeout an allowed parameter in extraConfigFiles which goes into helm values file?
  • I have rebuild a new test cluster. I have attempted to restore this database onto the new cluster which failed with "context deadline exceeded" error messages. Is this an expected behavior?
  1. yes, you can change brokerTimeout with extra config
  2. some call in the cluster is really slow (getRecoveryInfoV2). this is a interface try to iterate on meta data. I don't have info about why this is slow(need pprof), but likely this is related to partition/collection number. We've optimized some key path on 2.4 later version

@xiaofan-luan
Copy link
Contributor

so you can try to upgrade and see if it work. If not then I will need pprof of datacoord to check which execution path eats your cpu

@ganderaj
Copy link
Author

@xiaofan-luan

I have an interesting observation to share. We have different Milvus implementations at our organization, and I have chosen a database (10 Collections, 10 Partitions and 73K Vectors) from our Dev environment.

I tried to restore this database on the newly created Milvus (v2.4.1) instance which ended up in similar failure message: test_backup_error.txt. To validate the integrity of backup, I have attempted to restore the same onto the same milvus instance where the backup was taken which completed successfully.

Considering restore behavior is failing consistently with 3 different databases from 2 milvus instances, I assume something is wrong with my newly created milvus instance. Can you please review the error message and suggest what could be a potential cause?

@yanliang567
Copy link
Contributor

/assign @wayblink
/unassign

@sre-ci-robot sre-ci-robot assigned wayblink and unassigned yanliang567 Jun 24, 2024
@xiaofan-luan
Copy link
Contributor

/assign @wayblink /unassign

this is not a backup issue.
the problem is that when request to datacoord it timeout

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug triage/needs-information Indicates an issue needs more information in order to work on it.
Projects
None yet
Development

No branches or pull requests

4 participants