Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Querynode oomkilled when concurrent upserting data into 1024 partitions #34058

Open
1 task done
ThreadDao opened this issue Jun 21, 2024 · 2 comments
Open
1 task done
Assignees
Labels
kind/bug Issues or changes related a bug severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@ThreadDao
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: 2.4-20240621-7d1d5a83-amd64
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka):   pulsar  
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

deploy milvus with config

  components:
    dataNode:
      replicas: 1
      resources:
        limits:
          cpu: "8"
          memory: 16Gi
        requests:
          cpu: "4"
          memory: 8Gi
    indexNode:
      replicas: 3
      resources:
        limits:
          cpu: "8"
          memory: 8Gi
        requests:
          cpu: "4"
          memory: 2Gi
    mixCoord:
      replicas: 1
      resources:
        limits:
          cpu: "4"
          memory: 16Gi
        requests:
          cpu: "2" 
          memory: 8Gi 
    proxy:
      resources:
        limits:
          cpu: "1" 
          memory: 8Gi 
    queryNode:
      replicas: 2
      resources:
        limits:
          cpu: "16"
          memory: 72Gi
        requests:
          cpu: "4" 
          memory: 64Gi
  config:
    dataCoord:
      segment:
        sealProportion: 1.52e-05
    log:
      level: debug
    trace:
      exporter: jaeger
      jaeger:
        url: http://tempo-distributor.tempo:14268/api/traces
      sampleFraction: 1

test steps

  1. create a collection with 1 shard, enable partition-key with 1024 partitions
  2. create hnsw index {'index_type': 'HNSW', 'metric_type': 'L2', 'params': {'M': 8, 'efConstruction': 200}}
  3. insert 10m-128d entities -> flush
  4. concurrent requests: search + upsert + flush
 'client': {'test_case_type': 'ConcurrentClientBase',
            'test_case_name': 'test_concurrent_locust_custom_parameters',
            'test_case_params': {'dataset_params': {'metric_type': 'L2', 
                                                    'dim': 128,
                                                    'scalars_params': {'int64_1': {'params': {'is_partition_key': True}}},
                                                    'dataset_name': 'sift',
                                                    'dataset_size': '10m',
                                                    'ni_per': 50000},
                                 'collection_params': {'other_fields': ['int64_1'],
                                                       'shards_num': 1,
                                                       'num_partitions': 1024},
                                 'load_params': {},
                                 'release_params': {'release_of_reload': False},
                                 'index_params': {'index_type': 'HNSW',
                                                  'index_param': {'M': 8,
                                                                  'efConstruction': 200}},
                                 'concurrent_params': {'concurrent_number': 30,
                                                       'during_time': '3h', 
                                                       'interval': 20,
                                                       'spawn_rate': None},
                                 'concurrent_tasks': [{'type': 'search',
                                                       'weight': 10,
                                                       'params': {'nq': 100,
                                                                  'top_k': 100,
                                                                  'output_fields': ['int64_1'],
                                                                  'search_param': {'ef': 128}, 
                                                                  'timeout': 120}},
                                                      {'type': 'flush',
                                                       'weight': 1,
                                                       'params': {'timeout': 120}},
                                                      {'type': 'upsert',
                                                       'weight': 19,
                                                       'params': {'nb': 200,
                                                                  'timeout': 120,
                                                                  'start_id': 0,
                                                                  'random_id': True, 
                                                                  'random_vector': True}}]},
            'run_id': 2024062191801273,
            'datetime': '2024-06-21 03:06:20.115933',
            'client_version': '2.2'},

queryNode oomkilled

The qn oomkilled after about two minutes of concurrent requests, around at 2024-06-21 03:40:52
image

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

compact-opt-flush3-milvus-datanode-5d9d59d55d-s58tn               1/1     Running                           0               5h27m   10.104.16.107   4am-node21   <none>           <none>
compact-opt-flush3-milvus-indexnode-5ff8f4ff46-5mxlk              1/1     Running                           0               5h27m   10.104.30.6     4am-node38   <none>           <none>
compact-opt-flush3-milvus-indexnode-5ff8f4ff46-7fnh8              1/1     Running                           0               5h27m   10.104.34.192   4am-node37   <none>           <none>
compact-opt-flush3-milvus-indexnode-5ff8f4ff46-nf9bw              1/1     Running                           0               5h27m   10.104.16.108   4am-node21   <none>           <none>
compact-opt-flush3-milvus-mixcoord-8458f66976-qd5n5               1/1     Running                           0               5h27m   10.104.1.130    4am-node10   <none>           <none>
compact-opt-flush3-milvus-proxy-57cdc4f669-nxpwb                  1/1     Running                           0               5h27m   10.104.18.103   4am-node25   <none>           <none>
compact-opt-flush3-milvus-querynode-0-67fdd8499f-nn6pg            1/1     Running                           2 (3h24m ago)   5h27m   10.104.14.127   4am-node18   <none>           <none>
compact-opt-flush3-milvus-querynode-0-67fdd8499f-w8w4l            1/1     Running                           3 (133m ago)    5h27m   10.104.26.169   4am-node32   <none>           <none>

Anything else?

No response

@ThreadDao ThreadDao added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 21, 2024
@ThreadDao ThreadDao added the severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. label Jun 21, 2024
@ThreadDao ThreadDao added this to the 2.4.5 milestone Jun 21, 2024
@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 21, 2024
@yanliang567 yanliang567 removed their assignment Jun 21, 2024
@xiaofan-luan
Copy link
Contributor

With so many partitions, we might need to change concurrency of compaction and more datanodes.
Currently I think if we can dd more datanodes and catchup the compaction then it work for us

@XuanYang-cn
Copy link
Contributor

XuanYang-cn commented Jun 24, 2024

Even though there're 50K segment, the thing is why 2 * 64G querynode cannot hold 7GB data in memory.

@yanliang567 yanliang567 modified the milestones: 2.4.5, 2.4.6 Jun 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

4 participants