You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Please vote on this issue by adding a 馃憤 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment
Problem
Hi, we have one of the environments behaving strange. After each deployment on that env, memory usage of Vector spikes to dangerous values. It's a preprod environment with the same configuration as prod environments, but much lower traffic per Vector container. Some numbers:
Common for all envs:
We're using pretty large containers with 13vCPUs and 26GB of memory
We have autoscaling set 20% memory and 65% CPU utilization
We have a lot of transformations, this impacts Vector startup time that is ~9minutes.
For buffer_full on all sinks, we're dropping newest.
Prod:
CPU utilization is cycling at around 50-90% (occasional spikes to 100%)
Memory is usually ~3-4GB (occasionally spikes to 6GB)
With those numbers Vector processes data reliably, without drops.
Preprod:
Single preprod container processes around 15% of traffic going through prod one
CPU utilization is ~20-25%. During deploys spikes to 40-50%
Memory is usually ~1GB. During deploys spikes to 11-13GB (50-60%) (rarely reaches 13GB, it can cause container to crash)
Below is an image of memory utilization of one container after deploy
My guess about what could be possibly happening is that Vector struggles with quick scaling of a single transformation. In our case vector has a lot of transformations that are specific to a single service. And preprod takes traffic from a small set of services. When looking at utilization of transformations after deploy, we see spikes but it looks relatively healthy and nothing else is different on that env, so I'm probably wrong.
Utilization of each transformation
Happy to take any suggestions what might be the issue hee
Memory use is very workload and configuration dependent so this is likely to be difficult to track down without the configuration and at least a description of the workload (with as many specifics as possible).
If you want, you could try to engage in some memory profiling using valgrind or a similar tool.
That's fair. It'll probably be tricky to provide description of details of the pipeline, we have 100s VRLs. But on high level, it processes all of our telemetry before delivering it to Datadog.
You might recall our pipeline. We've met on a video call last year. I work for a gaming company from U.S. We wrote the Vector metric_metadata transformer that was added to Vector as part of log_to_metric (allowin datatype conversion from log to metric when underlying data is compatible)
We'll look into memory profiling and try to get some more details. Thanks Jesse
A note for the community
Problem
Hi, we have one of the environments behaving strange. After each deployment on that env, memory usage of Vector spikes to dangerous values. It's a preprod environment with the same configuration as prod environments, but much lower traffic per Vector container. Some numbers:
Common for all envs:
Prod:
Preprod:
Below is an image of memory utilization of one container after deploy
![vector_preprod](https://private-user-images.githubusercontent.com/27132600/338549548-5306be44-3894-4c60-97b8-21ab220fff5e.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTk0MDM1NDIsIm5iZiI6MTcxOTQwMzI0MiwicGF0aCI6Ii8yNzEzMjYwMC8zMzg1NDk1NDgtNTMwNmJlNDQtMzg5NC00YzYwLTk3YjgtMjFhYjIyMGZmZjVlLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA2MjYlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNjI2VDEyMDA0MlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTU3MzY5NzkyZTE4YWRlZTMwZTkwZDE4NjAwMjVhODg2ZmZlNTQyZWNmMzVlNGFkMmU4NDU2NjNlMDgxZjU3ZDgmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.4bSMQJ5nQv-3628fhpes9rcDoDSnnVJPAGTy8uP95N4)
My guess about what could be possibly happening is that Vector struggles with quick scaling of a single transformation. In our case vector has a lot of transformations that are specific to a single service. And preprod takes traffic from a small set of services. When looking at utilization of transformations after deploy, we see spikes but it looks relatively healthy and nothing else is different on that env, so I'm probably wrong.
Utilization of each transformation
![vector_transformation_utilization](https://private-user-images.githubusercontent.com/27132600/338554241-b4c88071-9ef6-4ce8-8c29-e498779c82cd.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTk0MDM1NDIsIm5iZiI6MTcxOTQwMzI0MiwicGF0aCI6Ii8yNzEzMjYwMC8zMzg1NTQyNDEtYjRjODgwNzEtOWVmNi00Y2U4LThjMjktZTQ5ODc3OWM4MmNkLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA2MjYlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNjI2VDEyMDA0MlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTg0OGI3MjZiYTZlNjc1NTcyZDZlNjYzYmQ3MDhmNjM1N2I3MTMyMDIwNzJmMDE1ZWMxM2ViYjVmMmFjNDRhNDAmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.f21wTkH9LZfHN1JCWrryiB1AqGwu52INFcQxFmL9GOY)
Happy to take any suggestions what might be the issue hee
Configuration
No response
Version
vector 0.38.0 (aarch64-unknown-linux-musl ea0ec6f 2024-05-07 14:34:39.794027186)
Debug Output
No response
Example Data
No response
Additional Context
No response
References
No response
The text was updated successfully, but these errors were encountered: