Add internal metrics expiration by default #20562

jcmcken · 2024-05-24T22:11:59Z

A note for the community

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Use Cases

If you look through the issue tracker, you'll notice many issues where users report a memory leak when enabling the internal metrics. I also encounter this issue with relatively recent versions of the Vector agent.

There's a global setting expire_metrics_secs that is intended to help with this problem. However it seems like a strange default to never expire metrics by default. Given the number of users who encounter this particular issue, and just using the principal of least surprise, it seems as if metrics should, by default, expire automatically after some period of time. The alternative, requiring the user to discover (likely in production) that their Vector agents are OOMing due to metrics collection (of all things), seems like a poor user experience for operators. In my specific case, when testing enabling the metrics in a sandbox environment with a very low throughput, the "memory leak" was so very slight that it was not noticeable until multiple days had passed. Only when you throw a real workload at it do you begin to see issues more readily

Attempted Solutions

As I mentioned, expire_metrics_secs does solve this problem for me, technically. But it seems to have a poor default value. I would like to save other Vector users the pain of discovering this issue in their live environments.

Proposal

I propose that expire_metrics_secs either have a default, or we introduce some other mechanism to ensure that metrics remain bounded in memory by default. I'm not sure of the technical implications of the latter. The former, I suppose, introduces a backwards incompatible change.

References

No response

Version

0.37.1

The text was updated successfully, but these errors were encountered:

jszwedko · 2024-05-28T16:02:50Z

@jcmcken do you happen to know which metrics were increasing without bounds? We attempted to remove all high-cardinality tags from internal metrics and so it should alleviate the need to have them expire. I'm wondering if maybe we missed one.

jcmcken · 2024-05-29T18:03:43Z

@jcmcken do you happen to know which metrics were increasing without bounds? We attempted to remove all high-cardinality tags from internal metrics and so it should alleviate the need to have them expire. I'm wondering if maybe we missed one.

Is there a suggested way to determine this? We have these metrics hooked up to Prometheus, but it's not clear if the cardinality would translate and get stored there. When I run this query over past hour:

topk(10, count by (__name__, job)({__name__=~"^vector_.*"}))

... the top offenders are:

component_received_event_bytes (~17k metrics)
component_received_events_total (~17k metrics)

The next highest is less than half of these two. These are high cardinality, however the label values shouldn't really be changing.

For some more data: we installed Vector (without the expiration) in 2 larger, non-prod clusters with about 50 nodes each. Based on hourly averages, the memory usage has grown 0.5GB in 24 hours in aggregate for one cluster, and about 2GB for the other.

jszwedko · 2024-05-29T20:30:47Z

Thanks @jcmcken that is helpful. Are you able to see the labels that are on the component_received_event_bytes and component_received_events_total metrics? It sounds like maybe there is one that shouldn't be there.

jcmcken · 2024-05-31T19:42:19Z

@jszwedko

component_id
component_kind
component_type
container
endpoint
host
instance
job
namespace
pod

And I see what the problem is now I think, it's the pod label. We're using kubernetes_logs source and it's emitting 1 metric per pod per component for every pod in the cluster (not just the vector pods) emitting logs, and we have thousands of pods with a non-trivial amount coming and going all the time. That 'coming and going all the time' is what's leading to the metrics growing

So I guess there's a tradeoff here. If we want to see which particular pod might be sending too many events, for example, then we do need the pod label (and namespace). But it does imply the metrics will be growing over time unless we expire them to zero out / delete metrics for pods which no longer exist. If we only want metrics on the Vector agents themselves, and not all the other pods, then that would be another way to keep the metrics in check.

Maybe there just needs to be an additional warning in the kubernetes_logs documentation about this particular issue instead of a change of behavior. Not sure what the best approach would be.

jszwedko · 2024-05-31T22:45:18Z

Ah, I see, interesting. I didn't realize the kubernetes_logs source was still adding those tags by default. I believe those tags should be opt-in since:

They do have value, as you note, and se we shouldn't just remove them
Unbounded cardinality can cause issues for downstream systems (in addition to Vector)

As an example, we had made the file tag on internal metrics published by the file source be opt-in: https://vector.dev/docs/reference/configuration/sources/file/#internal_metrics.include_file_tag. I think we should do something similar here.

Separately, though, maybe it does make sense to have internal metrics expire by default since this does seem like more desirable behavior for users.

HenrikDK · 2024-06-21T12:34:43Z

+1 to this.

We just ran into this in our production, we're only using the internal_metrics, and a remote_write, and vector keeps endlessly accumulating metrics series, finally our central Grafan Mimir instance started choking, and I found this Thread.

This is what vector's internal metrics looked like over the last 30 days, in our vector monitoring dashboard:

I don't mean to berate anyone, but it seems pretty wild to me that vector ships with a default setup for metrics collection that behaves like this.

Closes: #20562 Signed-off-by: Jesse Szwedko <[email protected]>

jcmcken added the type: feature A value-adding code addition that introduce new functionality. label May 24, 2024

jszwedko added the domain: observability Anything related to monitoring/observing Vector label May 28, 2024

jszwedko added the domain: performance Anything related to Vector's performance label May 28, 2024

jszwedko added a commit that referenced this issue Jun 21, 2024

chore!(config): Default expire_internal_metrics to 300s

50f3847

Closes: #20562 Signed-off-by: Jesse Szwedko <[email protected]>

jszwedko linked a pull request Jun 21, 2024 that will close this issue

chore!(config): Default expire_internal_metrics to 300s #20710

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add internal metrics expiration by default #20562

Add internal metrics expiration by default #20562

jcmcken commented May 24, 2024

jszwedko commented May 28, 2024

jcmcken commented May 29, 2024

jszwedko commented May 29, 2024

jcmcken commented May 31, 2024

jszwedko commented May 31, 2024

HenrikDK commented Jun 21, 2024 •

edited

Loading

Add internal metrics expiration by default #20562

Add internal metrics expiration by default #20562

Comments

jcmcken commented May 24, 2024

A note for the community

Use Cases

Attempted Solutions

Proposal

References

Version

jszwedko commented May 28, 2024

jcmcken commented May 29, 2024

jszwedko commented May 29, 2024

jcmcken commented May 31, 2024

jszwedko commented May 31, 2024

HenrikDK commented Jun 21, 2024 • edited Loading

HenrikDK commented Jun 21, 2024 •

edited

Loading