Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add internal metrics expiration by default #20562

Open
jcmcken opened this issue May 24, 2024 · 6 comments 路 May be fixed by #20710
Open

Add internal metrics expiration by default #20562

jcmcken opened this issue May 24, 2024 · 6 comments 路 May be fixed by #20710
Labels
domain: observability Anything related to monitoring/observing Vector domain: performance Anything related to Vector's performance type: feature A value-adding code addition that introduce new functionality.

Comments

@jcmcken
Copy link

jcmcken commented May 24, 2024

A note for the community

  • Please vote on this issue by adding a 馃憤 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Use Cases

If you look through the issue tracker, you'll notice many issues where users report a memory leak when enabling the internal metrics. I also encounter this issue with relatively recent versions of the Vector agent.

There's a global setting expire_metrics_secs that is intended to help with this problem. However it seems like a strange default to never expire metrics by default. Given the number of users who encounter this particular issue, and just using the principal of least surprise, it seems as if metrics should, by default, expire automatically after some period of time. The alternative, requiring the user to discover (likely in production) that their Vector agents are OOMing due to metrics collection (of all things), seems like a poor user experience for operators. In my specific case, when testing enabling the metrics in a sandbox environment with a very low throughput, the "memory leak" was so very slight that it was not noticeable until multiple days had passed. Only when you throw a real workload at it do you begin to see issues more readily

Attempted Solutions

As I mentioned, expire_metrics_secs does solve this problem for me, technically. But it seems to have a poor default value. I would like to save other Vector users the pain of discovering this issue in their live environments.

Proposal

I propose that expire_metrics_secs either have a default, or we introduce some other mechanism to ensure that metrics remain bounded in memory by default. I'm not sure of the technical implications of the latter. The former, I suppose, introduces a backwards incompatible change.

References

No response

Version

0.37.1

@jcmcken jcmcken added the type: feature A value-adding code addition that introduce new functionality. label May 24, 2024
@jszwedko jszwedko added the domain: observability Anything related to monitoring/observing Vector label May 28, 2024
@jszwedko
Copy link
Member

@jcmcken do you happen to know which metrics were increasing without bounds? We attempted to remove all high-cardinality tags from internal metrics and so it should alleviate the need to have them expire. I'm wondering if maybe we missed one.

@jszwedko jszwedko added the domain: performance Anything related to Vector's performance label May 28, 2024
@jcmcken
Copy link
Author

jcmcken commented May 29, 2024

@jcmcken do you happen to know which metrics were increasing without bounds? We attempted to remove all high-cardinality tags from internal metrics and so it should alleviate the need to have them expire. I'm wondering if maybe we missed one.

Is there a suggested way to determine this? We have these metrics hooked up to Prometheus, but it's not clear if the cardinality would translate and get stored there. When I run this query over past hour:

topk(10, count by (__name__, job)({__name__=~"^vector_.*"}))

... the top offenders are:

  • component_received_event_bytes (~17k metrics)
  • component_received_events_total (~17k metrics)

The next highest is less than half of these two. These are high cardinality, however the label values shouldn't really be changing.

For some more data: we installed Vector (without the expiration) in 2 larger, non-prod clusters with about 50 nodes each. Based on hourly averages, the memory usage has grown 0.5GB in 24 hours in aggregate for one cluster, and about 2GB for the other.

@jszwedko
Copy link
Member

Thanks @jcmcken that is helpful. Are you able to see the labels that are on the component_received_event_bytes and component_received_events_total metrics? It sounds like maybe there is one that shouldn't be there.

@jcmcken
Copy link
Author

jcmcken commented May 31, 2024

@jszwedko

  • component_id
  • component_kind
  • component_type
  • container
  • endpoint
  • host
  • instance
  • job
  • namespace
  • pod

And I see what the problem is now I think, it's the pod label. We're using kubernetes_logs source and it's emitting 1 metric per pod per component for every pod in the cluster (not just the vector pods) emitting logs, and we have thousands of pods with a non-trivial amount coming and going all the time. That 'coming and going all the time' is what's leading to the metrics growing

So I guess there's a tradeoff here. If we want to see which particular pod might be sending too many events, for example, then we do need the pod label (and namespace). But it does imply the metrics will be growing over time unless we expire them to zero out / delete metrics for pods which no longer exist. If we only want metrics on the Vector agents themselves, and not all the other pods, then that would be another way to keep the metrics in check.

Maybe there just needs to be an additional warning in the kubernetes_logs documentation about this particular issue instead of a change of behavior. Not sure what the best approach would be.

@jszwedko
Copy link
Member

Ah, I see, interesting. I didn't realize the kubernetes_logs source was still adding those tags by default. I believe those tags should be opt-in since:

  • They do have value, as you note, and se we shouldn't just remove them
  • Unbounded cardinality can cause issues for downstream systems (in addition to Vector)

As an example, we had made the file tag on internal metrics published by the file source be opt-in: https://vector.dev/docs/reference/configuration/sources/file/#internal_metrics.include_file_tag. I think we should do something similar here.

Separately, though, maybe it does make sense to have internal metrics expire by default since this does seem like more desirable behavior for users.

@HenrikDK
Copy link

HenrikDK commented Jun 21, 2024

+1 to this.

We just ran into this in our production, we're only using the internal_metrics, and a remote_write, and vector keeps endlessly accumulating metrics series, finally our central Grafan Mimir instance started choking, and I found this Thread.

This is what vector's internal metrics looked like over the last 30 days, in our vector monitoring dashboard:

image

I don't mean to berate anyone, but it seems pretty wild to me that vector ships with a default setup for metrics collection that behaves like this.

jszwedko added a commit that referenced this issue Jun 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: observability Anything related to monitoring/observing Vector domain: performance Anything related to Vector's performance type: feature A value-adding code addition that introduce new functionality.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants