-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add internal metrics expiration by default #20562
Comments
@jcmcken do you happen to know which metrics were increasing without bounds? We attempted to remove all high-cardinality tags from internal metrics and so it should alleviate the need to have them expire. I'm wondering if maybe we missed one. |
Is there a suggested way to determine this? We have these metrics hooked up to Prometheus, but it's not clear if the cardinality would translate and get stored there. When I run this query over past hour:
... the top offenders are:
The next highest is less than half of these two. These are high cardinality, however the label values shouldn't really be changing. For some more data: we installed Vector (without the expiration) in 2 larger, non-prod clusters with about 50 nodes each. Based on hourly averages, the memory usage has grown 0.5GB in 24 hours in aggregate for one cluster, and about 2GB for the other. |
Thanks @jcmcken that is helpful. Are you able to see the labels that are on the |
And I see what the problem is now I think, it's the So I guess there's a tradeoff here. If we want to see which particular pod might be sending too many events, for example, then we do need the Maybe there just needs to be an additional warning in the |
Ah, I see, interesting. I didn't realize the
As an example, we had made the Separately, though, maybe it does make sense to have internal metrics expire by default since this does seem like more desirable behavior for users. |
+1 to this. We just ran into this in our production, we're only using the internal_metrics, and a remote_write, and vector keeps endlessly accumulating metrics series, finally our central Grafan Mimir instance started choking, and I found this Thread. This is what vector's internal metrics looked like over the last 30 days, in our vector monitoring dashboard: I don't mean to berate anyone, but it seems pretty wild to me that vector ships with a default setup for metrics collection that behaves like this. |
Closes: #20562 Signed-off-by: Jesse Szwedko <[email protected]>
A note for the community
Use Cases
If you look through the issue tracker, you'll notice many issues where users report a memory leak when enabling the internal metrics. I also encounter this issue with relatively recent versions of the Vector agent.
There's a global setting
expire_metrics_secs
that is intended to help with this problem. However it seems like a strange default to never expire metrics by default. Given the number of users who encounter this particular issue, and just using the principal of least surprise, it seems as if metrics should, by default, expire automatically after some period of time. The alternative, requiring the user to discover (likely in production) that their Vector agents are OOMing due to metrics collection (of all things), seems like a poor user experience for operators. In my specific case, when testing enabling the metrics in a sandbox environment with a very low throughput, the "memory leak" was so very slight that it was not noticeable until multiple days had passed. Only when you throw a real workload at it do you begin to see issues more readilyAttempted Solutions
As I mentioned,
expire_metrics_secs
does solve this problem for me, technically. But it seems to have a poor default value. I would like to save other Vector users the pain of discovering this issue in their live environments.Proposal
I propose that
expire_metrics_secs
either have a default, or we introduce some other mechanism to ensure that metrics remain bounded in memory by default. I'm not sure of the technical implications of the latter. The former, I suppose, introduces a backwards incompatible change.References
No response
Version
0.37.1
The text was updated successfully, but these errors were encountered: