Set `GOMAXPROCS` and `GOMEMLIMIT` environment variables #6977

jnoordsij · 2024-04-30T15:32:31Z

Pull Request Motivation

Set GOMAXPROCS and GOMEMLIMIT environment variables based on container resources.

Inspired by traefik/traefik-helm-chart#1029, this should reduce potential CPU throttling and OOMKills on containers.

Kind

/kind feature

Release Note

Autoset `GOMAXPROCS` and `GOMEMLIMIT` environment variables for cert-manager pods based on requested CPU and memory values

…r resources Signed-off-by: Jesper Noordsij <[email protected]>

cert-manager-prow · 2024-04-30T15:32:38Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign joshvanl for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

cert-manager-prow · 2024-04-30T15:32:42Z

Hi @jnoordsij. Thanks for your PR.

I'm waiting for a cert-manager member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

inteon · 2024-05-01T11:21:46Z

@wallrj FYI

SgtCoDFish

Hey, appreciate you getting involved with the project!

I don't think we can merge this as is, and I think to merge something like this we'd need to discuss further about how to do this safely and in a way that won't lead to confusing outcomes. I'd encourage you to attend one of our meetings if you want to discuss this!

SgtCoDFish · 2024-05-13T11:06:04Z

deploy/charts/cert-manager/templates/deployment.yaml

+          {{- if (.Values.resources.limits).cpu }}
+          - name: GOMAXPROCS
+            valueFrom:
+              resourceFieldRef:
+                resource: limits.cpu
+          {{- end }}
+          {{- if (.Values.resources.limits).memory }}
+          - name: GOMEMLIMIT
+            valueFrom:
+              resourceFieldRef:
+                resource: limits.memory
+          {{- end }}


suggestion: A valid value for limits.cpu would be 100m. That value wouldn't be valid for GOMAXPROCS as far as I can tell, and I think it'd be ignored.

Similarly, a valid amount of memory in limits.memory would be 1e6 or 120M, and both of those would be invalid entries for GOMEMLIMIT (this says that supported suffixes are "B, KiB, MiB, GiB, and TiB").

I don't think we can generally apply the values of resource limits like this. This won't have the expected effect for a lot of totally valid resource limits.

It's maybe possible to construct some Helm function to convert resource limits to GOMAXPROCS / GOMEMLIMIT, but I think that'd be hard to do and a pain to debug. My instinct is that it's probably not worth the effort to add this - what do you think?

It might actually work: https://billglover.me/2022/09/14/use-the-kubernetes-downwards-api-to-set-gomemlimit/
That article seems to suggest that the downwards API actually returns the computed int value.

Then this PR needs confirmation, tests and documentation of that 😁 That article might be worth adding to the PR description but an article isn't really enough to get this change over the line.

Specifically, I'd like to see examples of what GOMAXPROCS is set to if my CPU limits are 0.01, 1m or 2.5 as initial test cases. I'd also like to see what GOMEMLIMIT is set to if I specify 1e6, or 1KB as memory limits.

The resourceFieldRef is a very specific Kubernetes directive that is created specifically for passing resource-related values, which rounds up the CPU value to the nearest whole number (e.g. 250m to 1) and passes the memory as a numeric value; so 64Mi would result in the environment variable being set to 67108864. This by design makes it completely compatible with Go's API.

An example is documented within Kubernetes documentation itself: https://kubernetes.io/docs/tasks/inject-data-application/environment-variable-expose-pod-information/#use-container-fields-as-values-for-environment-variables.

Would referencing to Kubernetes documentation suffice here, given that this just basically ensures the correct behavior by design? And if so, what would be a suitable please to add this reference?

Thank you for the link!

Honestly, having the docs linked in this PR is probably enough. I don't think we need to complicate the helm chart with a link, and anyone that cares will be able to git blame and find this PR easily enough.

I'll enable testing for this PR 👌

SgtCoDFish · 2024-05-14T16:14:42Z

/ok-to-test

inteon · 2024-06-17T06:48:40Z

@jnoordsij next step is to verify that this PR actually improves performance of cert-manager. @wallrj might be able to help here, he did some benchmarking of cert-manager in the past.

wallrj · 2024-06-20T09:04:28Z

@jnoordsij thanks for bringing this to our attention. I've been testing the effect of GOMEMLIMIT by setting the environment variables directly.

My conclusion from the experiments below, is it that setting the GOMEMLIMIT equal to the memory limit has the following advantages:

Significantly lower memory usage (at the cost of slightly higher CPU usage)
Fewer OOMKILL container failures
Improved ability to recover from OOMKILL failures and continue reconciling Certificates.

Memory Limit 200MiB / GOMEMLIMIT 200MiB

In this experiment I set both a memory limit of 200Mi and GOMEMLIMIT of 200MiB.

cert-manager reconciles 5000 self-signed RSA 2048 Certificate resources in ~500s (same as baseline experiment below).
peak memory usage was ~195MiB (around half the memory usage in the baseline experiment below).
CPU usage looks higher and choppier during the ramp-up phase (compared to baseline experiment below)...maybe due to increased GC activity?
The cert-manager controller was OOMKilled only once (highlighted in the second graph) and it recovered and was able to complete the benchmark.
The Grafana graph shows the cert-manager-controller exceeding its memory limit, and I don't know if that's a glitch in the report, or a bug in the quota enforcement

#values.yaml
resources:
  requests:
    cpu: 1
    memory: 200Mi
  limits:
    memory: 200Mi
extraEnv:
  - name: GOMEMLIMIT
    value: 200MiB

prometheus:
  enabled: true
  servicemonitor:
    enabled: true

config:
  apiVersion: controller.config.cert-manager.io/v1alpha1
  kind: ControllerConfiguration
  kubernetesAPIQPS: 10000
  kubernetesAPIBurst: 10000
  maxConcurrentChallenges: 400
  numberOfConcurrentWorkers: 8
  featureGates:
    ServerSideApply: true

Memory Limit 200MiB

In this experiment I only set a memory limit, not GOMEMLIMIT.

cert-manager fails to reconcile 5000 self-signed RSA 2048 Certificate resources
The cert-manager controller was OOMKilled when the memory usage approached 200MiB and was oomkilled repeatedly, each time the container was restarted.

richard@LAPTOP-HJEQ9V9G:~$ kubectl get pods -n venafi --watch
NAME                                                   READY   STATUS    RESTARTS   AGE
cert-manager-69f6f7585f-vnplg                          1/1     Running   0          4m29s
cert-manager-approver-policy-75664b78fc-jsnp8          1/1     Running   0          3m49s
cert-manager-cainjector-56846796ff-j8rs7               1/1     Running   0          4m29s
cert-manager-webhook-6979b54d5f-brgmg                  1/1     Running   0          4m29s
tlspk-monitoring-kube-state-metrics-68945cb7fd-l5pwb   1/1     Running   0          34s
trust-manager-9445c9f58-d6p2x                          1/1     Running   0          3m26s
venafi-enhanced-issuer-6687f4dcd5-26cxn                1/1     Running   0          3m49s




cert-manager-69f6f7585f-vnplg                          0/1     OOMKilled   0          9m51s
cert-manager-69f6f7585f-vnplg                          1/1     Running     1 (2s ago)   9m52s
cert-manager-69f6f7585f-vnplg                          0/1     OOMKilled   1 (4s ago)   9m54s
cert-manager-69f6f7585f-vnplg                          0/1     CrashLoopBackOff   1 (7s ago)   10m
cert-manager-69f6f7585f-vnplg                          1/1     Running            2 (19s ago)   10m
cert-manager-69f6f7585f-vnplg                          0/1     OOMKilled          2 (21s ago)   10m
cert-manager-69f6f7585f-vnplg                          0/1     CrashLoopBackOff   2 (7s ago)    10m
cert-manager-69f6f7585f-vnplg                          1/1     Running            3 (33s ago)   10m
cert-manager-69f6f7585f-vnplg                          0/1     OOMKilled          3 (35s ago)   10m
cert-manager-69f6f7585f-vnplg                          0/1     CrashLoopBackOff   3 (3s ago)    10m
cert-manager-69f6f7585f-vnplg                          1/1     Running            4 (45s ago)   11m
cert-manager-69f6f7585f-vnplg                          0/1     OOMKilled          4 (46s ago)   11m
cert-manager-69f6f7585f-vnplg                          0/1     CrashLoopBackOff   4 (7s ago)    11m
cert-manager-69f6f7585f-vnplg                          1/1     Running            5 (92s ago)   13m
cert-manager-69f6f7585f-vnplg                          0/1     OOMKilled          5 (93s ago)   13m
cert-manager-69f6f7585f-vnplg                          0/1     CrashLoopBackOff   5 (5s ago)    13m
cert-manager-69f6f7585f-vnplg                          1/1     Running            6 (2m42s ago)   15m
cert-manager-69f6f7585f-vnplg                          0/1     OOMKilled          6 (2m44s ago)   15m
cert-manager-69f6f7585f-vnplg                          0/1     CrashLoopBackOff   6 (1s ago)      15m
cert-manager-69f6f7585f-vnplg                          1/1     Running            7 (5m10s ago)   20m
cert-manager-69f6f7585f-vnplg                          0/1     OOMKilled          7 (5m12s ago)   21m
cert-manager-69f6f7585f-vnplg                          0/1     CrashLoopBackOff   7 (10s ago)     21m

#values.yaml
resources:
  requests:
    cpu: 1
    memory: 200Mi
  limits:
    memory: 200Mi

prometheus:
  enabled: true
  servicemonitor:
    enabled: true

config:
  apiVersion: controller.config.cert-manager.io/v1alpha1
  kind: ControllerConfiguration
  kubernetesAPIQPS: 10000
  kubernetesAPIBurst: 10000
  maxConcurrentChallenges: 400
  numberOfConcurrentWorkers: 8
  featureGates:
    ServerSideApply: true

Without memory limit

In this base line experiment neither memory limit nor GOMEMLIMIT were set.

cert-manager reconciled 5000 self-signed RSA 2048 Certificate resources in ~500s.
Peak memory usage was 421MiB in cert-manager-controller

#values.yaml
resources:
  requests:
    cpu: 1
    memory: 200Mi

prometheus:
  enabled: true
  servicemonitor:
    enabled: true

config:
  apiVersion: controller.config.cert-manager.io/v1alpha1
  kind: ControllerConfiguration
  kubernetesAPIQPS: 10000
  kubernetesAPIBurst: 10000
  maxConcurrentChallenges: 400
  numberOfConcurrentWorkers: 8
  featureGates:
    ServerSideApply: true

jnoordsij · 2024-06-20T09:24:46Z

@wallrj thanks a lot for your efforts to benchmark the changes, looking forward to your further findings!

Regarding the exceeding of the memory limit: I've observed similar measurements on my applications in the past, but have not yet been able to find a thorough explanation. My thoughts so far are probably some kind of mismeasurement and/or reporting caused by a restart of the container, causing the reported value to show the sum of both the just-killed container and the new one, although this is mostly speculative on my part.

inteon · 2024-06-20T10:35:41Z

/hold

wallrj · 2024-06-20T12:55:01Z

@jnoordsij wrote:

thanks a lot for your efforts to benchmark the changes, looking forward to your further findings!

I've updated the comment with the remaining findings.
Not claiming this is the most scientific analysis, but it seems to me that setting GOMEMLIMIT does have some clear benefits.

I have some doubts about setting GOMEMLIMIT equal to the resources.limits.memory.

This article recommends setting GOMEMLIMIT to be 5-10% below the hard memory limit:
- https://billglover.me/2022/09/14/use-the-kubernetes-downwards-api-to-set-gomemlimit/#one-caveat
This article recommends setting GOMEMLIMIT to be "slightly below" the cgroups memory limit:
- https://weaviate.io/blog/gomemlimit-a-game-changer-for-high-memory-applications#with-gomemlimit2750mib
I wondered if it might ever be advantagous to setting only resources.requests.memory and GOMEMLIMIT (omit resources.limits.memory) as a sort of soft memory limit.
If some users are already setting GOMEMLIMIT by adding extraEnv values when deploying the Helm chart, then the changes in this PR cause will force those users to adopt this new mechanism.
Users will no longer be able to set bespoke GOMEMLIMIT values.

I might start by trying to document the advantages of setting GOMEMLIMIT in https://cert-manager.io/docs/devops-tips/scaling-cert-manager/#set-appropriate-memory-requests-and-limits

Regarding the exceeding of the memory limit: I've observed similar measurements on my applications in the past, but have not yet been able to find a thorough explanation. My thoughts so far are probably some kind of mismeasurement and/or reporting caused by a restart of the container, causing the reported value to show the sum of both the just-killed container and the new one, although this is mostly speculative on my part.

I found a couple of possibly related issues:

Set GOMAXPROCS and GOMEMLIMIT environment variables based on containe…

da3a29a

…r resources Signed-off-by: Jesper Noordsij <[email protected]>

cert-manager-prow bot added needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 30, 2024

cert-manager-prow bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Apr 30, 2024

wallrj self-requested a review May 9, 2024 05:55

SgtCoDFish requested changes May 13, 2024

View reviewed changes

cert-manager-prow bot added ok-to-test and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 14, 2024

SgtCoDFish added kind/feature Categorizes issue or PR as related to a new feature. and removed needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels May 14, 2024

SgtCoDFish added this to the 1.15 milestone May 14, 2024

inteon modified the milestones: 1.15, 1.16 May 21, 2024

cert-manager-prow bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 20, 2024

inteon removed this from the 1.16 milestone Jun 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set `GOMAXPROCS` and `GOMEMLIMIT` environment variables #6977

Set `GOMAXPROCS` and `GOMEMLIMIT` environment variables #6977

jnoordsij commented Apr 30, 2024 •

edited by SgtCoDFish

Loading

cert-manager-prow bot commented Apr 30, 2024

cert-manager-prow bot commented Apr 30, 2024

inteon commented May 1, 2024

SgtCoDFish left a comment

SgtCoDFish May 13, 2024

inteon May 13, 2024 •

edited

Loading

SgtCoDFish May 13, 2024 •

edited

Loading

jnoordsij May 13, 2024

SgtCoDFish May 14, 2024

SgtCoDFish commented May 14, 2024

inteon commented Jun 17, 2024

wallrj commented Jun 20, 2024 •

edited

Loading

jnoordsij commented Jun 20, 2024

inteon commented Jun 20, 2024

wallrj commented Jun 20, 2024 •

edited

Loading

Set GOMAXPROCS and GOMEMLIMIT environment variables #6977

Are you sure you want to change the base?

Set GOMAXPROCS and GOMEMLIMIT environment variables #6977

Conversation

jnoordsij commented Apr 30, 2024 • edited by SgtCoDFish Loading

Pull Request Motivation

Kind

Release Note

cert-manager-prow bot commented Apr 30, 2024

cert-manager-prow bot commented Apr 30, 2024

inteon commented May 1, 2024

SgtCoDFish left a comment

Choose a reason for hiding this comment

SgtCoDFish May 13, 2024

Choose a reason for hiding this comment

inteon May 13, 2024 • edited Loading

Choose a reason for hiding this comment

SgtCoDFish May 13, 2024 • edited Loading

Choose a reason for hiding this comment

jnoordsij May 13, 2024

Choose a reason for hiding this comment

SgtCoDFish May 14, 2024

Choose a reason for hiding this comment

SgtCoDFish commented May 14, 2024

inteon commented Jun 17, 2024

wallrj commented Jun 20, 2024 • edited Loading

Memory Limit 200MiB / GOMEMLIMIT 200MiB

Memory Limit 200MiB

Without memory limit

jnoordsij commented Jun 20, 2024

inteon commented Jun 20, 2024

wallrj commented Jun 20, 2024 • edited Loading

Set `GOMAXPROCS` and `GOMEMLIMIT` environment variables #6977

Set `GOMAXPROCS` and `GOMEMLIMIT` environment variables #6977

jnoordsij commented Apr 30, 2024 •

edited by SgtCoDFish

Loading

inteon May 13, 2024 •

edited

Loading

SgtCoDFish May 13, 2024 •

edited

Loading

wallrj commented Jun 20, 2024 •

edited

Loading

wallrj commented Jun 20, 2024 •

edited

Loading