Add pod diagnostics before scaling down to zero in scaler #15326

skonto · 2024-06-12T13:22:03Z

Proposed Changes

Replaces If deployment is never available propagate the container msg #14835
Adds pod diagnostics as it was pending here, I am wondering what it is needed to remove activationTimeoutBuffer.
The idea is to mark the revision with resourcesAvailable=false and pa with ScaleTargetInitialized=false just before
we apply scaling down to zero and after we have timedout and we failed the activation here.
This would trigger the following condition in the revision lifecycle and pa status propagation:
if !ps.IsScaleTargetInitialized() && !resUnavailable && ps.ServiceName != "" {
A revision with no resources available will be set to ready false (due to its condSet) and that will propagate the condition up to the ksvc.
Tested with:

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: hello
spec:
  template:
    metadata:
      annotations:
        serving.knative.dev/progress-deadline: "45s"
    spec:
      timeoutSeconds: 30
      containers:
        - image: ghcr.io/knative/helloworld-go:latest
          ports:
            - containerPort: 8080
          env:
            - name: TARGET
              value: "World"

Steps to reproduce. First run the ksvc, let it scale down to zero and then remove the revision image from the local registry. Disable net access so image cannot be fetched, issue a new request.
The status of the Serving resources will become:

{
    "apiVersion": "v1",
    "items": [
        {
            "apiVersion": "serving.knative.dev/v1",
            "kind": "Service",
            "metadata": {
                "annotations": {
....
            },
            "status": {
                "address": {
                    "url": "http://hello.default.svc.cluster.local"
                },
                "conditions": [
                    {
                        "lastTransitionTime": "2024-06-12T13:02:06Z",
                        "message": "Revision \"hello-00001\" failed with message: Initial scale was never achieved.",
                        "reason": "RevisionFailed",
                        "status": "False",
                        "type": "ConfigurationsReady"
                    },
                    {
                        "lastTransitionTime": "2024-06-12T13:02:05Z",
                        "message": "Revision \"hello-00001\" failed to become ready.",
                        "reason": "RevisionMissing",
                        "status": "False",
                        "type": "Ready"
                    },
                    {
                        "lastTransitionTime": "2024-06-12T13:02:05Z",
                        "message": "Revision \"hello-00001\" failed to become ready.",
                        "reason": "RevisionMissing",
                        "status": "False",
                        "type": "RoutesReady"
                    }
                ],
    }
}
{
    "apiVersion": "v1",
    "items": [
        {
            "apiVersion": "serving.knative.dev/v1",
            "kind": "Revision",
            "metadata": {
                "annotations": {
                    "serving.knative.dev/creator": "minikube-user",
                    "serving.knative.dev/progress-deadline": "45s",
                    "serving.knative.dev/routes": "hello",
                    "serving.knative.dev/routingStateModified": "2024-06-12T12:57:33Z"
                },
...

            "status": {
                "actualReplicas": 0,
                "conditions": [
                    {
                        "lastTransitionTime": "2024-06-12T13:02:06Z",
                        "message": "The target is not receiving traffic.",
                        "reason": "NoTraffic",
                        "severity": "Info",
                        "status": "False",
                        "type": "Active"
                    },
                    {
                        "lastTransitionTime": "2024-06-12T12:57:51Z",
                        "status": "True",
                        "type": "ContainerHealthy"
                    },
                    {
                        "lastTransitionTime": "2024-06-12T13:02:06Z",
                        "message": "Initial scale was never achieved",
                        "reason": "ProgressDeadlineExceeded",
                        "status": "False",
                        "type": "Ready"
                    },
                    {
                        "lastTransitionTime": "2024-06-12T13:02:06Z",
                        "message": "Initial scale was never achieved",
                        "reason": "ProgressDeadlineExceeded",
                        "status": "False",
                        "type": "ResourcesAvailable"
                    }
                ],
...
}
{
    "apiVersion": "v1",
    "items": [
        {
            "apiVersion": "serving.knative.dev/v1",
            "kind": "Configuration",
            "metadata": {
...
                "name": "hello",
                "namespace": "default",
...
            "status": {
                "conditions": [
                    {
                        "lastTransitionTime": "2024-06-12T13:02:06Z",
                        "message": "Revision \"hello-00001\" failed with message: Initial scale was never achieved.",
                        "reason": "RevisionFailed",
                        "status": "False",
                        "type": "Ready"
                    }
                ],
...
}
{
    "apiVersion": "v1",
    "items": [
        {
            "apiVersion": "autoscaling.internal.knative.dev/v1alpha1",
            "kind": "PodAutoscaler",

  ...
            "spec": {
                "protocolType": "http1",
                "reachability": "Reachable",
                "scaleTargetRef": {
                    "apiVersion": "apps/v1",
                    "kind": "Deployment",
                    "name": "hello-00001-deployment"
                }
            },
            "status": {
                "actualScale": 0,
                "conditions": [
                    {
                        "lastTransitionTime": "2024-06-12T13:02:05Z",
                        "message": "The target is not receiving traffic.",
                        "reason": "NoTraffic",
                        "status": "False",
                        "type": "Active"
                    },
                    {
                        "lastTransitionTime": "2024-06-12T13:02:05Z",
                        "message": "The target is not receiving traffic.",
                        "reason": "NoTraffic",
                        "status": "False",
                        "type": "Ready"
                    },
                    {
                        "lastTransitionTime": "2024-06-12T12:58:51Z",
                        "message": "K8s Service is not ready",
                        "reason": "NotReady",
                        "status": "Unknown",
                        "type": "SKSReady"
                    },
                    {
                "desiredScale": 0,
                "metricsServiceName": "hello-00001-private",
                "observedGeneration": 1,
                "serviceName": "hello-00001"
            }
        }
    ],
}

After we bring the image back a new request will work as expected and resource statuses go back to the usual.
Release Note

codecov · 2024-06-12T13:26:37Z

Codecov Report

Attention: Patch coverage is 21.21212% with 26 lines in your changes missing coverage. Please review.

Project coverage is 84.60%. Comparing base (62ce45c) to head (248d6e8).
Report is 8 commits behind head on main.

Files	Patch %	Lines
pkg/reconciler/autoscaling/kpa/scaler.go	21.05%	13 Missing and 2 partials ⚠️
pkg/resources/pods.go	0.00%	7 Missing ⚠️
pkg/apis/autoscaling/v1alpha1/pa_lifecycle.go	0.00%	2 Missing ⚠️
pkg/reconciler/autoscaling/kpa/kpa.go	60.00%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #15326      +/-   ##
==========================================
- Coverage   84.76%   84.60%   -0.16%     
==========================================
  Files         218      218              
  Lines       13504    13534      +30     
==========================================
+ Hits        11447    11451       +4     
- Misses       1690     1713      +23     
- Partials      367      370       +3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

skonto · 2024-06-12T13:31:05Z

error: the server doesn't have a resource type "ksvc"

skonto · 2024-06-14T07:15:55Z

/retest

knative-prow · 2024-06-14T10:49:32Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: skonto
Once this PR has been reviewed and has the lgtm label, please ask for approval from dprotaso. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

~~OWNERS~~ [skonto]
pkg/apis/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

skonto · 2024-06-18T14:55:10Z

@dprotaso gentle ping.

skonto · 2024-06-25T13:48:55Z

@dprotaso gentle ping.

dprotaso · 2024-06-25T20:02:50Z

pkg/apis/autoscaling/v1alpha1/pa_lifecycle.go

+func (pas *PodAutoscalerStatus) MarkScaleTargetNotInitialized(reason, message string) {
+	podCondSet.Manage(pas).MarkFalse(PodAutoscalerConditionScaleTargetInitialized, reason, message)
+}
+


We should double check usages of this condition. Because before it would always be Unknown=>(True|False) and then remain unchanged.

I can't recall if there's code that assumes that it never changes.

Maybe we might want to introduce a new condition - to maybe surface subsequent scaling issues

I can't recall if there's code that assumes that it never changes.

Tests are not covering revision transitions? I checked the pa status propagation for the revision reconciliation and we have specific cases where this matters, but don't seem affected. I can take a look again if there is a scenario where this might be a problem. In general we should be able to set this to False (for whatever reason), since it is a legitimate value and then any reconciliation should take into consideration that condition and adjust. Here we go from True to False.

dprotaso · 2024-06-25T20:05:37Z

pkg/reconciler/autoscaling/kpa/kpa.go

+	pod, err := podCounter.GetAnyPod()
+	if err != nil {
+		return fmt.Errorf("error getting a pod for the revision: %w", err)
+	}


Fetching a pod here seems premature

Could you elaborate? We already do that in that function via the pod accessor for getting the state a few lines bellow. We are going to test for handling the scale to zero case and check pod status, if we have to.

dprotaso · 2024-06-25T20:12:08Z

pkg/reconciler/autoscaling/kpa/kpa.go

@@ -114,10 +114,16 @@ func (c *Reconciler) ReconcileKind(ctx context.Context, pa *autoscalingv1alpha1.
 	if err := c.ReconcileMetric(ctx, pa, resolveScrapeTarget(ctx, pa)); err != nil {
 		return fmt.Errorf("error reconciling Metric: %w", err)
 	}
+	podCounter := resourceutil.NewPodAccessor(c.podsLister, pa.Namespace, pa.Labels[serving.RevisionLabelKey])
+
+	pod, err := podCounter.GetAnyPod()


Should be getting a pod that isn't ready - eg. you could have min scale = 10 and the last pod can't be scheduled (due to resource constraints)

We are not targeting all pods, we are targeting the scenario with the image issue. If someone wants to cover all cases he can extend the work here later. Maybe I should change the PR title, here we are adding pod diagnostics for the issue with the image only or similar issues where all pods are stuck and deployment reconciliation cannot catch it due to the known K8s limitations (progress deadline cannot catch all cases).

dprotaso · 2024-06-25T20:14:53Z

pkg/reconciler/autoscaling/kpa/scaler.go

+				return retry.RetryOnConflict(retry.DefaultBackoff, func() error {
+					rev, err := client.ServingV1().Revisions(pa.Namespace).Get(ctx, pa.Name, metav1.GetOptions{})
+					if err != nil {
+						return err
+					}
+					rev.Status.MarkResourcesAvailableFalse(w.Reason, w.Message)
+					if _, err = client.ServingV1().Revisions(pa.Namespace).UpdateStatus(ctx, rev, metav1.UpdateOptions{}); err != nil {
+						return err
+					}
+					return nil
+				})
+			}
+		}


this is sorta violating our abstractions - if we wanted to propagate this error message to the revision we would have to do it through a PodAutoscaler condition.

have to do it through a PodAutoscaler condition.

Normally yes but due to the distributed status logic (not clear, undocumented) and how things are implemented this is safer imho, as it makes the decision locally and avoids influencing anything else, down the code path. 🤷 I can try change but it will require to propagate this decision down to the pa status update (not ideal as that code is many lines bellow), I had it this ways previously. Let's see.

Add podchecking before scaling down to zero in scaler

162de79

skonto requested a review from dprotaso June 12, 2024 13:22

skonto self-assigned this Jun 12, 2024

knative-prow bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jun 12, 2024

knative-prow bot requested a review from izabelacg June 12, 2024 13:22

skonto removed their assignment Jun 12, 2024

knative-prow bot requested a review from ReToCode June 12, 2024 13:22

skonto assigned dprotaso Jun 12, 2024

skonto changed the title ~~Add podchecking before scaling down to zero in scaler~~ Add pod diagnostics before scaling down to zero in scaler Jun 12, 2024

skonto changed the title ~~Add pod diagnostics before scaling down to zero in scaler~~ [wip] Add pod diagnostics before scaling down to zero in scaler Jun 12, 2024

knative-prow bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 12, 2024

skonto changed the title ~~[wip] Add pod diagnostics before scaling down to zero in scaler~~ Add pod diagnostics before scaling down to zero in scaler Jun 12, 2024

knative-prow bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 12, 2024

skonto mentioned this pull request Jun 13, 2024

If deployment is never available propagate the container msg #14835

Closed

fix ctx

248d6e8

dprotaso reviewed Jun 25, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add pod diagnostics before scaling down to zero in scaler #15326

Add pod diagnostics before scaling down to zero in scaler #15326

skonto commented Jun 12, 2024 •

edited

Loading

codecov bot commented Jun 12, 2024 •

edited

Loading

skonto commented Jun 12, 2024

skonto commented Jun 14, 2024

knative-prow bot commented Jun 14, 2024

skonto commented Jun 18, 2024

skonto commented Jun 25, 2024

dprotaso Jun 25, 2024

dprotaso Jun 25, 2024

skonto Jun 26, 2024 •

edited

Loading

dprotaso Jun 25, 2024

skonto Jun 26, 2024 •

edited

Loading

dprotaso Jun 25, 2024

skonto Jun 26, 2024 •

edited

Loading

dprotaso Jun 25, 2024

skonto Jun 26, 2024 •

edited

Loading

Add pod diagnostics before scaling down to zero in scaler #15326

Are you sure you want to change the base?

Add pod diagnostics before scaling down to zero in scaler #15326

Conversation

skonto commented Jun 12, 2024 • edited Loading

Proposed Changes

codecov bot commented Jun 12, 2024 • edited Loading

Codecov Report

skonto commented Jun 12, 2024

skonto commented Jun 14, 2024

knative-prow bot commented Jun 14, 2024

skonto commented Jun 18, 2024

skonto commented Jun 25, 2024

dprotaso Jun 25, 2024

Choose a reason for hiding this comment

dprotaso Jun 25, 2024

Choose a reason for hiding this comment

skonto Jun 26, 2024 • edited Loading

Choose a reason for hiding this comment

dprotaso Jun 25, 2024

Choose a reason for hiding this comment

skonto Jun 26, 2024 • edited Loading

Choose a reason for hiding this comment

dprotaso Jun 25, 2024

Choose a reason for hiding this comment

skonto Jun 26, 2024 • edited Loading

Choose a reason for hiding this comment

dprotaso Jun 25, 2024

Choose a reason for hiding this comment

skonto Jun 26, 2024 • edited Loading

Choose a reason for hiding this comment

skonto commented Jun 12, 2024 •

edited

Loading

codecov bot commented Jun 12, 2024 •

edited

Loading

skonto Jun 26, 2024 •

edited

Loading

skonto Jun 26, 2024 •

edited

Loading

skonto Jun 26, 2024 •

edited

Loading

skonto Jun 26, 2024 •

edited

Loading