Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

recommend 503 status code for a service with no healthy endpoints #3121

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

dprotaso
Copy link
Contributor

@dprotaso dprotaso commented May 30, 2024

What type of PR is this?

/kind cleanup
/kind documentation

What this PR does / why we need it:

A specific error case for 503 status code is a Kubernetes service with no healthy endpoints

It was recommend we keep this case here - #1210 (comment)

It was subsquently removed in this PR - #1243 (comment)

I think having this differentiator is important because it allows consumers (eg. Knative) to know whether the 5xx is being returned by the user's pod or the gateway.

This behaviour is already present in numerous Gateway Implementations (Istio, Contour, Linkerd)

Which issue(s) this PR fixes:

Fixes #

Does this PR introduce a user-facing change?:

HTTPRoute - 503 Status Code MAY be returned for Kubernetes Services who don't have any healthy endpoints

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. kind/documentation Categorizes issue or PR as related to documentation. labels May 30, 2024
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels May 30, 2024
Copy link
Member

@robscott robscott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dprotaso!

@@ -263,6 +263,9 @@ type HTTPRouteRule struct {
// invalid, 50 percent of traffic must receive a 500. Implementations may
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@youngnick do you remember if the intent here was "exactly 500" or "5xx"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was definitely "exactly 500", we logged #1200 to do that.

I've been trying to remember why we moved this to "exactly 500", and I think it was to do with partial validity rules.
There's a bunch of discussion in #1112, and even more in #1211 about it. There's also some discussion on #1511, with @mikemorris' comment #1151 (comment) being a good summary.

I seem to recall not being confident at the time that we didn't want to overcomplicate the spec. It's already pretty complicated, because we were discussing if "zero endpoints" means "not valid" or not.

Looking back, I think the answer we've landed on is that we treat the references between objects differently to possibly-transient conditions on the proxy anyway. ResolvedRefs is for references.

I don't think we should do this until we've gone back through those discussions and checked that we're not breaking any of the assumptions that we made then - or if we are, then we update other documentation as well to make it clearer.

However, if we can all agree that "zero endpoints" should be considered a transient state that does not impact the validity of the HTTPRoute, then returning a 503 in that case is okay.

Like I said, we need to clarify what happens here in the other listed cases for 500 errors.

  • What happens when there are multiple BackendRefs and one has no endpoints? As it stands, this update leaves that unclear.
  • What happens when all the BackendRefs have no endpoints? (Note that this covers the case where there's only one backend that has no endpoints).

I think the answer should be something like:

  • Having no endpoints does not make a HTTPBackendRef invalid in configuration terms
  • However, a backend with no endpoints MAY (tbh this might need to be SHOULD or even MUST) be treated as invalid for traffic management purposes and return a 503 error code. This means that, if there are multiple backendRefs:
    • each backendRef must get the correct proportion of traffic, even if that means the proportion of traffic bound for that backendRef all gets a 503. This is to ensure that weighted load balancing failures don't happen silently. (There's a case where you're doing a gradual failover, one of the services gets 503, and you don't notice until you flip the weight to 100 percent on the faulty one that we have to avoid.)
    • if all backendRefs have no endpoints, then all traffic that matches that rule will get a 503.

These are basically the same rules as above for 500s, we're basically making a class of traffic that's "invalid at a traffic level, but not at a config level" by doing this.

Copy link
Contributor Author

@dprotaso dprotaso Jun 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are basically the same rules as above for 500s, we're basically making a class of traffic that's "invalid at a traffic level, but not at a config level" by doing this.

Yeah - that all sounds good - what further edits do you think this PR requires?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a case where you're doing a gradual failover, one of the services gets 503, and you don't notice until you flip the weight to 100 percent on the faulty one that we have to avoid.

Can you elaborate on this a bit more?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with @youngnick's summary - went back to read some of my old comments and this seems to align with my thinking at that time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want me to codify parts of your comment into the godoc @youngnick ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I added a suggestion to that effect. Once that's done, this LGTM.

apis/v1/httproute_types.go Outdated Show resolved Hide resolved
@robscott
Copy link
Member

I think this makes sense, thanks @dprotaso! Would like a LGTM from @mikemorris or @youngnick though.

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dprotaso, robscott

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 28, 2024
Comment on lines +268 to +269
// If and implementation chooses to do this, all of the above rules for 500 responses
// MUST also apply.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dprotaso @youngnick Not sure I follow this - which of the following is the correct interpretation?

  1. You can use a 503 in this case, but everything else should be precisely a 500.
  2. If you use a 503 in this case, you should use a 503 anywhere else we've recommended using a 500.

I'm assuming it's the first one, but it's not completely obvious to me so could be worth clarifying the spec here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. kind/documentation Categorizes issue or PR as related to documentation. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants