Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a graceful shutdown timeout before closing the default ClientFactory #5718

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

ikhoon
Copy link
Contributor

@ikhoon ikhoon commented Jun 3, 2024

Motivation:

The default ClientFactory closes immediately when a JVM shuts down. The closed ClientFactory prevents all clients using the default factory from processing requests.

This is not a problem when the client is used alone. However, many clients are used in a server to fetch or deliver data when the server receives a request. If the ClientFactory is closed while the server is still in graceful shutdown mode, all requests will fail. This behavior cannot be called a graceful shutdown.

In this case, an EndpointSelectionTimeoutException is raised by HealthCheckedEndointGroup due to the termination of the default ClientFactory used by HttpHealthChecker.

com.linecorp.armeria.client.UnprocessedRequestException: com.linecorp.armeria.client.endpoint.EndpointSelectionTimeoutException:
  Failed to select within 6400 ms an endpoint from: HealthCheckedEndpointGroup{endpoints=[], numEndpoints=0, candidates=[Endpoint{example.com, ipAddr=x.x.x.x, weight=1000}, ...], numCandidates=8, ...,
  initialized=true, initialSelectionTimeoutMillis=10000, selectionTimeoutMillis=6400, contextGroupChain=[]}
    at com.linecorp.armeria.client.UnprocessedRequestException.of(UnprocessedRequestException.java:45)
    at com.linecorp.armeria.client.HttpClientDelegate.earlyFailedResponse(HttpClientDelegate.java:228)
...
Caused by: com.linecorp.armeria.client.endpoint.EndpointSelectionTimeoutException:
  Failed to select within 6400 ms an endpoint from: HealthCheckedEndpointGroup{endpoints=[], numEndpoints=0, candidates=[Endpoint{example.com, ipAddr=x.x.x.x, weight=1000}, ...], numCandidates=8, ...,
  initialized=true, initialSelectionTimeoutMillis=10000, selectionTimeoutMillis=6400, contextGroupChain=[]}
    at com.linecorp.armeria.client.endpoint.EndpointSelectionTimeoutException.get(EndpointSelectionTimeoutException.java:48)
    at com.linecorp.armeria.client.endpoint.AbstractEndpointSelector.lambda$select$0(AbstractEndpointSelector.java:117)
    at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
    at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:153)
    ...
    8 common frames omitted

I propose to add a delay before closing the default ClientFactory so that the server handles the request during graceful shutdown.

Modifications:

  • Add Flags.defaultClientFactoryGracefulShutdownTimeoutMillis() that indicates the default time to wait before closing the default ClientFactory.
  • Add TestFlagsProvider to override defaultClientFactoryGracefulShutdownTimeoutMillis for rapid iterative testing.

Result:

  • HealthCheckedEndpointGroup no longer raises EndpointSelectionTimeoutException when a server is stopped by the JVM shutdown hook.
  • You can now set a delay before the default ClientFactory shuts down via Flags.defaultClientFactoryGracefulShutdownTimeoutMillis(). If not set, 10 seconds is used by default.

…tory`

Motivation:

The default `ClientFactory` closes immediately when a JVM shuts down.
The closed `ClientFactory` prevents the all clients using the default
factory from processing requests.

This is not a problem when the client is used alone. However, many
clients are used in a server to fetch or deliver data when the server
receives a request. If the `ClientFactory` is closed while the server is
still in graceful shutdown mode, all requests will fail.  This behavior
cannot be called a graceful shutdown.

In this case, an `EndpointSelectionTimeoutException` is raised by
`HealthCheckedEndointGroup` due to termination of the default
`ClientFactory` used by `HttpHealthChecker`.
```java
com.linecorp.armeria.client.UnprocessedRequestException: com.linecorp.armeria.client.endpoint.EndpointSelectionTimeoutException:
  Failed to select within 6400 ms an endpoint from: HealthCheckedEndpointGroup{endpoints=[], numEndpoints=0, candidates=[Endpoint{example.com, ipAddr=x.x.x.x, weight=1000}, ...], numCandidates=8, ...,
  initialized=true, initialSelectionTimeoutMillis=10000, selectionTimeoutMillis=6400, contextGroupChain=[]}
    at com.linecorp.armeria.client.UnprocessedRequestException.of(UnprocessedRequestException.java:45)
    at com.linecorp.armeria.client.HttpClientDelegate.earlyFailedResponse(HttpClientDelegate.java:228)
...
Caused by: com.linecorp.armeria.client.endpoint.EndpointSelectionTimeoutException:
  Failed to select within 6400 ms an endpoint from: HealthCheckedEndpointGroup{endpoints=[], numEndpoints=0, candidates=[Endpoint{example.com, ipAddr=x.x.x.x, weight=1000}, ...], numCandidates=8, ...,
  initialized=true, initialSelectionTimeoutMillis=10000, selectionTimeoutMillis=6400, contextGroupChain=[]}
    at com.linecorp.armeria.client.endpoint.EndpointSelectionTimeoutException.get(EndpointSelectionTimeoutException.java:48)
    at com.linecorp.armeria.client.endpoint.AbstractEndpointSelector.lambda$select$0(AbstractEndpointSelector.java:117)
    at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
    at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:153)
    ...
    8 common frames omitted
```

I propose to add a delay before closing the default `ClientFactory` so
that the server handles the request during graceful shutdown.

Modifications:

- Add `Flags.defaultClientFactoryGracefulShutdownTimeoutMillis()`
  that indicates the default time to wait before closing the default
  `ClientFactory`.
- Add `TestFlagsProvider` to override `defaultClientFactoryGracefulShutdownTimeoutMillis`
  for rapid iterative testing.

Result:

- `HealthCheckedEndpointGroup` no longer raises `EndpointSelectionTimeoutException`
  when a server is stopped by the JVM shutdown hook.
- You can now set a delay before the default `ClientFactory` shuts down
  via `Flags.defaultClientFactoryGracefulShutdownTimeoutMillis()`.
  If not set, 10 seconds is used by default.
@ikhoon ikhoon added the defect label Jun 3, 2024
@ikhoon ikhoon added this to the 1.29.0 milestone Jun 3, 2024
@@ -97,6 +97,7 @@ final class DefaultFlagsProvider implements FlagsProvider {
static final String DNS_CACHE_SPEC = "maximumSize=4096";
static final long DEFAULT_UNLOGGED_EXCEPTIONS_REPORT_INTERVAL_MILLIS = 10000;
static final long DEFAULT_HTTP1_CONNECTION_CLOSE_DELAY_MILLIS = 3000;
static final long DEFAULT_CLIENT_FACTORY_GRACEFUL_SHUTDOWN_TIMEOUT_MILLIS = 10000;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure if this value was sensible. 😅 A shorter value should be needed for rapid restart and a longer value is useful for a more graceful shutdown.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can probably do it when we reach 2.0:

  • Move ServerBuilder.DEFAULT_GRACEFUL_SHUTDOWN_TIMEOUT to Flags
  • Increase ServerBuilder.DEFAULT_GRACEFUL_SHUTDOWN_TIMEOUT value from zero
  • Use the value for the client factory as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Let me create an issue based on your idea.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question) What do you think of just disabling this timeout by default or making it shorter (1s)?

As mentioned in the PR description, this problem occurs when:

  • The DefaultClientFactory is used within a server.
  • gracefulShutdown is used for the server and requests are received during the graceful shutdown period despite the health check signal.
  • Server.closeOnJvmShutdown is being used to shut down the server
    which I don't think is the case for many users.

However, with the current code anyone who has armeria in their classpath and loads DefaultClientFactory.class will observe their jvm shutting down after 10s which I'm not sure is good user experience 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @jrhee17. We should not wait this long. I'd actually prefer this graceful shutdown feature to be opt-in only.

Copy link
Contributor Author

@ikhoon ikhoon Jun 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value of 10 seconds was derived from Server's the default request timeout. Since many users want to gracefully shut down the server, they will set the graceful shutdown timeout to a value longer than the request timeout. They don't want to fail requests that come in just before the health check goes down. So, I set the default client factory to 10 seconds, the same as the default request timeout.

I think this part is controversial, so I would like to discuss it further.

Currently the server's graceful shutdown is disabled for this operation, so opt-in seems to be more sensible. However, I would recommend that most LINE internal users enable defaultClientFactoryGracefulShutdownTimeoutMillis() option, so I question whether opt-in is truly a good thing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For prod, yeah we need it, but for testing and dev, where devs spent most of their time, we probably don't need it? So I'd say we should keep it opt-in only unless we have a notion of 'profile' or 'environment'.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct. I wasn't sure about the default behavior either. We need to think more about this issue and decide on a direction.

unless we have a notion of 'profile' or 'environment'.

It sounds like a good idea. We may provide a different FlagsProvider depending on profile.

Copy link
Member

@minwoox minwoox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks!

@@ -97,6 +97,7 @@ final class DefaultFlagsProvider implements FlagsProvider {
static final String DNS_CACHE_SPEC = "maximumSize=4096";
static final long DEFAULT_UNLOGGED_EXCEPTIONS_REPORT_INTERVAL_MILLIS = 10000;
static final long DEFAULT_HTTP1_CONNECTION_CLOSE_DELAY_MILLIS = 3000;
static final long DEFAULT_CLIENT_FACTORY_GRACEFUL_SHUTDOWN_TIMEOUT_MILLIS = 10000;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can probably do it when we reach 2.0:

  • Move ServerBuilder.DEFAULT_GRACEFUL_SHUTDOWN_TIMEOUT to Flags
  • Increase ServerBuilder.DEFAULT_GRACEFUL_SHUTDOWN_TIMEOUT value from zero
  • Use the value for the client factory as well.

Copy link
Contributor

@jrhee17 jrhee17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approach looks good overall 👍 Left a couple questions 🙇

Going forward, I wonder if we may want to maintain a data structure similar to spring's DefaultSingletonBeanRegistry and define dependencies between Server and ClientFactory so that startup/shutdown orders are better defined.

@@ -97,6 +97,7 @@ final class DefaultFlagsProvider implements FlagsProvider {
static final String DNS_CACHE_SPEC = "maximumSize=4096";
static final long DEFAULT_UNLOGGED_EXCEPTIONS_REPORT_INTERVAL_MILLIS = 10000;
static final long DEFAULT_HTTP1_CONNECTION_CLOSE_DELAY_MILLIS = 3000;
static final long DEFAULT_CLIENT_FACTORY_GRACEFUL_SHUTDOWN_TIMEOUT_MILLIS = 10000;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question) What do you think of just disabling this timeout by default or making it shorter (1s)?

As mentioned in the PR description, this problem occurs when:

  • The DefaultClientFactory is used within a server.
  • gracefulShutdown is used for the server and requests are received during the graceful shutdown period despite the health check signal.
  • Server.closeOnJvmShutdown is being used to shut down the server
    which I don't think is the case for many users.

However, with the current code anyone who has armeria in their classpath and loads DefaultClientFactory.class will observe their jvm shutting down after 10s which I'm not sure is good user experience 🤔

@@ -80,6 +81,16 @@ final class DefaultClientFactory implements ClientFactory {
try {
Runtime.getRuntime().addShutdownHook(new Thread(() -> {
if (!shutdownHookDisabled) {
final long gracefulShutdownTimeoutMillis =
Flags.defaultClientFactoryGracefulShutdownTimeoutMillis();
if (gracefulShutdownTimeoutMillis > 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This option is ignored if you manually close the default {@linkplain ClientFactory} with {@link ClientFactory#closeDefault()}

I saw this in the javadocs, is this true? 🤔

https://github.com/jrhee17/armeria/tree/poc/long-jvm-shutdown

...
14:21:16.132 [main] INFO  com.linecorp.armeria.common.Flags - Using Tls engine: OpenSSL BoringSSL, 0x1010107f
14:21:16.165 [main] DEBUG c.l.armeria.client.ClientFactory - Closed the default client factories
14:21:26.178 [Thread-1] DEBUG c.l.armeria.client.ClientFactory - Closing the default client factories
14:21:26.179 [Thread-1] DEBUG c.l.armeria.client.ClientFactory - Closed the default client factories

Flags.defaultClientFactoryGracefulShutdownTimeoutMillis();
if (gracefulShutdownTimeoutMillis > 0) {
try {
Thread.sleep(gracefulShutdownTimeoutMillis);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Shouldn't we allow a user specify the graceful shutdown timeout for each ClientFactory? This PR adds the sleep only to the default factory, which doesn't sound right.
  • What should we do when a user sends a request during this period? Should we reject them?
  • Can we skip waiting when there are no pending requests? Do we need grace period just like we have on the server side shutdown? Just adding a fixed amount of sleep doesn't sound right.

Copy link
Contributor Author

@ikhoon ikhoon Jun 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we allow a user specify the graceful shutdown timeout for each ClientFactory?

Good idea.

Can we skip waiting when there are no pending requests? Do we need grace period just like we have on the server side shutdown? Just adding a fixed amount of sleep doesn't sound right.

If the server allows new requests for 10 seconds considering inflight requests and health check interval, the client should not be closed even if there are no active requests on the client.

What should we do when a user sends a request during this period? Should we reject them?

Since the server controls incoming traffic through health checks, I am not sure if it is right for the client to reject new requests.

The more I think about it, the problem cannot be solved without linking the lifecycles of the server and client.

@@ -97,6 +97,7 @@ final class DefaultFlagsProvider implements FlagsProvider {
static final String DNS_CACHE_SPEC = "maximumSize=4096";
static final long DEFAULT_UNLOGGED_EXCEPTIONS_REPORT_INTERVAL_MILLIS = 10000;
static final long DEFAULT_HTTP1_CONNECTION_CLOSE_DELAY_MILLIS = 3000;
static final long DEFAULT_CLIENT_FACTORY_GRACEFUL_SHUTDOWN_TIMEOUT_MILLIS = 10000;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @jrhee17. We should not wait this long. I'd actually prefer this graceful shutdown feature to be opt-in only.

@trustin trustin modified the milestones: 1.29.0, 1.30.0 Jun 7, 2024
ikhoon added a commit to ikhoon/armeria that referenced this pull request Jun 10, 2024
…gration

Motivation:

I tried to add a graceful shutdown timeout to not immediately close the
default `ClientFactory`. line#5718 (comment)
However, it is a breaking change and no agreement has been reached on
the default behavior.

I think that line#5718 may take some time to be merged or may not be merged
soon.

This problem can be solved more easily in Spring Boot. Spring Boot
starts the application in the main method and terminates with JVM
shutdown. By shutting down the default `ClientFactory` using
`SpringApplicationShutdownHook`, the graceful shutdown can be supported
without breaking changes for Spring integration, which is used by most
users.

Modifications:

- Use `SpringApplicationShutdownHandlers` to close the default
  `ClientFactory` for Spring integration.

Result:

In Spring integration, the default `Clientfactory` is gracefully closed
after the server is shut down.
@ikhoon ikhoon marked this pull request as draft June 10, 2024 13:07
@ikhoon ikhoon removed this from the 1.30.0 milestone Jun 10, 2024
minwoox pushed a commit that referenced this pull request Jun 11, 2024
…gration (#5742)

Motivation:

I tried to add a graceful shutdown timeout to not immediately close the default `ClientFactory`. #5718 (comment) However, it is a breaking change and no agreement has been reached on the default behavior.

I think that #5718 may take some time to be merged or may not be merged.

This problem can be solved more easily in Spring Boot. Spring Boot starts the application in the main method and terminates with JVM shutdown. By shutting down the default `ClientFactory` using `SpringApplicationShutdownHook`, the graceful shutdown can be supported without breaking changes for Spring integration, which is used by most users.

Modifications:

- Use `SpringApplicationShutdownHandlers` to close the default `ClientFactory` for Spring integration.

Result:

In Spring integration, the default `ClientFactory` is now gracefully closed after the server is shut down.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants