Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return non-0 exit code from copilot upload failure #5444

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ddl-ebrown
Copy link
Contributor

Tracking issue

Why are the changes needed?

  • The current copilot behavior is to:

    • try to upload the specified file
    • if an error occurs, try to write an error file instead
      • if the error file write succeeds, return success -- otherwise error

    This behavior is not ideal, because it suppresses real issues like
    the inability to find the expected output file. While a case can be
    made that the sidecar succeeded in handling its responsibility,
    important information is lost in the process that is useful to other
    containers in the job.

    Generallyk, it's better to consider the inability to
    find the original file a failure of the job to produce the expected
    file in the expected location (useful especially for container tasks).
    That means that a failure to upload the file should also return a
    non-0 exit code.

  • The new behavior is therefore:

    • try to upload the specified file
    • if an error occurs, try to write an error file instead
      • if the error file write succeeds, return the original error
      • if the error file write fails, return a joined version of the errors

What changes were proposed in this pull request?

How was this patch tested?

Setup process

Screenshots

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Related PRs

Docs link

@ddl-ebrown ddl-ebrown force-pushed the return-copilot-error-from-file-upload-failure branch from 40a7724 to 49ddaa8 Compare June 3, 2024 05:17
@ddl-ebrown
Copy link
Contributor Author

ddl-ebrown commented Jun 3, 2024

Opening this one up for discussion as I think the current behavior is confusing. Not sure if there should be richer exit codes returned / handled elsewhere -- but I don't think returning 0 right now is correct. If a file was supposed to be present and it wasn't OR if the file could not upload, those should be non-0 exit codes.

Copy link

codecov bot commented Jun 3, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 60.99%. Comparing base (bba8c11) to head (69a83fe).

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #5444   +/-   ##
=======================================
  Coverage   60.99%   60.99%           
=======================================
  Files         793      793           
  Lines       51325    51325           
=======================================
  Hits        31305    31305           
  Misses      17136    17136           
  Partials     2884     2884           
Flag Coverage Δ
unittests-datacatalog 69.31% <ø> (ø)
unittests-flyteadmin 58.71% <ø> (ø)
unittests-flytecopilot 17.79% <ø> (ø)
unittests-flytectl 67.97% <ø> (ø)
unittests-flyteidl 79.04% <ø> (ø)
unittests-flyteplugins 61.81% <ø> (ø)
unittests-flytepropeller 57.32% <ø> (ø)
unittests-flytestdlib 65.82% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

 - The current copilot behavior is to:

   * try to upload the specified file
   * if an error occurs, try to write an error file instead
     * if the error file write succeeds, return success -- otherwise error

   This behavior is not ideal, because it suppresses real issues like
   the inability to find the expected output file. While a case can be
   made that the sidecar succeeded in handling its responsibility,
   important information is lost in the process that is useful to other
   containers in the job.

   Generallyk, it's better to consider the inability to
   find the original file a failure of the job to produce the expected
   file in the expected location (useful especially for container tasks).
   That means that a failure to upload the file should also return a
   non-0 exit code.

 - The new behavior is therefore:

   * try to upload the specified file
   * if an error occurs, try to write an error file instead
     * if the error file write succeeds, return the original error
     * if the error file write fails, return a joined version of the
       errors

Signed-off-by: ddl-ebrown <[email protected]>
@ddl-ebrown ddl-ebrown force-pushed the return-copilot-error-from-file-upload-failure branch from 49ddaa8 to 69a83fe Compare June 14, 2024 04:42
@ddl-ebrown ddl-ebrown marked this pull request as ready for review June 14, 2024 04:56
@@ -91,7 +91,7 @@ func TestUploadOptions_Upload(t *testing.T) {
ok, err := containerwatcher.FileExists(success)
assert.NoError(t, err)
assert.True(t, ok, "sucessfile not created")
assert.NoError(t, uopts.Sidecar(ctx))
assert.Error(t, uopts.Sidecar(ctx))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this actually should be an error

@wild-endeavor
Copy link
Contributor

I like the joining of the errors but personally I think the original behavior of "if the error file write succeeds, return success -- otherwise error" more if only because this is the pattern that flytekit follows for the vast majority of its tasks. I think to change this we'd also have to understand how the plugin would handle the case. An error file implies a clean exit, implying that a user error occurred rather than a system error. The case where copilot is not able to find the expected output file is an okay scenario for making it a non-0 exit code, since that's kind of like a system level error, but in the normal case I think it should still be 0.

If you really need it, maybe you can opt in via an environment variable or something... interpret all errors as exit code 1. However I'm not sure if propeller checks for an errors.pb file if that happens though, so if there's information there then it might get lost. This is kinda of an unspoken API, which I don't like. There's a contract between task and engine that's not formally spelled out. I think this is why i'm averse to changing it.

@ddl-ebrown
Copy link
Contributor Author

I like the joining of the errors but personally I think the original behavior of "if the error file write succeeds, return success -- otherwise error" more if only because this is the pattern that flytekit follows for the vast majority of its tasks. I think to change this we'd also have to understand how the plugin would handle the case. An error file implies a clean exit, implying that a user error occurred rather than a system error. The case where copilot is not able to find the expected output file is an okay scenario for making it a non-0 exit code, since that's kind of like a system level error, but in the normal case I think it should still be 0.

If you really need it, maybe you can opt in via an environment variable or something... interpret all errors as exit code 1. However I'm not sure if propeller checks for an errors.pb file if that happens though, so if there's information there then it might get lost. This is kinda of an unspoken API, which I don't like. There's a contract between task and engine that's not formally spelled out. I think this is why i'm averse to changing it.

CI is green, but not sure if all your concerns are covered in the tests. Could you help me better understand the informal contract? I'm not sure which plugin you're referring to here / the semantics involved.

IMHO, exit codes are the right way to handle the differentiation between system error and user error. While I was at Puppet, we added a --detailed-exitcode switch for exactly this scenario. The initial error code implementation of 1 / 0 was too naive, and it became evident that there was a need to communicate more granular statuses back (and not swallow errors). To not break backwards compatibility, that switch was born (details at https://www.puppet.com/docs/puppet/8/man/agent#usage-notes). Terraform followed the same model (https://developer.hashicorp.com/terraform/cli/commands/plan#detailed-exitcode) as did other tools .

I think it would be pretty straightforward to add a switch and follow that sort of model if you think what's here breaks a contract somewhere?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants