Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CudaGraph] Handle exceptions thrown while capturing cuda graph #17113

Merged
merged 3 commits into from
Jun 27, 2024

Conversation

Lunderberg
Copy link
Contributor

Prior to this commit, an exception thrown during the capture of a cuda graph would result in std::terminate being called. This commit updates the implementation of "vm.builtin.cuda_graph.run_or_capture" such that a thrown exception can be recovered from, and does not cause any changes to the state of TVM's cuda graph cache.

  • Call to cudaStreamDestroy was previously skipped, now moved to a RAII-style destructor in a ScopedCUDAStream class.

  • Call to cudaStreamEndCapture was previously skipped, end of cuda graph capture now performed as part of RAII-style destructor for CUDACaptureStream class.

  • Restoration of CUDAThreadEntry::ThreadLocal()->stream was previously skipped, now restored as part of RAII-style destructor for CUDACaptureStream class.

  • Previously, an error raised from cudaGraphInstantiate would leave the capture_cache_ in an ill-formed state. Now, the capture_cache_ is only updated after a valid CUDAGraphCapturedState has been fully constructed.

Prior to this commit, an exception thrown during the capture of a cuda
graph would result in `std::terminate` being called.  This commit
updates the implementation of `"vm.builtin.cuda_graph.run_or_capture"`
such that a thrown exception can be recovered from, and does not cause
any changes to the state of TVM's cuda graph cache.

- Call to `cudaStreamDestroy` was previously skipped, now moved to a
  RAII-style destructor in a `ScopedCUDAStream` class.

- Call to `cudaStreamEndCapture` was previously skipped, end of cuda
  graph capture now performed as part of RAII-style destructor for
  `CUDACaptureStream` class.

- Restoration of `CUDAThreadEntry::ThreadLocal()->stream` was
  previously skipped, now restored as part of RAII-style destructor
  for `CUDACaptureStream` class.

- Previously, an error raised from `cudaGraphInstantiate` would leave
  the `capture_cache_` in an ill-formed state.  Now, the
  `capture_cache_` is only updated after a valid
  `CUDAGraphCapturedState` has been fully constructed.
@Lunderberg Lunderberg requested a review from vinx13 June 24, 2024 16:03
CUDA_CALL(cudaStreamBeginCapture(capture_stream_, cudaStreamCaptureModeGlobal));
}
~CUDACaptureStream() {
cudaStreamEndCapture(capture_stream_, output_graph_);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be enclosed with CUDA_CALL to check return code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested it with CUDA_CALL to verify, and using the CUDA_CALL would re-introduce the same bug that this change is intended to fix. If the stack is unwinding due to a thrown exception, then throwing another exception would result in std::terminate being called. To avoid this, destructors shouldn't throw exceptions (stackoverflow link).

There are some ways to use std::uncaught_exceptions to determine whether an exception is being unwound, and to conditionally throw an exception if it isn't already the case. However, those tend to be pretty context-dependent, and probably aren't worth using in this case.

@vinx13 vinx13 merged commit a84adaf into apache:main Jun 27, 2024
21 checks passed
@Lunderberg Lunderberg deleted the cuda_graph_exception_handling branch June 27, 2024 21:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants