Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA backend does not work with rust nightly #1867

Open
jggc opened this issue Jun 8, 2024 · 2 comments
Open

CUDA backend does not work with rust nightly #1867

jggc opened this issue Jun 8, 2024 · 2 comments

Comments

@jggc
Copy link
Contributor

jggc commented Jun 8, 2024

Describe the bug
This is not actually clear whether this is a bug or a feature/documentation request but here it goes:

Running rust nightly 2024-05-30, no matter how I set up libtorch I will end up with

2024-06-08T16:45:38.148370Z ERROR burn_train::learner::application_logger: PANIC => panicked at /home/user/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tch-0.15.0/src/wrappers/tensor_generated.rs:7988:40:                  
called `Result::unwrap()` on an `Err` value: Torch("Could not run 'aten::empty.memory_format' with arguments from the 'CUDA' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective....

The reason I am reporting is that this is at least the third time that I encounter this same issue for different reasons such as :

  • Running rust stable (a while ago nightly was required by burn)
  • Running rust nightly
  • Running wrong cuda version
  • Running wrong libtorch version
  • Wrong environment variables setup

What is my point
I think this error is totally unhelpful and there is a loot of room for improvement regarding the setup tch-gpu.

What are you thinking ?

Should we :

  1. Implement pre-flight checks
  2. Improve and consolidate documentation
  3. Improve the error message, reading "operator does not exist" does not hint that well at where the issue is IMHO.
@caelunshun
Copy link

caelunshun commented Jun 10, 2024

I've fixed this by depending on tch version 0.15 and adding tch::maybe_init_cuda() to the start of main(). This seems to stop the linker from removing the libtorch_cuda dependency, which is what causes that error message (at least in my case).

This problem could definitely be documented better; it took me a couple hours to figure this out.

@laggui
Copy link
Member

laggui commented Jun 10, 2024

I agree that when running into issues with tch, the actual error is never really clear.

What happens the most often is trying to use the CUDA version when the environment variable was set in another shell (not persistent), so you try to run your program and you get an error similar to the one you posted. Cargo is all sorts of confused and the resolution on tch-rs based on the changes to the environment variable never seemed to work for me, so I end up cleaning the cache and rebuilding the package.

We tried to improve the setup but the environment variables are required by tch-rs, so it is not as straightforward to circumvent (I tried). We could definitely add some documentation for common issues at the very least. The best we can do about the error message from the torch side is probably just try to match the generic error message and give some tips/cues.

We're open to suggestions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants