Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check failed: error == cudaSuccess (8 vs. 0) on the nvidia-docker2, RTX3070 #1038

Open
Tommy-gif-ai opened this issue May 3, 2023 · 0 comments

Comments

@Tommy-gif-ai
Copy link

Please use the caffe-users list for usage, installation, or modeling questions, or other requests for help.
Do not post such requests to Issues. Doing so interferes with the development of Caffe.

Please read the guidelines for contributing before submitting this issue.

Issue summary

I worked like blew.

  1. Install NVIDIA Container Toolkit

$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
&& curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
&& curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list |
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' |
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

$ sudo apt-get update
$ sudo apt-get install -y nvidia-docker2
$ sudo systemctl restart docker

  1. Install BVLC Caffe Docker
    docker pull bvlc/caffe:gpu

3.Gpu Test within Docker
$ docker run --gpus all -it --rm bvlc/caffe:gpu

And I did make, make test, and make runtest with MNIST Dataset in caffe.
It was confirmed that Train and Validation were performed with the GPU (RTX3070).

After that, I followed what I saw here to drive the SSD,
When executing this command, the following error occurs.(The full log is attached as a file.)

python examples/ssd/ssd_pascal.py


I0503 12:11:42.210090 59284 solver.cpp:295] Learning Rate Policy: multistep
I0503 12:11:42.216208 59284 blocking_queue.cpp:50] Data layer prefetch queue empty
F0503 12:11:42.328768 59284 im2col.cu:61] Check failed: error == cudaSuccess (8 vs. 0) invalid device function
*** Check failure stack trace: ***
@ 0x7f15e23a15cd google::LogMessage::Fail()
@ 0x7f15e23a3433 google::LogMessage::SendToLog()
@ 0x7f15e23a115b google::LogMessage::Flush()
@ 0x7f15e23a3e1e google::LogMessageFatal::~LogMessageFatal()
@ 0x7f15e2cd9b1a caffe::im2col_gpu<>()
@ 0x7f15e2baa829 caffe::BaseConvolutionLayer<>::conv_im2col_gpu()
@ 0x7f15e2baa926 caffe::BaseConvolutionLayer<>::forward_gpu_gemm()
@ 0x7f15e2caad96 caffe::ConvolutionLayer<>::Forward_gpu()
@ 0x7f15e2c64642 caffe::Net<>::ForwardFromTo()
@ 0x7f15e2c64767 caffe::Net<>::Forward()
@ 0x7f15e2bcaff0 caffe::Solver<>::Step()
@ 0x7f15e2bcba7e caffe::Solver<>::Solve()
@ 0x40b9c4 train()
@ 0x407590 main
@ 0x7f15e1311840 __libc_start_main
@ 0x407db9 _start
@ (nil) (unknown)
Aborted

At first, gpu1, gpu2, and gpu3 were all turned on, so another cuda Error occurred. After checking what was in Issues and modifying the python code as follows, the error disappeared.

But the above error occurs.

Steps to reproduce

If you are having difficulty building Caffe or training a model, please ask the caffe-users mailing list. If you are reporting a build error that seems to be due to a bug in Caffe, please attach your build configuration (either Makefile.config or CMakeCache.txt) and the output of the make (or cmake) command.

Your system configuration

Operating system: Ubuntu 18.04.6 LTS
(I used nvidia-docker 2. Please refer to the above work procedure.)
Compiler: I haven't changed any settings regarding the compiler.(Makefile.config)

Refer to http://caffe.berkeleyvision.org/installation.html

Contributions simplifying and improving our build system are welcome!

cuDNN acceleration switch (uncomment to build with cuDNN).

USE_CUDNN := 1

CPU-only switch (uncomment to build without GPU support).

#CPU_ONLY := 1

uncomment to disable IO dependencies and corresponding data layers

USE_OPENCV := 0

USE_LEVELDB := 0

USE_LMDB := 0

uncomment to allow MDB_NOLOCK when reading LMDB files (only if necessary)

You should not set this flag if you will be reading LMDBs with any

possibility of simultaneous read and write

ALLOW_LMDB_NOLOCK := 1

Uncomment if you're using OpenCV 3

OPENCV_VERSION := 3

To customize your choice of compiler, uncomment and set the following.

N.B. the default for Linux is g++ and the default for OSX is clang++

CUSTOM_CXX := g++

CUDA directory contains bin/ and lib/ directories that we need.

CUDA_DIR := /usr/local/cuda

On Ubuntu 14.04, if cuda tools are installed via

"sudo apt-get install nvidia-cuda-toolkit" then use this instead:

CUDA_DIR := /usr

CUDA architecture setting: going with all of them.

For CUDA < 6.0, comment the lines after *_35 for compatibility.

CUDA_ARCH := -gencode arch=compute_20,code=sm_21
-gencode arch=compute_30,code=sm_30
-gencode arch=compute_35,code=sm_35
-gencode arch=compute_50,code=sm_50
-gencode arch=compute_52,code=sm_52
-gencode arch=compute_61,code=sm_61
-gencode arch=compute_61,code=compute_61
-gencode arch=compute_86,code=sm_86
-gencode arch=compute_86,code=compute_86

DCUDA_ARCH_NAME="Manual" -DCUDA_ARCH_BIN="52 60" -DCUDA_ARCH_PTX="60"

BLAS choice:

atlas for ATLAS (default)

mkl for MKL

open for OpenBlas

BLAS := atlas
#BLAS := open

Custom (MKL/ATLAS/OpenBLAS) include and lib directories.

Leave commented to accept the defaults for your choice of BLAS

(which should work)!

BLAS_INCLUDE := /path/to/your/blas

BLAS_LIB := /path/to/your/blas

Homebrew puts openblas in a directory that is not on the standard search path

BLAS_INCLUDE := $(shell brew --prefix openblas)/include

BLAS_LIB := $(shell brew --prefix openblas)/lib

This is required only if you will compile the matlab interface.

MATLAB directory should contain the mex binary in /bin.

MATLAB_DIR := /usr/local

MATLAB_DIR := /Applications/MATLAB_R2012b.app

NOTE: this is required only if you will compile the python interface.

We need to be able to find Python.h and numpy/arrayobject.h.

PYTHON_INCLUDE := /usr/include/python2.7
/usr/lib/python2.7/dist-packages/numpy/core/include

Anaconda Python distribution is quite popular. Include path:

Verify anaconda location, sometimes it's in root.

ANACONDA_HOME := $(HOME)/anaconda2

PYTHON_INCLUDE := $(ANACONDA_HOME)/include \

            $(ANACONDA_HOME)/include/python2.7 \
            $(ANACONDA_HOME)/lib/python2.7/site-packages/numpy/core/include \

Uncomment to use Python 3 (default is Python 2)

PYTHON_LIBRARIES := boost_python3 python3.5m

PYTHON_INCLUDE := /usr/include/python3.5m \

/usr/lib/python3.5/dist-packages/numpy/core/include

We need to be able to find libpythonX.X.so or .dylib.

PYTHON_LIB := /usr/lib

PYTHON_LIB := $(ANACONDA_HOME)/lib

Homebrew installs numpy in a non standard path (keg only)

PYTHON_INCLUDE += $(dir $(shell python -c 'import numpy.core; print(numpy.core.file)'))/include

PYTHON_LIB += $(shell brew --prefix numpy)/lib

Uncomment to support layers written in Python (will link against Python libs)

WITH_PYTHON_LAYER := 1

Whatever else you find you need goes here.

INCLUDE_DIRS := $(PYTHON_INCLUDE) /usr/local/include
/usr/include/hdf5/serial

LIBRARY_DIRS := $(PYTHON_LIB) /usr/local/lib /usr/lib /usr/lib/x86_64-linux-gnu/hdf5/serial

If Homebrew is installed at a non standard location (for example your home directory) and you use it for general dependencies

INCLUDE_DIRS += $(shell brew --prefix)/include

LIBRARY_DIRS += $(shell brew --prefix)/lib

Uncomment to use pkg-config to specify OpenCV library paths.

(Usually not necessary -- OpenCV libraries are normally installed in one of the above $LIBRARY_DIRS.)

USE_PKG_CONFIG := 1

N.B. both build and distribute dirs are cleared on make clean

BUILD_DIR := build
DISTRIBUTE_DIR := distribute

Uncomment for debugging. Does not work on OSX due to BVLC#171

DEBUG := 1

The ID of the GPU that 'make runtest' will use to run unit tests.

TEST_GPUID := 0

enable pretty build (comment to see full commands)

Q ?= @

CUDA version (if applicable):
I still don't understand this.
The result of the nvidia-smi command and the result of the nvcc -V command are different.
I don't know if it's because of caffe-docker.
Because of this, I have changed the cuda configuration part of the Makefile several times, but the result is the same.(See Makefile above, For reference, I upgraded cuda version from 11 to 12.)

this is the result of 'nvidia-smi'
root@a50f950a8134:/opt/caffe# nvidia-smi
Wed May 3 12:30:35 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.47 Driver Version: 531.68 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3070 On | 00000000:01:00.0 On | N/A |
| 33% 33C P8 16W / 220W| 1231MiB / 8192MiB | 5% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+

this is the result of 'nvcc -V'

root@a50f950a8134:/opt/caffe# nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Tue_Jan_10_13:22:03_CST_2017
Cuda compilation tools, release 8.0, V8.0.61
root@a50f950a8134:/opt/caffe#

CUDNN version (if applicable):

BLAS: Even after installing openblas, an error occurred, so atlas was used instead of using it. (Refer to the Makefile above)
Python or MATLAB version (for pycaffe and matcaffe respectively):

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant