Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: cost matrix is infeasible when training mask2former_r50_8xb2-160k_ade20k-512x512.py #3706

Open
tms2003 opened this issue Jun 14, 2024 · 0 comments

Comments

@tms2003
Copy link

tms2003 commented Jun 14, 2024

Thanks for your error report and we appreciate it a lot.

Checklist

  1. I have searched related issues but cannot get the expected help.
  2. The bug has not been fixed in the latest version.

Describe the bug
A clear and concise description of what the bug is.

Reproduction

  1. What command or script did you run?

     bash tools/dist_train.sh configs/mask2former/mask2former_r50_8xb2-160k_ade20k-512x512.py 4 --work-dir work_dirs/mask2former_r50_8xb2-160k_ade20k-512x512 --amp 
    
    
    
  2. Did you make any modifications on the code or config? Did you understand what you have modified?

no

  1. What dataset did you use?

Environment

  1. Please run python mmseg/utils/collect_env.py to collect necessary environment information and paste it here.
sys.platform: linux
Python: 3.8.19 (default, Mar 20 2024, 19:58:24) [GCC 11.2.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0,1,2,3: NVIDIA GeForce RTX 3080 Ti
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.2, V12.2.140
GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
PyTorch: 2.1.0
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2023.1-Product Build 20230303 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.1.1 (Git Hash 64f6bcbcbab628e96f33a62c3e975f8535a7bde4)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 12.1
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
  - CuDNN 8.9.2
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-invalid-partial-specialization -Wno-unused-private-field -Wno-aligned-allocation-unavailable -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.1.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, 

TorchVision: 0.16.0
OpenCV: 4.10.0
MMEngine: 0.10.4
MMSegmentation: 1.2.2+b040e14
  1. You may add addition that may be helpful for locating the problem, such as
    • How you installed PyTorch [e.g., pip, conda, source]
    • Other environment variables that may be related (such as $PATH, $LD_LIBRARY_PATH, $PYTHONPATH, etc.)

Error traceback

If applicable, paste the error trackback here.


06/14 08:08:07 - mmengine - INFO - paramwise_options -- decode_head.query_feat.weight:lr_mult=1.0                                                   
06/14 08:08:07 - mmengine - INFO - paramwise_options -- decode_head.query_feat.weight:decay_mult=0.0                                                
06/14 08:08:07 - mmengine - INFO - paramwise_options -- decode_head.level_embed.weight:lr=0.0001                                                    
06/14 08:08:07 - mmengine - INFO - paramwise_options -- decode_head.level_embed.weight:weight_decay=0.0                                             
06/14 08:08:07 - mmengine - INFO - paramwise_options -- decode_head.level_embed.weight:lr_mult=1.0                                                  
06/14 08:08:07 - mmengine - INFO - paramwise_options -- decode_head.level_embed.weight:decay_mult=0.0                                               
06/14 08:08:07 - mmengine - WARNING - The prefix is not set in metric class IoUMetric.                                                              
06/14 08:08:07 - mmengine - INFO - load model from: torchvision://resnet50                                                                          
06/14 08:08:07 - mmengine - INFO - Loads checkpoint by torchvision backend from path: torchvision://resnet50                                        
06/14 08:08:07 - mmengine - WARNING - The model and loaded state dict do not match exactly                                                          
                                                                                                                                                    
unexpected key in source state_dict: fc.weight, fc.bias                                                                                             
                                                                                                                                                    
06/14 08:08:07 - mmengine - WARNING - "FileClient" will be deprecated in future. Please use io functions in https://mmengine.readthedocs.io/en/lates
t/api/fileio.html#file-io                                                                                                                           
06/14 08:08:07 - mmengine - WARNING - "HardDiskBackend" is the alias of "LocalBackend" and the former will be deprecated in future.                 
06/14 08:08:07 - mmengine - INFO - Checkpoints will be saved to /home/incar/tms/source/rsdemo/mmsegmentation/work_dirs/mask2former_r50_8xb2-160k_ade
20k-512x512.                                                                                                                                        
/home/incar/miniconda3/envs/openmmlab/lib/python3.8/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it w
ill be required to pass the indexing argument. (Triggered internally at /opt/conda/conda-bld/pytorch_1695392036766/work/aten/src/ATen/native/TensorS
hape.cpp:3526.)                                                                                                                                     
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]                                                                              
/home/incar/miniconda3/envs/openmmlab/lib/python3.8/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it w
ill be required to pass the indexing argument. (Triggered internally at /opt/conda/conda-bld/pytorch_1695392036766/work/aten/src/ATen/native/TensorS
hape.cpp:3526.)                                                                                                                                     
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]                                                                              
/home/incar/miniconda3/envs/openmmlab/lib/python3.8/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it w
ill be required to pass the indexing argument. (Triggered internally at /opt/conda/conda-bld/pytorch_1695392036766/work/aten/src/ATen/native/TensorS
hape.cpp:3526.)                                                                                                                                     
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]                                                                              
/home/incar/miniconda3/envs/openmmlab/lib/python3.8/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it w
ill be required to pass the indexing argument. (Triggered internally at /opt/conda/conda-bld/pytorch_1695392036766/work/aten/src/ATen/native/TensorS
hape.cpp:3526.)                                                           
    return forward_call(*args, **kwargs)                                                                                                   [84/1973]
  File "/home/incar/miniconda3/envs/openmmlab/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1519, in forward
    else self._run_ddp_forward(*inputs, **kwargs)
  File "/home/incar/miniconda3/envs/openmmlab/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1355, in _run_ddp_forward
    return self.module(*inputs, **kwargs)  # type: ignore[index]
  File "/home/incar/miniconda3/envs/openmmlab/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/incar/miniconda3/envs/openmmlab/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/incar/tms/source/rsdemo/mmsegmentation/mmseg/models/segmentors/base.py", line 94, in forward
    return self.loss(inputs, data_samples)
  File "/home/incar/tms/source/rsdemo/mmsegmentation/mmseg/models/segmentors/encoder_decoder.py", line 178, in loss
    loss_decode = self._decode_head_forward_train(x, data_samples)
  File "/home/incar/tms/source/rsdemo/mmsegmentation/mmseg/models/segmentors/encoder_decoder.py", line 139, in _decode_head_forward_train
    loss_decode = self.decode_head.loss(inputs, data_samples,
  File "/home/incar/tms/source/rsdemo/mmsegmentation/mmseg/models/decode_heads/mask2former_head.py", line 126, in loss
    losses = self.loss_by_feat(all_cls_scores, all_mask_preds,
  File "/home/incar/tms/source/rsdemo/mmdetection/mmdet/models/dense_heads/maskformer_head.py", line 348, in loss_by_feat
    losses_cls, losses_mask, losses_dice = multi_apply(
  File "/home/incar/tms/source/rsdemo/mmdetection/mmdet/models/utils/misc.py", line 219, in multi_apply
    return tuple(map(list, zip(*map_results)))
  File "/home/incar/tms/source/rsdemo/mmdetection/mmdet/models/dense_heads/mask2former_head.py", line 273, in _loss_by_feat_single
    avg_factor) = self.get_targets(cls_scores_list, mask_preds_list,
  File "/home/incar/tms/source/rsdemo/mmdetection/mmdet/models/dense_heads/maskformer_head.py", line 237, in get_targets
    results = multi_apply(self._get_targets_single, cls_scores_list,
  File "/home/incar/tms/source/rsdemo/mmdetection/mmdet/models/utils/misc.py", line 219, in multi_apply
    return tuple(map(list, zip(*map_results)))
  File "/home/incar/tms/source/rsdemo/mmdetection/mmdet/models/dense_heads/mask2former_head.py", line 222, in _get_targets_single
    assign_result = self.assigner.assign(
  File "/home/incar/tms/source/rsdemo/mmdetection/mmdet/models/task_modules/assigners/hungarian_assigner.py", line 131, in assign
    matched_row_inds, matched_col_inds = linear_sum_assignment(cost)
ValueError: cost matrix is infeasible 

Bug fix

If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant