Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

分布式设置错误 #19

Open
WeiminLee opened this issue May 20, 2024 · 1 comment
Open

分布式设置错误 #19

WeiminLee opened this issue May 20, 2024 · 1 comment

Comments

@WeiminLee
Copy link

请问代码一直卡在了torch.distributed.init_process_group 这个方法,请问如何解决?

环境信息:单击多卡
image

OS 设置:
os.environ['RANK'] = '0'
os.environ['WORLD_SIZE'] = '4' # 因为我需要只用其中的4张卡
os.environ['LOCAL_RANK'] = '0'
os.environ['MASTER_ADDR'] = '127.0.0.1' # rank0 对应的地址
os.environ['MASTER_PORT'] = '29500' # 任何空闲的端口
os.environ['NCCL_IB_DISABLE'] = "1"
os.environ['NCCL_IBEXT_DISABLE'] = "1"

下面的代码一直超时,请问是哪里设置错误了么?

args.dist_url: "env://"
args.dist_backend = "nccl"

torch.distributed.init_process_group(
backend=args.dist_backend,
init_method=args.dist_url,
world_size=args.world_size,
rank=args.rank,
timeout=datetime.timedelta(
seconds=10
), # allow auto-downloading and de-compressing
)
torch.distributed.barrier()

@Coobiw
Copy link
Owner

Coobiw commented May 20, 2024

os.environ['RANK'] = '0'
os.environ['WORLD_SIZE'] = '4' # 因为我需要只用其中的4张卡
os.environ['LOCAL_RANK'] = '0'

不要指定三项

如果只用四张卡 直接运行时加入CUDA_VISIBLE_DEVICES=x,x,x,x

如果还是超时,请删除掉两个关于NCCL的环境变量

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants