-
Notifications
You must be signed in to change notification settings - Fork 21.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[distributed] NCCL result code update #128777
Conversation
The nccl result codes are outdated. This PR fixes pytorch#128756.
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/128777
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ✅ No FailuresAs of commit cc6d86d with merge base 6079c50 (): This comment was automatically generated by Dr. CI and updates every 15 minutes. |
The committers listed above are authorized under a signed CLA. |
@@ -44,8 +44,9 @@ enum class ncclResult { | |||
InternalError = 3, | |||
InvalidArgument = 4, | |||
InvalidUsage = 5, | |||
NumResults = 6, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to append RemoteError=8 without changing the previous enum item values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are the official values.
While remapping (6<->8) can be an option, it can be confusing and may cause issues down the road.
Is there any issue with updating the previous values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK the definition was copied from NCCL. Is it better to drop our version and use the official definition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the nccl's official doc has the same values. My opinion is to use the official values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try to define an alias enum by using? I don't know why the old definition was copied. But it seems unnecessary now.
@Skylion007 Since the PR is now approved, is there any task left from my end? |
@pytorchbot merge |
Merge failedReason: This PR needs a If not, please add the To add a label, you can comment to pytorchbot, for example For more information, see Details for Dev Infra teamRaised by workflow job |
@pytorchbot label "topic: not user facing" |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
The nccl result codes are outdated. This PR fixes #128756.
Fixes #128756