Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[distributed] NCCL result code update #128777

Closed
wants to merge 1 commit into from

Conversation

myungjin
Copy link
Contributor

The nccl result codes are outdated. This PR fixes #128756.

Fixes #128756

The nccl result codes are outdated. This PR fixes pytorch#128756.
@myungjin myungjin requested a review from eqy as a code owner June 15, 2024 19:04
Copy link

pytorch-bot bot commented Jun 15, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/128777

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

✅ No Failures

As of commit cc6d86d with merge base 6079c50 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copy link

linux-foundation-easycla bot commented Jun 15, 2024

CLA Signed


The committers listed above are authorized under a signed CLA.

@@ -44,8 +44,9 @@ enum class ncclResult {
InternalError = 3,
InvalidArgument = 4,
InvalidUsage = 5,
NumResults = 6,
Copy link
Collaborator

@cyyever cyyever Jun 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to append RemoteError=8 without changing the previous enum item values?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are the official values.
While remapping (6<->8) can be an option, it can be confusing and may cause issues down the road.
Is there any issue with updating the previous values?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK the definition was copied from NCCL. Is it better to drop our version and use the official definition?

Copy link
Contributor Author

@myungjin myungjin Jun 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the nccl's official doc has the same values. My opinion is to use the official values.

Copy link
Collaborator

@cyyever cyyever Jun 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try to define an alias enum by using? I don't know why the old definition was copied. But it seems unnecessary now.

@Skylion007 Skylion007 added this to the 2.4.0 milestone Jun 16, 2024
@myungjin
Copy link
Contributor Author

@Skylion007 Since the PR is now approved, is there any task left from my end?

@cyyever cyyever added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 18, 2024
@cyyever
Copy link
Collaborator

cyyever commented Jun 20, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: This PR needs a release notes: label
If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Details for Dev Infra team Raised by workflow job

@cyyever
Copy link
Collaborator

cyyever commented Jun 20, 2024

@pytorchbot label "topic: not user facing"

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Jun 20, 2024
@cyyever
Copy link
Collaborator

cyyever commented Jun 20, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged open source topic: not user facing topic category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Outdated ncclResult code
5 participants