Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix spmd reduce scatter python test #7500

Merged
merged 1 commit into from
Jun 26, 2024
Merged

Conversation

bhavya01
Copy link
Collaborator

These tests were failing on v5e-8 with the following error:

$ python test/spmd/test_xla_sharding.py
.....................................................F
Actual: tensor([[8.],
        [8.],
        [8.],
        [8.],
        [8.],
        [8.],
        [8.],
        [8.]]) 
        
Expected: tensor([[4.],
        [4.],
        [4.],
        [4.],
        [4.],
        [4.],
        [4.],
        [4.]])
F........
======================================================================
FAIL: test_spmd_reduce_scatter (__main__.BasicXlaShardingTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/mnt/disks/bbahl/pytorch/xla/test/spmd/test_xla_sharding.py", line 1235, in test_spmd_reduce_scatter
    torch.testing.assert_close(x.cpu(), expected_x)
  File "/mnt/disks/bbahl/miniconda3/envs/torchnightly/lib/python3.10/site-packages/torch/testing/_comparison.py", line 1524, in assert_close
    raise error_metas[0].to_error(msg)
AssertionError: Tensor-likes are not close!

Mismatched elements: 8 / 8 (100.0%)
Greatest absolute difference: 4.0 at index (0, 0) (up to 1e-05 allowed)
Greatest relative difference: 1.0 at index (0, 0) (up to 1.3e-06 allowed)

======================================================================
FAIL: test_spmd_reduce_scatter_canonical_index (__main__.BasicXlaShardingTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/mnt/disks/bbahl/pytorch/xla/test/spmd/test_xla_sharding.py", line 1257, in test_spmd_reduce_scatter_canonical_index
    torch.testing.assert_close(x.cpu(), expected_x)
  File "/mnt/disks/bbahl/miniconda3/envs/torchnightly/lib/python3.10/site-packages/torch/testing/_comparison.py", line 1524, in assert_close
    raise error_metas[0].to_error(msg)
AssertionError: Tensor-likes are not close!

Mismatched elements: 8 / 8 (100.0%)
Greatest absolute difference: 4.0 at index (0, 0) (up to 1e-05 allowed)
Greatest relative difference: 1.0 at index (0, 0) (up to 1.3e-06 allowed)

----------------------------------------------------------------------
Ran 63 tests in 2.984s

FAILED (failures=2)

It seems like the content of the tensor should depend on the number of devices that we shard across.

@bhavya01 bhavya01 self-assigned this Jun 25, 2024
@bhavya01 bhavya01 merged commit c654f12 into master Jun 26, 2024
23 checks passed
@alanwaketan
Copy link
Collaborator

LGTM. Sorry for late review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants