You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
See the discussion on #993. @andyl98 has reported that when running on 8X A100 with 1 TB DRAM they hit CPU OOM during checkpoint save. They also point out that they do not see the OOM without the usage of FullOptimStateDictConfig.
In an ideal world I think we should only need (model params) + (optimizer params) = (70B * 2) + (70B * 2 * 2) = 420 GB in bf16, so seems like the unsharding is being done inefficiently (at least wrt CPU RAM)?
The text was updated successfully, but these errors were encountered:
See the discussion on #993. @andyl98 has reported that when running on 8X A100 with 1 TB DRAM they hit CPU OOM during checkpoint save. They also point out that they do not see the OOM without the usage of
FullOptimStateDictConfig
.In an ideal world I think we should only need (model params) + (optimizer params) = (70B * 2) + (70B * 2 * 2) = 420 GB in bf16, so seems like the unsharding is being done inefficiently (at least wrt CPU RAM)?
The text was updated successfully, but these errors were encountered: