Distributed checkpointing user guide #9494

mikolajblaz · 2024-06-18T13:00:28Z

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Collection: [Note which collection this PR will affect]

Changelog

Add specific line by line info of high level changes in this PR.

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

Signed-off-by: Mikołaj Błaż <[email protected]>

yaoyu-33 · 2024-06-24T21:47:39Z

docs/source/checkpoints/dist_ckpt.rst

+
+    # Distributed checkpoint save
+    sharded_state_dict = {
+        'weight': dist_checkpointing.ShardedTensor.from_rank_offsets('weight', local_ten, (0, rank, world_size))


can you add a comment explain (0, rank, world_size)

yaoyu-33 · 2024-06-24T22:06:05Z

docs/source/checkpoints/dist_ckpt.rst

+4. All other objects are treated as "common" and saved according to a sharded strategy (see `Save and load strategies`_)
+5. All ShardedObjects are extracted from point (3) objects and saved with a common strategy (see `Save and load strategies`_)
+6. All ShardedTensors are saved.
+7. `metadata.json` file with backend and version metadata is saved to the checkpoint directory.


maybe put a link to source code here to show where those steps happen

yaoyu-33 · 2024-06-24T22:07:33Z

docs/source/checkpoints/dist_ckpt.rst

+The sharded state dict is processed in the following way:
+
+1. The ShardedTensorFactories are applied
+2. LocalNonPersistentObject are extracted from the sharded state dict and ignored


LocalNonPersistentObject wasn't explained. What's this?

yaoyu-33 · 2024-06-24T22:09:44Z

docs/source/checkpoints/dist_ckpt.rst

+        return dist_checkpointing.load(sharded_state_dict, ckpt_dir, fully_parallel_load_strategy)
+
+
+The `dist_checkpointing` package provides default strategies for some sharded backends, so it's enough to specify a tuple `(backend, version)` as a saving strategy.


explain what are backends and versions here?

yaoyu-33 · 2024-06-24T22:10:44Z

docs/source/checkpoints/dist_ckpt.rst

+    from megatron.core.dist_checkpointing.strategies.torch import TorchDistLoadShardedStrategy, TorchDistSaveShardedStrategy
+    from megatron.core.dist_checkpointing.strategies.fully_parallel import FullyParallelLoadStrategyWrapper, FullyParallelSaveStrategyWrapper
+
+    base_save_strategy = TorchDistSaveShardedStrategy('torch_dist', 1)


add comments

yaoyu-33 · 2024-06-24T22:14:48Z

docs/source/checkpoints/dist_ckpt.rst

+The `dist_checkpointing` package provides default strategies for some sharded backends, so it's enough to specify a tuple `(backend, version)` as a saving strategy.
+Backends and versions are stored in a `metadata.json` file inside the checkpoint so that the loading strategy can be determined automatically (provided that there exists a default loading strategy for a given backend and version).
+
+For "sharded" strategies, currently the backends supported by default are based on `torch.distributed.checkpoint` format (`torch_dist` backend) and Zarr format (`zarr` backend).


add a bit more to explain the difference?

yaoyu-33 · 2024-06-24T22:15:37Z

docs/source/checkpoints/dist_ckpt.rst

+Note: in order to reuse model SharderTensors to create optimizer ShardedTensors, the model **SharderTensors must wrap model parameters**, not just tensors
+(obtaining a state dict with model parameters can be achieved by passing `keep_vars=True` to the model `state_dict` function).
+Otherwise the correspondence between model ShardedTensors and optimizer states is impossible to recreate.
+This is the reason for introducing ShardedTensorFactories - we have to register the original model parameter as `ShardedTensorFactories.data` and apply any subsequent transformations as a factory function in order to make sure that the same transformation can be applied to the optimizer states.


show an example source code in mcore if there's any

yaoyu-33 · 2024-06-24T22:20:42Z

docs/source/checkpoints/dist_ckpt.rst

+
+Extra flattening comes with an efficiency challenge during checkpoint resharding.
+Since flattening is applied after the global tensors is sharded into the grid of local chunks, loading after resharding requires accessing incontiguous data fragments.
+An example solution for that is implemented in the `dist_checkpointing/strategies/resharding.py` module and involves saving the flattened tensor with a different global shape than the original one.


please use github path

yaoyu-33 · 2024-06-24T22:23:10Z

docs/source/checkpoints/dist_ckpt.rst

+   * - 3
+     - [5, 9]
+   * - 5
+     - [10, 11]


why DP affects the local shards?

jgerh · 2024-06-26T20:58:14Z

docs/source/checkpoints/dist_ckpt.rst

+------------
+
+Model parallel training requires parallelism-aware checkpointing.
+Megatron-Core provides a checkpointing library capable of handling all types of parallelisms used in LLMs training.


Megatron Core provides a checkpointing library capable of handling all types of parallelisms used in LLM training.

jgerh · 2024-06-26T20:59:05Z

docs/source/checkpoints/dist_ckpt.rst

+
+Model parallel training requires parallelism-aware checkpointing.
+Megatron-Core provides a checkpointing library capable of handling all types of parallelisms used in LLMs training.
+Although distributed checkpointing library is targeted at Megatron-Core model, it can be used with other models as well, provided an appropriate integration.


Although the distributed checkpointing library is targeted for the Megatron Core model, it can also be used with other models, as long as proper integration is implemented.

jgerh · 2024-06-26T21:00:42Z

docs/source/checkpoints/dist_ckpt.rst

+Although distributed checkpointing library is targeted at Megatron-Core model, it can be used with other models as well, provided an appropriate integration.
+
+The library provides two main entrypoints: `dist_checkpointing.save` and `dist_checkpointing.load` which are meant to replace the `torch.save` and `torch.load` in the regular checkpointing flow.
+Apart from that it provides mechanism to define different types of local tensors placement in the global checkpoint.


Apart from that, it provides a mechanism to define the different types of local tensors placement in the global checkpoint.

jgerh · 2024-06-26T21:00:54Z

docs/source/checkpoints/dist_ckpt.rst

+Apart from that it provides mechanism to define different types of local tensors placement in the global checkpoint.
+
+
+Basic sharding


Basic Sharding

jgerh · 2024-06-26T21:08:23Z

docs/source/checkpoints/dist_ckpt.rst

+    # For some distributed checkpoint backends this is actually what happens underneath.
+
+
+Supported entities


Supported Entities

jgerh · 2024-06-26T21:14:10Z

docs/source/checkpoints/dist_ckpt.rst

+-------------
+It's the primary use case of distributed checkpointing - tensors sharding.
+Allows to define how PyTorch tensors are sharded across the workload.
+See `Tensors transformations`_ section for more details on ShardedTensors.


See the Tensors transformations_ section for more details on ShardedTensors.

jgerh · 2024-06-26T21:20:26Z

docs/source/checkpoints/dist_ckpt.rst

+This class allows to defer tensors transformations until the actual saving.
+A factory can expand a tensor into an arbitrary sub state dict (including all supported entities listed above).
+The need for such deferral will be explained in the `Tensors transformations`_ section.


The ShardedTensorFactory class defers tensors transformations until they are actually saved.

jgerh · 2024-06-26T21:22:10Z

docs/source/checkpoints/dist_ckpt.rst

+This is a simple wrapper that allows to express the fact that the object wrapped with this class should end up in the final loaded state dict during loading.
+During saving such objects are ignored.
+
+Arbitrary object


Arbitrary Object

jgerh · 2024-06-26T21:27:20Z

docs/source/checkpoints/dist_ckpt.rst

+
+
+
+Entrypoints


Entry Points

jgerh · 2024-06-26T21:27:53Z

docs/source/checkpoints/dist_ckpt.rst

+
+Entrypoints
+===========
+There are several useful user entrypoints for checkpoint saving and loading.


There are several useful user entry points for checkpoint saving and loading.

jgerh · 2024-06-26T21:30:13Z

docs/source/checkpoints/dist_ckpt.rst

+Requires providing a sharded state dict to save and saving strategies for handling different entities (see `Save and load strategies`_ for detailed explanation).
+The sharded state dict is processed in the following way:
+
+1. The ShardedTensorFactories are applied


The ShardedTensorFactories are applied.

jgerh · 2024-06-26T21:30:27Z

docs/source/checkpoints/dist_ckpt.rst

+The sharded state dict is processed in the following way:
+
+1. The ShardedTensorFactories are applied
+2. LocalNonPersistentObject are extracted from the sharded state dict and ignored


The LocalNonPersistentObject is extracted from the sharded state dict and ignored.

jgerh · 2024-06-26T21:30:42Z

docs/source/checkpoints/dist_ckpt.rst

+
+1. The ShardedTensorFactories are applied
+2. LocalNonPersistentObject are extracted from the sharded state dict and ignored
+3. ShardedBase objects are extracted


The ShardedBase objects are extracted.

jgerh · 2024-06-26T21:30:56Z

docs/source/checkpoints/dist_ckpt.rst

+1. The ShardedTensorFactories are applied
+2. LocalNonPersistentObject are extracted from the sharded state dict and ignored
+3. ShardedBase objects are extracted
+4. All other objects are treated as "common" and saved according to a sharded strategy (see `Save and load strategies`_)


All other objects are treated as "common" and saved according to a sharded strategy (see Save and load strategies_).

jgerh · 2024-06-26T21:31:08Z

docs/source/checkpoints/dist_ckpt.rst

+2. LocalNonPersistentObject are extracted from the sharded state dict and ignored
+3. ShardedBase objects are extracted
+4. All other objects are treated as "common" and saved according to a sharded strategy (see `Save and load strategies`_)
+5. All ShardedObjects are extracted from point (3) objects and saved with a common strategy (see `Save and load strategies`_)


All ShardedObjects are extracted from point (3) objects and saved with a common strategy (see Save and load strategies_).

jgerh · 2024-06-26T21:33:47Z

docs/source/checkpoints/dist_ckpt.rst

+Requires providing a sharded state dict (in order to implicitly define mappings between local tensors and checkpoint tensors) and loading strategies.
+In practice, the same sharded state dict can be usually used for both saving and loading (the sharded state dict for loading will just contain tensors with uninitialized data).
+
+The sharded state dict provided as an input is processed in the following way:


When the sharded state dict is provided as input, it is processed in the following way:

jgerh · 2024-06-26T21:34:04Z

docs/source/checkpoints/dist_ckpt.rst

+
+The sharded state dict provided as an input is processed in the following way:
+
+1. "common" state dict is loaded from the checkpoint. This forms the base of the resulting state dict


The "common" state dict is loaded from the checkpoint. This forms the base of the resulting state dict.

jgerh · 2024-06-26T21:35:33Z

docs/source/checkpoints/dist_ckpt.rst

+4. All other objects are treated as "common" and saved according to a sharded strategy (see `Save and load strategies`_)
+5. All ShardedObjects are extracted from point (3) objects and saved with a common strategy (see `Save and load strategies`_)
+6. All ShardedTensors are saved.
+7. `metadata.json` file with backend and version metadata is saved to the checkpoint directory.


The metadata.json file with backend and version metadata is saved to the checkpoint directory.

jgerh · 2024-06-26T21:35:57Z

docs/source/checkpoints/dist_ckpt.rst

+The sharded state dict provided as an input is processed in the following way:
+
+1. "common" state dict is loaded from the checkpoint. This forms the base of the resulting state dict
+2. The ShardedTensorFactories from the input sharded state dict are applied


The ShardedTensorFactories from the input sharded state dict are applied.

jgerh · 2024-06-26T21:36:22Z

docs/source/checkpoints/dist_ckpt.rst

+
+1. "common" state dict is loaded from the checkpoint. This forms the base of the resulting state dict
+2. The ShardedTensorFactories from the input sharded state dict are applied
+3. LocalNonPersistentObject are extracted from the input sharded state dict, unwrapped and added to the resulting state dict


The LocalNonPersistentObject is extracted from the input sharded state dict, unwrapped and added to the resulting state dict.

jgerh · 2024-06-26T21:36:40Z

docs/source/checkpoints/dist_ckpt.rst

+1. "common" state dict is loaded from the checkpoint. This forms the base of the resulting state dict
+2. The ShardedTensorFactories from the input sharded state dict are applied
+3. LocalNonPersistentObject are extracted from the input sharded state dict, unwrapped and added to the resulting state dict
+4. ShardedObjects are extracted and loaded from the checkpoint into the resulting state dict


The ShardedObjects are extracted and loaded from the checkpoint into the resulting state dict.

jgerh · 2024-06-26T21:36:58Z

docs/source/checkpoints/dist_ckpt.rst

+2. The ShardedTensorFactories from the input sharded state dict are applied
+3. LocalNonPersistentObject are extracted from the input sharded state dict, unwrapped and added to the resulting state dict
+4. ShardedObjects are extracted and loaded from the checkpoint into the resulting state dict
+5. ShardedTensors are extracted and loaded from the checkpoint into the resulting state dict


The ShardedTensors are extracted and loaded from the checkpoint into the resulting state dict.

jgerh · 2024-06-26T21:37:18Z

docs/source/checkpoints/dist_ckpt.rst

+3. LocalNonPersistentObject are extracted from the input sharded state dict, unwrapped and added to the resulting state dict
+4. ShardedObjects are extracted and loaded from the checkpoint into the resulting state dict
+5. ShardedTensors are extracted and loaded from the checkpoint into the resulting state dict
+6. Factory merges are applied (see `Optimizers`_ for explanation)


Factory merges are applied (see Optimizers_ for explanation).

jgerh · 2024-06-26T23:17:05Z

docs/source/checkpoints/dist_ckpt.rst

+   * - 5
+     - [10, 11]
+
+The same tensor after sharding by TP=6, flattening and sharding by DP=1:


After sharding by TP=6 and flattening and sharding by DP=1, the resulting local shards are as follows:

jgerh · 2024-06-26T23:17:22Z

docs/source/checkpoints/dist_ckpt.rst

+     - [5, 11]
+
+
+Arbitrary transformations


Arbitrary Transformations

jgerh · 2024-06-26T23:22:00Z

docs/source/checkpoints/dist_ckpt.rst

+For example, if the model weights are supposed to be transposed in the checkpoint, it's almost impossible to implement a performant factory function that is capable of transposing a flattened and sliced tensor, because the flattening and slicing should happen in the transposed dimension.
+
+
+Application integration


Application Integration

jgerh · 2024-06-26T23:30:16Z

docs/source/checkpoints/intro.rst

I reviewed the intro.rst file and provided edits here: #9503

jgerh · 2024-06-27T20:13:42Z

docs/source/checkpoints/dist_ckpt.rst

+Basic sharding
+--------------
+
+The main way to define relationship of a plain local PyTorch tensor to tensors on other ranks is by wrapping it in a `ShardedTensor` class.


The main way to define the relationship of a plain, local PyTorch tensor to tensors on other ranks is by wrapping it in a ShardedTensor class.

jgerh · 2024-06-27T20:38:48Z

docs/source/checkpoints/dist_ckpt.rst

+A factory can expand a tensor into an arbitrary sub state dict (including all supported entities listed above).
+The need for such deferral will be explained in the `Tensors transformations`_ section.
+
+LocalNonpersitentObject


LocalNonpersistentObject

jgerh · 2024-06-27T20:39:35Z

docs/source/checkpoints/dist_ckpt.rst

+
+LocalNonpersitentObject
+-----------------------
+This is a simple wrapper that allows to express the fact that the object wrapped with this class should end up in the final loaded state dict during loading.


LocalNonpersistentObject is a simple wrapper indicating that the object wrapped with this class should end up in the final loaded state dict during loading.

jgerh · 2024-06-27T20:41:01Z

docs/source/checkpoints/dist_ckpt.rst

+
+Arbitrary object
+----------------
+All objects different than dicts, lists and the instances of the classes listed above are treated as "common" objects.


All objects different than dicts, lists, and the instances of the classes listed above are treated as "common" objects.

jgerh · 2024-06-27T20:42:07Z

docs/source/checkpoints/dist_ckpt.rst

+----------------
+All objects different than dicts, lists and the instances of the classes listed above are treated as "common" objects.
+
+During saving, all such objects in the sharded state dict passed to `dist_checkpointing.save` are assumed to be duplicated across ranks and therefore saved only by a single coordinator rank (rank 0).


During saving, all such objects in the sharded state dict passed to dist_checkpointing.save are assumed to be duplicated across ranks. Therefore, they are saved only by a single coordinator rank (rank 0).

jgerh · 2024-06-27T20:45:22Z

docs/source/checkpoints/dist_ckpt.rst

+
+dist_checkpointing.load
+-----------------------
+The main entrypoint for checkpoint loading.


The dist_checkpointing.load function is the main entry point for checkpoint loading.

jgerh · 2024-06-27T20:45:39Z

docs/source/checkpoints/dist_ckpt.rst

+dist_checkpointing.load
+-----------------------
+The main entrypoint for checkpoint loading.
+Requires providing a sharded state dict (in order to implicitly define mappings between local tensors and checkpoint tensors) and loading strategies.


It requires providing a sharded state dict (in order to implicitly define mappings between local tensors and checkpoint tensors) and loading strategies.

jgerh · 2024-06-27T20:50:25Z

docs/source/checkpoints/dist_ckpt.rst

+
+Optimizers
+==========
+This module gives helper tools to the user to simplify constructing ShardedTensors for optimizer states.


The Optimizers module provides helper tools to the user to simplify constructing ShardedTensors for optimizer states.

jgerh · 2024-06-27T20:51:40Z

docs/source/checkpoints/dist_ckpt.rst

+Optimizers
+==========
+This module gives helper tools to the user to simplify constructing ShardedTensors for optimizer states.
+The ShardedTensors that define local to sharded tensors mapping for model parameters should be reused for optimizer states to avoid code duplication.


The ShardedTensors that define local-to-sharded tensors mapping for model parameters should be reused for optimizer states to avoid code duplication.

jgerh · 2024-06-27T20:54:32Z

docs/source/checkpoints/dist_ckpt.rst

+This should support most optimizer cases, but some of them might require custom sharded state dict creation.
+A good example is a Distributed Optimizer which flattens the parameters - see `Tensors transformations`_ section for more details.
+
+Note: in order to reuse model SharderTensors to create optimizer ShardedTensors, the model **SharderTensors must wrap model parameters**, not just tensors


Note: In order to reuse model SharderTensors to create optimizer ShardedTensors, the model SharderTensors must wrap model parameters, not just tensors

jgerh · 2024-06-27T20:56:16Z

docs/source/checkpoints/dist_ckpt.rst

+
+Shape mismatch
+--------------
+The `allow_shape_mismatch` flag allows to relax the requirement of matching global tensor shapes during loading.


The allow_shape_mismatch flag relaxes the requirement of matching global tensor shapes during loading.

jgerh · 2024-06-27T20:56:45Z

docs/source/checkpoints/dist_ckpt.rst

+
+Flattening
+----------
+The `flattened_range` attribute allows to declare the fact that `ShardedTensor.data` is actually a slice of a flattened model parameter.


The flattened_range attribute declares that ShardedTensor.data represents a slice of a flattened model parameter.

jgerh · 2024-06-27T20:59:06Z

docs/source/checkpoints/dist_ckpt.rst

+
+Arbitrary transformations
+-------------------------
+The way to apply arbitrary transformations to the tensors during saving and loading is with ShardedTensorFactory, which allows to define such transformations as a function that can be reapplied to any ShardedTensor (in particular, a ShardedTensor representing optimizer states).


The way to apply arbitrary transformations to the tensors during saving and loading is with ShardedTensorFactory. It defines such transformations as a function that can be reapplied to any ShardedTensor (in particular, a ShardedTensor representing optimizer states).

jgerh · 2024-06-27T21:00:10Z

docs/source/checkpoints/dist_ckpt.rst

+In order to apply such transformation both to model and optimizer parameters in a consistent manner, it's necessary to encode them as factory functions (with original model parameter as the `data` input so that the optimizer params can be properly mapped to model ShardedTensors).
+
+Note that implementing some transformations might be challenging or impossible while supporting flattening for a Distributed Optimizer case.
+For example, if the model weights are supposed to be transposed in the checkpoint, it's almost impossible to implement a performant factory function that is capable of transposing a flattened and sliced tensor, because the flattening and slicing should happen in the transposed dimension.


For example, if the model weights are supposed to be transposed in the checkpoint, it's almost impossible to implement a performant factory function that is capable of transposing a flattened and sliced tensor. This is because the flattening and slicing should happen in the transposed dimension.

jgerh · 2024-06-27T21:03:04Z

docs/source/checkpoints/dist_ckpt.rst

+The only thing required from the application side is preparing a sharded state dict with ShardedTensors, ShardedObjects, etc. (representing the sharding of the data employed by the application)
+and using the `dist_checkpointing.save` and `dist_checkpointing.load` entrypoints as replacements for `torch.save` and `torch.load`.
+
+In Megatron-Core the sharded state dict preparation is already implemented in a `sharded_state_dict` method added to all Megatron-Core models and modules, which allows to create sharded state dicts in a composable way.


In Megatron Core, the sharded state dictionary preparation is already implemented in a sharded_state_dict method which creates the sharded state dicts in a composable way.

jgerh · 2024-06-27T21:05:36Z

docs/source/checkpoints/dist_ckpt.rst

+    }
+    dist_checkpointing.save(sharded_state_dict, dist_ckpt_root)
+
+During load the distributed checkpoint can be easily read even if the job size changes (contrary to native checkpoints that require the same number of ranks).


During load, the distributed checkpoint can be easily read even if the job size changes (contrary to native checkpoints that require the same number of ranks).

jgerh · 2024-06-27T21:05:55Z

docs/source/checkpoints/dist_ckpt.rst

+    dist_checkpointing.save(sharded_state_dict, dist_ckpt_root)
+
+During load the distributed checkpoint can be easily read even if the job size changes (contrary to native checkpoints that require the same number of ranks).
+The main difference wrt. `torch.load` is that the user has to provide the definition of the sharded state dict that needs to be loaded.


The main difference with wrt. torch.load is that the user has to provide the definition of the sharded state dict that needs to be loaded.

jgerh · 2024-06-27T21:07:20Z

docs/source/checkpoints/dist_ckpt.rst

+
+ShardedBase
+-----------
+Base class for expressing any kind of sharding.


ShardedBase is the base class for expressing any kind of sharding.

jgerh · 2024-06-27T21:07:45Z

docs/source/checkpoints/dist_ckpt.rst

+ShardedBase
+-----------
+Base class for expressing any kind of sharding.
+Each sharded entity must be uniquely identified by its `key`, carry some `data` to be saved or loaded and define `replica_id` which helps identify data redundancy.


Each sharded entity must be uniquely identified by its key, carry some data to be saved or loaded, and define replica_id which helps identify data redundancy.

jgerh · 2024-06-27T21:08:49Z

docs/source/checkpoints/dist_ckpt.rst

+
+ShardedTensor
+-------------
+It's the primary use case of distributed checkpointing - tensors sharding.


ShardedTensor is the primary use case for distributed checkpointing - tensor sharding.

jgerh · 2024-06-27T21:08:58Z

docs/source/checkpoints/dist_ckpt.rst

+ShardedTensor
+-------------
+It's the primary use case of distributed checkpointing - tensors sharding.
+Allows to define how PyTorch tensors are sharded across the workload.


It defines how PyTorch tensors are distributed across the workload.

jgerh · 2024-06-27T21:16:33Z

docs/source/checkpoints/dist_ckpt.rst

+
+dist_checkpointing.save
+-----------------------
+The only entrypoint for checkpoint saving.


The dist_checkpointing.save function is the only entry point for checkpoint saving.

jgerh · 2024-06-27T21:16:44Z

docs/source/checkpoints/dist_ckpt.rst

+dist_checkpointing.save
+-----------------------
+The only entrypoint for checkpoint saving.
+Requires providing a sharded state dict to save and saving strategies for handling different entities (see `Save and load strategies`_ for detailed explanation).


It requires providing a sharded state dict to save and saving strategies for handling different entities (see Save and load strategies_ for detailed explanation).

jgerh · 2024-06-27T21:39:04Z

docs/source/checkpoints/dist_ckpt.rst

@@ -0,0 +1,392 @@
+Distributed checkpoints


Distributed Checkpoints

jgerh · 2024-06-27T21:39:17Z

docs/source/checkpoints/dist_ckpt.rst

+Distributed checkpoints
+=======================
+
+This guide provides details about the distributed checkpoints format from Megatron-Core.


This guide provides details about the distributed checkpoints format from Megatron Core.

mikolajblaz added 5 commits June 18, 2024 14:55

Describe shardings and entrypoints

ad1e47c

Signed-off-by: Mikołaj Błaż <[email protected]>

Strategies, optimizers, finalize entrypoints

5ac0299

Signed-off-by: Mikołaj Błaż <[email protected]>

Transformations

2657fa0

Signed-off-by: Mikołaj Błaż <[email protected]>

Integration

9215e6e

Signed-off-by: Mikołaj Błaż <[email protected]>

Add link from intro

d8b4fbe

Signed-off-by: Mikołaj Błaż <[email protected]>

mikolajblaz requested a review from yaoyu-33 June 18, 2024 13:00

yaoyu-33 reviewed Jun 25, 2024

View reviewed changes

jgerh reviewed Jun 26, 2024

View reviewed changes

docs/source/checkpoints/dist_ckpt.rst

Entrypoints

Copy link

Collaborator

jgerh Jun 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Entry Points

jgerh reviewed Jun 26, 2024

View reviewed changes

docs/source/checkpoints/dist_ckpt.rst

- [5, 11]

Arbitrary transformations

Copy link

Collaborator

jgerh Jun 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arbitrary Transformations

jgerh reviewed Jun 26, 2024

View reviewed changes

docs/source/checkpoints/intro.rst

Copy link

Collaborator

jgerh Jun 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the intro.rst file and provided edits here: #9503

jgerh reviewed Jun 27, 2024

View reviewed changes

docs/source/checkpoints/dist_ckpt.rst

@@ -0,0 +1,392 @@

Distributed checkpoints

Copy link

Collaborator

jgerh Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Distributed Checkpoints

jgerh reviewed Jun 27, 2024

View reviewed changes

mikolajblaz changed the title ~~Mblaz/docs dist ckpt~~ Distributed checkpointing user guide Jun 28, 2024

		return dist_checkpointing.load(sharded_state_dict, ckpt_dir, fully_parallel_load_strategy)


		The `dist_checkpointing` package provides default strategies for some sharded backends, so it's enough to specify a tuple `(backend, version)` as a saving strategy.

		Apart from that it provides mechanism to define different types of local tensors placement in the global checkpoint.


		Basic sharding

		# For some distributed checkpoint backends this is actually what happens underneath.


		Supported entities


		The sharded state dict provided as an input is processed in the following way:

		1. "common" state dict is loaded from the checkpoint. This forms the base of the resulting state dict

Distributed checkpointing user guide #9494

Are you sure you want to change the base?

Distributed checkpointing user guide #9494

Conversation

mikolajblaz commented Jun 18, 2024

What does this PR do ?

Changelog

Usage

GitHub Actions CI

Before your PR is "Ready for review"

Who can review?

Additional Information

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jgerh Jun 26, 2024 • edited Loading

Choose a reason for hiding this comment

jgerh Jun 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jgerh Jun 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jgerh Jun 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jgerh Jun 26, 2024 • edited Loading

Choose a reason for hiding this comment

jgerh Jun 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jgerh Jun 26, 2024 • edited Loading

Choose a reason for hiding this comment

jgerh Jun 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jgerh Jun 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LocalNonpersistentObject

jgerh Jun 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jgerh Jun 27, 2024 • edited Loading

Choose a reason for hiding this comment

jgerh Jun 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jgerh Jun 26, 2024 •

edited

Loading

jgerh Jun 26, 2024 •

edited

Loading

jgerh Jun 26, 2024 •

edited

Loading

jgerh Jun 26, 2024 •

edited

Loading

jgerh Jun 26, 2024 •

edited

Loading

jgerh Jun 26, 2024 •

edited

Loading

jgerh Jun 26, 2024 •

edited

Loading

jgerh Jun 26, 2024 •

edited

Loading

jgerh Jun 27, 2024 •

edited

Loading

jgerh Jun 27, 2024 •

edited

Loading

jgerh Jun 27, 2024 •

edited

Loading

jgerh Jun 27, 2024 •

edited

Loading