megatron-fsdp

Name	megatron-fsdp JSON
Version	0.1.0rc5 JSON
	download
home_page	None
Summary	Megatron-FSDP is an NVIDIA-developed PyTorch extension that provides a high-performance implementation of Fully Sharded Data Parallelism (FSDP)
upload_time	2025-10-06 04:13:10
maintainer	None
docs_url	None
author	None
requires_python	>=3.10
license	Apache 2.0
keywords	nlp nlu deep gpu language learning learning machine nvidia pytorch torch transformer
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            <div align="center">

# 🚀 Megatron-FSDP

</div>

<div align="center">

[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/release/python-3100/)

</div>

## ✨ What is Megatron-FSDP?

**Megatron-FSDP** is an NVIDIA-developed PyTorch extension that provides a high-performance implementation of Fully Sharded Data Parallelism (FSDP). It offers seamless cross-compatibility with major deep learning frameworks and parallelism libraries, making it easy to scale your PyTorch models across multiple GPUs and nodes.

Megatron-FSDP can provide up to 25% speed up and 23% memory savings compared to FSDP2.

### Compatibility

- **[PyTorch DTensor](https://docs.pytorch.org/docs/stable/distributed.tensor.html)**
- **[Megatron Core](https://github.com/NVIDIA/Megatron-LM)**
- **[TransformerEngine](https://github.com/NVIDIA/TransformerEngine)**

## ✨ Features

- **Easy Integration**: Simple `fully_shard` function for quick model parallelization
- **High Performance**: Optimized for NVIDIA GPUs with efficient memory management
- **Cross-Framework**: Works seamlessly with PyTorch, Huggingface Transformers, Megatron-LM, Megatron Bridge and TransformerEngine
- **Scalable**: Supports both single-node multi-GPU and multi-node distributed training
- **Flexible Configuration**: Configurable sharding strategies and process groups

## ⚡ Optimizations

- **Advanced Bucketing**: Data-type aware bucketing system to minimize the overhead of collective operations
- **Buffer Management**: Zero copy communication is achieved by reorganizing the storage of parameters and main grad with `ParamAndGradBuffer` class
- **Communication Overlapping**: Improved communication overlap of paramter all-gather and gradient reduce-scatter
- **User-Buffer-Registration NCCL communication**: Offload NCCL collective communication to NVL/IB Sharp to reduce GPU SM usage for communication
- **FP8 Mixed Precision with Transformer Engine**: Compatibility with Transformer Engine enables efficient FP8 mixed precision training
- **Gradient accumulate fusion support with Transformer Engine**: Remove the explicit gradient copy to the communication buffer in backwards pass

<!-- ## 📊 Performance  -->

## 📦 Installation

```
pip install megatron-fsdp
```

- PyPI: https://pypi.org/project/megatron-fsdp/
- Source Code: https://github.com/NVIDIA/Megatron-LM/tree/main/megatron/core/distributed/fsdp/src

## 🚀 Quick Start

### Basic Usage

Transform your PyTorch model to use Fully Sharded Data Parallelism with just a few lines:

```python
import torch
from megatron_fsdp import fully_shard

# Your existing model and optimizer
model = YourModel()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

# Enable FSDP with Megatron-FSDP.
# Alternatively, you can use fully_shard_model() followed by fully_shard_optimizer()!
model, optimizer = fully_shard(
    model,
    optimizer,
    device_mesh=device_mesh, # Your global DeviceMesh.
    dp_shard_dim="dp_shard_cp", # Sharding across the flattened DP-CP mesh.
    fsdp_unit_modules=[YourTransformerBlock], # Modules to shard.
)

# Your model is now ready for distributed training!
```

### Comparison with FSDP-2

`fully_shard` / `fully_shard_model` / `fully_shard_optimizer` are simple entrypoints into `MegatronFSDP`.

- No need to call `fully_shard` on all the sub-modules, just pass your sub-module classes or import paths to `fully_shard`!
- One liner for the sharding change, which seamlessly preserves the identity of your training loop.

Compare this with FSDP2:

```python
import torch
from torch.distributed.fsdp import fully_shard

# Your existing model and optimizer
model = YourModel()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

# Enable FSDP with FSDP2
for module in model.modules():
    if isinstance(module, YourTransformerBlock): # Sub-Modules to shard
        fully_shard(module)
fully_shard(model)

# Your model is now ready for distributed training!
```

## `fully_shard` / `MegatronFSDP` API - Advanced Features

```python
import torch
from megatron_fsdp import fully_shard

# Initialize DeviceMesh.
device_mesh = torch.distributed.device_mesh.init_device_mesh(
    "cuda",
    mesh_shape=(dp_outer_size, dp_shard_size, cp_size, tp_size),
    mesh_dim_names=("dp_outer", "dp_shard", "cp", "tp"),
)
# Only relevant when using HSDP, where we also need the full DP group for data parallelism,
# This sub-mesh can be provided to distributed samplers or dataloaders.
device_mesh[("dp_outer", "dp_shard")]._flatten("dp")
# Only required if using CP. Otherwise, just pass dp_shard to FSDP.
device_mesh[("dp_shard", "cp")]._flatten("dp_shard_cp")
# Only required if using HSDP. Otherwise, don't pass hybrid_fsdp_group.
device_mesh[("dp_outer", "dp_shard", "cp")]._flatten("hsdp")
hsdp_group = device_mesh["hsdp"].get_group()

# Fully-shards your model and distributes your optimizer.
model, optimizer = fully_shard(
    # PyTorch (Root) Module
    model,
    # PyTorch Optimizer
    optimizer,
    # Device Mesh
    device_mesh=device_mesh
    # Always required for FSDP or HSDP.
    dp_shard_dim="dp_shard_cp",
    # Set this required argument to use HSDP instead of FSDP. Otherwise, set this to None.
    dp_outer_dim="dp_outer",
    # Only required for TP-sensitive models (i.e. Megatron-LM / TransformerEngine) or when using DTensor-based TP.
    # Otherwise, set this to None.
    tp_dim="tp",
    # Only required when using HSDP. Otherwise, set this to None.
    hybrid_fsdp_group=hsdp_group,
    # FSDP Sharding Strategy: no_shard (0) / optim (1) / optim_grads (2) / optim_grads_params (3)
    zero_dp_strategy=3,
    outer_dp_sharding_strategy=1,
    # Sharded Modules
    fsdp_unit_modules=[...],
    # Initialize the model on devices in shards to avoid OOM. Requires device("meta")-init for model.
    init_model_with_meta_device=True,
    # Reduce gradients in FP32.
    grad_reduce_in_fp32=False,
    # Store distributed optimization state in FP32.
    preserve_fp32_weights=True,
    # Sync parameters and gradients each step. Allows for gradient transformations after backward pass,
    # and synchronizes parameters and gradients across HSDP groups, but deactivates compute-communication
    # overlap going into the subsequent training step.
    sync_model_each_microbatch=True,
    # Preprocess state dict for DCP checkpointing. Required for Torch Distributed Checkpoint.
    preproc_state_dict_for_dcp_ckpt=True,
)

# Save model and optimizer state.
torch.distributed.checkpoint.save({"model": model.state_dict(), "optimizer": optimizer.state_dict()}, checkpoint_id=str(CKPT_DIR))

# Load model and optimizer state.
ckpt_state_dict = {"model": model.state_dict(), "optimizer": optimizer.state_dict()}
torch.distributed.checkpoint.load(state_dict=ckpt_state_dict, checkpoint_id=str(CKPT_DIR))
model.load_state_dict(ckpt_state_dict["model"])
optimizer.load_state_dict(ckpt_state_dict["optimizer"])
```

- `zero_dp_strategy` (and `outer_dp_sharding_strategy`) configure different degrees of zero-redundancy data parallelism as described in [ZeRO (Zero Redundancy Optimizer)](https://arxiv.org/abs/1910.02054). It reduces CUDA memory utilization during model training by distributing model parameters, gradients, and optimizer states across multiple devices in the DP `ProcessGroup`, and collectively communicating subsets of parameters and gradients to specific devices when needed for computation or differentiation. More aggressive sharding strategies will entail more communication overhead, with `no_shard` being the least memory efficient but most communication efficient, and `optim_grads_params` being the most memory efficient but least communication efficient. `outer_dp_sharding_strategy` has the same options, except for the (required) "outer" DP group (`dp_outer_dim` / `hybrid_fsdp_group`) when using [Hybrid-Sharded Data Parallelism (HSDP)](https://arxiv.org/pdf/2304.11277), and only `no_shard` (DP Replication) and `optim` (Optimizer State Hybrid Sharding, requires `zero_dp_strategy='optim_grads_params`) are supported.
  - Default: `optim_grads_params` or `3` for `zero_dp_strategy` and `no_shard` or `0` for `outer_dp_sharding_strategy`
  - `0` or `no_shard` implies that your model is not sharded. Similar memory usage to `DDP`.
  - `1` or `optim` implies that your optimizer state is sharded for distributed optimization. Similar to optimizer state sharding in `ZeRO-DP`.
  - `2` or `optim_grads` implies that your optimizer state and gradients are sharded. Similar to `ZeRO-2`.
  - `3` or `optim_grads_params` implies that your optimizer state, gradients, and training parameters are sharded. Similar to `ZeRO-3`.
- `fsdp_unit_modules` is a list of sub-module classes or `str` import-paths associated with modules that you want `MegatronFSDP` to fully-shard.
  - Required if `1`, `2`, or `3` are specified as the sharding strategy. Defaults to `None`.
- `device_mesh` is a [`torch.distributed.DeviceMesh`](https://docs.pytorch.org/docs/stable/distributed.html#devicemesh) that informs `MegatronFSDP` of your distributed environment for sharding in conjunction with hardware configuration and other parallelisms.
  - `dp_shard_dim` is the name of the sub-mesh required for FSDP sharding, and is commonly the flattened combination of the data parallel (DP) and context parallel (CP) sub-meshes.
    - When model parameters are replicated across DP-CP during the backward pass, resultant gradients across DP and CP ranks are reduced simultaneously, normalized by the DP-CP world size. For more information about how ring attention shards the sequence dimension through the attention and non-attention layers of the Transformer, refer to: [Ring Attention with Blockwise Transformers for Near-Infinite Context](https://arxiv.org/abs/2310.01889).
  - `dp_outer_dim` is the name of the sub-mesh corresponding to the "outer" DP group, which is required for replication or sharding in HSDP. `fully_shard` will perform HSDP if `dp_outer_dim` is specified.
  - `tp_dim` is the name of the sub-mesh used for tensor parallelism (TP), which is required for `(FSDP, TP)`-strided sharding when using Megatron-LM or Torch-native `DTensor` TP.
    - For more information about tensor parallelism, refer to: [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053).
  - `hybrid_fsdp_group` is the `ProcessGroup` which contains all ranks in the flattened `dp_shard_dim` and `dp_outer_dim` sub-meshes utilized to specify the `(DP-Outer, DP-Shard)` sharded coordinate system for the weight and gradient buffers. Required for HSDP.
- `init_model_with_meta_device` has `MegatronFSDP` initialize your `meta`-device model in shards on every CUDA device to avoid OOM when initializing extremely large models that cannot fit on a single device. Users can initialize their model on a [`meta`-device](https://docs.pytorch.org/docs/stable/meta.html) (`with torch.device('meta'): ...`), and ``MegatronFSDP`` will further shard and initialize the model parameters layer-by-layer adhering to the customizable `module.reset_parameters` method, which prevents the entire model from being allocated in memory at any point during runtime.
    - Defaults to `False`.
    - Note that the `device` argument which installs your model on a specific device or rank will be deactivated when `init_model_with_meta_device=True`.
- `grad_reduce_in_fp32` will reduce gradients in `FP32` precision (in contrast to the lower `BF16` or `FP8` model training precision).
    - Defaults to `False`.
    - `torch.distributed.fsdp.MixedPrecisionPolicy` will be supported in the near future.
- `preserve_fp32_weights` will preserve a `FP32` precision version of model parameters utilized for optimization.
    - Defaults to `True`.
    - `torch.distributed.fsdp.MixedPrecisionPolicy` will be supported in the near future.
- `overlap_grad_reduce` and `overlap_param_gather` will overlap gradient [`reduce-scatter`](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/collectives.html#reducescatter) and parameter [`all-gather`](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/collectives.html#allgather) group communications with backward and forward compute with asynchronous calls and pre-fetching. (In the case of `no_shard`, parameters are not gathered but gradient [`all-reduce`](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/collectives.html#allreduce) is overlapped.)
    - Both default to `True`.
- `sync_model_each_microbatch` will trigger a `wait` (`MegatronFSDP.finish_grad_sync()`) on gradient reduction, parameter de-allocation, and optimizer parameter / gradient installation (in preparation for `optimizer.step()`) after every forward-backward pass. When using HSDP, parameters and gradients will be all-gathered and reduced respectively on the "outer" DP group each training step instead of each optimization cycle. This behavior is desirable for a transparent and user-friendly sharded training loop where post-backward transformations on the gradient and a clean compute / memory state are necessary between training iterations, but damages performance in situations where optimization is delayed (e.g. gradient accumulation) where the communications of the previous training iteration can be overlapped with the compute of the next training iteration. Will also override `is_last_microbatch` / `microbatch_count` logic in `MegatronFSDP`.
    - Defaults to `True` for `fully_shard`, but defaults to `False` when using the `MegatronFSDP` class directly.
- `keep_fp8_transpose_cache_when_using_custom_fsdp` will keep the fp8 transpose cache when using `MegatronFSDP`. This option will cause (number of parameter $\times$ 1 Byte) of memory overhead, but can skip the weight transpose operation in the backward propagation. This feature will not give any benefit from the Blackwell architecture.
    - **Only effective when using Megatron-LM.**
    - Defaults to `False`.
- `nccl_ub` will allocate and register the NCCL userbuffer for param and grad buffers. This option enables an SM-efficient NCCL algorithm that could improve the performance of overlapped computations. This flag will be much more effective when used together with SHARP if the FSDP communication includes both NVL and IB domains. Enabling this option will cause additional memory overhead due to the requirement to enable the `fsdp_double_buffer` option.
    - **Only effective when using Megatron-LM.**
    - Defaults to `False`.
- `fsdp_double_buffer` will use persistently allocated double buffers for temporarily-defined memory needed in `MegatronFSDP` communications. Having persistent double buffers may increase peak VRAM utilization, but is required to register NCCL user buffers (`nccl_ub=True`) for `MegatronFSDP`. Currently, this is only supported for simple repetitive model structures such as GPT.
    - **Only effective when using Megatron-LM.**
    - Defaults to `False`. Automatically overridden to `True` when `nccl_ub` is enabled.
- `preproc_state_dict_for_dcp_ckpt` adds `model.state_dict()` and `optimizer.state_dict()` post-hooks that modify the model and optimizer state in preparation for `torch.distributed.checkpoint.{save,load}` ([Torch DCP](https://docs.pytorch.org/docs/stable/distributed.checkpoint.html)) checkpointing. Specifically, it adds `__create_write_items__` and `__create_chunk_list__` methods to Tensors utilized by Torch DCP to redistribute parameters when saving and loading model and optimizer checkpoints. Can be deactivated should the user need a custom distributed checkpointing strategy.
    - Defaults to `True`.

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "megatron-fsdp",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": "NVIDIA <nemo-toolkit@nvidia.com>",
    "keywords": "NLP, NLU, deep, gpu, language, learning, learning, machine, nvidia, pytorch, torch, transformer",
    "author": null,
    "author_email": "NVIDIA <nemo-toolkit@nvidia.com>",
    "download_url": "https://files.pythonhosted.org/packages/5f/f7/a4ac7ed22936ca4dd01d311798de3e5e6cd4034f4b15483cf822e06c817d/megatron_fsdp-0.1.0rc5.tar.gz",
    "platform": null,
    "description": "<div align=\"center\">\n\n# \ud83d\ude80 Megatron-FSDP\n\n</div>\n\n<div align=\"center\">\n\n[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/release/python-3100/)\n\n</div>\n\n## \u2728 What is Megatron-FSDP?\n\n**Megatron-FSDP** is an NVIDIA-developed PyTorch extension that provides a high-performance implementation of Fully Sharded Data Parallelism (FSDP). It offers seamless cross-compatibility with major deep learning frameworks and parallelism libraries, making it easy to scale your PyTorch models across multiple GPUs and nodes.\n\nMegatron-FSDP can provide up to 25% speed up and 23% memory savings compared to FSDP2.\n\n### Compatibility\n\n- **[PyTorch DTensor](https://docs.pytorch.org/docs/stable/distributed.tensor.html)**\n- **[Megatron Core](https://github.com/NVIDIA/Megatron-LM)**\n- **[TransformerEngine](https://github.com/NVIDIA/TransformerEngine)**\n\n## \u2728 Features\n\n- **Easy Integration**: Simple `fully_shard` function for quick model parallelization\n- **High Performance**: Optimized for NVIDIA GPUs with efficient memory management\n- **Cross-Framework**: Works seamlessly with PyTorch, Huggingface Transformers, Megatron-LM, Megatron Bridge and TransformerEngine\n- **Scalable**: Supports both single-node multi-GPU and multi-node distributed training\n- **Flexible Configuration**: Configurable sharding strategies and process groups\n\n## \u26a1 Optimizations\n\n- **Advanced Bucketing**: Data-type aware bucketing system to minimize the overhead of collective operations\n- **Buffer Management**: Zero copy communication is achieved by reorganizing the storage of parameters and main grad with `ParamAndGradBuffer` class\n- **Communication Overlapping**: Improved communication overlap of paramter all-gather and gradient reduce-scatter\n- **User-Buffer-Registration NCCL communication**: Offload NCCL collective communication to NVL/IB Sharp to reduce GPU SM usage for communication\n- **FP8 Mixed Precision with Transformer Engine**: Compatibility with Transformer Engine enables efficient FP8 mixed precision training\n- **Gradient accumulate fusion support with Transformer Engine**: Remove the explicit gradient copy to the communication buffer in backwards pass\n\n<!-- ## \ud83d\udcca Performance  -->\n\n## \ud83d\udce6 Installation\n\n```\npip install megatron-fsdp\n```\n\n- PyPI: https://pypi.org/project/megatron-fsdp/\n- Source Code: https://github.com/NVIDIA/Megatron-LM/tree/main/megatron/core/distributed/fsdp/src\n\n## \ud83d\ude80 Quick Start\n\n### Basic Usage\n\nTransform your PyTorch model to use Fully Sharded Data Parallelism with just a few lines:\n\n```python\nimport torch\nfrom megatron_fsdp import fully_shard\n\n# Your existing model and optimizer\nmodel = YourModel()\noptimizer = torch.optim.Adam(model.parameters(), lr=1e-3)\n\n# Enable FSDP with Megatron-FSDP.\n# Alternatively, you can use fully_shard_model() followed by fully_shard_optimizer()!\nmodel, optimizer = fully_shard(\n    model,\n    optimizer,\n    device_mesh=device_mesh, # Your global DeviceMesh.\n    dp_shard_dim=\"dp_shard_cp\", # Sharding across the flattened DP-CP mesh.\n    fsdp_unit_modules=[YourTransformerBlock], # Modules to shard.\n)\n\n# Your model is now ready for distributed training!\n```\n\n### Comparison with FSDP-2\n\n`fully_shard` / `fully_shard_model` / `fully_shard_optimizer` are simple entrypoints into `MegatronFSDP`.\n\n- No need to call `fully_shard` on all the sub-modules, just pass your sub-module classes or import paths to `fully_shard`!\n- One liner for the sharding change, which seamlessly preserves the identity of your training loop.\n\nCompare this with FSDP2:\n\n```python\nimport torch\nfrom torch.distributed.fsdp import fully_shard\n\n# Your existing model and optimizer\nmodel = YourModel()\noptimizer = torch.optim.Adam(model.parameters(), lr=1e-3)\n\n# Enable FSDP with FSDP2\nfor module in model.modules():\n    if isinstance(module, YourTransformerBlock): # Sub-Modules to shard\n        fully_shard(module)\nfully_shard(model)\n\n# Your model is now ready for distributed training!\n```\n\n## `fully_shard` / `MegatronFSDP` API - Advanced Features\n\n```python\nimport torch\nfrom megatron_fsdp import fully_shard\n\n# Initialize DeviceMesh.\ndevice_mesh = torch.distributed.device_mesh.init_device_mesh(\n    \"cuda\",\n    mesh_shape=(dp_outer_size, dp_shard_size, cp_size, tp_size),\n    mesh_dim_names=(\"dp_outer\", \"dp_shard\", \"cp\", \"tp\"),\n)\n# Only relevant when using HSDP, where we also need the full DP group for data parallelism,\n# This sub-mesh can be provided to distributed samplers or dataloaders.\ndevice_mesh[(\"dp_outer\", \"dp_shard\")]._flatten(\"dp\")\n# Only required if using CP. Otherwise, just pass dp_shard to FSDP.\ndevice_mesh[(\"dp_shard\", \"cp\")]._flatten(\"dp_shard_cp\")\n# Only required if using HSDP. Otherwise, don't pass hybrid_fsdp_group.\ndevice_mesh[(\"dp_outer\", \"dp_shard\", \"cp\")]._flatten(\"hsdp\")\nhsdp_group = device_mesh[\"hsdp\"].get_group()\n\n# Fully-shards your model and distributes your optimizer.\nmodel, optimizer = fully_shard(\n    # PyTorch (Root) Module\n    model,\n    # PyTorch Optimizer\n    optimizer,\n    # Device Mesh\n    device_mesh=device_mesh\n    # Always required for FSDP or HSDP.\n    dp_shard_dim=\"dp_shard_cp\",\n    # Set this required argument to use HSDP instead of FSDP. Otherwise, set this to None.\n    dp_outer_dim=\"dp_outer\",\n    # Only required for TP-sensitive models (i.e. Megatron-LM / TransformerEngine) or when using DTensor-based TP.\n    # Otherwise, set this to None.\n    tp_dim=\"tp\",\n    # Only required when using HSDP. Otherwise, set this to None.\n    hybrid_fsdp_group=hsdp_group,\n    # FSDP Sharding Strategy: no_shard (0) / optim (1) / optim_grads (2) / optim_grads_params (3)\n    zero_dp_strategy=3,\n    outer_dp_sharding_strategy=1,\n    # Sharded Modules\n    fsdp_unit_modules=[...],\n    # Initialize the model on devices in shards to avoid OOM. Requires device(\"meta\")-init for model.\n    init_model_with_meta_device=True,\n    # Reduce gradients in FP32.\n    grad_reduce_in_fp32=False,\n    # Store distributed optimization state in FP32.\n    preserve_fp32_weights=True,\n    # Sync parameters and gradients each step. Allows for gradient transformations after backward pass,\n    # and synchronizes parameters and gradients across HSDP groups, but deactivates compute-communication\n    # overlap going into the subsequent training step.\n    sync_model_each_microbatch=True,\n    # Preprocess state dict for DCP checkpointing. Required for Torch Distributed Checkpoint.\n    preproc_state_dict_for_dcp_ckpt=True,\n)\n\n# Save model and optimizer state.\ntorch.distributed.checkpoint.save({\"model\": model.state_dict(), \"optimizer\": optimizer.state_dict()}, checkpoint_id=str(CKPT_DIR))\n\n# Load model and optimizer state.\nckpt_state_dict = {\"model\": model.state_dict(), \"optimizer\": optimizer.state_dict()}\ntorch.distributed.checkpoint.load(state_dict=ckpt_state_dict, checkpoint_id=str(CKPT_DIR))\nmodel.load_state_dict(ckpt_state_dict[\"model\"])\noptimizer.load_state_dict(ckpt_state_dict[\"optimizer\"])\n```\n\n- `zero_dp_strategy` (and `outer_dp_sharding_strategy`) configure different degrees of zero-redundancy data parallelism as described in [ZeRO (Zero Redundancy Optimizer)](https://arxiv.org/abs/1910.02054). It reduces CUDA memory utilization during model training by distributing model parameters, gradients, and optimizer states across multiple devices in the DP `ProcessGroup`, and collectively communicating subsets of parameters and gradients to specific devices when needed for computation or differentiation. More aggressive sharding strategies will entail more communication overhead, with `no_shard` being the least memory efficient but most communication efficient, and `optim_grads_params` being the most memory efficient but least communication efficient. `outer_dp_sharding_strategy` has the same options, except for the (required) \"outer\" DP group (`dp_outer_dim` / `hybrid_fsdp_group`) when using [Hybrid-Sharded Data Parallelism (HSDP)](https://arxiv.org/pdf/2304.11277), and only `no_shard` (DP Replication) and `optim` (Optimizer State Hybrid Sharding, requires `zero_dp_strategy='optim_grads_params`) are supported.\n  - Default: `optim_grads_params` or `3` for `zero_dp_strategy` and `no_shard` or `0` for `outer_dp_sharding_strategy`\n  - `0` or `no_shard` implies that your model is not sharded. Similar memory usage to `DDP`.\n  - `1` or `optim` implies that your optimizer state is sharded for distributed optimization. Similar to optimizer state sharding in `ZeRO-DP`.\n  - `2` or `optim_grads` implies that your optimizer state and gradients are sharded. Similar to `ZeRO-2`.\n  - `3` or `optim_grads_params` implies that your optimizer state, gradients, and training parameters are sharded. Similar to `ZeRO-3`.\n- `fsdp_unit_modules` is a list of sub-module classes or `str` import-paths associated with modules that you want `MegatronFSDP` to fully-shard.\n  - Required if `1`, `2`, or `3` are specified as the sharding strategy. Defaults to `None`.\n- `device_mesh` is a [`torch.distributed.DeviceMesh`](https://docs.pytorch.org/docs/stable/distributed.html#devicemesh) that informs `MegatronFSDP` of your distributed environment for sharding in conjunction with hardware configuration and other parallelisms.\n  - `dp_shard_dim` is the name of the sub-mesh required for FSDP sharding, and is commonly the flattened combination of the data parallel (DP) and context parallel (CP) sub-meshes.\n    - When model parameters are replicated across DP-CP during the backward pass, resultant gradients across DP and CP ranks are reduced simultaneously, normalized by the DP-CP world size. For more information about how ring attention shards the sequence dimension through the attention and non-attention layers of the Transformer, refer to: [Ring Attention with Blockwise Transformers for Near-Infinite Context](https://arxiv.org/abs/2310.01889).\n  - `dp_outer_dim` is the name of the sub-mesh corresponding to the \"outer\" DP group, which is required for replication or sharding in HSDP. `fully_shard` will perform HSDP if `dp_outer_dim` is specified.\n  - `tp_dim` is the name of the sub-mesh used for tensor parallelism (TP), which is required for `(FSDP, TP)`-strided sharding when using Megatron-LM or Torch-native `DTensor` TP.\n    - For more information about tensor parallelism, refer to: [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053).\n  - `hybrid_fsdp_group` is the `ProcessGroup` which contains all ranks in the flattened `dp_shard_dim` and `dp_outer_dim` sub-meshes utilized to specify the `(DP-Outer, DP-Shard)` sharded coordinate system for the weight and gradient buffers. Required for HSDP.\n- `init_model_with_meta_device` has `MegatronFSDP` initialize your `meta`-device model in shards on every CUDA device to avoid OOM when initializing extremely large models that cannot fit on a single device. Users can initialize their model on a [`meta`-device](https://docs.pytorch.org/docs/stable/meta.html) (`with torch.device('meta'): ...`), and ``MegatronFSDP`` will further shard and initialize the model parameters layer-by-layer adhering to the customizable `module.reset_parameters` method, which prevents the entire model from being allocated in memory at any point during runtime.\n    - Defaults to `False`.\n    - Note that the `device` argument which installs your model on a specific device or rank will be deactivated when `init_model_with_meta_device=True`.\n- `grad_reduce_in_fp32` will reduce gradients in `FP32` precision (in contrast to the lower `BF16` or `FP8` model training precision).\n    - Defaults to `False`.\n    - `torch.distributed.fsdp.MixedPrecisionPolicy` will be supported in the near future.\n- `preserve_fp32_weights` will preserve a `FP32` precision version of model parameters utilized for optimization.\n    - Defaults to `True`.\n    - `torch.distributed.fsdp.MixedPrecisionPolicy` will be supported in the near future.\n- `overlap_grad_reduce` and `overlap_param_gather` will overlap gradient [`reduce-scatter`](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/collectives.html#reducescatter) and parameter [`all-gather`](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/collectives.html#allgather) group communications with backward and forward compute with asynchronous calls and pre-fetching. (In the case of `no_shard`, parameters are not gathered but gradient [`all-reduce`](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/collectives.html#allreduce) is overlapped.)\n    - Both default to `True`.\n- `sync_model_each_microbatch` will trigger a `wait` (`MegatronFSDP.finish_grad_sync()`) on gradient reduction, parameter de-allocation, and optimizer parameter / gradient installation (in preparation for `optimizer.step()`) after every forward-backward pass. When using HSDP, parameters and gradients will be all-gathered and reduced respectively on the \"outer\" DP group each training step instead of each optimization cycle. This behavior is desirable for a transparent and user-friendly sharded training loop where post-backward transformations on the gradient and a clean compute / memory state are necessary between training iterations, but damages performance in situations where optimization is delayed (e.g. gradient accumulation) where the communications of the previous training iteration can be overlapped with the compute of the next training iteration. Will also override `is_last_microbatch` / `microbatch_count` logic in `MegatronFSDP`.\n    - Defaults to `True` for `fully_shard`, but defaults to `False` when using the `MegatronFSDP` class directly.\n- `keep_fp8_transpose_cache_when_using_custom_fsdp` will keep the fp8 transpose cache when using `MegatronFSDP`. This option will cause (number of parameter $\\times$ 1 Byte) of memory overhead, but can skip the weight transpose operation in the backward propagation. This feature will not give any benefit from the Blackwell architecture.\n    - **Only effective when using Megatron-LM.**\n    - Defaults to `False`.\n- `nccl_ub` will allocate and register the NCCL userbuffer for param and grad buffers. This option enables an SM-efficient NCCL algorithm that could improve the performance of overlapped computations. This flag will be much more effective when used together with SHARP if the FSDP communication includes both NVL and IB domains. Enabling this option will cause additional memory overhead due to the requirement to enable the `fsdp_double_buffer` option.\n    - **Only effective when using Megatron-LM.**\n    - Defaults to `False`.\n- `fsdp_double_buffer` will use persistently allocated double buffers for temporarily-defined memory needed in `MegatronFSDP` communications. Having persistent double buffers may increase peak VRAM utilization, but is required to register NCCL user buffers (`nccl_ub=True`) for `MegatronFSDP`. Currently, this is only supported for simple repetitive model structures such as GPT.\n    - **Only effective when using Megatron-LM.**\n    - Defaults to `False`. Automatically overridden to `True` when `nccl_ub` is enabled.\n- `preproc_state_dict_for_dcp_ckpt` adds `model.state_dict()` and `optimizer.state_dict()` post-hooks that modify the model and optimizer state in preparation for `torch.distributed.checkpoint.{save,load}` ([Torch DCP](https://docs.pytorch.org/docs/stable/distributed.checkpoint.html)) checkpointing. Specifically, it adds `__create_write_items__` and `__create_chunk_list__` methods to Tensors utilized by Torch DCP to redistribute parameters when saving and loading model and optimizer checkpoints. Can be deactivated should the user need a custom distributed checkpointing strategy.\n    - Defaults to `True`.\n",
    "bugtrack_url": null,
    "license": "Apache 2.0",
    "summary": "**Megatron-FSDP** is an NVIDIA-developed PyTorch extension that provides a high-performance implementation of Fully Sharded Data Parallelism (FSDP)",
    "version": "0.1.0rc5",
    "project_urls": {
        "Download": "https://github.com/NVIDIA/Megatron-LM/releases",
        "Homepage": "https://github.com/NVIDIA/Megatron-LM/megatron/core"
    },
    "split_keywords": [
        "nlp",
        " nlu",
        " deep",
        " gpu",
        " language",
        " learning",
        " learning",
        " machine",
        " nvidia",
        " pytorch",
        " torch",
        " transformer"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "3e2e927636d697a26b9009725c3425630436ee3d926d977bbf53d3ec2cdd842c",
                "md5": "573bb88b5cd5d0efe9d3a3c4a4ac7c7b",
                "sha256": "27ae75cc44d89e3a2193475814602ea6ca8de4ec85c0b6a931861affe6210909"
            },
            "downloads": -1,
            "filename": "megatron_fsdp-0.1.0rc5-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "573bb88b5cd5d0efe9d3a3c4a4ac7c7b",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 83330,
            "upload_time": "2025-10-06T04:13:09",
            "upload_time_iso_8601": "2025-10-06T04:13:09.657825Z",
            "url": "https://files.pythonhosted.org/packages/3e/2e/927636d697a26b9009725c3425630436ee3d926d977bbf53d3ec2cdd842c/megatron_fsdp-0.1.0rc5-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "5ff7a4ac7ed22936ca4dd01d311798de3e5e6cd4034f4b15483cf822e06c817d",
                "md5": "25aeaba9e102a767e01530c58cd80841",
                "sha256": "df15faf20a46af38cecddc7cd08cbbe58da036a5f8e5b438e287777a6dea1b93"
            },
            "downloads": -1,
            "filename": "megatron_fsdp-0.1.0rc5.tar.gz",
            "has_sig": false,
            "md5_digest": "25aeaba9e102a767e01530c58cd80841",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 82741,
            "upload_time": "2025-10-06T04:13:10",
            "upload_time_iso_8601": "2025-10-06T04:13:10.591938Z",
            "url": "https://files.pythonhosted.org/packages/5f/f7/a4ac7ed22936ca4dd01d311798de3e5e6cd4034f4b15483cf822e06c817d/megatron_fsdp-0.1.0rc5.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-06 04:13:10",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "NVIDIA",
    "github_project": "Megatron-LM",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "megatron-fsdp"
}

None