<div align="center">
Megatron-LM & Megatron Core
===========================
<h4>GPU-optimized library for training transformer models at scale</h4>
[](https://docs.nvidia.com/Megatron-Core/developer-guide/latest/index.html)
[](./CHANGELOG.md)
[](./LICENSE)
<div align="left">
## ⚡ Quick Start
```bash
# 1. Install Megatron Core with required dependencies
pip install megatron-core
pip install --no-build-isolation transformer-engine[pytorch]
# 2. Clone repository for examples
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
```
**→ [Complete Installation Guide](#installation)** - Docker, pip variants (dev,lts,etc.), source installation, and system requirements
# Latest News
- 📣 NEW! **[DeepSeek & MoE Training with FP8](https://github.com/yanring/Megatron-MoE-ModelZoo)** examples are now available, including optimized configurations for `DeepSeek-V3`, `Qwen2` and `Mixtral` models with FP8 precision support.
- **[2025/05]** Megatron Core v0.11.0 brings new capabilities for multi-data center LLM training ([blog](https://developer.nvidia.com/blog/turbocharge-llm-training-across-long-haul-data-center-networks-with-nvidia-nemo-framework/)).
<details>
<summary>Previous News</summary>
- **[2024/07]** Megatron Core v0.7 improves scalability and training resiliency and adds support for multimodal training ([blog](https://developer.nvidia.com/blog/train-generative-ai-models-more-efficiently-with-new-nvidia-Megatron-Core-functionalities/)).
- **[2024/06]** Megatron Core added supports for Mamba-based models. Check out our paper [An Empirical Study of Mamba-based Language Models](https://arxiv.org/pdf/2406.07887) and [code example](https://github.com/NVIDIA/Megatron-LM/tree/ssm/examples/mamba).
- **[2024/01 Announcement]** NVIDIA has released the core capabilities in **Megatron-LM** into [**Megatron Core**](https://github.com/NVIDIA/Megatron-LM/tree/main/megatron/core) in this repository. Megatron Core expands upon Megatron-LM's GPU-optimized techniques with more cutting-edge innovations on system-level optimizations, featuring composable and modular APIs. Explore the [Megatron Core intro](#Megatron Core) for more details.
</details>
<details>
<summary>Table of Contents</summary>
**Getting Started**
- [Quick Start](#-quick-start)
- [Latest News](#latest-news)
- [Megatron Overview](#megatron-overview)
- [Project Structure](#project-structure)
- [Megatron-LM: Reference Implementation](#megatron-lm-reference-implementation)
- [Megatron Core: Production Library](#megatron-core-production-library)
- [Installation](#installation)
- [Docker (Recommended)](#-docker-recommended)
- [Pip Installation](#-pip-installation)
- [Source Installation](#-source-installation)
- [System Requirements](#system-requirements)
**Core Features**
- [Performance Benchmarking](#performance-benchmarking)
- [Weak Scaling Results](#weak-scaling-results)
- [Strong Scaling Results](#strong-scaling-results)
- [Ecosystem Libraries](#ecosystem-libraries)
**Training**
- [Training](#training)
- [Getting Started](#getting-started)
- [Data Preparation](#data-preparation)
- [Parallelism Strategies](#parallelism-strategies)
- [Data Parallelism (DP)](#data-parallelism-dp)
- [Tensor Parallelism (TP)](#tensor-parallelism-tp)
- [Pipeline Parallelism (PP)](#pipeline-parallelism-pp)
- [Context Parallelism (CP)](#context-parallelism-cp)
- [Expert Parallelism (EP)](#expert-parallelism-ep)
- [Parallelism Selection Guide](#parallelism-selection-guide)
- [Performance Optimizations](#performance-optimizations)
**Resources**
- [Examples](./examples/) - Training scripts and tutorials
- [Documentation](https://docs.nvidia.com/Megatron-Core/) - Official docs
- [Community & Support](#-community--support) - Get help and contribute
- [Getting Help](#getting-help)
- [Contributing](#contributing)
- [Citation](#citation)
</details>
# Megatron Overview
## Project Structure
```
Megatron-LM/
├── megatron/
│ ├── core/ # Megatron Core (kernels, parallelism, building blocks)
│ │ ├── models/ # Transformer models
│ │ ├── transformer/ # Transformer building blocks
│ │ ├── tensor_parallel/ # Tensor parallelism
│ │ ├── pipeline_parallel/ # Pipeline parallelism
│ │ ├── distributed/ # Distributed training (FSDP, DDP)
│ │ ├── optimizer/ # Optimizers
│ │ ├── datasets/ # Dataset loaders
│ │ ├── inference/ # Inference engines
│ │ └── export/ # Model export (e.g. TensorRT-LLM)
│ ├── training/ # Training scripts
│ ├── inference/ # Inference server
│ ├── legacy/ # Legacy components
│ └── post_training/ # Post-training (RLHF, etc.)
├── examples/ # Ready-to-use training examples
├── tools/ # Utility tools
├── tests/ # Comprehensive test suite
└── docs/ # Documentation
```
### Megatron-LM: Reference Implementation
**Reference implementation** that includes Megatron Core plus everything needed to train models.
**Best for:**
- **Training state-of-the-art foundation models** at scale with cutting-edge performance on latest NVIDIA hardware
- **Research teams** exploring new architectures and training techniques
- **Learning distributed training** concepts and best practices
- **Quick experimentation** with proven model configurations
**What you get:**
- Pre-configured training scripts for GPT, LLama, DeepSeek, Qwen, and more.
- End-to-end examples from data prep to evaluation
- Research-focused tools and utilities
### Megatron Core: Composable Library
**Composable library** with GPU-optimized building blocks for custom training frameworks.
**Best for:**
- **Framework developers** building on top of modular and optimized components
- **Research teams** needing custom training loops, optimizers, or data pipelines
- **ML engineers** requiring fault-tolerant training pipelines
**What you get:**
- Composable transformer building blocks (attention, MLP, etc.)
- Advanced parallelism strategies (TP, PP, DP, EP, CP)
- Pipeline schedules and distributed optimizers
- Mixed precision support (FP16, BF16, FP8)
- GPU-optimized kernels and memory management
- High-performance dataloaders and dataset utilities
- Model architectures (LLaMA, Qwen, GPT, Mixtral, Mamba, etc.)
## Ecosystem Libraries
**Libraries used by Megatron Core:**
- **[Megatron Energon](https://github.com/NVIDIA/Megatron-Energon)** 📣 **NEW!** - Multi-modal data loader (text, images, video, audio) with distributed loading and dataset blending
- **[Transformer Engine](https://github.com/NVIDIA/TransformerEngine)** - Optimized kernels and FP8 mixed precision support
- **[Resiliency Extension (NVRx)](https://github.com/NVIDIA/nvidia-resiliency-ext)** - Fault tolerant training with failure detection and recovery
**Libraries using Megatron Core:**
- **[NeMo Framework](https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html)** - Enterprise framework with cloud-native support and end-to-end examples
- **[TensorRT Model Optimizer (ModelOpt)](https://github.com/NVIDIA/TensorRT-Model-Optimizer)** - Model optimization toolkit for quantization, pruning, and distillation
**Compatible with:** [HuggingFace Accelerate](https://github.com/huggingface/accelerate), [Colossal-AI](https://github.com/hpcaitech/ColossalAI), [DeepSpeed](https://github.com/microsoft/DeepSpeed)
# Installation
## 🐳 Docker (Recommended)
We strongly recommend using the previous releases of [PyTorch NGC Container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) rather than the latest one for optimal compatibility with Megatron Core release and testing. Our releases are always based on the previous month's NGC container, so this ensures compatibility and stability.
This container comes with all dependencies pre-installed with compatible versions and optimized configurations for NVIDIA GPUs:
- PyTorch (latest stable version)
- CUDA, cuDNN, NCCL (latest stable versions)
- Support for FP8 on NVIDIA Hopper, Ada, and Blackwell GPUs
- For best performance, use NVIDIA Turing GPU architecture generations and later
```bash
# Run container with mounted directories
docker run --runtime --nvidia --gpus all -it --rm \
-v /path/to/megatron:/workspace/megatron \
-v /path/to/dataset:/workspace/dataset \
-v /path/to/checkpoints:/workspace/checkpoints \
nvcr.io/nvidia/pytorch:25.04-py3
```
## Pip Installation
Megatron Core offers support for two NGC PyTorch containers:
- `dev`: Moving head that supports the most recent upstream dependencies
- `lts`: Long-term support of NGC PyTorch 24.01
Both containers can be combined with `mlm` which adds package dependencies for Megatron-LM on top of Megatron Core.
```bash
# Install the latest release with minimal dependencies (no Transformer Engine)
pip install megatron-core[dev]
```
```bash
# Install packages for LTS support NGC PyTorch 24.01
pip install megatron-core[lts]
```
For a version of Megatron Core with only torch, run:
```bash
pip install megatron-core
```
For dependencies required by Megatron-LM, please run:
```bash
pip install megatron-core[mlm]
```
## Source Installation
For development or latest features:
For Hybrid models, Megatron Core requires [mamba](https://github.com/state-spaces/mamba). If the pre-built wheel in PyPI does not fit your environment, you can fall back to an install script Megatron Core uses in its CI system. For this, please install `uv` first:
```bash
export UV_VERSION=0.7.2
export PATH="$HOME/.local/bin:$PATH"
curl -LsSf https://astral.sh/uv/${UV_VERSION}/install.sh | sh
export UV_PROJECT_ENVIRONMENT=./venv
export PATH="$UV_PROJECT_ENVIRONMENT/bin:$PATH"
export UV_LINK_MODE=copy
```
Run the following command to build upstream dependencies from source:
```bash
# Clone and install
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
# Optional: checkout specific release
git checkout core_r0.13.0
bash docker/common/install.sh --environment {dev,lts}
```
## System Requirements
### Hardware Requirements
- **FP8 Support**: NVIDIA Hopper, Ada, Blackwell GPUs
- **Recommended**: NVIDIA Turing architecture or later
### Software Requirements
- **CUDA/cuDNN/NCCL**: Latest stable versions
- **PyTorch**: Latest stable version
- **Transformer Engine**: Latest stable version
- **Python**: 3.12 recommended
# Performance Benchmarking
For our latest performance benchmarking results, please refer to [NVIDIA NeMo Framework Performance Summary](https://docs.nvidia.com/nemo-framework/user-guide/latest/performance/performance_summary.html).
Our codebase efficiently trains models from 2B to 462B parameters across thousands of GPUs, achieving up to **47% Model FLOP Utilization (MFU)** on H100 clusters.

**Benchmark Configuration:**
- **Vocabulary size**: 131,072 tokens
- **Sequence length**: 4096 tokens
- **Model scaling**: Varied hidden size, attention heads, and layers to achieve target parameter counts
- **Communication optimizations**: Fine-grained overlapping with DP (`--overlap-grad-reduce`, `--overlap-param-gather`), TP (`--tp-comm-overlap`), and PP (enabled by default)
**Key Results:**
- **6144 H100 GPUs**: Successfully benchmarked 462B parameter model training
- **Superlinear scaling**: MFU increases from 41% to 47-48% with model size
- **End-to-end measurement**: Throughputs include all operations (data loading, optimizer steps, communication, logging)
- **Production ready**: Full training pipeline with checkpointing and fault tolerance
- *Note: Performance results measured without training to convergence*
## Weak Scaling Results
Our weak scaled results show superlinear scaling (MFU increases from 41% for the smallest model considered to 47-48% for the largest models); this is because larger GEMMs have higher arithmetic intensity and are consequently more efficient to execute.

## Strong Scaling Results
We also strong scaled the standard GPT-3 model (our version has slightly more than 175 billion parameters due to larger vocabulary size) from 96 H100 GPUs to 4608 GPUs, using the same batch size of 1152 sequences throughout. Communication becomes more exposed at larger scale, leading to a reduction in MFU from 47% to 42%.

# Training
## Getting Started
### Simple Training Example
```bash
# Distributed training example (2 GPUs, mock data)
torchrun --nproc_per_node=2 examples/run_simple_mcore_train_loop.py
```
### LLama-3 Training Example
```bash
# 8 GPUs, FP8 precision, mock data
./examples/llama/train_llama3_8b_fp8.sh
```
## Data Preparation
### JSONL Data Format
```json
{"text": "Your training text here..."}
{"text": "Another training sample..."}
```
### Basic Preprocessing
```bash
python tools/preprocess_data.py \
--input data.jsonl \
--output-prefix processed_data \
--tokenizer-type HuggingFaceTokenizer \
--tokenizer-model /path/to/tokenizer.model \
--workers 8 \
--append-eod
```
### Key Arguments
- `--input`: Path to input JSON/JSONL file
- `--output-prefix`: Prefix for output binary files (.bin and .idx)
- `--tokenizer-type`: Tokenizer type (`HuggingFaceTokenizer`, `GPT2BPETokenizer`, etc.)
- `--tokenizer-model`: Path to tokenizer model file
- `--workers`: Number of parallel workers for processing
- `--append-eod`: Add end-of-document token
<!-- **→ [Complete Data Preparation Guide](./docs/data-preparation.md)** - Comprehensive guide covering advanced preprocessing, dataset collection, deduplication, and optimization strategies -->
# Parallelism Strategies
## Data Parallelism (DP)
### Standard Data Parallel
```bash
# Standard DDP - replicate model on each GPU
torchrun --nproc_per_node=8 pretrain_gpt.py \
--data-parallel-sharding-strategy no_shard
```
### Fully Sharded Data Parallel (FSDP)
```bash
# Megatron's optimized FSDP (~15% faster than PyTorch FSDP2)
--use-custom-fsdp
# PyTorch FSDP2
--use-torch-fsdp2
# Sharding strategies
--data-parallel-sharding-strategy optim # Shard optimizer states (ZeRO-1)
--data-parallel-sharding-strategy optim_grads # Shard gradients + optimizer (ZeRO-2)
--data-parallel-sharding-strategy optim_grads_params # Shard parameters + gradients + optimizer (ZeRO-3)
```
## Tensor Parallelism (TP)
Split individual model layers across GPUs:
```bash
--tensor-model-parallel-size 4 # 4-way tensor parallelism
--sequence-parallel # Enable sequence parallelism (recommended with TP)
```
## Pipeline Parallelism (PP)
Split model depth across GPUs:
```bash
--pipeline-model-parallel-size 8 # 8 pipeline stages
--virtual-pipeline-model-parallel-size 4 # Virtual pipeline for better load balancing
```
## Context Parallelism (CP)
Split long sequences across GPUs for handling long contexts:
```bash
--context-parallel-size 2 # 2-way context parallelism
--cp-comm-type p2p # Communication: p2p, a2a, allgather, a2a+p2p
--hierarchical-context-parallel-sizes 2 4 # Hierarchical context parallelism
```
## Expert Parallelism (EP)
For Mixture of Experts (MoE) models:
```bash
--expert-model-parallel-size 4 # 4-way expert parallelism
--num-experts 8 # 8 experts per MoE layer
--moe-grouped-gemm # Optimize expert computation
```
## Combining Parallelism Strategies
### Parallelism Selection Guide
Based on [NVIDIA NeMo production configurations](https://github.com/NVIDIA/NeMo/tree/main/scripts/performance/recommended_model_configs):
| Model | Size | GPUs | TP | PP | CP | EP | Notes |
|-------|------|------|----|----|----|----|-------|
| **LLama-3** | 8B | 8 | 1 | 1 | 2 | 1 | CP for long seqlen (8K) |
| **LLama-3** | 70B | 64 | 4 | 4 | 2 | 1 | TP+PP |
| **LLama-3.1** | 405B | 1024 | 8 | 8 | 2 | 1 | 3D parallelism for scale |
| **GPT-3** | 175B | 128-512 | 4 | 8 | 1 | 1 | Large model config |
| **Mixtral** | 8x7B | 64 | 1 | 4 | 1 | 8 | EP for MoE |
| **Mixtral** | 8x22B | 256 | 4 | 4 | 8 | 8 | Combined TP+EP for large MoE |
| **DeepSeek-V3** | 671B | 1024 | 2 | 16 | 1 | 64 | Large MoE config |
### MoE-Specific Requirements
**Important**: When combining Expert Parallelism (EP) with Tensor Parallelism (TP), **Sequence Parallelism (SP) must be enabled**.
## Performance Optimizations
| Feature | Flag | Benefit |
|---------|------|---------|
| **FlashAttention** | `--attention-backend` | Faster attention and lower memory usage |
| **FP8 Training** | `--fp8-hybrid` | Faster training |
| **Activation Checkpointing** | `--recompute-activations` | Reduced memory usage |
| **Data Parallelism Communication Overlap** | `--overlap-grad-reduce` | Faster distributed training |
| **Distributed Optimizer** | `--use-distributed-optimizer` | Reduced checkpointing time |
**→ [NVIDIA NeMo Framework Performance Tuning Guide](https://docs.nvidia.com/nemo-framework/user-guide/latest/performance/performance-guide.html#performance-tuning-guide)** - Comprehensive performance optimization guide covering advanced tuning techniques, communication overlaps, memory optimizations, and profiling options.
### FlashAttention
[FlashAttention](https://github.com/Dao-AILab/flash-attention) is a fast and memory-efficient attention algorithm. We recommend the default usage, which uses cuDNN for attention via Transformer Engine and provides up to 50% speedups on forward and 84% on backward propagation with FP8 kernels. The `flash-attn` package is also supported via `--use-flash-attn`.
### Mixed Precision Training
```bash
--fp16 # Standard FP16
--bf16 # BFloat16 (recommended for large models)
--fp8-hybrid # FP8 training (Hopper, Ada, and Blackwell GPUs)
```
### Activation Checkpointing and Recomputation
```bash
# For limited memory
--recompute-activations
# For extreme memory constraints
--recompute-granularity full \
--recompute-method uniform
```
### Data Parallelism Communication Overlap
```bash
--overlap-grad-reduce
--overlap-param-gather
```
### Distributed Optimizer
```bash
--use-distributed-optimizer
```
# Community & Support
## Getting Help
- 📖 **[Documentation](https://docs.nvidia.com/Megatron-Core/)** - Official documentation
- 🐛 **[Issues](https://github.com/NVIDIA/Megatron-LM/issues)** - Bug reports and feature requests
## Contributing
We ❤️ contributions! Ways to contribute:
- 🐛 **Report bugs** - Help us improve reliability
- 💡 **Suggest features** - Shape the future of Megatron Core
- 📝 **Improve docs** - Make Megatron Core more accessible
- 🔧 **Submit PRs** - Contribute code improvements
**→ [Contributing Guide](./CONTRIBUTING.md)**
## Citation
```bibtex
@article{megatron-lm,
title={Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism},
author={Shoeybi, Mohammad and Patwary, Mostofa and Puri, Raul and LeGresley, Patrick and Casper, Jared and Catanzaro, Bryan},
journal={arXiv preprint arXiv:1909.08053},
year={2019}
}
```
Raw data
{
"_id": null,
"home_page": null,
"name": "megatron-core",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": "NVIDIA <nemo-toolkit@nvidia.com>",
"keywords": "NLP, NLU, deep, gpu, language, learning, learning, machine, nvidia, pytorch, torch, transformer",
"author": null,
"author_email": "NVIDIA <nemo-toolkit@nvidia.com>",
"download_url": "https://files.pythonhosted.org/packages/46/8f/89b294eb1e3c47518af48fbc79af75a275d55e88c750fbaf5aa785d5e29a/megatron_core-0.14.0.tar.gz",
"platform": null,
"description": "<div align=\"center\">\n\nMegatron-LM & Megatron Core\n===========================\n<h4>GPU-optimized library for training transformer models at scale</h4>\n\n[](https://docs.nvidia.com/Megatron-Core/developer-guide/latest/index.html)\n[](./CHANGELOG.md)\n[](./LICENSE)\n\n<div align=\"left\">\n\n## \u26a1 Quick Start\n\n```bash\n# 1. Install Megatron Core with required dependencies\npip install megatron-core\npip install --no-build-isolation transformer-engine[pytorch]\n\n# 2. Clone repository for examples\ngit clone https://github.com/NVIDIA/Megatron-LM.git\ncd Megatron-LM\n```\n\n**\u2192 [Complete Installation Guide](#installation)** - Docker, pip variants (dev,lts,etc.), source installation, and system requirements\n\n# Latest News\n\n- \ud83d\udce3 NEW! **[DeepSeek & MoE Training with FP8](https://github.com/yanring/Megatron-MoE-ModelZoo)** examples are now available, including optimized configurations for `DeepSeek-V3`, `Qwen2` and `Mixtral` models with FP8 precision support.\n- **[2025/05]** Megatron Core v0.11.0 brings new capabilities for multi-data center LLM training ([blog](https://developer.nvidia.com/blog/turbocharge-llm-training-across-long-haul-data-center-networks-with-nvidia-nemo-framework/)). \n\n<details>\n<summary>Previous News</summary>\n\n- **[2024/07]** Megatron Core v0.7 improves scalability and training resiliency and adds support for multimodal training ([blog](https://developer.nvidia.com/blog/train-generative-ai-models-more-efficiently-with-new-nvidia-Megatron-Core-functionalities/)). \n- **[2024/06]** Megatron Core added supports for Mamba-based models. Check out our paper [An Empirical Study of Mamba-based Language Models](https://arxiv.org/pdf/2406.07887) and [code example](https://github.com/NVIDIA/Megatron-LM/tree/ssm/examples/mamba).\n- **[2024/01 Announcement]** NVIDIA has released the core capabilities in **Megatron-LM** into [**Megatron Core**](https://github.com/NVIDIA/Megatron-LM/tree/main/megatron/core) in this repository. Megatron Core expands upon Megatron-LM's GPU-optimized techniques with more cutting-edge innovations on system-level optimizations, featuring composable and modular APIs. Explore the [Megatron Core intro](#Megatron Core) for more details.\n\n</details>\n\n<details>\n<summary>Table of Contents</summary>\n\n**Getting Started**\n- [Quick Start](#-quick-start)\n- [Latest News](#latest-news)\n- [Megatron Overview](#megatron-overview)\n - [Project Structure](#project-structure)\n - [Megatron-LM: Reference Implementation](#megatron-lm-reference-implementation)\n - [Megatron Core: Production Library](#megatron-core-production-library)\n- [Installation](#installation) \n - [Docker (Recommended)](#-docker-recommended)\n - [Pip Installation](#-pip-installation)\n - [Source Installation](#-source-installation)\n - [System Requirements](#system-requirements)\n\n**Core Features**\n- [Performance Benchmarking](#performance-benchmarking)\n - [Weak Scaling Results](#weak-scaling-results)\n - [Strong Scaling Results](#strong-scaling-results)\n- [Ecosystem Libraries](#ecosystem-libraries)\n\n**Training**\n- [Training](#training)\n - [Getting Started](#getting-started)\n - [Data Preparation](#data-preparation)\n- [Parallelism Strategies](#parallelism-strategies)\n - [Data Parallelism (DP)](#data-parallelism-dp)\n - [Tensor Parallelism (TP)](#tensor-parallelism-tp)\n - [Pipeline Parallelism (PP)](#pipeline-parallelism-pp)\n - [Context Parallelism (CP)](#context-parallelism-cp)\n - [Expert Parallelism (EP)](#expert-parallelism-ep)\n - [Parallelism Selection Guide](#parallelism-selection-guide)\n- [Performance Optimizations](#performance-optimizations)\n\n**Resources**\n- [Examples](./examples/) - Training scripts and tutorials\n- [Documentation](https://docs.nvidia.com/Megatron-Core/) - Official docs\n- [Community & Support](#-community--support) - Get help and contribute\n - [Getting Help](#getting-help)\n - [Contributing](#contributing)\n - [Citation](#citation)\n\n</details>\n\n# Megatron Overview\n\n## Project Structure\n```\nMegatron-LM/\n\u251c\u2500\u2500 megatron/ \n\u2502 \u251c\u2500\u2500 core/ # Megatron Core (kernels, parallelism, building blocks)\n\u2502 \u2502 \u251c\u2500\u2500 models/ # Transformer models\n\u2502 \u2502 \u251c\u2500\u2500 transformer/ # Transformer building blocks\n\u2502 \u2502 \u251c\u2500\u2500 tensor_parallel/ # Tensor parallelism\n\u2502 \u2502 \u251c\u2500\u2500 pipeline_parallel/ # Pipeline parallelism\n\u2502 \u2502 \u251c\u2500\u2500 distributed/ # Distributed training (FSDP, DDP)\n\u2502 \u2502 \u251c\u2500\u2500 optimizer/ # Optimizers\n\u2502 \u2502 \u251c\u2500\u2500 datasets/ # Dataset loaders\n\u2502 \u2502 \u251c\u2500\u2500 inference/ # Inference engines\n\u2502 \u2502 \u2514\u2500\u2500 export/ # Model export (e.g. TensorRT-LLM)\n\u2502 \u251c\u2500\u2500 training/ # Training scripts\n\u2502 \u251c\u2500\u2500 inference/ # Inference server\n\u2502 \u251c\u2500\u2500 legacy/ # Legacy components\n\u2502 \u2514\u2500\u2500 post_training/ # Post-training (RLHF, etc.)\n\u251c\u2500\u2500 examples/ # Ready-to-use training examples\n\u251c\u2500\u2500 tools/ # Utility tools\n\u251c\u2500\u2500 tests/ # Comprehensive test suite\n\u2514\u2500\u2500 docs/ # Documentation\n```\n\n### Megatron-LM: Reference Implementation\n**Reference implementation** that includes Megatron Core plus everything needed to train models.\n\n**Best for:**\n- **Training state-of-the-art foundation models** at scale with cutting-edge performance on latest NVIDIA hardware\n- **Research teams** exploring new architectures and training techniques\n- **Learning distributed training** concepts and best practices \n- **Quick experimentation** with proven model configurations\n\n**What you get:**\n- Pre-configured training scripts for GPT, LLama, DeepSeek, Qwen, and more.\n- End-to-end examples from data prep to evaluation\n- Research-focused tools and utilities\n\n### Megatron Core: Composable Library \n**Composable library** with GPU-optimized building blocks for custom training frameworks.\n\n**Best for:**\n- **Framework developers** building on top of modular and optimized components\n- **Research teams** needing custom training loops, optimizers, or data pipelines\n- **ML engineers** requiring fault-tolerant training pipelines\n\n**What you get:**\n- Composable transformer building blocks (attention, MLP, etc.)\n- Advanced parallelism strategies (TP, PP, DP, EP, CP)\n- Pipeline schedules and distributed optimizers\n- Mixed precision support (FP16, BF16, FP8)\n- GPU-optimized kernels and memory management\n- High-performance dataloaders and dataset utilities\n- Model architectures (LLaMA, Qwen, GPT, Mixtral, Mamba, etc.)\n\n## Ecosystem Libraries\n\n**Libraries used by Megatron Core:**\n\n- **[Megatron Energon](https://github.com/NVIDIA/Megatron-Energon)** \ud83d\udce3 **NEW!** - Multi-modal data loader (text, images, video, audio) with distributed loading and dataset blending\n- **[Transformer Engine](https://github.com/NVIDIA/TransformerEngine)** - Optimized kernels and FP8 mixed precision support\n- **[Resiliency Extension (NVRx)](https://github.com/NVIDIA/nvidia-resiliency-ext)** - Fault tolerant training with failure detection and recovery\n\n**Libraries using Megatron Core:**\n\n- **[NeMo Framework](https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html)** - Enterprise framework with cloud-native support and end-to-end examples\n- **[TensorRT Model Optimizer (ModelOpt)](https://github.com/NVIDIA/TensorRT-Model-Optimizer)** - Model optimization toolkit for quantization, pruning, and distillation\n\n**Compatible with:** [HuggingFace Accelerate](https://github.com/huggingface/accelerate), [Colossal-AI](https://github.com/hpcaitech/ColossalAI), [DeepSpeed](https://github.com/microsoft/DeepSpeed)\n\n# Installation\n\n## \ud83d\udc33 Docker (Recommended)\n\nWe strongly recommend using the previous releases of [PyTorch NGC Container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) rather than the latest one for optimal compatibility with Megatron Core release and testing. Our releases are always based on the previous month's NGC container, so this ensures compatibility and stability.\n\nThis container comes with all dependencies pre-installed with compatible versions and optimized configurations for NVIDIA GPUs:\n\n- PyTorch (latest stable version)\n- CUDA, cuDNN, NCCL (latest stable versions)\n- Support for FP8 on NVIDIA Hopper, Ada, and Blackwell GPUs\n- For best performance, use NVIDIA Turing GPU architecture generations and later\n\n```bash\n# Run container with mounted directories\ndocker run --runtime --nvidia --gpus all -it --rm \\\n -v /path/to/megatron:/workspace/megatron \\\n -v /path/to/dataset:/workspace/dataset \\\n -v /path/to/checkpoints:/workspace/checkpoints \\\n nvcr.io/nvidia/pytorch:25.04-py3\n```\n\n## Pip Installation\n\nMegatron Core offers support for two NGC PyTorch containers:\n\n- `dev`: Moving head that supports the most recent upstream dependencies\n- `lts`: Long-term support of NGC PyTorch 24.01\n\nBoth containers can be combined with `mlm` which adds package dependencies for Megatron-LM on top of Megatron Core.\n\n```bash\n# Install the latest release with minimal dependencies (no Transformer Engine)\npip install megatron-core[dev]\n```\n\n```bash\n# Install packages for LTS support NGC PyTorch 24.01\npip install megatron-core[lts]\n```\n\nFor a version of Megatron Core with only torch, run:\n\n```bash\npip install megatron-core\n```\n\nFor dependencies required by Megatron-LM, please run:\n\n```bash\npip install megatron-core[mlm]\n```\n\n## Source Installation\n\nFor development or latest features:\n\nFor Hybrid models, Megatron Core requires [mamba](https://github.com/state-spaces/mamba). If the pre-built wheel in PyPI does not fit your environment, you can fall back to an install script Megatron Core uses in its CI system. For this, please install `uv` first:\n\n```bash\nexport UV_VERSION=0.7.2\nexport PATH=\"$HOME/.local/bin:$PATH\"\ncurl -LsSf https://astral.sh/uv/${UV_VERSION}/install.sh | sh\nexport UV_PROJECT_ENVIRONMENT=./venv\nexport PATH=\"$UV_PROJECT_ENVIRONMENT/bin:$PATH\"\nexport UV_LINK_MODE=copy\n```\n\nRun the following command to build upstream dependencies from source:\n\n```bash\n# Clone and install\ngit clone https://github.com/NVIDIA/Megatron-LM.git\ncd Megatron-LM\n\n# Optional: checkout specific release\ngit checkout core_r0.13.0\n\nbash docker/common/install.sh --environment {dev,lts}\n```\n\n## System Requirements\n\n### Hardware Requirements\n- **FP8 Support**: NVIDIA Hopper, Ada, Blackwell GPUs\n- **Recommended**: NVIDIA Turing architecture or later\n\n### Software Requirements\n- **CUDA/cuDNN/NCCL**: Latest stable versions\n- **PyTorch**: Latest stable version\n- **Transformer Engine**: Latest stable version\n- **Python**: 3.12 recommended\n\n# Performance Benchmarking\n\nFor our latest performance benchmarking results, please refer to [NVIDIA NeMo Framework Performance Summary](https://docs.nvidia.com/nemo-framework/user-guide/latest/performance/performance_summary.html).\n\nOur codebase efficiently trains models from 2B to 462B parameters across thousands of GPUs, achieving up to **47% Model FLOP Utilization (MFU)** on H100 clusters.\n\n\n\n**Benchmark Configuration:**\n- **Vocabulary size**: 131,072 tokens\n- **Sequence length**: 4096 tokens \n- **Model scaling**: Varied hidden size, attention heads, and layers to achieve target parameter counts\n- **Communication optimizations**: Fine-grained overlapping with DP (`--overlap-grad-reduce`, `--overlap-param-gather`), TP (`--tp-comm-overlap`), and PP (enabled by default)\n\n**Key Results:**\n- **6144 H100 GPUs**: Successfully benchmarked 462B parameter model training\n- **Superlinear scaling**: MFU increases from 41% to 47-48% with model size\n- **End-to-end measurement**: Throughputs include all operations (data loading, optimizer steps, communication, logging)\n- **Production ready**: Full training pipeline with checkpointing and fault tolerance\n- *Note: Performance results measured without training to convergence*\n\n## Weak Scaling Results\nOur weak scaled results show superlinear scaling (MFU increases from 41% for the smallest model considered to 47-48% for the largest models); this is because larger GEMMs have higher arithmetic intensity and are consequently more efficient to execute.\n\n\n\n## Strong Scaling Results\nWe also strong scaled the standard GPT-3 model (our version has slightly more than 175 billion parameters due to larger vocabulary size) from 96 H100 GPUs to 4608 GPUs, using the same batch size of 1152 sequences throughout. Communication becomes more exposed at larger scale, leading to a reduction in MFU from 47% to 42%.\n\n\n\n# Training\n\n## Getting Started\n\n### Simple Training Example\n```bash\n# Distributed training example (2 GPUs, mock data)\ntorchrun --nproc_per_node=2 examples/run_simple_mcore_train_loop.py\n```\n\n### LLama-3 Training Example\n```bash\n# 8 GPUs, FP8 precision, mock data\n./examples/llama/train_llama3_8b_fp8.sh\n```\n\n## Data Preparation\n\n### JSONL Data Format\n```json\n{\"text\": \"Your training text here...\"}\n{\"text\": \"Another training sample...\"}\n```\n\n### Basic Preprocessing\n```bash\npython tools/preprocess_data.py \\\n --input data.jsonl \\\n --output-prefix processed_data \\\n --tokenizer-type HuggingFaceTokenizer \\\n --tokenizer-model /path/to/tokenizer.model \\\n --workers 8 \\\n --append-eod\n```\n\n### Key Arguments\n- `--input`: Path to input JSON/JSONL file\n- `--output-prefix`: Prefix for output binary files (.bin and .idx)\n- `--tokenizer-type`: Tokenizer type (`HuggingFaceTokenizer`, `GPT2BPETokenizer`, etc.)\n- `--tokenizer-model`: Path to tokenizer model file\n- `--workers`: Number of parallel workers for processing\n- `--append-eod`: Add end-of-document token\n\n<!-- **\u2192 [Complete Data Preparation Guide](./docs/data-preparation.md)** - Comprehensive guide covering advanced preprocessing, dataset collection, deduplication, and optimization strategies -->\n\n# Parallelism Strategies\n\n## Data Parallelism (DP)\n\n### Standard Data Parallel\n```bash\n# Standard DDP - replicate model on each GPU\ntorchrun --nproc_per_node=8 pretrain_gpt.py \\\n --data-parallel-sharding-strategy no_shard\n```\n\n### Fully Sharded Data Parallel (FSDP)\n```bash\n# Megatron's optimized FSDP (~15% faster than PyTorch FSDP2)\n--use-custom-fsdp\n\n# PyTorch FSDP2\n--use-torch-fsdp2\n\n# Sharding strategies\n--data-parallel-sharding-strategy optim # Shard optimizer states (ZeRO-1)\n--data-parallel-sharding-strategy optim_grads # Shard gradients + optimizer (ZeRO-2)\n--data-parallel-sharding-strategy optim_grads_params # Shard parameters + gradients + optimizer (ZeRO-3)\n```\n\n## Tensor Parallelism (TP)\nSplit individual model layers across GPUs:\n```bash\n--tensor-model-parallel-size 4 # 4-way tensor parallelism\n--sequence-parallel # Enable sequence parallelism (recommended with TP)\n```\n\n## Pipeline Parallelism (PP)\nSplit model depth across GPUs:\n```bash\n--pipeline-model-parallel-size 8 # 8 pipeline stages\n--virtual-pipeline-model-parallel-size 4 # Virtual pipeline for better load balancing\n```\n\n## Context Parallelism (CP)\nSplit long sequences across GPUs for handling long contexts:\n```bash\n--context-parallel-size 2 # 2-way context parallelism\n--cp-comm-type p2p # Communication: p2p, a2a, allgather, a2a+p2p\n--hierarchical-context-parallel-sizes 2 4 # Hierarchical context parallelism\n```\n\n## Expert Parallelism (EP)\nFor Mixture of Experts (MoE) models:\n```bash\n--expert-model-parallel-size 4 # 4-way expert parallelism\n--num-experts 8 # 8 experts per MoE layer\n--moe-grouped-gemm # Optimize expert computation\n```\n\n## Combining Parallelism Strategies\n\n### Parallelism Selection Guide\n\nBased on [NVIDIA NeMo production configurations](https://github.com/NVIDIA/NeMo/tree/main/scripts/performance/recommended_model_configs):\n\n| Model | Size | GPUs | TP | PP | CP | EP | Notes |\n|-------|------|------|----|----|----|----|-------|\n| **LLama-3** | 8B | 8 | 1 | 1 | 2 | 1 | CP for long seqlen (8K) |\n| **LLama-3** | 70B | 64 | 4 | 4 | 2 | 1 | TP+PP |\n| **LLama-3.1** | 405B | 1024 | 8 | 8 | 2 | 1 | 3D parallelism for scale |\n| **GPT-3** | 175B | 128-512 | 4 | 8 | 1 | 1 | Large model config |\n| **Mixtral** | 8x7B | 64 | 1 | 4 | 1 | 8 | EP for MoE |\n| **Mixtral** | 8x22B | 256 | 4 | 4 | 8 | 8 | Combined TP+EP for large MoE |\n| **DeepSeek-V3** | 671B | 1024 | 2 | 16 | 1 | 64 | Large MoE config |\n\n### MoE-Specific Requirements\n\n**Important**: When combining Expert Parallelism (EP) with Tensor Parallelism (TP), **Sequence Parallelism (SP) must be enabled**.\n\n## Performance Optimizations\n\n| Feature | Flag | Benefit |\n|---------|------|---------|\n| **FlashAttention** | `--attention-backend` | Faster attention and lower memory usage |\n| **FP8 Training** | `--fp8-hybrid` | Faster training |\n| **Activation Checkpointing** | `--recompute-activations` | Reduced memory usage |\n| **Data Parallelism Communication Overlap** | `--overlap-grad-reduce` | Faster distributed training |\n| **Distributed Optimizer** | `--use-distributed-optimizer` | Reduced checkpointing time |\n\n**\u2192 [NVIDIA NeMo Framework Performance Tuning Guide](https://docs.nvidia.com/nemo-framework/user-guide/latest/performance/performance-guide.html#performance-tuning-guide)** - Comprehensive performance optimization guide covering advanced tuning techniques, communication overlaps, memory optimizations, and profiling options.\n\n### FlashAttention\n[FlashAttention](https://github.com/Dao-AILab/flash-attention) is a fast and memory-efficient attention algorithm. We recommend the default usage, which uses cuDNN for attention via Transformer Engine and provides up to 50% speedups on forward and 84% on backward propagation with FP8 kernels. The `flash-attn` package is also supported via `--use-flash-attn`.\n\n### Mixed Precision Training\n```bash\n--fp16 # Standard FP16\n--bf16 # BFloat16 (recommended for large models)\n--fp8-hybrid # FP8 training (Hopper, Ada, and Blackwell GPUs)\n```\n\n### Activation Checkpointing and Recomputation\n```bash\n# For limited memory\n--recompute-activations\n\n# For extreme memory constraints\n--recompute-granularity full \\\n--recompute-method uniform\n```\n\n### Data Parallelism Communication Overlap\n\n```bash\n--overlap-grad-reduce\n--overlap-param-gather\n```\n\n### Distributed Optimizer\n```bash\n--use-distributed-optimizer\n```\n\n# Community & Support\n\n## Getting Help\n- \ud83d\udcd6 **[Documentation](https://docs.nvidia.com/Megatron-Core/)** - Official documentation\n- \ud83d\udc1b **[Issues](https://github.com/NVIDIA/Megatron-LM/issues)** - Bug reports and feature requests\n\n## Contributing\nWe \u2764\ufe0f contributions! Ways to contribute:\n- \ud83d\udc1b **Report bugs** - Help us improve reliability\n- \ud83d\udca1 **Suggest features** - Shape the future of Megatron Core\n- \ud83d\udcdd **Improve docs** - Make Megatron Core more accessible\n- \ud83d\udd27 **Submit PRs** - Contribute code improvements\n\n**\u2192 [Contributing Guide](./CONTRIBUTING.md)**\n\n## Citation\n```bibtex\n@article{megatron-lm,\n title={Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism},\n author={Shoeybi, Mohammad and Patwary, Mostofa and Puri, Raul and LeGresley, Patrick and Casper, Jared and Catanzaro, Bryan},\n journal={arXiv preprint arXiv:1909.08053},\n year={2019}\n}\n```\n",
"bugtrack_url": null,
"license": "Apache 2.0",
"summary": "Megatron Core - a library for efficient and scalable training of transformer based models",
"version": "0.14.0",
"project_urls": {
"Download": "https://github.com/NVIDIA/Megatron-LM/releases",
"Homepage": "https://github.com/NVIDIA/Megatron-LM/megatron/core"
},
"split_keywords": [
"nlp",
" nlu",
" deep",
" gpu",
" language",
" learning",
" learning",
" machine",
" nvidia",
" pytorch",
" torch",
" transformer"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "eb365c8ece7f8b9b5278f50b24d177a0e766063bb8831a798c20c1a07a18d400",
"md5": "fef3cd85207cd149931410a8787ac1e0",
"sha256": "51dd5e0fb3ea801ec73b4634d23aaf95a1c928cfcfe52b06429f5582849bcdb7"
},
"downloads": -1,
"filename": "megatron_core-0.14.0-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl",
"has_sig": false,
"md5_digest": "fef3cd85207cd149931410a8787ac1e0",
"packagetype": "bdist_wheel",
"python_version": "cp310",
"requires_python": ">=3.10",
"size": 2114559,
"upload_time": "2025-10-08T15:04:17",
"upload_time_iso_8601": "2025-10-08T15:04:17.560228Z",
"url": "https://files.pythonhosted.org/packages/eb/36/5c8ece7f8b9b5278f50b24d177a0e766063bb8831a798c20c1a07a18d400/megatron_core-0.14.0-cp310-cp310-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "58a8f4b6396fed9a48222a614c5da3b3b714091b60bd570f94d7477b0bf5fde5",
"md5": "d68bc3eb603af118d9f1c84382c05fa6",
"sha256": "cdeace5fe5d989dc859dcbb3ce3da08f8be44a9a07ed0058145d2327039a273a"
},
"downloads": -1,
"filename": "megatron_core-0.14.0-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl",
"has_sig": false,
"md5_digest": "d68bc3eb603af118d9f1c84382c05fa6",
"packagetype": "bdist_wheel",
"python_version": "cp310",
"requires_python": ">=3.10",
"size": 2145821,
"upload_time": "2025-10-08T15:04:19",
"upload_time_iso_8601": "2025-10-08T15:04:19.603876Z",
"url": "https://files.pythonhosted.org/packages/58/a8/f4b6396fed9a48222a614c5da3b3b714091b60bd570f94d7477b0bf5fde5/megatron_core-0.14.0-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "bab7bc5f62cd10f288fe1062e7e64aac938f5ae16ba80ad4f11c7c6eb900de88",
"md5": "efa5dcf3d50a1ae3d9eecd8002eaeffe",
"sha256": "cf8caeedfc024c6b17ce3501a263be454adbace489bacb59697c099dd41d5355"
},
"downloads": -1,
"filename": "megatron_core-0.14.0-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl",
"has_sig": false,
"md5_digest": "efa5dcf3d50a1ae3d9eecd8002eaeffe",
"packagetype": "bdist_wheel",
"python_version": "cp311",
"requires_python": ">=3.10",
"size": 2125618,
"upload_time": "2025-10-08T15:04:21",
"upload_time_iso_8601": "2025-10-08T15:04:21.269483Z",
"url": "https://files.pythonhosted.org/packages/ba/b7/bc5f62cd10f288fe1062e7e64aac938f5ae16ba80ad4f11c7c6eb900de88/megatron_core-0.14.0-cp311-cp311-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "a625dc71867d46abb494c5569d67dd67eb5da32a01b8e2a20a443ade1ed4fbf0",
"md5": "a8ab807292d307a66df7eb002237dcd0",
"sha256": "e011fbb8c514f58ccae4441f21b5b0b6e7684522bb59f94ea69341686f0de019"
},
"downloads": -1,
"filename": "megatron_core-0.14.0-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl",
"has_sig": false,
"md5_digest": "a8ab807292d307a66df7eb002237dcd0",
"packagetype": "bdist_wheel",
"python_version": "cp311",
"requires_python": ">=3.10",
"size": 2157450,
"upload_time": "2025-10-08T15:04:22",
"upload_time_iso_8601": "2025-10-08T15:04:22.888574Z",
"url": "https://files.pythonhosted.org/packages/a6/25/dc71867d46abb494c5569d67dd67eb5da32a01b8e2a20a443ade1ed4fbf0/megatron_core-0.14.0-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "b6ff5c9a8affbf29a9f0227543e684daee977a6891521e318fc6f02159a578eb",
"md5": "8ef5cb04430f52ad05d8f564db64ac34",
"sha256": "52af4afbf58b1ae0556e9b209b02b3b8eee626aa34c57d48ebb2ba7116e572cb"
},
"downloads": -1,
"filename": "megatron_core-0.14.0-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl",
"has_sig": false,
"md5_digest": "8ef5cb04430f52ad05d8f564db64ac34",
"packagetype": "bdist_wheel",
"python_version": "cp312",
"requires_python": ">=3.10",
"size": 2146301,
"upload_time": "2025-10-08T15:04:24",
"upload_time_iso_8601": "2025-10-08T15:04:24.743794Z",
"url": "https://files.pythonhosted.org/packages/b6/ff/5c9a8affbf29a9f0227543e684daee977a6891521e318fc6f02159a578eb/megatron_core-0.14.0-cp312-cp312-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "82750d3ad7c9b43fbe0eb5378452b46b3943502ae377b0bd122cf6836c5b055f",
"md5": "24bb2b5acf324962ab88b3704de771d0",
"sha256": "d989f93ea64a9c2c24d1a447c218f0566d04f2195afcba2e423f89ad8b717596"
},
"downloads": -1,
"filename": "megatron_core-0.14.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl",
"has_sig": false,
"md5_digest": "24bb2b5acf324962ab88b3704de771d0",
"packagetype": "bdist_wheel",
"python_version": "cp312",
"requires_python": ">=3.10",
"size": 2170140,
"upload_time": "2025-10-08T15:04:26",
"upload_time_iso_8601": "2025-10-08T15:04:26.223731Z",
"url": "https://files.pythonhosted.org/packages/82/75/0d3ad7c9b43fbe0eb5378452b46b3943502ae377b0bd122cf6836c5b055f/megatron_core-0.14.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "abbe95d6b3db7e89aad412ee19ddd0d9d61ca7d3dc4413d796ba14a3d0d725eb",
"md5": "88ad156377e10c3e2440024d6d1517de",
"sha256": "e4bf2b19763e469021f72d12c38147c052a1057805e7da0ea799a80ac74cc421"
},
"downloads": -1,
"filename": "megatron_core-0.14.0-cp313-cp313-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl",
"has_sig": false,
"md5_digest": "88ad156377e10c3e2440024d6d1517de",
"packagetype": "bdist_wheel",
"python_version": "cp313",
"requires_python": ">=3.10",
"size": 2145061,
"upload_time": "2025-10-08T15:04:27",
"upload_time_iso_8601": "2025-10-08T15:04:27.990641Z",
"url": "https://files.pythonhosted.org/packages/ab/be/95d6b3db7e89aad412ee19ddd0d9d61ca7d3dc4413d796ba14a3d0d725eb/megatron_core-0.14.0-cp313-cp313-manylinux_2_24_aarch64.manylinux_2_28_aarch64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "5df60cb45c7ec234ed845578a5f3727d6839b706fa55e476b0234e11428dfa20",
"md5": "174249f2f8e2128bac254e56f8052885",
"sha256": "401fd02a3f3bac62325f75621494026681e6fad9b2afad495da45cbe1e05a940"
},
"downloads": -1,
"filename": "megatron_core-0.14.0-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl",
"has_sig": false,
"md5_digest": "174249f2f8e2128bac254e56f8052885",
"packagetype": "bdist_wheel",
"python_version": "cp313",
"requires_python": ">=3.10",
"size": 2170799,
"upload_time": "2025-10-08T15:04:29",
"upload_time_iso_8601": "2025-10-08T15:04:29.136099Z",
"url": "https://files.pythonhosted.org/packages/5d/f6/0cb45c7ec234ed845578a5f3727d6839b706fa55e476b0234e11428dfa20/megatron_core-0.14.0-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "468f89b294eb1e3c47518af48fbc79af75a275d55e88c750fbaf5aa785d5e29a",
"md5": "29fbc80004ad8177c9b2cca37703de21",
"sha256": "fcfd4bb2bc3eb83a646aacab4d47ab8786a6dcb7bd9fe3ae9cbba8eb185158ec"
},
"downloads": -1,
"filename": "megatron_core-0.14.0.tar.gz",
"has_sig": false,
"md5_digest": "29fbc80004ad8177c9b2cca37703de21",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 804761,
"upload_time": "2025-10-08T15:04:33",
"upload_time_iso_8601": "2025-10-08T15:04:33.172575Z",
"url": "https://files.pythonhosted.org/packages/46/8f/89b294eb1e3c47518af48fbc79af75a275d55e88c750fbaf5aa785d5e29a/megatron_core-0.14.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-08 15:04:33",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "NVIDIA",
"github_project": "Megatron-LM",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "megatron-core"
}