simpletuner


Namesimpletuner JSON
Version 2.2.1 PyPI version JSON
download
home_pagehttps://github.com/bghira/SimpleTuner
SummaryStable Diffusion 2.x and XL tuner.
upload_time2025-09-13 08:10:46
maintainerbghira
docs_urlNone
authorbghira
requires_python<3.13,>=3.11
licenseNone
keywords stable-diffusion machine-learning deep-learning pytorch cuda rocm diffusion-models ai image-generation
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # SimpleTuner πŸ’Ή

> ℹ️ No data is sent to any third parties except through opt-in flag `report_to`, `push_to_hub`, or webhooks which must be manually configured.

**SimpleTuner** is geared towards simplicity, with a focus on making the code easily understood. This codebase serves as a shared academic exercise, and contributions are welcome.

If you'd like to join our community, we can be found [on Discord](https://discord.gg/CVzhX7ZA) via Terminus Research Group.
If you have any questions, please feel free to reach out to us there.

## Table of Contents

- [Design Philosophy](#design-philosophy)
- [Tutorial](#tutorial)
- [Features](#features)
  - [Core Training Features](#core-training-features)
  - [Model Architecture Support](#model-architecture-support)
  - [Advanced Training Techniques](#advanced-training-techniques)
  - [Model-Specific Features](#model-specific-features)
  - [Quickstart Guides](#quickstart-guides)
- [Hardware Requirements](#hardware-requirements)
- [Toolkit](#toolkit)
- [Setup](#setup)
- [Troubleshooting](#troubleshooting)

## Design Philosophy

- **Simplicity**: Aiming to have good default settings for most use cases, so less tinkering is required.
- **Versatility**: Designed to handle a wide range of image quantities - from small datasets to extensive collections.
- **Cutting-Edge Features**: Only incorporates features that have proven efficacy, avoiding the addition of untested options.

## Tutorial

Please fully explore this README before embarking on [the tutorial](/documentation/TUTORIAL.md), as it contains vital information that you might need to know first.

For a quick start without reading the full documentation, you can use the [Quick Start](/documentation/QUICKSTART.md) guide.

For memory-constrained systems, see the [DeepSpeed document](/documentation/DEEPSPEED.md) which explains how to use πŸ€—Accelerate to configure Microsoft's DeepSpeed for optimiser state offload.

For multi-node distributed training, [this guide](/documentation/DISTRIBUTED.md) will help tweak the configurations from the INSTALL and Quickstart guides to be suitable for multi-node training, and optimising for image datasets numbering in the billions of samples.

---

## Features

SimpleTuner provides comprehensive training support across multiple diffusion model architectures with consistent feature availability:

### Core Training Features

- **Multi-GPU training** - Distributed training across multiple GPUs with automatic optimization
- **Advanced caching** - Image, video, and caption embeddings cached to disk for faster training
- **Aspect bucketing** - Support for varied image/video sizes and aspect ratios
- **Memory optimization** - Most models trainable on 24G GPU, many on 16G with optimizations
- **DeepSpeed integration** - Train large models on smaller GPUs with gradient checkpointing and optimizer state offload
- **S3 training** - Train directly from cloud storage (Cloudflare R2, Wasabi S3)
- **EMA support** - Exponential moving average weights for improved stability and quality

### Model Architecture Support

| Model | Parameters | PEFT LoRA | Lycoris | Full-Rank | ControlNet | Quantization | Flow Matching | Text Encoders |
|-------|------------|-----------|---------|-----------|------------|--------------|---------------|---------------|
| **Stable Diffusion XL** | 3.5B | βœ“ | βœ“ | βœ“ | βœ“ | int8/nf4 | βœ— | CLIP-L/G |
| **Stable Diffusion 3** | 2B-8B | βœ“ | βœ“ | βœ“* | βœ“ | int8/fp8/nf4 | βœ“ | CLIP-L/G + T5-XXL |
| **Flux.1** | 12B | βœ“ | βœ“ | βœ“* | βœ“ | int8/fp8/nf4 | βœ“ | CLIP-L + T5-XXL |
| **Auraflow** | 6.8B | βœ“ | βœ“ | βœ“* | βœ“ | int8/fp8/nf4 | βœ“ | UMT5-XXL |
| **PixArt Sigma** | 0.6B-0.9B | βœ— | βœ“ | βœ“ | βœ“ | int8 | βœ— | T5-XXL |
| **Sana** | 0.6B-4.8B | βœ— | βœ“ | βœ“ | βœ— | int8 | βœ“ | Gemma2-2B |
| **Lumina2** | 2B | βœ“ | βœ“ | βœ“ | βœ— | int8 | βœ“ | Gemma2 |
| **Kwai Kolors** | 5B | βœ“ | βœ“ | βœ“ | βœ— | βœ— | βœ— | ChatGLM-6B |
| **LTX Video** | 5B | βœ“ | βœ“ | βœ“ | βœ— | int8/fp8 | βœ“ | T5-XXL |
| **Wan Video** | 1.3B-14B | βœ“ | βœ“ | βœ“* | βœ— | int8 | βœ“ | UMT5 |
| **HiDream** | 17B (8.5B MoE) | βœ“ | βœ“ | βœ“* | βœ“ | int8/fp8/nf4 | βœ“ | CLIP-L + T5-XXL + Llama |
| **Cosmos2** | 2B-14B | βœ— | βœ“ | βœ“ | βœ— | int8 | βœ“ | T5-XXL |
| **OmniGen** | 3.8B | βœ“ | βœ“ | βœ“ | βœ— | int8/fp8 | βœ“ | T5-XXL |
| **Qwen Image** | 20B | βœ“ | βœ“ | βœ“* | βœ— | int8/nf4 (req.) | βœ“ | T5-XXL |
| **SD 1.x/2.x (Legacy)** | 0.9B | βœ“ | βœ“ | βœ“ | βœ“ | int8/nf4 | βœ— | CLIP-L |

*βœ“ = Supported, βœ— = Not supported, * = Requires DeepSpeed for full-rank training*

### Advanced Training Techniques

- **TREAD** - Token-wise dropout for Flux and Wan models, including Kontext training
- **Masked loss training** - Superior convergence with segmentation/depth guidance
- **Prior regularization** - Enhanced training stability for character consistency
- **Gradient checkpointing** - Configurable intervals for memory/speed optimization
- **Loss functions** - L2, Huber, Smooth L1 with scheduling support
- **SNR weighting** - Min-SNR gamma weighting for improved training dynamics

### Model-Specific Features

- **Flux Kontext** - Edit conditioning and image-to-image training for Flux models
- **PixArt two-stage** - eDiff training pipeline support for PixArt Sigma
- **Flow matching models** - Advanced scheduling with beta/uniform distributions
- **HiDream MoE** - Mixture of Experts gate loss augmentation
- **T5 masked training** - Enhanced fine details for Flux and compatible models
- **QKV fusion** - Memory and speed optimizations (Flux, Lumina2)
- **TREAD integration** - Selective token routing for Wan and Flux models
- **Classifier-free guidance** - Optional CFG reintroduction for distilled models

### Quickstart Guides

Detailed quickstart guides are available for all supported models:

- **[Flux.1 Guide](/documentation/quickstart/FLUX.md)** - Includes Kontext editing support and QKV fusion
- **[Stable Diffusion 3 Guide](/documentation/quickstart/SD3.md)** - Full and LoRA training with ControlNet
- **[Stable Diffusion XL Guide](/documentation/quickstart/SDXL.md)** - Complete SDXL training pipeline
- **[Auraflow Guide](/documentation/quickstart/AURAFLOW.md)** - Flow-matching model training
- **[PixArt Sigma Guide](/documentation/quickstart/SIGMA.md)** - DiT model with two-stage support
- **[Sana Guide](/documentation/quickstart/SANA.md)** - Lightweight flow-matching model
- **[Lumina2 Guide](/documentation/quickstart/LUMINA2.md)** - 2B parameter flow-matching model
- **[Kwai Kolors Guide](/documentation/quickstart/KOLORS.md)** - SDXL-based with ChatGLM encoder
- **[LTX Video Guide](/documentation/quickstart/LTXVIDEO.md)** - Video diffusion training
- **[Wan Video Guide](/documentation/quickstart/WAN.md)** - Video flow-matching with TREAD support
- **[HiDream Guide](/documentation/quickstart/HIDREAM.md)** - MoE model with advanced features
- **[Cosmos2 Guide](/documentation/quickstart/COSMOS2IMAGE.md)** - Multi-modal image generation
- **[OmniGen Guide](/documentation/quickstart/OMNIGEN.md)** - Unified image generation model
- **[Qwen Image Guide](/documentation/quickstart/QWEN_IMAGE.md)** - 20B parameter large-scale training

---

## Hardware Requirements

### General Requirements

- **NVIDIA**: RTX 3080+ recommended (tested up to H200)
- **AMD**: 7900 XTX 24GB and MI300X verified (higher memory usage vs NVIDIA)
- **Apple**: M3 Max+ with 24GB+ unified memory for LoRA training

### Memory Guidelines by Model Size

- **Large models (12B+)**: A100-80G for full-rank, 24G+ for LoRA/Lycoris
- **Medium models (2B-8B)**: 16G+ for LoRA, 40G+ for full-rank training
- **Small models (<2B)**: 12G+ sufficient for most training types

**Note**: Quantization (int8/fp8/nf4) significantly reduces memory requirements. See individual [quickstart guides](#quickstart-guides) for model-specific requirements.

## Setup

SimpleTuner can be installed via pip for most users:

```bash
# Base installation (CPU-only PyTorch)
pip install simpletuner

# CUDA users (NVIDIA GPUs)
pip install simpletuner[cuda]

# ROCm users (AMD GPUs)
pip install simpletuner[rocm]

# Apple Silicon users (M1/M2/M3/M4 Macs)
pip install simpletuner[apple]
```

For manual installation or development setup, see the [installation documentation](/documentation/INSTALL.md).

## Troubleshooting

Enable debug logs for a more detailed insight by adding `export SIMPLETUNER_LOG_LEVEL=DEBUG` to your environment (`config/config.env`) file.

For performance analysis of the training loop, setting `SIMPLETUNER_TRAINING_LOOP_LOG_LEVEL=DEBUG` will have timestamps that highlight any issues in your configuration.

For a comprehensive list of options available, consult [this documentation](/documentation/OPTIONS.md).

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/bghira/SimpleTuner",
    "name": "simpletuner",
    "maintainer": "bghira",
    "docs_url": null,
    "requires_python": "<3.13,>=3.11",
    "maintainer_email": null,
    "keywords": "stable-diffusion, machine-learning, deep-learning, pytorch, cuda, rocm, diffusion-models, ai, image-generation",
    "author": "bghira",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/bd/34/8a228701d255c68c9b4e58979ae7428021e9797c73d0519d050f026dd28a/simpletuner-2.2.1.tar.gz",
    "platform": null,
    "description": "# SimpleTuner \ud83d\udcb9\n\n> \u2139\ufe0f No data is sent to any third parties except through opt-in flag `report_to`, `push_to_hub`, or webhooks which must be manually configured.\n\n**SimpleTuner** is geared towards simplicity, with a focus on making the code easily understood. This codebase serves as a shared academic exercise, and contributions are welcome.\n\nIf you'd like to join our community, we can be found [on Discord](https://discord.gg/CVzhX7ZA) via Terminus Research Group.\nIf you have any questions, please feel free to reach out to us there.\n\n## Table of Contents\n\n- [Design Philosophy](#design-philosophy)\n- [Tutorial](#tutorial)\n- [Features](#features)\n  - [Core Training Features](#core-training-features)\n  - [Model Architecture Support](#model-architecture-support)\n  - [Advanced Training Techniques](#advanced-training-techniques)\n  - [Model-Specific Features](#model-specific-features)\n  - [Quickstart Guides](#quickstart-guides)\n- [Hardware Requirements](#hardware-requirements)\n- [Toolkit](#toolkit)\n- [Setup](#setup)\n- [Troubleshooting](#troubleshooting)\n\n## Design Philosophy\n\n- **Simplicity**: Aiming to have good default settings for most use cases, so less tinkering is required.\n- **Versatility**: Designed to handle a wide range of image quantities - from small datasets to extensive collections.\n- **Cutting-Edge Features**: Only incorporates features that have proven efficacy, avoiding the addition of untested options.\n\n## Tutorial\n\nPlease fully explore this README before embarking on [the tutorial](/documentation/TUTORIAL.md), as it contains vital information that you might need to know first.\n\nFor a quick start without reading the full documentation, you can use the [Quick Start](/documentation/QUICKSTART.md) guide.\n\nFor memory-constrained systems, see the [DeepSpeed document](/documentation/DEEPSPEED.md) which explains how to use \ud83e\udd17Accelerate to configure Microsoft's DeepSpeed for optimiser state offload.\n\nFor multi-node distributed training, [this guide](/documentation/DISTRIBUTED.md) will help tweak the configurations from the INSTALL and Quickstart guides to be suitable for multi-node training, and optimising for image datasets numbering in the billions of samples.\n\n---\n\n## Features\n\nSimpleTuner provides comprehensive training support across multiple diffusion model architectures with consistent feature availability:\n\n### Core Training Features\n\n- **Multi-GPU training** - Distributed training across multiple GPUs with automatic optimization\n- **Advanced caching** - Image, video, and caption embeddings cached to disk for faster training\n- **Aspect bucketing** - Support for varied image/video sizes and aspect ratios\n- **Memory optimization** - Most models trainable on 24G GPU, many on 16G with optimizations\n- **DeepSpeed integration** - Train large models on smaller GPUs with gradient checkpointing and optimizer state offload\n- **S3 training** - Train directly from cloud storage (Cloudflare R2, Wasabi S3)\n- **EMA support** - Exponential moving average weights for improved stability and quality\n\n### Model Architecture Support\n\n| Model | Parameters | PEFT LoRA | Lycoris | Full-Rank | ControlNet | Quantization | Flow Matching | Text Encoders |\n|-------|------------|-----------|---------|-----------|------------|--------------|---------------|---------------|\n| **Stable Diffusion XL** | 3.5B | \u2713 | \u2713 | \u2713 | \u2713 | int8/nf4 | \u2717 | CLIP-L/G |\n| **Stable Diffusion 3** | 2B-8B | \u2713 | \u2713 | \u2713* | \u2713 | int8/fp8/nf4 | \u2713 | CLIP-L/G + T5-XXL |\n| **Flux.1** | 12B | \u2713 | \u2713 | \u2713* | \u2713 | int8/fp8/nf4 | \u2713 | CLIP-L + T5-XXL |\n| **Auraflow** | 6.8B | \u2713 | \u2713 | \u2713* | \u2713 | int8/fp8/nf4 | \u2713 | UMT5-XXL |\n| **PixArt Sigma** | 0.6B-0.9B | \u2717 | \u2713 | \u2713 | \u2713 | int8 | \u2717 | T5-XXL |\n| **Sana** | 0.6B-4.8B | \u2717 | \u2713 | \u2713 | \u2717 | int8 | \u2713 | Gemma2-2B |\n| **Lumina2** | 2B | \u2713 | \u2713 | \u2713 | \u2717 | int8 | \u2713 | Gemma2 |\n| **Kwai Kolors** | 5B | \u2713 | \u2713 | \u2713 | \u2717 | \u2717 | \u2717 | ChatGLM-6B |\n| **LTX Video** | 5B | \u2713 | \u2713 | \u2713 | \u2717 | int8/fp8 | \u2713 | T5-XXL |\n| **Wan Video** | 1.3B-14B | \u2713 | \u2713 | \u2713* | \u2717 | int8 | \u2713 | UMT5 |\n| **HiDream** | 17B (8.5B MoE) | \u2713 | \u2713 | \u2713* | \u2713 | int8/fp8/nf4 | \u2713 | CLIP-L + T5-XXL + Llama |\n| **Cosmos2** | 2B-14B | \u2717 | \u2713 | \u2713 | \u2717 | int8 | \u2713 | T5-XXL |\n| **OmniGen** | 3.8B | \u2713 | \u2713 | \u2713 | \u2717 | int8/fp8 | \u2713 | T5-XXL |\n| **Qwen Image** | 20B | \u2713 | \u2713 | \u2713* | \u2717 | int8/nf4 (req.) | \u2713 | T5-XXL |\n| **SD 1.x/2.x (Legacy)** | 0.9B | \u2713 | \u2713 | \u2713 | \u2713 | int8/nf4 | \u2717 | CLIP-L |\n\n*\u2713 = Supported, \u2717 = Not supported, * = Requires DeepSpeed for full-rank training*\n\n### Advanced Training Techniques\n\n- **TREAD** - Token-wise dropout for Flux and Wan models, including Kontext training\n- **Masked loss training** - Superior convergence with segmentation/depth guidance\n- **Prior regularization** - Enhanced training stability for character consistency\n- **Gradient checkpointing** - Configurable intervals for memory/speed optimization\n- **Loss functions** - L2, Huber, Smooth L1 with scheduling support\n- **SNR weighting** - Min-SNR gamma weighting for improved training dynamics\n\n### Model-Specific Features\n\n- **Flux Kontext** - Edit conditioning and image-to-image training for Flux models\n- **PixArt two-stage** - eDiff training pipeline support for PixArt Sigma\n- **Flow matching models** - Advanced scheduling with beta/uniform distributions\n- **HiDream MoE** - Mixture of Experts gate loss augmentation\n- **T5 masked training** - Enhanced fine details for Flux and compatible models\n- **QKV fusion** - Memory and speed optimizations (Flux, Lumina2)\n- **TREAD integration** - Selective token routing for Wan and Flux models\n- **Classifier-free guidance** - Optional CFG reintroduction for distilled models\n\n### Quickstart Guides\n\nDetailed quickstart guides are available for all supported models:\n\n- **[Flux.1 Guide](/documentation/quickstart/FLUX.md)** - Includes Kontext editing support and QKV fusion\n- **[Stable Diffusion 3 Guide](/documentation/quickstart/SD3.md)** - Full and LoRA training with ControlNet\n- **[Stable Diffusion XL Guide](/documentation/quickstart/SDXL.md)** - Complete SDXL training pipeline\n- **[Auraflow Guide](/documentation/quickstart/AURAFLOW.md)** - Flow-matching model training\n- **[PixArt Sigma Guide](/documentation/quickstart/SIGMA.md)** - DiT model with two-stage support\n- **[Sana Guide](/documentation/quickstart/SANA.md)** - Lightweight flow-matching model\n- **[Lumina2 Guide](/documentation/quickstart/LUMINA2.md)** - 2B parameter flow-matching model\n- **[Kwai Kolors Guide](/documentation/quickstart/KOLORS.md)** - SDXL-based with ChatGLM encoder\n- **[LTX Video Guide](/documentation/quickstart/LTXVIDEO.md)** - Video diffusion training\n- **[Wan Video Guide](/documentation/quickstart/WAN.md)** - Video flow-matching with TREAD support\n- **[HiDream Guide](/documentation/quickstart/HIDREAM.md)** - MoE model with advanced features\n- **[Cosmos2 Guide](/documentation/quickstart/COSMOS2IMAGE.md)** - Multi-modal image generation\n- **[OmniGen Guide](/documentation/quickstart/OMNIGEN.md)** - Unified image generation model\n- **[Qwen Image Guide](/documentation/quickstart/QWEN_IMAGE.md)** - 20B parameter large-scale training\n\n---\n\n## Hardware Requirements\n\n### General Requirements\n\n- **NVIDIA**: RTX 3080+ recommended (tested up to H200)\n- **AMD**: 7900 XTX 24GB and MI300X verified (higher memory usage vs NVIDIA)\n- **Apple**: M3 Max+ with 24GB+ unified memory for LoRA training\n\n### Memory Guidelines by Model Size\n\n- **Large models (12B+)**: A100-80G for full-rank, 24G+ for LoRA/Lycoris\n- **Medium models (2B-8B)**: 16G+ for LoRA, 40G+ for full-rank training\n- **Small models (<2B)**: 12G+ sufficient for most training types\n\n**Note**: Quantization (int8/fp8/nf4) significantly reduces memory requirements. See individual [quickstart guides](#quickstart-guides) for model-specific requirements.\n\n## Setup\n\nSimpleTuner can be installed via pip for most users:\n\n```bash\n# Base installation (CPU-only PyTorch)\npip install simpletuner\n\n# CUDA users (NVIDIA GPUs)\npip install simpletuner[cuda]\n\n# ROCm users (AMD GPUs)\npip install simpletuner[rocm]\n\n# Apple Silicon users (M1/M2/M3/M4 Macs)\npip install simpletuner[apple]\n```\n\nFor manual installation or development setup, see the [installation documentation](/documentation/INSTALL.md).\n\n## Troubleshooting\n\nEnable debug logs for a more detailed insight by adding `export SIMPLETUNER_LOG_LEVEL=DEBUG` to your environment (`config/config.env`) file.\n\nFor performance analysis of the training loop, setting `SIMPLETUNER_TRAINING_LOOP_LOG_LEVEL=DEBUG` will have timestamps that highlight any issues in your configuration.\n\nFor a comprehensive list of options available, consult [this documentation](/documentation/OPTIONS.md).\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Stable Diffusion 2.x and XL tuner.",
    "version": "2.2.1",
    "project_urls": {
        "Changelog": "https://github.com/bghira/SimpleTuner/releases",
        "Documentation": "https://github.com/bghira/SimpleTuner/blob/main/README.md",
        "Homepage": "https://github.com/bghira/SimpleTuner",
        "Issues": "https://github.com/bghira/SimpleTuner/issues",
        "Repository": "https://github.com/bghira/SimpleTuner.git"
    },
    "split_keywords": [
        "stable-diffusion",
        " machine-learning",
        " deep-learning",
        " pytorch",
        " cuda",
        " rocm",
        " diffusion-models",
        " ai",
        " image-generation"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "aa72d087886072c433302cc51b80c336fe65882d8051746a02d32f53da84f75a",
                "md5": "7fbb2185b3edbd780b5a26c5b8226106",
                "sha256": "aea9d3b66e46ed9d69f6cfe107f16a0c7c44fc3da1a919173e629297e8d9116e"
            },
            "downloads": -1,
            "filename": "simpletuner-2.2.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "7fbb2185b3edbd780b5a26c5b8226106",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<3.13,>=3.11",
            "size": 971812,
            "upload_time": "2025-09-13T08:10:44",
            "upload_time_iso_8601": "2025-09-13T08:10:44.525411Z",
            "url": "https://files.pythonhosted.org/packages/aa/72/d087886072c433302cc51b80c336fe65882d8051746a02d32f53da84f75a/simpletuner-2.2.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "bd348a228701d255c68c9b4e58979ae7428021e9797c73d0519d050f026dd28a",
                "md5": "34b30b320e9ccc8f97d41ec6948ca899",
                "sha256": "1056083caddae05dd5c943bdbab16a8362b38ccc08e9a38cc05a8c683e7f07a8"
            },
            "downloads": -1,
            "filename": "simpletuner-2.2.1.tar.gz",
            "has_sig": false,
            "md5_digest": "34b30b320e9ccc8f97d41ec6948ca899",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<3.13,>=3.11",
            "size": 898636,
            "upload_time": "2025-09-13T08:10:46",
            "upload_time_iso_8601": "2025-09-13T08:10:46.217148Z",
            "url": "https://files.pythonhosted.org/packages/bd/34/8a228701d255c68c9b4e58979ae7428021e9797c73d0519d050f026dd28a/simpletuner-2.2.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-09-13 08:10:46",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "bghira",
    "github_project": "SimpleTuner",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "simpletuner"
}
        
Elapsed time: 1.76860s