ademamix


Nameademamix JSON
Version 0.0.1 PyPI version JSON
download
home_pageNone
SummaryAdEMAMix is a PyTorch optimizer that combines two EMAs to better utilize past gradients, offering improved convergence and model retention over AdamW.
upload_time2024-09-25 04:35:35
maintainerNone
docs_urlNone
authorOguz Vuruskaner
requires_python<4.0,>=3.8
licenseMIT
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # AdEMAMix Optimizer

![PyPI](https://img.shields.io/pypi/v/adememix) ![License](https://img.shields.io/github/license/ovuruska/torch_ademamix/adememix)
## Overview

AdEMAMix is a optimizer that builds upon the widely-used AdamW algorithm, introducing two Exponential Moving Averages (EMAs) of past gradients for enhanced historical data utilization. This approach enables faster convergence and lower training minima, particularly beneficial for large-scale models like LLMs and image classification tasks. For a comprehensive understanding of AdEMAMix, including its methodology and performance benchmarks, refer to the original paper: "The AdEMAMix Optimizer: Better, Faster, Older" by Matteo Pagliardini, Pierre Ablin, and David Grangier, available at [https://arxiv.org/abs/2409.03137](https://arxiv.org/abs/2409.03137).
## Table of Contents
- [Installation](#installation)
- [Usage](#usage)
- [Contributing](#contributing)
- [License](#license)
- [References](#references)

## Installation
You can install the package via PyPI:

```bash
pip install adememix
```

Alternatively, you can clone the repository and install it manually:

```bash
pip install git+https://github.com/ovuruska/torch_ademamix.git
```

I apologize for the confusion. You're right, I should have kept it in English. Here's the expanded Usage section in English:

## Usage

AdEMAMix optimizer can be easily used in PyTorch, similar to other optimizers. Here's a more detailed usage example:

```python
import torch
from torch import nn
from torch.utils.data import DataLoader
from ademamix import AdEMAMix

# Model definition (example)
class SimpleModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = nn.Linear(10, 1)
    
    def forward(self, x):
        return self.fc(x)

# Create model and data loader
model = SimpleModel()
train_loader = DataLoader(your_dataset, batch_size=32, shuffle=True)

# Training parameters
num_epochs = 10
num_iterations = len(train_loader) * num_epochs

# Optimizer setup
optimizer = AdEMAMix(
    model.parameters(),
    lr=1e-3,
    betas=(0.9, 0.999, 0.9999),
    alpha=5.0,
    weight_decay=1e-2,
    T_alpha=num_iterations,
    T_beta3=num_iterations
)

# Loss function
criterion = nn.MSELoss()

# Training loop
for epoch in range(num_epochs):
    for batch_idx, (data, target) in enumerate(train_loader):
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()
        
        if batch_idx % 100 == 0:
            print(f'Epoch {epoch}, Batch {batch_idx}, Loss: {loss.item()}')
```

### Setting T_alpha and T_beta3 Parameters

The `T_alpha` and `T_beta3` parameters control the time-dependent behavior of AdEMAMix. These parameters are typically set equal to the total number of iterations (`num_iterations`).

- `T_alpha`: Controls how the alpha parameter changes over time.
- `T_beta3`: Controls how the beta3 parameter changes over time.

Things to consider when setting these parameters:

1. Calculate the total number of iterations for your training:
   ```python
   num_iterations = len(train_loader) * num_epochs
   ```

2. Set `T_alpha` and `T_beta3` equal to this number:
   ```python
   optimizer = AdEMAMix(
       model.parameters(),
       lr=1e-3,
       T_alpha=num_iterations,
       T_beta3=num_iterations,
       # ... other parameters ...
   )
   ```

3. If your training duration is very long or short, you can adjust these values. For example:
   - For faster adaptation: `T_alpha = num_iterations // 2`
   - For slower adaptation: `T_alpha = num_iterations * 2`

4. You can fine-tune the optimizer's behavior by setting `T_alpha` and `T_beta3` to different values.

Note: If you leave these parameters as `None`, AdEMAMix will use constant alpha and beta3 values.

### Advanced Usage

AdEMAMix allows for fine-tuning in different scenarios:

```python
optimizer = AdEMAMix(
    model.parameters(),
    lr=1e-3,
    betas=(0.9, 0.999, 0.9999),  # (beta1, beta2, beta3)
    alpha=5.0,
    weight_decay=1e-2,
    T_alpha=num_iterations,
    T_beta3=num_iterations,
    amsgrad=True,  # Use AMSGrad variant
    foreach=True,  # For faster updates (if supported)
    maximize=False,  # If True, maximizes the loss (e.g., for GAN training)
    capturable=True,  # For use with CUDA graphs
    differentiable=False  # To make the optimizer differentiable
)
```



You can easily replace your existing Adam or AdamW optimizer with AdEMAMix by modifying just a few lines in your existing code.

## Contributing
Contributions are welcome! Feel free to submit a pull request or open an issue if you encounter any bugs or have suggestions for improvements. Please ensure that your contributions adhere to our code of conduct and follow the repository guidelines.

## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## References
If you use AdEMAMix in your research, please cite the following papers:

1. Matteo Pagliardini, Pierre Ablin, David Grangier (2024). *The AdEMAMix Optimizer: Better, Faster, Older*. [arXiv:2409.03137](https://arxiv.org/abs/2409.03137)

---


            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "ademamix",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.8",
    "maintainer_email": null,
    "keywords": null,
    "author": "Oguz Vuruskaner",
    "author_email": "ovuruska@outlook.com",
    "download_url": "https://files.pythonhosted.org/packages/2c/4a/840be6b1de9b6620baf4541c490651acefff29fc04601c8b2265bd1dfb74/ademamix-0.0.1.tar.gz",
    "platform": null,
    "description": "# AdEMAMix Optimizer\n\n![PyPI](https://img.shields.io/pypi/v/adememix) ![License](https://img.shields.io/github/license/ovuruska/torch_ademamix/adememix)\n## Overview\n\nAdEMAMix is a optimizer that builds upon the widely-used AdamW algorithm, introducing two Exponential Moving Averages (EMAs) of past gradients for enhanced historical data utilization. This approach enables faster convergence and lower training minima, particularly beneficial for large-scale models like LLMs and image classification tasks. For a comprehensive understanding of AdEMAMix, including its methodology and performance benchmarks, refer to the original paper: \"The AdEMAMix Optimizer: Better, Faster, Older\" by Matteo Pagliardini, Pierre Ablin, and David Grangier, available at [https://arxiv.org/abs/2409.03137](https://arxiv.org/abs/2409.03137).\n## Table of Contents\n- [Installation](#installation)\n- [Usage](#usage)\n- [Contributing](#contributing)\n- [License](#license)\n- [References](#references)\n\n## Installation\nYou can install the package via PyPI:\n\n```bash\npip install adememix\n```\n\nAlternatively, you can clone the repository and install it manually:\n\n```bash\npip install git+https://github.com/ovuruska/torch_ademamix.git\n```\n\nI apologize for the confusion. You're right, I should have kept it in English. Here's the expanded Usage section in English:\n\n## Usage\n\nAdEMAMix optimizer can be easily used in PyTorch, similar to other optimizers. Here's a more detailed usage example:\n\n```python\nimport torch\nfrom torch import nn\nfrom torch.utils.data import DataLoader\nfrom ademamix import AdEMAMix\n\n# Model definition (example)\nclass SimpleModel(nn.Module):\n    def __init__(self):\n        super().__init__()\n        self.fc = nn.Linear(10, 1)\n    \n    def forward(self, x):\n        return self.fc(x)\n\n# Create model and data loader\nmodel = SimpleModel()\ntrain_loader = DataLoader(your_dataset, batch_size=32, shuffle=True)\n\n# Training parameters\nnum_epochs = 10\nnum_iterations = len(train_loader) * num_epochs\n\n# Optimizer setup\noptimizer = AdEMAMix(\n    model.parameters(),\n    lr=1e-3,\n    betas=(0.9, 0.999, 0.9999),\n    alpha=5.0,\n    weight_decay=1e-2,\n    T_alpha=num_iterations,\n    T_beta3=num_iterations\n)\n\n# Loss function\ncriterion = nn.MSELoss()\n\n# Training loop\nfor epoch in range(num_epochs):\n    for batch_idx, (data, target) in enumerate(train_loader):\n        optimizer.zero_grad()\n        output = model(data)\n        loss = criterion(output, target)\n        loss.backward()\n        optimizer.step()\n        \n        if batch_idx % 100 == 0:\n            print(f'Epoch {epoch}, Batch {batch_idx}, Loss: {loss.item()}')\n```\n\n### Setting T_alpha and T_beta3 Parameters\n\nThe `T_alpha` and `T_beta3` parameters control the time-dependent behavior of AdEMAMix. These parameters are typically set equal to the total number of iterations (`num_iterations`).\n\n- `T_alpha`: Controls how the alpha parameter changes over time.\n- `T_beta3`: Controls how the beta3 parameter changes over time.\n\nThings to consider when setting these parameters:\n\n1. Calculate the total number of iterations for your training:\n   ```python\n   num_iterations = len(train_loader) * num_epochs\n   ```\n\n2. Set `T_alpha` and `T_beta3` equal to this number:\n   ```python\n   optimizer = AdEMAMix(\n       model.parameters(),\n       lr=1e-3,\n       T_alpha=num_iterations,\n       T_beta3=num_iterations,\n       # ... other parameters ...\n   )\n   ```\n\n3. If your training duration is very long or short, you can adjust these values. For example:\n   - For faster adaptation: `T_alpha = num_iterations // 2`\n   - For slower adaptation: `T_alpha = num_iterations * 2`\n\n4. You can fine-tune the optimizer's behavior by setting `T_alpha` and `T_beta3` to different values.\n\nNote: If you leave these parameters as `None`, AdEMAMix will use constant alpha and beta3 values.\n\n### Advanced Usage\n\nAdEMAMix allows for fine-tuning in different scenarios:\n\n```python\noptimizer = AdEMAMix(\n    model.parameters(),\n    lr=1e-3,\n    betas=(0.9, 0.999, 0.9999),  # (beta1, beta2, beta3)\n    alpha=5.0,\n    weight_decay=1e-2,\n    T_alpha=num_iterations,\n    T_beta3=num_iterations,\n    amsgrad=True,  # Use AMSGrad variant\n    foreach=True,  # For faster updates (if supported)\n    maximize=False,  # If True, maximizes the loss (e.g., for GAN training)\n    capturable=True,  # For use with CUDA graphs\n    differentiable=False  # To make the optimizer differentiable\n)\n```\n\n\n\nYou can easily replace your existing Adam or AdamW optimizer with AdEMAMix by modifying just a few lines in your existing code.\n\n## Contributing\nContributions are welcome! Feel free to submit a pull request or open an issue if you encounter any bugs or have suggestions for improvements. Please ensure that your contributions adhere to our code of conduct and follow the repository guidelines.\n\n## License\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## References\nIf you use AdEMAMix in your research, please cite the following papers:\n\n1. Matteo Pagliardini, Pierre Ablin, David Grangier (2024). *The AdEMAMix Optimizer: Better, Faster, Older*. [arXiv:2409.03137](https://arxiv.org/abs/2409.03137)\n\n---\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "AdEMAMix is a PyTorch optimizer that combines two EMAs to better utilize past gradients, offering improved convergence and model retention over AdamW.",
    "version": "0.0.1",
    "project_urls": null,
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "1c506e0dea298143a1e823f195487e2cfbec1e425e121f54bdc78887b9582575",
                "md5": "6e892d317833285a0b4241026a794d0f",
                "sha256": "eeeb4de4ddd3190e353310ba1700afd2ef63d4c2c299bfe2d757ea4b912e2a12"
            },
            "downloads": -1,
            "filename": "ademamix-0.0.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "6e892d317833285a0b4241026a794d0f",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.8",
            "size": 7733,
            "upload_time": "2024-09-25T04:35:34",
            "upload_time_iso_8601": "2024-09-25T04:35:34.006093Z",
            "url": "https://files.pythonhosted.org/packages/1c/50/6e0dea298143a1e823f195487e2cfbec1e425e121f54bdc78887b9582575/ademamix-0.0.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "2c4a840be6b1de9b6620baf4541c490651acefff29fc04601c8b2265bd1dfb74",
                "md5": "8e8d24a620d9b9b9d72ca841d0710fdd",
                "sha256": "5dd8b01ad05defb7813c44bbc94325486fba6bd6c7914460c58684328df61283"
            },
            "downloads": -1,
            "filename": "ademamix-0.0.1.tar.gz",
            "has_sig": false,
            "md5_digest": "8e8d24a620d9b9b9d72ca841d0710fdd",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.8",
            "size": 6875,
            "upload_time": "2024-09-25T04:35:35",
            "upload_time_iso_8601": "2024-09-25T04:35:35.806792Z",
            "url": "https://files.pythonhosted.org/packages/2c/4a/840be6b1de9b6620baf4541c490651acefff29fc04601c8b2265bd1dfb74/ademamix-0.0.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-09-25 04:35:35",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "ademamix"
}
        
Elapsed time: 0.80919s