# RapidFire AI
Rapid experimentation for easier, faster, and more impactful fine-tuning and post-training for LLMs and other DL models.
## Overview
RapidFire AI is a new experiment execution framework that transforms your LLM customization experimentation from slow, sequential processes into rapid, intelligent workflows with hyperparallelized training, dynamic real-time experiment control, and automatic multi-GPU system orchestration.

## Getting Started
### Prerequisites
- [NVIDIA GPU using the 7.x or 8.x Compute Capability](https://developer.nvidia.com/cuda-gpus)
- [NVIDIA CUDA Toolkit 11.8+](https://developer.nvidia.com/cuda-toolkit-archive)
- [Python 3.12.x](https://www.python.org/downloads/)
- [PyTorch 2.7.1+](https://pytorch.org/get-started/previous-versions/) with corresponding forward compatible prebuilt CUDA binaries
### Installation/Starting
```bash
virtualenv -p python3 oss_venv
source oss_venv/bin/activate
# from pypi
pip install rapidfireai
# install specific dependencies and initialize rapidfire
rapidfireai init
# start the rapidfire server
rapidfireai start
# open up example notebook and start experiment
```
### Troubleshooting
For a quick system diagnostics report (Python env, relevant packages, GPU/CUDA, and key environment variables), run:
```bash
rapidfireai doctor
```
If you encounter port conflicts, you can kill existing processes:
```bash
lsof -t -i:5002 | xargs kill -9 # mlflow
lsof -t -i:8080 | xargs kill -9 # dispatcher
lsof -t -i:3000 | xargs kill -9 # frontend server
```
## Documentation
Browse or reference the full documentation, example use case tutorials, all API details, dashboard details, and more [here](https://rapidfire-ai-oss-docs.readthedocs-hosted.com/).
## Key Features
### MLflow Integration
Full MLflow support for experiment tracking and metrics visualization. A named RapidFire AI experiment corresponds to an MLflow experiment for comprehensive governance
### Interactive Control Operations (IC Ops)
First-of-its-kind dynamic real-time control over runs in flight. Can be invoked through the dashboard:
- Stop active runs; puts them in a dormant state
- Resume stopped runs; makes them active again
- Clone and modify existing runs, with or without warm starting from parent’s weights
- Delete unwanted or failed runs
### Multi-GPU Support
The Scheduler automatically handles multiple GPUs on the machine and divides resources across all running configs for optimal resource utilization.
### Search and AutoML Support
Built-in procedures for searching over configuration knob combinations, including Grid Search and Random Search. Easy to integrate with AutoML procedures. Native support for some popular AutoML procedures and customized automation of IC Ops coming soon.
## Directory Structure
```
rapidfireai/
├── automl/ # Search and AutoML algorithms for knob tuning
├── backend/ # Core backend components (controller, scheduler, worker)
├── db/ # Database interface and SQLite operations
├── dispatcher/ # Flask-based web API for UI communication
├── frontend/ # Frontend components (dashboard, IC Ops implementation)
├── ml/ # ML training utilities and trainer classes
├── utils/ # Utility functions and helper modules
└── experiment.py # Main experiment lifecycle management
```
## Architecture
RapidFire AI adopts a microservices-inspired loosely coupled distributed architecture with:
- **Dispatcher**: Web API layer for UI communication
- **Database**: SQLite for state persistence
- **Controller**: Central orchestrator running in user process
- **Workers**: GPU-based training processes
- **Dashboard**: Experiment tracking and visualization dashboard
This design enables efficient resource utilization while providing a seamless user experience for AI experimentation.
## Components
### Dispatcher
The dispatcher provides a REST API interface for the web UI. It can be run via Flask as a single app or via Gunicorn to have it load balanced. Handles interactive control features and displays the current state of the runs in the experiment.
### Database
Uses SQLite for persistent storage of metadata of experiments, runs, and artifacts. The Controller also uses it to talk with Workers on scheduling state. A clean asynchronous interface for all DB operations, including experiment lifecycle management and run tracking.
### Controller
Runs as part of the user’s console or Notebook process. Orchestrates the entire training lifecycle including model creation, worker management, and scheduling. The `run_fit` logic handles sample preprocessing, model creation for given knob configurations, worker initialization, and continuous monitoring of training progress across distributed workers.
### Worker
Handles the actual model training and inference on the GPUs. Workers poll the Database for tasks, load dataset chunks, and execute training runs with checkpointing and progress reporting. Currently expects any given model for given batch size to fit on a single GPU.
### Experiment
Manages the complete experiment lifecycle, including creation, naming conventions, and cleanup. Experiments are automatically named with unique suffixes if conflicts exist, and all experiment metadata is tracked in the Database. An experiment's running tasks are automatically cancelled when the process ends abruptly.
### Dashboard
A fork of MLflow that enables full tracking and visualization of all experiments and runs. It features a new panel for Interactive Control Ops that can be performed on any active runs.
## Developing with RapidFire AI
### Prerequisites
- Python 3.x
- Git
- Ubuntu/Debian system (for apt package manager)
```bash
# Run these commands one after the other on a fresh Ubuntu machine
# install dependencies
sudo apt update -y
# clone the repository
git clone https://github.com/RapidFireAI/rapidfireai.git
# navigate to the repository
cd ./rapidfireai
# install basic dependencies
sudo apt install -y python3-virtualenv
virtualenv -p python3 oss_venv
source oss_venv/bin/activate
pip3 install ipykernel
pip3 install jupyter
pip3 install "huggingface-hub[cli]"
export PATH="$HOME/.local/bin:$PATH"
hf auth login --token <your_token>
# checkout the develop branch
git checkout develop
# install the repository as a python package
pip3 install -r requirements.txt
# Install correct version of vllm and flash-attn
# uv pip install vllm=0.10.1.1 --torch-backend=cu126 or cu118
# uv pip install flash-attn==1.0.9 --no-build-isoloation or 2.8.3
# install frontend packages
curl -fsSL https://deb.nodesource.com/setup_22.x | sudo -E bash - && sudo apt-get install -y nodejs
npm install npm@10.5.1
sudo apt install -y yarn
# if running into node versioning errors, remove the previous version of node then run the lines above again
sudo apt-get remove --purge nodejs libnode-dev libnode72 npm
sudo apt autoremove --purge
# check installations
node -v # 22.x
npm -v # 10.5.1
# still inside venv, run the start script to begin all 3 servers
chmod +x ./rapidfireai/start_dev.sh
./rapidfireai/start_dev.sh start
# run the notebook from within your IDE
# make sure the notebook is running in the oss_venv virtual environment
# head to settings in Cursor/VSCode and search for venv and add the path - $HOME/rapidfireai/oss_venv
# we cannot run a Jupyter notebook directly since there are restrictions on Jupyter being able to create child processes
# VSCode can port-forward localhost:3000 where the rf-frontend server will be running
# for port clash issues -
lsof -t -i:8080 | xargs kill -9 # dispatcher
lsof -t -i:5002 | xargs kill -9 # mlflow
lsof -t -i:3000 | xargs kill -9 # frontend
```
Raw data
{
"_id": null,
"home_page": null,
"name": "rapidfireai",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.12",
"maintainer_email": null,
"keywords": "ai, rapidfire, rapidfireai, deep-learning, artificial-intelligence, machine-learning, mlflow, experiment-tracking",
"author": null,
"author_email": "\"RapidFire AI Inc.\" <support@rapidfire.ai>",
"download_url": "https://files.pythonhosted.org/packages/e4/60/0386b2cb1f4ff76cbc68b25f5807ac8b67b81dd9d0799ce293f3b7d81fbc/rapidfireai-0.9.10.tar.gz",
"platform": null,
"description": "# RapidFire AI\n\nRapid experimentation for easier, faster, and more impactful fine-tuning and post-training for LLMs and other DL models.\n\n## Overview\n\nRapidFire AI is a new experiment execution framework that transforms your LLM customization experimentation from slow, sequential processes into rapid, intelligent workflows with hyperparallelized training, dynamic real-time experiment control, and automatic multi-GPU system orchestration.\n\n\n\n\n## Getting Started\n### Prerequisites\n- [NVIDIA GPU using the 7.x or 8.x Compute Capability](https://developer.nvidia.com/cuda-gpus)\n- [NVIDIA CUDA Toolkit 11.8+](https://developer.nvidia.com/cuda-toolkit-archive)\n- [Python 3.12.x](https://www.python.org/downloads/)\n- [PyTorch 2.7.1+](https://pytorch.org/get-started/previous-versions/) with corresponding forward compatible prebuilt CUDA binaries\n\n### Installation/Starting\n```bash\nvirtualenv -p python3 oss_venv\nsource oss_venv/bin/activate\n\n# from pypi\npip install rapidfireai\n\n# install specific dependencies and initialize rapidfire\nrapidfireai init\n\n# start the rapidfire server\nrapidfireai start\n\n# open up example notebook and start experiment\n```\n\n\n\n### Troubleshooting\n\nFor a quick system diagnostics report (Python env, relevant packages, GPU/CUDA, and key environment variables), run:\n```bash\nrapidfireai doctor\n```\n\nIf you encounter port conflicts, you can kill existing processes:\n```bash\nlsof -t -i:5002 | xargs kill -9 # mlflow\nlsof -t -i:8080 | xargs kill -9 # dispatcher\nlsof -t -i:3000 | xargs kill -9 # frontend server\n```\n\n## Documentation\n\nBrowse or reference the full documentation, example use case tutorials, all API details, dashboard details, and more [here](https://rapidfire-ai-oss-docs.readthedocs-hosted.com/).\n\n\n## Key Features\n\n### MLflow Integration\nFull MLflow support for experiment tracking and metrics visualization. A named RapidFire AI experiment corresponds to an MLflow experiment for comprehensive governance\n\n### Interactive Control Operations (IC Ops)\nFirst-of-its-kind dynamic real-time control over runs in flight. Can be invoked through the dashboard:\n- Stop active runs; puts them in a dormant state\n- Resume stopped runs; makes them active again\n- Clone and modify existing runs, with or without warm starting from parent\u2019s weights\n- Delete unwanted or failed runs\n\n### Multi-GPU Support\nThe Scheduler automatically handles multiple GPUs on the machine and divides resources across all running configs for optimal resource utilization.\n\n### Search and AutoML Support\nBuilt-in procedures for searching over configuration knob combinations, including Grid Search and Random Search. Easy to integrate with AutoML procedures. Native support for some popular AutoML procedures and customized automation of IC Ops coming soon.\n\n\n## Directory Structure\n\n```\nrapidfireai/\n\u251c\u2500\u2500 automl/ # Search and AutoML algorithms for knob tuning\n\u251c\u2500\u2500 backend/ # Core backend components (controller, scheduler, worker)\n\u251c\u2500\u2500 db/ # Database interface and SQLite operations\n\u251c\u2500\u2500 dispatcher/ # Flask-based web API for UI communication\n\u251c\u2500\u2500 frontend/ # Frontend components (dashboard, IC Ops implementation)\n\u251c\u2500\u2500 ml/ # ML training utilities and trainer classes\n\u251c\u2500\u2500 utils/ # Utility functions and helper modules\n\u2514\u2500\u2500 experiment.py # Main experiment lifecycle management\n```\n\n## Architecture\n\nRapidFire AI adopts a microservices-inspired loosely coupled distributed architecture with:\n- **Dispatcher**: Web API layer for UI communication\n- **Database**: SQLite for state persistence\n- **Controller**: Central orchestrator running in user process\n- **Workers**: GPU-based training processes\n- **Dashboard**: Experiment tracking and visualization dashboard\n\nThis design enables efficient resource utilization while providing a seamless user experience for AI experimentation.\n\n\n## Components\n\n### Dispatcher\nThe dispatcher provides a REST API interface for the web UI. It can be run via Flask as a single app or via Gunicorn to have it load balanced. Handles interactive control features and displays the current state of the runs in the experiment.\n\n### Database\nUses SQLite for persistent storage of metadata of experiments, runs, and artifacts. The Controller also uses it to talk with Workers on scheduling state. A clean asynchronous interface for all DB operations, including experiment lifecycle management and run tracking.\n\n### Controller\nRuns as part of the user\u2019s console or Notebook process. Orchestrates the entire training lifecycle including model creation, worker management, and scheduling. The `run_fit` logic handles sample preprocessing, model creation for given knob configurations, worker initialization, and continuous monitoring of training progress across distributed workers.\n\n### Worker\nHandles the actual model training and inference on the GPUs. Workers poll the Database for tasks, load dataset chunks, and execute training runs with checkpointing and progress reporting. Currently expects any given model for given batch size to fit on a single GPU.\n\n### Experiment\nManages the complete experiment lifecycle, including creation, naming conventions, and cleanup. Experiments are automatically named with unique suffixes if conflicts exist, and all experiment metadata is tracked in the Database. An experiment's running tasks are automatically cancelled when the process ends abruptly.\n\n### Dashboard\nA fork of MLflow that enables full tracking and visualization of all experiments and runs. It features a new panel for Interactive Control Ops that can be performed on any active runs.\n\n\n## Developing with RapidFire AI\n### Prerequisites\n- Python 3.x\n- Git\n- Ubuntu/Debian system (for apt package manager)\n\n```bash\n# Run these commands one after the other on a fresh Ubuntu machine\n\n# install dependencies\nsudo apt update -y\n\n# clone the repository\ngit clone https://github.com/RapidFireAI/rapidfireai.git\n\n# navigate to the repository\ncd ./rapidfireai\n\n# install basic dependencies\nsudo apt install -y python3-virtualenv\nvirtualenv -p python3 oss_venv\nsource oss_venv/bin/activate\npip3 install ipykernel\npip3 install jupyter\npip3 install \"huggingface-hub[cli]\"\nexport PATH=\"$HOME/.local/bin:$PATH\"\nhf auth login --token <your_token>\n\n# checkout the develop branch\ngit checkout develop\n\n# install the repository as a python package\npip3 install -r requirements.txt\n\n# Install correct version of vllm and flash-attn\n# uv pip install vllm=0.10.1.1 --torch-backend=cu126 or cu118\n# uv pip install flash-attn==1.0.9 --no-build-isoloation or 2.8.3\n\n# install frontend packages\ncurl -fsSL https://deb.nodesource.com/setup_22.x | sudo -E bash - && sudo apt-get install -y nodejs\nnpm install npm@10.5.1\nsudo apt install -y yarn\n\n# if running into node versioning errors, remove the previous version of node then run the lines above again\nsudo apt-get remove --purge nodejs libnode-dev libnode72 npm\nsudo apt autoremove --purge\n\n# check installations\nnode -v # 22.x\nnpm -v # 10.5.1\n\n# still inside venv, run the start script to begin all 3 servers\nchmod +x ./rapidfireai/start_dev.sh\n./rapidfireai/start_dev.sh start\n\n# run the notebook from within your IDE\n# make sure the notebook is running in the oss_venv virtual environment\n# head to settings in Cursor/VSCode and search for venv and add the path - $HOME/rapidfireai/oss_venv\n# we cannot run a Jupyter notebook directly since there are restrictions on Jupyter being able to create child processes\n\n# VSCode can port-forward localhost:3000 where the rf-frontend server will be running\n\n# for port clash issues -\nlsof -t -i:8080 | xargs kill -9 # dispatcher\nlsof -t -i:5002 | xargs kill -9 # mlflow\nlsof -t -i:3000 | xargs kill -9 # frontend\n```\n\n",
"bugtrack_url": null,
"license": "Apache-2.0",
"summary": "RapidFire AI: Rapid Experimentation Engine for Customizing LLMs",
"version": "0.9.10",
"project_urls": {
"Homepage": "https://rapidfire.ai"
},
"split_keywords": [
"ai",
" rapidfire",
" rapidfireai",
" deep-learning",
" artificial-intelligence",
" machine-learning",
" mlflow",
" experiment-tracking"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "9ba8342f314a2b010a88c9fa0b02a7984aa0007b3156abee9ea5614c2d10e167",
"md5": "b16648c3168568ff0b01e22c11aa3984",
"sha256": "2920d5f264a68663e37318fc07791ef13902c9bbb3a29f9fa61c9874fe69d724"
},
"downloads": -1,
"filename": "rapidfireai-0.9.10-py3-none-any.whl",
"has_sig": false,
"md5_digest": "b16648c3168568ff0b01e22c11aa3984",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.12",
"size": 5629935,
"upload_time": "2025-09-05T22:36:42",
"upload_time_iso_8601": "2025-09-05T22:36:42.581693Z",
"url": "https://files.pythonhosted.org/packages/9b/a8/342f314a2b010a88c9fa0b02a7984aa0007b3156abee9ea5614c2d10e167/rapidfireai-0.9.10-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "e4600386b2cb1f4ff76cbc68b25f5807ac8b67b81dd9d0799ce293f3b7d81fbc",
"md5": "3fa1d23b4b9fbb6f43e5ed498824dfe4",
"sha256": "381fe3cbf86b06226c5b475e27ed3fb2937500144e75fa4a343b15123b76b384"
},
"downloads": -1,
"filename": "rapidfireai-0.9.10.tar.gz",
"has_sig": false,
"md5_digest": "3fa1d23b4b9fbb6f43e5ed498824dfe4",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.12",
"size": 5524374,
"upload_time": "2025-09-05T22:36:44",
"upload_time_iso_8601": "2025-09-05T22:36:44.447979Z",
"url": "https://files.pythonhosted.org/packages/e4/60/0386b2cb1f4ff76cbc68b25f5807ac8b67b81dd9d0799ce293f3b7d81fbc/rapidfireai-0.9.10.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-09-05 22:36:44",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "rapidfireai"
}