<div align="center">
<img src="https://raw.githubusercontent.com/aihpi/aihpi-cluster/main/00_aisc/img/logo_aisc_bmftr.jpg" alt="AI Service Centre Logo" width="400">
<h1>aihpi - AI High Performance Infrastructure</h1>
</div>
A Python package for simplified distributed job submission on SLURM clusters with container support. Built on top of submitit with additional features specifically designed for AI/ML workloads.
## Installation
```bash
# Basic installation
pip install aihpi
# With experiment tracking support
pip install aihpi[tracking]
# With all optional dependencies
pip install aihpi[all]
```
## Quick Start
```python
from aihpi import SlurmJobExecutor, JobConfig
config = JobConfig(
job_name="my-training",
num_nodes=1,
gpus_per_node=2,
walltime="01:00:00",
partition="gpu",
login_node="10.130.0.6" # Your SLURM login node IP
)
executor = SlurmJobExecutor(config)
job = executor.submit_function(my_training_function)
```
## Features
- **Simple API**: Configure and submit jobs with minimal code
- **Command Line Interface**: `aihpi` CLI for easy job submission and management
- **Distributed Training**: Automatic setup for multi-node distributed training
- **Container Support**: First-class support for Pyxis/Enroot containers
- **Container Submission**: Submit jobs from within containers via SSH to login nodes
- **LlamaFactory Integration**: Built-in support for LlamaFactory training
- **Job Monitoring**: Real-time job status tracking and log streaming
- **Experiment Tracking**: Integration with Weights & Biases, MLflow, and local tracking
## Command Line Usage
```bash
# Submit a Python job
aihpi run train.py --config config.py
# Submit with monitoring
aihpi run train.py --config config.py --monitor
# Submit distributed job
aihpi run train.py --config distributed_config.py
# Monitor a running job
aihpi monitor 12345 --follow
```
## Documentation & Examples
For detailed documentation, examples, and setup instructions, visit:
- **GitHub Repository**: [aihpi/aihpi-cluster](https://github.com/aihpi/aihpi-cluster)
- **Full Documentation**: [README.md](https://github.com/aihpi/aihpi-cluster#readme)
## Requirements
- Python ≥ 3.8
- Access to SLURM cluster
- submitit ≥ 1.4.0
## License
MIT License
---
## Acknowledgements
<div align="center">
<img src="https://raw.githubusercontent.com/aihpi/aihpi-cluster/main/00_aisc/img/logo_bmftr_de.png" alt="BMBF Logo" width="170"/>
</div>
The [AI Service Centre Berlin Brandenburg](http://hpi.de/kisz) is funded by the [Federal Ministry of Research, Technology and Space](https://www.bmbf.de/) under the funding code 01IS22092.
Raw data
{
"_id": null,
"home_page": "https://github.com/username/aihpi",
"name": "aihpi",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "slurm, distributed, training, ai, ml, pytorch, llamafactory",
"author": "Felix Boelter",
"author_email": "Felix Boelter <felix.boelter@hpi.de>",
"download_url": "https://files.pythonhosted.org/packages/bd/4f/f77cbd80ed7a51fdbe951c700dca11a95a1f944d69eb3284db8924cd620f/aihpi-0.1.5.tar.gz",
"platform": null,
"description": "<div align=\"center\">\n<img src=\"https://raw.githubusercontent.com/aihpi/aihpi-cluster/main/00_aisc/img/logo_aisc_bmftr.jpg\" alt=\"AI Service Centre Logo\" width=\"400\">\n<h1>aihpi - AI High Performance Infrastructure</h1>\n</div>\n\nA Python package for simplified distributed job submission on SLURM clusters with container support. Built on top of submitit with additional features specifically designed for AI/ML workloads.\n\n## Installation\n\n```bash\n# Basic installation\npip install aihpi\n\n# With experiment tracking support\npip install aihpi[tracking]\n\n# With all optional dependencies\npip install aihpi[all]\n```\n\n## Quick Start\n\n```python\nfrom aihpi import SlurmJobExecutor, JobConfig\n\nconfig = JobConfig(\n job_name=\"my-training\",\n num_nodes=1,\n gpus_per_node=2,\n walltime=\"01:00:00\",\n partition=\"gpu\",\n login_node=\"10.130.0.6\" # Your SLURM login node IP\n)\n\nexecutor = SlurmJobExecutor(config)\njob = executor.submit_function(my_training_function)\n```\n\n## Features\n\n- **Simple API**: Configure and submit jobs with minimal code\n- **Command Line Interface**: `aihpi` CLI for easy job submission and management\n- **Distributed Training**: Automatic setup for multi-node distributed training\n- **Container Support**: First-class support for Pyxis/Enroot containers\n- **Container Submission**: Submit jobs from within containers via SSH to login nodes\n- **LlamaFactory Integration**: Built-in support for LlamaFactory training\n- **Job Monitoring**: Real-time job status tracking and log streaming\n- **Experiment Tracking**: Integration with Weights & Biases, MLflow, and local tracking\n\n## Command Line Usage\n\n```bash\n# Submit a Python job\naihpi run train.py --config config.py\n\n# Submit with monitoring\naihpi run train.py --config config.py --monitor\n\n# Submit distributed job\naihpi run train.py --config distributed_config.py\n\n# Monitor a running job\naihpi monitor 12345 --follow\n```\n\n## Documentation & Examples\n\nFor detailed documentation, examples, and setup instructions, visit:\n- **GitHub Repository**: [aihpi/aihpi-cluster](https://github.com/aihpi/aihpi-cluster)\n- **Full Documentation**: [README.md](https://github.com/aihpi/aihpi-cluster#readme)\n\n## Requirements\n\n- Python \u2265 3.8\n- Access to SLURM cluster\n- submitit \u2265 1.4.0\n\n## License\n\nMIT License\n\n---\n\n## Acknowledgements\n<div align=\"center\">\n<img src=\"https://raw.githubusercontent.com/aihpi/aihpi-cluster/main/00_aisc/img/logo_bmftr_de.png\" alt=\"BMBF Logo\" width=\"170\"/>\n</div>\n\nThe [AI Service Centre Berlin Brandenburg](http://hpi.de/kisz) is funded by the [Federal Ministry of Research, Technology and Space](https://www.bmbf.de/) under the funding code 01IS22092.\n",
"bugtrack_url": null,
"license": null,
"summary": "AI High Performance Infrastructure - Distributed job submission for SLURM clusters",
"version": "0.1.5",
"project_urls": {
"Bug Reports": "https://github.com/aihpi/aihpi-cluster/issues",
"Documentation": "https://github.com/aihpi/aihpi-cluster#readme",
"Homepage": "https://github.com/aihpi/aihpi-cluster",
"Source": "https://github.com/aihpi/aihpi-cluster"
},
"split_keywords": [
"slurm",
" distributed",
" training",
" ai",
" ml",
" pytorch",
" llamafactory"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "6448414e390ea6b09c1f4b7c54f22a30f0bae8441bc8b17a25efe333fafad40b",
"md5": "a9366cc3e639fc391edf476641b56ce7",
"sha256": "ec86651377ce24af3b4a326e1352e2595a2132840682bccdc1e558c5c38fe760"
},
"downloads": -1,
"filename": "aihpi-0.1.5-py3-none-any.whl",
"has_sig": false,
"md5_digest": "a9366cc3e639fc391edf476641b56ce7",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 28627,
"upload_time": "2025-09-10T19:38:07",
"upload_time_iso_8601": "2025-09-10T19:38:07.519529Z",
"url": "https://files.pythonhosted.org/packages/64/48/414e390ea6b09c1f4b7c54f22a30f0bae8441bc8b17a25efe333fafad40b/aihpi-0.1.5-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "bd4ff77cbd80ed7a51fdbe951c700dca11a95a1f944d69eb3284db8924cd620f",
"md5": "aa3d30440fd4e0cb9e7973ad372233eb",
"sha256": "26ae7f177ddac2ff887292e54409bec93b5d5b548209e0ce6f26d40d03b15a65"
},
"downloads": -1,
"filename": "aihpi-0.1.5.tar.gz",
"has_sig": false,
"md5_digest": "aa3d30440fd4e0cb9e7973ad372233eb",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 43730,
"upload_time": "2025-09-10T19:38:08",
"upload_time_iso_8601": "2025-09-10T19:38:08.911082Z",
"url": "https://files.pythonhosted.org/packages/bd/4f/f77cbd80ed7a51fdbe951c700dca11a95a1f944d69eb3284db8924cd620f/aihpi-0.1.5.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-09-10 19:38:08",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "username",
"github_project": "aihpi",
"github_not_found": true,
"lcname": "aihpi"
}