mlflow-slurm


Namemlflow-slurm JSON
Version 1.0.6 PyPI version JSON
download
home_pagehttps://github.com/ncsa/mlflow-slurm
SummaryBackend implementation for running MLFlow projects on Slurm
upload_time2024-12-07 19:41:23
maintainerBen Galewsky
docs_urlNone
authorNone
requires_python>=3.6
licenseNone
keywords mlflow
VCS
bugtrack_url
requirements mlflow-skinny jinja2
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # MLFlow-Slurm
Backend for executing MLFlow projects on Slurm batch system

## Usage
Install this package in the environment from which you will be submitting jobs.
If you are submitting jobs from inside jobs, make sure you have this package
listed in your conda or pip environment.

Just list this as your `--backend` in the job run. You should include a json
config file to control how the batch script is constructed:
```shell
mlflow run --backend slurm \
          --backend-config slurm_config.json \
          examples/sklearn_elasticnet_wine
```

It will generate a batch script named after the job id and submit it via the
Slurm `sbatch` command. It will tag the run with the Slurm JobID

## Configure Jobs
You can set values in a json file to control job submission. The supported
properties in this file are:

| Config File Setting | Use                                                                                                            |
|---------------------|----------------------------------------------------------------------------------------------------------------|
| partition           | Which Slurm partition should the job run in?                                                                   |
| account             | What account name to run under                                                                                 |
| environment         | List of additional environment variables to add to the job                                                     |
| exports             | List of environment variables to export to the job                                                             |
| gpus_per_node       | On GPU partitions how many GPUs to allocate per node                                                           |
| gres                | SLURM Generic RESources requests                                                                               |
| mem                 | Amount of memory to allocate to CPU jobs                                                                       |
| modules             | List of modules to load before starting job                                                                    |
| nodes               | Number of nodes to request from SLURM                                                                          |
| ntasks              | Number of tasks to run on each node                                                                            |
| exclusive           | Set to `true` to insure jobs don't share a node with other jobs                                                |
| time                | Max CPU time job may run                                                                                       |
| sbatch-script-file  | Name of batch file to be produced. Leave blank to have service generate a script file name based on the run ID |

## Sequential Worker Jobs
There are occasions where you have a job that can't finish in the maximum
allowable wall time. If you are able to write out a checkpoint file, you can
use sequential worker jobs to continue the job where it left off. This is
useful for training deep learning models or other long running jobs.

To use this, you just need to provide a parameter to the `mlflow run` command
```shell
  mlflow run --backend slurm -c ../../slurm_config.json -P sequential_workers=3 .
```
This will the submit the job as normal, but also submit 3 additional jobs that
each depend on the previous job. As soon as the first job terminates, the next
job will start. This will continue until all jobs have completed.

## Development
The slurm docker deployment is handy for testing and development. You can start
up a slurm environment with the included docker-compose file

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/ncsa/mlflow-slurm",
    "name": "mlflow-slurm",
    "maintainer": "Ben Galewsky",
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": "bengal1@illinois.edu",
    "keywords": "mlflow",
    "author": null,
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/be/ad/8aa120650815ccf89e1ee3ef301da4c4af787854927f8afbdd0b8d70f080/mlflow_slurm-1.0.6.tar.gz",
    "platform": null,
    "description": "# MLFlow-Slurm\nBackend for executing MLFlow projects on Slurm batch system\n\n## Usage\nInstall this package in the environment from which you will be submitting jobs.\nIf you are submitting jobs from inside jobs, make sure you have this package\nlisted in your conda or pip environment.\n\nJust list this as your `--backend` in the job run. You should include a json\nconfig file to control how the batch script is constructed:\n```shell\nmlflow run --backend slurm \\\n          --backend-config slurm_config.json \\\n          examples/sklearn_elasticnet_wine\n```\n\nIt will generate a batch script named after the job id and submit it via the\nSlurm `sbatch` command. It will tag the run with the Slurm JobID\n\n## Configure Jobs\nYou can set values in a json file to control job submission. The supported\nproperties in this file are:\n\n| Config File Setting | Use                                                                                                            |\n|---------------------|----------------------------------------------------------------------------------------------------------------|\n| partition           | Which Slurm partition should the job run in?                                                                   |\n| account             | What account name to run under                                                                                 |\n| environment         | List of additional environment variables to add to the job                                                     |\n| exports             | List of environment variables to export to the job                                                             |\n| gpus_per_node       | On GPU partitions how many GPUs to allocate per node                                                           |\n| gres                | SLURM Generic RESources requests                                                                               |\n| mem                 | Amount of memory to allocate to CPU jobs                                                                       |\n| modules             | List of modules to load before starting job                                                                    |\n| nodes               | Number of nodes to request from SLURM                                                                          |\n| ntasks              | Number of tasks to run on each node                                                                            |\n| exclusive           | Set to `true` to insure jobs don't share a node with other jobs                                                |\n| time                | Max CPU time job may run                                                                                       |\n| sbatch-script-file  | Name of batch file to be produced. Leave blank to have service generate a script file name based on the run ID |\n\n## Sequential Worker Jobs\nThere are occasions where you have a job that can't finish in the maximum\nallowable wall time. If you are able to write out a checkpoint file, you can\nuse sequential worker jobs to continue the job where it left off. This is\nuseful for training deep learning models or other long running jobs.\n\nTo use this, you just need to provide a parameter to the `mlflow run` command\n```shell\n  mlflow run --backend slurm -c ../../slurm_config.json -P sequential_workers=3 .\n```\nThis will the submit the job as normal, but also submit 3 additional jobs that\neach depend on the previous job. As soon as the first job terminates, the next\njob will start. This will continue until all jobs have completed.\n\n## Development\nThe slurm docker deployment is handy for testing and development. You can start\nup a slurm environment with the included docker-compose file\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Backend implementation for running MLFlow projects on Slurm",
    "version": "1.0.6",
    "project_urls": {
        "Homepage": "https://github.com/ncsa/mlflow-slurm"
    },
    "split_keywords": [
        "mlflow"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "45d3dbff3783dc25af19943d4245888f89050bda0029af19ae51ac412a59a3f4",
                "md5": "8f7cefbb5315f9bdf5dc0792d35bc38c",
                "sha256": "1e31830cbafb4fa58902fbb3157e49d71184c230695c2dec4013e8a34b0d4fab"
            },
            "downloads": -1,
            "filename": "mlflow_slurm-1.0.6-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "8f7cefbb5315f9bdf5dc0792d35bc38c",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6",
            "size": 9048,
            "upload_time": "2024-12-07T19:41:23",
            "upload_time_iso_8601": "2024-12-07T19:41:23.077436Z",
            "url": "https://files.pythonhosted.org/packages/45/d3/dbff3783dc25af19943d4245888f89050bda0029af19ae51ac412a59a3f4/mlflow_slurm-1.0.6-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "bead8aa120650815ccf89e1ee3ef301da4c4af787854927f8afbdd0b8d70f080",
                "md5": "4536437bb0f3f0892b98b663e7fe32f9",
                "sha256": "9aad85243dc78652c7c068f6db80abed1d034d8ee2969629333d8a624e8716cb"
            },
            "downloads": -1,
            "filename": "mlflow_slurm-1.0.6.tar.gz",
            "has_sig": false,
            "md5_digest": "4536437bb0f3f0892b98b663e7fe32f9",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 7482,
            "upload_time": "2024-12-07T19:41:23",
            "upload_time_iso_8601": "2024-12-07T19:41:23.955308Z",
            "url": "https://files.pythonhosted.org/packages/be/ad/8aa120650815ccf89e1ee3ef301da4c4af787854927f8afbdd0b8d70f080/mlflow_slurm-1.0.6.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-12-07 19:41:23",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "ncsa",
    "github_project": "mlflow-slurm",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "mlflow-skinny",
            "specs": []
        },
        {
            "name": "jinja2",
            "specs": []
        }
    ],
    "lcname": "mlflow-slurm"
}
        
Elapsed time: 4.87921s