ngcpp


Namengcpp JSON
Version 0.2.6 PyPI version JSON
download
home_page
SummaryA toolkit for ngc
upload_time2023-09-26 21:21:10
maintainer
docs_urlNone
authorQinsheng
requires_python>=3.8,<3.12
licenseMIT
keywords jam experiment ngc
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Install

```shell
pip install ngcpp
sudo apt install fzf # iteractive fuzzy finder

# if apt installed fzf does not work or give strange behavior
git clone --depth 1 https://github.com/junegunn/fzf.git ~/.fzf
~/.fzf/install
```

# Feature

## cluster `ngc_cluster --help`

* `ngc_cluster usage --help`: List user usage
* `ngc_cluster hang --help`: List your all hanging jobs
* `ngc_cluster list --help`: List your jobs in one cluster
* `ngc_cluster alias --help`: List all available cluster aliases

## job `ngc_job --help`

* `ngc_job kill --help`: kill your jobs interactively
* `ngc_job result --help`: download results for your jobs interactively
* `ngc_job bash --help`: exec bash for one selected job~(support autoresuming)


## resume `ngc_resume --help`


### Automated Job Resubmission

Utilize a "polling" mechanism to automatically resubmit your job under the following circumstances:
* The job is terminated by the system.
* The job encounters a failure.
* The job hangs.
By implementing this approach, you ensure that critical jobs are consistently reattempted,

### How to Use

To track a specific job, define it in the `ngc_resume.yaml` file as follows:

```yaml
"10": # Cluster alias, e.g., prd10
    ml-model.test_unet: # Job name; ensure that the job name matches the `--name` argument in the command line
        ngc batch run --name "ml-model.test_unet" --priority HIGH --order 50 --preempt RUNONCE --min-timeslice 0s --total-runtime 1209600s --ace nv-us-west-2 --instance dgxa100.40g.8.norm --commandline "sleep 10h" --result /result --array-type "PYTORCH" --replicas "18" --image "another_docker_image" --org your_org --team your_team
    ml-model.test_another_unet:
        # Define another job here

"77": # Cluster alias, e.g., prd77
    job_name: cmd
```

### Key Considerations:

* Ensure job names in one cluster is unique
* Ensure that the job name **matches** the --name argument in the command line.
* The `cmd` section can be copied from `https://bc.ngc.nvidia.com/jobs/{job_id}?tab=overview`.
* We determine hanging jobs by checking their TensorCore usage.

### TODO

- [ ] Implement a feature to automatically copy the command.

# dev

## install poetry

##

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "ngcpp",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8,<3.12",
    "maintainer_email": "",
    "keywords": "jam,experiment,ngc",
    "author": "Qinsheng",
    "author_email": "qsh.zh27@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/14/59/808ed8bd64a6c2338f675f9a83be6ec1fdf8c59e66c48c9e448f49fe6527/ngcpp-0.2.6.tar.gz",
    "platform": null,
    "description": "# Install\n\n```shell\npip install ngcpp\nsudo apt install fzf # iteractive fuzzy finder\n\n# if apt installed fzf does not work or give strange behavior\ngit clone --depth 1 https://github.com/junegunn/fzf.git ~/.fzf\n~/.fzf/install\n```\n\n# Feature\n\n## cluster `ngc_cluster --help`\n\n* `ngc_cluster usage --help`: List user usage\n* `ngc_cluster hang --help`: List your all hanging jobs\n* `ngc_cluster list --help`: List your jobs in one cluster\n* `ngc_cluster alias --help`: List all available cluster aliases\n\n## job `ngc_job --help`\n\n* `ngc_job kill --help`: kill your jobs interactively\n* `ngc_job result --help`: download results for your jobs interactively\n* `ngc_job bash --help`: exec bash for one selected job~(support autoresuming)\n\n\n## resume `ngc_resume --help`\n\n\n### Automated Job Resubmission\n\nUtilize a \"polling\" mechanism to automatically resubmit your job under the following circumstances:\n* The job is terminated by the system.\n* The job encounters a failure.\n* The job hangs.\nBy implementing this approach, you ensure that critical jobs are consistently reattempted,\n\n### How to Use\n\nTo track a specific job, define it in the `ngc_resume.yaml` file as follows:\n\n```yaml\n\"10\": # Cluster alias, e.g., prd10\n    ml-model.test_unet: # Job name; ensure that the job name matches the `--name` argument in the command line\n        ngc batch run --name \"ml-model.test_unet\" --priority HIGH --order 50 --preempt RUNONCE --min-timeslice 0s --total-runtime 1209600s --ace nv-us-west-2 --instance dgxa100.40g.8.norm --commandline \"sleep 10h\" --result /result --array-type \"PYTORCH\" --replicas \"18\" --image \"another_docker_image\" --org your_org --team your_team\n    ml-model.test_another_unet:\n        # Define another job here\n\n\"77\": # Cluster alias, e.g., prd77\n    job_name: cmd\n```\n\n### Key Considerations:\n\n* Ensure job names in one cluster is unique\n* Ensure that the job name **matches** the --name argument in the command line.\n* The `cmd` section can be copied from `https://bc.ngc.nvidia.com/jobs/{job_id}?tab=overview`.\n* We determine hanging jobs by checking their TensorCore usage.\n\n### TODO\n\n- [ ] Implement a feature to automatically copy the command.\n\n# dev\n\n## install poetry\n\n##\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A toolkit for ngc",
    "version": "0.2.6",
    "project_urls": null,
    "split_keywords": [
        "jam",
        "experiment",
        "ngc"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "8422c54904206c7067b3759f3df437b8d1283e1d911642e37de8fad83d6f7d6a",
                "md5": "a9ad00d97ff205d2228c677307e16b3a",
                "sha256": "55b1d30db6c1cc5057ca49d2d7c8466ffe4de0a77bf92aff9cb2465b0a03f897"
            },
            "downloads": -1,
            "filename": "ngcpp-0.2.6-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "a9ad00d97ff205d2228c677307e16b3a",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8,<3.12",
            "size": 18118,
            "upload_time": "2023-09-26T21:21:08",
            "upload_time_iso_8601": "2023-09-26T21:21:08.598662Z",
            "url": "https://files.pythonhosted.org/packages/84/22/c54904206c7067b3759f3df437b8d1283e1d911642e37de8fad83d6f7d6a/ngcpp-0.2.6-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "1459808ed8bd64a6c2338f675f9a83be6ec1fdf8c59e66c48c9e448f49fe6527",
                "md5": "b38f2813eae60a054ad10d47b08c980e",
                "sha256": "27087c6a529c9a871f62ac63bcb3444fda133e7845f182704d1759d20d0fa816"
            },
            "downloads": -1,
            "filename": "ngcpp-0.2.6.tar.gz",
            "has_sig": false,
            "md5_digest": "b38f2813eae60a054ad10d47b08c980e",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8,<3.12",
            "size": 14280,
            "upload_time": "2023-09-26T21:21:10",
            "upload_time_iso_8601": "2023-09-26T21:21:10.219878Z",
            "url": "https://files.pythonhosted.org/packages/14/59/808ed8bd64a6c2338f675f9a83be6ec1fdf8c59e66c48c9e448f49fe6527/ngcpp-0.2.6.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-09-26 21:21:10",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "ngcpp"
}
        
Elapsed time: 0.38821s