Name | ngcpp JSON |
Version |
0.2.6
JSON |
| download |
home_page | |
Summary | A toolkit for ngc |
upload_time | 2023-09-26 21:21:10 |
maintainer | |
docs_url | None |
author | Qinsheng |
requires_python | >=3.8,<3.12 |
license | MIT |
keywords |
jam
experiment
ngc
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# Install
```shell
pip install ngcpp
sudo apt install fzf # iteractive fuzzy finder
# if apt installed fzf does not work or give strange behavior
git clone --depth 1 https://github.com/junegunn/fzf.git ~/.fzf
~/.fzf/install
```
# Feature
## cluster `ngc_cluster --help`
* `ngc_cluster usage --help`: List user usage
* `ngc_cluster hang --help`: List your all hanging jobs
* `ngc_cluster list --help`: List your jobs in one cluster
* `ngc_cluster alias --help`: List all available cluster aliases
## job `ngc_job --help`
* `ngc_job kill --help`: kill your jobs interactively
* `ngc_job result --help`: download results for your jobs interactively
* `ngc_job bash --help`: exec bash for one selected job~(support autoresuming)
## resume `ngc_resume --help`
### Automated Job Resubmission
Utilize a "polling" mechanism to automatically resubmit your job under the following circumstances:
* The job is terminated by the system.
* The job encounters a failure.
* The job hangs.
By implementing this approach, you ensure that critical jobs are consistently reattempted,
### How to Use
To track a specific job, define it in the `ngc_resume.yaml` file as follows:
```yaml
"10": # Cluster alias, e.g., prd10
ml-model.test_unet: # Job name; ensure that the job name matches the `--name` argument in the command line
ngc batch run --name "ml-model.test_unet" --priority HIGH --order 50 --preempt RUNONCE --min-timeslice 0s --total-runtime 1209600s --ace nv-us-west-2 --instance dgxa100.40g.8.norm --commandline "sleep 10h" --result /result --array-type "PYTORCH" --replicas "18" --image "another_docker_image" --org your_org --team your_team
ml-model.test_another_unet:
# Define another job here
"77": # Cluster alias, e.g., prd77
job_name: cmd
```
### Key Considerations:
* Ensure job names in one cluster is unique
* Ensure that the job name **matches** the --name argument in the command line.
* The `cmd` section can be copied from `https://bc.ngc.nvidia.com/jobs/{job_id}?tab=overview`.
* We determine hanging jobs by checking their TensorCore usage.
### TODO
- [ ] Implement a feature to automatically copy the command.
# dev
## install poetry
##
Raw data
{
"_id": null,
"home_page": "",
"name": "ngcpp",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.8,<3.12",
"maintainer_email": "",
"keywords": "jam,experiment,ngc",
"author": "Qinsheng",
"author_email": "qsh.zh27@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/14/59/808ed8bd64a6c2338f675f9a83be6ec1fdf8c59e66c48c9e448f49fe6527/ngcpp-0.2.6.tar.gz",
"platform": null,
"description": "# Install\n\n```shell\npip install ngcpp\nsudo apt install fzf # iteractive fuzzy finder\n\n# if apt installed fzf does not work or give strange behavior\ngit clone --depth 1 https://github.com/junegunn/fzf.git ~/.fzf\n~/.fzf/install\n```\n\n# Feature\n\n## cluster `ngc_cluster --help`\n\n* `ngc_cluster usage --help`: List user usage\n* `ngc_cluster hang --help`: List your all hanging jobs\n* `ngc_cluster list --help`: List your jobs in one cluster\n* `ngc_cluster alias --help`: List all available cluster aliases\n\n## job `ngc_job --help`\n\n* `ngc_job kill --help`: kill your jobs interactively\n* `ngc_job result --help`: download results for your jobs interactively\n* `ngc_job bash --help`: exec bash for one selected job~(support autoresuming)\n\n\n## resume `ngc_resume --help`\n\n\n### Automated Job Resubmission\n\nUtilize a \"polling\" mechanism to automatically resubmit your job under the following circumstances:\n* The job is terminated by the system.\n* The job encounters a failure.\n* The job hangs.\nBy implementing this approach, you ensure that critical jobs are consistently reattempted,\n\n### How to Use\n\nTo track a specific job, define it in the `ngc_resume.yaml` file as follows:\n\n```yaml\n\"10\": # Cluster alias, e.g., prd10\n ml-model.test_unet: # Job name; ensure that the job name matches the `--name` argument in the command line\n ngc batch run --name \"ml-model.test_unet\" --priority HIGH --order 50 --preempt RUNONCE --min-timeslice 0s --total-runtime 1209600s --ace nv-us-west-2 --instance dgxa100.40g.8.norm --commandline \"sleep 10h\" --result /result --array-type \"PYTORCH\" --replicas \"18\" --image \"another_docker_image\" --org your_org --team your_team\n ml-model.test_another_unet:\n # Define another job here\n\n\"77\": # Cluster alias, e.g., prd77\n job_name: cmd\n```\n\n### Key Considerations:\n\n* Ensure job names in one cluster is unique\n* Ensure that the job name **matches** the --name argument in the command line.\n* The `cmd` section can be copied from `https://bc.ngc.nvidia.com/jobs/{job_id}?tab=overview`.\n* We determine hanging jobs by checking their TensorCore usage.\n\n### TODO\n\n- [ ] Implement a feature to automatically copy the command.\n\n# dev\n\n## install poetry\n\n##\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A toolkit for ngc",
"version": "0.2.6",
"project_urls": null,
"split_keywords": [
"jam",
"experiment",
"ngc"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "8422c54904206c7067b3759f3df437b8d1283e1d911642e37de8fad83d6f7d6a",
"md5": "a9ad00d97ff205d2228c677307e16b3a",
"sha256": "55b1d30db6c1cc5057ca49d2d7c8466ffe4de0a77bf92aff9cb2465b0a03f897"
},
"downloads": -1,
"filename": "ngcpp-0.2.6-py3-none-any.whl",
"has_sig": false,
"md5_digest": "a9ad00d97ff205d2228c677307e16b3a",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8,<3.12",
"size": 18118,
"upload_time": "2023-09-26T21:21:08",
"upload_time_iso_8601": "2023-09-26T21:21:08.598662Z",
"url": "https://files.pythonhosted.org/packages/84/22/c54904206c7067b3759f3df437b8d1283e1d911642e37de8fad83d6f7d6a/ngcpp-0.2.6-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "1459808ed8bd64a6c2338f675f9a83be6ec1fdf8c59e66c48c9e448f49fe6527",
"md5": "b38f2813eae60a054ad10d47b08c980e",
"sha256": "27087c6a529c9a871f62ac63bcb3444fda133e7845f182704d1759d20d0fa816"
},
"downloads": -1,
"filename": "ngcpp-0.2.6.tar.gz",
"has_sig": false,
"md5_digest": "b38f2813eae60a054ad10d47b08c980e",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8,<3.12",
"size": 14280,
"upload_time": "2023-09-26T21:21:10",
"upload_time_iso_8601": "2023-09-26T21:21:10.219878Z",
"url": "https://files.pythonhosted.org/packages/14/59/808ed8bd64a6c2338f675f9a83be6ec1fdf8c59e66c48c9e448f49fe6527/ngcpp-0.2.6.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-09-26 21:21:10",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "ngcpp"
}