oneccl-binding-pt

Name	oneccl-binding-pt JSON
Version	0.0.4 JSON
	download
home_page	https://github.com/intel/torch-ccl
Summary	Intel® oneCCL Bindings for PyTorch (formerly known as torch_ccl)
upload_time	2024-02-26 06:09:49
maintainer
docs_url	None
author	IntelAI
requires_python
license	MIT
keywords	simplest hello world pypi package
VCS
bugtrack_url
requirements	torch setuptools
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Intel® oneCCL Bindings for PyTorch (formerly known as torch_ccl)

This repository holds PyTorch bindings maintained by Intel for the Intel® oneAPI Collective Communications Library (oneCCL).

## Introduction

[PyTorch](https://github.com/pytorch/pytorch) is an open-source machine learning framework.

[Intel® oneCCL](https://github.com/oneapi-src/oneCCL) (collective communications library) is a library for efficient distributed deep learning training implementing such collectives like `allreduce`, `allgather`, `alltoall`. For more information on oneCCL, please refer to the [oneCCL documentation](https://spec.oneapi.com/versions/latest/elements/oneCCL/source/index.html).

`oneccl_bindings_for_pytorch` module implements PyTorch C10D ProcessGroup API and can be dynamically loaded as external ProcessGroup and only works on Linux platform now.

## Capability

The table below shows which functions are available for use with CPU / Intel dGPU tensors.

|                  | CPU   | GPU   |
| :--------------- | :---: | :---: |
| `send`           | ×     | √     |
| `recv`           | ×     | √     |
| `broadcast`      | √     | √     |
| `all_reduce`     | √     | √     |
| `reduce`         | √     | √     |
| `all_gather`     | √     | √     |
| `gather`         | √     | √     |
| `scatter`        | ×     | ×     |
| `reduce_scatter` | √     | √     |
| `all_to_all`     | √     | √     |
| `barrier`        | √     | √     |


## Pytorch API Align

We recommend using Anaconda as Python package management system. The followings are the corresponding branches (tags) of `oneccl_bindings_for_pytorch` and supported Pytorch.

   | `torch`                                                         | `oneccl_bindings_for_pytorch`                                             |
   | :-------------------------------------------------------------: | :-----------------------------------------------------------------------: |
   | `master`                                                        |  `master`                                                                 |
   | [v2.2.0](https://github.com/pytorch/pytorch/tree/v2.2.0)        |  [ccl_torch2.2.0+cpu](https://github.com/intel/torch-ccl/tree/ccl_torch2.2.0%2Bcpu)   |
   | [v2.1.0](https://github.com/pytorch/pytorch/tree/v2.1.0)        |  [ccl_torch2.1.0+cpu](https://github.com/intel/torch-ccl/tree/ccl_torch2.1.0%2Bcpu)   |
   | [v2.0.1](https://github.com/pytorch/pytorch/tree/v2.0.1)        |  [ccl_torch2.0.100](https://github.com/intel/torch-ccl/tree/ccl_torch2.0.100)   |
   | [v1.13](https://github.com/pytorch/pytorch/tree/v1.13)          |  [ccl_torch1.13](https://github.com/intel/torch-ccl/tree/ccl_torch1.13)   |
   | [v1.12.1](https://github.com/pytorch/pytorch/tree/v1.12.1)      |  [ccl_torch1.12.100](https://github.com/intel/torch-ccl/tree/ccl_torch1.12.100)   |
   | [v1.12.0](https://github.com/pytorch/pytorch/tree/v1.12.0)      |  [ccl_torch1.12](https://github.com/intel/torch-ccl/tree/ccl_torch1.12)   |
   | [v1.11.0](https://github.com/pytorch/pytorch/tree/v1.11.0)      |  [ccl_torch1.11](https://github.com/intel/torch-ccl/tree/ccl_torch1.11)   |
   | [v1.10.0](https://github.com/pytorch/pytorch/tree/v1.10.0)      |  [ccl_torch1.10](https://github.com/intel/torch-ccl/tree/ccl_torch1.10)   |
   | [v1.9.0](https://github.com/pytorch/pytorch/tree/v1.9.0)        |  [ccl_torch1.9](https://github.com/intel/torch-ccl/tree/ccl_torch1.9)     |
   | [v1.8.1](https://github.com/pytorch/pytorch/tree/v1.8.1)        |  [ccl_torch1.8](https://github.com/intel/torch-ccl/tree/ccl_torch1.8)     |
   | [v1.7.1](https://github.com/pytorch/pytorch/tree/v1.7.1)        |  [ccl_torch1.7](https://github.com/intel/torch-ccl/tree/ccl_torch1.7)     |
   | [v1.6.0](https://github.com/pytorch/pytorch/tree/v1.6.0)        |  [ccl_torch1.6](https://github.com/intel/torch-ccl/tree/ccl_torch1.6)     |
   | [v1.5-rc3](https://github.com/pytorch/pytorch/tree/v1.5.0-rc3)  |  [beta09](https://github.com/intel/torch-ccl/tree/beta09)                 |

The usage details can be found in the README of corresponding branch. The following part is about the usage of v1.9 tag. if you want to use other version of torch-ccl please checkout to that branch(tag). For pytorch-1.5.0-rc3, the [#PR28068](https://github.com/pytorch/pytorch/pull/28068) and [#PR32361](https://github.com/pytorch/pytorch/pull/32361) are need to dynamicall register external ProcessGroup and enable `alltoall` collective communication primitive. The patch file about these two PRs is in `patches` directory and you can use it directly.

## Requirements

- Python 3.8 or later and a C++17 compiler

- PyTorch v2.2.0

## Build Option List

The following build options are supported in Intel® oneCCL Bindings for PyTorch*.

| Build Option                        | Default Value  | Description                                                                                         |
| :---------------------------------- | :------------- | :-------------------------------------------------------------------------------------------------- |
| COMPUTE_BACKEND                     |                | Set oneCCL `COMPUTE_BACKEDN`,set to `dpcpp`  and use DPC++ Compiler to enable support for Intel XPU |
| USE_SYSTEM_ONECCL                   | OFF            | Use oneCCL library in system                                                                        |
| CCL_PACKAGE_NAME                    | oneccl-bind-pt | Set Wheel Name                                                                                      |
| ONECCL_BINDINGS_FOR_PYTORCH_BACKEND | cpu            | Set BACKEND                                                                                         |
| CCL_SHA_VERSION                     | False          | add git head sha version to Wheel name                                                              |

## Lunch Option List

The following lunch options are supported in Intel® oneCCL Bindings for PyTorch*.

| Lunch Option                             | Default Value | Description                                                           |
| :--------------------------------------- | :------------ | :-------------------------------------------------------------------- |
| ONECCL_BINDINGS_FOR_PYTORCH_ENV_VERBOSE  | 0             | Set verbose level in ONECCL_BINDINGS_FOR_PYTORCH                      |
| ONECCL_BINDINGS_FOR_PYTORCH_ENV_WAIT_GDB | 0             | Set 1 to force the oneccl_bindings_for_pytorch wait for GDB attaching |

## Installation

### Install from Source

1. clone the `oneccl_bindings_for_pytorch`.

   ```bash
   git clone https://github.com/intel/torch-ccl.git && cd torch-ccl
   git submodule sync
   git submodule update --init --recursive
   ```

2. Install `oneccl_bindings_for_pytorch`

   ```bash
   # for CPU Backend Only
   python setup.py install
   # for XPU Backend: use DPC++ Compiler to enable support for Intel XPU
   # build with oneCCL from third party
   COMPUTE_BACKEND=dpcpp python setup.py install
   # build without oneCCL
   export INTELONEAPIROOT=${HOME}/intel/oneapi
   USE_SYSTEM_ONECCL=ON COMPUTE_BACKEND=dpcpp python setup.py install
   ```

### Install PreBuilt Wheel

Wheel files are avaiable for the following Python versions.

| Extension Version | Python 3.6 | Python 3.7 | Python 3.8 | Python 3.9 | Python 3.10 | Python 3.11 |
| :---------------: | :--------: | :--------: | :--------: | :--------: | :---------: | :---------: |
| 2.2.0             |            |            | √          | √          | √           | √           |
| 2.1.0             |            |            | √          | √          | √           | √           |
| 2.0.100           |            |            | √          | √          | √           | √           |
| 1.13              |            | √          | √          | √          | √           |             |
| 1.12.100          |            | √          | √          | √          | √           |             |
| 1.12.0            |            | √          | √          | √          | √           |             |
| 1.11.0            |            | √          | √          | √          | √           |             |
| 1.10.0            | √          | √          | √          | √          |             |             |

```bash
python -m pip install oneccl_bind_pt==2.0.100 -f https://developer.intel.com/ipex-whl-stable-xpu
```

### Runtime Dynamic Linking

- If oneccl_bindings_for_pytorch is built without oneCCL and use oneCCL in system, dynamic link oneCCl from oneAPI basekit (recommended usage):

```bash
source $basekit_root/ccl/latest/env/vars.sh
```

Note: Make sure you have installed [basekit](https://www.intel.com/content/www/us/en/developer/tools/oneapi/toolkits.html#base-kit) when using Intel® oneCCL Bindings for Pytorch\* on Intel® GPUs.

- If oneccl_bindings_for_pytorch is built with oneCCL from third party or installed from prebuilt wheel:
Dynamic link oneCCL and Intel MPI libraries:

```bash
source $(python -c "import oneccl_bindings_for_pytorch as torch_ccl;print(torch_ccl.cwd)")/env/setvars.sh
```

Dynamic link oneCCL only (not including Intel MPI):

```bash
source $(python -c "import oneccl_bindings_for_pytorch as torch_ccl;print(torch_ccl.cwd)")/env/vars.sh
```

## Usage

example.py

```python

import torch.nn.parallel
import torch.distributed as dist
import oneccl_bindings_for_pytorch

...

os.environ['MASTER_ADDR'] = '127.0.0.1'
os.environ['MASTER_PORT'] = '29500'
os.environ['RANK'] = str(os.environ.get('PMI_RANK', 0))
os.environ['WORLD_SIZE'] = str(os.environ.get('PMI_SIZE', 1))

backend = 'ccl'
dist.init_process_group(backend, ...)
my_rank = dist.get_rank()
my_size = dist.get_world_size()
print("my rank = %d  my size = %d" % (my_rank, my_size))

...

model = torch.nn.parallel.DistributedDataParallel(model, ...)

...
```

(oneccl_bindings_for_pytorch is built without oneCCL, use oneCCL and MPI(if needed) in system)

```bash
source $basekit_root/ccl/latest/env/vars.sh
source $basekit_root/mpi/latest/env/vars.sh
```

mpirun -n <N> -ppn <PPN> -f <hostfile> python example.py
```

## Performance Debugging

For debugging performance of communication primitives PyTorch's [Autograd profiler](https://pytorch.org/docs/stable/autograd.html#profiler)
can be used to inspect time spent inside oneCCL calls.

Example:

profiling.py

```python

import torch.nn.parallel
import torch.distributed as dist
import oneccl_bindings_for_pytorch
import os

os.environ['MASTER_ADDR'] = '127.0.0.1'
os.environ['MASTER_PORT'] = '29500'
os.environ['RANK'] = str(os.environ.get('PMI_RANK', 0))
os.environ['WORLD_SIZE'] = str(os.environ.get('PMI_SIZE', 1))

backend = 'ccl'
dist.init_process_group(backend)
my_rank = dist.get_rank()
my_size = dist.get_world_size()
print("my rank = %d  my size = %d" % (my_rank, my_size))

x = torch.ones([2, 2])
y = torch.ones([4, 4])
with torch.autograd.profiler.profile(record_shapes=True) as prof:
    for _ in range(10):
        dist.all_reduce(x)
        dist.all_reduce(y)
dist.barrier()
print(prof.key_averages(group_by_input_shape=True).table(sort_by="self_cpu_time_total"))

```

```bash
mpirun -n 2 -l python profiling.py
```

```bash
[0] my rank = 0  my size = 2
[0] -----------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------
[0]                                                  Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls          Input Shapes
[0] -----------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------
[0]                oneccl_bindings_for_pytorch::allreduce        91.41%     297.900ms        91.41%     297.900ms      29.790ms            10              [[2, 2]]
[0]     oneccl_bindings_for_pytorch::wait::cpu::allreduce         8.24%      26.845ms         8.24%      26.845ms       2.684ms            10      [[2, 2], [2, 2]]
[0]     oneccl_bindings_for_pytorch::wait::cpu::allreduce         0.30%     973.651us         0.30%     973.651us      97.365us            10      [[4, 4], [4, 4]]
[0]                oneccl_bindings_for_pytorch::allreduce         0.06%     190.254us         0.06%     190.254us      19.025us            10              [[4, 4]]
[0] -----------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------
[0] Self CPU time total: 325.909ms
[0]
[1] my rank = 1  my size = 2
[1] -----------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------
[1]                                                  Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls          Input Shapes
[1] -----------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------
[1]                oneccl_bindings_for_pytorch::allreduce        96.03%     318.551ms        96.03%     318.551ms      31.855ms            10              [[2, 2]]
[1]     oneccl_bindings_for_pytorch::wait::cpu::allreduce         3.62%      12.019ms         3.62%      12.019ms       1.202ms            10      [[2, 2], [2, 2]]
[1]                oneccl_bindings_for_pytorch::allreduce         0.33%       1.082ms         0.33%       1.082ms     108.157us            10              [[4, 4]]
[1]     oneccl_bindings_for_pytorch::wait::cpu::allreduce         0.02%      56.505us         0.02%      56.505us       5.651us            10      [[4, 4], [4, 4]]
[1] -----------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------
[1] Self CPU time total: 331.708ms
[1]

```

## Known Issues

For Point-to-point communication, directly call dist.send/recv after initializing the process group in launch script will trigger runtime error. Because all ranks of the group are expected to participate in this call to create communicators in our current implementation, while dist.send/recv only has a pair of ranks' participation. As a result, dist.send/recv should be used after collective call, which ensures all ranks' participation. The further solution for supporting directly call dist.send/recv after initializing the process group is still under investigation.

## License

[BSD License](https://github.com/intel/torch-ccl/blob/master/LICENSE)

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/intel/torch-ccl",
    "name": "oneccl-binding-pt",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "simplest hello world pypi package",
    "author": "IntelAI",
    "author_email": "IntelAI@intel.com",
    "download_url": "",
    "platform": null,
    "description": "# Intel\u00ae oneCCL Bindings for PyTorch (formerly known as torch_ccl)\n\nThis repository holds PyTorch bindings maintained by Intel for the Intel\u00ae oneAPI Collective Communications Library (oneCCL).\n\n## Introduction\n\n[PyTorch](https://github.com/pytorch/pytorch) is an open-source machine learning framework.\n\n[Intel\u00ae oneCCL](https://github.com/oneapi-src/oneCCL) (collective communications library) is a library for efficient distributed deep learning training implementing such collectives like `allreduce`, `allgather`, `alltoall`. For more information on oneCCL, please refer to the [oneCCL documentation](https://spec.oneapi.com/versions/latest/elements/oneCCL/source/index.html).\n\n`oneccl_bindings_for_pytorch` module implements PyTorch C10D ProcessGroup API and can be dynamically loaded as external ProcessGroup and only works on Linux platform now.\n\n## Capability\n\nThe table below shows which functions are available for use with CPU / Intel dGPU tensors.\n\n|                  | CPU   | GPU   |\n| :--------------- | :---: | :---: |\n| `send`           | \u00d7     | \u221a     |\n| `recv`           | \u00d7     | \u221a     |\n| `broadcast`      | \u221a     | \u221a     |\n| `all_reduce`     | \u221a     | \u221a     |\n| `reduce`         | \u221a     | \u221a     |\n| `all_gather`     | \u221a     | \u221a     |\n| `gather`         | \u221a     | \u221a     |\n| `scatter`        | \u00d7     | \u00d7     |\n| `reduce_scatter` | \u221a     | \u221a     |\n| `all_to_all`     | \u221a     | \u221a     |\n| `barrier`        | \u221a     | \u221a     |\n\n\n## Pytorch API Align\n\nWe recommend using Anaconda as Python package management system. The followings are the corresponding branches (tags) of `oneccl_bindings_for_pytorch` and supported Pytorch.\n\n   | `torch`                                                         | `oneccl_bindings_for_pytorch`                                             |\n   | :-------------------------------------------------------------: | :-----------------------------------------------------------------------: |\n   | `master`                                                        |  `master`                                                                 |\n   | [v2.2.0](https://github.com/pytorch/pytorch/tree/v2.2.0)        |  [ccl_torch2.2.0+cpu](https://github.com/intel/torch-ccl/tree/ccl_torch2.2.0%2Bcpu)   |\n   | [v2.1.0](https://github.com/pytorch/pytorch/tree/v2.1.0)        |  [ccl_torch2.1.0+cpu](https://github.com/intel/torch-ccl/tree/ccl_torch2.1.0%2Bcpu)   |\n   | [v2.0.1](https://github.com/pytorch/pytorch/tree/v2.0.1)        |  [ccl_torch2.0.100](https://github.com/intel/torch-ccl/tree/ccl_torch2.0.100)   |\n   | [v1.13](https://github.com/pytorch/pytorch/tree/v1.13)          |  [ccl_torch1.13](https://github.com/intel/torch-ccl/tree/ccl_torch1.13)   |\n   | [v1.12.1](https://github.com/pytorch/pytorch/tree/v1.12.1)      |  [ccl_torch1.12.100](https://github.com/intel/torch-ccl/tree/ccl_torch1.12.100)   |\n   | [v1.12.0](https://github.com/pytorch/pytorch/tree/v1.12.0)      |  [ccl_torch1.12](https://github.com/intel/torch-ccl/tree/ccl_torch1.12)   |\n   | [v1.11.0](https://github.com/pytorch/pytorch/tree/v1.11.0)      |  [ccl_torch1.11](https://github.com/intel/torch-ccl/tree/ccl_torch1.11)   |\n   | [v1.10.0](https://github.com/pytorch/pytorch/tree/v1.10.0)      |  [ccl_torch1.10](https://github.com/intel/torch-ccl/tree/ccl_torch1.10)   |\n   | [v1.9.0](https://github.com/pytorch/pytorch/tree/v1.9.0)        |  [ccl_torch1.9](https://github.com/intel/torch-ccl/tree/ccl_torch1.9)     |\n   | [v1.8.1](https://github.com/pytorch/pytorch/tree/v1.8.1)        |  [ccl_torch1.8](https://github.com/intel/torch-ccl/tree/ccl_torch1.8)     |\n   | [v1.7.1](https://github.com/pytorch/pytorch/tree/v1.7.1)        |  [ccl_torch1.7](https://github.com/intel/torch-ccl/tree/ccl_torch1.7)     |\n   | [v1.6.0](https://github.com/pytorch/pytorch/tree/v1.6.0)        |  [ccl_torch1.6](https://github.com/intel/torch-ccl/tree/ccl_torch1.6)     |\n   | [v1.5-rc3](https://github.com/pytorch/pytorch/tree/v1.5.0-rc3)  |  [beta09](https://github.com/intel/torch-ccl/tree/beta09)                 |\n\nThe usage details can be found in the README of corresponding branch. The following part is about the usage of v1.9 tag. if you want to use other version of torch-ccl please checkout to that branch(tag). For pytorch-1.5.0-rc3, the [#PR28068](https://github.com/pytorch/pytorch/pull/28068) and [#PR32361](https://github.com/pytorch/pytorch/pull/32361) are need to dynamicall register external ProcessGroup and enable `alltoall` collective communication primitive. The patch file about these two PRs is in `patches` directory and you can use it directly.\n\n## Requirements\n\n- Python 3.8 or later and a C++17 compiler\n\n- PyTorch v2.2.0\n\n## Build Option List\n\nThe following build options are supported in Intel\u00ae oneCCL Bindings for PyTorch*.\n\n| Build Option                        | Default Value  | Description                                                                                         |\n| :---------------------------------- | :------------- | :-------------------------------------------------------------------------------------------------- |\n| COMPUTE_BACKEND                     |                | Set oneCCL `COMPUTE_BACKEDN`,set to `dpcpp`  and use DPC++ Compiler to enable support for Intel XPU |\n| USE_SYSTEM_ONECCL                   | OFF            | Use oneCCL library in system                                                                        |\n| CCL_PACKAGE_NAME                    | oneccl-bind-pt | Set Wheel Name                                                                                      |\n| ONECCL_BINDINGS_FOR_PYTORCH_BACKEND | cpu            | Set BACKEND                                                                                         |\n| CCL_SHA_VERSION                     | False          | add git head sha version to Wheel name                                                              |\n\n## Lunch Option List\n\nThe following lunch options are supported in Intel\u00ae oneCCL Bindings for PyTorch*.\n\n| Lunch Option                             | Default Value | Description                                                           |\n| :--------------------------------------- | :------------ | :-------------------------------------------------------------------- |\n| ONECCL_BINDINGS_FOR_PYTORCH_ENV_VERBOSE  | 0             | Set verbose level in ONECCL_BINDINGS_FOR_PYTORCH                      |\n| ONECCL_BINDINGS_FOR_PYTORCH_ENV_WAIT_GDB | 0             | Set 1 to force the oneccl_bindings_for_pytorch wait for GDB attaching |\n\n## Installation\n\n### Install from Source\n\n1. clone the `oneccl_bindings_for_pytorch`.\n\n   ```bash\n   git clone https://github.com/intel/torch-ccl.git && cd torch-ccl\n   git submodule sync\n   git submodule update --init --recursive\n   ```\n\n2. Install `oneccl_bindings_for_pytorch`\n\n   ```bash\n   # for CPU Backend Only\n   python setup.py install\n   # for XPU Backend: use DPC++ Compiler to enable support for Intel XPU\n   # build with oneCCL from third party\n   COMPUTE_BACKEND=dpcpp python setup.py install\n   # build without oneCCL\n   export INTELONEAPIROOT=${HOME}/intel/oneapi\n   USE_SYSTEM_ONECCL=ON COMPUTE_BACKEND=dpcpp python setup.py install\n   ```\n\n### Install PreBuilt Wheel\n\nWheel files are avaiable for the following Python versions.\n\n| Extension Version | Python 3.6 | Python 3.7 | Python 3.8 | Python 3.9 | Python 3.10 | Python 3.11 |\n| :---------------: | :--------: | :--------: | :--------: | :--------: | :---------: | :---------: |\n| 2.2.0             |            |            | \u221a          | \u221a          | \u221a           | \u221a           |\n| 2.1.0             |            |            | \u221a          | \u221a          | \u221a           | \u221a           |\n| 2.0.100           |            |            | \u221a          | \u221a          | \u221a           | \u221a           |\n| 1.13              |            | \u221a          | \u221a          | \u221a          | \u221a           |             |\n| 1.12.100          |            | \u221a          | \u221a          | \u221a          | \u221a           |             |\n| 1.12.0            |            | \u221a          | \u221a          | \u221a          | \u221a           |             |\n| 1.11.0            |            | \u221a          | \u221a          | \u221a          | \u221a           |             |\n| 1.10.0            | \u221a          | \u221a          | \u221a          | \u221a          |             |             |\n\n```bash\npython -m pip install oneccl_bind_pt==2.0.100 -f https://developer.intel.com/ipex-whl-stable-xpu\n```\n\n### Runtime Dynamic Linking\n\n- If oneccl_bindings_for_pytorch is built without oneCCL and use oneCCL in system, dynamic link oneCCl from oneAPI basekit (recommended usage):\n\n```bash\nsource $basekit_root/ccl/latest/env/vars.sh\n```\n\nNote: Make sure you have installed [basekit](https://www.intel.com/content/www/us/en/developer/tools/oneapi/toolkits.html#base-kit) when using Intel\u00ae oneCCL Bindings for Pytorch\\* on Intel\u00ae GPUs.\n\n- If oneccl_bindings_for_pytorch is built with oneCCL from third party or installed from prebuilt wheel:\nDynamic link oneCCL and Intel MPI libraries:\n\n```bash\nsource $(python -c \"import oneccl_bindings_for_pytorch as torch_ccl;print(torch_ccl.cwd)\")/env/setvars.sh\n```\n\nDynamic link oneCCL only (not including Intel MPI):\n\n```bash\nsource $(python -c \"import oneccl_bindings_for_pytorch as torch_ccl;print(torch_ccl.cwd)\")/env/vars.sh\n```\n\n## Usage\n\nexample.py\n\n```python\n\nimport torch.nn.parallel\nimport torch.distributed as dist\nimport oneccl_bindings_for_pytorch\n\n...\n\nos.environ['MASTER_ADDR'] = '127.0.0.1'\nos.environ['MASTER_PORT'] = '29500'\nos.environ['RANK'] = str(os.environ.get('PMI_RANK', 0))\nos.environ['WORLD_SIZE'] = str(os.environ.get('PMI_SIZE', 1))\n\nbackend = 'ccl'\ndist.init_process_group(backend, ...)\nmy_rank = dist.get_rank()\nmy_size = dist.get_world_size()\nprint(\"my rank = %d  my size = %d\" % (my_rank, my_size))\n\n...\n\nmodel = torch.nn.parallel.DistributedDataParallel(model, ...)\n\n...\n```\n\n(oneccl_bindings_for_pytorch is built without oneCCL, use oneCCL and MPI(if needed) in system)\n\n```bash\nsource $basekit_root/ccl/latest/env/vars.sh\nsource $basekit_root/mpi/latest/env/vars.sh\n```\n\nmpirun -n <N> -ppn <PPN> -f <hostfile> python example.py\n```\n\n## Performance Debugging\n\nFor debugging performance of communication primitives PyTorch's [Autograd profiler](https://pytorch.org/docs/stable/autograd.html#profiler)\ncan be used to inspect time spent inside oneCCL calls.\n\nExample:\n\nprofiling.py\n\n```python\n\nimport torch.nn.parallel\nimport torch.distributed as dist\nimport oneccl_bindings_for_pytorch\nimport os\n\nos.environ['MASTER_ADDR'] = '127.0.0.1'\nos.environ['MASTER_PORT'] = '29500'\nos.environ['RANK'] = str(os.environ.get('PMI_RANK', 0))\nos.environ['WORLD_SIZE'] = str(os.environ.get('PMI_SIZE', 1))\n\nbackend = 'ccl'\ndist.init_process_group(backend)\nmy_rank = dist.get_rank()\nmy_size = dist.get_world_size()\nprint(\"my rank = %d  my size = %d\" % (my_rank, my_size))\n\nx = torch.ones([2, 2])\ny = torch.ones([4, 4])\nwith torch.autograd.profiler.profile(record_shapes=True) as prof:\n    for _ in range(10):\n        dist.all_reduce(x)\n        dist.all_reduce(y)\ndist.barrier()\nprint(prof.key_averages(group_by_input_shape=True).table(sort_by=\"self_cpu_time_total\"))\n\n```\n\n```bash\nmpirun -n 2 -l python profiling.py\n```\n\n```bash\n[0] my rank = 0  my size = 2\n[0] -----------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------\n[0]                                                  Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls          Input Shapes\n[0] -----------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------\n[0]                oneccl_bindings_for_pytorch::allreduce        91.41%     297.900ms        91.41%     297.900ms      29.790ms            10              [[2, 2]]\n[0]     oneccl_bindings_for_pytorch::wait::cpu::allreduce         8.24%      26.845ms         8.24%      26.845ms       2.684ms            10      [[2, 2], [2, 2]]\n[0]     oneccl_bindings_for_pytorch::wait::cpu::allreduce         0.30%     973.651us         0.30%     973.651us      97.365us            10      [[4, 4], [4, 4]]\n[0]                oneccl_bindings_for_pytorch::allreduce         0.06%     190.254us         0.06%     190.254us      19.025us            10              [[4, 4]]\n[0] -----------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------\n[0] Self CPU time total: 325.909ms\n[0]\n[1] my rank = 1  my size = 2\n[1] -----------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------\n[1]                                                  Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls          Input Shapes\n[1] -----------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------\n[1]                oneccl_bindings_for_pytorch::allreduce        96.03%     318.551ms        96.03%     318.551ms      31.855ms            10              [[2, 2]]\n[1]     oneccl_bindings_for_pytorch::wait::cpu::allreduce         3.62%      12.019ms         3.62%      12.019ms       1.202ms            10      [[2, 2], [2, 2]]\n[1]                oneccl_bindings_for_pytorch::allreduce         0.33%       1.082ms         0.33%       1.082ms     108.157us            10              [[4, 4]]\n[1]     oneccl_bindings_for_pytorch::wait::cpu::allreduce         0.02%      56.505us         0.02%      56.505us       5.651us            10      [[4, 4], [4, 4]]\n[1] -----------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------\n[1] Self CPU time total: 331.708ms\n[1]\n\n```\n\n## Known Issues\n\nFor Point-to-point communication, directly call dist.send/recv after initializing the process group in launch script will trigger runtime error. Because all ranks of the group are expected to participate in this call to create communicators in our current implementation, while dist.send/recv only has a pair of ranks' participation. As a result, dist.send/recv should be used after collective call, which ensures all ranks' participation. The further solution for supporting directly call dist.send/recv after initializing the process group is still under investigation.\n\n## License\n\n[BSD License](https://github.com/intel/torch-ccl/blob/master/LICENSE)\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Intel\u00ae oneCCL Bindings for PyTorch (formerly known as torch_ccl)",
    "version": "0.0.4",
    "project_urls": {
        "Homepage": "https://github.com/intel/torch-ccl"
    },
    "split_keywords": [
        "simplest",
        "hello",
        "world",
        "pypi",
        "package"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "8cce88ec7d031615432b33e2e2e82d936d5a7d0dfa5672e6d438a4932820db2c",
                "md5": "b9084dbc78f3d29197944c186966f736",
                "sha256": "3d2008fae4164c94aed2433892058ea67ec50f6a02f4889fa7695c1ede82381d"
            },
            "downloads": -1,
            "filename": "oneccl_binding_pt-0.0.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "b9084dbc78f3d29197944c186966f736",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 6482,
            "upload_time": "2024-02-26T06:09:49",
            "upload_time_iso_8601": "2024-02-26T06:09:49.512702Z",
            "url": "https://files.pythonhosted.org/packages/8c/ce/88ec7d031615432b33e2e2e82d936d5a7d0dfa5672e6d438a4932820db2c/oneccl_binding_pt-0.0.4-py3-none-any.whl",
            "yanked": true,
            "yanked_reason": "This is a dummy package"
        }
    ],
    "upload_time": "2024-02-26 06:09:49",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "intel",
    "github_project": "torch-ccl",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "torch",
            "specs": [
                [
                    ">=",
                    "1.10.0"
                ]
            ]
        },
        {
            "name": "setuptools",
            "specs": []
        }
    ],
    "lcname": "oneccl-binding-pt"
}

IntelAI