# Deep Learning Communication (DLcomm) Benchmark
**DLComm** is a communication benchmark designed for Deep Learning and AI workloads. Collective communication performance is often the primary bottleneck in AI training, inference, reasoning, and large-scale applications. DLComm emulates the communication patterns of the latest large language models (LLMs) and AI applications at scale, specifically targeting deployments of 50,000 GPUs and beyond.
The benchmark is provided as an executable that can be configured to test various communication patterns within different AI distributed runtime frameworks. It uses a modular design to support all levels of communicator groups across GPUs, with flexible configurations for payload sizes, AI frameworks, and collective communication backends. DLComm enables testing on diverse systems, supports modifying scale-up and scale-out algorithms, and verifies correctness after communication operations.
Unlike traditional communication benchmarks, DLComm is built with the philosophy of reflecting real-world communication performance of the application as accurately as possible. It captures the interplay between Python runtimes, AI frameworks, and collective communication libraries (CCL) to provide insights that are directly relevant to actual AI workloads.
The below gif shows a simple model of how different collective communications are performed over a group of GPUs. Update the below gif with a note - x axis is num_gpus_per_node and y axis is num_compute_nodes. Each sqaure is a GPU on a compute node. Each blinking bright rectangles could represent different collectives executing in an order.

## Installation and running DLCOMM
pip install -r requirements.txt
pip install DLcomm
## Running the benchmark
## YAML configuration file
Workload characteristics for DL COMM are specified by a YAML configuration file. The main configuration file is located at `dl_comm/config/config.yaml`. A sample configuration file is also available in the `examples/config.yaml` for reference.
Below is an example configuration file
```yaml
framework : pytorch # tensorflow / jax / titan / monarch
ccl_backend : ccl # rccl / nccl / xccl (Note: PyTorch 2.7+ users should use 'xccl' instead of 'ccl' for Intel oneCCL)
ccl_debug : on # on / off - enables CCL debug logging and algorithm selection reporting
use_profiler: unitrace
barrier : on # on / off - on: adds MPI barrier before timer printing for accurate timing, off: only rank 0 prints
comm_group:
mode: combined # within_node/across_node/combined/flatview -> Only one out of four should be used
flatview:
num_compute_nodes: 2
num_gpus_per_node: 12
gpu_ids_per_node: [0,1,2,3,4,5,6,7,8,9,10,11]
collective:
name: allgather # allgather / reducescatter / broadcast
op: sum # max / min / prod / sum
scale_up_algorithm: topo
scale_out_algorithm: ring # rabinseifner
iterations: 5
payload:
dtype: bfloat16 # float64 / int32 / int64 / bfloat16 / float8 / float32
count: 1024
buffer_size: 1KB # 4096 # in Bytes -> float32(4B) x 1024 elements
verify_correctness: on
combined:
within_node:
num_compute_nodes: 2
num_gpus_per_node: 12
gpu_ids_per_node: [0,1,2,3,4,5,6,7, 8, 9, 10, 11]
collective:
name: allgather # allgather / reducescatter / broadcast
op: sum # max / min / prod / sum
scale_up_algorithm: ring
scale_out_algorithm: ring # rabinseifner
iterations: 2
payload:
dtype: bfloat16 # float64 / int32 / int64 / bfloat16 / float8 / float32
count: 1024
buffer_size: 1KB # 4096 # in Bytes -> float32(4B) x 1024 elements
verify_correctness: on
across_node:
num_compute_nodes: 2
num_gpus_per_node: 3
gpu_ids_per_node: [0,1,3]
collective:
name: alltoall # allgather / reducescatter / broadcast
op: sum # max / min / prod / sum
scale_up_algorithm: ring
scale_out_algorithm: ring # rabinseifner
iterations: 4
payload:
dtype: bfloat16 # float64 / int32 / int64 / bfloat16 / float8 / float32
count: 1024
buffer_size: 1KB # 4096 # in Bytes -> float32(4B) x 1024 elements
verify_correctness: on
```
### Important Note for PyTorch Users
**Backend Naming**: The `ccl_backend` field naming depends on your PyTorch version:
- **PyTorch < 2.7**: Use `ccl_backend: ccl` for Intel oneCCL
- **PyTorch 2.7+**: Use `ccl_backend: xccl` for Intel oneCCL
Make sure to use the correct backend name for your PyTorch version to avoid initialization errors.
## How to contribute
We welcome contributions from the community to the benchmark code.
If you would like to contribute, please submit an issue to https://github.com/argonne-lcf/DLcomm_benchmark/issues, and contact ALCF DLCOMM team, Kaushik Velusamy at kaushik.v@anl.gov , Musa Cim at mtc5693@psu.edu
## Citation and Reference
## Acknowledgments
This work used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility under Contract DE-AC02-06CH11357 and is supported in part by National Science Foundation under NSF, OCI-1835764 and NSF, CSR-1814872.
## License
Apache 2.0 LICENSE
Copyright (c) 2025, UChicago Argonne, LLC All Rights Reserved
If you have questions about your rights to use or distribute this software, please contact Argonne Intellectual Property Office at partners@anl.gov
NOTICE. This Software was developed under funding from the U.S. Department of Energy and the U.S. Government consequently retains certain rights. As such, the U.S. Government has been granted for itself and others acting on its behalf a paid-up, nonexclusive, irrevocable, worldwide license in the Software to reproduce, distribute copies to the public, prepare derivative works, and perform publicly and display publicly, and to permit others to do so.
Raw data
{
"_id": null,
"home_page": null,
"name": "DLcomm",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "deep-learning, gpu, communication, benchmark, mpi, distributed, pytorch, ccl",
"author": null,
"author_email": "Musa Cim <musaoguzhancim@gmail.com>, Kaushik Velusamy <kaushikvelusamy@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/81/63/def6cb0d9dc311b8c01cf3c75774d4e6545b9c7eee9eed886a51c783ff42/dlcomm-0.2.0.tar.gz",
"platform": null,
"description": "# Deep Learning Communication (DLcomm) Benchmark\n\n**DLComm** is a communication benchmark designed for Deep Learning and AI workloads. Collective communication performance is often the primary bottleneck in AI training, inference, reasoning, and large-scale applications. DLComm emulates the communication patterns of the latest large language models (LLMs) and AI applications at scale, specifically targeting deployments of 50,000 GPUs and beyond.\n\nThe benchmark is provided as an executable that can be configured to test various communication patterns within different AI distributed runtime frameworks. It uses a modular design to support all levels of communicator groups across GPUs, with flexible configurations for payload sizes, AI frameworks, and collective communication backends. DLComm enables testing on diverse systems, supports modifying scale-up and scale-out algorithms, and verifies correctness after communication operations.\n\nUnlike traditional communication benchmarks, DLComm is built with the philosophy of reflecting real-world communication performance of the application as accurately as possible. It captures the interplay between Python runtimes, AI frameworks, and collective communication libraries (CCL) to provide insights that are directly relevant to actual AI workloads.\n\nThe below gif shows a simple model of how different collective communications are performed over a group of GPUs. Update the below gif with a note - x axis is num_gpus_per_node and y axis is num_compute_nodes. Each sqaure is a GPU on a compute node. Each blinking bright rectangles could represent different collectives executing in an order.\n\n\n\n## Installation and running DLCOMM\n\npip install -r requirements.txt\n\npip install DLcomm\n\n## Running the benchmark\n\n## YAML configuration file\n\nWorkload characteristics for DL COMM are specified by a YAML configuration file. The main configuration file is located at `dl_comm/config/config.yaml`. A sample configuration file is also available in the `examples/config.yaml` for reference.\n\nBelow is an example configuration file\n\n```yaml\nframework : pytorch # tensorflow / jax / titan / monarch\nccl_backend : ccl # rccl / nccl / xccl (Note: PyTorch 2.7+ users should use 'xccl' instead of 'ccl' for Intel oneCCL)\nccl_debug : on # on / off - enables CCL debug logging and algorithm selection reporting\nuse_profiler: unitrace\nbarrier : on # on / off - on: adds MPI barrier before timer printing for accurate timing, off: only rank 0 prints\n\ncomm_group:\n mode: combined # within_node/across_node/combined/flatview -> Only one out of four should be used\n \n flatview:\n num_compute_nodes: 2\n num_gpus_per_node: 12\n gpu_ids_per_node: [0,1,2,3,4,5,6,7,8,9,10,11] \n collective:\n name: allgather # allgather / reducescatter / broadcast\n op: sum # max / min / prod / sum\n scale_up_algorithm: topo\n scale_out_algorithm: ring # rabinseifner \n iterations: 5\n payload:\n dtype: bfloat16 # float64 / int32 / int64 / bfloat16 / float8 / float32\n count: 1024\n buffer_size: 1KB # 4096 # in Bytes -> float32(4B) x 1024 elements\n verify_correctness: on\n\n combined:\n within_node: \n num_compute_nodes: 2\n num_gpus_per_node: 12\n gpu_ids_per_node: [0,1,2,3,4,5,6,7, 8, 9, 10, 11] \n collective:\n name: allgather # allgather / reducescatter / broadcast\n op: sum # max / min / prod / sum\n scale_up_algorithm: ring\n scale_out_algorithm: ring # rabinseifner \n iterations: 2\n payload:\n dtype: bfloat16 # float64 / int32 / int64 / bfloat16 / float8 / float32\n count: 1024\n buffer_size: 1KB # 4096 # in Bytes -> float32(4B) x 1024 elements\n verify_correctness: on\n\n across_node: \n num_compute_nodes: 2\n num_gpus_per_node: 3\n gpu_ids_per_node: [0,1,3] \n collective:\n name: alltoall # allgather / reducescatter / broadcast\n op: sum # max / min / prod / sum\n scale_up_algorithm: ring\n scale_out_algorithm: ring # rabinseifner \n iterations: 4\n payload:\n dtype: bfloat16 # float64 / int32 / int64 / bfloat16 / float8 / float32\n count: 1024\n buffer_size: 1KB # 4096 # in Bytes -> float32(4B) x 1024 elements\n verify_correctness: on\n```\n\n### Important Note for PyTorch Users\n\n**Backend Naming**: The `ccl_backend` field naming depends on your PyTorch version:\n\n- **PyTorch < 2.7**: Use `ccl_backend: ccl` for Intel oneCCL\n- **PyTorch 2.7+**: Use `ccl_backend: xccl` for Intel oneCCL\n\nMake sure to use the correct backend name for your PyTorch version to avoid initialization errors.\n\n## How to contribute\n\nWe welcome contributions from the community to the benchmark code.\nIf you would like to contribute, please submit an issue to https://github.com/argonne-lcf/DLcomm_benchmark/issues, and contact ALCF DLCOMM team, Kaushik Velusamy at kaushik.v@anl.gov , Musa Cim at mtc5693@psu.edu\n\n## Citation and Reference\n\n## Acknowledgments\n\nThis work used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility under Contract DE-AC02-06CH11357 and is supported in part by National Science Foundation under NSF, OCI-1835764 and NSF, CSR-1814872.\n\n## License\n\nApache 2.0 LICENSE\n\nCopyright (c) 2025, UChicago Argonne, LLC All Rights Reserved\n\nIf you have questions about your rights to use or distribute this software, please contact Argonne Intellectual Property Office at partners@anl.gov\n\nNOTICE. This Software was developed under funding from the U.S. Department of Energy and the U.S. Government consequently retains certain rights. As such, the U.S. Government has been granted for itself and others acting on its behalf a paid-up, nonexclusive, irrevocable, worldwide license in the Software to reproduce, distribute copies to the public, prepare derivative works, and perform publicly and display publicly, and to permit others to do so.\n",
"bugtrack_url": null,
"license": null,
"summary": "Distributed GPU Communication Benchmarking Framework for Deep Learning",
"version": "0.2.0",
"project_urls": {
"Bug Tracker": "https://github.com/argonne-lcf/DLcomm_benchmark/issues",
"Documentation": "https://github.com/argonne-lcf/DLcomm_benchmark",
"Homepage": "https://github.com/argonne-lcf/DLcomm_benchmark",
"Repository": "https://github.com/argonne-lcf/DLcomm_benchmark"
},
"split_keywords": [
"deep-learning",
" gpu",
" communication",
" benchmark",
" mpi",
" distributed",
" pytorch",
" ccl"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "e907a2ffbc65e95e1d320b153f1f5858b48e2d8fbcc72b93bc56fdffb8d63888",
"md5": "88384415086f06af934b37ce545864b3",
"sha256": "39dbe56cfa337b9e8324f7dae5ca93040df72c17400f314b58ead963a7746074"
},
"downloads": -1,
"filename": "dlcomm-0.2.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "88384415086f06af934b37ce545864b3",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 25651,
"upload_time": "2025-07-16T15:24:14",
"upload_time_iso_8601": "2025-07-16T15:24:14.500786Z",
"url": "https://files.pythonhosted.org/packages/e9/07/a2ffbc65e95e1d320b153f1f5858b48e2d8fbcc72b93bc56fdffb8d63888/dlcomm-0.2.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "8163def6cb0d9dc311b8c01cf3c75774d4e6545b9c7eee9eed886a51c783ff42",
"md5": "fdc6973333882b3efa053f26782d2a49",
"sha256": "07c7b9f50e64259345cabe26526ad352a7e08e706a072117b63b4bc5d810e632"
},
"downloads": -1,
"filename": "dlcomm-0.2.0.tar.gz",
"has_sig": false,
"md5_digest": "fdc6973333882b3efa053f26782d2a49",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 22368,
"upload_time": "2025-07-16T15:24:15",
"upload_time_iso_8601": "2025-07-16T15:24:15.548046Z",
"url": "https://files.pythonhosted.org/packages/81/63/def6cb0d9dc311b8c01cf3c75774d4e6545b9c7eee9eed886a51c783ff42/dlcomm-0.2.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-16 15:24:15",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "argonne-lcf",
"github_project": "DLcomm_benchmark",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "antlr4-python3-runtime",
"specs": [
[
"==",
"4.9.3"
]
]
},
{
"name": "dataclasses",
"specs": [
[
"==",
"0.8"
]
]
},
{
"name": "hydra-core",
"specs": [
[
"==",
"1.3.2"
]
]
},
{
"name": "importlib-resources",
"specs": [
[
"==",
"5.4.0"
]
]
},
{
"name": "omegaconf",
"specs": [
[
"==",
"2.3.0"
]
]
},
{
"name": "packaging",
"specs": [
[
"==",
"21.3"
]
]
},
{
"name": "pyparsing",
"specs": [
[
"==",
"3.1.4"
]
]
},
{
"name": "PyYAML",
"specs": [
[
"==",
"6.0.1"
]
]
},
{
"name": "zipp",
"specs": [
[
"==",
"3.6.0"
]
]
}
],
"lcname": "dlcomm"
}