aiconfigurator


Nameaiconfigurator JSON
Version 0.1.0 PyPI version JSON
download
home_pageNone
Summaryaiconfigurator: automatic disaggregated serving offline configuration
upload_time2025-08-12 19:10:42
maintainerNone
docs_urlNone
authorNone
requires_python>=3.9
licenseNone
keywords decode distributed dynamo gpu inference llm nvidia prefill
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->

# aiconfigurator
Today, in disaggregated serving, it's quite difficult to find a proper config
to get benefits from disaggregation such as how many prefill workers and decode workers 
do I need and what about the parallelism for each worker. Combined with SLA: 
TTFT(Time-To-First-Token) and TPOT(Time-Per-Output-Token), it becomes even more complicated 
to solve the throughput @ latency problem.

We're introducing aiconfigurator to help you find a good reference to start with in your 
disaggregated serving journey. The tool will try to search the space to get a good deployment config 
based on your requirement including which model you want to serve, how many GPUs you have and what's 
the GPU. Automatically generate the config files for you to deploy with Dynamo.

It's based on modeling the LLM inference with collected data on a target machine with a specific framework.
It searches thousands of different configurations in the background in tens of seconds and runs on any machine 
with a CLI tool and a webapp provided.

Let's get started.


# Build and Install
## Install from PyPI
```bash
pip3 install aiconfigurator
```
## Build and install from source
1. apt-get install git-lfs (linux) or brew install git-lfs (macos)
2. clone the repo
3. (optional) python3 -m venv myenv && source myenv/bin/activate  (need to have python >= 3.9)
4. (optional) pip3 install --upgrade pip (if you encounter issue that didn't find setup.py)
5. pip3 install "."

## Build with Dockerfile
```bash
  # This will create a ./dist/ folder containing the wheel file
  docker build -f docker/Dockerfile --no-cache --target build -t aiconfigurator:latest .
  docker create --name aic aiconfigurator:latest && docker cp aic:/workspace/dist dist/ && docker rm aic
```

# Run
## CLI
```bash
  aiconfigurator cli --model QWEN3_32B --total_gpus 32 --system h200_sxm
```
With **3 basic args**, it will report out the estimated best deployment result and the deployment details  
With **--save_dir DIR**, you can output the framework configs automatically to deploy with Dynamo  
With **-h**, you can have more information about optional args to customize your deployment target  

```
********************************************************************************
*                      Dynamo aiconfigurator Final Results                     *
********************************************************************************
  ----------------------------------------------------------------------------
  Input Configuration & SLA Target:
    Model: QWEN3_32B (is_moe: False)
    Total GPUs: 512
    I/O Length (tokens): Input=4000, Output=500
    SLA Target: TTFT <= 300.0ms, TPOT <= 10.0ms
  ----------------------------------------------------------------------------
  Overall best system chosen: disagg at 812.48 tokens/s/gpu (2.39x better)
    - Agg Actual Best: 340.48 tokens/s/gpu  100.83 tokens/s/user | TTFT: 188.91ms TPOT: 9.92ms
    - Disagg Actual Best: 812.48 tokens/s/gpu  109.12 tokens/s/user | TTFT: 276.94ms TPOT: 9.16ms
  ----------------------------------------------------------------------------
  Pareto Frontier:
               QWEN3_32B Pareto Frontier: tokens/s/gpu vs tokens/s/user         
      ┌────────────────────────────────────────────────────────────────────────┐
1600.0┤ dd Disagg                                                              │
      │ aa Agg                                                                 │
      │ XX Best                                                                │
      │                                                                        │
1333.3┤    a                                                                   │
      │     a                                                                  │
      │      aaaa    d                                                         │
      │          a    ddddddddd                                                │
1066.7┤           a           dd                                               │
      │            aa           dddddddd                                       │
      │              aaa                dd                                     │
      │                 a                 d                                    │
 800.0┤                 a                  dddddddXdd                          │
      │                  aaaa                        d                         │
      │                      aaa                      d                        │
      │                         aa                     d                       │
 533.3┤                           aaaaaa                dd                     │
      │                                 aa                dd                   │
      │                                   aa                dd                 │
      │                                     aaaaaa            ddd              │
 266.7┤                                          aaaaa           d             │
      │                                              aaaaaaa                   │
      │                                                     aaaaaaa            │
      │                                                                        │
   0.0┤                                                                        │
      └┬─────────────────┬─────────────────┬────────────────┬─────────────────┬┘
       0                45                90               135              180 
tokens/s/gpu                         tokens/s/user                              

  ----------------------------------------------------------------------------
  Worker Setup:
    Model: QWEN3_32B (is_moe: False)
    Disagg Prefill: h200_sxm (trtllm)
    Disagg Decode:  h200_sxm (trtllm)
    Prefill Quantization: GEMM: fp8_block, KVCache: fp8, FMHA: fp8
    Decode Quantization: GEMM: fp8_block, KVCache: fp8, FMHA: fp8
    Agg: h200_sxm (trtllm)
    Quantization: GEMM: fp8_block, KVCache: fp8, FMHA: fp8
  ----------------------------------------------------------------------------
  Deployment Details:
    (p) stands for prefill, (d) stands for decode, bs stands for batch size, a replica stands for the smallest scalable unit xPyD of the disagg system
    Some math: total gpus used = replicas * gpus/replica
               gpus/replica = (p)gpus/worker * (p)workers + (d)gpus/worker * (d)workers; for Agg, gpus/replica = gpus/worker
               gpus/worker = tp * pp * dp = etp * ep * pp for MoE models; tp * pp for dense models (underlined numbers are the actual values in math)

Disagg Top Configurations: (Sorted by tokens/s/gpu)
+------+--------------+---------------+-------------+------------------+----------+----------------+------------+----------------+-------------+-------+------------+----------------+-------------+-------+
| Rank | tokens/s/gpu | tokens/s/user | concurrency | total_gpus(used) | replicas |  gpus/replica  | (p)workers | (p)gpus/worker | (p)parallel | (p)bs | (d)workers | (d)gpus/worker | (d)parallel | (d)bs |
+------+--------------+---------------+-------------+------------------+----------+----------------+------------+----------------+-------------+-------+------------+----------------+-------------+-------+
|  1   |    812.48    |     109.12    |      60     |  512 (512=64x8)  |    64    |  8 (=4x1+1x4)  |     4      |    1 (=1x1)    |    tp1pp1   |   1   |     1      |    4 (=4x1)    |    tp4pp1   |   60  |
|  2   |    802.97    |     100.56    |     204     | 512 (500=20x25)  |    20    | 25 (=13x1+3x4) |     13     |    1 (=1x1)    |    tp1pp1   |   1   |     3      |    4 (=4x1)    |    tp4pp1   |   68  |
|  3   |    802.09    |     106.73    |     192     | 512 (500=20x25)  |    20    | 25 (=13x1+3x4) |     13     |    1 (=1x1)    |    tp1pp1   |   1   |     3      |    4 (=4x1)    |    tp4pp1   |   64  |
|  4   |    767.19    |     114.22    |     156     | 512 (506=22x23)  |    22    | 23 (=11x1+3x4) |     11     |    1 (=1x1)    |    tp1pp1   |   1   |     3      |    4 (=4x1)    |    tp4pp1   |   52  |
|  5   |    761.70    |     111.61    |     224     | 512 (496=16x31)  |    16    | 31 (=15x1+4x4) |     15     |    1 (=1x1)    |    tp1pp1   |   1   |     4      |    4 (=4x1)    |    tp4pp1   |   56  |
+------+--------------+---------------+-------------+------------------+----------+----------------+------------+----------------+-------------+-------+------------+----------------+-------------+-------+
Agg Top Configurations: (Sorted by tokens/s/gpu)
+------+--------------+---------------+-------------+------------------+----------+--------------+-------------+----------+----+
| Rank | tokens/s/gpu | tokens/s/user | concurrency | total_gpus(used) | replicas | gpus/replica | gpus/worker | parallel | bs |
+------+--------------+---------------+-------------+------------------+----------+--------------+-------------+----------+----+
|  1   |    340.48    |     100.83    |      15     | 512 (512=128x4)  |   128    |      4       |   4 (=4x1)  |  tp4pp1  | 15 |
|  2   |    326.78    |     104.48    |      14     | 512 (512=128x4)  |   128    |      4       |   4 (=4x1)  |  tp4pp1  | 14 |
|  3   |    307.50    |     105.57    |      13     | 512 (512=128x4)  |   128    |      4       |   4 (=4x1)  |  tp4pp1  | 13 |
|  4   |    296.61    |     107.15    |      24     |  512 (512=64x8)  |    64    |      8       |   8 (=8x1)  |  tp8pp1  | 24 |
|  5   |    265.44    |     115.81    |      20     |  512 (512=64x8)  |    64    |      8       |   8 (=8x1)  |  tp8pp1  | 20 |
+------+--------------+---------------+-------------+------------------+----------+--------------+-------------+----------+----+
********************************************************************************
INFO 2025-07-28 17:23:10,701 main.py:1035] Configuration completed in 48.18 seconds
```
The results indicate that, when you want to deploy Qwen3-32B on h200_sxm in fp8, you can get **2.39x** of disagg over agg deployment under SLA TTFT<=300ms and TPOT<=10ms with ISL:OSL as 4000:500  
Try different ISL:OSL for differnt TTFT and TPOT limit with, say, 
```bash
  aiconfigurator cli --model QWEN3_32B --total_gpus 32 --system h200_sxm --ttft 200 --tpot 10 --isl 8000 --osl 200
```
You will get different answers.  

### Customized config for aiconfigurator
If you want to even customize more, including the search space, quantization for each component, we define all these parameters in a yaml file.  
The built-in yaml files are under src/aiconigurator/cli/templates/trtllm/xxx_default.yaml (in future, trtllm can be other backend names)  
Please refer to the yaml file and modify what you want. Pass your customized yaml file by **--yaml_path**, 
```bash
  aiconfigurator cli --model QWEN3_32B --total_gpus 32 --system h200_sxm --ttft 200 --tpot 10 --isl 8000 --osl 200 --yaml_path customized_config.yaml
```
About how to tune these parameters, please refer to [Advanced Tuning](docs/advanced_tuning.md) for details

### Generate configs for Dynamo
In aiconfigurator cli, if you specify --save_dir, we'll generate configs for deploying with Dynamo.
This is an **important** feature to bridge the gap between configuration and Dynamo deployment.  
The folder structure will be like this,
````
backend_configs/
├── agg/
│   ├── agg_config.yaml
│   └── node_0_run.sh
└── disagg/
│   ├── decode_config.yaml
│   ├── prefill_config.yaml
│   ├── node_0_run.sh
│   ├── node_1_run.sh
│   └── ...
└──
````
Please refer to [Deployment Guide](docs/dynamo_deployment_guide.md) for details


## Webapp
```bash
  aiconfigurator webapp
```
Visit 127.0.0.1:7860  
Make sure to read [Advanced Tuning](docs/advanced_tuning.md) and the readme tab of webapp before you do experiments.


## Tuning with advanced features
There're a lot of features like different quantizations, different parallel strategies for you to tune the performance 
beyond the default configurations. This is common for both CLI and Webapp. Please refer to [Advanced Tuning](docs/advanced_tuning.md) for details


# How it works
## Modeling and mechanism

If we want to estimate the inference perf for a LLM, below should be considered,
1. compute cost, gemm, attention, others
2. communication cost, all-reduce for tensor-parallel, p2p for pipeline-parallel

Based on breaking down the LLM inference process into operations, i.e., gemm, attention, communication, embedding, elementwise operations, others.  
Collect operation execution time on a given hardware  
Estimate the given config execution time composed of operation execution time based on interpolation/extrapolation.  
We then model the inflight-batching (aggregated) and disaggregated serving on top of that.  
Search for the best config among those thousands of possible combinations and generate configs for Dynamo based on the results.

## Support list 
Models: GPT, LLAMA(2,3), MOE, QWEN, DEEPSEEK_V3  
OPs: MHA/GQA/MLA(FP8,FP16,FP32 fmha), 8bit kvcache, GEMM(FP16, 8/4bit WO, SQ, FP8), AllReduce(FP16), Embedding, P2P, ElementWise, NCCL(all2all, allgather, reducescatter), MoE(FP16, FP8, W4AFP8)  
TRTLLM Versions: 0.20.0, 1.0.0rc3  
Parallel modes: Tensor-parallel; Pipeline-parallel; Expert Tensor-parallel/Expert-parallell; Attention DP for DEEPSEEK and MoE  
Scheduling: Static; IFB(continuous batching); Disaggregated serving; MTP for DEEPSEEK

## Data Collection
Data collection is a standalone process for collecting the database for aiconfigurator. By default, you don't have to collect the data by yourself.
Small versions of database will not introduce huge perf difference. Say, you can use 1.0.0rc3 data of trtllm on h200_sxm and deploy the generated 
configs with Dynamo + trtllm 1.0.0rc4 worker.

If you want to go through the process, please refer to this [guidance](collector/README.md) under collector folder


# Known issues
1. moe memory estimation of trtllm backend needs to consider workspace  
2. result is relatively too optimisitc in low-speed high-throughput region.  
> **Note**: the result is not final absolute one. It can be inaccurate due to modeling gap or indicate performance improvement opportunity. It's trying to align with framework's current implementation and aming to provide configuration suggestion. Please verify it in real benchmark with our generated configs and do follow-up tuning.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "aiconfigurator",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "decode, distributed, dynamo, gpu, inference, llm, nvidia, prefill",
    "author": null,
    "author_email": "\"NVIDIA Inc.\" <sw-dl-dynamo@nvidia.com>",
    "download_url": null,
    "platform": null,
    "description": "<!--\nSPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.\nSPDX-License-Identifier: Apache-2.0\n-->\n\n# aiconfigurator\nToday, in disaggregated serving, it's quite difficult to find a proper config\nto get benefits from disaggregation such as how many prefill workers and decode workers \ndo I need and what about the parallelism for each worker. Combined with SLA: \nTTFT(Time-To-First-Token) and TPOT(Time-Per-Output-Token), it becomes even more complicated \nto solve the throughput @ latency problem.\n\nWe're introducing aiconfigurator to help you find a good reference to start with in your \ndisaggregated serving journey. The tool will try to search the space to get a good deployment config \nbased on your requirement including which model you want to serve, how many GPUs you have and what's \nthe GPU. Automatically generate the config files for you to deploy with Dynamo.\n\nIt's based on modeling the LLM inference with collected data on a target machine with a specific framework.\nIt searches thousands of different configurations in the background in tens of seconds and runs on any machine \nwith a CLI tool and a webapp provided.\n\nLet's get started.\n\n\n# Build and Install\n## Install from PyPI\n```bash\npip3 install aiconfigurator\n```\n## Build and install from source\n1. apt-get install git-lfs (linux) or brew install git-lfs (macos)\n2. clone the repo\n3. (optional) python3 -m venv myenv && source myenv/bin/activate  (need to have python >= 3.9)\n4. (optional) pip3 install --upgrade pip (if you encounter issue that didn't find setup.py)\n5. pip3 install \".\"\n\n## Build with Dockerfile\n```bash\n  # This will create a ./dist/ folder containing the wheel file\n  docker build -f docker/Dockerfile --no-cache --target build -t aiconfigurator:latest .\n  docker create --name aic aiconfigurator:latest && docker cp aic:/workspace/dist dist/ && docker rm aic\n```\n\n# Run\n## CLI\n```bash\n  aiconfigurator cli --model QWEN3_32B --total_gpus 32 --system h200_sxm\n```\nWith **3 basic args**, it will report out the estimated best deployment result and the deployment details  \nWith **--save_dir DIR**, you can output the framework configs automatically to deploy with Dynamo  \nWith **-h**, you can have more information about optional args to customize your deployment target  \n\n```\n********************************************************************************\n*                      Dynamo aiconfigurator Final Results                     *\n********************************************************************************\n  ----------------------------------------------------------------------------\n  Input Configuration & SLA Target:\n    Model: QWEN3_32B (is_moe: False)\n    Total GPUs: 512\n    I/O Length (tokens): Input=4000, Output=500\n    SLA Target: TTFT <= 300.0ms, TPOT <= 10.0ms\n  ----------------------------------------------------------------------------\n  Overall best system chosen: disagg at 812.48 tokens/s/gpu (2.39x better)\n    - Agg Actual Best: 340.48 tokens/s/gpu  100.83 tokens/s/user | TTFT: 188.91ms TPOT: 9.92ms\n    - Disagg Actual Best: 812.48 tokens/s/gpu  109.12 tokens/s/user | TTFT: 276.94ms TPOT: 9.16ms\n  ----------------------------------------------------------------------------\n  Pareto Frontier:\n               QWEN3_32B Pareto Frontier: tokens/s/gpu vs tokens/s/user         \n      \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n1600.0\u2524 dd Disagg                                                              \u2502\n      \u2502 aa Agg                                                                 \u2502\n      \u2502 XX Best                                                                \u2502\n      \u2502                                                                        \u2502\n1333.3\u2524    a                                                                   \u2502\n      \u2502     a                                                                  \u2502\n      \u2502      aaaa    d                                                         \u2502\n      \u2502          a    ddddddddd                                                \u2502\n1066.7\u2524           a           dd                                               \u2502\n      \u2502            aa           dddddddd                                       \u2502\n      \u2502              aaa                dd                                     \u2502\n      \u2502                 a                 d                                    \u2502\n 800.0\u2524                 a                  dddddddXdd                          \u2502\n      \u2502                  aaaa                        d                         \u2502\n      \u2502                      aaa                      d                        \u2502\n      \u2502                         aa                     d                       \u2502\n 533.3\u2524                           aaaaaa                dd                     \u2502\n      \u2502                                 aa                dd                   \u2502\n      \u2502                                   aa                dd                 \u2502\n      \u2502                                     aaaaaa            ddd              \u2502\n 266.7\u2524                                          aaaaa           d             \u2502\n      \u2502                                              aaaaaaa                   \u2502\n      \u2502                                                     aaaaaaa            \u2502\n      \u2502                                                                        \u2502\n   0.0\u2524                                                                        \u2502\n      \u2514\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2518\n       0                45                90               135              180 \ntokens/s/gpu                         tokens/s/user                              \n\n  ----------------------------------------------------------------------------\n  Worker Setup:\n    Model: QWEN3_32B (is_moe: False)\n    Disagg Prefill: h200_sxm (trtllm)\n    Disagg Decode:  h200_sxm (trtllm)\n    Prefill Quantization: GEMM: fp8_block, KVCache: fp8, FMHA: fp8\n    Decode Quantization: GEMM: fp8_block, KVCache: fp8, FMHA: fp8\n    Agg: h200_sxm (trtllm)\n    Quantization: GEMM: fp8_block, KVCache: fp8, FMHA: fp8\n  ----------------------------------------------------------------------------\n  Deployment Details:\n    (p) stands for prefill, (d) stands for decode, bs stands for batch size, a replica stands for the smallest scalable unit xPyD of the disagg system\n    Some math: total gpus used = replicas * gpus/replica\n               gpus/replica = (p)gpus/worker * (p)workers + (d)gpus/worker * (d)workers; for Agg, gpus/replica = gpus/worker\n               gpus/worker = tp * pp * dp = etp * ep * pp for MoE models; tp * pp for dense models (underlined numbers are the actual values in math)\n\nDisagg Top Configurations: (Sorted by tokens/s/gpu)\n+------+--------------+---------------+-------------+------------------+----------+----------------+------------+----------------+-------------+-------+------------+----------------+-------------+-------+\n| Rank | tokens/s/gpu | tokens/s/user | concurrency | total_gpus(used) | replicas |  gpus/replica  | (p)workers | (p)gpus/worker | (p)parallel | (p)bs | (d)workers | (d)gpus/worker | (d)parallel | (d)bs |\n+------+--------------+---------------+-------------+------------------+----------+----------------+------------+----------------+-------------+-------+------------+----------------+-------------+-------+\n|  1   |    812.48    |     109.12    |      60     |  512 (512=64x8)  |    64    |  8 (=4x1+1x4)  |     4      |    1 (=1x1)    |    tp1pp1   |   1   |     1      |    4 (=4x1)    |    tp4pp1   |   60  |\n|  2   |    802.97    |     100.56    |     204     | 512 (500=20x25)  |    20    | 25 (=13x1+3x4) |     13     |    1 (=1x1)    |    tp1pp1   |   1   |     3      |    4 (=4x1)    |    tp4pp1   |   68  |\n|  3   |    802.09    |     106.73    |     192     | 512 (500=20x25)  |    20    | 25 (=13x1+3x4) |     13     |    1 (=1x1)    |    tp1pp1   |   1   |     3      |    4 (=4x1)    |    tp4pp1   |   64  |\n|  4   |    767.19    |     114.22    |     156     | 512 (506=22x23)  |    22    | 23 (=11x1+3x4) |     11     |    1 (=1x1)    |    tp1pp1   |   1   |     3      |    4 (=4x1)    |    tp4pp1   |   52  |\n|  5   |    761.70    |     111.61    |     224     | 512 (496=16x31)  |    16    | 31 (=15x1+4x4) |     15     |    1 (=1x1)    |    tp1pp1   |   1   |     4      |    4 (=4x1)    |    tp4pp1   |   56  |\n+------+--------------+---------------+-------------+------------------+----------+----------------+------------+----------------+-------------+-------+------------+----------------+-------------+-------+\nAgg Top Configurations: (Sorted by tokens/s/gpu)\n+------+--------------+---------------+-------------+------------------+----------+--------------+-------------+----------+----+\n| Rank | tokens/s/gpu | tokens/s/user | concurrency | total_gpus(used) | replicas | gpus/replica | gpus/worker | parallel | bs |\n+------+--------------+---------------+-------------+------------------+----------+--------------+-------------+----------+----+\n|  1   |    340.48    |     100.83    |      15     | 512 (512=128x4)  |   128    |      4       |   4 (=4x1)  |  tp4pp1  | 15 |\n|  2   |    326.78    |     104.48    |      14     | 512 (512=128x4)  |   128    |      4       |   4 (=4x1)  |  tp4pp1  | 14 |\n|  3   |    307.50    |     105.57    |      13     | 512 (512=128x4)  |   128    |      4       |   4 (=4x1)  |  tp4pp1  | 13 |\n|  4   |    296.61    |     107.15    |      24     |  512 (512=64x8)  |    64    |      8       |   8 (=8x1)  |  tp8pp1  | 24 |\n|  5   |    265.44    |     115.81    |      20     |  512 (512=64x8)  |    64    |      8       |   8 (=8x1)  |  tp8pp1  | 20 |\n+------+--------------+---------------+-------------+------------------+----------+--------------+-------------+----------+----+\n********************************************************************************\nINFO 2025-07-28 17:23:10,701 main.py:1035] Configuration completed in 48.18 seconds\n```\nThe results indicate that, when you want to deploy Qwen3-32B on h200_sxm in fp8, you can get **2.39x** of disagg over agg deployment under SLA TTFT<=300ms and TPOT<=10ms with ISL:OSL as 4000:500  \nTry different ISL:OSL for differnt TTFT and TPOT limit with, say, \n```bash\n  aiconfigurator cli --model QWEN3_32B --total_gpus 32 --system h200_sxm --ttft 200 --tpot 10 --isl 8000 --osl 200\n```\nYou will get different answers.  \n\n### Customized config for aiconfigurator\nIf you want to even customize more, including the search space, quantization for each component, we define all these parameters in a yaml file.  \nThe built-in yaml files are under src/aiconigurator/cli/templates/trtllm/xxx_default.yaml (in future, trtllm can be other backend names)  \nPlease refer to the yaml file and modify what you want. Pass your customized yaml file by **--yaml_path**, \n```bash\n  aiconfigurator cli --model QWEN3_32B --total_gpus 32 --system h200_sxm --ttft 200 --tpot 10 --isl 8000 --osl 200 --yaml_path customized_config.yaml\n```\nAbout how to tune these parameters, please refer to [Advanced Tuning](docs/advanced_tuning.md) for details\n\n### Generate configs for Dynamo\nIn aiconfigurator cli, if you specify --save_dir, we'll generate configs for deploying with Dynamo.\nThis is an **important** feature to bridge the gap between configuration and Dynamo deployment.  \nThe folder structure will be like this,\n````\nbackend_configs/\n\u251c\u2500\u2500 agg/\n\u2502   \u251c\u2500\u2500 agg_config.yaml\n\u2502   \u2514\u2500\u2500 node_0_run.sh\n\u2514\u2500\u2500 disagg/\n\u2502   \u251c\u2500\u2500 decode_config.yaml\n\u2502   \u251c\u2500\u2500 prefill_config.yaml\n\u2502   \u251c\u2500\u2500 node_0_run.sh\n\u2502   \u251c\u2500\u2500 node_1_run.sh\n\u2502   \u2514\u2500\u2500 ...\n\u2514\u2500\u2500\n````\nPlease refer to [Deployment Guide](docs/dynamo_deployment_guide.md) for details\n\n\n## Webapp\n```bash\n  aiconfigurator webapp\n```\nVisit 127.0.0.1:7860  \nMake sure to read [Advanced Tuning](docs/advanced_tuning.md) and the readme tab of webapp before you do experiments.\n\n\n## Tuning with advanced features\nThere're a lot of features like different quantizations, different parallel strategies for you to tune the performance \nbeyond the default configurations. This is common for both CLI and Webapp. Please refer to [Advanced Tuning](docs/advanced_tuning.md) for details\n\n\n# How it works\n## Modeling and mechanism\n\nIf we want to estimate the inference perf for a LLM, below should be considered,\n1. compute cost, gemm, attention, others\n2. communication cost, all-reduce for tensor-parallel, p2p for pipeline-parallel\n\nBased on breaking down the LLM inference process into operations, i.e., gemm, attention, communication, embedding, elementwise operations, others.  \nCollect operation execution time on a given hardware  \nEstimate the given config execution time composed of operation execution time based on interpolation/extrapolation.  \nWe then model the inflight-batching (aggregated) and disaggregated serving on top of that.  \nSearch for the best config among those thousands of possible combinations and generate configs for Dynamo based on the results.\n\n## Support list \nModels: GPT, LLAMA(2,3), MOE, QWEN, DEEPSEEK_V3  \nOPs: MHA/GQA/MLA(FP8,FP16,FP32 fmha), 8bit kvcache, GEMM(FP16, 8/4bit WO, SQ, FP8), AllReduce(FP16), Embedding, P2P, ElementWise, NCCL(all2all, allgather, reducescatter), MoE(FP16, FP8, W4AFP8)  \nTRTLLM Versions: 0.20.0, 1.0.0rc3  \nParallel modes: Tensor-parallel; Pipeline-parallel; Expert Tensor-parallel/Expert-parallell; Attention DP for DEEPSEEK and MoE  \nScheduling: Static; IFB(continuous batching); Disaggregated serving; MTP for DEEPSEEK\n\n## Data Collection\nData collection is a standalone process for collecting the database for aiconfigurator. By default, you don't have to collect the data by yourself.\nSmall versions of database will not introduce huge perf difference. Say, you can use 1.0.0rc3 data of trtllm on h200_sxm and deploy the generated \nconfigs with Dynamo + trtllm 1.0.0rc4 worker.\n\nIf you want to go through the process, please refer to this [guidance](collector/README.md) under collector folder\n\n\n# Known issues\n1. moe memory estimation of trtllm backend needs to consider workspace  \n2. result is relatively too optimisitc in low-speed high-throughput region.  \n> **Note**: the result is not final absolute one. It can be inaccurate due to modeling gap or indicate performance improvement opportunity. It's trying to align with framework's current implementation and aming to provide configuration suggestion. Please verify it in real benchmark with our generated configs and do follow-up tuning.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "aiconfigurator: automatic disaggregated serving offline configuration",
    "version": "0.1.0",
    "project_urls": {
        "GitHub": "https://github.com/ai-dynamo/aiconfigurator",
        "Repository": "https://github.com/ai-dynamo/aiconfigurator.git"
    },
    "split_keywords": [
        "decode",
        " distributed",
        " dynamo",
        " gpu",
        " inference",
        " llm",
        " nvidia",
        " prefill"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "40480f8844c560d7f625b3462be6a531ec6eabdb254344ea268f68cd5f52ff7b",
                "md5": "439eade2feb235901fce5e36cf788feb",
                "sha256": "44d37cbd103a0879929dc3c7f2fdebe22a3ae7e7ab6c8c8a3aa7b0e00245d301"
            },
            "downloads": -1,
            "filename": "aiconfigurator-0.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "439eade2feb235901fce5e36cf788feb",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 4868501,
            "upload_time": "2025-08-12T19:10:42",
            "upload_time_iso_8601": "2025-08-12T19:10:42.583868Z",
            "url": "https://files.pythonhosted.org/packages/40/48/0f8844c560d7f625b3462be6a531ec6eabdb254344ea268f68cd5f52ff7b/aiconfigurator-0.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-12 19:10:42",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "ai-dynamo",
    "github_project": "aiconfigurator",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "aiconfigurator"
}
        
Elapsed time: 1.73775s