vidur

Name	vidur JSON
Version	0.0.9 JSON
	download
home_page	https://mlsys.org/virtual/2024/poster/2667
Summary	A LLM inference cluster simulator
upload_time	2025-02-02 07:31:31
maintainer	None
docs_url	None
author	Systems for AI Lab, Georgia Tech; Microsoft Corporation
requires_python	>=3.10
license	None
keywords	simulator llm inference cluster
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Vidur: LLM Inference Simulator

Vidur is a high-fidelity and extensible LLM inference simulator. It can help you with:

1. Capacity planning and finding the best deployment configuration for your LLM deployments.
2. Test new research ideas like new scheduling algorithms, optimizations like speculative decoding, etc.
3. Study the system performance of models under different workloads and configurations.

... all without access to GPUs except for a quick initial profiling phase.

Please refer to our [MLSys'24 paper](https://arxiv.org/abs/2405.05465) for more details.
We have a [talk with live demo](https://mlsys.org/virtual/2024/poster/2667) that captures the capabilities of the system.

## Supported Models

| Model / Device | A100 80GB DGX | H100 DGX | 4xA100 80GB Pairwise NVLink Node | 8xA40 Pairwise NVLink Node |
| --- | --- | --- | --- | --- |
| `meta-llama/Llama-3-8B` | ✅ | ✅ | ❌ | ❌ |
| `meta-llama/Llama-3-70B` | ✅ | ✅ | ❌ | ❌ |

* __Instructions on adding a new model to existing or new SKUs can be found [here](docs/profiling.md)__.
* All models support a maximum context length of 2M.
* Pipeline parallelism is supported for all models. The PP dimension should divide the number of layers in the model.
* In DGX nodes, there are 8 GPUs, fully connected via NVLink. So TP1, TP2, TP4 and TP8 are supported.
* In 4x pairwise NVLink nodes, there are 4 GPUs, so TP1, TP2 and TP4 are supported. TP4 here is less performant than TP4 in DGX nodes because (GPU1, GPU2) are connected via NVLink and (GPU3, GPU4) are connected via NVLink. but between these layers, the interconnect is slower.
* You can use any combination of TP and PP. For example, you can run LLaMA2-70B on TP2-PP2 on a 4xA100 80GB Pairwise NVLink Node.

## Chrome Trace

Vidur exports chrome traces of each simulation. The trace can be found in the `simulator_output` directory. The trace can be opened by navigating to `chrome://tracing/` or `edge://tracing/` and loading the trace.

![Chrome Trace](./assets/chrome_trace.png)

## Setup

### Using `mamba`

To run the simulator, create a mamba environment with the given dependency file.

```sh
mamba env create -p ./env -f ./environment.yml
mamba env update -f environment-dev.yml
```

### Using `venv`

1. Ensure that you have Python 3.10 installed on your system. Refer <https://www.bitecode.dev/p/installing-python-the-bare-minimum>
2. `cd` into the repository root
3. Create a virtual environment using `venv` module using `python3.10 -m venv .venv`
4. Activate the virtual environment using `source .venv/bin/activate`
5. Install the dependencies using `python -m pip install -r requirements.txt`
6. Run `deactivate` to deactivate the virtual environment

### Using `conda` (Least recommended)

To run the simulator, create a conda environment with the given dependency file.

```sh
conda env create -p ./env -f ./environment.yml
conda env update -f environment-dev.yml
```

## Setting up wandb (Optional)

First, setup your account on `https://<your-org>.wandb.io/` or public wandb, obtain the api key and then run the following command,

```sh
wandb login --host https://<your-org>.wandb.io
```

To opt out of wandb, pick any one of the following methods:

1. `export WANDB_MODE=disabled` in your shell or add this in `~/.zshrc` or `~/.bashrc`. Remember to reload using `source ~/.zshrc`.
2. Set `wandb_project` and `wandb_group` as `""` in `vidur/config/default.yml`. Also, remove these CLI params from the shell command with which the simulator is invoked.

## Running the simulator

To run the simulator, execute the following command from the repository root,

```sh
python -m vidur.main
```

or a big example with all the parameters,

```sh
python -m vidur.main  \
--replica_config_device a100 \
--replica_config_model_name meta-llama/Llama-2-7b-hf  \
--cluster_config_num_replicas 1 \
--replica_config_tensor_parallel_size 1 \
--replica_config_num_pipeline_stages 1 \
--request_generator_config_type synthetic \
--length_generator_config_type trace \
--interval_generator_config_type static \
--[trace|zipf|uniform|fixed]_request_length_generator_config_max_tokens 4096 \
--trace_request_length_generator_config_trace_file ./data/processed_traces/arxiv_summarization_stats_llama2_tokenizer_filtered_v2.csv \
--synthetic_request_generator_config_num_requests 128  \
--replica_scheduler_config_type vllm  \
--[vllm|lightllm|orca|faster_transformer|sarathi]_scheduler_config_batch_size_cap 256  \
--[vllm|lightllm]_scheduler_config_max_tokens_in_batch 4096
```

The simulator supports a plethora of parameters for the simulation description which can be found [here](docs/launch_parameters.md).

The metrics will be logged to wandb directly and a copy will be stored in the `simulator_output` directory along with the chrome trace. A description of all the logged metrics can be found [here](docs/metrics.md).

## Formatting Code

To format code, execute the following command:

```sh
make format
```

Raw data

            {
    "_id": null,
    "home_page": "https://mlsys.org/virtual/2024/poster/2667",
    "name": "vidur",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "Simulator, LLM, Inference, Cluster",
    "author": "Systems for AI Lab, Georgia Tech; Microsoft Corporation",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/0b/06/d096dfe48549d483b67cd94fbdec8b755393856c0fdba264eedccbb276f1/vidur-0.0.9.tar.gz",
    "platform": null,
    "description": "# Vidur: LLM Inference Simulator\n\nVidur is a high-fidelity and extensible LLM inference simulator. It can help you with:\n\n1. Capacity planning and finding the best deployment configuration for your LLM deployments.\n2. Test new research ideas like new scheduling algorithms, optimizations like speculative decoding, etc.\n3. Study the system performance of models under different workloads and configurations.\n\n... all without access to GPUs except for a quick initial profiling phase.\n\nPlease refer to our [MLSys'24 paper](https://arxiv.org/abs/2405.05465) for more details.\nWe have a [talk with live demo](https://mlsys.org/virtual/2024/poster/2667) that captures the capabilities of the system.\n\n## Supported Models\n\n| Model / Device | A100 80GB DGX | H100 DGX | 4xA100 80GB Pairwise NVLink Node | 8xA40 Pairwise NVLink Node |\n| --- | --- | --- | --- | --- |\n| `meta-llama/Llama-3-8B` | \u2705 | \u2705 | \u274c | \u274c |\n| `meta-llama/Llama-3-70B` | \u2705 | \u2705 | \u274c | \u274c |\n\n* __Instructions on adding a new model to existing or new SKUs can be found [here](docs/profiling.md)__.\n* All models support a maximum context length of 2M.\n* Pipeline parallelism is supported for all models. The PP dimension should divide the number of layers in the model.\n* In DGX nodes, there are 8 GPUs, fully connected via NVLink. So TP1, TP2, TP4 and TP8 are supported.\n* In 4x pairwise NVLink nodes, there are 4 GPUs, so TP1, TP2 and TP4 are supported. TP4 here is less performant than TP4 in DGX nodes because (GPU1, GPU2) are connected via NVLink and (GPU3, GPU4) are connected via NVLink. but between these layers, the interconnect is slower.\n* You can use any combination of TP and PP. For example, you can run LLaMA2-70B on TP2-PP2 on a 4xA100 80GB Pairwise NVLink Node.\n\n## Chrome Trace\n\nVidur exports chrome traces of each simulation. The trace can be found in the `simulator_output` directory. The trace can be opened by navigating to `chrome://tracing/` or `edge://tracing/` and loading the trace.\n\n![Chrome Trace](./assets/chrome_trace.png)\n\n## Setup\n\n### Using `mamba`\n\nTo run the simulator, create a mamba environment with the given dependency file.\n\n```sh\nmamba env create -p ./env -f ./environment.yml\nmamba env update -f environment-dev.yml\n```\n\n### Using `venv`\n\n1. Ensure that you have Python 3.10 installed on your system. Refer <https://www.bitecode.dev/p/installing-python-the-bare-minimum>\n2. `cd` into the repository root\n3. Create a virtual environment using `venv` module using `python3.10 -m venv .venv`\n4. Activate the virtual environment using `source .venv/bin/activate`\n5. Install the dependencies using `python -m pip install -r requirements.txt`\n6. Run `deactivate` to deactivate the virtual environment\n\n### Using `conda` (Least recommended)\n\nTo run the simulator, create a conda environment with the given dependency file.\n\n```sh\nconda env create -p ./env -f ./environment.yml\nconda env update -f environment-dev.yml\n```\n\n## Setting up wandb (Optional)\n\nFirst, setup your account on `https://<your-org>.wandb.io/` or public wandb, obtain the api key and then run the following command,\n\n```sh\nwandb login --host https://<your-org>.wandb.io\n```\n\nTo opt out of wandb, pick any one of the following methods:\n\n1. `export WANDB_MODE=disabled` in your shell or add this in `~/.zshrc` or `~/.bashrc`. Remember to reload using `source ~/.zshrc`.\n2. Set `wandb_project` and `wandb_group` as `\"\"` in `vidur/config/default.yml`. Also, remove these CLI params from the shell command with which the simulator is invoked.\n\n## Running the simulator\n\nTo run the simulator, execute the following command from the repository root,\n\n```sh\npython -m vidur.main\n```\n\nor a big example with all the parameters,\n\n```sh\npython -m vidur.main  \\\n--replica_config_device a100 \\\n--replica_config_model_name meta-llama/Llama-2-7b-hf  \\\n--cluster_config_num_replicas 1 \\\n--replica_config_tensor_parallel_size 1 \\\n--replica_config_num_pipeline_stages 1 \\\n--request_generator_config_type synthetic \\\n--length_generator_config_type trace \\\n--interval_generator_config_type static \\\n--[trace|zipf|uniform|fixed]_request_length_generator_config_max_tokens 4096 \\\n--trace_request_length_generator_config_trace_file ./data/processed_traces/arxiv_summarization_stats_llama2_tokenizer_filtered_v2.csv \\\n--synthetic_request_generator_config_num_requests 128  \\\n--replica_scheduler_config_type vllm  \\\n--[vllm|lightllm|orca|faster_transformer|sarathi]_scheduler_config_batch_size_cap 256  \\\n--[vllm|lightllm]_scheduler_config_max_tokens_in_batch 4096\n```\n\nThe simulator supports a plethora of parameters for the simulation description which can be found [here](docs/launch_parameters.md).\n\nThe metrics will be logged to wandb directly and a copy will be stored in the `simulator_output` directory along with the chrome trace. A description of all the logged metrics can be found [here](docs/metrics.md).\n\n## Formatting Code\n\nTo format code, execute the following command:\n\n```sh\nmake format\n```\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A LLM inference cluster simulator",
    "version": "0.0.9",
    "project_urls": {
        "Homepage": "https://mlsys.org/virtual/2024/poster/2667"
    },
    "split_keywords": [
        "simulator",
        " llm",
        " inference",
        " cluster"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "b888639659f90afa5960771882e35eb3676f1f6822e94dfa71e29abc1e5f142f",
                "md5": "b9c9d2ce1a08437bc001a13598a8efaf",
                "sha256": "c796c59f0b5369e4ce28ae212f52c7210edd18de7c34dc2a6f2cb33ecd0a98d8"
            },
            "downloads": -1,
            "filename": "vidur-0.0.9-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "b9c9d2ce1a08437bc001a13598a8efaf",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 163003,
            "upload_time": "2025-02-02T07:31:29",
            "upload_time_iso_8601": "2025-02-02T07:31:29.386435Z",
            "url": "https://files.pythonhosted.org/packages/b8/88/639659f90afa5960771882e35eb3676f1f6822e94dfa71e29abc1e5f142f/vidur-0.0.9-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "0b06d096dfe48549d483b67cd94fbdec8b755393856c0fdba264eedccbb276f1",
                "md5": "05fee871237e9f319532e8c0732d392b",
                "sha256": "8202371022f42e33d3798e0ecaec2dc5a607ae9270caf7c1cc0b8376e920727f"
            },
            "downloads": -1,
            "filename": "vidur-0.0.9.tar.gz",
            "has_sig": false,
            "md5_digest": "05fee871237e9f319532e8c0732d392b",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 108183,
            "upload_time": "2025-02-02T07:31:31",
            "upload_time_iso_8601": "2025-02-02T07:31:31.695668Z",
            "url": "https://files.pythonhosted.org/packages/0b/06/d096dfe48549d483b67cd94fbdec8b755393856c0fdba264eedccbb276f1/vidur-0.0.9.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-02-02 07:31:31",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "vidur"
}

Systems for AI Lab, Georgia Tech; Microsoft Corporation