llm-execution-time-predictor

Name	llm-execution-time-predictor JSON
Version	0.1.2 JSON
	download
home_page	None
Summary	LLM batch inference latency predictor and profiler CLI tool
upload_time	2025-07-29 03:16:27
maintainer	None
docs_url	None
author	None
requires_python	>=3.8
license	MIT
keywords	llm inference latency prediction profiling machine-learning
VCS
bugtrack_url
requirements	sglang fire gradio numpy matplotlib scikit-learn pandas plotly
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # LLM Execution Time Predictor

A small utility to help train a regression model given to predict prefill/decode times. 
By using the batch size and input, the prefill/decode execution times are very predictable.

This can be plugged into a simulator for faster experiments.

A more complicated version is done by https://github.com/microsoft/vidur but it trains every component of the model forwarding. This utility instead just profiles the full model forwarding as a unit to simplify research.

The tool https://modal.com/llm-almanac/advisor is nice visualizer but it doesn't let you train a local version and specify an exact bs/input

## Installation

### Option 1: Install from PyPI (Recommended)
```bash
pip install llm_execution_time_predictor
```

### Option 2: Install from Source
```bash
pip install -r requirements.txt
```

## Using Prefill/Decode execution time for predictors
A very small set of features are used to train the predictor.
Num new tokens: total tokens processed/generated:
- for decode, it's the batch size. for prefill, it's the full input chunk
Product ext cost: Represents the cost of attention
- For prefill, it's O(seq_len^2) so we do bs * input^2
- For decode, it's just O(seq_len)
Total context tokens: 
- Total tokens processed across batch * input representing the cache usage
Time of kernel

Tested on both prefill/decode the decode time 

## Usage

### Using the PyPI Package
```bash
# Profile a model and generate benchmark data
llm-execution-time-predictor profile <model_name> --tp_size <tp_size>

# Train models from benchmark data
llm-execution-time-predictor train_models <config_name> <benchmark_file> [--predictor-file <output_file>]

# Make predictions using trained models
llm-execution-time-predictor predict <predictor_file> <config_name> --mode <prefill/decode> --bs <batch_size> --input-len <input_length>

# View trained models and make interactive predictions (CLI)
llm-execution-time-predictor view [--predictor-file <predictor_file>]

# Launch web-based viewer with interactive plots
llm-execution-time-predictor webview [--predictor-file <predictor_file>] [--host <host>] [--port <port>]
```

### Using from Source
```bash
# Profile a model and generate benchmark data
python llm_execution_time_predictor/llm_forward_predictor_cli.py profile <model_name> --tp_size <tp_size>

# Train models from benchmark data
python llm_execution_time_predictor/llm_forward_predictor_cli.py train_models <config_name> <benchmark_file> [--predictor-file <output_file>]

# Make predictions using trained models
python llm_execution_time_predictor/llm_forward_predictor_cli.py predict <predictor_file> <config_name> --mode <prefill/decode> --bs <batch_size> --input-len <input_length>

# View trained models and make interactive predictions (CLI)
python llm_execution_time_predictor/llm_forward_predictor_cli.py view [--predictor-file <predictor_file>]

# Launch web-based viewer with interactive plots
python llm_execution_time_predictor/llm_forward_predictor_cli.py webview [--predictor-file <predictor_file>] [--host <host>] [--port <port>]
```

The trained predictor file format:
```json
{
    "config_name": {
        "prefill": {
            "weights": [0.1234, 0.5678, 0.9012, 0.3456],
            "bias": 0.0123,
            "model_type": "linear"
        },
        "decode": {
            "weights": [0.2345, 0.6789, 0.0123, 0.4567],
            "bias": 0.0456,
            "model_type": "linear"
        }
    }
}
```

Feature order: `[num_new_tokens, prod_ext_ctx, num_context_tokens, batch_size]`

## Webviewer
![Web Viewer](webview_demo.png)

## Quickstart workflow

### Using PyPI Package
```bash
llm-execution-time-predictor profile Qwen/Qwen3-4B --tp_size 1
llm-execution-time-predictor train_models tp1_config benchmark_data_Qwen_Qwen3-4B_TP_1_PP_1.json --predictor-file trained_predictors.json
llm-execution-time-predictor predict trained_predictors.json tp1_config --mode decode --bs 8 --input-len 1024
llm-execution-time-predictor webview --predictor-file trained_predictors.json
```

### Using from Source
```bash
python llm_execution_time_predictor/llm_forward_predictor_cli.py profile Qwen/Qwen3-4B --tp_size 1
python llm_execution_time_predictor/llm_forward_predictor_cli.py train_models tp1_config benchmark_data_Qwen_Qwen3-4B_TP_1_PP_1.json --predictor-file trained_predictors.json
python llm_execution_time_predictor/llm_forward_predictor_cli.py predict trained_predictors.json tp1_config --mode decode --bs 8 --input-len 1024
python llm_execution_time_predictor/llm_forward_predictor_cli.py webview --predictor-file trained_predictors.json
```

# TODO
1. Fix vLLM force one batch
with vllm backend, currently vLLM might run more than 1 batch making some of the profiling innacurate skewing the model. Currently no good solution for this. 

# Ack
Co-contributors: [Dongming Li](https://github.com/dongmingli-Ben) and [Zijian He](https://github.com/jiange91)

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "llm-execution-time-predictor",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "llm, inference, latency, prediction, profiling, machine-learning",
    "author": null,
    "author_email": "Vikranth Srivatsa <vsrivatsa@users.noreply.github.com>",
    "download_url": "https://files.pythonhosted.org/packages/dd/28/bcf2dcec22512a2c3a32761a587c0ab992d239660b51496074073ce50df1/llm_execution_time_predictor-0.1.2.tar.gz",
    "platform": null,
    "description": "# LLM Execution Time Predictor\n\nA small utility to help train a regression model given to predict prefill/decode times. \nBy using the batch size and input, the prefill/decode execution times are very predictable.\n\nThis can be plugged into a simulator for faster experiments.\n\nA more complicated version is done by https://github.com/microsoft/vidur but it trains every component of the model forwarding. This utility instead just profiles the full model forwarding as a unit to simplify research.\n\nThe tool https://modal.com/llm-almanac/advisor is nice visualizer but it doesn't let you train a local version and specify an exact bs/input\n\n## Installation\n\n### Option 1: Install from PyPI (Recommended)\n```bash\npip install llm_execution_time_predictor\n```\n\n### Option 2: Install from Source\n```bash\npip install -r requirements.txt\n```\n\n## Using Prefill/Decode execution time for predictors\nA very small set of features are used to train the predictor.\nNum new tokens: total tokens processed/generated:\n- for decode, it's the batch size. for prefill, it's the full input chunk\nProduct ext cost: Represents the cost of attention\n- For prefill, it's O(seq_len^2) so we do bs * input^2\n- For decode, it's just O(seq_len)\nTotal context tokens: \n- Total tokens processed across batch * input representing the cache usage\nTime of kernel\n\nTested on both prefill/decode the decode time \n\n## Usage\n\n### Using the PyPI Package\n```bash\n# Profile a model and generate benchmark data\nllm-execution-time-predictor profile <model_name> --tp_size <tp_size>\n\n# Train models from benchmark data\nllm-execution-time-predictor train_models <config_name> <benchmark_file> [--predictor-file <output_file>]\n\n# Make predictions using trained models\nllm-execution-time-predictor predict <predictor_file> <config_name> --mode <prefill/decode> --bs <batch_size> --input-len <input_length>\n\n# View trained models and make interactive predictions (CLI)\nllm-execution-time-predictor view [--predictor-file <predictor_file>]\n\n# Launch web-based viewer with interactive plots\nllm-execution-time-predictor webview [--predictor-file <predictor_file>] [--host <host>] [--port <port>]\n```\n\n### Using from Source\n```bash\n# Profile a model and generate benchmark data\npython llm_execution_time_predictor/llm_forward_predictor_cli.py profile <model_name> --tp_size <tp_size>\n\n# Train models from benchmark data\npython llm_execution_time_predictor/llm_forward_predictor_cli.py train_models <config_name> <benchmark_file> [--predictor-file <output_file>]\n\n# Make predictions using trained models\npython llm_execution_time_predictor/llm_forward_predictor_cli.py predict <predictor_file> <config_name> --mode <prefill/decode> --bs <batch_size> --input-len <input_length>\n\n# View trained models and make interactive predictions (CLI)\npython llm_execution_time_predictor/llm_forward_predictor_cli.py view [--predictor-file <predictor_file>]\n\n# Launch web-based viewer with interactive plots\npython llm_execution_time_predictor/llm_forward_predictor_cli.py webview [--predictor-file <predictor_file>] [--host <host>] [--port <port>]\n```\n\nThe trained predictor file format:\n```json\n{\n    \"config_name\": {\n        \"prefill\": {\n            \"weights\": [0.1234, 0.5678, 0.9012, 0.3456],\n            \"bias\": 0.0123,\n            \"model_type\": \"linear\"\n        },\n        \"decode\": {\n            \"weights\": [0.2345, 0.6789, 0.0123, 0.4567],\n            \"bias\": 0.0456,\n            \"model_type\": \"linear\"\n        }\n    }\n}\n```\n\nFeature order: `[num_new_tokens, prod_ext_ctx, num_context_tokens, batch_size]`\n\n## Webviewer\n![Web Viewer](webview_demo.png)\n\n## Quickstart workflow\n\n### Using PyPI Package\n```bash\nllm-execution-time-predictor profile Qwen/Qwen3-4B --tp_size 1\nllm-execution-time-predictor train_models tp1_config benchmark_data_Qwen_Qwen3-4B_TP_1_PP_1.json --predictor-file trained_predictors.json\nllm-execution-time-predictor predict trained_predictors.json tp1_config --mode decode --bs 8 --input-len 1024\nllm-execution-time-predictor webview --predictor-file trained_predictors.json\n```\n\n### Using from Source\n```bash\npython llm_execution_time_predictor/llm_forward_predictor_cli.py profile Qwen/Qwen3-4B --tp_size 1\npython llm_execution_time_predictor/llm_forward_predictor_cli.py train_models tp1_config benchmark_data_Qwen_Qwen3-4B_TP_1_PP_1.json --predictor-file trained_predictors.json\npython llm_execution_time_predictor/llm_forward_predictor_cli.py predict trained_predictors.json tp1_config --mode decode --bs 8 --input-len 1024\npython llm_execution_time_predictor/llm_forward_predictor_cli.py webview --predictor-file trained_predictors.json\n```\n\n# TODO\n1. Fix vLLM force one batch\nwith vllm backend, currently vLLM might run more than 1 batch making some of the profiling innacurate skewing the model. Currently no good solution for this. \n\n# Ack\nCo-contributors: [Dongming Li](https://github.com/dongmingli-Ben) and [Zijian He](https://github.com/jiange91)\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "LLM batch inference latency predictor and profiler CLI tool",
    "version": "0.1.2",
    "project_urls": {
        "Bug Tracker": "https://github.com/vikranth22446/llm_execution_time_predictor/issues",
        "Documentation": "https://github.com/vikranth22446/llm_execution_time_predictor#readme",
        "Homepage": "https://github.com/vikranth22446/llm_execution_time_predictor",
        "Repository": "https://github.com/vikranth22446/llm_execution_time_predictor"
    },
    "split_keywords": [
        "llm",
        " inference",
        " latency",
        " prediction",
        " profiling",
        " machine-learning"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "766740e98704e15292575c24c2fb64e28b5a083d202c0591ebdcc377c5d7c8d8",
                "md5": "ccd2b0371f81bfa15a1390a03bff8354",
                "sha256": "cb3022dbb6ea78691ac72f8054fa0d745cc7a1444705bbde3b986ff5ac084685"
            },
            "downloads": -1,
            "filename": "llm_execution_time_predictor-0.1.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "ccd2b0371f81bfa15a1390a03bff8354",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 29392,
            "upload_time": "2025-07-29T03:16:26",
            "upload_time_iso_8601": "2025-07-29T03:16:26.310988Z",
            "url": "https://files.pythonhosted.org/packages/76/67/40e98704e15292575c24c2fb64e28b5a083d202c0591ebdcc377c5d7c8d8/llm_execution_time_predictor-0.1.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "dd28bcf2dcec22512a2c3a32761a587c0ab992d239660b51496074073ce50df1",
                "md5": "fb32675a0ad1b92ab08bf15102670995",
                "sha256": "08c0b74c5d7b05d5c02144983caa32a473cc5ec965e1cbe87a3232694a559c23"
            },
            "downloads": -1,
            "filename": "llm_execution_time_predictor-0.1.2.tar.gz",
            "has_sig": false,
            "md5_digest": "fb32675a0ad1b92ab08bf15102670995",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 26524,
            "upload_time": "2025-07-29T03:16:27",
            "upload_time_iso_8601": "2025-07-29T03:16:27.444346Z",
            "url": "https://files.pythonhosted.org/packages/dd/28/bcf2dcec22512a2c3a32761a587c0ab992d239660b51496074073ce50df1/llm_execution_time_predictor-0.1.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-29 03:16:27",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "vikranth22446",
    "github_project": "llm_execution_time_predictor",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "sglang",
            "specs": []
        },
        {
            "name": "fire",
            "specs": []
        },
        {
            "name": "gradio",
            "specs": []
        },
        {
            "name": "numpy",
            "specs": []
        },
        {
            "name": "matplotlib",
            "specs": []
        },
        {
            "name": "scikit-learn",
            "specs": []
        },
        {
            "name": "pandas",
            "specs": []
        },
        {
            "name": "plotly",
            "specs": []
        }
    ],
    "lcname": "llm-execution-time-predictor"
}

None