# LLM Execution Time Predictor
A small utility to help train a regression model given to predict prefill/decode times.
By using the batch size and input, the prefill/decode execution times are very predictable.
This can be plugged into a simulator for faster experiments.
A more complicated version is done by https://github.com/microsoft/vidur but it trains every component of the model forwarding. This utility instead just profiles the full model forwarding as a unit to simplify research.
The tool https://modal.com/llm-almanac/advisor is nice visualizer but it doesn't let you train a local version and specify an exact bs/input
## Installation
### Option 1: Install from PyPI (Recommended)
```bash
pip install llm_execution_time_predictor
```
### Option 2: Install from Source
```bash
pip install -r requirements.txt
```
## Using Prefill/Decode execution time for predictors
A very small set of features are used to train the predictor.
Num new tokens: total tokens processed/generated:
- for decode, it's the batch size. for prefill, it's the full input chunk
Product ext cost: Represents the cost of attention
- For prefill, it's O(seq_len^2) so we do bs * input^2
- For decode, it's just O(seq_len)
Total context tokens:
- Total tokens processed across batch * input representing the cache usage
Time of kernel
Tested on both prefill/decode the decode time
## Usage
### Using the PyPI Package
```bash
# Profile a model and generate benchmark data
llm-execution-time-predictor profile <model_name> --tp_size <tp_size>
# Train models from benchmark data
llm-execution-time-predictor train_models <config_name> <benchmark_file> [--predictor-file <output_file>]
# Make predictions using trained models
llm-execution-time-predictor predict <predictor_file> <config_name> --mode <prefill/decode> --bs <batch_size> --input-len <input_length>
# View trained models and make interactive predictions (CLI)
llm-execution-time-predictor view [--predictor-file <predictor_file>]
# Launch web-based viewer with interactive plots
llm-execution-time-predictor webview [--predictor-file <predictor_file>] [--host <host>] [--port <port>]
```
### Using from Source
```bash
# Profile a model and generate benchmark data
python llm_execution_time_predictor/llm_forward_predictor_cli.py profile <model_name> --tp_size <tp_size>
# Train models from benchmark data
python llm_execution_time_predictor/llm_forward_predictor_cli.py train_models <config_name> <benchmark_file> [--predictor-file <output_file>]
# Make predictions using trained models
python llm_execution_time_predictor/llm_forward_predictor_cli.py predict <predictor_file> <config_name> --mode <prefill/decode> --bs <batch_size> --input-len <input_length>
# View trained models and make interactive predictions (CLI)
python llm_execution_time_predictor/llm_forward_predictor_cli.py view [--predictor-file <predictor_file>]
# Launch web-based viewer with interactive plots
python llm_execution_time_predictor/llm_forward_predictor_cli.py webview [--predictor-file <predictor_file>] [--host <host>] [--port <port>]
```
The trained predictor file format:
```json
{
"config_name": {
"prefill": {
"weights": [0.1234, 0.5678, 0.9012, 0.3456],
"bias": 0.0123,
"model_type": "linear"
},
"decode": {
"weights": [0.2345, 0.6789, 0.0123, 0.4567],
"bias": 0.0456,
"model_type": "linear"
}
}
}
```
Feature order: `[num_new_tokens, prod_ext_ctx, num_context_tokens, batch_size]`
## Webviewer

## Quickstart workflow
### Using PyPI Package
```bash
llm-execution-time-predictor profile Qwen/Qwen3-4B --tp_size 1
llm-execution-time-predictor train_models tp1_config benchmark_data_Qwen_Qwen3-4B_TP_1_PP_1.json --predictor-file trained_predictors.json
llm-execution-time-predictor predict trained_predictors.json tp1_config --mode decode --bs 8 --input-len 1024
llm-execution-time-predictor webview --predictor-file trained_predictors.json
```
### Using from Source
```bash
python llm_execution_time_predictor/llm_forward_predictor_cli.py profile Qwen/Qwen3-4B --tp_size 1
python llm_execution_time_predictor/llm_forward_predictor_cli.py train_models tp1_config benchmark_data_Qwen_Qwen3-4B_TP_1_PP_1.json --predictor-file trained_predictors.json
python llm_execution_time_predictor/llm_forward_predictor_cli.py predict trained_predictors.json tp1_config --mode decode --bs 8 --input-len 1024
python llm_execution_time_predictor/llm_forward_predictor_cli.py webview --predictor-file trained_predictors.json
```
# TODO
1. Fix vLLM force one batch
with vllm backend, currently vLLM might run more than 1 batch making some of the profiling innacurate skewing the model. Currently no good solution for this.
# Ack
Co-contributors: [Dongming Li](https://github.com/dongmingli-Ben) and [Zijian He](https://github.com/jiange91)
Raw data
{
"_id": null,
"home_page": null,
"name": "llm-execution-time-predictor",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "llm, inference, latency, prediction, profiling, machine-learning",
"author": null,
"author_email": "Vikranth Srivatsa <vsrivatsa@users.noreply.github.com>",
"download_url": "https://files.pythonhosted.org/packages/dd/28/bcf2dcec22512a2c3a32761a587c0ab992d239660b51496074073ce50df1/llm_execution_time_predictor-0.1.2.tar.gz",
"platform": null,
"description": "# LLM Execution Time Predictor\n\nA small utility to help train a regression model given to predict prefill/decode times. \nBy using the batch size and input, the prefill/decode execution times are very predictable.\n\nThis can be plugged into a simulator for faster experiments.\n\nA more complicated version is done by https://github.com/microsoft/vidur but it trains every component of the model forwarding. This utility instead just profiles the full model forwarding as a unit to simplify research.\n\nThe tool https://modal.com/llm-almanac/advisor is nice visualizer but it doesn't let you train a local version and specify an exact bs/input\n\n## Installation\n\n### Option 1: Install from PyPI (Recommended)\n```bash\npip install llm_execution_time_predictor\n```\n\n### Option 2: Install from Source\n```bash\npip install -r requirements.txt\n```\n\n## Using Prefill/Decode execution time for predictors\nA very small set of features are used to train the predictor.\nNum new tokens: total tokens processed/generated:\n- for decode, it's the batch size. for prefill, it's the full input chunk\nProduct ext cost: Represents the cost of attention\n- For prefill, it's O(seq_len^2) so we do bs * input^2\n- For decode, it's just O(seq_len)\nTotal context tokens: \n- Total tokens processed across batch * input representing the cache usage\nTime of kernel\n\nTested on both prefill/decode the decode time \n\n## Usage\n\n### Using the PyPI Package\n```bash\n# Profile a model and generate benchmark data\nllm-execution-time-predictor profile <model_name> --tp_size <tp_size>\n\n# Train models from benchmark data\nllm-execution-time-predictor train_models <config_name> <benchmark_file> [--predictor-file <output_file>]\n\n# Make predictions using trained models\nllm-execution-time-predictor predict <predictor_file> <config_name> --mode <prefill/decode> --bs <batch_size> --input-len <input_length>\n\n# View trained models and make interactive predictions (CLI)\nllm-execution-time-predictor view [--predictor-file <predictor_file>]\n\n# Launch web-based viewer with interactive plots\nllm-execution-time-predictor webview [--predictor-file <predictor_file>] [--host <host>] [--port <port>]\n```\n\n### Using from Source\n```bash\n# Profile a model and generate benchmark data\npython llm_execution_time_predictor/llm_forward_predictor_cli.py profile <model_name> --tp_size <tp_size>\n\n# Train models from benchmark data\npython llm_execution_time_predictor/llm_forward_predictor_cli.py train_models <config_name> <benchmark_file> [--predictor-file <output_file>]\n\n# Make predictions using trained models\npython llm_execution_time_predictor/llm_forward_predictor_cli.py predict <predictor_file> <config_name> --mode <prefill/decode> --bs <batch_size> --input-len <input_length>\n\n# View trained models and make interactive predictions (CLI)\npython llm_execution_time_predictor/llm_forward_predictor_cli.py view [--predictor-file <predictor_file>]\n\n# Launch web-based viewer with interactive plots\npython llm_execution_time_predictor/llm_forward_predictor_cli.py webview [--predictor-file <predictor_file>] [--host <host>] [--port <port>]\n```\n\nThe trained predictor file format:\n```json\n{\n \"config_name\": {\n \"prefill\": {\n \"weights\": [0.1234, 0.5678, 0.9012, 0.3456],\n \"bias\": 0.0123,\n \"model_type\": \"linear\"\n },\n \"decode\": {\n \"weights\": [0.2345, 0.6789, 0.0123, 0.4567],\n \"bias\": 0.0456,\n \"model_type\": \"linear\"\n }\n }\n}\n```\n\nFeature order: `[num_new_tokens, prod_ext_ctx, num_context_tokens, batch_size]`\n\n## Webviewer\n\n\n## Quickstart workflow\n\n### Using PyPI Package\n```bash\nllm-execution-time-predictor profile Qwen/Qwen3-4B --tp_size 1\nllm-execution-time-predictor train_models tp1_config benchmark_data_Qwen_Qwen3-4B_TP_1_PP_1.json --predictor-file trained_predictors.json\nllm-execution-time-predictor predict trained_predictors.json tp1_config --mode decode --bs 8 --input-len 1024\nllm-execution-time-predictor webview --predictor-file trained_predictors.json\n```\n\n### Using from Source\n```bash\npython llm_execution_time_predictor/llm_forward_predictor_cli.py profile Qwen/Qwen3-4B --tp_size 1\npython llm_execution_time_predictor/llm_forward_predictor_cli.py train_models tp1_config benchmark_data_Qwen_Qwen3-4B_TP_1_PP_1.json --predictor-file trained_predictors.json\npython llm_execution_time_predictor/llm_forward_predictor_cli.py predict trained_predictors.json tp1_config --mode decode --bs 8 --input-len 1024\npython llm_execution_time_predictor/llm_forward_predictor_cli.py webview --predictor-file trained_predictors.json\n```\n\n# TODO\n1. Fix vLLM force one batch\nwith vllm backend, currently vLLM might run more than 1 batch making some of the profiling innacurate skewing the model. Currently no good solution for this. \n\n# Ack\nCo-contributors: [Dongming Li](https://github.com/dongmingli-Ben) and [Zijian He](https://github.com/jiange91)\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "LLM batch inference latency predictor and profiler CLI tool",
"version": "0.1.2",
"project_urls": {
"Bug Tracker": "https://github.com/vikranth22446/llm_execution_time_predictor/issues",
"Documentation": "https://github.com/vikranth22446/llm_execution_time_predictor#readme",
"Homepage": "https://github.com/vikranth22446/llm_execution_time_predictor",
"Repository": "https://github.com/vikranth22446/llm_execution_time_predictor"
},
"split_keywords": [
"llm",
" inference",
" latency",
" prediction",
" profiling",
" machine-learning"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "766740e98704e15292575c24c2fb64e28b5a083d202c0591ebdcc377c5d7c8d8",
"md5": "ccd2b0371f81bfa15a1390a03bff8354",
"sha256": "cb3022dbb6ea78691ac72f8054fa0d745cc7a1444705bbde3b986ff5ac084685"
},
"downloads": -1,
"filename": "llm_execution_time_predictor-0.1.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "ccd2b0371f81bfa15a1390a03bff8354",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 29392,
"upload_time": "2025-07-29T03:16:26",
"upload_time_iso_8601": "2025-07-29T03:16:26.310988Z",
"url": "https://files.pythonhosted.org/packages/76/67/40e98704e15292575c24c2fb64e28b5a083d202c0591ebdcc377c5d7c8d8/llm_execution_time_predictor-0.1.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "dd28bcf2dcec22512a2c3a32761a587c0ab992d239660b51496074073ce50df1",
"md5": "fb32675a0ad1b92ab08bf15102670995",
"sha256": "08c0b74c5d7b05d5c02144983caa32a473cc5ec965e1cbe87a3232694a559c23"
},
"downloads": -1,
"filename": "llm_execution_time_predictor-0.1.2.tar.gz",
"has_sig": false,
"md5_digest": "fb32675a0ad1b92ab08bf15102670995",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 26524,
"upload_time": "2025-07-29T03:16:27",
"upload_time_iso_8601": "2025-07-29T03:16:27.444346Z",
"url": "https://files.pythonhosted.org/packages/dd/28/bcf2dcec22512a2c3a32761a587c0ab992d239660b51496074073ce50df1/llm_execution_time_predictor-0.1.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-29 03:16:27",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "vikranth22446",
"github_project": "llm_execution_time_predictor",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "sglang",
"specs": []
},
{
"name": "fire",
"specs": []
},
{
"name": "gradio",
"specs": []
},
{
"name": "numpy",
"specs": []
},
{
"name": "matplotlib",
"specs": []
},
{
"name": "scikit-learn",
"specs": []
},
{
"name": "pandas",
"specs": []
},
{
"name": "plotly",
"specs": []
}
],
"lcname": "llm-execution-time-predictor"
}