Name | nvidia-vlmeval JSON |
Version |
25.7.1
JSON |
| download |
home_page | None |
Summary | OpenCompass VLM Evaluation Kit - packaged by NVIDIA |
upload_time | 2025-08-05 08:39:28 |
maintainer | None |
docs_url | None |
author | None |
requires_python | >=3.7.0 |
license | None |
keywords |
ai
nlp
in-context learning
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# NVIDIA Eval Factory
The goal of NVIDIA Eval Factory is to advance and refine state-of-the-art methodologies for model evaluation, and deliver them as modular evaluation packages (evaluation containers and pip wheels) that teams can use as standardized building blocks.
# Quick start guide
NVIDIA Eval Factory provide you with evaluation clients, that are specifically built to evaluate model endpoints using our Standard API.
## Launching an evaluation for an LLM
1. Install the package
```
pip install nvidia-vlmeval
```
3. (Optional) Set a token to your API endpoint if it's protected
```bash
export MY_API_KEY="your_api_key_here"
```
4. List the available evaluations:
```bash
$ eval-factory ls
Available tasks:
* ai2d_judge (in vlmevalkit)
* chartqa (in vlmevalkit)
* mathvista-mini (in vlmevalkit)
* mmmu_judge (in vlmevalkit)
* ocrbench (in vlmevalkit)
* slidevqa (in vlmevalkit)
...
```
5. Run the evaluation of your choice:
```bash
eval-factory run_eval \
--eval_type ocrbench \
--model_id microsoft/phi-4-multimodal-instruct \
--model_url https://integrate.api.nvidia.com/v1/chat/completions \
--model_type vlm \
--api_key_name MY_API_KEY \
--output_dir /workspace/results
```
6. Gather the results
```bash
cat /workspace/results/results.yml
```
# Command-Line Tool
Each package comes pre-installed with a set of command-line tools, designed to simplify the execution of evaluation tasks. Below are the available commands and their usage for the `vlmevalkit`:
## Commands
### 1. **List Evaluation Types**
```bash
eval-factory ls
```
Displays the evaluation types available within the harness.
### 2. **Run an evaluation**
The `eval-factory run_eval` command executes the evaluation process. Below are the flags and their descriptions:
### Required flags
* `--eval_type <string>`
The type of evaluation to perform
* `--model_id <string>`
The name or identifier of the model to evaluate.
* `--model_url <url>`
The API endpoint where the model is accessible.
* `--model_type <string>`
The type of the model to evaluate, currently either "chat", "completions", or "vlm".
* `--output_dir <directory>`
The directory to use as the working directory for the evaluation. The results, including the results.yml output file, will be saved here.
### Optional flags
* `--api_key_name <string>`
The name of the environment variable that stores the Bearer token for the API, if authentication is required.
* `--run_config <path>`
Specifies the path to a YAML file containing the evaluation definition.
### Example
```bash
eval-factory run_eval \
--eval_type ocrbench \
--model_id my_model \
--model_type vlm \
--model_url http://localhost:8000/v1/chat/completions \
--output_dir ./evaluation_results
```
If the model API requires authentication, set the API key in an environment variable and reference it using the `--api_key_name` flag:
```bash
export MY_API_KEY="your_api_key_here"
eval-factory run_eval \
--eval_type ocrbench \
--model_id my_model \
--model_type vlm \
--model_url http://localhost:8000/v1/chat/completions \
--api_key_name MY_API_KEY \
--output_dir ./evaluation_results
```
# Configuring evaluations via YAML
Evaluations in NVIDIA Eval Factory are configured using YAML files that define the parameters and settings required for the evaluation process. These configuration files follow a standard API which ensures consistency across evaluations.
Example of a YAML config:
```yaml
config:
type: ocrbench
params:
parallelism: 50
limit_samples: 20
target:
api_endpoint:
model_id: microsoft/phi-4-multimodal-instruct
type: vlm
url: https://integrate.api.nvidia.com/v1/chat/completions
api_key: NVIDIA_API_KEY
```
The priority of overrides is as follows:
1. command line arguments
2. user config (as seen above)
3. task defaults (defined per task type)
4. framework defaults
`--dry_run` option allows you to print the final run configuration and command without executing the evaluation.
### Example:
```bash
eval-factory run_eval \
--eval_type ocrbench \
--model_id my_model \
--model_type vlm \
--model_url http://localhost:8000/v1/chat/completions \
--output_dir .evaluation_results \
--dry_run
```
Output:
```bash
Rendered config:
command: "cat > {{config.output_dir}}/vlmeval_config.json << 'EOF'\n{\n \"model\"\
: {\n \"{{target.api_endpoint.model_id.split('/')[-1]}}\": {\n \"class\"\
: \"CustomOAIEndpoint\",\n \"model\": \"{{target.api_endpoint.model_id}}\"\
,\n \"api_base\": \"{{target.api_endpoint.url}}\",\n \"api_key_var_name\"\
: \"{{target.api_endpoint.api_key}}\",\n \"max_tokens\": {{config.params.max_new_tokens}},\n\
\ \"temperature\": {{config.params.temperature}},{% if config.params.top_p\
\ is not none %}\n \"top_p\": {{config.params.top_p}},{% endif %}\n \"\
retry\": {{config.params.max_retries}},\n \"timeout\": {{config.params.request_timeout}}{%\
\ if config.params.extra.wait is defined %},\n \"wait\": {{config.params.extra.wait}}{%\
\ endif %}{% if config.params.extra.img_size is defined %},\n \"img_size\"\
: {{config.params.extra.img_size}}{% endif %}{% if config.params.extra.img_detail\
\ is defined %},\n \"img_detail\": \"{{config.params.extra.img_detail}}\"{%\
\ endif %}{% if config.params.extra.system_prompt is defined %},\n \"system_prompt\"\
: \"{{config.params.extra.system_prompt}}\"{% endif %}{% if config.params.extra.verbose\
\ is defined %},\n \"verbose\": {{config.params.extra.verbose}}{% endif %}\n\
\ }\n },\n \"data\": {\n \"{{config.params.extra.dataset.name}}\": {\n \
\ \"class\": \"{{config.params.extra.dataset.class}}\",\n \"dataset\":\
\ \"{{config.params.extra.dataset.name}}\",\n \"model\": \"{{target.api_endpoint.model_id}}\"\
\n }\n }\n}\nEOF\npython -m vlmeval.run \\\n --config {{config.output_dir}}/vlmeval_config.json\
\ \\\n --work-dir {{config.output_dir}} \\\n --api-nproc {{config.params.parallelism}}\
\ \\\n {%- if config.params.extra.judge is defined %}\n --judge {{config.params.extra.judge.model}}\
\ \\\n --judge-args '{{config.params.extra.judge.args}}' \\\n {%- endif %}\n \
\ {% if config.params.limit_samples is not none %}--first-n {{config.params.limit_samples}}{%\
\ endif %}\n"
framework_name: vlmevalkit
pkg_name: vlmeval
config:
output_dir: .evaluation_results
params:
limit_samples: null
max_new_tokens: 2048
max_retries: 5
parallelism: 4
task: null
temperature: 0.0
request_timeout: 60
top_p: null
extra:
dataset:
name: OCRBench
class: OCRBench
supported_endpoint_types:
- vlm
type: ocrbench
target:
api_endpoint:
api_key: null
model_id: my_model
stream: null
type: vlm
url: http://localhost:8000/v1/chat/completions
Rendered command:
cat > .evaluation_results/vlmeval_config.json << 'EOF'
{
"model": {
"my_model": {
"class": "CustomOAIEndpoint",
"model": "my_model",
"api_base": "http://localhost:8000/v1/chat/completions",
"api_key_var_name": "None",
"max_tokens": 2048,
"temperature": 0.0,
"retry": 5,
"timeout": 60
}
},
"data": {
"OCRBench": {
"class": "OCRBench",
"dataset": "OCRBench",
"model": "my_model"
}
}
}
EOF
python -m vlmeval.run \
--config .evaluation_results/vlmeval_config.json \
--work-dir .evaluation_results \
--api-nproc 4 \
```
# FAQ
## Deploying a model as an endpoint
NVIDIA Eval Factory utilize a client-server communication architecture to interact with the model. As a prerequisite, the **model must be deployed as an endpoint with a NIM-compatible API**.
Users have the flexibility to deploy their model using their own infrastructure and tooling.
Servers with APIs that conform to the OpenAI/NIM API standard are expected to work seamlessly out of the box.
Raw data
{
"_id": null,
"home_page": null,
"name": "nvidia-vlmeval",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.7.0",
"maintainer_email": null,
"keywords": "AI, NLP, in-context learning",
"author": null,
"author_email": null,
"download_url": null,
"platform": null,
"description": "# NVIDIA Eval Factory\n\nThe goal of NVIDIA Eval Factory is to advance and refine state-of-the-art methodologies for model evaluation, and deliver them as modular evaluation packages (evaluation containers and pip wheels) that teams can use as standardized building blocks.\n\n# Quick start guide\n\nNVIDIA Eval Factory provide you with evaluation clients, that are specifically built to evaluate model endpoints using our Standard API.\n\n## Launching an evaluation for an LLM\n\n1. Install the package\n ```\n pip install nvidia-vlmeval\n ```\n\n3. (Optional) Set a token to your API endpoint if it's protected\n ```bash\n export MY_API_KEY=\"your_api_key_here\"\n ```\n4. List the available evaluations:\n ```bash\n $ eval-factory ls\n Available tasks:\n * ai2d_judge (in vlmevalkit)\n * chartqa (in vlmevalkit)\n * mathvista-mini (in vlmevalkit)\n * mmmu_judge (in vlmevalkit)\n * ocrbench (in vlmevalkit)\n * slidevqa (in vlmevalkit)\n ...\n\n ```\n5. Run the evaluation of your choice:\n ```bash\n eval-factory run_eval \\\n --eval_type ocrbench \\\n --model_id microsoft/phi-4-multimodal-instruct \\\n --model_url https://integrate.api.nvidia.com/v1/chat/completions \\\n --model_type vlm \\\n --api_key_name MY_API_KEY \\\n --output_dir /workspace/results\n ```\n6. Gather the results\n ```bash\n cat /workspace/results/results.yml\n ```\n\n# Command-Line Tool\n\nEach package comes pre-installed with a set of command-line tools, designed to simplify the execution of evaluation tasks. Below are the available commands and their usage for the `vlmevalkit`:\n\n## Commands\n\n### 1. **List Evaluation Types**\n\n```bash\neval-factory ls\n```\n\nDisplays the evaluation types available within the harness.\n\n### 2. **Run an evaluation**\n\nThe `eval-factory run_eval` command executes the evaluation process. Below are the flags and their descriptions:\n\n### Required flags\n* `--eval_type <string>`\nThe type of evaluation to perform\n* `--model_id <string>`\nThe name or identifier of the model to evaluate.\n* `--model_url <url>`\nThe API endpoint where the model is accessible.\n* `--model_type <string>`\nThe type of the model to evaluate, currently either \"chat\", \"completions\", or \"vlm\".\n* `--output_dir <directory>`\nThe directory to use as the working directory for the evaluation. The results, including the results.yml output file, will be saved here.\n\n### Optional flags\n* `--api_key_name <string>`\nThe name of the environment variable that stores the Bearer token for the API, if authentication is required.\n* `--run_config <path>`\nSpecifies the path to a YAML file containing the evaluation definition.\n\n### Example\n\n```bash\neval-factory run_eval \\\n --eval_type ocrbench \\\n --model_id my_model \\\n --model_type vlm \\\n --model_url http://localhost:8000/v1/chat/completions \\\n --output_dir ./evaluation_results\n```\n\nIf the model API requires authentication, set the API key in an environment variable and reference it using the `--api_key_name` flag:\n\n```bash\nexport MY_API_KEY=\"your_api_key_here\"\n\neval-factory run_eval \\\n --eval_type ocrbench \\\n --model_id my_model \\\n --model_type vlm \\\n --model_url http://localhost:8000/v1/chat/completions \\\n --api_key_name MY_API_KEY \\\n --output_dir ./evaluation_results\n```\n\n# Configuring evaluations via YAML\n\nEvaluations in NVIDIA Eval Factory are configured using YAML files that define the parameters and settings required for the evaluation process. These configuration files follow a standard API which ensures consistency across evaluations.\n\nExample of a YAML config:\n```yaml\nconfig:\n type: ocrbench\n params:\n parallelism: 50\n limit_samples: 20\ntarget:\n api_endpoint:\n model_id: microsoft/phi-4-multimodal-instruct\n type: vlm\n url: https://integrate.api.nvidia.com/v1/chat/completions\n api_key: NVIDIA_API_KEY\n```\n\nThe priority of overrides is as follows:\n1. command line arguments\n2. user config (as seen above)\n3. task defaults (defined per task type)\n4. framework defaults \n\n`--dry_run` option allows you to print the final run configuration and command without executing the evaluation.\n\n### Example:\n\n```bash\neval-factory run_eval \\\n --eval_type ocrbench \\\n --model_id my_model \\\n --model_type vlm \\\n --model_url http://localhost:8000/v1/chat/completions \\\n --output_dir .evaluation_results \\\n --dry_run\n```\n\nOutput:\n\n```bash\nRendered config:\n\ncommand: \"cat > {{config.output_dir}}/vlmeval_config.json << 'EOF'\\n{\\n \\\"model\\\"\\\n : {\\n \\\"{{target.api_endpoint.model_id.split('/')[-1]}}\\\": {\\n \\\"class\\\"\\\n : \\\"CustomOAIEndpoint\\\",\\n \\\"model\\\": \\\"{{target.api_endpoint.model_id}}\\\"\\\n ,\\n \\\"api_base\\\": \\\"{{target.api_endpoint.url}}\\\",\\n \\\"api_key_var_name\\\"\\\n : \\\"{{target.api_endpoint.api_key}}\\\",\\n \\\"max_tokens\\\": {{config.params.max_new_tokens}},\\n\\\n \\ \\\"temperature\\\": {{config.params.temperature}},{% if config.params.top_p\\\n \\ is not none %}\\n \\\"top_p\\\": {{config.params.top_p}},{% endif %}\\n \\\"\\\n retry\\\": {{config.params.max_retries}},\\n \\\"timeout\\\": {{config.params.request_timeout}}{%\\\n \\ if config.params.extra.wait is defined %},\\n \\\"wait\\\": {{config.params.extra.wait}}{%\\\n \\ endif %}{% if config.params.extra.img_size is defined %},\\n \\\"img_size\\\"\\\n : {{config.params.extra.img_size}}{% endif %}{% if config.params.extra.img_detail\\\n \\ is defined %},\\n \\\"img_detail\\\": \\\"{{config.params.extra.img_detail}}\\\"{%\\\n \\ endif %}{% if config.params.extra.system_prompt is defined %},\\n \\\"system_prompt\\\"\\\n : \\\"{{config.params.extra.system_prompt}}\\\"{% endif %}{% if config.params.extra.verbose\\\n \\ is defined %},\\n \\\"verbose\\\": {{config.params.extra.verbose}}{% endif %}\\n\\\n \\ }\\n },\\n \\\"data\\\": {\\n \\\"{{config.params.extra.dataset.name}}\\\": {\\n \\\n \\ \\\"class\\\": \\\"{{config.params.extra.dataset.class}}\\\",\\n \\\"dataset\\\":\\\n \\ \\\"{{config.params.extra.dataset.name}}\\\",\\n \\\"model\\\": \\\"{{target.api_endpoint.model_id}}\\\"\\\n \\n }\\n }\\n}\\nEOF\\npython -m vlmeval.run \\\\\\n --config {{config.output_dir}}/vlmeval_config.json\\\n \\ \\\\\\n --work-dir {{config.output_dir}} \\\\\\n --api-nproc {{config.params.parallelism}}\\\n \\ \\\\\\n {%- if config.params.extra.judge is defined %}\\n --judge {{config.params.extra.judge.model}}\\\n \\ \\\\\\n --judge-args '{{config.params.extra.judge.args}}' \\\\\\n {%- endif %}\\n \\\n \\ {% if config.params.limit_samples is not none %}--first-n {{config.params.limit_samples}}{%\\\n \\ endif %}\\n\"\nframework_name: vlmevalkit\npkg_name: vlmeval\nconfig:\n output_dir: .evaluation_results\n params:\n limit_samples: null\n max_new_tokens: 2048\n max_retries: 5\n parallelism: 4\n task: null\n temperature: 0.0\n request_timeout: 60\n top_p: null\n extra:\n dataset:\n name: OCRBench\n class: OCRBench\n supported_endpoint_types:\n - vlm\n type: ocrbench\ntarget:\n api_endpoint:\n api_key: null\n model_id: my_model\n stream: null\n type: vlm\n url: http://localhost:8000/v1/chat/completions\n\n\nRendered command:\n\ncat > .evaluation_results/vlmeval_config.json << 'EOF'\n{\n \"model\": {\n \"my_model\": {\n \"class\": \"CustomOAIEndpoint\",\n \"model\": \"my_model\",\n \"api_base\": \"http://localhost:8000/v1/chat/completions\",\n \"api_key_var_name\": \"None\",\n \"max_tokens\": 2048,\n \"temperature\": 0.0,\n \"retry\": 5,\n \"timeout\": 60\n }\n },\n \"data\": {\n \"OCRBench\": {\n \"class\": \"OCRBench\",\n \"dataset\": \"OCRBench\",\n \"model\": \"my_model\"\n }\n }\n}\nEOF\npython -m vlmeval.run \\\n --config .evaluation_results/vlmeval_config.json \\\n --work-dir .evaluation_results \\\n --api-nproc 4 \\\n\n```\n\n# FAQ\n\n## Deploying a model as an endpoint\n\nNVIDIA Eval Factory utilize a client-server communication architecture to interact with the model. As a prerequisite, the **model must be deployed as an endpoint with a NIM-compatible API**.\n\nUsers have the flexibility to deploy their model using their own infrastructure and tooling.\n\nServers with APIs that conform to the OpenAI/NIM API standard are expected to work seamlessly out of the box.\n",
"bugtrack_url": null,
"license": null,
"summary": "OpenCompass VLM Evaluation Kit - packaged by NVIDIA",
"version": "25.7.1",
"project_urls": null,
"split_keywords": [
"ai",
" nlp",
" in-context learning"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "fabcdee66bbd6c8eb17f55f43eccc5c2ddc195bef507d217ce74d17210ddfd8b",
"md5": "679391f4ca148147cd172085b3081cd9",
"sha256": "a48864ca237729a0b610a146dca26c610ed8f36446afff2b3e28a4b21e8a7d0a"
},
"downloads": -1,
"filename": "nvidia_vlmeval-25.7.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "679391f4ca148147cd172085b3081cd9",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7.0",
"size": 732346,
"upload_time": "2025-08-05T08:39:28",
"upload_time_iso_8601": "2025-08-05T08:39:28.632414Z",
"url": "https://files.pythonhosted.org/packages/fa/bc/dee66bbd6c8eb17f55f43eccc5c2ddc195bef507d217ce74d17210ddfd8b/nvidia_vlmeval-25.7.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-05 08:39:28",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "nvidia-vlmeval"
}