nvidia-vlmeval

Name	nvidia-vlmeval JSON
Version	25.7.1 JSON
	download
home_page	None
Summary	OpenCompass VLM Evaluation Kit - packaged by NVIDIA
upload_time	2025-08-05 08:39:28
maintainer	None
docs_url	None
author	None
requires_python	>=3.7.0
license	None
keywords	ai nlp in-context learning
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # NVIDIA Eval Factory

The goal of NVIDIA Eval Factory is to advance and refine state-of-the-art methodologies for model evaluation, and deliver them as modular evaluation packages (evaluation containers and pip wheels) that teams can use as standardized building blocks.

# Quick start guide

NVIDIA Eval Factory provide you with evaluation clients, that are specifically built to evaluate model endpoints using our Standard API.

## Launching an evaluation for an LLM

1. Install the package
    ```
    pip install nvidia-vlmeval
    ```

3. (Optional) Set a token to your API endpoint if it's protected
    ```bash
    export MY_API_KEY="your_api_key_here"
    ```
4. List the available evaluations:
    ```bash
    $ eval-factory ls
    Available tasks:
    * ai2d_judge (in vlmevalkit)
    * chartqa (in vlmevalkit)
    * mathvista-mini (in vlmevalkit)
    * mmmu_judge (in vlmevalkit)
    * ocrbench (in vlmevalkit)
    * slidevqa (in vlmevalkit)
    ...

    ```
5. Run the evaluation of your choice:
   ```bash
   eval-factory run_eval \
       --eval_type ocrbench \
       --model_id microsoft/phi-4-multimodal-instruct \
       --model_url https://integrate.api.nvidia.com/v1/chat/completions \
       --model_type vlm \
       --api_key_name MY_API_KEY \
       --output_dir /workspace/results
   ```
6. Gather the results
    ```bash
    cat /workspace/results/results.yml
    ```

# Command-Line Tool

Each package comes pre-installed with a set of command-line tools, designed to simplify the execution of evaluation tasks. Below are the available commands and their usage for the `vlmevalkit`:

## Commands

### 1. **List Evaluation Types**

```bash
eval-factory ls
```

Displays the evaluation types available within the harness.

### 2. **Run an evaluation**

The `eval-factory run_eval` command executes the evaluation process. Below are the flags and their descriptions:

### Required flags
* `--eval_type <string>`
The type of evaluation to perform
* `--model_id <string>`
The name or identifier of the model to evaluate.
* `--model_url <url>`
The API endpoint where the model is accessible.
* `--model_type <string>`
The type of the model to evaluate, currently either "chat", "completions", or "vlm".
* `--output_dir <directory>`
The directory to use as the working directory for the evaluation. The results, including the results.yml output file, will be saved here.

### Optional flags
* `--api_key_name <string>`
The name of the environment variable that stores the Bearer token for the API, if authentication is required.
* `--run_config <path>`
Specifies the path to a  YAML file containing the evaluation definition.

### Example

```bash
eval-factory run_eval \
    --eval_type ocrbench \
    --model_id my_model \
    --model_type vlm \
    --model_url http://localhost:8000/v1/chat/completions \
    --output_dir ./evaluation_results
```

If the model API requires authentication, set the API key in an environment variable and reference it using the `--api_key_name` flag:

```bash
export MY_API_KEY="your_api_key_here"

eval-factory run_eval \
    --eval_type ocrbench \
    --model_id my_model \
    --model_type vlm \
    --model_url http://localhost:8000/v1/chat/completions \
    --api_key_name MY_API_KEY \
    --output_dir ./evaluation_results
```

# Configuring evaluations via YAML

Evaluations in NVIDIA Eval Factory are configured using YAML files that define the parameters and settings required for the evaluation process. These configuration files follow a standard API which ensures consistency across evaluations.

Example of a YAML config:
```yaml
config:
  type: ocrbench
  params:
    parallelism: 50
    limit_samples: 20
target:
  api_endpoint:
    model_id: microsoft/phi-4-multimodal-instruct
    type: vlm
    url: https://integrate.api.nvidia.com/v1/chat/completions
    api_key: NVIDIA_API_KEY
```

The priority of overrides is as follows:
1. command line arguments
2. user config (as seen above)
3. task defaults (defined per task type)
4. framework defaults 

`--dry_run` option allows you to print the final run configuration and command without executing the evaluation.

### Example:

```bash
eval-factory run_eval \
    --eval_type ocrbench \
    --model_id my_model \
    --model_type vlm \
    --model_url http://localhost:8000/v1/chat/completions \
    --output_dir .evaluation_results \
    --dry_run
```

Output:

```bash
Rendered config:

command: "cat > {{config.output_dir}}/vlmeval_config.json << 'EOF'\n{\n  \"model\"\
  : {\n    \"{{target.api_endpoint.model_id.split('/')[-1]}}\": {\n      \"class\"\
  : \"CustomOAIEndpoint\",\n      \"model\": \"{{target.api_endpoint.model_id}}\"\
  ,\n      \"api_base\": \"{{target.api_endpoint.url}}\",\n      \"api_key_var_name\"\
  : \"{{target.api_endpoint.api_key}}\",\n      \"max_tokens\": {{config.params.max_new_tokens}},\n\
  \      \"temperature\": {{config.params.temperature}},{% if config.params.top_p\
  \ is not none %}\n      \"top_p\": {{config.params.top_p}},{% endif %}\n      \"\
  retry\": {{config.params.max_retries}},\n      \"timeout\": {{config.params.request_timeout}}{%\
  \ if config.params.extra.wait is defined %},\n      \"wait\": {{config.params.extra.wait}}{%\
  \ endif %}{% if config.params.extra.img_size is defined %},\n      \"img_size\"\
  : {{config.params.extra.img_size}}{% endif %}{% if config.params.extra.img_detail\
  \ is defined %},\n      \"img_detail\": \"{{config.params.extra.img_detail}}\"{%\
  \ endif %}{% if config.params.extra.system_prompt is defined %},\n      \"system_prompt\"\
  : \"{{config.params.extra.system_prompt}}\"{% endif %}{% if config.params.extra.verbose\
  \ is defined %},\n      \"verbose\": {{config.params.extra.verbose}}{% endif %}\n\
  \    }\n  },\n  \"data\": {\n    \"{{config.params.extra.dataset.name}}\": {\n \
  \     \"class\": \"{{config.params.extra.dataset.class}}\",\n      \"dataset\":\
  \ \"{{config.params.extra.dataset.name}}\",\n      \"model\": \"{{target.api_endpoint.model_id}}\"\
  \n    }\n  }\n}\nEOF\npython -m vlmeval.run \\\n  --config {{config.output_dir}}/vlmeval_config.json\
  \ \\\n  --work-dir {{config.output_dir}} \\\n  --api-nproc {{config.params.parallelism}}\
  \ \\\n  {%- if config.params.extra.judge is defined %}\n  --judge {{config.params.extra.judge.model}}\
  \ \\\n  --judge-args '{{config.params.extra.judge.args}}' \\\n  {%- endif %}\n \
  \ {% if config.params.limit_samples is not none %}--first-n {{config.params.limit_samples}}{%\
  \ endif %}\n"
framework_name: vlmevalkit
pkg_name: vlmeval
config:
  output_dir: .evaluation_results
  params:
    limit_samples: null
    max_new_tokens: 2048
    max_retries: 5
    parallelism: 4
    task: null
    temperature: 0.0
    request_timeout: 60
    top_p: null
    extra:
      dataset:
        name: OCRBench
        class: OCRBench
  supported_endpoint_types:
  - vlm
  type: ocrbench
target:
  api_endpoint:
    api_key: null
    model_id: my_model
    stream: null
    type: vlm
    url: http://localhost:8000/v1/chat/completions


Rendered command:

cat > .evaluation_results/vlmeval_config.json << 'EOF'
{
  "model": {
    "my_model": {
      "class": "CustomOAIEndpoint",
      "model": "my_model",
      "api_base": "http://localhost:8000/v1/chat/completions",
      "api_key_var_name": "None",
      "max_tokens": 2048,
      "temperature": 0.0,
      "retry": 5,
      "timeout": 60
    }
  },
  "data": {
    "OCRBench": {
      "class": "OCRBench",
      "dataset": "OCRBench",
      "model": "my_model"
    }
  }
}
EOF
python -m vlmeval.run \
  --config .evaluation_results/vlmeval_config.json \
  --work-dir .evaluation_results \
  --api-nproc 4 \

```

# FAQ

## Deploying a model as an endpoint

NVIDIA Eval Factory utilize a client-server communication architecture to interact with the model. As a prerequisite, the **model must be deployed as an endpoint with a NIM-compatible API**.

Users have the flexibility to deploy their model using their own infrastructure and tooling.

Servers with APIs that conform to the OpenAI/NIM API standard are expected to work seamlessly out of the box.

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "nvidia-vlmeval",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7.0",
    "maintainer_email": null,
    "keywords": "AI, NLP, in-context learning",
    "author": null,
    "author_email": null,
    "download_url": null,
    "platform": null,
    "description": "# NVIDIA Eval Factory\n\nThe goal of NVIDIA Eval Factory is to advance and refine state-of-the-art methodologies for model evaluation, and deliver them as modular evaluation packages (evaluation containers and pip wheels) that teams can use as standardized building blocks.\n\n# Quick start guide\n\nNVIDIA Eval Factory provide you with evaluation clients, that are specifically built to evaluate model endpoints using our Standard API.\n\n## Launching an evaluation for an LLM\n\n1. Install the package\n    ```\n    pip install nvidia-vlmeval\n    ```\n\n3. (Optional) Set a token to your API endpoint if it's protected\n    ```bash\n    export MY_API_KEY=\"your_api_key_here\"\n    ```\n4. List the available evaluations:\n    ```bash\n    $ eval-factory ls\n    Available tasks:\n    * ai2d_judge (in vlmevalkit)\n    * chartqa (in vlmevalkit)\n    * mathvista-mini (in vlmevalkit)\n    * mmmu_judge (in vlmevalkit)\n    * ocrbench (in vlmevalkit)\n    * slidevqa (in vlmevalkit)\n    ...\n\n    ```\n5. Run the evaluation of your choice:\n   ```bash\n   eval-factory run_eval \\\n       --eval_type ocrbench \\\n       --model_id microsoft/phi-4-multimodal-instruct \\\n       --model_url https://integrate.api.nvidia.com/v1/chat/completions \\\n       --model_type vlm \\\n       --api_key_name MY_API_KEY \\\n       --output_dir /workspace/results\n   ```\n6. Gather the results\n    ```bash\n    cat /workspace/results/results.yml\n    ```\n\n# Command-Line Tool\n\nEach package comes pre-installed with a set of command-line tools, designed to simplify the execution of evaluation tasks. Below are the available commands and their usage for the `vlmevalkit`:\n\n## Commands\n\n### 1. **List Evaluation Types**\n\n```bash\neval-factory ls\n```\n\nDisplays the evaluation types available within the harness.\n\n### 2. **Run an evaluation**\n\nThe `eval-factory run_eval` command executes the evaluation process. Below are the flags and their descriptions:\n\n### Required flags\n* `--eval_type <string>`\nThe type of evaluation to perform\n* `--model_id <string>`\nThe name or identifier of the model to evaluate.\n* `--model_url <url>`\nThe API endpoint where the model is accessible.\n* `--model_type <string>`\nThe type of the model to evaluate, currently either \"chat\", \"completions\", or \"vlm\".\n* `--output_dir <directory>`\nThe directory to use as the working directory for the evaluation. The results, including the results.yml output file, will be saved here.\n\n### Optional flags\n* `--api_key_name <string>`\nThe name of the environment variable that stores the Bearer token for the API, if authentication is required.\n* `--run_config <path>`\nSpecifies the path to a  YAML file containing the evaluation definition.\n\n### Example\n\n```bash\neval-factory run_eval \\\n    --eval_type ocrbench \\\n    --model_id my_model \\\n    --model_type vlm \\\n    --model_url http://localhost:8000/v1/chat/completions \\\n    --output_dir ./evaluation_results\n```\n\nIf the model API requires authentication, set the API key in an environment variable and reference it using the `--api_key_name` flag:\n\n```bash\nexport MY_API_KEY=\"your_api_key_here\"\n\neval-factory run_eval \\\n    --eval_type ocrbench \\\n    --model_id my_model \\\n    --model_type vlm \\\n    --model_url http://localhost:8000/v1/chat/completions \\\n    --api_key_name MY_API_KEY \\\n    --output_dir ./evaluation_results\n```\n\n# Configuring evaluations via YAML\n\nEvaluations in NVIDIA Eval Factory are configured using YAML files that define the parameters and settings required for the evaluation process. These configuration files follow a standard API which ensures consistency across evaluations.\n\nExample of a YAML config:\n```yaml\nconfig:\n  type: ocrbench\n  params:\n    parallelism: 50\n    limit_samples: 20\ntarget:\n  api_endpoint:\n    model_id: microsoft/phi-4-multimodal-instruct\n    type: vlm\n    url: https://integrate.api.nvidia.com/v1/chat/completions\n    api_key: NVIDIA_API_KEY\n```\n\nThe priority of overrides is as follows:\n1. command line arguments\n2. user config (as seen above)\n3. task defaults (defined per task type)\n4. framework defaults \n\n`--dry_run` option allows you to print the final run configuration and command without executing the evaluation.\n\n### Example:\n\n```bash\neval-factory run_eval \\\n    --eval_type ocrbench \\\n    --model_id my_model \\\n    --model_type vlm \\\n    --model_url http://localhost:8000/v1/chat/completions \\\n    --output_dir .evaluation_results \\\n    --dry_run\n```\n\nOutput:\n\n```bash\nRendered config:\n\ncommand: \"cat > {{config.output_dir}}/vlmeval_config.json << 'EOF'\\n{\\n  \\\"model\\\"\\\n  : {\\n    \\\"{{target.api_endpoint.model_id.split('/')[-1]}}\\\": {\\n      \\\"class\\\"\\\n  : \\\"CustomOAIEndpoint\\\",\\n      \\\"model\\\": \\\"{{target.api_endpoint.model_id}}\\\"\\\n  ,\\n      \\\"api_base\\\": \\\"{{target.api_endpoint.url}}\\\",\\n      \\\"api_key_var_name\\\"\\\n  : \\\"{{target.api_endpoint.api_key}}\\\",\\n      \\\"max_tokens\\\": {{config.params.max_new_tokens}},\\n\\\n  \\      \\\"temperature\\\": {{config.params.temperature}},{% if config.params.top_p\\\n  \\ is not none %}\\n      \\\"top_p\\\": {{config.params.top_p}},{% endif %}\\n      \\\"\\\n  retry\\\": {{config.params.max_retries}},\\n      \\\"timeout\\\": {{config.params.request_timeout}}{%\\\n  \\ if config.params.extra.wait is defined %},\\n      \\\"wait\\\": {{config.params.extra.wait}}{%\\\n  \\ endif %}{% if config.params.extra.img_size is defined %},\\n      \\\"img_size\\\"\\\n  : {{config.params.extra.img_size}}{% endif %}{% if config.params.extra.img_detail\\\n  \\ is defined %},\\n      \\\"img_detail\\\": \\\"{{config.params.extra.img_detail}}\\\"{%\\\n  \\ endif %}{% if config.params.extra.system_prompt is defined %},\\n      \\\"system_prompt\\\"\\\n  : \\\"{{config.params.extra.system_prompt}}\\\"{% endif %}{% if config.params.extra.verbose\\\n  \\ is defined %},\\n      \\\"verbose\\\": {{config.params.extra.verbose}}{% endif %}\\n\\\n  \\    }\\n  },\\n  \\\"data\\\": {\\n    \\\"{{config.params.extra.dataset.name}}\\\": {\\n \\\n  \\     \\\"class\\\": \\\"{{config.params.extra.dataset.class}}\\\",\\n      \\\"dataset\\\":\\\n  \\ \\\"{{config.params.extra.dataset.name}}\\\",\\n      \\\"model\\\": \\\"{{target.api_endpoint.model_id}}\\\"\\\n  \\n    }\\n  }\\n}\\nEOF\\npython -m vlmeval.run \\\\\\n  --config {{config.output_dir}}/vlmeval_config.json\\\n  \\ \\\\\\n  --work-dir {{config.output_dir}} \\\\\\n  --api-nproc {{config.params.parallelism}}\\\n  \\ \\\\\\n  {%- if config.params.extra.judge is defined %}\\n  --judge {{config.params.extra.judge.model}}\\\n  \\ \\\\\\n  --judge-args '{{config.params.extra.judge.args}}' \\\\\\n  {%- endif %}\\n \\\n  \\ {% if config.params.limit_samples is not none %}--first-n {{config.params.limit_samples}}{%\\\n  \\ endif %}\\n\"\nframework_name: vlmevalkit\npkg_name: vlmeval\nconfig:\n  output_dir: .evaluation_results\n  params:\n    limit_samples: null\n    max_new_tokens: 2048\n    max_retries: 5\n    parallelism: 4\n    task: null\n    temperature: 0.0\n    request_timeout: 60\n    top_p: null\n    extra:\n      dataset:\n        name: OCRBench\n        class: OCRBench\n  supported_endpoint_types:\n  - vlm\n  type: ocrbench\ntarget:\n  api_endpoint:\n    api_key: null\n    model_id: my_model\n    stream: null\n    type: vlm\n    url: http://localhost:8000/v1/chat/completions\n\n\nRendered command:\n\ncat > .evaluation_results/vlmeval_config.json << 'EOF'\n{\n  \"model\": {\n    \"my_model\": {\n      \"class\": \"CustomOAIEndpoint\",\n      \"model\": \"my_model\",\n      \"api_base\": \"http://localhost:8000/v1/chat/completions\",\n      \"api_key_var_name\": \"None\",\n      \"max_tokens\": 2048,\n      \"temperature\": 0.0,\n      \"retry\": 5,\n      \"timeout\": 60\n    }\n  },\n  \"data\": {\n    \"OCRBench\": {\n      \"class\": \"OCRBench\",\n      \"dataset\": \"OCRBench\",\n      \"model\": \"my_model\"\n    }\n  }\n}\nEOF\npython -m vlmeval.run \\\n  --config .evaluation_results/vlmeval_config.json \\\n  --work-dir .evaluation_results \\\n  --api-nproc 4 \\\n\n```\n\n# FAQ\n\n## Deploying a model as an endpoint\n\nNVIDIA Eval Factory utilize a client-server communication architecture to interact with the model. As a prerequisite, the **model must be deployed as an endpoint with a NIM-compatible API**.\n\nUsers have the flexibility to deploy their model using their own infrastructure and tooling.\n\nServers with APIs that conform to the OpenAI/NIM API standard are expected to work seamlessly out of the box.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "OpenCompass VLM Evaluation Kit - packaged by NVIDIA",
    "version": "25.7.1",
    "project_urls": null,
    "split_keywords": [
        "ai",
        " nlp",
        " in-context learning"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "fabcdee66bbd6c8eb17f55f43eccc5c2ddc195bef507d217ce74d17210ddfd8b",
                "md5": "679391f4ca148147cd172085b3081cd9",
                "sha256": "a48864ca237729a0b610a146dca26c610ed8f36446afff2b3e28a4b21e8a7d0a"
            },
            "downloads": -1,
            "filename": "nvidia_vlmeval-25.7.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "679391f4ca148147cd172085b3081cd9",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7.0",
            "size": 732346,
            "upload_time": "2025-08-05T08:39:28",
            "upload_time_iso_8601": "2025-08-05T08:39:28.632414Z",
            "url": "https://files.pythonhosted.org/packages/fa/bc/dee66bbd6c8eb17f55f43eccc5c2ddc195bef507d217ce74d17210ddfd8b/nvidia_vlmeval-25.7.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-05 08:39:28",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "nvidia-vlmeval"
}

None