multimedeval

Name	multimedeval JSON
Version	1.0.0 JSON
	download
home_page	None
Summary	A Python tool to evaluate the performance of VLM on the medical domain.
upload_time	2025-07-23 14:44:40
maintainer	None
docs_url	None
author	Corentin Royer
requires_python	<3.13,>=3.9
license	MIT
keywords	evaluation medical vlm
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # MultiMedEval

MultiMedEval is a library to evaluate the performance of Vision-Language Models (VLM) on medical domain tasks. The goal is to have a set of benchmark with a unified evaluation scheme to facilitate the development and comparison of medical VLM.
We include 24 tasks representing 10 different imaging modalities and some text-only tasks.

![tests workflow](https://github.com/corentin-ryr/MultiMedEval/actions/workflows/python-tests.yml/badge.svg) ![PyPI - Version](https://img.shields.io/pypi/v/multimedeval) ![PyPI - Python Version](https://img.shields.io/pypi/pyversions/multimedeval) ![GitHub License](https://img.shields.io/github/license/corentin-ryr/MultiMedEval)

## Tasks

<details>
  <summary>Question Answering</summary>

| Task     | Description                                            | Modality         | Size |
| -------- | ------------------------------------------------------ | ---------------- | ---- |
| MedQA    | Multiple choice questions on general medical knowledge | General medicine | 1273 |
| PubMedQA | Yes/no/maybe questions based on PubMed paper abstracts | General medicine | 500  |
| MedMCQA  | Multiple choice questions on general medical knowledge | General medicine | 4183 |

</details>

</br>

<details>
  <summary>Visual Question Answering</summary>

| Task     | Description                              | Modality  | Size |
| -------- | ---------------------------------------- | --------- | ---- |
| VQA-RAD  | Open ended questions on radiology images | X-ray     | 451  |
| Path-VQA | Open ended questions on pathology images | Pathology | 6719 |
| SLAKE    | Open ended questions on radiology images | X-ray     | 1061 |

</details>

</br>

<details>
  <summary>Report Comparison</summary>

| Task                       | Description                                                                       | Modality    | Size  |
| -------------------------- | --------------------------------------------------------------------------------- | ----------- | ----- |
| MIMIC-CXR-ReportGeneration | Generation of finding sections of radiology reports based on the radiology images | Chest X-ray | 2347  |
| MIMIC-III                  | Summarization of radiology reports                                                | Text        | 13054 |

</details>

</br>

<details>
  <summary>Natural Language Inference</summary>

| Task   | Description                                      | Modality         | Size |
| ------ | ------------------------------------------------ | ---------------- | ---- |
| MedNLI | Natural Language Inference on medical sentences. | General medicine | 1422 |

</details>

</br>

<details>
  <summary>Image Classification</summary>

| Task                          | Description                                                                                                   | Modality      | Size  |
| ----------------------------- | ------------------------------------------------------------------------------------------------------------- | ------------- | ----- |
| MIMIC-CXR-ImageClassification | Classification of radiology images into 5 diseases                                                            | Chest X-ray   | 5159  |
| VinDr-Mammo                   | Classification of mammography images into 5 BIRADS levels                                                     | Mammography   | 429   |
| Pad-UFES-20                   | Classification of skin lesion images into 7 diseases                                                          | Dermatology   | 2298  |
| CBIS-DDSM-Mass                | Classification of masses in mammography images into "benign", "malignant" or "benign without callback"        | Mammography   | 378   |
| CBIS-DDSM-Calcification       | Classification of calcification in mammography images into "benign", "malignant" or "benign without callback" | Mammography   | 326   |
| MNIST-Oct                     | Image classification of Optical coherence tomography of the retine                                            | OCT           | 1000  |
| MNIST-Path                    | Image classification of pathology image                                                                       | Pathology     | 7180  |
| MNIST-Blood                   | Image classification of blood cell seen through a microscope                                                  | Microscopy    | 3421  |
| MNIST-Breast                  | Image classification of mammography                                                                           | Mammography   | 156   |
| MNIST-Derma                   | Image classification of skin deffect images                                                                   | Dermatology   | 2005  |
| MNIST-OrganC                  | Image classification of abdominal CT scan                                                                     | CT            | 8216  |
| MNIST-OrganS                  | Image classification of abdominal CT scan                                                                     | CT            | 8827  |
| MNIST-Pneumonia               | Image classification of chest X-Rays                                                                          | X-Ray         | 624   |
| MNIST-Retina                  | Image classification of the retina taken with a fondus camera                                                 | Fondus Camera | 400   |
| MNIST-Tissue                  | Image classification of kidney cortex seen through a microscope                                               | Microscopy    | 12820 |

</details>

</br>

<p align="center">
    <img src="figures/sankey.png" alt="Sankey graph">
    <br>
    <em>Representation of the modalities, tasks and datasets in MultiMedEval</em>
</p>

## Setup

To install the library, you can use `pip`

```console
pip install multimedeval
```

To run the benchmark on your model, you first need to create an instance of the `MultiMedEval` class.

```python
from multimedeval import MultiMedEval, SetupParams, EvalParams
from multimedeval.utils import BatcherInput, BatcherOutput

engine = MultiMedEval()
```

You then need to call the `setup` function of the `engine`. This will download the datasets if needed and prepare them for evaluation. You can specify where to store the data and which datasets you want to download.

```python
setupParams = SetupParams(medqa_dir="data/")
tasksReady = engine.setup(setup_params=setupParams)
```

Here we initialize the `SetupParams` dataclass with only the path for the MedQA dataset. If you omit to pass a directory for some of the datasets, they will be skipped during the evaluation. During the setup process, the script will need a Physionet username and password to download "VinDr-Mammo", "MIMIC-CXR" and "MIMIC-III". You also need to setup Kaggle on your machine before running the setup as the "CBIS-DDSM" is hosted on Kaggle. At the end of the setup process, you will see a summary of which tasks are ready and which didn't run properly and the function will return a summary in the form of a dictionary.

## Usage

### Implement the Batcher

The user must implement one Callable: `batcher`. It takes a batch of input and must return the answer.
The batch is a list of inputs.
Each input is an instance of @dataclass `BatcherInput`, containing the following fields:

- `conversation`: a prompt in the form of a Hugginface style conversation between a user and an assistant.
- `images`: a list of Pillow images. The number of images matches the number of <img> tokens in the prompt and are ordered.
- `segmentation_masks`: (optional) a list of segmentation masks, the number of which matches that of <seg> tokens in the prompt and are ordered.

```python
[
    BatcherInput(
        conversation = 
          [
              {"role": "user", "content": "This is a question with an image <img>."},
              {"role": "assistant", "content": "This is the answer."},
              {"role": "user", "content": "This is a question with an image <img>."},
          ],
        images = [PIL.Image(), PIL.Image()],
        segmentation_masks = [PIL.Image(), PIL.Image()]
    ),
    BatcherInput(
        conversation =
          [
              {"role": "user", "content": "This is a question without images."},
              {"role": "assistant", "content": "This is the answer."},
              {"role": "user", "content": "This is a question without images."},
          ],
        images = [],
        segmentation_masks = []
    ),

]
```

Here is an example of a `batcher` without any logic:

```python
def batcher(prompts: List[BatcherInput]) -> List[BatcherOutput]:
    return ["Answer" for _ in prompts]
```

A function is the simplest example of a Callable but the batcher can also be implemented as a Callable class (i.e. a class implementing the `__call__` method). Doing it this way allows to initialize the model in the `__init__` function of the class. We give an example for the Mistral model (a language-only model).

```python
class batcherMistral:
    def __init__(self) -> None:
        self.model: MistralModel = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")
        self.tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")
        self.tokenizer.pad_token = self.tokenizer.eos_token

    def __call__(self, prompts: List[BatcherInput]) -> List[BatcherOutput]:
        model_inputs = [self.tokenizer.apply_chat_template(messages.conversation, return_tensors="pt", tokenize=False) for messages in prompts]
        model_inputs = self.tokenizer(model_inputs, padding="max_length", truncation=True, max_length=1024, return_tensors="pt")

        generated_ids = self.model.generate(**model_inputs, max_new_tokens=200, do_sample=True, pad_token_id=self.tokenizer.pad_token_id)

        # Remove the first 1024 tokens (prompt)
        generated_ids = generated_ids[:, model_inputs["input_ids"].shape[1] :]

        answers = self.tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
        return [BatcherOutput(text=answer) for answer in answers]
```

### Run the benchmark

To run the benchmark, call the `eval` method of the `MultiMedEval` class with the list of tasks to benchmark on, the batcher to ealuate and the evaluation parameters. If the list is empty, all the tasks will be benchmarked.

```python
evalParams = EvalParams(batch_size=128)
results = engine.eval(["MedQA", "VQA-RAD"], batcher, eval_params=evalParams)
```

## MultiMedEval parameters

The `SetupParams` class takes a path for each dataset:

- medqa_dir: will be use in Huggingface's `load_dataset` as cache_dir
- pubmedqa_dir: will be use in Huggingface's `load_dataset` as cache_dir
- medmcqa_dir: will be use in Huggingface's `load_dataset` as cache_dir
- vqa_rad_dir: will be use in Huggingface's `load_dataset` as cache_dir
- path_vqa_dir: will be use in Huggingface's `load_dataset` as cache_dir
- slake_dir: the dataset is currently hosted on Google Drive which can be an issue on some systems.
- mimic_iii_dir: path for the (physionet) MIMIC-III dataset.
- mednli_dir: will be use in Huggingface's `load_dataset` as cache_dir
- mimic_cxr_dir: path for the (physionet) MIMIC-CXR dataset.
- vindr_mammo_dir: path for the (physionet) VinDr-Mammo dataset.
- pad_ufes_20_dir
- cbis_ddsm_dir: dataset hosted on Kaggle. Kaggle must be set up on the system (see [this](https://www.kaggle.com/docs/api#getting-started-installation-&-authentication))
- mnist_oct_dir
- mnist_path_dir
- mnist_blood_dir
- mnist_breast_dir
- mnist_derma_dir
- mnist_organc_dir
- mnist_organs_dir
- mnist_pneumonia_dir
- mnist_retina_dir
- mnist_tissue_dir
- chexbert_dir: path for the CheXBert model checkpoint
- physionet_username: physionet username to download MIMIC and VinDr-Mammo
- physionet_password: password for the physionet account

The `EvalParams` class takes the following arguments:

- batch_size: The size of the batches sent to the user's batcher Callable.
- run_name: The name to use for the folder where the output will be stored.
- fewshot: A boolean indicating whether the evaluation is few-shot.
- num_workers: The number of workers for the dataloader.
- device: The device to run the evaluation on.
- tensorBoardWriter: The tensorboard writer to use for logging.
- tensorboardStep: The global step for logging to tensorboard.

## Additional tasks

To add a new task to the list of already implemented ones, create a folder named `MultiMedEvalAdditionalDatasets` and a subfolder with the name of your dataset.

Inside your dataset folder, create a `json` file that follows the following template for a VQA dataset:

```json
{
  "taskType": "VQA",
  "modality": "Radiology",
  "samples": [
    {
      "question": "Question 1",
      "answer": "Answer 1",
      "images": ["image1.png", "image2.png"]
    },
    { "question": "Question 2", "answer": "Answer 2", "images": ["image1.png"] }
  ]
}
```

And for a QA dataset:

```json
{
  "taskType": "QA",
  "modality": "Pathology",
  "samples": [
    {
      "question": "Question 1",
      "answer": "Answer 1",
      "options": ["Option 1", "Option 2"],
      "images": ["image1.png", "image2.png"]
    },
    {
      "question": "Question 2",
      "answer": "Answer 2",
      "options": ["Option 1", "Option 2"],
      "images": ["image1.png"]
    }
  ]
}
```

Note that in both cases the `images` key is optional. If the `taskType` is VQA, the metrics computed will be BLEU-1, accuracy for closed and open questions, recall and recall for open questions as well as F1. For the QA `taskType`, the tool will report the accuracy (by comparing the answer to every option using BLEU).

## Reference

```
@misc{royer2024multimedeval,
      title={MultiMedEval: A Benchmark and a Toolkit for Evaluating Medical Vision-Language Models},
      author={Corentin Royer and Bjoern Menze and Anjany Sekuboyina},
      year={2024},
      eprint={2402.09262},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}
```

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "multimedeval",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<3.13,>=3.9",
    "maintainer_email": null,
    "keywords": "evaluation, medical, vlm",
    "author": "Corentin Royer",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/67/90/ce9b7331ba3a748e27fc5a0e5c4f38e81dc9e440153ecc1f80fba7e112bc/multimedeval-1.0.0.tar.gz",
    "platform": null,
    "description": "# MultiMedEval\n\nMultiMedEval is a library to evaluate the performance of Vision-Language Models (VLM) on medical domain tasks. The goal is to have a set of benchmark with a unified evaluation scheme to facilitate the development and comparison of medical VLM.\nWe include 24 tasks representing 10 different imaging modalities and some text-only tasks.\n\n![tests workflow](https://github.com/corentin-ryr/MultiMedEval/actions/workflows/python-tests.yml/badge.svg) ![PyPI - Version](https://img.shields.io/pypi/v/multimedeval) ![PyPI - Python Version](https://img.shields.io/pypi/pyversions/multimedeval) ![GitHub License](https://img.shields.io/github/license/corentin-ryr/MultiMedEval)\n\n## Tasks\n\n<details>\n  <summary>Question Answering</summary>\n\n| Task     | Description                                            | Modality         | Size |\n| -------- | ------------------------------------------------------ | ---------------- | ---- |\n| MedQA    | Multiple choice questions on general medical knowledge | General medicine | 1273 |\n| PubMedQA | Yes/no/maybe questions based on PubMed paper abstracts | General medicine | 500  |\n| MedMCQA  | Multiple choice questions on general medical knowledge | General medicine | 4183 |\n\n</details>\n\n</br>\n\n<details>\n  <summary>Visual Question Answering</summary>\n\n| Task     | Description                              | Modality  | Size |\n| -------- | ---------------------------------------- | --------- | ---- |\n| VQA-RAD  | Open ended questions on radiology images | X-ray     | 451  |\n| Path-VQA | Open ended questions on pathology images | Pathology | 6719 |\n| SLAKE    | Open ended questions on radiology images | X-ray     | 1061 |\n\n</details>\n\n</br>\n\n<details>\n  <summary>Report Comparison</summary>\n\n| Task                       | Description                                                                       | Modality    | Size  |\n| -------------------------- | --------------------------------------------------------------------------------- | ----------- | ----- |\n| MIMIC-CXR-ReportGeneration | Generation of finding sections of radiology reports based on the radiology images | Chest X-ray | 2347  |\n| MIMIC-III                  | Summarization of radiology reports                                                | Text        | 13054 |\n\n</details>\n\n</br>\n\n<details>\n  <summary>Natural Language Inference</summary>\n\n| Task   | Description                                      | Modality         | Size |\n| ------ | ------------------------------------------------ | ---------------- | ---- |\n| MedNLI | Natural Language Inference on medical sentences. | General medicine | 1422 |\n\n</details>\n\n</br>\n\n<details>\n  <summary>Image Classification</summary>\n\n| Task                          | Description                                                                                                   | Modality      | Size  |\n| ----------------------------- | ------------------------------------------------------------------------------------------------------------- | ------------- | ----- |\n| MIMIC-CXR-ImageClassification | Classification of radiology images into 5 diseases                                                            | Chest X-ray   | 5159  |\n| VinDr-Mammo                   | Classification of mammography images into 5 BIRADS levels                                                     | Mammography   | 429   |\n| Pad-UFES-20                   | Classification of skin lesion images into 7 diseases                                                          | Dermatology   | 2298  |\n| CBIS-DDSM-Mass                | Classification of masses in mammography images into \"benign\", \"malignant\" or \"benign without callback\"        | Mammography   | 378   |\n| CBIS-DDSM-Calcification       | Classification of calcification in mammography images into \"benign\", \"malignant\" or \"benign without callback\" | Mammography   | 326   |\n| MNIST-Oct                     | Image classification of Optical coherence tomography of the retine                                            | OCT           | 1000  |\n| MNIST-Path                    | Image classification of pathology image                                                                       | Pathology     | 7180  |\n| MNIST-Blood                   | Image classification of blood cell seen through a microscope                                                  | Microscopy    | 3421  |\n| MNIST-Breast                  | Image classification of mammography                                                                           | Mammography   | 156   |\n| MNIST-Derma                   | Image classification of skin deffect images                                                                   | Dermatology   | 2005  |\n| MNIST-OrganC                  | Image classification of abdominal CT scan                                                                     | CT            | 8216  |\n| MNIST-OrganS                  | Image classification of abdominal CT scan                                                                     | CT            | 8827  |\n| MNIST-Pneumonia               | Image classification of chest X-Rays                                                                          | X-Ray         | 624   |\n| MNIST-Retina                  | Image classification of the retina taken with a fondus camera                                                 | Fondus Camera | 400   |\n| MNIST-Tissue                  | Image classification of kidney cortex seen through a microscope                                               | Microscopy    | 12820 |\n\n</details>\n\n</br>\n\n<p align=\"center\">\n    <img src=\"figures/sankey.png\" alt=\"Sankey graph\">\n    <br>\n    <em>Representation of the modalities, tasks and datasets in MultiMedEval</em>\n</p>\n\n## Setup\n\nTo install the library, you can use `pip`\n\n```console\npip install multimedeval\n```\n\nTo run the benchmark on your model, you first need to create an instance of the `MultiMedEval` class.\n\n```python\nfrom multimedeval import MultiMedEval, SetupParams, EvalParams\nfrom multimedeval.utils import BatcherInput, BatcherOutput\n\nengine = MultiMedEval()\n```\n\nYou then need to call the `setup` function of the `engine`. This will download the datasets if needed and prepare them for evaluation. You can specify where to store the data and which datasets you want to download.\n\n```python\nsetupParams = SetupParams(medqa_dir=\"data/\")\ntasksReady = engine.setup(setup_params=setupParams)\n```\n\nHere we initialize the `SetupParams` dataclass with only the path for the MedQA dataset. If you omit to pass a directory for some of the datasets, they will be skipped during the evaluation. During the setup process, the script will need a Physionet username and password to download \"VinDr-Mammo\", \"MIMIC-CXR\" and \"MIMIC-III\". You also need to setup Kaggle on your machine before running the setup as the \"CBIS-DDSM\" is hosted on Kaggle. At the end of the setup process, you will see a summary of which tasks are ready and which didn't run properly and the function will return a summary in the form of a dictionary.\n\n## Usage\n\n### Implement the Batcher\n\nThe user must implement one Callable: `batcher`. It takes a batch of input and must return the answer.\nThe batch is a list of inputs.\nEach input is an instance of @dataclass `BatcherInput`, containing the following fields:\n\n- `conversation`: a prompt in the form of a Hugginface style conversation between a user and an assistant.\n- `images`: a list of Pillow images. The number of images matches the number of <img> tokens in the prompt and are ordered.\n- `segmentation_masks`: (optional) a list of segmentation masks, the number of which matches that of <seg> tokens in the prompt and are ordered.\n\n```python\n[\n    BatcherInput(\n        conversation = \n          [\n              {\"role\": \"user\", \"content\": \"This is a question with an image <img>.\"},\n              {\"role\": \"assistant\", \"content\": \"This is the answer.\"},\n              {\"role\": \"user\", \"content\": \"This is a question with an image <img>.\"},\n          ],\n        images = [PIL.Image(), PIL.Image()],\n        segmentation_masks = [PIL.Image(), PIL.Image()]\n    ),\n    BatcherInput(\n        conversation =\n          [\n              {\"role\": \"user\", \"content\": \"This is a question without images.\"},\n              {\"role\": \"assistant\", \"content\": \"This is the answer.\"},\n              {\"role\": \"user\", \"content\": \"This is a question without images.\"},\n          ],\n        images = [],\n        segmentation_masks = []\n    ),\n\n]\n```\n\nHere is an example of a `batcher` without any logic:\n\n```python\ndef batcher(prompts: List[BatcherInput]) -> List[BatcherOutput]:\n    return [\"Answer\" for _ in prompts]\n```\n\nA function is the simplest example of a Callable but the batcher can also be implemented as a Callable class (i.e. a class implementing the `__call__` method). Doing it this way allows to initialize the model in the `__init__` function of the class. We give an example for the Mistral model (a language-only model).\n\n```python\nclass batcherMistral:\n    def __init__(self) -> None:\n        self.model: MistralModel = AutoModelForCausalLM.from_pretrained(\"mistralai/Mistral-7B-Instruct-v0.1\")\n        self.tokenizer = AutoTokenizer.from_pretrained(\"mistralai/Mistral-7B-Instruct-v0.1\")\n        self.tokenizer.pad_token = self.tokenizer.eos_token\n\n    def __call__(self, prompts: List[BatcherInput]) -> List[BatcherOutput]:\n        model_inputs = [self.tokenizer.apply_chat_template(messages.conversation, return_tensors=\"pt\", tokenize=False) for messages in prompts]\n        model_inputs = self.tokenizer(model_inputs, padding=\"max_length\", truncation=True, max_length=1024, return_tensors=\"pt\")\n\n        generated_ids = self.model.generate(**model_inputs, max_new_tokens=200, do_sample=True, pad_token_id=self.tokenizer.pad_token_id)\n\n        # Remove the first 1024 tokens (prompt)\n        generated_ids = generated_ids[:, model_inputs[\"input_ids\"].shape[1] :]\n\n        answers = self.tokenizer.batch_decode(generated_ids, skip_special_tokens=True)\n        return [BatcherOutput(text=answer) for answer in answers]\n```\n\n### Run the benchmark\n\nTo run the benchmark, call the `eval` method of the `MultiMedEval` class with the list of tasks to benchmark on, the batcher to ealuate and the evaluation parameters. If the list is empty, all the tasks will be benchmarked.\n\n```python\nevalParams = EvalParams(batch_size=128)\nresults = engine.eval([\"MedQA\", \"VQA-RAD\"], batcher, eval_params=evalParams)\n```\n\n## MultiMedEval parameters\n\nThe `SetupParams` class takes a path for each dataset:\n\n- medqa_dir: will be use in Huggingface's `load_dataset` as cache_dir\n- pubmedqa_dir: will be use in Huggingface's `load_dataset` as cache_dir\n- medmcqa_dir: will be use in Huggingface's `load_dataset` as cache_dir\n- vqa_rad_dir: will be use in Huggingface's `load_dataset` as cache_dir\n- path_vqa_dir: will be use in Huggingface's `load_dataset` as cache_dir\n- slake_dir: the dataset is currently hosted on Google Drive which can be an issue on some systems.\n- mimic_iii_dir: path for the (physionet) MIMIC-III dataset.\n- mednli_dir: will be use in Huggingface's `load_dataset` as cache_dir\n- mimic_cxr_dir: path for the (physionet) MIMIC-CXR dataset.\n- vindr_mammo_dir: path for the (physionet) VinDr-Mammo dataset.\n- pad_ufes_20_dir\n- cbis_ddsm_dir: dataset hosted on Kaggle. Kaggle must be set up on the system (see [this](https://www.kaggle.com/docs/api#getting-started-installation-&-authentication))\n- mnist_oct_dir\n- mnist_path_dir\n- mnist_blood_dir\n- mnist_breast_dir\n- mnist_derma_dir\n- mnist_organc_dir\n- mnist_organs_dir\n- mnist_pneumonia_dir\n- mnist_retina_dir\n- mnist_tissue_dir\n- chexbert_dir: path for the CheXBert model checkpoint\n- physionet_username: physionet username to download MIMIC and VinDr-Mammo\n- physionet_password: password for the physionet account\n\nThe `EvalParams` class takes the following arguments:\n\n- batch_size: The size of the batches sent to the user's batcher Callable.\n- run_name: The name to use for the folder where the output will be stored.\n- fewshot: A boolean indicating whether the evaluation is few-shot.\n- num_workers: The number of workers for the dataloader.\n- device: The device to run the evaluation on.\n- tensorBoardWriter: The tensorboard writer to use for logging.\n- tensorboardStep: The global step for logging to tensorboard.\n\n## Additional tasks\n\nTo add a new task to the list of already implemented ones, create a folder named `MultiMedEvalAdditionalDatasets` and a subfolder with the name of your dataset.\n\nInside your dataset folder, create a `json` file that follows the following template for a VQA dataset:\n\n```json\n{\n  \"taskType\": \"VQA\",\n  \"modality\": \"Radiology\",\n  \"samples\": [\n    {\n      \"question\": \"Question 1\",\n      \"answer\": \"Answer 1\",\n      \"images\": [\"image1.png\", \"image2.png\"]\n    },\n    { \"question\": \"Question 2\", \"answer\": \"Answer 2\", \"images\": [\"image1.png\"] }\n  ]\n}\n```\n\nAnd for a QA dataset:\n\n```json\n{\n  \"taskType\": \"QA\",\n  \"modality\": \"Pathology\",\n  \"samples\": [\n    {\n      \"question\": \"Question 1\",\n      \"answer\": \"Answer 1\",\n      \"options\": [\"Option 1\", \"Option 2\"],\n      \"images\": [\"image1.png\", \"image2.png\"]\n    },\n    {\n      \"question\": \"Question 2\",\n      \"answer\": \"Answer 2\",\n      \"options\": [\"Option 1\", \"Option 2\"],\n      \"images\": [\"image1.png\"]\n    }\n  ]\n}\n```\n\nNote that in both cases the `images` key is optional. If the `taskType` is VQA, the metrics computed will be BLEU-1, accuracy for closed and open questions, recall and recall for open questions as well as F1. For the QA `taskType`, the tool will report the accuracy (by comparing the answer to every option using BLEU).\n\n## Reference\n\n```\n@misc{royer2024multimedeval,\n      title={MultiMedEval: A Benchmark and a Toolkit for Evaluating Medical Vision-Language Models},\n      author={Corentin Royer and Bjoern Menze and Anjany Sekuboyina},\n      year={2024},\n      eprint={2402.09262},\n      archivePrefix={arXiv},\n      primaryClass={cs.CV}\n}\n```\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A Python tool to evaluate the performance of VLM on the medical domain.",
    "version": "1.0.0",
    "project_urls": {
        "Documentation": "https://github.com/corentin-ryr/MultiMedEval",
        "Homepage": "https://github.com/corentin-ryr/MultiMedEval",
        "Repository": "https://github.com/corentin-ryr/MultiMedEval"
    },
    "split_keywords": [
        "evaluation",
        " medical",
        " vlm"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "5833f62c775e754de7c5d9070c4701e4cc96683de059750c911a35614b85af87",
                "md5": "3d6565d6926815044bcc7cf80e6f6185",
                "sha256": "05b15c7d2d7e798043ef7577274edea45549362c9300ba5c6fed6c4eb88e19ae"
            },
            "downloads": -1,
            "filename": "multimedeval-1.0.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "3d6565d6926815044bcc7cf80e6f6185",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<3.13,>=3.9",
            "size": 907340,
            "upload_time": "2025-07-23T14:44:38",
            "upload_time_iso_8601": "2025-07-23T14:44:38.977836Z",
            "url": "https://files.pythonhosted.org/packages/58/33/f62c775e754de7c5d9070c4701e4cc96683de059750c911a35614b85af87/multimedeval-1.0.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "6790ce9b7331ba3a748e27fc5a0e5c4f38e81dc9e440153ecc1f80fba7e112bc",
                "md5": "32b881dabf99bc9c32e7eff42e210d65",
                "sha256": "6ce7dfed53eb934ade54c5edd73fd0197a2669a235e7b022cff52dcd260e1bf7"
            },
            "downloads": -1,
            "filename": "multimedeval-1.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "32b881dabf99bc9c32e7eff42e210d65",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<3.13,>=3.9",
            "size": 631967,
            "upload_time": "2025-07-23T14:44:40",
            "upload_time_iso_8601": "2025-07-23T14:44:40.782258Z",
            "url": "https://files.pythonhosted.org/packages/67/90/ce9b7331ba3a748e27fc5a0e5c4f38e81dc9e440153ecc1f80fba7e112bc/multimedeval-1.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-23 14:44:40",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "corentin-ryr",
    "github_project": "MultiMedEval",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "multimedeval"
}

Corentin Royer