<div align="center">
<a href="https://github.com/jbdel/RadEval">
<img src="https://github.com/jbdel/RadEval/raw/libra_run/RadEval_banner.png" alt="RadEval" width="100%" style="border-radius: 16px;">
</a>
</div>
<div align="center">
**All-in-one metrics for evaluating AI-generated radiology text**
</div>
<!--- BADGES: START --->
[](https://pypi.org/project/RadEval/)
[]()
[]()
[](https://huggingface.co/IAMJB/RadEvalModernBERT)
[](https://justin13601.github.io/files/radeval.mp4)
[](https://huggingface.co/spaces/X-iZhang/RadEval)
[]()
[](https://github.com/jbdel/RadEval/main/LICENSE)
<!--- BADGES: END --->
## ๐ Table of Contents
- [๐ Overview](#-overview)
- [โ Why RadEval](#-why-radeval)
- [โจ Key Features](#-key-features)
- [โ๏ธ Installation](#๏ธ-installation)
- [๐ Quick Start](#-quick-start)
- [๐ Evaluation Metrics](#-evaluation-metrics)
- [๐ง Configuration Options](#-configuration-options)
- [๐ File Format Suggestion](#-file-format-suggestion)
- [๐งช Hypothesis Testing (Significance Evaluation)](#-hypothesis-testing-significance-evaluation)
- [๐ง RadEval Expert Dataset](#-radeval-expert-dataset)
- [๐ฆ Performance Tips](#-performance-tips)
- [๐ Citation](#-citation)
## ๐ Overview
**RadEval** is a comprehensive evaluation framework specifically designed for assessing the quality of AI-generated radiology text. It provides a unified interface to multiple state-of-the-art evaluation metrics, enabling researchers and practitioners to thoroughly evaluate their radiology text generation models.
### โ Why RadEval
> [!TIP]
> - **Domain-Specific**: Tailored for radiology text evaluation with medical knowledge integration
> - **Multi-Metric**: Supports 11+ different evaluation metrics in one framework
> - **Easy to Use**: Simple API with flexible configuration options
> - **Comprehensive**: From traditional n-gram metrics to advanced LLM-based evaluations
> - **Research-Ready**: Built for reproducible evaluation in radiology AI research
### โจ Key Features
> [!NOTE]
> - **Multiple Evaluation Perspectives**: Lexical, semantic, clinical, and temporal evaluations
> - **Statistical Testing**: Built-in hypothesis testing for system comparison
> - **Batch Processing**: Efficient evaluation of large datasets
> - **Flexible Configuration**: Enable/disable specific metrics based on your needs
> - **Detailed Results**: Comprehensive output with metric explanations
> - **File Format Support**: Direct evaluation from common file formats (.tok, .txt, .json)
## โ๏ธ Installation
RadEval supports Python **3.10+** and can be installed via PyPI or from source.
### Option 1: Install via PyPI (Recommended)
```bash
pip install RadEval
```
> [!TIP]
> We recommend using a virtual environment to avoid dependency conflicts, especially since some metrics require loading large inference models.
### Option 2: Install from GitHub (Latest Development Version)
Install the most up-to-date version directly from GitHub:
```bash
pip install git+https://github.com/jbdel/RadEval.git
```
> This is useful if you want the latest features or bug fixes before the next PyPI release.
### Option 3: Install in Development Mode (Recommended for Contributors)
```bash
# Clone the repository
git clone https://github.com/jbdel/RadEval.git
cd RadEval
# Create and activate a conda environment
conda create -n RadEval python=3.10 -y
conda activate RadEval
# Install in development (editable) mode
pip install -e .
```
> This setup allows you to modify the source code and reflect changes immediately without reinstallation.
## ๐ Quick Start
### Example 1: Basic Evaluation
Evaluate a few reports using selected metrics:
```python
from RadEval import RadEval
import json
refs = [
"No definite acute cardiopulmonary process.Enlarged cardiac silhouette could be accentuated by patient's positioning.",
"Increased mild pulmonary edema and left basal atelectasis.",
]
hyps = [
"Relatively lower lung volumes with no focal airspace consolidation appreciated.",
"No pleural effusions or pneumothoraces.",
]
evaluator = RadEval(
do_radgraph=True,
do_bleu=True
)
results = evaluator(refs=refs, hyps=hyps)
print(json.dumps(results, indent=2))
```
<details>
<summary> Output </summary>
```json
{
"radgraph_simple": 0.5,
"radgraph_partial": 0.5,
"radgraph_complete": 0.5,
"bleu": 0.5852363407461811
}
```
</details>
### Example 2: Comprehensive Evaluation
Set `do_details=True` to enable per-metric detailed outputs, including entity-level comparisons and score-specific breakdowns when supported.
```python
from RadEval import RadEval
import json
evaluator = RadEval(
do_srr_bert=True,
do_rouge=True,
do_details=True
)
refs = [
"No definite acute cardiopulmonary process.Enlarged cardiac silhouette could be accentuated by patient's positioning.",
"Increased mild pulmonary edema and left basal atelectasis.",
]
hyps = [
"Relatively lower lung volumes with no focal airspace consolidation appreciated.",
"No pleural effusions or pneumothoraces.",
]
results = evaluator(refs=refs, hyps=hyps)
print(json.dumps(results, indent=2))
```
<details>
<summary> Output </summary>
```json
{
"rouge": {
"rouge1": {
"mean_score": 0.04,
"sample_scores": [
0.08,
0.0
]
},
"rouge2": {
"mean_score": 0.0,
"sample_scores": [
0.0,
0.0
]
},
"rougeL": {
"mean_score": 0.04,
"sample_scores": [
0.08,
0.0
]
}
},
"srr_bert": {
"srr_bert_weighted_f1": 0.16666666666666666,
"srr_bert_weighted_precision": 0.125,
"srr_bert_weighted_recall": 0.25,
"label_scores": {
"Edema (Present)": {
"f1-score": 0.0,
"precision": 0.0,
"recall": 0.0,
"support": 1.0
},
"Atelectasis (Present)": {
"f1-score": 0.0,
"precision": 0.0,
"recall": 0.0,
"support": 1.0
},
"Cardiomegaly (Uncertain)": {
"f1-score": 0.0,
"precision": 0.0,
"recall": 0.0,
"support": 1.0
},
"No Finding": {
"f1-score": 0.6666666666666666,
"precision": 0.5,
"recall": 1.0,
"support": 1.0
}
}
}
}
```
</details>
### Example 3: Quick Hypothesis Testing
Compare two systems statistically to validate improvements:
```python
from RadEval import RadEval, compare_systems
# Define systems to compare
systems = {
'baseline': [
"No acute findings.",
"Mild heart enlargement."
],
'improved': [
"No acute cardiopulmonary process.",
"Mild cardiomegaly with clear lung fields."
]
}
# Reference ground truth
references = [
"No acute cardiopulmonary process.",
"Mild cardiomegaly with clear lung fields."
]
# Initialise evaluators only for selected metrics
bleu_evaluator = RadEval(do_bleu=True)
rouge_evaluator = RadEval(do_rouge=True)
# Wrap metrics into callable functions
metrics = {
'bleu': lambda hyps, refs: bleu_evaluator(refs, hyps)['bleu'],
'rouge1': lambda hyps, refs: rouge_evaluator(refs, hyps)['rouge1'],
}
# Run statistical test
signatures, scores = compare_systems(
systems=systems,
metrics=metrics,
references=references,
n_samples=50, # Number of bootstrap samples
print_results=True # Print significance table
)
```
<details>
<summary> Output </summary>
<pre lang="md">
================================================================================
PAIRED SIGNIFICANCE TEST RESULTS
================================================================================
System bleu rouge1
----------------------------------------------------------------------
Baseline: baseline 0.0000 0.3968
----------------------------------------------------------------------
improved 1.0000 1.0000
(p=0.4800) (p=0.4600)
----------------------------------------------------------------------
- Significance level: 0.05
- '*' indicates significant difference (p < significance level)
- Null hypothesis: systems are essentially the same
- Significant results suggest systems are meaningfully different
METRIC SIGNATURES:
- bleu: bleu|ar:50|seed:12345
- rouge1: rouge1|ar:50|seed:12345
</pre>
</details>
### Example 4: File-based Evaluation
Recommended for batch evaluation of large sets of generated reports.
```python
import json
from RadEval import RadEval
def evaluate_from_files():
def read_reports(filepath):
with open(filepath, 'r') as f:
return [line.strip() for line in f if line.strip()]
refs = read_reports('ground_truth.tok')
hyps = read_reports('model_predictions.tok')
evaluator = RadEval(
do_radgraph=True,
do_bleu=True,
do_bertscore=True,
do_chexbert=True
)
results = evaluator(refs=refs, hyps=hyps)
with open('evaluation_results.json', 'w') as f:
json.dump(results, f, indent=2)
return results
```
## ๐ Evaluation Metrics
RadEval currently supports the following evaluation metrics:
| Category | Metric | Description | Best For |
|----------|--------|-------------|----------|
| **Lexical** | BLEU | N-gram overlap measurement | Surface-level similarity |
| | ROUGE | Recall-oriented evaluation | Content coverage |
| **Semantic** | BERTScore | BERT-based semantic similarity | Semantic meaning preservation |
| | RadEval BERTScore | Domain-adapted ModernBertModel evaluation | Medical text semantics |
| **Clinical** | CheXbert | Clinical finding classification | Medical accuracy |
| | RadGraph | Knowledge graph-based evaluation | Clinical relationship accuracy |
| | RaTEScore | Entity-level assessments | Medical synonyms |
| **Specialized** | RadCLIQ | Composite multiple metrics | Clinical relevance |
| | SRR-BERT | Structured report evaluation | Report structure quality |
| | Temporal F1 | Time-sensitive evaluation | Temporal consistency |
| | GREEN | LLM-based metric | Overall radiology report quality |
## ๐ง Configuration Options
### RadEval Constructor Parameters
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `do_radgraph` | bool | False | Enable RadGraph evaluation |
| `do_green` | bool | False | Enable GREEN metric |
| `do_bleu` | bool | False | Enable BLEU evaluation |
| `do_rouge` | bool | False | Enable ROUGE metrics |
| `do_bertscore` | bool | False | Enable BERTScore |
| `do_srr_bert` | bool | False | Enable SRR-BERT |
| `do_chexbert` | bool | False | Enable CheXbert classification |
| `do_temporal` | bool | False | Enable temporal evaluation |
| `do_ratescore` | bool | False | Enable RateScore |
| `do_radcliq` | bool | False | Enable RadCLIQ |
| `do_radeval_bertsore` | bool | False | Enable RadEval BERTScore |
| `do_details` | bool | False | Include detailed metrics |
### Example Configurations
```python
# Lightweight evaluation (fast)
light_evaluator = RadEval(
do_bleu=True,
do_rouge=True
)
# Medical focus (clinical accuracy)
medical_evaluator = RadEval(
do_radgraph=True,
do_chexbert=True,
do_green=True
)
# Comprehensive evaluation (all metrics)
full_evaluator = RadEval(
do_radgraph=True,
do_green=True,
do_bleu=True,
do_rouge=True,
do_bertscore=True,
do_srr_bert=True,
do_chexbert=True,
do_temporal=True,
do_ratescore=True,
do_radcliq=True,
do_radeval_bertsore=True,
do_details=False # Optional: return detailed metric breakdowns
)
```
## ๐ File Format Suggestion
To ensure efficient evaluation, we recommend formatting your data in one of the following ways:
### ๐ Text Files (.tok, .txt)
Each line contains one report
```
No acute cardiopulmonary process.
Mild cardiomegaly noted.
Normal chest radiograph.
```
Use two separate files:
> - ground_truth.tok โ reference reports
> - model_predictions.tok โ generated reports
### ๐งพ JSON Files
```json
{
"references": [
"No acute cardiopulmonary process.",
"Mild cardiomegaly noted."
],
"hypotheses": [
"Normal chest X-ray.",
"Enlarged heart observed."
]
}
```
### ๐ Python Lists
```python
refs = ["Report 1", "Report 2"]
hyps = ["Generated 1", "Generated 2"]
```
> [!TIP]
> File-based input is recommended for batch evaluation and reproducibility in research workflows.
## ๐งช Hypothesis Testing (Significance Evaluation)
RadEval supports **paired significance testing** to statistically compare different radiology report generation systems using **Approximate Randomization (AR)**.
This allows you to determine whether an observed improvement in metric scores is **statistically significant**, rather than due to chance.
### ๐ Key Features
- **Paired comparison** of any number of systems against a baseline
- **Statistical rigor** using Approximate Randomization (AR) testing
- **All built-in metrics** supported (BLEU, ROUGE, BERTScore, RadGraph, CheXbert, etc.)
- **Custom metrics** integration for domain-specific evaluation
- **P-values** and significance markers (`*`) for easy interpretation
### ๐งฎ Statistical Background
The hypothesis testing uses **Approximate Randomization** to determine if observed metric differences are statistically significant:
1. **Null Hypothesis (Hโ)**: The two systems perform equally well
2. **Test Statistic**: Difference in metric scores between systems
3. **Randomization**: Shuffle system assignments and recalculate differences
4. **P-value**: Proportion of random shuffles with differences โฅ observed
5. **Significance**: If p < 0.05, reject Hโ (systems are significantly different)
> [!NOTE]
> **Why AR testing?**
> Unlike parametric tests, AR makes no assumptions about score distributions, making it ideal for evaluation metrics that may not follow normal distributions.
### ๐ Understanding the Results
**Interpreting P-values:**
- **p < 0.05**: Statistically significant difference (marked with `*`)
- **p โฅ 0.05**: No significant evidence of difference
- **Lower p-values**: Stronger evidence of real differences
**Practical Significance:**
- Look for consistent improvements across multiple metrics
- Consider domain relevance (e.g., RadGraph for clinical accuracy)
- Balance statistical and clinical significance
### ๐๏ธ Example: Compare RadEval Default Metrics and a Custom Metric
#### Step 1: Initialize packages and dataset
```python
from RadEval import RadEval, compare_systems
# Reference ground truth reports
references = [
"No acute cardiopulmonary process.",
"No radiographic findings to suggest pneumonia.",
"Mild cardiomegaly with clear lung fields.",
"Small pleural effusion on the right side.",
"Status post cardiac surgery with stable appearance.",
]
# Three systems: baseline, improved, and poor
systems = {
'baseline': [
"No acute findings.",
"No pneumonia.",
"Mild cardiomegaly, clear lungs.",
"Small right pleural effusion.",
"Post-cardiac surgery, stable."
],
'improved': [
"No acute cardiopulmonary process.",
"No radiographic findings suggesting pneumonia.",
"Mild cardiomegaly with clear lung fields bilaterally.",
"Small pleural effusion present on the right side.",
"Status post cardiac surgery with stable appearance."
],
'poor': [
"Normal.",
"OK.",
"Heart big.",
"Some fluid.",
"Surgery done."
]
}
```
#### Step 2: Define Evaluation Metrics and Parameters
We define each evaluation metric using a dedicated RadEval instance (configured to compute one specific score), and also include a simple custom metric โ average word count. All metrics are wrapped into a unified metrics dictionary for flexible evaluation and comparison.
```python
# Initialise each evaluator with the corresponding metric
bleu_evaluator = RadEval(do_bleu=True)
rouge_evaluator = RadEval(do_rouge=True)
bertscore_evaluator = RadEval(do_bertscore=True)
radgraph_evaluator = RadEval(do_radgraph=True)
chexbert_evaluator = RadEval(do_chexbert=True)
# Define a custom metric: average word count of generated reports
def word_count_metric(hyps, refs):
return sum(len(report.split()) for report in hyps) / len(hyps)
# Wrap metrics into a unified dictionary of callables
metrics = {
'bleu': lambda hyps, refs: bleu_evaluator(refs, hyps)['bleu'],
'rouge1': lambda hyps, refs: rouge_evaluator(refs, hyps)['rouge1'],
'rouge2': lambda hyps, refs: rouge_evaluator(refs, hyps)['rouge2'],
'rougeL': lambda hyps, refs: rouge_evaluator(refs, hyps)['rougeL'],
'bertscore': lambda hyps, refs: bertscore_evaluator(refs, hyps)['bertscore'],
'radgraph': lambda hyps, refs: radgraph_evaluator(refs, hyps)['radgraph_partial'],
'chexbert': lambda hyps, refs: chexbert_evaluator(refs, hyps)['chexbert-5_macro avg_f1-score'],
'word_count': word_count_metric # โ example of a simple custom-defined metric
}
```
> [!TIP]
> - Each metric function takes (hyps, refs) as input and returns a single float score.
> - This modular design allows you to flexibly plug in or remove metrics without changing the core logic of RadEval or compare_systems.
> - For advanced, you may define your own `RadEval(do_xxx=True)` variant or custom metrics and include them seamlessly here.
#### Step 3 Run significance testing
Use `compare_systems` to evaluate all defined systems against the reference reports using the metrics specified above. This step performs randomization-based significance testing to assess whether differences between systems are statistically meaningful.
```python
print("Running significance tests...")
signatures, scores = compare_systems(
systems=systems,
metrics=metrics,
references=references,
n_samples=50, # Number of randomization samples
significance_level=0.05, # Alpha level for significance testing
print_results=True # Print formatted results table
)
```
<details>
<summary> Output </summary>
<pre lang="md">
Running tests...
================================================================================
PAIRED SIGNIFICANCE TEST RESULTS
================================================================================
System bleu rouge1 rouge2 rougeL bertscore radgraph chexbert word_count
----------------------------------------------------------------------------------------------------------------------------------------------------------------
Baseline: baseline 0.0000 0.6652 0.3133 0.6288 0.6881 0.5538 1.0000 3.2000
----------------------------------------------------------------------------------------------------------------------------------------------------------------
improved 0.6874 0.9531 0.8690 0.9531 0.9642 0.9818 1.0000 6.2000
(p=0.0000)* (p=0.0800) (p=0.1200) (p=0.0600) (p=0.0400)* (p=0.1200) (p=1.0000) (p=0.0600)
----------------------------------------------------------------------------------------------------------------------------------------------------------------
poor 0.0000 0.0444 0.0000 0.0444 0.1276 0.0000 0.8000 1.6000
(p=0.4000) (p=0.0400)* (p=0.0600) (p=0.1200) (p=0.0400)* (p=0.0200)* (p=1.0000) (p=0.0400)*
----------------------------------------------------------------------------------------------------------------------------------------------------------------
- Significance level: 0.05
- '*' indicates significant difference (p < significance level)
- Null hypothesis: systems are essentially the same
- Significant results suggest systems are meaningfully different
METRIC SIGNATURES:
- bleu: bleu|ar:50|seed:12345
- rouge1: rouge1|ar:50|seed:12345
- rouge2: rouge2|ar:50|seed:12345
- rougeL: rougeL|ar:50|seed:12345
- bertscore: bertscore|ar:50|seed:12345
- radgraph: radgraph|ar:50|seed:12345
- chexbert: chexbert|ar:50|seed:12345
- word_count: word_count|ar:50|seed:12345
</pre>
</details>
> [!TIP]
> - The output includes mean scores for each metric and system, along with p-values comparing each system to the baseline.
> - Statistically significant improvements (or declines) are marked with an asterisk `*` if p < 0.05.
> - `signatures` stores each metric configuration (e.g. random seed, sample size), and `scores` contains raw score values per system for further analysis or plotting.
#### Step 4: Summarise Significant Findings
```python
# Significance testing
print("\nSignificant differences (p < 0.05):")
baseline_name = list(systems.keys())[0] # Assume first one is the baseline
for system_name in systems.keys():
if system_name == baseline_name:
continue
significant_metrics = []
for metric_name in metrics.keys():
pvalue_key = f"{metric_name}_pvalue"
if pvalue_key in scores[system_name]:
p_val = scores[system_name][pvalue_key]
if p_val < 0.05:
significant_metrics.append(metric_name)
if significant_metrics:
print(f" {system_name} vs {baseline_name}: {', '.join(significant_metrics)}")
else:
print(f" {system_name} vs {baseline_name}: No significant differences")
```
<details>
<summary> Output </summary>
<pre lang="md">
Significant differences (p < 0.05):
improved vs baseline: bleu, bertscore
poor vs baseline: rouge1, bertscore, radgraph, word_count
</pre>
</details>
> [!TIP]
> This makes it easy to:
> - Verify whether model improvements are meaningful
> - Test new metrics or design your own
> - Report statistically sound results in your paper
## ๐ง RadEval Expert Dataset
To support reliable benchmarking, we introduce the **RadEval Expert Dataset**, a carefully curated evaluation set annotated by board-certified radiologists. This dataset consists of realistic radiology reports and challenging model generations, enabling nuanced evaluation across clinical accuracy, temporal consistency, and language quality. It serves as a gold standard to validate automatic metrics and model performance under expert review.
## ๐ฆ Performance Tips
1. **Start Small**: Test with a few examples before full evaluation
2. **Select Metrics**: Only enable metrics you actually need
3. **Batch Processing**: Process large datasets in smaller chunks
4. **GPU Usage**: Ensure CUDA is available for faster computation
## ๐ Citation
If you use RadEval in your research, please cite:
```BibTeX
@software{radeval2025,
author = {Jean-Benoit Delbrouck, Justin Xu, Xi Zhang},
title = {RadEval: A framework for radiology text evaluation},
year = {2025},
publisher = {GitHub},
howpublished = {\url{https://github.com/jbdel/RadEval}},
}
```
### ๐ฆ Codebase Contributors
<table>
<tbody>
<tr>
<td align="center">
<a href="https://jbdel.github.io/">
<img src="https://aimi.stanford.edu/sites/g/files/sbiybj20451/files/styles/medium_square/public/media/image/image5_0.png?h=f4e62a0a&itok=euaj9VoF"
width="100" height="100"
style="object-fit: cover; border-radius: 20%;" alt="Jean-Benoit Delbrouck"/>
<br />
<sub><b>Jean-Benoit Delbrouck</b></sub>
</a>
</td>
<td align="center">
<a href="https://justin13601.github.io/">
<img src="https://justin13601.github.io/images/pfp2.JPG"
width="100" height="100"
style="object-fit: cover; border-radius: 20%;" alt="Justin Xu"/>
<br />
<sub><b>Justin Xu</b></sub>
</a>
</td>
<td align="center">
<a href="https://x-izhang.github.io/">
<img src="https://x-izhang.github.io/author/xi-zhang/avatar_hu13660783057866068725.jpg"
width="100" height="100"
style="object-fit: cover; border-radius: 20%;" alt="Xi Zhang"/>
<br />
<sub><b>Xi Zhang</b></sub>
</a>
</td>
</tr>
</tbody>
</table>
## ๐ Acknowledgments
This project would not be possible without the foundational work of the radiology AI community.
We extend our gratitude to the authors and maintainers of the following open-source projects and metrics:
- ๐ง **CheXbert**, **RadGraph**, and **CheXpert** from Stanford AIMI for their powerful labelers and benchmarks.
- ๐ **BERTScore** and **BLEU/ROUGE** for general-purpose NLP evaluation.
- ๐ฅ **RadCliQ** and **RaTE Score** for clinically grounded evaluation of radiology reports.
- ๐งช **SRR-BERT** for structured report understanding in radiology.
- ๐ Researchers contributing to temporal and factual consistency metrics in medical imaging.
Special thanks to:
- All contributors to open datasets such as **MIMIC-CXR**, which make reproducible research possible.
- Our collaborators for their support and inspiration throughout development.
We aim to build on these contributions and promote accessible, fair, and robust evaluation of AI-generated radiology text.
---
<div align="center">
<p>โญ If you find RadEval useful, please give us a star! โญ</p>
<p>Made with โค๏ธ for the radiology AI research community</p>
</div>
Raw data
{
"_id": null,
"home_page": "https://github.com/jbdel/RadEval",
"name": "RadEval",
"maintainer": "Xi Zhang",
"docs_url": null,
"requires_python": "<=3.12.1,>=3.9",
"maintainer_email": null,
"keywords": "radiology, evaluation, natural language processing, radiology report, medical NLP, clinical text generation, LLM, bioNLP, chexbert, radgraph, medical AI",
"author": "Jean-Benoit Delbrouck, Justin Xu, Xi Zhang",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/3a/aa/c4b7b5bac9947993c04bfbd00aa32a2f218aa0147af37f931f264c85d6cf/radeval-0.0.1rc3.tar.gz",
"platform": null,
"description": "<div align=\"center\">\n <a href=\"https://github.com/jbdel/RadEval\">\n <img src=\"https://github.com/jbdel/RadEval/raw/libra_run/RadEval_banner.png\" alt=\"RadEval\" width=\"100%\" style=\"border-radius: 16px;\">\n </a>\n</div>\n\n<div align=\"center\">\n\n**All-in-one metrics for evaluating AI-generated radiology text**\n\n</div>\n\n<!--- BADGES: START --->\n[](https://pypi.org/project/RadEval/)\n[]()\n[]()\n[](https://huggingface.co/IAMJB/RadEvalModernBERT)\n[](https://justin13601.github.io/files/radeval.mp4)\n[](https://huggingface.co/spaces/X-iZhang/RadEval)\n[]()\n[](https://github.com/jbdel/RadEval/main/LICENSE)\n<!--- BADGES: END --->\n\n## \ud83d\udcd6 Table of Contents\n\n- [\ud83c\udf1f Overview](#-overview)\n - [\u2753 Why RadEval](#-why-radeval)\n - [\u2728 Key Features](#-key-features)\n- [\u2699\ufe0f Installation](#\ufe0f-installation)\n- [\ud83d\ude80 Quick Start](#-quick-start)\n- [\ud83d\udcca Evaluation Metrics](#-evaluation-metrics)\n- [\ud83d\udd27 Configuration Options](#-configuration-options)\n- [\ud83d\udcc1 File Format Suggestion](#-file-format-suggestion)\n- [\ud83e\uddea Hypothesis Testing (Significance Evaluation)](#-hypothesis-testing-significance-evaluation)\n- [\ud83e\udde0 RadEval Expert Dataset](#-radeval-expert-dataset)\n- [\ud83d\udea6 Performance Tips](#-performance-tips)\n- [\ud83d\udcda Citation](#-citation)\n\n## \ud83c\udf1f Overview\n\n**RadEval** is a comprehensive evaluation framework specifically designed for assessing the quality of AI-generated radiology text. It provides a unified interface to multiple state-of-the-art evaluation metrics, enabling researchers and practitioners to thoroughly evaluate their radiology text generation models.\n\n### \u2753 Why RadEval\n> [!TIP]\n> - **Domain-Specific**: Tailored for radiology text evaluation with medical knowledge integration\n> - **Multi-Metric**: Supports 11+ different evaluation metrics in one framework\n> - **Easy to Use**: Simple API with flexible configuration options\n> - **Comprehensive**: From traditional n-gram metrics to advanced LLM-based evaluations\n> - **Research-Ready**: Built for reproducible evaluation in radiology AI research\n\n### \u2728 Key Features\n> [!NOTE]\n> - **Multiple Evaluation Perspectives**: Lexical, semantic, clinical, and temporal evaluations\n> - **Statistical Testing**: Built-in hypothesis testing for system comparison\n> - **Batch Processing**: Efficient evaluation of large datasets\n> - **Flexible Configuration**: Enable/disable specific metrics based on your needs\n> - **Detailed Results**: Comprehensive output with metric explanations\n> - **File Format Support**: Direct evaluation from common file formats (.tok, .txt, .json)\n\n## \u2699\ufe0f Installation\nRadEval supports Python **3.10+** and can be installed via PyPI or from source.\n\n### Option 1: Install via PyPI (Recommended)\n\n```bash\npip install RadEval\n```\n> [!TIP]\n> We recommend using a virtual environment to avoid dependency conflicts, especially since some metrics require loading large inference models.\n\n### Option 2: Install from GitHub (Latest Development Version)\nInstall the most up-to-date version directly from GitHub:\n```bash\npip install git+https://github.com/jbdel/RadEval.git\n```\n> This is useful if you want the latest features or bug fixes before the next PyPI release.\n\n### Option 3: Install in Development Mode (Recommended for Contributors)\n```bash\n# Clone the repository\ngit clone https://github.com/jbdel/RadEval.git\ncd RadEval\n\n# Create and activate a conda environment\nconda create -n RadEval python=3.10 -y\nconda activate RadEval\n\n# Install in development (editable) mode\npip install -e .\n```\n> This setup allows you to modify the source code and reflect changes immediately without reinstallation.\n\n## \ud83d\ude80 Quick Start\n\n### Example 1: Basic Evaluation\nEvaluate a few reports using selected metrics:\n```python\nfrom RadEval import RadEval\nimport json\n\nrefs = [\n \"No definite acute cardiopulmonary process.Enlarged cardiac silhouette could be accentuated by patient's positioning.\",\n \"Increased mild pulmonary edema and left basal atelectasis.\",\n]\nhyps = [\n \"Relatively lower lung volumes with no focal airspace consolidation appreciated.\",\n \"No pleural effusions or pneumothoraces.\",\n]\n\nevaluator = RadEval(\n do_radgraph=True,\n do_bleu=True\n)\n\nresults = evaluator(refs=refs, hyps=hyps)\nprint(json.dumps(results, indent=2))\n```\n<details>\n<summary> Output </summary>\n\n```json\n{\n \"radgraph_simple\": 0.5,\n \"radgraph_partial\": 0.5,\n \"radgraph_complete\": 0.5,\n \"bleu\": 0.5852363407461811\n}\n```\n\n</details>\n\n### Example 2: Comprehensive Evaluation\nSet `do_details=True` to enable per-metric detailed outputs, including entity-level comparisons and score-specific breakdowns when supported.\n\n```python\nfrom RadEval import RadEval\nimport json\n\nevaluator = RadEval(\n do_srr_bert=True,\n do_rouge=True,\n do_details=True\n)\n\nrefs = [\n \"No definite acute cardiopulmonary process.Enlarged cardiac silhouette could be accentuated by patient's positioning.\",\n \"Increased mild pulmonary edema and left basal atelectasis.\",\n]\nhyps = [\n \"Relatively lower lung volumes with no focal airspace consolidation appreciated.\",\n \"No pleural effusions or pneumothoraces.\",\n]\n\nresults = evaluator(refs=refs, hyps=hyps)\nprint(json.dumps(results, indent=2))\n```\n\n<details>\n<summary> Output </summary>\n\n```json\n{\n \"rouge\": {\n \"rouge1\": {\n \"mean_score\": 0.04,\n \"sample_scores\": [\n 0.08,\n 0.0\n ]\n },\n \"rouge2\": {\n \"mean_score\": 0.0,\n \"sample_scores\": [\n 0.0,\n 0.0\n ]\n },\n \"rougeL\": {\n \"mean_score\": 0.04,\n \"sample_scores\": [\n 0.08,\n 0.0\n ]\n }\n },\n \"srr_bert\": {\n \"srr_bert_weighted_f1\": 0.16666666666666666,\n \"srr_bert_weighted_precision\": 0.125,\n \"srr_bert_weighted_recall\": 0.25,\n \"label_scores\": {\n \"Edema (Present)\": {\n \"f1-score\": 0.0,\n \"precision\": 0.0,\n \"recall\": 0.0,\n \"support\": 1.0\n },\n \"Atelectasis (Present)\": {\n \"f1-score\": 0.0,\n \"precision\": 0.0,\n \"recall\": 0.0,\n \"support\": 1.0\n },\n \"Cardiomegaly (Uncertain)\": {\n \"f1-score\": 0.0,\n \"precision\": 0.0,\n \"recall\": 0.0,\n \"support\": 1.0\n },\n \"No Finding\": {\n \"f1-score\": 0.6666666666666666,\n \"precision\": 0.5,\n \"recall\": 1.0,\n \"support\": 1.0\n }\n }\n }\n}\n```\n\n</details>\n\n### Example 3: Quick Hypothesis Testing\nCompare two systems statistically to validate improvements:\n\n```python\nfrom RadEval import RadEval, compare_systems\n\n# Define systems to compare\nsystems = {\n 'baseline': [\n \"No acute findings.\",\n \"Mild heart enlargement.\"\n ],\n 'improved': [\n \"No acute cardiopulmonary process.\",\n \"Mild cardiomegaly with clear lung fields.\"\n ]\n}\n\n# Reference ground truth\nreferences = [\n \"No acute cardiopulmonary process.\",\n \"Mild cardiomegaly with clear lung fields.\"\n]\n\n# Initialise evaluators only for selected metrics\nbleu_evaluator = RadEval(do_bleu=True)\nrouge_evaluator = RadEval(do_rouge=True)\n\n# Wrap metrics into callable functions\nmetrics = {\n 'bleu': lambda hyps, refs: bleu_evaluator(refs, hyps)['bleu'],\n 'rouge1': lambda hyps, refs: rouge_evaluator(refs, hyps)['rouge1'],\n}\n\n# Run statistical test\nsignatures, scores = compare_systems(\n systems=systems,\n metrics=metrics, \n references=references,\n n_samples=50, # Number of bootstrap samples\n print_results=True # Print significance table\n)\n```\n\n<details>\n<summary> Output </summary>\n\n<pre lang=\"md\">\n================================================================================\nPAIRED SIGNIFICANCE TEST RESULTS\n================================================================================\nSystem bleu rouge1\n----------------------------------------------------------------------\nBaseline: baseline 0.0000 0.3968 \n----------------------------------------------------------------------\nimproved 1.0000 1.0000 \n (p=0.4800) (p=0.4600) \n----------------------------------------------------------------------\n- Significance level: 0.05\n- '*' indicates significant difference (p < significance level)\n- Null hypothesis: systems are essentially the same\n- Significant results suggest systems are meaningfully different\n\nMETRIC SIGNATURES:\n- bleu: bleu|ar:50|seed:12345\n- rouge1: rouge1|ar:50|seed:12345\n</pre>\n\n</details>\n\n### Example 4: File-based Evaluation\nRecommended for batch evaluation of large sets of generated reports.\n```python\nimport json\nfrom RadEval import RadEval\n\ndef evaluate_from_files():\n def read_reports(filepath):\n with open(filepath, 'r') as f:\n return [line.strip() for line in f if line.strip()]\n \n refs = read_reports('ground_truth.tok')\n hyps = read_reports('model_predictions.tok')\n \n evaluator = RadEval(\n do_radgraph=True,\n do_bleu=True,\n do_bertscore=True,\n do_chexbert=True\n )\n \n results = evaluator(refs=refs, hyps=hyps)\n \n with open('evaluation_results.json', 'w') as f:\n json.dump(results, f, indent=2)\n\n return results\n```\n\n## \ud83d\udcca Evaluation Metrics\n\nRadEval currently supports the following evaluation metrics:\n\n| Category | Metric | Description | Best For |\n|----------|--------|-------------|----------|\n| **Lexical** | BLEU | N-gram overlap measurement | Surface-level similarity |\n| | ROUGE | Recall-oriented evaluation | Content coverage |\n| **Semantic** | BERTScore | BERT-based semantic similarity | Semantic meaning preservation |\n| | RadEval BERTScore | Domain-adapted ModernBertModel evaluation | Medical text semantics |\n| **Clinical** | CheXbert | Clinical finding classification | Medical accuracy |\n| | RadGraph | Knowledge graph-based evaluation | Clinical relationship accuracy |\n| | RaTEScore | Entity-level assessments | Medical synonyms |\n| **Specialized** | RadCLIQ | Composite multiple metrics | Clinical relevance |\n| | SRR-BERT | Structured report evaluation | Report structure quality |\n| | Temporal F1 | Time-sensitive evaluation | Temporal consistency |\n| | GREEN | LLM-based metric | Overall radiology report quality |\n\n## \ud83d\udd27 Configuration Options\n\n### RadEval Constructor Parameters\n\n| Parameter | Type | Default | Description |\n|-----------|------|---------|-------------|\n| `do_radgraph` | bool | False | Enable RadGraph evaluation |\n| `do_green` | bool | False | Enable GREEN metric |\n| `do_bleu` | bool | False | Enable BLEU evaluation |\n| `do_rouge` | bool | False | Enable ROUGE metrics |\n| `do_bertscore` | bool | False | Enable BERTScore |\n| `do_srr_bert` | bool | False | Enable SRR-BERT |\n| `do_chexbert` | bool | False | Enable CheXbert classification |\n| `do_temporal` | bool | False | Enable temporal evaluation |\n| `do_ratescore` | bool | False | Enable RateScore |\n| `do_radcliq` | bool | False | Enable RadCLIQ |\n| `do_radeval_bertsore` | bool | False | Enable RadEval BERTScore |\n| `do_details` | bool | False | Include detailed metrics |\n\n### Example Configurations\n\n```python\n# Lightweight evaluation (fast)\nlight_evaluator = RadEval(\n do_bleu=True,\n do_rouge=True\n)\n\n# Medical focus (clinical accuracy)\nmedical_evaluator = RadEval(\n do_radgraph=True,\n do_chexbert=True,\n do_green=True\n)\n\n# Comprehensive evaluation (all metrics)\nfull_evaluator = RadEval(\n do_radgraph=True,\n do_green=True,\n do_bleu=True,\n do_rouge=True,\n do_bertscore=True,\n do_srr_bert=True,\n do_chexbert=True,\n do_temporal=True,\n do_ratescore=True,\n do_radcliq=True,\n do_radeval_bertsore=True,\n do_details=False # Optional: return detailed metric breakdowns\n)\n```\n\n## \ud83d\udcc1 File Format Suggestion\n\nTo ensure efficient evaluation, we recommend formatting your data in one of the following ways:\n\n### \ud83d\udcc4 Text Files (.tok, .txt)\nEach line contains one report\n```\nNo acute cardiopulmonary process.\nMild cardiomegaly noted.\nNormal chest radiograph.\n```\nUse two separate files:\n> - ground_truth.tok \u2014 reference reports\n> - model_predictions.tok \u2014 generated reports\n\n### \ud83e\uddfe JSON Files\n```json\n{\n \"references\": [\n \"No acute cardiopulmonary process.\",\n \"Mild cardiomegaly noted.\"\n ],\n \"hypotheses\": [\n \"Normal chest X-ray.\",\n \"Enlarged heart observed.\"\n ]\n}\n```\n\n### \ud83d\udc0d Python Lists\n```python\nrefs = [\"Report 1\", \"Report 2\"]\nhyps = [\"Generated 1\", \"Generated 2\"]\n```\n> [!TIP]\n> File-based input is recommended for batch evaluation and reproducibility in research workflows.\n\n\n## \ud83e\uddea Hypothesis Testing (Significance Evaluation)\nRadEval supports **paired significance testing** to statistically compare different radiology report generation systems using **Approximate Randomization (AR)**.\n\nThis allows you to determine whether an observed improvement in metric scores is **statistically significant**, rather than due to chance.\n\n### \ud83d\udccc Key Features\n\n- **Paired comparison** of any number of systems against a baseline\n- **Statistical rigor** using Approximate Randomization (AR) testing\n- **All built-in metrics** supported (BLEU, ROUGE, BERTScore, RadGraph, CheXbert, etc.) \n- **Custom metrics** integration for domain-specific evaluation\n- **P-values** and significance markers (`*`) for easy interpretation\n\n### \ud83e\uddee Statistical Background\n\nThe hypothesis testing uses **Approximate Randomization** to determine if observed metric differences are statistically significant:\n\n1. **Null Hypothesis (H\u2080)**: The two systems perform equally well\n2. **Test Statistic**: Difference in metric scores between systems\n3. **Randomization**: Shuffle system assignments and recalculate differences\n4. **P-value**: Proportion of random shuffles with differences \u2265 observed\n5. **Significance**: If p < 0.05, reject H\u2080 (systems are significantly different)\n\n> [!NOTE]\n> **Why AR testing?** \n> Unlike parametric tests, AR makes no assumptions about score distributions, making it ideal for evaluation metrics that may not follow normal distributions.\n\n### \ud83d\udc40 Understanding the Results\n\n**Interpreting P-values:**\n- **p < 0.05**: Statistically significant difference (marked with `*`)\n- **p \u2265 0.05**: No significant evidence of difference\n- **Lower p-values**: Stronger evidence of real differences\n\n**Practical Significance:**\n- Look for consistent improvements across multiple metrics\n- Consider domain relevance (e.g., RadGraph for clinical accuracy) \n- Balance statistical and clinical significance\n\n### \ud83d\udd87\ufe0f Example: Compare RadEval Default Metrics and a Custom Metric\n\n#### Step 1: Initialize packages and dataset\n```python\nfrom RadEval import RadEval, compare_systems\n\n# Reference ground truth reports\nreferences = [\n \"No acute cardiopulmonary process.\",\n \"No radiographic findings to suggest pneumonia.\",\n \"Mild cardiomegaly with clear lung fields.\",\n \"Small pleural effusion on the right side.\",\n \"Status post cardiac surgery with stable appearance.\",\n]\n# Three systems: baseline, improved, and poor\nsystems = {\n 'baseline': [\n \"No acute findings.\",\n \"No pneumonia.\",\n \"Mild cardiomegaly, clear lungs.\",\n \"Small right pleural effusion.\",\n \"Post-cardiac surgery, stable.\"\n ],\n 'improved': [\n \"No acute cardiopulmonary process.\",\n \"No radiographic findings suggesting pneumonia.\",\n \"Mild cardiomegaly with clear lung fields bilaterally.\",\n \"Small pleural effusion present on the right side.\",\n \"Status post cardiac surgery with stable appearance.\"\n ],\n 'poor': [\n \"Normal.\",\n \"OK.\",\n \"Heart big.\",\n \"Some fluid.\",\n \"Surgery done.\"\n ]\n}\n```\n\n#### Step 2: Define Evaluation Metrics and Parameters\nWe define each evaluation metric using a dedicated RadEval instance (configured to compute one specific score), and also include a simple custom metric \u2014 average word count. All metrics are wrapped into a unified metrics dictionary for flexible evaluation and comparison.\n\n```python\n# Initialise each evaluator with the corresponding metric\nbleu_evaluator = RadEval(do_bleu=True)\nrouge_evaluator = RadEval(do_rouge=True)\nbertscore_evaluator = RadEval(do_bertscore=True)\nradgraph_evaluator = RadEval(do_radgraph=True)\nchexbert_evaluator = RadEval(do_chexbert=True)\n\n# Define a custom metric: average word count of generated reports\ndef word_count_metric(hyps, refs):\n return sum(len(report.split()) for report in hyps) / len(hyps)\n\n# Wrap metrics into a unified dictionary of callables\nmetrics = {\n 'bleu': lambda hyps, refs: bleu_evaluator(refs, hyps)['bleu'],\n 'rouge1': lambda hyps, refs: rouge_evaluator(refs, hyps)['rouge1'],\n 'rouge2': lambda hyps, refs: rouge_evaluator(refs, hyps)['rouge2'],\n 'rougeL': lambda hyps, refs: rouge_evaluator(refs, hyps)['rougeL'],\n 'bertscore': lambda hyps, refs: bertscore_evaluator(refs, hyps)['bertscore'],\n 'radgraph': lambda hyps, refs: radgraph_evaluator(refs, hyps)['radgraph_partial'],\n 'chexbert': lambda hyps, refs: chexbert_evaluator(refs, hyps)['chexbert-5_macro avg_f1-score'],\n 'word_count': word_count_metric # \u2190 example of a simple custom-defined metric\n}\n```\n\n> [!TIP] \n> - Each metric function takes (hyps, refs) as input and returns a single float score.\n> - This modular design allows you to flexibly plug in or remove metrics without changing the core logic of RadEval or compare_systems.\n> - For advanced, you may define your own `RadEval(do_xxx=True)` variant or custom metrics and include them seamlessly here.\n\n#### Step 3 Run significance testing\n\nUse `compare_systems` to evaluate all defined systems against the reference reports using the metrics specified above. This step performs randomization-based significance testing to assess whether differences between systems are statistically meaningful.\n\n```python\nprint(\"Running significance tests...\")\n\nsignatures, scores = compare_systems(\n systems=systems,\n metrics=metrics,\n references=references,\n n_samples=50, # Number of randomization samples\n significance_level=0.05, # Alpha level for significance testing\n print_results=True # Print formatted results table\n)\n```\n\n<details>\n<summary> Output </summary>\n\n<pre lang=\"md\">\nRunning tests...\n================================================================================\nPAIRED SIGNIFICANCE TEST RESULTS\n================================================================================\nSystem bleu rouge1 rouge2 rougeL bertscore radgraph chexbert word_count\n----------------------------------------------------------------------------------------------------------------------------------------------------------------\nBaseline: baseline 0.0000 0.6652 0.3133 0.6288 0.6881 0.5538 1.0000 3.2000 \n----------------------------------------------------------------------------------------------------------------------------------------------------------------\nimproved 0.6874 0.9531 0.8690 0.9531 0.9642 0.9818 1.0000 6.2000 \n (p=0.0000)* (p=0.0800) (p=0.1200) (p=0.0600) (p=0.0400)* (p=0.1200) (p=1.0000) (p=0.0600) \n----------------------------------------------------------------------------------------------------------------------------------------------------------------\npoor 0.0000 0.0444 0.0000 0.0444 0.1276 0.0000 0.8000 1.6000 \n (p=0.4000) (p=0.0400)* (p=0.0600) (p=0.1200) (p=0.0400)* (p=0.0200)* (p=1.0000) (p=0.0400)* \n----------------------------------------------------------------------------------------------------------------------------------------------------------------\n- Significance level: 0.05\n- '*' indicates significant difference (p < significance level)\n- Null hypothesis: systems are essentially the same\n- Significant results suggest systems are meaningfully different\n\nMETRIC SIGNATURES:\n- bleu: bleu|ar:50|seed:12345\n- rouge1: rouge1|ar:50|seed:12345\n- rouge2: rouge2|ar:50|seed:12345\n- rougeL: rougeL|ar:50|seed:12345\n- bertscore: bertscore|ar:50|seed:12345\n- radgraph: radgraph|ar:50|seed:12345\n- chexbert: chexbert|ar:50|seed:12345\n- word_count: word_count|ar:50|seed:12345\n</pre>\n\n</details>\n\n> [!TIP]\n> - The output includes mean scores for each metric and system, along with p-values comparing each system to the baseline.\n> - Statistically significant improvements (or declines) are marked with an asterisk `*` if p < 0.05.\n> - `signatures` stores each metric configuration (e.g. random seed, sample size), and `scores` contains raw score values per system for further analysis or plotting.\n\n#### Step 4: Summarise Significant Findings\n\n```python\n# Significance testing\nprint(\"\\nSignificant differences (p < 0.05):\")\nbaseline_name = list(systems.keys())[0] # Assume first one is the baseline\n\nfor system_name in systems.keys():\n if system_name == baseline_name:\n continue\n \n significant_metrics = []\n for metric_name in metrics.keys():\n pvalue_key = f\"{metric_name}_pvalue\"\n if pvalue_key in scores[system_name]:\n p_val = scores[system_name][pvalue_key]\n if p_val < 0.05:\n significant_metrics.append(metric_name)\n \n if significant_metrics:\n print(f\" {system_name} vs {baseline_name}: {', '.join(significant_metrics)}\")\n else:\n print(f\" {system_name} vs {baseline_name}: No significant differences\")\n```\n\n<details>\n<summary> Output </summary>\n\n<pre lang=\"md\">\nSignificant differences (p < 0.05):\n improved vs baseline: bleu, bertscore\n poor vs baseline: rouge1, bertscore, radgraph, word_count\n</pre>\n\n</details>\n\n> [!TIP]\n> This makes it easy to:\n> - Verify whether model improvements are meaningful\n> - Test new metrics or design your own\n> - Report statistically sound results in your paper\n\n## \ud83e\udde0 RadEval Expert Dataset\nTo support reliable benchmarking, we introduce the **RadEval Expert Dataset**, a carefully curated evaluation set annotated by board-certified radiologists. This dataset consists of realistic radiology reports and challenging model generations, enabling nuanced evaluation across clinical accuracy, temporal consistency, and language quality. It serves as a gold standard to validate automatic metrics and model performance under expert review.\n\n## \ud83d\udea6 Performance Tips\n\n1. **Start Small**: Test with a few examples before full evaluation\n2. **Select Metrics**: Only enable metrics you actually need\n3. **Batch Processing**: Process large datasets in smaller chunks\n4. **GPU Usage**: Ensure CUDA is available for faster computation\n\n\n## \ud83d\udcda Citation\n\nIf you use RadEval in your research, please cite:\n\n```BibTeX\n@software{radeval2025,\n author = {Jean-Benoit Delbrouck, Justin Xu, Xi Zhang},\n title = {RadEval: A framework for radiology text evaluation},\n year = {2025},\n publisher = {GitHub},\n howpublished = {\\url{https://github.com/jbdel/RadEval}},\n}\n```\n\n### \ud83d\udce6 Codebase Contributors\n<table>\n <tbody>\n <tr>\n <td align=\"center\">\n <a href=\"https://jbdel.github.io/\">\n <img src=\"https://aimi.stanford.edu/sites/g/files/sbiybj20451/files/styles/medium_square/public/media/image/image5_0.png?h=f4e62a0a&itok=euaj9VoF\"\n width=\"100\" height=\"100\"\n style=\"object-fit: cover; border-radius: 20%;\" alt=\"Jean-Benoit Delbrouck\"/>\n <br />\n <sub><b>Jean-Benoit Delbrouck</b></sub>\n </a>\n </td>\n <td align=\"center\">\n <a href=\"https://justin13601.github.io/\">\n <img src=\"https://justin13601.github.io/images/pfp2.JPG\"\n width=\"100\" height=\"100\"\n style=\"object-fit: cover; border-radius: 20%;\" alt=\"Justin Xu\"/>\n <br />\n <sub><b>Justin Xu</b></sub>\n </a>\n </td>\n <td align=\"center\">\n <a href=\"https://x-izhang.github.io/\">\n <img src=\"https://x-izhang.github.io/author/xi-zhang/avatar_hu13660783057866068725.jpg\"\n width=\"100\" height=\"100\"\n style=\"object-fit: cover; border-radius: 20%;\" alt=\"Xi Zhang\"/>\n <br />\n <sub><b>Xi Zhang</b></sub>\n </a>\n </td>\n </tr>\n </tbody>\n</table>\n\n## \ud83d\ude4f Acknowledgments\n\nThis project would not be possible without the foundational work of the radiology AI community. \nWe extend our gratitude to the authors and maintainers of the following open-source projects and metrics:\n\n- \ud83e\udde0 **CheXbert**, **RadGraph**, and **CheXpert** from Stanford AIMI for their powerful labelers and benchmarks.\n- \ud83d\udcd0 **BERTScore** and **BLEU/ROUGE** for general-purpose NLP evaluation.\n- \ud83c\udfe5 **RadCliQ** and **RaTE Score** for clinically grounded evaluation of radiology reports.\n- \ud83e\uddea **SRR-BERT** for structured report understanding in radiology.\n- \ud83d\udd0d Researchers contributing to temporal and factual consistency metrics in medical imaging.\n\nSpecial thanks to:\n- All contributors to open datasets such as **MIMIC-CXR**, which make reproducible research possible.\n- Our collaborators for their support and inspiration throughout development.\n\nWe aim to build on these contributions and promote accessible, fair, and robust evaluation of AI-generated radiology text.\n\n\n---\n\n<div align=\"center\">\n <p>\u2b50 If you find RadEval useful, please give us a star! \u2b50</p>\n <p>Made with \u2764\ufe0f for the radiology AI research community</p>\n</div>\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "All-in-one metrics for evaluating AI-generated radiology text",
"version": "0.0.1rc3",
"project_urls": {
"Bug Reports": "https://github.com/jbdel/RadEval/issues",
"Documentation": "https://github.com/jbdel/RadEval/blob/main/README.md",
"Homepage": "https://github.com/jbdel/RadEval",
"Source": "https://github.com/jbdel/RadEval"
},
"split_keywords": [
"radiology",
" evaluation",
" natural language processing",
" radiology report",
" medical nlp",
" clinical text generation",
" llm",
" bionlp",
" chexbert",
" radgraph",
" medical ai"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "3aaac4b7b5bac9947993c04bfbd00aa32a2f218aa0147af37f931f264c85d6cf",
"md5": "d85adbcf998382038c3d6849110429f0",
"sha256": "44a240ebea25e72685037b24d91ec92518fb1425e83d618a9667e7bac159aba6"
},
"downloads": -1,
"filename": "radeval-0.0.1rc3.tar.gz",
"has_sig": false,
"md5_digest": "d85adbcf998382038c3d6849110429f0",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<=3.12.1,>=3.9",
"size": 62157,
"upload_time": "2025-07-13T16:29:16",
"upload_time_iso_8601": "2025-07-13T16:29:16.880802Z",
"url": "https://files.pythonhosted.org/packages/3a/aa/c4b7b5bac9947993c04bfbd00aa32a2f218aa0147af37f931f264c85d6cf/radeval-0.0.1rc3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-13 16:29:16",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "jbdel",
"github_project": "RadEval",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "radeval"
}