parxyval

Name	parxyval JSON
Version	0.1.0 JSON
	download
home_page	None
Summary	An evaluation framework for document parsing.
upload_time	2025-10-06 10:46:31
maintainer	None
docs_url	None
author	Alessio Vertemati
requires_python	<4.0,>=3.12
license	None
keywords	parxy evaluation evaluation-framework convert document document-parsing parser-benchmark pdf docx html markdown layout model rag llm document-performance-monitoring document-ai text-extraction data-quality rag parsing-tools
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            ![pypi](https://img.shields.io/pypi/v/parxyval.svg)
![Pydantic v2](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/pydantic/pydantic/main/docs/badge/v2.json) [![uv](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/uv/main/assets/badge/v0.json)](https://github.com/astral-sh/uv)

# Parxyval


Parxyval – The Developer's Knight of the Parsing Table.

An evaluation framework for document parsing, inspired by the quest for the Holy Grail. 🏰⚔️

In a world of imperfect parsers, Parxyval helps you measure, compare, and discover which tool truly preserves the meaning of your documents. Benchmark precision, recall, structure, and reliability across multiple parsing services — and let your pipeline find its Grail.


**Requirements**

- Python 3.12 or above.
- A Hugging Face account for downloading datasets that requires a login


**Next steps**

- [Getting started](#getting-started)
  - [Available commands](#command-reference)
- [Benchmarks](#benchmarks)
- [Supported evaluations](#evaluations)



## Getting Started

Parxyval is an unstructured document processing evaluation framework offering a CLI interface to benchmark PDF parsing solutions. Follow these steps to get started:

Parxyval is available on Pypi. You can try it using `uv`.

```bash
uvx parxyval --help
```



1. **Download the Dataset**
```bash
# Download sample documents from DocLayNet dataset
parxyval download --limit 100 --include-pdf
```

The ground truth is stored in `./data/doclaynet/json` while pdf files are stored in `./data/doclaynet/pdf`


2. **Parse Documents**
```bash
# Parse PDFs using your chosen driver (default: pymupdf)
parxyval parse --driver pymupdf 

# you can personalize input and output locations
# --input data/doclaynet/pdf --output data/doclaynet/processed
```

Parxyval supports all drivers available in [Parxy](https://github.com/OneOffTech/parxy).

Pdf files are read from `./data/doclaynet/pdf` and the parser outputs is written in `./data/doclaynet/processed/{driver}`, e.g. `./data/doclaynet/processed/pymupdf`

3. **Evaluate Results**

```bash
# Run evaluation with selected metrics
parxyval evaluate --metric sequence_matcher --metric bleu_score --input ./data/doclaynet/processed/pymupdf
```

### Command Reference

#### `parxyval download`
Download documents from the DocLayNet dataset.

Options:
- `--limit, -l`: Number of entries to download (default: 100)
- `--skip, -s`: Skip specified number of entries
- `--output, -o`: Output folder path (default: data/doclaynet)
- `--include-pdf`: Download PDF files (default: False)

#### `parxyval parse`
Parse PDF documents using specified driver.

Options:
- `--driver, -d`: Parser driver to use (default: pymupdf)
- `--limit, -l`: Maximum documents to process (default: 100)
- `--skip, -s`: Skip specified number of documents
- `--input, -i`: Input folder with PDFs (default: data/doclaynet/pdf)
- `--output, -o`: Output folder for results (default: data/doclaynet/processed)

#### `parxyval evaluate`
Evaluate parsing results against ground truth.

Arguments:
- `driver`: Parser driver to evaluate (default: pymupdf)

Options:
- `--metric, -m`: Metrics to use (can be specified multiple times)
- `--golden, -g`: Ground truth folder (default: data/doclaynet/json)
- `--input, -i`: Parsed documents folder (default: data/doclaynet/processed/pymupdf)
- `--output, -o`: Results output folder (default: data/doclaynet/results)


## Benchmarks

Parxyval supports various benchmarks for the evaluation of document processing services.

- [DocLayNet](https://huggingface.co/datasets/ds4sd/DocLayNet-v1.2): Evaluate text and layout using the DocLayNet v1.2 dataset.


_Datasets we are evaluating to support:_

- [DP-Bench: Document Parsing Benchmark](https://huggingface.co/datasets/upstage/dp-bench)
- [OmniDocBench](https://huggingface.co/datasets/opendatalab/OmniDocBench)

## Evaluations

Parxyval provides a comprehensive suite of text evaluation metrics to assess the quality of PDF parsing results. Each metric focuses on different aspects of text similarity and accuracy:

### Text Similarity Metrics

- **Sequence Matcher**: Measures the similarity between two texts using Python's difflib sequence matcher. Ideal for detecting overall textual similarities and differences.

- **Jaccard Similarity**: Computes the similarity between page contents by measuring the intersection over union of their token sets. Perfect for assessing vocabulary overlap between parsed and reference texts.

- **Edit Distance**: Calculates the normalized Levenshtein distance between texts, measuring the minimum number of single-character edits required to change one text into another. Useful for identifying character-level parsing accuracy.

### Natural Language Processing Metrics

- **BLEU Score**: A precision-based metric that compares n-grams between the parsed and reference texts. Particularly effective for evaluating the preservation of word sequences and phrases.

- **METEOR Score**: Advanced metric that considers stemming, synonymy, and paraphrasing. Provides a more nuanced evaluation of semantic similarity between parsed and reference texts.

### Information Retrieval Metrics

- **Precision**: Measures the accuracy of the parsed text by calculating the proportion of correctly parsed tokens relative to all tokens in the parsed text.

- **Recall**: Evaluates completeness by calculating the proportion of reference tokens that were correctly captured in the parsed text.

- **F1 Score**: The harmonic mean of precision and recall, providing a balanced measure of parsing accuracy.

All metrics are computed page-wise and then averaged across the entire document, ensuring a comprehensive evaluation of parsing quality at both local and global levels.


## Security Vulnerabilities

Please review our [security policy](./.github/SECURITY.md) on how to report security vulnerabilities.


## Supporters

The project is provided and supported by OneOff-Tech (UG) and Alessio Vertemati.

<p align="left"><a href="https://oneofftech.de" target="_blank"><img src="https://raw.githubusercontent.com/OneOffTech/.github/main/art/oneofftech-logo.svg" width="200"></a></p>


## Licence and Copyright

Parxy is licensed under the [MIT licence](./LICENCE).

- Copyright (c) 2025-present Alessio Vertemati, @avvertix
- Copyright (c) 2025-present Oneoff-tech UG, www.oneofftech.de
- All contributors

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "parxyval",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.12",
    "maintainer_email": null,
    "keywords": "parxy, evaluation, evaluation-framework, convert, document, document-parsing, parser-benchmark, pdf, docx, html, markdown, layout model, rag, llm, document-performance-monitoring, document-ai, text-extraction, data-quality, rag, parsing-tools",
    "author": "Alessio Vertemati",
    "author_email": "Alessio Vertemati <alessio@oneofftech.xyz>",
    "download_url": "https://files.pythonhosted.org/packages/05/db/2e0f5532765661bd360a5a36394f70f823d261eb01b20ba800f666c97fd2/parxyval-0.1.0.tar.gz",
    "platform": null,
    "description": "![pypi](https://img.shields.io/pypi/v/parxyval.svg)\r\n![Pydantic v2](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/pydantic/pydantic/main/docs/badge/v2.json) [![uv](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/uv/main/assets/badge/v0.json)](https://github.com/astral-sh/uv)\r\n\r\n# Parxyval\r\n\r\n\r\nParxyval \u2013 The Developer's Knight of the Parsing Table.\r\n\r\nAn evaluation framework for document parsing, inspired by the quest for the Holy Grail. \ud83c\udff0\u2694\ufe0f\r\n\r\nIn a world of imperfect parsers, Parxyval helps you measure, compare, and discover which tool truly preserves the meaning of your documents. Benchmark precision, recall, structure, and reliability across multiple parsing services \u2014 and let your pipeline find its Grail.\r\n\r\n\r\n**Requirements**\r\n\r\n- Python 3.12 or above.\r\n- A Hugging Face account for downloading datasets that requires a login\r\n\r\n\r\n**Next steps**\r\n\r\n- [Getting started](#getting-started)\r\n  - [Available commands](#command-reference)\r\n- [Benchmarks](#benchmarks)\r\n- [Supported evaluations](#evaluations)\r\n\r\n\r\n\r\n## Getting Started\r\n\r\nParxyval is an unstructured document processing evaluation framework offering a CLI interface to benchmark PDF parsing solutions. Follow these steps to get started:\r\n\r\nParxyval is available on Pypi. You can try it using `uv`.\r\n\r\n```bash\r\nuvx parxyval --help\r\n```\r\n\r\n\r\n\r\n1. **Download the Dataset**\r\n```bash\r\n# Download sample documents from DocLayNet dataset\r\nparxyval download --limit 100 --include-pdf\r\n```\r\n\r\nThe ground truth is stored in `./data/doclaynet/json` while pdf files are stored in `./data/doclaynet/pdf`\r\n\r\n\r\n2. **Parse Documents**\r\n```bash\r\n# Parse PDFs using your chosen driver (default: pymupdf)\r\nparxyval parse --driver pymupdf \r\n\r\n# you can personalize input and output locations\r\n# --input data/doclaynet/pdf --output data/doclaynet/processed\r\n```\r\n\r\nParxyval supports all drivers available in [Parxy](https://github.com/OneOffTech/parxy).\r\n\r\nPdf files are read from `./data/doclaynet/pdf` and the parser outputs is written in `./data/doclaynet/processed/{driver}`, e.g. `./data/doclaynet/processed/pymupdf`\r\n\r\n3. **Evaluate Results**\r\n\r\n```bash\r\n# Run evaluation with selected metrics\r\nparxyval evaluate --metric sequence_matcher --metric bleu_score --input ./data/doclaynet/processed/pymupdf\r\n```\r\n\r\n### Command Reference\r\n\r\n#### `parxyval download`\r\nDownload documents from the DocLayNet dataset.\r\n\r\nOptions:\r\n- `--limit, -l`: Number of entries to download (default: 100)\r\n- `--skip, -s`: Skip specified number of entries\r\n- `--output, -o`: Output folder path (default: data/doclaynet)\r\n- `--include-pdf`: Download PDF files (default: False)\r\n\r\n#### `parxyval parse`\r\nParse PDF documents using specified driver.\r\n\r\nOptions:\r\n- `--driver, -d`: Parser driver to use (default: pymupdf)\r\n- `--limit, -l`: Maximum documents to process (default: 100)\r\n- `--skip, -s`: Skip specified number of documents\r\n- `--input, -i`: Input folder with PDFs (default: data/doclaynet/pdf)\r\n- `--output, -o`: Output folder for results (default: data/doclaynet/processed)\r\n\r\n#### `parxyval evaluate`\r\nEvaluate parsing results against ground truth.\r\n\r\nArguments:\r\n- `driver`: Parser driver to evaluate (default: pymupdf)\r\n\r\nOptions:\r\n- `--metric, -m`: Metrics to use (can be specified multiple times)\r\n- `--golden, -g`: Ground truth folder (default: data/doclaynet/json)\r\n- `--input, -i`: Parsed documents folder (default: data/doclaynet/processed/pymupdf)\r\n- `--output, -o`: Results output folder (default: data/doclaynet/results)\r\n\r\n\r\n## Benchmarks\r\n\r\nParxyval supports various benchmarks for the evaluation of document processing services.\r\n\r\n- [DocLayNet](https://huggingface.co/datasets/ds4sd/DocLayNet-v1.2): Evaluate text and layout using the DocLayNet v1.2 dataset.\r\n\r\n\r\n_Datasets we are evaluating to support:_\r\n\r\n- [DP-Bench: Document Parsing Benchmark](https://huggingface.co/datasets/upstage/dp-bench)\r\n- [OmniDocBench](https://huggingface.co/datasets/opendatalab/OmniDocBench)\r\n\r\n## Evaluations\r\n\r\nParxyval provides a comprehensive suite of text evaluation metrics to assess the quality of PDF parsing results. Each metric focuses on different aspects of text similarity and accuracy:\r\n\r\n### Text Similarity Metrics\r\n\r\n- **Sequence Matcher**: Measures the similarity between two texts using Python's difflib sequence matcher. Ideal for detecting overall textual similarities and differences.\r\n\r\n- **Jaccard Similarity**: Computes the similarity between page contents by measuring the intersection over union of their token sets. Perfect for assessing vocabulary overlap between parsed and reference texts.\r\n\r\n- **Edit Distance**: Calculates the normalized Levenshtein distance between texts, measuring the minimum number of single-character edits required to change one text into another. Useful for identifying character-level parsing accuracy.\r\n\r\n### Natural Language Processing Metrics\r\n\r\n- **BLEU Score**: A precision-based metric that compares n-grams between the parsed and reference texts. Particularly effective for evaluating the preservation of word sequences and phrases.\r\n\r\n- **METEOR Score**: Advanced metric that considers stemming, synonymy, and paraphrasing. Provides a more nuanced evaluation of semantic similarity between parsed and reference texts.\r\n\r\n### Information Retrieval Metrics\r\n\r\n- **Precision**: Measures the accuracy of the parsed text by calculating the proportion of correctly parsed tokens relative to all tokens in the parsed text.\r\n\r\n- **Recall**: Evaluates completeness by calculating the proportion of reference tokens that were correctly captured in the parsed text.\r\n\r\n- **F1 Score**: The harmonic mean of precision and recall, providing a balanced measure of parsing accuracy.\r\n\r\nAll metrics are computed page-wise and then averaged across the entire document, ensuring a comprehensive evaluation of parsing quality at both local and global levels.\r\n\r\n\r\n## Security Vulnerabilities\r\n\r\nPlease review our [security policy](./.github/SECURITY.md) on how to report security vulnerabilities.\r\n\r\n\r\n## Supporters\r\n\r\nThe project is provided and supported by OneOff-Tech (UG) and Alessio Vertemati.\r\n\r\n<p align=\"left\"><a href=\"https://oneofftech.de\" target=\"_blank\"><img src=\"https://raw.githubusercontent.com/OneOffTech/.github/main/art/oneofftech-logo.svg\" width=\"200\"></a></p>\r\n\r\n\r\n## Licence and Copyright\r\n\r\nParxy is licensed under the [MIT licence](./LICENCE).\r\n\r\n- Copyright (c) 2025-present Alessio Vertemati, @avvertix\r\n- Copyright (c) 2025-present Oneoff-tech UG, www.oneofftech.de\r\n- All contributors\r\n\r\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "An evaluation framework for document parsing.",
    "version": "0.1.0",
    "project_urls": {
        "homepage": "https://github.com/OneOffTech/parxyval",
        "issues": "https://github.com/OneOffTech/parxyval/issues",
        "repository": "https://github.com/OneOffTech/parxyval"
    },
    "split_keywords": [
        "parxy",
        " evaluation",
        " evaluation-framework",
        " convert",
        " document",
        " document-parsing",
        " parser-benchmark",
        " pdf",
        " docx",
        " html",
        " markdown",
        " layout model",
        " rag",
        " llm",
        " document-performance-monitoring",
        " document-ai",
        " text-extraction",
        " data-quality",
        " rag",
        " parsing-tools"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "d40f953bcdf19f1e438dabd5c8c5eebeaeb1c33d25780b6a62d3a52bdb5f0a8d",
                "md5": "acf8eff96941a0bca2ac752f2a104dd1",
                "sha256": "baf0a25f5767401f60d74e54aaeaabe2f0cd7015eb09c3c10a110dba8b1f961a"
            },
            "downloads": -1,
            "filename": "parxyval-0.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "acf8eff96941a0bca2ac752f2a104dd1",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.12",
            "size": 20777,
            "upload_time": "2025-10-06T10:46:30",
            "upload_time_iso_8601": "2025-10-06T10:46:30.347875Z",
            "url": "https://files.pythonhosted.org/packages/d4/0f/953bcdf19f1e438dabd5c8c5eebeaeb1c33d25780b6a62d3a52bdb5f0a8d/parxyval-0.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "05db2e0f5532765661bd360a5a36394f70f823d261eb01b20ba800f666c97fd2",
                "md5": "4a129aab62e46636bc9a644b650df159",
                "sha256": "103cb5cf921d66f6db3400935e58217923b175a19e308e4c3d93767cb8fe3b05"
            },
            "downloads": -1,
            "filename": "parxyval-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "4a129aab62e46636bc9a644b650df159",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.12",
            "size": 14826,
            "upload_time": "2025-10-06T10:46:31",
            "upload_time_iso_8601": "2025-10-06T10:46:31.491638Z",
            "url": "https://files.pythonhosted.org/packages/05/db/2e0f5532765661bd360a5a36394f70f823d261eb01b20ba800f666c97fd2/parxyval-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-06 10:46:31",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "OneOffTech",
    "github_project": "parxyval",
    "github_not_found": true,
    "lcname": "parxyval"
}

Alessio Vertemati