docling-sdg

Name	docling-sdg JSON
Version	0.4.0 JSON
	download
home_page	None
Summary	Docling for Synthetic Data Generation (SDG) provides a set of tools to create artificial data from documents, leveraging generative AI and docling's parsing capabilities.
upload_time	2025-08-15 09:27:04
maintainer	None
docs_url	None
author	None
requires_python	<3.13,>=3.10
license	None
keywords	ai artificial intelligence docling document understanding large language models llm prompt engineering sdg synthetic data generation
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            <p align="center">
  <a href="https://github.com/docling-project/docling-sdg">
    <img loading="lazy" alt="Docling" src="https://github.com/docling-project/docling-sdg/raw/main/docs/assets/docling-sdg-pic.png" width="40%"/>
  </a>
</p>

# Docling SDG

[![Platforms](https://img.shields.io/badge/platform-macos%20|%20linux%20|%20windows-blue)](https://github.com/docling-project/docling-parse/)
[![PyPI version](https://img.shields.io/pypi/v/docling-sdg)](https://pypi.org/project/docling-sdg/)
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/docling-sdg)](https://pypi.org/project/docling-sdg/)
[![uv](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/uv/main/assets/badge/v0.json)](https://github.com/astral-sh/uv)
[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)
[![Pydantic v2](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/pydantic/pydantic/main/docs/badge/v2.json)](https://docs.pydantic.dev/latest/contributing/#badges)
[![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit)
[![License MIT](https://img.shields.io/github/license/docling-project/docling-parse)](https://opensource.org/licenses/MIT)
[![PyPI Downloads](https://static.pepy.tech/badge/docling-sdg/month)](https://pepy.tech/projects/docling-sdg)
[![LF AI & Data](https://img.shields.io/badge/LF%20AI%20%26%20Data-003778?logo=linuxfoundation&logoColor=fff&color=0094ff&labelColor=003778)](https://lfaidata.foundation/projects/)

Docling for Synthetic Data Generation (SDG) provides a set of tools to create artificial data from documents, leveraging generative AI and Docling's parsing capabilities.

## Features

* 🧬 Generation of question-answering pairs from passages of [multiple document formats][supported_formats] including 
PDF, HTML, or DOCX, leveraging Docling's parsing capabilities
* ⚖️ LLM as a judge for high quality question-answering pairs
* 💻 Simple and convenient CLI

### Coming soon

* 📝 Integrations with Llama Stack and vLLM
* 📝 SDG on tabular data
* 📝 Documentation

## Installation

To use Docling SDG, simply install `docling-sdg` from your package manager, e.g., pip:

```bash
pip install docling-sdg
```

Alternatively, you can clone this repository and use [uv](https://docs.astral.sh/uv) for
creating a virtual environment, installing the packages, and running the project commands.

```bash
git clone git@github.com:docling-project/docling-sdg.git
cd docling-sdg
uv sync
```

## Getting started

You can create synthetically-generated questions and answers from relevant parts of one or several documents.
These question-answer pairs may be used in AI applications, such as evaluating a RAG application or generating
ground truth to train a language model.


### Sample

Generating and judging data with LLMs may be computationally intense. Since document collections may be large,
you may want to chunk the documents into passages, filter them based on length and content criteria, and sample
a bunch of them to have a manageable dataset.

```python
from docling_sdg.qa.sample import PassageSampler

source = "https://en.wikipedia.org/wiki/Duck"
passage_sampler = PassageSampler()
print(passage_sampler.sample(source))
```

By default, the results will be exported to the file `docling_sdg_sample.jsonl`. Every line represents a document passage.

### Generate

For each passage created in the previous step, we can prompt an LLM and generate 3 different questions of the following
types: _simple fact_, _summary_, and _reasoning_.

The `GenerateOptions` class controls which model provider is used for Q&A generation by setting the `provider` attribute, as shown below. Three options are available:

* `LlmProvider.WATSONX` for [watsonx.ai](https://www.ibm.com/products/watsonx-ai);, you will need to provide a watsonx.ai instance ID and an API key.
* `LlmProvider.OPENAI` for OpenAI; you will need to provide an OpenAI API key
* `LlmProvider.OPENAI_LIKE` for any model provider with OpenAI compatible APIs; if no API key is needed (such as when running against `ollama` locally), set `api_key` to any string, e.g. `"fake"`

```python
import os
from docling_sdg.qa.base import GenerateOptions, LlmProvider
from docling_sdg.qa.generate import Generator
from pathlib import Path

options = GenerateOptions(
    provider=LlmProvider.WATSONX,
    project_id=os.environ.get("WATSONX_PROJECT_ID"),
    api_key=os.environ.get("WATSONX_APIKEY"),
    url=os.environ.get("WATSONX_URL"),
)

generator = Generator(generate_options=options)
print(generator.generate_from_sample(Path("docling_sdg_sample.jsonl")))
```

By default, the results will be exported to the file `docling_sdg_generated_qac.jsonl`. Every line represents a generated
question-answer-context item with additional information like the question type.


### Critique

Certain applications may require certain quality in the generated data. The last step consists of using an LLM to judge
the generated data and provide both qualitative and quantiative evaluations of the question-answer-context items. Using
those evaluations, we can filter the generated dataset to the required quality levels.

```python
import os
from docling_sdg.qa.base import CritiqueOptions, LlmProvider
from docling_sdg.qa.critique import Judge
from pathlib import Path

options = CritiqueOptions(
    provider=LlmProvider.WATSONX,
    project_id=os.environ.get("WATSONX_PROJECT_ID"),
    api_key=os.environ.get("WATSONX_APIKEY"),
    url=os.environ.get("WATSONX_URL"),
)

judge = Judge(critique_options=options)
print(judge.critique(Path("docling_sdg_generated_qac.jsonl")))
```

By default, the results will be exported to the file `docling_sdg_critiqued_qac.jsonl`. The file content is similar to 
the one created in the [Generate](#generate) step, but it additionally contains the critique evaluation on several dimensions such as
_question to context groundness_, _question feasibility_ or _context usefulness_.


## CLI

Docling SDG has a built-in CLI to run the 3 steps of the question-answering data generation.

```bash
docling-sdg qa sample https://en.wikipedia.org/wiki/Duck
docling-sdg qa generate docling_sdg_sample.jsonl
docling-sdg qa critique docling_sdg_generated.jsonl
```

Find out more about optional parameters with the help argument. For instance:

```bash
docling-sdg qa generate --help
```

## Get help and support

Please feel free to connect with us using the [discussion section](https://github.com/docling-project/docling/discussions).

## Technical report

For more details on Docling SDG's inner workings, check out the paper [Know Your RAG: Dataset Taxonomy and Generation Strategies for Evaluating RAG System](https://aclanthology.org/2025.coling-industry.4.pdf), as well as [Docling Technical Report](https://arxiv.org/abs/2408.09869).

## Contributing

Please read [Contributing to Docling SDG](https://github.com/docling-project/docling-sdg/blob/main/CONTRIBUTING.md) for details.

## References

If you use Docling SDG in your projects, please consider citing the following:

```bib
@inproceedings{teixeira-de-lima-etal-2025-know,
    title={Know Your RAG: Dataset Taxonomy and Generation Strategies for Evaluating RAG Systems}, 
    author={Rafael Teixeira de Lima and Shubham Gupta and Cesar Berrospi and Lokesh Mishra and Michele Dolfi and Peter Staar and Panagiotis Vagenas},
    year={2025},
    month={jan},
    booktitle={Proceedings of the 31st International Conference on Computational Linguistics: Industry Track},
    publisher={Association for Computational Linguistics},
    url={https://aclanthology.org/2025.coling-industry.4/}
}
```

## License

The Docling SDG codebase is under MIT license.
For individual model usage, please refer to the model licenses found in the original packages.

## LF AI & Data

Docling is hosted as a project in the [LF AI & Data Foundation](https://lfaidata.foundation/projects/).

### IBM ❤️ Open Source AI

The project was started by the AI for knowledge team at IBM Research Zurich.

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "docling-sdg",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<3.13,>=3.10",
    "maintainer_email": "Cesar Berrospi Ramis <ceb@zurich.ibm.com>, Tim Strohmeyer <Tim.Strohmeyer@ibm.com>, Michele Dolfi <dol@zurich.ibm.com>, Peter Staar <taa@zurich.ibm.com>",
    "keywords": "AI, artificial intelligence, docling, document understanding, large language models, llm, prompt engineering, sdg, synthetic data generation",
    "author": null,
    "author_email": "Cesar Berrospi Ramis <ceb@zurich.ibm.com>, Rafael Teixeira de Lima <rtdl@ibm.com>, Tim Strohmeyer <Tim.Strohmeyer@ibm.com>, Panos Vagenas <pva@zurich.ibm.com>, Michele Dolfi <dol@zurich.ibm.com>, Peter Staar <taa@zurich.ibm.com>",
    "download_url": "https://files.pythonhosted.org/packages/1a/22/4686a98d4074c49873a9bc1338922d96f74e09344ac493566271721678e4/docling_sdg-0.4.0.tar.gz",
    "platform": null,
    "description": "<p align=\"center\">\n  <a href=\"https://github.com/docling-project/docling-sdg\">\n    <img loading=\"lazy\" alt=\"Docling\" src=\"https://github.com/docling-project/docling-sdg/raw/main/docs/assets/docling-sdg-pic.png\" width=\"40%\"/>\n  </a>\n</p>\n\n# Docling SDG\n\n[![Platforms](https://img.shields.io/badge/platform-macos%20|%20linux%20|%20windows-blue)](https://github.com/docling-project/docling-parse/)\n[![PyPI version](https://img.shields.io/pypi/v/docling-sdg)](https://pypi.org/project/docling-sdg/)\n[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/docling-sdg)](https://pypi.org/project/docling-sdg/)\n[![uv](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/uv/main/assets/badge/v0.json)](https://github.com/astral-sh/uv)\n[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)\n[![Pydantic v2](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/pydantic/pydantic/main/docs/badge/v2.json)](https://docs.pydantic.dev/latest/contributing/#badges)\n[![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit)\n[![License MIT](https://img.shields.io/github/license/docling-project/docling-parse)](https://opensource.org/licenses/MIT)\n[![PyPI Downloads](https://static.pepy.tech/badge/docling-sdg/month)](https://pepy.tech/projects/docling-sdg)\n[![LF AI & Data](https://img.shields.io/badge/LF%20AI%20%26%20Data-003778?logo=linuxfoundation&logoColor=fff&color=0094ff&labelColor=003778)](https://lfaidata.foundation/projects/)\n\nDocling for Synthetic Data Generation (SDG) provides a set of tools to create artificial data from documents, leveraging generative AI and Docling's parsing capabilities.\n\n## Features\n\n* \ud83e\uddec Generation of question-answering pairs from passages of [multiple document formats][supported_formats] including \nPDF, HTML, or DOCX, leveraging Docling's parsing capabilities\n* \u2696\ufe0f LLM as a judge for high quality question-answering pairs\n* \ud83d\udcbb Simple and convenient CLI\n\n### Coming soon\n\n* \ud83d\udcdd Integrations with Llama Stack and vLLM\n* \ud83d\udcdd SDG on tabular data\n* \ud83d\udcdd Documentation\n\n## Installation\n\nTo use Docling SDG, simply install `docling-sdg` from your package manager, e.g., pip:\n\n```bash\npip install docling-sdg\n```\n\nAlternatively, you can clone this repository and use [uv](https://docs.astral.sh/uv) for\ncreating a virtual environment, installing the packages, and running the project commands.\n\n```bash\ngit clone git@github.com:docling-project/docling-sdg.git\ncd docling-sdg\nuv sync\n```\n\n## Getting started\n\nYou can create synthetically-generated questions and answers from relevant parts of one or several documents.\nThese question-answer pairs may be used in AI applications, such as evaluating a RAG application or generating\nground truth to train a language model.\n\n\n### Sample\n\nGenerating and judging data with LLMs may be computationally intense. Since document collections may be large,\nyou may want to chunk the documents into passages, filter them based on length and content criteria, and sample\na bunch of them to have a manageable dataset.\n\n```python\nfrom docling_sdg.qa.sample import PassageSampler\n\nsource = \"https://en.wikipedia.org/wiki/Duck\"\npassage_sampler = PassageSampler()\nprint(passage_sampler.sample(source))\n```\n\nBy default, the results will be exported to the file `docling_sdg_sample.jsonl`. Every line represents a document passage.\n\n### Generate\n\nFor each passage created in the previous step, we can prompt an LLM and generate 3 different questions of the following\ntypes: _simple fact_, _summary_, and _reasoning_.\n\nThe `GenerateOptions` class controls which model provider is used for Q&A generation by setting the `provider` attribute, as shown below. Three options are available:\n\n* `LlmProvider.WATSONX` for [watsonx.ai](https://www.ibm.com/products/watsonx-ai);, you will need to provide a watsonx.ai instance ID and an API key.\n* `LlmProvider.OPENAI` for OpenAI; you will need to provide an OpenAI API key\n* `LlmProvider.OPENAI_LIKE` for any model provider with OpenAI compatible APIs; if no API key is needed (such as when running against `ollama` locally), set `api_key` to any string, e.g. `\"fake\"`\n\n```python\nimport os\nfrom docling_sdg.qa.base import GenerateOptions, LlmProvider\nfrom docling_sdg.qa.generate import Generator\nfrom pathlib import Path\n\noptions = GenerateOptions(\n    provider=LlmProvider.WATSONX,\n    project_id=os.environ.get(\"WATSONX_PROJECT_ID\"),\n    api_key=os.environ.get(\"WATSONX_APIKEY\"),\n    url=os.environ.get(\"WATSONX_URL\"),\n)\n\ngenerator = Generator(generate_options=options)\nprint(generator.generate_from_sample(Path(\"docling_sdg_sample.jsonl\")))\n```\n\nBy default, the results will be exported to the file `docling_sdg_generated_qac.jsonl`. Every line represents a generated\nquestion-answer-context item with additional information like the question type.\n\n\n### Critique\n\nCertain applications may require certain quality in the generated data. The last step consists of using an LLM to judge\nthe generated data and provide both qualitative and quantiative evaluations of the question-answer-context items. Using\nthose evaluations, we can filter the generated dataset to the required quality levels.\n\n```python\nimport os\nfrom docling_sdg.qa.base import CritiqueOptions, LlmProvider\nfrom docling_sdg.qa.critique import Judge\nfrom pathlib import Path\n\noptions = CritiqueOptions(\n    provider=LlmProvider.WATSONX,\n    project_id=os.environ.get(\"WATSONX_PROJECT_ID\"),\n    api_key=os.environ.get(\"WATSONX_APIKEY\"),\n    url=os.environ.get(\"WATSONX_URL\"),\n)\n\njudge = Judge(critique_options=options)\nprint(judge.critique(Path(\"docling_sdg_generated_qac.jsonl\")))\n```\n\nBy default, the results will be exported to the file `docling_sdg_critiqued_qac.jsonl`. The file content is similar to \nthe one created in the [Generate](#generate) step, but it additionally contains the critique evaluation on several dimensions such as\n_question to context groundness_, _question feasibility_ or _context usefulness_.\n\n\n## CLI\n\nDocling SDG has a built-in CLI to run the 3 steps of the question-answering data generation.\n\n```bash\ndocling-sdg qa sample https://en.wikipedia.org/wiki/Duck\ndocling-sdg qa generate docling_sdg_sample.jsonl\ndocling-sdg qa critique docling_sdg_generated.jsonl\n```\n\nFind out more about optional parameters with the help argument. For instance:\n\n```bash\ndocling-sdg qa generate --help\n```\n\n## Get help and support\n\nPlease feel free to connect with us using the [discussion section](https://github.com/docling-project/docling/discussions).\n\n## Technical report\n\nFor more details on Docling SDG's inner workings, check out the paper [Know Your RAG: Dataset Taxonomy and Generation Strategies for Evaluating RAG System](https://aclanthology.org/2025.coling-industry.4.pdf), as well as [Docling Technical Report](https://arxiv.org/abs/2408.09869).\n\n## Contributing\n\nPlease read [Contributing to Docling SDG](https://github.com/docling-project/docling-sdg/blob/main/CONTRIBUTING.md) for details.\n\n## References\n\nIf you use Docling SDG in your projects, please consider citing the following:\n\n```bib\n@inproceedings{teixeira-de-lima-etal-2025-know,\n    title={Know Your RAG: Dataset Taxonomy and Generation Strategies for Evaluating RAG Systems}, \n    author={Rafael Teixeira de Lima and Shubham Gupta and Cesar Berrospi and Lokesh Mishra and Michele Dolfi and Peter Staar and Panagiotis Vagenas},\n    year={2025},\n    month={jan},\n    booktitle={Proceedings of the 31st International Conference on Computational Linguistics: Industry Track},\n    publisher={Association for Computational Linguistics},\n    url={https://aclanthology.org/2025.coling-industry.4/}\n}\n```\n\n## License\n\nThe Docling SDG codebase is under MIT license.\nFor individual model usage, please refer to the model licenses found in the original packages.\n\n## LF AI & Data\n\nDocling is hosted as a project in the [LF AI & Data Foundation](https://lfaidata.foundation/projects/).\n\n### IBM \u2764\ufe0f Open Source AI\n\nThe project was started by the AI for knowledge team at IBM Research Zurich.\n\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Docling for Synthetic Data Generation (SDG) provides a set of tools to create artificial data from documents, leveraging generative AI and docling's parsing capabilities.",
    "version": "0.4.0",
    "project_urls": {
        "Changelog": "https://github.com/docling-project/docling-sdg/blob/main/CHANGELOG.md",
        "Homepage": "https://github.com/docling-project/docling-sdg",
        "Issues": "https://github.com/docling-project/docling-sdg/issues",
        "Repository": "https://github.com/docling-project/docling-sdg"
    },
    "split_keywords": [
        "ai",
        " artificial intelligence",
        " docling",
        " document understanding",
        " large language models",
        " llm",
        " prompt engineering",
        " sdg",
        " synthetic data generation"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "34f4aab8e66ec8cd3fa995b6f701befb3ac7d396f1d9d9ea65a203384f4268be",
                "md5": "92ddf8e4a314079f1f7e5e4a053075cf",
                "sha256": "f815d66e42c6f7f8d3600b3e03f0439c1cfb796a7dad7a8e0dcb61be7cdadc8b"
            },
            "downloads": -1,
            "filename": "docling_sdg-0.4.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "92ddf8e4a314079f1f7e5e4a053075cf",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<3.13,>=3.10",
            "size": 32897,
            "upload_time": "2025-08-15T09:27:02",
            "upload_time_iso_8601": "2025-08-15T09:27:02.991169Z",
            "url": "https://files.pythonhosted.org/packages/34/f4/aab8e66ec8cd3fa995b6f701befb3ac7d396f1d9d9ea65a203384f4268be/docling_sdg-0.4.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "1a224686a98d4074c49873a9bc1338922d96f74e09344ac493566271721678e4",
                "md5": "76e9c4e96cae1bbc2be775b25a386263",
                "sha256": "bfce5d69e7115f4481f87524da33e806ed0e60e505518ce654810897b74fea92"
            },
            "downloads": -1,
            "filename": "docling_sdg-0.4.0.tar.gz",
            "has_sig": false,
            "md5_digest": "76e9c4e96cae1bbc2be775b25a386263",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<3.13,>=3.10",
            "size": 29579,
            "upload_time": "2025-08-15T09:27:04",
            "upload_time_iso_8601": "2025-08-15T09:27:04.298512Z",
            "url": "https://files.pythonhosted.org/packages/1a/22/4686a98d4074c49873a9bc1338922d96f74e09344ac493566271721678e4/docling_sdg-0.4.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-15 09:27:04",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "docling-project",
    "github_project": "docling-sdg",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "docling-sdg"
}

None