quackling

Name	quackling JSON
Version	0.4.1 JSON
	download
home_page	https://github.com/DS4SD/quackling
Summary	Quackling enables document-native generative AI applications
upload_time	2024-09-11 13:26:57
maintainer	Panos Vagenas
docs_url	None
author	Panos Vagenas
requires_python	<4.0,>=3.10
license	MIT
keywords	document pdf rag generative ai chunking docling llama index
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            > [!IMPORTANT]
> # 👉 Now part of [Docling](https://github.com/DS4SD/docling)!

<p align="center">
  <a href="https://github.com/DS4SD/quackling">
    <img loading="lazy" alt="Quackling" src="https://raw.githubusercontent.com/DS4SD/quackling/main/resources/logo.jpeg" width="150" />
  </a>
</p>

# Quackling

[![PyPI version](https://img.shields.io/pypi/v/quackling)](https://pypi.org/project/quackling/)
![Python](https://img.shields.io/badge/python-3.10%20%7C%203.11%20%7C%203.12-blue)
[![Poetry](https://img.shields.io/endpoint?url=https://python-poetry.org/badge/v0.json)](https://python-poetry.org/)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![Imports: isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336)](https://pycqa.github.io/isort/)
[![Pydantic v2](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/pydantic/pydantic/main/docs/badge/v2.json)](https://pydantic.dev)
[![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit)
[![License MIT](https://img.shields.io/github/license/DS4SD/quackling)](https://opensource.org/licenses/MIT)

Easily build document-native generative AI applications, such as RAG, leveraging [Docling](https://github.com/DS4SD/docling)'s efficient PDF extraction and rich data model — while still using your favorite framework, [🦙 LlamaIndex](https://docs.llamaindex.ai/en/stable/) or [🦜🔗 LangChain](https://python.langchain.com/).

## Features

- 🧠 Enables rich gen AI applications by providing capabilities on native document level — not just plain text / Markdown!
- ⚡️ Leverages Docling's conversion quality and speed.
- ⚙️ Plug-and-play integration with LlamaIndex and LangChain for building powerful applications like RAG.

<p align="center">
  <a href="https://raw.githubusercontent.com/DS4SD/quackling/main/resources/doc_native_rag.png">
    <img loading="lazy" alt="Doc-native RAG" src="https://raw.githubusercontent.com/DS4SD/quackling/main/resources/doc_native_rag.png" width="350" />
  </a>
</p>


## Installation

To use Quackling, simply install `quackling` from your package manager, e.g. pip:

```sh
pip install quackling
```

## Usage

Quackling offers core capabilities (`quackling.core`), as well as framework integration components (`quackling.llama_index` and `quackling.langchain`). Below you find examples of both.

### Basic RAG

Here is a basic RAG pipeline using LlamaIndex:

> [!NOTE]
> To use as is, first `pip install llama-index-embeddings-huggingface llama-index-llms-huggingface-api`
> additionally to `quackling` to install the models.
> Otherwise, you can set `EMBED_MODEL` & `LLM` as desired, e.g. using
> [local models](https://docs.llamaindex.ai/en/stable/getting_started/starter_example_local).

```python
import os

from llama_index.core import VectorStoreIndex
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.huggingface_api import HuggingFaceInferenceAPI
from quackling.llama_index.node_parsers import HierarchicalJSONNodeParser
from quackling.llama_index.readers import DoclingPDFReader

DOCS = ["https://arxiv.org/pdf/2206.01062"]
QUESTION = "How many pages were human annotated?"
EMBED_MODEL = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
LLM = HuggingFaceInferenceAPI(
    token=os.getenv("HF_TOKEN"),
    model_name="mistralai/Mistral-7B-Instruct-v0.3",
)

index = VectorStoreIndex.from_documents(
    documents=DoclingPDFReader(parse_type=DoclingPDFReader.ParseType.JSON).load_data(DOCS),
    embed_model=EMBED_MODEL,
    transformations=[HierarchicalJSONNodeParser()],
)
query_engine = index.as_query_engine(llm=LLM)
result = query_engine.query(QUESTION)
print(result.response)
# > 80K pages were human annotated
```

### Chunking

You can also use Quackling as a standalone with any pipeline.
For instance, to split the document to chunks based on document structure and returning pointers
to Docling document's nodes:

```python
from docling.document_converter import DocumentConverter
from quackling.core.chunkers import HierarchicalChunker

doc = DocumentConverter().convert_single("https://arxiv.org/pdf/2408.09869").output
chunks = list(HierarchicalChunker().chunk(doc))
# > [
# >     ChunkWithMetadata(
# >         path='$.main-text[4]',
# >         text='Docling Technical Report\n[...]',
# >         page=1,
# >         bbox=[117.56, 439.85, 494.07, 482.42]
# >     ),
# >     [...]
# > ]
```

## More examples

### LlamaIndex

- [Milvus basic RAG (dense embeddings)](examples/llama_index/basic_pipeline.ipynb)
- [Milvus hybrid RAG (dense & sparse embeddings combined e.g. via RRF) & reranker model usage](examples/llama_index/hybrid_pipeline.ipynb)
- [Milvus RAG also fetching native document metadata for search results](examples/llama_index/native_nodes.ipynb)
- [Local node transformations (e.g. embeddings)](examples/llama_index/node_transformations.ipynb)
- ...

### LangChain
- [Milvus basic RAG (dense embeddings)](examples/langchain/basic_pipeline.ipynb)

## Contributing

Please read [Contributing to Quackling](./CONTRIBUTING.md) for details.

## References

If you use Quackling in your projects, please consider citing the following:

```bib
@techreport{Docling,
  author = "Deep Search Team",
  month = 8,
  title = "Docling Technical Report",
  url = "https://arxiv.org/abs/2408.09869",
  eprint = "2408.09869",
  doi = "10.48550/arXiv.2408.09869",
  version = "1.0.0",
  year = 2024
}
```

## License

The Quackling codebase is under MIT license.
For individual component usage, please refer to the component licenses found in the original packages.

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/DS4SD/quackling",
    "name": "quackling",
    "maintainer": "Panos Vagenas",
    "docs_url": null,
    "requires_python": "<4.0,>=3.10",
    "maintainer_email": "pva@zurich.ibm.com",
    "keywords": "document, PDF, RAG, generative AI, chunking, docling, llama index",
    "author": "Panos Vagenas",
    "author_email": "pva@zurich.ibm.com",
    "download_url": "https://files.pythonhosted.org/packages/31/58/42f51fb771fe4d7602a31074736aa12aea560cfe548d30abd9497e2e9279/quackling-0.4.1.tar.gz",
    "platform": null,
    "description": "> [!IMPORTANT]\n> # \ud83d\udc49 Now part of [Docling](https://github.com/DS4SD/docling)!\n\n<p align=\"center\">\n  <a href=\"https://github.com/DS4SD/quackling\">\n    <img loading=\"lazy\" alt=\"Quackling\" src=\"https://raw.githubusercontent.com/DS4SD/quackling/main/resources/logo.jpeg\" width=\"150\" />\n  </a>\n</p>\n\n# Quackling\n\n[![PyPI version](https://img.shields.io/pypi/v/quackling)](https://pypi.org/project/quackling/)\n![Python](https://img.shields.io/badge/python-3.10%20%7C%203.11%20%7C%203.12-blue)\n[![Poetry](https://img.shields.io/endpoint?url=https://python-poetry.org/badge/v0.json)](https://python-poetry.org/)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n[![Imports: isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336)](https://pycqa.github.io/isort/)\n[![Pydantic v2](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/pydantic/pydantic/main/docs/badge/v2.json)](https://pydantic.dev)\n[![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit)\n[![License MIT](https://img.shields.io/github/license/DS4SD/quackling)](https://opensource.org/licenses/MIT)\n\nEasily build document-native generative AI applications, such as RAG, leveraging [Docling](https://github.com/DS4SD/docling)'s efficient PDF extraction and rich data model \u2014 while still using your favorite framework, [\ud83e\udd99 LlamaIndex](https://docs.llamaindex.ai/en/stable/) or [\ud83e\udd9c\ud83d\udd17 LangChain](https://python.langchain.com/).\n\n## Features\n\n- \ud83e\udde0 Enables rich gen AI applications by providing capabilities on native document level \u2014 not just plain text / Markdown!\n- \u26a1\ufe0f Leverages Docling's conversion quality and speed.\n- \u2699\ufe0f Plug-and-play integration with LlamaIndex and LangChain for building powerful applications like RAG.\n\n<p align=\"center\">\n  <a href=\"https://raw.githubusercontent.com/DS4SD/quackling/main/resources/doc_native_rag.png\">\n    <img loading=\"lazy\" alt=\"Doc-native RAG\" src=\"https://raw.githubusercontent.com/DS4SD/quackling/main/resources/doc_native_rag.png\" width=\"350\" />\n  </a>\n</p>\n\n\n## Installation\n\nTo use Quackling, simply install `quackling` from your package manager, e.g. pip:\n\n```sh\npip install quackling\n```\n\n## Usage\n\nQuackling offers core capabilities (`quackling.core`), as well as framework integration components (`quackling.llama_index` and `quackling.langchain`). Below you find examples of both.\n\n### Basic RAG\n\nHere is a basic RAG pipeline using LlamaIndex:\n\n> [!NOTE]\n> To use as is, first `pip install llama-index-embeddings-huggingface llama-index-llms-huggingface-api`\n> additionally to `quackling` to install the models.\n> Otherwise, you can set `EMBED_MODEL` & `LLM` as desired, e.g. using\n> [local models](https://docs.llamaindex.ai/en/stable/getting_started/starter_example_local).\n\n```python\nimport os\n\nfrom llama_index.core import VectorStoreIndex\nfrom llama_index.embeddings.huggingface import HuggingFaceEmbedding\nfrom llama_index.llms.huggingface_api import HuggingFaceInferenceAPI\nfrom quackling.llama_index.node_parsers import HierarchicalJSONNodeParser\nfrom quackling.llama_index.readers import DoclingPDFReader\n\nDOCS = [\"https://arxiv.org/pdf/2206.01062\"]\nQUESTION = \"How many pages were human annotated?\"\nEMBED_MODEL = HuggingFaceEmbedding(model_name=\"BAAI/bge-small-en-v1.5\")\nLLM = HuggingFaceInferenceAPI(\n    token=os.getenv(\"HF_TOKEN\"),\n    model_name=\"mistralai/Mistral-7B-Instruct-v0.3\",\n)\n\nindex = VectorStoreIndex.from_documents(\n    documents=DoclingPDFReader(parse_type=DoclingPDFReader.ParseType.JSON).load_data(DOCS),\n    embed_model=EMBED_MODEL,\n    transformations=[HierarchicalJSONNodeParser()],\n)\nquery_engine = index.as_query_engine(llm=LLM)\nresult = query_engine.query(QUESTION)\nprint(result.response)\n# > 80K pages were human annotated\n```\n\n### Chunking\n\nYou can also use Quackling as a standalone with any pipeline.\nFor instance, to split the document to chunks based on document structure and returning pointers\nto Docling document's nodes:\n\n```python\nfrom docling.document_converter import DocumentConverter\nfrom quackling.core.chunkers import HierarchicalChunker\n\ndoc = DocumentConverter().convert_single(\"https://arxiv.org/pdf/2408.09869\").output\nchunks = list(HierarchicalChunker().chunk(doc))\n# > [\n# >     ChunkWithMetadata(\n# >         path='$.main-text[4]',\n# >         text='Docling Technical Report\\n[...]',\n# >         page=1,\n# >         bbox=[117.56, 439.85, 494.07, 482.42]\n# >     ),\n# >     [...]\n# > ]\n```\n\n## More examples\n\n### LlamaIndex\n\n- [Milvus basic RAG (dense embeddings)](examples/llama_index/basic_pipeline.ipynb)\n- [Milvus hybrid RAG (dense & sparse embeddings combined e.g. via RRF) & reranker model usage](examples/llama_index/hybrid_pipeline.ipynb)\n- [Milvus RAG also fetching native document metadata for search results](examples/llama_index/native_nodes.ipynb)\n- [Local node transformations (e.g. embeddings)](examples/llama_index/node_transformations.ipynb)\n- ...\n\n### LangChain\n- [Milvus basic RAG (dense embeddings)](examples/langchain/basic_pipeline.ipynb)\n\n## Contributing\n\nPlease read [Contributing to Quackling](./CONTRIBUTING.md) for details.\n\n## References\n\nIf you use Quackling in your projects, please consider citing the following:\n\n```bib\n@techreport{Docling,\n  author = \"Deep Search Team\",\n  month = 8,\n  title = \"Docling Technical Report\",\n  url = \"https://arxiv.org/abs/2408.09869\",\n  eprint = \"2408.09869\",\n  doi = \"10.48550/arXiv.2408.09869\",\n  version = \"1.0.0\",\n  year = 2024\n}\n```\n\n## License\n\nThe Quackling codebase is under MIT license.\nFor individual component usage, please refer to the component licenses found in the original packages.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Quackling enables document-native generative AI applications",
    "version": "0.4.1",
    "project_urls": {
        "Homepage": "https://github.com/DS4SD/quackling",
        "Repository": "https://github.com/DS4SD/quackling"
    },
    "split_keywords": [
        "document",
        " pdf",
        " rag",
        " generative ai",
        " chunking",
        " docling",
        " llama index"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "80980f5892f4d9c9c7dc6956f4b3a5315f81ee5b1acf5b7d1fb0f99fd22b0d64",
                "md5": "8c2ceb51d9af0a27528d28a51078960a",
                "sha256": "e1e07fcbb964e7e2a96f68bd4ce6b432a341b61f1c0ea1ca298e61e85b14e9a2"
            },
            "downloads": -1,
            "filename": "quackling-0.4.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "8c2ceb51d9af0a27528d28a51078960a",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.10",
            "size": 17506,
            "upload_time": "2024-09-11T13:26:54",
            "upload_time_iso_8601": "2024-09-11T13:26:54.141439Z",
            "url": "https://files.pythonhosted.org/packages/80/98/0f5892f4d9c9c7dc6956f4b3a5315f81ee5b1acf5b7d1fb0f99fd22b0d64/quackling-0.4.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "315842f51fb771fe4d7602a31074736aa12aea560cfe548d30abd9497e2e9279",
                "md5": "6c1c9da8e8ff9a0d08992570dea4a9d4",
                "sha256": "9f3be8538c89258c774b047f762d71f92f5a0a776bab7630e81a9b0eb041d4a2"
            },
            "downloads": -1,
            "filename": "quackling-0.4.1.tar.gz",
            "has_sig": false,
            "md5_digest": "6c1c9da8e8ff9a0d08992570dea4a9d4",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.10",
            "size": 13945,
            "upload_time": "2024-09-11T13:26:57",
            "upload_time_iso_8601": "2024-09-11T13:26:57.196493Z",
            "url": "https://files.pythonhosted.org/packages/31/58/42f51fb771fe4d7602a31074736aa12aea560cfe548d30abd9497e2e9279/quackling-0.4.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-09-11 13:26:57",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "DS4SD",
    "github_project": "quackling",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "quackling"
}

Panos Vagenas