> [!IMPORTANT]
> # 👉 Now part of [Docling](https://github.com/DS4SD/docling)!
<p align="center">
<a href="https://github.com/DS4SD/quackling">
<img loading="lazy" alt="Quackling" src="https://raw.githubusercontent.com/DS4SD/quackling/main/resources/logo.jpeg" width="150" />
</a>
</p>
# Quackling
[![PyPI version](https://img.shields.io/pypi/v/quackling)](https://pypi.org/project/quackling/)
![Python](https://img.shields.io/badge/python-3.10%20%7C%203.11%20%7C%203.12-blue)
[![Poetry](https://img.shields.io/endpoint?url=https://python-poetry.org/badge/v0.json)](https://python-poetry.org/)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![Imports: isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336)](https://pycqa.github.io/isort/)
[![Pydantic v2](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/pydantic/pydantic/main/docs/badge/v2.json)](https://pydantic.dev)
[![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit)
[![License MIT](https://img.shields.io/github/license/DS4SD/quackling)](https://opensource.org/licenses/MIT)
Easily build document-native generative AI applications, such as RAG, leveraging [Docling](https://github.com/DS4SD/docling)'s efficient PDF extraction and rich data model — while still using your favorite framework, [🦙 LlamaIndex](https://docs.llamaindex.ai/en/stable/) or [🦜🔗 LangChain](https://python.langchain.com/).
## Features
- 🧠 Enables rich gen AI applications by providing capabilities on native document level — not just plain text / Markdown!
- ⚡️ Leverages Docling's conversion quality and speed.
- ⚙️ Plug-and-play integration with LlamaIndex and LangChain for building powerful applications like RAG.
<p align="center">
<a href="https://raw.githubusercontent.com/DS4SD/quackling/main/resources/doc_native_rag.png">
<img loading="lazy" alt="Doc-native RAG" src="https://raw.githubusercontent.com/DS4SD/quackling/main/resources/doc_native_rag.png" width="350" />
</a>
</p>
## Installation
To use Quackling, simply install `quackling` from your package manager, e.g. pip:
```sh
pip install quackling
```
## Usage
Quackling offers core capabilities (`quackling.core`), as well as framework integration components (`quackling.llama_index` and `quackling.langchain`). Below you find examples of both.
### Basic RAG
Here is a basic RAG pipeline using LlamaIndex:
> [!NOTE]
> To use as is, first `pip install llama-index-embeddings-huggingface llama-index-llms-huggingface-api`
> additionally to `quackling` to install the models.
> Otherwise, you can set `EMBED_MODEL` & `LLM` as desired, e.g. using
> [local models](https://docs.llamaindex.ai/en/stable/getting_started/starter_example_local).
```python
import os
from llama_index.core import VectorStoreIndex
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.huggingface_api import HuggingFaceInferenceAPI
from quackling.llama_index.node_parsers import HierarchicalJSONNodeParser
from quackling.llama_index.readers import DoclingPDFReader
DOCS = ["https://arxiv.org/pdf/2206.01062"]
QUESTION = "How many pages were human annotated?"
EMBED_MODEL = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
LLM = HuggingFaceInferenceAPI(
token=os.getenv("HF_TOKEN"),
model_name="mistralai/Mistral-7B-Instruct-v0.3",
)
index = VectorStoreIndex.from_documents(
documents=DoclingPDFReader(parse_type=DoclingPDFReader.ParseType.JSON).load_data(DOCS),
embed_model=EMBED_MODEL,
transformations=[HierarchicalJSONNodeParser()],
)
query_engine = index.as_query_engine(llm=LLM)
result = query_engine.query(QUESTION)
print(result.response)
# > 80K pages were human annotated
```
### Chunking
You can also use Quackling as a standalone with any pipeline.
For instance, to split the document to chunks based on document structure and returning pointers
to Docling document's nodes:
```python
from docling.document_converter import DocumentConverter
from quackling.core.chunkers import HierarchicalChunker
doc = DocumentConverter().convert_single("https://arxiv.org/pdf/2408.09869").output
chunks = list(HierarchicalChunker().chunk(doc))
# > [
# > ChunkWithMetadata(
# > path='$.main-text[4]',
# > text='Docling Technical Report\n[...]',
# > page=1,
# > bbox=[117.56, 439.85, 494.07, 482.42]
# > ),
# > [...]
# > ]
```
## More examples
### LlamaIndex
- [Milvus basic RAG (dense embeddings)](examples/llama_index/basic_pipeline.ipynb)
- [Milvus hybrid RAG (dense & sparse embeddings combined e.g. via RRF) & reranker model usage](examples/llama_index/hybrid_pipeline.ipynb)
- [Milvus RAG also fetching native document metadata for search results](examples/llama_index/native_nodes.ipynb)
- [Local node transformations (e.g. embeddings)](examples/llama_index/node_transformations.ipynb)
- ...
### LangChain
- [Milvus basic RAG (dense embeddings)](examples/langchain/basic_pipeline.ipynb)
## Contributing
Please read [Contributing to Quackling](./CONTRIBUTING.md) for details.
## References
If you use Quackling in your projects, please consider citing the following:
```bib
@techreport{Docling,
author = "Deep Search Team",
month = 8,
title = "Docling Technical Report",
url = "https://arxiv.org/abs/2408.09869",
eprint = "2408.09869",
doi = "10.48550/arXiv.2408.09869",
version = "1.0.0",
year = 2024
}
```
## License
The Quackling codebase is under MIT license.
For individual component usage, please refer to the component licenses found in the original packages.
Raw data
{
"_id": null,
"home_page": "https://github.com/DS4SD/quackling",
"name": "quackling",
"maintainer": "Panos Vagenas",
"docs_url": null,
"requires_python": "<4.0,>=3.10",
"maintainer_email": "pva@zurich.ibm.com",
"keywords": "document, PDF, RAG, generative AI, chunking, docling, llama index",
"author": "Panos Vagenas",
"author_email": "pva@zurich.ibm.com",
"download_url": "https://files.pythonhosted.org/packages/31/58/42f51fb771fe4d7602a31074736aa12aea560cfe548d30abd9497e2e9279/quackling-0.4.1.tar.gz",
"platform": null,
"description": "> [!IMPORTANT]\n> # \ud83d\udc49 Now part of [Docling](https://github.com/DS4SD/docling)!\n\n<p align=\"center\">\n <a href=\"https://github.com/DS4SD/quackling\">\n <img loading=\"lazy\" alt=\"Quackling\" src=\"https://raw.githubusercontent.com/DS4SD/quackling/main/resources/logo.jpeg\" width=\"150\" />\n </a>\n</p>\n\n# Quackling\n\n[![PyPI version](https://img.shields.io/pypi/v/quackling)](https://pypi.org/project/quackling/)\n![Python](https://img.shields.io/badge/python-3.10%20%7C%203.11%20%7C%203.12-blue)\n[![Poetry](https://img.shields.io/endpoint?url=https://python-poetry.org/badge/v0.json)](https://python-poetry.org/)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n[![Imports: isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336)](https://pycqa.github.io/isort/)\n[![Pydantic v2](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/pydantic/pydantic/main/docs/badge/v2.json)](https://pydantic.dev)\n[![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit)\n[![License MIT](https://img.shields.io/github/license/DS4SD/quackling)](https://opensource.org/licenses/MIT)\n\nEasily build document-native generative AI applications, such as RAG, leveraging [Docling](https://github.com/DS4SD/docling)'s efficient PDF extraction and rich data model \u2014 while still using your favorite framework, [\ud83e\udd99 LlamaIndex](https://docs.llamaindex.ai/en/stable/) or [\ud83e\udd9c\ud83d\udd17 LangChain](https://python.langchain.com/).\n\n## Features\n\n- \ud83e\udde0 Enables rich gen AI applications by providing capabilities on native document level \u2014 not just plain text / Markdown!\n- \u26a1\ufe0f Leverages Docling's conversion quality and speed.\n- \u2699\ufe0f Plug-and-play integration with LlamaIndex and LangChain for building powerful applications like RAG.\n\n<p align=\"center\">\n <a href=\"https://raw.githubusercontent.com/DS4SD/quackling/main/resources/doc_native_rag.png\">\n <img loading=\"lazy\" alt=\"Doc-native RAG\" src=\"https://raw.githubusercontent.com/DS4SD/quackling/main/resources/doc_native_rag.png\" width=\"350\" />\n </a>\n</p>\n\n\n## Installation\n\nTo use Quackling, simply install `quackling` from your package manager, e.g. pip:\n\n```sh\npip install quackling\n```\n\n## Usage\n\nQuackling offers core capabilities (`quackling.core`), as well as framework integration components (`quackling.llama_index` and `quackling.langchain`). Below you find examples of both.\n\n### Basic RAG\n\nHere is a basic RAG pipeline using LlamaIndex:\n\n> [!NOTE]\n> To use as is, first `pip install llama-index-embeddings-huggingface llama-index-llms-huggingface-api`\n> additionally to `quackling` to install the models.\n> Otherwise, you can set `EMBED_MODEL` & `LLM` as desired, e.g. using\n> [local models](https://docs.llamaindex.ai/en/stable/getting_started/starter_example_local).\n\n```python\nimport os\n\nfrom llama_index.core import VectorStoreIndex\nfrom llama_index.embeddings.huggingface import HuggingFaceEmbedding\nfrom llama_index.llms.huggingface_api import HuggingFaceInferenceAPI\nfrom quackling.llama_index.node_parsers import HierarchicalJSONNodeParser\nfrom quackling.llama_index.readers import DoclingPDFReader\n\nDOCS = [\"https://arxiv.org/pdf/2206.01062\"]\nQUESTION = \"How many pages were human annotated?\"\nEMBED_MODEL = HuggingFaceEmbedding(model_name=\"BAAI/bge-small-en-v1.5\")\nLLM = HuggingFaceInferenceAPI(\n token=os.getenv(\"HF_TOKEN\"),\n model_name=\"mistralai/Mistral-7B-Instruct-v0.3\",\n)\n\nindex = VectorStoreIndex.from_documents(\n documents=DoclingPDFReader(parse_type=DoclingPDFReader.ParseType.JSON).load_data(DOCS),\n embed_model=EMBED_MODEL,\n transformations=[HierarchicalJSONNodeParser()],\n)\nquery_engine = index.as_query_engine(llm=LLM)\nresult = query_engine.query(QUESTION)\nprint(result.response)\n# > 80K pages were human annotated\n```\n\n### Chunking\n\nYou can also use Quackling as a standalone with any pipeline.\nFor instance, to split the document to chunks based on document structure and returning pointers\nto Docling document's nodes:\n\n```python\nfrom docling.document_converter import DocumentConverter\nfrom quackling.core.chunkers import HierarchicalChunker\n\ndoc = DocumentConverter().convert_single(\"https://arxiv.org/pdf/2408.09869\").output\nchunks = list(HierarchicalChunker().chunk(doc))\n# > [\n# > ChunkWithMetadata(\n# > path='$.main-text[4]',\n# > text='Docling Technical Report\\n[...]',\n# > page=1,\n# > bbox=[117.56, 439.85, 494.07, 482.42]\n# > ),\n# > [...]\n# > ]\n```\n\n## More examples\n\n### LlamaIndex\n\n- [Milvus basic RAG (dense embeddings)](examples/llama_index/basic_pipeline.ipynb)\n- [Milvus hybrid RAG (dense & sparse embeddings combined e.g. via RRF) & reranker model usage](examples/llama_index/hybrid_pipeline.ipynb)\n- [Milvus RAG also fetching native document metadata for search results](examples/llama_index/native_nodes.ipynb)\n- [Local node transformations (e.g. embeddings)](examples/llama_index/node_transformations.ipynb)\n- ...\n\n### LangChain\n- [Milvus basic RAG (dense embeddings)](examples/langchain/basic_pipeline.ipynb)\n\n## Contributing\n\nPlease read [Contributing to Quackling](./CONTRIBUTING.md) for details.\n\n## References\n\nIf you use Quackling in your projects, please consider citing the following:\n\n```bib\n@techreport{Docling,\n author = \"Deep Search Team\",\n month = 8,\n title = \"Docling Technical Report\",\n url = \"https://arxiv.org/abs/2408.09869\",\n eprint = \"2408.09869\",\n doi = \"10.48550/arXiv.2408.09869\",\n version = \"1.0.0\",\n year = 2024\n}\n```\n\n## License\n\nThe Quackling codebase is under MIT license.\nFor individual component usage, please refer to the component licenses found in the original packages.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Quackling enables document-native generative AI applications",
"version": "0.4.1",
"project_urls": {
"Homepage": "https://github.com/DS4SD/quackling",
"Repository": "https://github.com/DS4SD/quackling"
},
"split_keywords": [
"document",
" pdf",
" rag",
" generative ai",
" chunking",
" docling",
" llama index"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "80980f5892f4d9c9c7dc6956f4b3a5315f81ee5b1acf5b7d1fb0f99fd22b0d64",
"md5": "8c2ceb51d9af0a27528d28a51078960a",
"sha256": "e1e07fcbb964e7e2a96f68bd4ce6b432a341b61f1c0ea1ca298e61e85b14e9a2"
},
"downloads": -1,
"filename": "quackling-0.4.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "8c2ceb51d9af0a27528d28a51078960a",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0,>=3.10",
"size": 17506,
"upload_time": "2024-09-11T13:26:54",
"upload_time_iso_8601": "2024-09-11T13:26:54.141439Z",
"url": "https://files.pythonhosted.org/packages/80/98/0f5892f4d9c9c7dc6956f4b3a5315f81ee5b1acf5b7d1fb0f99fd22b0d64/quackling-0.4.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "315842f51fb771fe4d7602a31074736aa12aea560cfe548d30abd9497e2e9279",
"md5": "6c1c9da8e8ff9a0d08992570dea4a9d4",
"sha256": "9f3be8538c89258c774b047f762d71f92f5a0a776bab7630e81a9b0eb041d4a2"
},
"downloads": -1,
"filename": "quackling-0.4.1.tar.gz",
"has_sig": false,
"md5_digest": "6c1c9da8e8ff9a0d08992570dea4a9d4",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4.0,>=3.10",
"size": 13945,
"upload_time": "2024-09-11T13:26:57",
"upload_time_iso_8601": "2024-09-11T13:26:57.196493Z",
"url": "https://files.pythonhosted.org/packages/31/58/42f51fb771fe4d7602a31074736aa12aea560cfe548d30abd9497e2e9279/quackling-0.4.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-09-11 13:26:57",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "DS4SD",
"github_project": "quackling",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "quackling"
}