sparrow-parse

Name	sparrow-parse JSON
Version	0.5.0 JSON
	download
home_page	https://github.com/katanaml/sparrow/tree/main/sparrow-data/parse
Summary	Sparrow Parse is a Python package (part of Sparrow) for parsing and extracting information from documents.
upload_time	2025-01-09 12:28:45
maintainer	None
docs_url	None
author	Andrej Baranovskij
requires_python	>=3.10
license	None
keywords	llm vllm ocr vision
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Sparrow Parse

## Description

This module implements Sparrow Parse [library](https://pypi.org/project/sparrow-parse/) library with helpful methods for data pre-processing, parsing and extracting information. This library relies on Visual LLM functionality, Table Transformers and is part of Sparrow. Check main [README](https://github.com/katanaml/sparrow)

## Install

```
pip install sparrow-parse
```

## Parsing and extraction

### Sparrow Parse VL (vision-language model) extractor with local MLX or Hugging Face Cloud GPU infra

```
# run locally: python -m sparrow_parse.extractors.vllm_extractor

from sparrow_parse.vllm.inference_factory import InferenceFactory
from sparrow_parse.extractors.vllm_extractor import VLLMExtractor

extractor = VLLMExtractor()

config = {
    "method": "mlx",  # Could be 'huggingface', 'mlx' or 'local_gpu'
    "model_name": "mlx-community/Qwen2-VL-72B-Instruct-4bit",
}

# Use the factory to get the correct instance
factory = InferenceFactory(config)
model_inference_instance = factory.get_inference_instance()

input_data = [
    {
        "file_path": "/Users/andrejb/Work/katana-git/sparrow/sparrow-ml/llm/data/bonds_table.jpg",
        "text_input": "retrieve all data. return response in JSON format"
    }
]

# Now you can run inference without knowing which implementation is used
results_array, num_pages = extractor.run_inference(model_inference_instance, input_data, tables_only=False,
                                 generic_query=False,
                                 debug_dir=None,
                                 debug=True,
                                 mode=None)

for i, result in enumerate(results_array):
    print(f"Result for page {i + 1}:", result)
print(f"Number of pages: {num_pages}")
```

Use `tables_only=True` if you want to extract only tables.

Use `mode="static"` if you want to simulate LLM call, without executing LLM backend.

Method `run_inference` will return results and number of pages processed.

To run with Hugging Face backend use these config values:

```
config = {
    "method": "huggingface",  # Could be 'huggingface' or 'local_gpu'
    "hf_space": "katanaml/sparrow-qwen2-vl-7b",
    "hf_token": os.getenv('HF_TOKEN'),
}
```

Note: GPU backend `katanaml/sparrow-qwen2-vl-7b` is private, to be able to run below command, you need to create your own backend on Hugging Face space using [code](https://github.com/katanaml/sparrow/tree/main/sparrow-data/parse/sparrow_parse/vllm/infra/qwen2_vl_7b) from Sparrow Parse.

## PDF pre-processing

```
from sparrow_parse.extractor.pdf_optimizer import PDFOptimizer

pdf_optimizer = PDFOptimizer()

num_pages, output_files, temp_dir = pdf_optimizer.split_pdf_to_pages(file_path,
                                                                     output_directory,
                                                                     convert_to_images)

```

Example:

*file_path* - `/data/invoice_1.pdf`

*output_directory* - set to not `None`, for debug purposes only

*convert_to_images* - default `False`, to split into PDF files

## Library build

Create Python virtual environment

```
python -m venv .env_sparrow_parse
```

Install Python libraries

```
pip install -r requirements.txt
```

Build package

```
pip install setuptools wheel
python setup.py sdist bdist_wheel
```

Upload to PyPI

```
pip install twine
twine upload dist/*
```

## Commercial usage

Sparrow is available under the GPL 3.0 license, promoting freedom to use, modify, and distribute the software while ensuring any modifications remain open source under the same license. This aligns with our commitment to supporting the open-source community and fostering collaboration.

Additionally, we recognize the diverse needs of organizations, including small to medium-sized enterprises (SMEs). Therefore, Sparrow is also offered for free commercial use to organizations with gross revenue below $5 million USD in the past 12 months, enabling them to leverage Sparrow without the financial burden often associated with high-quality software solutions.

For businesses that exceed this revenue threshold or require usage terms not accommodated by the GPL 3.0 license—such as integrating Sparrow into proprietary software without the obligation to disclose source code modifications—we offer dual licensing options. Dual licensing allows Sparrow to be used under a separate proprietary license, offering greater flexibility for commercial applications and proprietary integrations. This model supports both the project's sustainability and the business's needs for confidentiality and customization.

If your organization is seeking to utilize Sparrow under a proprietary license, or if you are interested in custom workflows, consulting services, or dedicated support and maintenance options, please contact us at abaranovskis@redsamuraiconsulting.com. We're here to provide tailored solutions that meet your unique requirements, ensuring you can maximize the benefits of Sparrow for your projects and workflows.

## Author

[Katana ML](https://katanaml.io), [Andrej Baranovskij](https://github.com/abaranovskis-redsamurai)

## License

Licensed under the GPL 3.0. Copyright 2020-2024 Katana ML, Andrej Baranovskij. [Copy of the license](https://github.com/katanaml/sparrow/blob/main/LICENSE).

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/katanaml/sparrow/tree/main/sparrow-data/parse",
    "name": "sparrow-parse",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "llm, vllm, ocr, vision",
    "author": "Andrej Baranovskij",
    "author_email": "andrejus.baranovskis@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/2b/e6/408eb4afe69144fa8ee3acb11f116516194ae1a785b9641c26c9fa1998fb/sparrow-parse-0.5.0.tar.gz",
    "platform": null,
    "description": "# Sparrow Parse\n\n## Description\n\nThis module implements Sparrow Parse [library](https://pypi.org/project/sparrow-parse/) library with helpful methods for data pre-processing, parsing and extracting information. This library relies on Visual LLM functionality, Table Transformers and is part of Sparrow. Check main [README](https://github.com/katanaml/sparrow)\n\n## Install\n\n```\npip install sparrow-parse\n```\n\n## Parsing and extraction\n\n### Sparrow Parse VL (vision-language model) extractor with local MLX or Hugging Face Cloud GPU infra\n\n```\n# run locally: python -m sparrow_parse.extractors.vllm_extractor\n\nfrom sparrow_parse.vllm.inference_factory import InferenceFactory\nfrom sparrow_parse.extractors.vllm_extractor import VLLMExtractor\n\nextractor = VLLMExtractor()\n\nconfig = {\n    \"method\": \"mlx\",  # Could be 'huggingface', 'mlx' or 'local_gpu'\n    \"model_name\": \"mlx-community/Qwen2-VL-72B-Instruct-4bit\",\n}\n\n# Use the factory to get the correct instance\nfactory = InferenceFactory(config)\nmodel_inference_instance = factory.get_inference_instance()\n\ninput_data = [\n    {\n        \"file_path\": \"/Users/andrejb/Work/katana-git/sparrow/sparrow-ml/llm/data/bonds_table.jpg\",\n        \"text_input\": \"retrieve all data. return response in JSON format\"\n    }\n]\n\n# Now you can run inference without knowing which implementation is used\nresults_array, num_pages = extractor.run_inference(model_inference_instance, input_data, tables_only=False,\n                                 generic_query=False,\n                                 debug_dir=None,\n                                 debug=True,\n                                 mode=None)\n\nfor i, result in enumerate(results_array):\n    print(f\"Result for page {i + 1}:\", result)\nprint(f\"Number of pages: {num_pages}\")\n```\n\nUse `tables_only=True` if you want to extract only tables.\n\nUse `mode=\"static\"` if you want to simulate LLM call, without executing LLM backend.\n\nMethod `run_inference` will return results and number of pages processed.\n\nTo run with Hugging Face backend use these config values:\n\n```\nconfig = {\n    \"method\": \"huggingface\",  # Could be 'huggingface' or 'local_gpu'\n    \"hf_space\": \"katanaml/sparrow-qwen2-vl-7b\",\n    \"hf_token\": os.getenv('HF_TOKEN'),\n}\n```\n\nNote: GPU backend `katanaml/sparrow-qwen2-vl-7b` is private, to be able to run below command, you need to create your own backend on Hugging Face space using [code](https://github.com/katanaml/sparrow/tree/main/sparrow-data/parse/sparrow_parse/vllm/infra/qwen2_vl_7b) from Sparrow Parse.\n\n## PDF pre-processing\n\n```\nfrom sparrow_parse.extractor.pdf_optimizer import PDFOptimizer\n\npdf_optimizer = PDFOptimizer()\n\nnum_pages, output_files, temp_dir = pdf_optimizer.split_pdf_to_pages(file_path,\n                                                                     output_directory,\n                                                                     convert_to_images)\n\n```\n\nExample:\n\n*file_path* - `/data/invoice_1.pdf`\n\n*output_directory* - set to not `None`, for debug purposes only\n\n*convert_to_images* - default `False`, to split into PDF files\n\n## Library build\n\nCreate Python virtual environment\n\n```\npython -m venv .env_sparrow_parse\n```\n\nInstall Python libraries\n\n```\npip install -r requirements.txt\n```\n\nBuild package\n\n```\npip install setuptools wheel\npython setup.py sdist bdist_wheel\n```\n\nUpload to PyPI\n\n```\npip install twine\ntwine upload dist/*\n```\n\n## Commercial usage\n\nSparrow is available under the GPL 3.0 license, promoting freedom to use, modify, and distribute the software while ensuring any modifications remain open source under the same license. This aligns with our commitment to supporting the open-source community and fostering collaboration.\n\nAdditionally, we recognize the diverse needs of organizations, including small to medium-sized enterprises (SMEs). Therefore, Sparrow is also offered for free commercial use to organizations with gross revenue below $5 million USD in the past 12 months, enabling them to leverage Sparrow without the financial burden often associated with high-quality software solutions.\n\nFor businesses that exceed this revenue threshold or require usage terms not accommodated by the GPL 3.0 license\u2014such as integrating Sparrow into proprietary software without the obligation to disclose source code modifications\u2014we offer dual licensing options. Dual licensing allows Sparrow to be used under a separate proprietary license, offering greater flexibility for commercial applications and proprietary integrations. This model supports both the project's sustainability and the business's needs for confidentiality and customization.\n\nIf your organization is seeking to utilize Sparrow under a proprietary license, or if you are interested in custom workflows, consulting services, or dedicated support and maintenance options, please contact us at abaranovskis@redsamuraiconsulting.com. We're here to provide tailored solutions that meet your unique requirements, ensuring you can maximize the benefits of Sparrow for your projects and workflows.\n\n## Author\n\n[Katana ML](https://katanaml.io), [Andrej Baranovskij](https://github.com/abaranovskis-redsamurai)\n\n## License\n\nLicensed under the GPL 3.0. Copyright 2020-2024 Katana ML, Andrej Baranovskij. [Copy of the license](https://github.com/katanaml/sparrow/blob/main/LICENSE).\n\n\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Sparrow Parse is a Python package (part of Sparrow) for parsing and extracting information from documents.",
    "version": "0.5.0",
    "project_urls": {
        "Homepage": "https://github.com/katanaml/sparrow/tree/main/sparrow-data/parse",
        "Repository": "https://github.com/katanaml/sparrow"
    },
    "split_keywords": [
        "llm",
        " vllm",
        " ocr",
        " vision"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "9acba248706420e5f83724d3e7edd9903ca9446341ebf1c39db4e498ccc3e0ca",
                "md5": "ab3d456f2d2cea0d8316c13657c0c09e",
                "sha256": "531691fa613f62a80d140725be5444aba97c3c92525fa0fbcc40eab5aba6d224"
            },
            "downloads": -1,
            "filename": "sparrow_parse-0.5.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "ab3d456f2d2cea0d8316c13657c0c09e",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 16208,
            "upload_time": "2025-01-09T12:28:43",
            "upload_time_iso_8601": "2025-01-09T12:28:43.778594Z",
            "url": "https://files.pythonhosted.org/packages/9a/cb/a248706420e5f83724d3e7edd9903ca9446341ebf1c39db4e498ccc3e0ca/sparrow_parse-0.5.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "2be6408eb4afe69144fa8ee3acb11f116516194ae1a785b9641c26c9fa1998fb",
                "md5": "029a2f5bc22c43b84311d2ec4e9d8d47",
                "sha256": "692087777dc7b6c5a1d281195e053fb601496aabc864c5fcb9bdae3f5157c8c7"
            },
            "downloads": -1,
            "filename": "sparrow-parse-0.5.0.tar.gz",
            "has_sig": false,
            "md5_digest": "029a2f5bc22c43b84311d2ec4e9d8d47",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 14669,
            "upload_time": "2025-01-09T12:28:45",
            "upload_time_iso_8601": "2025-01-09T12:28:45.956872Z",
            "url": "https://files.pythonhosted.org/packages/2b/e6/408eb4afe69144fa8ee3acb11f116516194ae1a785b9641c26c9fa1998fb/sparrow-parse-0.5.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-01-09 12:28:45",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "katanaml",
    "github_project": "sparrow",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "sparrow-parse"
}

Andrej Baranovskij