unstructured-inference

Name	unstructured-inference JSON
Version	0.8.7 JSON
	download
home_page	https://github.com/Unstructured-IO/unstructured-inference
Summary	A library for performing inference using trained models.
upload_time	2025-02-03 19:41:02
maintainer	None
docs_url	None
author	Unstructured Technologies
requires_python	>=3.7.0
license	Apache-2.0
keywords	nlp pdf html cv xml parsing preprocessing
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            <h3 align="center">
  <img
    src="https://raw.githubusercontent.com/Unstructured-IO/unstructured/main/img/unstructured_logo.png"
    height="200"
  >

</h3>

<h3 align="center">
  <p>Open-Source Pre-Processing Tools for Unstructured Data</p>
</h3>

The `unstructured-inference` repo contains hosted model inference code for layout parsing models. 
These models are invoked via API as part of the partitioning bricks in the `unstructured` package.

## Installation

### Package

Run `pip install unstructured-inference`.

### Detectron2

[Detectron2](https://github.com/facebookresearch/detectron2) is required for using models from the [layoutparser model zoo](#using-models-from-the-layoutparser-model-zoo) 
but is not automatically installed with this package. 
For MacOS and Linux, build from source with:
```shell
pip install 'git+https://github.com/facebookresearch/detectron2.git@57bdb21249d5418c130d54e2ebdc94dda7a4c01a'
```
Other install options can be found in the 
[Detectron2 installation guide](https://detectron2.readthedocs.io/en/latest/tutorials/install.html).

Windows is not officially supported by Detectron2, but some users are able to install it anyway. 
See discussion [here](https://layout-parser.github.io/tutorials/installation#for-windows-users) for 
tips on installing Detectron2 on Windows.

### Repository

To install the repository for development, clone the repo and run `make install` to install dependencies.
Run `make help` for a full list of install options.

## Getting Started

To get started with the layout parsing model, use the following commands:

```python
from unstructured_inference.inference.layout import DocumentLayout

layout = DocumentLayout.from_file("sample-docs/loremipsum.pdf")

print(layout.pages[0].elements)
```

Once the model has detected the layout and OCR'd the document, the text extracted from the first 
page of the sample document will be displayed.
You can convert a given element to a `dict` by running the `.to_dict()` method.

## Models

The inference pipeline operates by finding text elements in a document page using a detection model, then extracting the contents of the elements using direct extraction (if available), OCR, and optionally table inference models.

We offer several detection models including [Detectron2](https://github.com/facebookresearch/detectron2) and [YOLOX](https://github.com/Megvii-BaseDetection/YOLOX).

### Using a non-default model

When doing inference, an alternate model can be used by passing the model object to the ingestion method via the `model` parameter. The `get_model` function can be used to construct one of our out-of-the-box models from a keyword, e.g.:
```python
from unstructured_inference.models.base import get_model
from unstructured_inference.inference.layout import DocumentLayout

model = get_model("yolox")
layout = DocumentLayout.from_file("sample-docs/layout-parser-paper.pdf", detection_model=model)
```

### Using your own model

Any detection model can be used for in the `unstructured_inference` pipeline by wrapping the model in the `UnstructuredObjectDetectionModel` class. To integrate with the `DocumentLayout` class, a subclass of `UnstructuredObjectDetectionModel` must have a `predict` method that accepts a `PIL.Image.Image` and returns a list of `LayoutElement`s, and an `initialize` method, which loads the model and prepares it for inference.

## Security Policy

See our [security policy](https://github.com/Unstructured-IO/unstructured-inference/security/policy) for
information on how to report security vulnerabilities.

## Learn more

| Section | Description |
|-|-|
| [Unstructured Community Github](https://github.com/Unstructured-IO/community) | Information about Unstructured.io community projects  |
| [Unstructured Github](https://github.com/Unstructured-IO) | Unstructured.io open source repositories |
| [Company Website](https://unstructured.io) | Unstructured.io product and company info |

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/Unstructured-IO/unstructured-inference",
    "name": "unstructured-inference",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7.0",
    "maintainer_email": null,
    "keywords": "NLP PDF HTML CV XML parsing preprocessing",
    "author": "Unstructured Technologies",
    "author_email": "devops@unstructuredai.io",
    "download_url": "https://files.pythonhosted.org/packages/d3/21/4cd723ea8e524898248f449365dc65ffb57d811615bbfaec18c406ab99c9/unstructured_inference-0.8.7.tar.gz",
    "platform": null,
    "description": "<h3 align=\"center\">\n  <img\n    src=\"https://raw.githubusercontent.com/Unstructured-IO/unstructured/main/img/unstructured_logo.png\"\n    height=\"200\"\n  >\n\n</h3>\n\n<h3 align=\"center\">\n  <p>Open-Source Pre-Processing Tools for Unstructured Data</p>\n</h3>\n\nThe `unstructured-inference` repo contains hosted model inference code for layout parsing models. \nThese models are invoked via API as part of the partitioning bricks in the `unstructured` package.\n\n## Installation\n\n### Package\n\nRun `pip install unstructured-inference`.\n\n### Detectron2\n\n[Detectron2](https://github.com/facebookresearch/detectron2) is required for using models from the [layoutparser model zoo](#using-models-from-the-layoutparser-model-zoo) \nbut is not automatically installed with this package. \nFor MacOS and Linux, build from source with:\n```shell\npip install 'git+https://github.com/facebookresearch/detectron2.git@57bdb21249d5418c130d54e2ebdc94dda7a4c01a'\n```\nOther install options can be found in the \n[Detectron2 installation guide](https://detectron2.readthedocs.io/en/latest/tutorials/install.html).\n\nWindows is not officially supported by Detectron2, but some users are able to install it anyway. \nSee discussion [here](https://layout-parser.github.io/tutorials/installation#for-windows-users) for \ntips on installing Detectron2 on Windows.\n\n### Repository\n\nTo install the repository for development, clone the repo and run `make install` to install dependencies.\nRun `make help` for a full list of install options.\n\n## Getting Started\n\nTo get started with the layout parsing model, use the following commands:\n\n```python\nfrom unstructured_inference.inference.layout import DocumentLayout\n\nlayout = DocumentLayout.from_file(\"sample-docs/loremipsum.pdf\")\n\nprint(layout.pages[0].elements)\n```\n\nOnce the model has detected the layout and OCR'd the document, the text extracted from the first \npage of the sample document will be displayed.\nYou can convert a given element to a `dict` by running the `.to_dict()` method.\n\n## Models\n\nThe inference pipeline operates by finding text elements in a document page using a detection model, then extracting the contents of the elements using direct extraction (if available), OCR, and optionally table inference models.\n\nWe offer several detection models including [Detectron2](https://github.com/facebookresearch/detectron2) and [YOLOX](https://github.com/Megvii-BaseDetection/YOLOX).\n\n### Using a non-default model\n\nWhen doing inference, an alternate model can be used by passing the model object to the ingestion method via the `model` parameter. The `get_model` function can be used to construct one of our out-of-the-box models from a keyword, e.g.:\n```python\nfrom unstructured_inference.models.base import get_model\nfrom unstructured_inference.inference.layout import DocumentLayout\n\nmodel = get_model(\"yolox\")\nlayout = DocumentLayout.from_file(\"sample-docs/layout-parser-paper.pdf\", detection_model=model)\n```\n\n### Using your own model\n\nAny detection model can be used for in the `unstructured_inference` pipeline by wrapping the model in the `UnstructuredObjectDetectionModel` class. To integrate with the `DocumentLayout` class, a subclass of `UnstructuredObjectDetectionModel` must have a `predict` method that accepts a `PIL.Image.Image` and returns a list of `LayoutElement`s, and an `initialize` method, which loads the model and prepares it for inference.\n\n## Security Policy\n\nSee our [security policy](https://github.com/Unstructured-IO/unstructured-inference/security/policy) for\ninformation on how to report security vulnerabilities.\n\n## Learn more\n\n| Section | Description |\n|-|-|\n| [Unstructured Community Github](https://github.com/Unstructured-IO/community) | Information about Unstructured.io community projects  |\n| [Unstructured Github](https://github.com/Unstructured-IO) | Unstructured.io open source repositories |\n| [Company Website](https://unstructured.io) | Unstructured.io product and company info |\n\n\n",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "A library for performing inference using trained models.",
    "version": "0.8.7",
    "project_urls": {
        "Homepage": "https://github.com/Unstructured-IO/unstructured-inference"
    },
    "split_keywords": [
        "nlp",
        "pdf",
        "html",
        "cv",
        "xml",
        "parsing",
        "preprocessing"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "18fdc2db76fd19b73b0269513913cbecbeca715cc77be6cb3d603c3059f8e518",
                "md5": "8913512c6b401ada80d04d1d860af5f3",
                "sha256": "a470b166981b3a6e1252da55fb6f40aa1f4711d59c9498867ebc90ec87c8039d"
            },
            "downloads": -1,
            "filename": "unstructured_inference-0.8.7-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "8913512c6b401ada80d04d1d860af5f3",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7.0",
            "size": 48835,
            "upload_time": "2025-02-03T19:41:00",
            "upload_time_iso_8601": "2025-02-03T19:41:00.991868Z",
            "url": "https://files.pythonhosted.org/packages/18/fd/c2db76fd19b73b0269513913cbecbeca715cc77be6cb3d603c3059f8e518/unstructured_inference-0.8.7-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "d3214cd723ea8e524898248f449365dc65ffb57d811615bbfaec18c406ab99c9",
                "md5": "2bea9da7c13202bf553f2dc5f45c6fca",
                "sha256": "0bb39127ca44e3ce0b5d85328d161a6ebff254c4eb9f90cc76b17e637283da5a"
            },
            "downloads": -1,
            "filename": "unstructured_inference-0.8.7.tar.gz",
            "has_sig": false,
            "md5_digest": "2bea9da7c13202bf553f2dc5f45c6fca",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7.0",
            "size": 44639,
            "upload_time": "2025-02-03T19:41:02",
            "upload_time_iso_8601": "2025-02-03T19:41:02.236849Z",
            "url": "https://files.pythonhosted.org/packages/d3/21/4cd723ea8e524898248f449365dc65ffb57d811615bbfaec18c406ab99c9/unstructured_inference-0.8.7.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-02-03 19:41:02",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Unstructured-IO",
    "github_project": "unstructured-inference",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "unstructured-inference"
}

Unstructured Technologies