<h3 align="center">
<img
src="https://raw.githubusercontent.com/Unstructured-IO/unstructured/main/img/unstructured_logo.png"
height="200"
>
</h3>
<h3 align="center">
<p>Open-Source Pre-Processing Tools for Unstructured Data</p>
</h3>
The `unstructured-inference` repo contains hosted model inference code for layout parsing models.
These models are invoked via API as part of the partitioning bricks in the `unstructured` package.
## Installation
### Package
Run `pip install unstructured-inference`.
### Detectron2
[Detectron2](https://github.com/facebookresearch/detectron2) is required for using models from the [layoutparser model zoo](#using-models-from-the-layoutparser-model-zoo)
but is not automatically installed with this package.
For MacOS and Linux, build from source with:
```shell
pip install 'git+https://github.com/facebookresearch/detectron2.git@57bdb21249d5418c130d54e2ebdc94dda7a4c01a'
```
Other install options can be found in the
[Detectron2 installation guide](https://detectron2.readthedocs.io/en/latest/tutorials/install.html).
Windows is not officially supported by Detectron2, but some users are able to install it anyway.
See discussion [here](https://layout-parser.github.io/tutorials/installation#for-windows-users) for
tips on installing Detectron2 on Windows.
### Repository
To install the repository for development, clone the repo and run `make install` to install dependencies.
Run `make help` for a full list of install options.
## Getting Started
To get started with the layout parsing model, use the following commands:
```python
from unstructured_inference.inference.layout import DocumentLayout
layout = DocumentLayout.from_file("sample-docs/loremipsum.pdf")
print(layout.pages[0].elements)
```
Once the model has detected the layout and OCR'd the document, the text extracted from the first
page of the sample document will be displayed.
You can convert a given element to a `dict` by running the `.to_dict()` method.
## Models
The inference pipeline operates by finding text elements in a document page using a detection model, then extracting the contents of the elements using direct extraction (if available), OCR, and optionally table inference models.
We offer several detection models including [Detectron2](https://github.com/facebookresearch/detectron2) and [YOLOX](https://github.com/Megvii-BaseDetection/YOLOX).
### Using a non-default model
When doing inference, an alternate model can be used by passing the model object to the ingestion method via the `model` parameter. The `get_model` function can be used to construct one of our out-of-the-box models from a keyword, e.g.:
```python
from unstructured_inference.models.base import get_model
from unstructured_inference.inference.layout import DocumentLayout
model = get_model("yolox")
layout = DocumentLayout.from_file("sample-docs/layout-parser-paper.pdf", detection_model=model)
```
### Using models from the layoutparser model zoo
The `UnstructuredDetectronModel` class in `unstructured_inference.modelts.detectron2` uses the `faster_rcnn_R_50_FPN_3x` model pretrained on DocLayNet, but by using different construction parameters, any model in the `layoutparser` [model zoo](https://layout-parser.readthedocs.io/en/latest/notes/modelzoo.html) can be used. `UnstructuredDetectronModel` is a light wrapper around the `layoutparser` `Detectron2LayoutModel` object, and accepts the same arguments. See [layoutparser documentation](https://layout-parser.readthedocs.io/en/latest/api_doc/models.html#layoutparser.models.Detectron2LayoutModel) for details.
### Using your own model
Any detection model can be used for in the `unstructured_inference` pipeline by wrapping the model in the `UnstructuredObjectDetectionModel` class. To integrate with the `DocumentLayout` class, a subclass of `UnstructuredObjectDetectionModel` must have a `predict` method that accepts a `PIL.Image.Image` and returns a list of `LayoutElement`s, and an `initialize` method, which loads the model and prepares it for inference.
## Security Policy
See our [security policy](https://github.com/Unstructured-IO/unstructured-inference/security/policy) for
information on how to report security vulnerabilities.
## Learn more
| Section | Description |
|-|-|
| [Unstructured Community Github](https://github.com/Unstructured-IO/community) | Information about Unstructured.io community projects |
| [Unstructured Github](https://github.com/Unstructured-IO) | Unstructured.io open source repositories |
| [Company Website](https://unstructured.io) | Unstructured.io product and company info |
Raw data
{
"_id": null,
"home_page": "https://github.com/Unstructured-IO/unstructured-inference",
"name": "unstructured-inference",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.7.0",
"maintainer_email": null,
"keywords": "NLP PDF HTML CV XML parsing preprocessing",
"author": "Unstructured Technologies",
"author_email": "devops@unstructuredai.io",
"download_url": "https://files.pythonhosted.org/packages/ba/dc/273b0b4f325962ea9649d28414088d8eae177882586a638ff80a9846f14f/unstructured_inference-0.8.1.tar.gz",
"platform": null,
"description": "<h3 align=\"center\">\n <img\n src=\"https://raw.githubusercontent.com/Unstructured-IO/unstructured/main/img/unstructured_logo.png\"\n height=\"200\"\n >\n\n</h3>\n\n<h3 align=\"center\">\n <p>Open-Source Pre-Processing Tools for Unstructured Data</p>\n</h3>\n\nThe `unstructured-inference` repo contains hosted model inference code for layout parsing models. \nThese models are invoked via API as part of the partitioning bricks in the `unstructured` package.\n\n## Installation\n\n### Package\n\nRun `pip install unstructured-inference`.\n\n### Detectron2\n\n[Detectron2](https://github.com/facebookresearch/detectron2) is required for using models from the [layoutparser model zoo](#using-models-from-the-layoutparser-model-zoo) \nbut is not automatically installed with this package. \nFor MacOS and Linux, build from source with:\n```shell\npip install 'git+https://github.com/facebookresearch/detectron2.git@57bdb21249d5418c130d54e2ebdc94dda7a4c01a'\n```\nOther install options can be found in the \n[Detectron2 installation guide](https://detectron2.readthedocs.io/en/latest/tutorials/install.html).\n\nWindows is not officially supported by Detectron2, but some users are able to install it anyway. \nSee discussion [here](https://layout-parser.github.io/tutorials/installation#for-windows-users) for \ntips on installing Detectron2 on Windows.\n\n### Repository\n\nTo install the repository for development, clone the repo and run `make install` to install dependencies.\nRun `make help` for a full list of install options.\n\n## Getting Started\n\nTo get started with the layout parsing model, use the following commands:\n\n```python\nfrom unstructured_inference.inference.layout import DocumentLayout\n\nlayout = DocumentLayout.from_file(\"sample-docs/loremipsum.pdf\")\n\nprint(layout.pages[0].elements)\n```\n\nOnce the model has detected the layout and OCR'd the document, the text extracted from the first \npage of the sample document will be displayed.\nYou can convert a given element to a `dict` by running the `.to_dict()` method.\n\n## Models\n\nThe inference pipeline operates by finding text elements in a document page using a detection model, then extracting the contents of the elements using direct extraction (if available), OCR, and optionally table inference models.\n\nWe offer several detection models including [Detectron2](https://github.com/facebookresearch/detectron2) and [YOLOX](https://github.com/Megvii-BaseDetection/YOLOX).\n\n### Using a non-default model\n\nWhen doing inference, an alternate model can be used by passing the model object to the ingestion method via the `model` parameter. The `get_model` function can be used to construct one of our out-of-the-box models from a keyword, e.g.:\n```python\nfrom unstructured_inference.models.base import get_model\nfrom unstructured_inference.inference.layout import DocumentLayout\n\nmodel = get_model(\"yolox\")\nlayout = DocumentLayout.from_file(\"sample-docs/layout-parser-paper.pdf\", detection_model=model)\n```\n\n### Using models from the layoutparser model zoo\n\nThe `UnstructuredDetectronModel` class in `unstructured_inference.modelts.detectron2` uses the `faster_rcnn_R_50_FPN_3x` model pretrained on DocLayNet, but by using different construction parameters, any model in the `layoutparser` [model zoo](https://layout-parser.readthedocs.io/en/latest/notes/modelzoo.html) can be used. `UnstructuredDetectronModel` is a light wrapper around the `layoutparser` `Detectron2LayoutModel` object, and accepts the same arguments. See [layoutparser documentation](https://layout-parser.readthedocs.io/en/latest/api_doc/models.html#layoutparser.models.Detectron2LayoutModel) for details.\n\n### Using your own model\n\nAny detection model can be used for in the `unstructured_inference` pipeline by wrapping the model in the `UnstructuredObjectDetectionModel` class. To integrate with the `DocumentLayout` class, a subclass of `UnstructuredObjectDetectionModel` must have a `predict` method that accepts a `PIL.Image.Image` and returns a list of `LayoutElement`s, and an `initialize` method, which loads the model and prepares it for inference.\n\n## Security Policy\n\nSee our [security policy](https://github.com/Unstructured-IO/unstructured-inference/security/policy) for\ninformation on how to report security vulnerabilities.\n\n## Learn more\n\n| Section | Description |\n|-|-|\n| [Unstructured Community Github](https://github.com/Unstructured-IO/community) | Information about Unstructured.io community projects |\n| [Unstructured Github](https://github.com/Unstructured-IO) | Unstructured.io open source repositories |\n| [Company Website](https://unstructured.io) | Unstructured.io product and company info |\n\n\n",
"bugtrack_url": null,
"license": "Apache-2.0",
"summary": "A library for performing inference using trained models.",
"version": "0.8.1",
"project_urls": {
"Homepage": "https://github.com/Unstructured-IO/unstructured-inference"
},
"split_keywords": [
"nlp",
"pdf",
"html",
"cv",
"xml",
"parsing",
"preprocessing"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "d5f166af1f6feb219917ad10ac8815ce9c10441d5dc63ba4d660fc4f54dce37e",
"md5": "2d105f39eda26a86c869f7a3d0327376",
"sha256": "1f22fd25906ab8ecc7ea69c3aa9dcfb585ae51ba5d5770fc7c151b43851e9f9a"
},
"downloads": -1,
"filename": "unstructured_inference-0.8.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "2d105f39eda26a86c869f7a3d0327376",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7.0",
"size": 48365,
"upload_time": "2024-10-25T18:57:11",
"upload_time_iso_8601": "2024-10-25T18:57:11.484262Z",
"url": "https://files.pythonhosted.org/packages/d5/f1/66af1f6feb219917ad10ac8815ce9c10441d5dc63ba4d660fc4f54dce37e/unstructured_inference-0.8.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "badc273b0b4f325962ea9649d28414088d8eae177882586a638ff80a9846f14f",
"md5": "42514d09661db0efe3ee0875101cf830",
"sha256": "a73ffdc89a6e55315ad9700878a9c18faf845989cf065ca69216e0893051be8d"
},
"downloads": -1,
"filename": "unstructured_inference-0.8.1.tar.gz",
"has_sig": false,
"md5_digest": "42514d09661db0efe3ee0875101cf830",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7.0",
"size": 44555,
"upload_time": "2024-10-25T18:57:13",
"upload_time_iso_8601": "2024-10-25T18:57:13.032080Z",
"url": "https://files.pythonhosted.org/packages/ba/dc/273b0b4f325962ea9649d28414088d8eae177882586a638ff80a9846f14f/unstructured_inference-0.8.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-10-25 18:57:13",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Unstructured-IO",
"github_project": "unstructured-inference",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "unstructured-inference"
}