<h3 align="center">
<img
src="https://raw.githubusercontent.com/Unstructured-IO/unstructured/main/img/unstructured_logo.png"
height="200"
>
</h3>
<h3 align="center">
<p>Open-Source Pre-Processing Tools for Unstructured Data</p>
</h3>
The `unstructured-inference` repo contains hosted model inference code for layout parsing models.
These models are invoked via API as part of the partitioning bricks in the `unstructured` package.
## Installation
### Package
Run `pip install unstructured-inference`.
### Detectron2
[Detectron2](https://github.com/facebookresearch/detectron2) is required for most inference tasks
but is not automatically installed with this package.
For MacOS and Linux, build from source with:
```shell
pip install 'git+https://github.com/facebookresearch/detectron2.git@v0.4#egg=detectron2'
```
Other install options can be found in the
[Detectron2 installation guide](https://detectron2.readthedocs.io/en/latest/tutorials/install.html).
Windows is not officially supported by Detectron2, but some users are able to install it anyway.
See discussion [here](https://layout-parser.github.io/tutorials/installation#for-windows-users) for
tips on installing Detectron2 on Windows.
### Repository
To install the repository for development, clone the repo and run `make install` to install dependencies.
Run `make help` for a full list of install options.
## Getting Started
To get started with the layout parsing model, use the following commands:
```python
from unstructured_inference.inference.layout import DocumentLayout
layout = DocumentLayout.from_file("sample-docs/loremipsum.pdf")
print(layout.pages[0].elements)
```
Once the model has detected the layout and OCR'd the document, the text extracted from the first
page of the sample document will be displayed.
You can convert a given element to a `dict` by running the `.to_dict()` method.
To build the Docker container, run `make docker-build`. Note that Apple hardware with an M1 chip
has trouble building `Detectron2` on Docker and for best results you should build it on Linux. To
run the API locally, use `make start-app-local`. You can stop the API with `make stop-app-local`.
The API will run at `http:/localhost:5000`.
You can then `POST` a PDF file to the API endpoint to see its layout with the command:
```
curl -X 'POST' 'http://localhost:5000/layout/pdf' -F 'file=@<your_pdf_file>' | jq -C . | less -R
```
You can also choose the types of elements you want to return from the output of PDF parsing by
passing a list of types to the `include_elems` parameter. For example, if you only want to return
`Text` elements and `Title` elements, you can curl:
```
curl -X 'POST' 'http://localhost:5000/layout/pdf' \
-F 'file=@<your_pdf_file>' \
-F include_elems=Text \
-F include_elems=Title \
| jq -C | less -R
```
If you are using an Apple M1 chip, use `make run-app-dev` instead of `make start-app-local` to
start the API with hot reloading. The API will run at `http:/localhost:8000`.
View the swagger documentation at `http://localhost:5000/docs`.
## YoloX model
For using the YoloX model the endpoints are:
```
http://localhost:8000/layout_v1/pdf
http://localhost:8000/layout_v1/image
```
For example:
```
curl -X 'POST' 'http://localhost:8000/layout/yolox/image' \
-F 'file=@sample-docs/test-image.jpg' \
| jq -C | less -R
curl -X 'POST' 'http://localhost:8000/layout/yolox/pdf' \
-F 'file=@sample-docs/loremipsum.pdf' \
| jq -C | less -R
```
If your PDF file doesn't have text embedded you can force the use of OCR with
the parameter force_ocr=True:
```
curl -X 'POST' 'http://localhost:8000/layout/yolox/pdf' \
-F 'file=@sample-docs/loremipsum.pdf' \
-F force_ocr=true
| jq -C | less -R
```
or in local:
```
layout = yolox_local_inference(filename, type="pdf")
```
## Security Policy
See our [security policy](https://github.com/Unstructured-IO/unstructured-inference/security/policy) for
information on how to report security vulnerabilities.
## Learn more
| Section | Description |
|-|-|
| [Unstructured Community Github](https://github.com/Unstructured-IO/community) | Information about Unstructured.io community projects |
| [Unstructured Github](https://github.com/Unstructured-IO) | Unstructured.io open source repositories |
| [Company Website](https://unstructured.io) | Unstructured.io product and company info |
Raw data
{
"_id": null,
"home_page": "https://github.com/Unstructured-IO/unstructured-inference",
"name": "unstructured-inference",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.7.0",
"maintainer_email": "",
"keywords": "NLP PDF HTML CV XML parsing preprocessing",
"author": "Unstructured Technologies",
"author_email": "devops@unstructuredai.io",
"download_url": "https://files.pythonhosted.org/packages/27/74/a4cd68f6a7b4735d2a11711ec917901f23e604dedb5b855d2f613adcd48c/unstructured_inference-0.2.11.tar.gz",
"platform": null,
"description": "<h3 align=\"center\">\n <img\n src=\"https://raw.githubusercontent.com/Unstructured-IO/unstructured/main/img/unstructured_logo.png\"\n height=\"200\"\n >\n\n</h3>\n\n<h3 align=\"center\">\n <p>Open-Source Pre-Processing Tools for Unstructured Data</p>\n</h3>\n\nThe `unstructured-inference` repo contains hosted model inference code for layout parsing models. \nThese models are invoked via API as part of the partitioning bricks in the `unstructured` package.\n\n## Installation\n\n### Package\n\nRun `pip install unstructured-inference`.\n\n### Detectron2\n\n[Detectron2](https://github.com/facebookresearch/detectron2) is required for most inference tasks \nbut is not automatically installed with this package. \nFor MacOS and Linux, build from source with:\n```shell\npip install 'git+https://github.com/facebookresearch/detectron2.git@v0.4#egg=detectron2'\n```\nOther install options can be found in the \n[Detectron2 installation guide](https://detectron2.readthedocs.io/en/latest/tutorials/install.html).\n\nWindows is not officially supported by Detectron2, but some users are able to install it anyway. \nSee discussion [here](https://layout-parser.github.io/tutorials/installation#for-windows-users) for \ntips on installing Detectron2 on Windows.\n\n### Repository\n\nTo install the repository for development, clone the repo and run `make install` to install dependencies.\nRun `make help` for a full list of install options.\n\n## Getting Started\n\nTo get started with the layout parsing model, use the following commands:\n\n```python\nfrom unstructured_inference.inference.layout import DocumentLayout\n\nlayout = DocumentLayout.from_file(\"sample-docs/loremipsum.pdf\")\n\nprint(layout.pages[0].elements)\n```\n\nOnce the model has detected the layout and OCR'd the document, the text extracted from the first \npage of the sample document will be displayed.\nYou can convert a given element to a `dict` by running the `.to_dict()` method.\n\nTo build the Docker container, run `make docker-build`. Note that Apple hardware with an M1 chip \nhas trouble building `Detectron2` on Docker and for best results you should build it on Linux. To \nrun the API locally, use `make start-app-local`. You can stop the API with `make stop-app-local`. \nThe API will run at `http:/localhost:5000`. \nYou can then `POST` a PDF file to the API endpoint to see its layout with the command:\n```\ncurl -X 'POST' 'http://localhost:5000/layout/pdf' -F 'file=@<your_pdf_file>' | jq -C . | less -R\n```\n\nYou can also choose the types of elements you want to return from the output of PDF parsing by \npassing a list of types to the `include_elems` parameter. For example, if you only want to return \n`Text` elements and `Title` elements, you can curl:\n```\ncurl -X 'POST' 'http://localhost:5000/layout/pdf' \\\n-F 'file=@<your_pdf_file>' \\\n-F include_elems=Text \\\n-F include_elems=Title \\\n | jq -C | less -R\n```\nIf you are using an Apple M1 chip, use `make run-app-dev` instead of `make start-app-local` to \nstart the API with hot reloading. The API will run at `http:/localhost:8000`.\n\nView the swagger documentation at `http://localhost:5000/docs`.\n\n## YoloX model\n\nFor using the YoloX model the endpoints are: \n```\nhttp://localhost:8000/layout_v1/pdf\nhttp://localhost:8000/layout_v1/image\n```\nFor example:\n```\ncurl -X 'POST' 'http://localhost:8000/layout/yolox/image' \\\n-F 'file=@sample-docs/test-image.jpg' \\\n | jq -C | less -R\n\ncurl -X 'POST' 'http://localhost:8000/layout/yolox/pdf' \\\n-F 'file=@sample-docs/loremipsum.pdf' \\\n | jq -C | less -R\n```\n\nIf your PDF file doesn't have text embedded you can force the use of OCR with\nthe parameter force_ocr=True:\n```\ncurl -X 'POST' 'http://localhost:8000/layout/yolox/pdf' \\\n-F 'file=@sample-docs/loremipsum.pdf' \\\n-F force_ocr=true \n | jq -C | less -R\n```\n\nor in local:\n\n```\nlayout = yolox_local_inference(filename, type=\"pdf\")\n```\n\n## Security Policy\n\nSee our [security policy](https://github.com/Unstructured-IO/unstructured-inference/security/policy) for\ninformation on how to report security vulnerabilities.\n\n## Learn more\n\n| Section | Description |\n|-|-|\n| [Unstructured Community Github](https://github.com/Unstructured-IO/community) | Information about Unstructured.io community projects |\n| [Unstructured Github](https://github.com/Unstructured-IO) | Unstructured.io open source repositories |\n| [Company Website](https://unstructured.io) | Unstructured.io product and company info |",
"bugtrack_url": null,
"license": "Apache-2.0",
"summary": "A library for performing inference using trained models.",
"version": "0.2.11",
"split_keywords": [
"nlp",
"pdf",
"html",
"cv",
"xml",
"parsing",
"preprocessing"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "2774a4cd68f6a7b4735d2a11711ec917901f23e604dedb5b855d2f613adcd48c",
"md5": "ca7086153a4699ff50bcfa2092b519de",
"sha256": "943e66969dc8693f2a72c22ff69e9466d38a468802cbf1e98bfa1cb9c2fcdf2b"
},
"downloads": -1,
"filename": "unstructured_inference-0.2.11.tar.gz",
"has_sig": false,
"md5_digest": "ca7086153a4699ff50bcfa2092b519de",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7.0",
"size": 17902,
"upload_time": "2023-03-10T17:26:15",
"upload_time_iso_8601": "2023-03-10T17:26:15.778185Z",
"url": "https://files.pythonhosted.org/packages/27/74/a4cd68f6a7b4735d2a11711ec917901f23e604dedb5b855d2f613adcd48c/unstructured_inference-0.2.11.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-03-10 17:26:15",
"github": true,
"gitlab": false,
"bitbucket": false,
"github_user": "Unstructured-IO",
"github_project": "unstructured-inference",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "unstructured-inference"
}