<h3 align="center">
<img
src="https://raw.githubusercontent.com/Unstructured-IO/unstructured-api-tools/main/img/unstructured_logo.png"
height="200"
>
</h3>
<h3 align="center">
<p>Open-Source Pre-Processing Tools for Unstructured Data</p>
</h3>
The `unstructured_api_tools` library includes utilities for converting pipeline notebooks into
REST API applications. `unstructured_api_tools` is intended for use in conjunction with
pipeline repos. See [`pipeline-sec-filings`](https://github.com/Unstructured-IO/pipeline-sec-filings)
for an example of a repo that uses `unstructured_api_tools`.
## Installation
To install the library, run `pip install unstructured_api_tools`.
## Developer Quick Start
* Using `pyenv` to manage virtualenv's is recommended
* Mac install instructions. See [here](https://github.com/Unstructured-IO/community#mac--homebrew) for more detailed instructions.
* `brew install pyenv-virtualenv`
* `pyenv install 3.8.15`
* Linux instructions are available [here](https://github.com/Unstructured-IO/community#linux).
* Create a virtualenv to work in and activate it, e.g. for one named `unstructured_api_tools`:
`pyenv virtualenv 3.8.15 unstructured_api_tools` <br />
`pyenv activate unstructured_api_tools`
* Run `make install-project-local`
## Usage
Use the CLI command to convert pipeline notebooks to scripts, for example:
```bash
unstructured_api_tools convert-pipeline-notebooks \
--input-directory pipeline-family-sec-filings/pipeline-notebooks \
--output-directory pipeline-family-sec-filings/prepline_sec_filings/api \
--pipeline-family sec-filings \
--semver 0.2.1
```
If you do not provide the `pipeline-family` and `semver` arguments, those values are parsed from
`preprocessing-pipeline-family.yaml`. You can provide the `preprocessing-pipeline-family.yaml` file
explicitly with `--config-filename` or the `PIPELINE_FAMILY_CONFIG` environment variable. If neither
of those is specified, the fallback is to use the `preprocessing-pipeline-family.yaml` file in the
current working directory.
The API file undergoes `black`, `flake8` and `mypy` checks after being generated. If you want
`flake8` to ignore specific errors, you can specify them through the CLI with
`--flake8-ignore F401, E402`.
See the [`flake8` docs](https://flake8.pycqa.org/en/latest/user/error-codes.html#error-violation-codes)
for a full list of error codes.
### Conversion from `pipeline_api` to FastAPI
The command described in [**Usage**](#Usage) generates a FastAPI API route for each `pipeline_api`
function defined in the notebook. The signature of the `pipeline_api` method determines what
parameters the generated FastAPI accepts.
Currently, only plain text file uploads are supported and as such the first argument must always be
`text`, but support for multiple files and binary files is coming soon!
In addition, any number of string array parameters may be specified. Any kwarg beginning with
`m_` indicates a multi-value string parameter that is accepted by the FastAPI API.
For example, in a notebook containing:
def pipeline_api(text, m_subject=[], m_name=[]):
`text` represents the content of a file posted to the FastAPI API, and the `m_subject` and `m_name`
keyword args represent optional parameters that may be posted to the API as well, both allowing
multiple string parameters. A `curl` request against such an API could look like this:
curl -X 'POST' \
'https://<hostname>/<pipeline-family-name>/<pipeline-family-version>/<api-name>' \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-F 'file=@file-to-process.txt' \
-F 'subject=art' \
-F 'subject=history'
-F 'subject=math' \
-F 'name=feynman'
In addition, you can specify the response type if `pipeline_api` can support both "application/json"
and "text/csv" as return types.
For example, in a notebook containing a kwarg `response_type`:
def pipeline_api(text, response_type="text/csv", m_subject=[], m_name=[]):
The consumer of the API may then specify "text/csv" as the requested response content type with the usual
HTTP Accept header, e.g. `Accept: application/json` or `Accept: text/csv`.
## Security Policy
See our [security policy](https://github.com/Unstructured-IO/unstructured-api-tools/security/policy) for
information on how to report security vulnerabilities.
## Learn more
| Section | Description |
|-|-|
| [Company Website](https://unstructured.io) | Unstructured.io product and company info |
Raw data
{
"_id": null,
"home_page": "https://github.com/Unstructured-IO/unstructured-api-tools",
"name": "unstructured-api-tools",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.8.0",
"maintainer_email": "",
"keywords": "NLP PDF HTML CV XML parsing preprocessing",
"author": "Unstructured Technologies",
"author_email": "mrobinson@unstructuredai.io",
"download_url": "https://files.pythonhosted.org/packages/e2/36/7bd2e0ccb2a988b0e8b595b3e8559f859f053b9b21657ee61c74a617c567/unstructured_api_tools-0.10.11.tar.gz",
"platform": null,
"description": "<h3 align=\"center\">\n <img\n src=\"https://raw.githubusercontent.com/Unstructured-IO/unstructured-api-tools/main/img/unstructured_logo.png\"\n height=\"200\"\n >\n</h3>\n\n<h3 align=\"center\">\n <p>Open-Source Pre-Processing Tools for Unstructured Data</p>\n</h3>\n\n\nThe `unstructured_api_tools` library includes utilities for converting pipeline notebooks into\nREST API applications. `unstructured_api_tools` is intended for use in conjunction with\npipeline repos. See [`pipeline-sec-filings`](https://github.com/Unstructured-IO/pipeline-sec-filings)\nfor an example of a repo that uses `unstructured_api_tools`.\n\n## Installation\n\nTo install the library, run `pip install unstructured_api_tools`.\n\n## Developer Quick Start\n\n* Using `pyenv` to manage virtualenv's is recommended\n\t* Mac install instructions. See [here](https://github.com/Unstructured-IO/community#mac--homebrew) for more detailed instructions.\n\t\t* `brew install pyenv-virtualenv`\n\t * `pyenv install 3.8.15`\n * Linux instructions are available [here](https://github.com/Unstructured-IO/community#linux).\n\n* Create a virtualenv to work in and activate it, e.g. for one named `unstructured_api_tools`:\n\n\t`pyenv virtualenv 3.8.15 unstructured_api_tools` <br />\n\t`pyenv activate unstructured_api_tools`\n\n* Run `make install-project-local`\n\n## Usage\n\nUse the CLI command to convert pipeline notebooks to scripts, for example:\n\n```bash\nunstructured_api_tools convert-pipeline-notebooks \\\n --input-directory pipeline-family-sec-filings/pipeline-notebooks \\\n --output-directory pipeline-family-sec-filings/prepline_sec_filings/api \\\n --pipeline-family sec-filings \\\n --semver 0.2.1\n```\n\nIf you do not provide the `pipeline-family` and `semver` arguments, those values are parsed from\n`preprocessing-pipeline-family.yaml`. You can provide the `preprocessing-pipeline-family.yaml` file\nexplicitly with `--config-filename` or the `PIPELINE_FAMILY_CONFIG` environment variable. If neither\nof those is specified, the fallback is to use the `preprocessing-pipeline-family.yaml` file in the\ncurrent working directory.\n\nThe API file undergoes `black`, `flake8` and `mypy` checks after being generated. If you want\n`flake8` to ignore specific errors, you can specify them through the CLI with\n`--flake8-ignore F401, E402`.\nSee the [`flake8` docs](https://flake8.pycqa.org/en/latest/user/error-codes.html#error-violation-codes)\nfor a full list of error codes.\n\n### Conversion from `pipeline_api` to FastAPI\n\nThe command described in [**Usage**](#Usage) generates a FastAPI API route for each `pipeline_api`\nfunction defined in the notebook. The signature of the `pipeline_api` method determines what\nparameters the generated FastAPI accepts.\n\nCurrently, only plain text file uploads are supported and as such the first argument must always be\n`text`, but support for multiple files and binary files is coming soon!\n\nIn addition, any number of string array parameters may be specified. Any kwarg beginning with\n`m_` indicates a multi-value string parameter that is accepted by the FastAPI API.\n\nFor example, in a notebook containing:\n\n def pipeline_api(text, m_subject=[], m_name=[]):\n\n`text` represents the content of a file posted to the FastAPI API, and the `m_subject` and `m_name`\nkeyword args represent optional parameters that may be posted to the API as well, both allowing\nmultiple string parameters. A `curl` request against such an API could look like this:\n\n curl -X 'POST' \\\n 'https://<hostname>/<pipeline-family-name>/<pipeline-family-version>/<api-name>' \\\n -H 'accept: application/json' \\\n -H 'Content-Type: multipart/form-data' \\\n -F 'file=@file-to-process.txt' \\\n -F 'subject=art' \\\n -F 'subject=history'\n -F 'subject=math' \\\n -F 'name=feynman'\n\nIn addition, you can specify the response type if `pipeline_api` can support both \"application/json\"\nand \"text/csv\" as return types.\n\nFor example, in a notebook containing a kwarg `response_type`:\n\n def pipeline_api(text, response_type=\"text/csv\", m_subject=[], m_name=[]):\n\nThe consumer of the API may then specify \"text/csv\" as the requested response content type with the usual\nHTTP Accept header, e.g. `Accept: application/json` or `Accept: text/csv`.\n\n## Security Policy\n\nSee our [security policy](https://github.com/Unstructured-IO/unstructured-api-tools/security/policy) for\ninformation on how to report security vulnerabilities.\n\n## Learn more\n\n| Section | Description |\n|-|-|\n| [Company Website](https://unstructured.io) | Unstructured.io product and company info |\n\n\n",
"bugtrack_url": null,
"license": "Apache-2.0",
"summary": "A library that prepares raw documents for downstream ML tasks.",
"version": "0.10.11",
"project_urls": {
"Homepage": "https://github.com/Unstructured-IO/unstructured-api-tools"
},
"split_keywords": [
"nlp",
"pdf",
"html",
"cv",
"xml",
"parsing",
"preprocessing"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "75b5b95a391d63290ce7429b0a2aded9ea7402e6263ef9bec5ae45f67ba5eadd",
"md5": "80a36382996456318946a64bf8604796",
"sha256": "76926c571b751b16a6cce2275dc36c48bd5c922be1d96dad796a28ad53e91694"
},
"downloads": -1,
"filename": "unstructured_api_tools-0.10.11-py3-none-any.whl",
"has_sig": false,
"md5_digest": "80a36382996456318946a64bf8604796",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8.0",
"size": 22458,
"upload_time": "2023-08-14T21:53:11",
"upload_time_iso_8601": "2023-08-14T21:53:11.026842Z",
"url": "https://files.pythonhosted.org/packages/75/b5/b95a391d63290ce7429b0a2aded9ea7402e6263ef9bec5ae45f67ba5eadd/unstructured_api_tools-0.10.11-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "e2367bd2e0ccb2a988b0e8b595b3e8559f859f053b9b21657ee61c74a617c567",
"md5": "d60df4d2d7611dd1322e079c8b7f29bd",
"sha256": "27d354d24a7c5615dfadaf5813ffc47c35e31a642a1bc8ff40389ac05bca39c0"
},
"downloads": -1,
"filename": "unstructured_api_tools-0.10.11.tar.gz",
"has_sig": false,
"md5_digest": "d60df4d2d7611dd1322e079c8b7f29bd",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8.0",
"size": 21763,
"upload_time": "2023-08-14T21:53:12",
"upload_time_iso_8601": "2023-08-14T21:53:12.210390Z",
"url": "https://files.pythonhosted.org/packages/e2/36/7bd2e0ccb2a988b0e8b595b3e8559f859f053b9b21657ee61c74a617c567/unstructured_api_tools-0.10.11.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-08-14 21:53:12",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Unstructured-IO",
"github_project": "unstructured-api-tools",
"travis_ci": false,
"coveralls": true,
"github_actions": true,
"lcname": "unstructured-api-tools"
}