unstructured-api-tools


Nameunstructured-api-tools JSON
Version 0.10.11 PyPI version JSON
download
home_pagehttps://github.com/Unstructured-IO/unstructured-api-tools
SummaryA library that prepares raw documents for downstream ML tasks.
upload_time2023-08-14 21:53:12
maintainer
docs_urlNone
authorUnstructured Technologies
requires_python>=3.8.0
licenseApache-2.0
keywords nlp pdf html cv xml parsing preprocessing
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage
            <h3 align="center">
  <img
    src="https://raw.githubusercontent.com/Unstructured-IO/unstructured-api-tools/main/img/unstructured_logo.png"
    height="200"
  >
</h3>

<h3 align="center">
  <p>Open-Source Pre-Processing Tools for Unstructured Data</p>
</h3>


The `unstructured_api_tools` library includes utilities for converting pipeline notebooks into
REST API applications. `unstructured_api_tools` is intended for use in conjunction with
pipeline repos. See [`pipeline-sec-filings`](https://github.com/Unstructured-IO/pipeline-sec-filings)
for an example of a repo that uses `unstructured_api_tools`.

## Installation

To install the library, run `pip install unstructured_api_tools`.

## Developer Quick Start

* Using `pyenv` to manage virtualenv's is recommended
	* Mac install instructions. See [here](https://github.com/Unstructured-IO/community#mac--homebrew) for more detailed instructions.
		* `brew install pyenv-virtualenv`
	  * `pyenv install 3.8.15`
  * Linux instructions are available [here](https://github.com/Unstructured-IO/community#linux).

* Create a virtualenv to work in and activate it, e.g. for one named `unstructured_api_tools`:

	`pyenv  virtualenv 3.8.15 unstructured_api_tools` <br />
	`pyenv activate unstructured_api_tools`

* Run `make install-project-local`

## Usage

Use the CLI command to convert pipeline notebooks to scripts, for example:

```bash
unstructured_api_tools convert-pipeline-notebooks \
  --input-directory pipeline-family-sec-filings/pipeline-notebooks \
  --output-directory pipeline-family-sec-filings/prepline_sec_filings/api \
  --pipeline-family sec-filings \
  --semver 0.2.1
```

If you do not provide the `pipeline-family` and `semver` arguments, those values are parsed from
`preprocessing-pipeline-family.yaml`. You can provide the `preprocessing-pipeline-family.yaml` file
explicitly with `--config-filename` or the `PIPELINE_FAMILY_CONFIG` environment variable. If neither
of those is specified, the fallback is to use the `preprocessing-pipeline-family.yaml` file in the
current working directory.

The API file undergoes `black`, `flake8` and `mypy` checks after being generated. If you want
`flake8` to ignore specific errors, you can specify them through the CLI with
`--flake8-ignore F401, E402`.
See the [`flake8` docs](https://flake8.pycqa.org/en/latest/user/error-codes.html#error-violation-codes)
for a full list of error codes.

### Conversion from `pipeline_api` to FastAPI

The command described in [**Usage**](#Usage) generates a FastAPI API route for each `pipeline_api`
function defined in the notebook. The signature of the `pipeline_api` method determines what
parameters the generated FastAPI accepts.

Currently, only plain text file uploads are supported and as such the first argument must always be
`text`, but support for multiple files and binary files is coming soon!

In addition, any number of string array parameters may be specified. Any kwarg beginning with
`m_` indicates a multi-value string parameter that is accepted by the FastAPI API.

For example, in a notebook containing:

    def pipeline_api(text, m_subject=[], m_name=[]):

`text` represents the content of a file posted to the FastAPI API, and the `m_subject` and `m_name`
keyword args represent optional parameters that may be posted to the API as well, both allowing
multiple string parameters. A `curl` request against such an API could look like this:

    curl -X 'POST' \
      'https://<hostname>/<pipeline-family-name>/<pipeline-family-version>/<api-name>' \
      -H 'accept: application/json'  \
      -H 'Content-Type: multipart/form-data' \
      -F 'file=@file-to-process.txt' \
      -F 'subject=art' \
      -F 'subject=history'
      -F 'subject=math' \
      -F 'name=feynman'

In addition, you can specify the response type if `pipeline_api` can support both "application/json"
and "text/csv" as return types.

For example, in a notebook containing a kwarg `response_type`:

    def pipeline_api(text, response_type="text/csv", m_subject=[], m_name=[]):

The consumer of the API may then specify "text/csv" as the requested response content type with the usual
HTTP Accept header, e.g. `Accept: application/json` or `Accept: text/csv`.

## Security Policy

See our [security policy](https://github.com/Unstructured-IO/unstructured-api-tools/security/policy) for
information on how to report security vulnerabilities.

## Learn more

| Section | Description |
|-|-|
| [Company Website](https://unstructured.io) | Unstructured.io product and company info |



            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/Unstructured-IO/unstructured-api-tools",
    "name": "unstructured-api-tools",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8.0",
    "maintainer_email": "",
    "keywords": "NLP PDF HTML CV XML parsing preprocessing",
    "author": "Unstructured Technologies",
    "author_email": "mrobinson@unstructuredai.io",
    "download_url": "https://files.pythonhosted.org/packages/e2/36/7bd2e0ccb2a988b0e8b595b3e8559f859f053b9b21657ee61c74a617c567/unstructured_api_tools-0.10.11.tar.gz",
    "platform": null,
    "description": "<h3 align=\"center\">\n  <img\n    src=\"https://raw.githubusercontent.com/Unstructured-IO/unstructured-api-tools/main/img/unstructured_logo.png\"\n    height=\"200\"\n  >\n</h3>\n\n<h3 align=\"center\">\n  <p>Open-Source Pre-Processing Tools for Unstructured Data</p>\n</h3>\n\n\nThe `unstructured_api_tools` library includes utilities for converting pipeline notebooks into\nREST API applications. `unstructured_api_tools` is intended for use in conjunction with\npipeline repos. See [`pipeline-sec-filings`](https://github.com/Unstructured-IO/pipeline-sec-filings)\nfor an example of a repo that uses `unstructured_api_tools`.\n\n## Installation\n\nTo install the library, run `pip install unstructured_api_tools`.\n\n## Developer Quick Start\n\n* Using `pyenv` to manage virtualenv's is recommended\n\t* Mac install instructions. See [here](https://github.com/Unstructured-IO/community#mac--homebrew) for more detailed instructions.\n\t\t* `brew install pyenv-virtualenv`\n\t  * `pyenv install 3.8.15`\n  * Linux instructions are available [here](https://github.com/Unstructured-IO/community#linux).\n\n* Create a virtualenv to work in and activate it, e.g. for one named `unstructured_api_tools`:\n\n\t`pyenv  virtualenv 3.8.15 unstructured_api_tools` <br />\n\t`pyenv activate unstructured_api_tools`\n\n* Run `make install-project-local`\n\n## Usage\n\nUse the CLI command to convert pipeline notebooks to scripts, for example:\n\n```bash\nunstructured_api_tools convert-pipeline-notebooks \\\n  --input-directory pipeline-family-sec-filings/pipeline-notebooks \\\n  --output-directory pipeline-family-sec-filings/prepline_sec_filings/api \\\n  --pipeline-family sec-filings \\\n  --semver 0.2.1\n```\n\nIf you do not provide the `pipeline-family` and `semver` arguments, those values are parsed from\n`preprocessing-pipeline-family.yaml`. You can provide the `preprocessing-pipeline-family.yaml` file\nexplicitly with `--config-filename` or the `PIPELINE_FAMILY_CONFIG` environment variable. If neither\nof those is specified, the fallback is to use the `preprocessing-pipeline-family.yaml` file in the\ncurrent working directory.\n\nThe API file undergoes `black`, `flake8` and `mypy` checks after being generated. If you want\n`flake8` to ignore specific errors, you can specify them through the CLI with\n`--flake8-ignore F401, E402`.\nSee the [`flake8` docs](https://flake8.pycqa.org/en/latest/user/error-codes.html#error-violation-codes)\nfor a full list of error codes.\n\n### Conversion from `pipeline_api` to FastAPI\n\nThe command described in [**Usage**](#Usage) generates a FastAPI API route for each `pipeline_api`\nfunction defined in the notebook. The signature of the `pipeline_api` method determines what\nparameters the generated FastAPI accepts.\n\nCurrently, only plain text file uploads are supported and as such the first argument must always be\n`text`, but support for multiple files and binary files is coming soon!\n\nIn addition, any number of string array parameters may be specified. Any kwarg beginning with\n`m_` indicates a multi-value string parameter that is accepted by the FastAPI API.\n\nFor example, in a notebook containing:\n\n    def pipeline_api(text, m_subject=[], m_name=[]):\n\n`text` represents the content of a file posted to the FastAPI API, and the `m_subject` and `m_name`\nkeyword args represent optional parameters that may be posted to the API as well, both allowing\nmultiple string parameters. A `curl` request against such an API could look like this:\n\n    curl -X 'POST' \\\n      'https://<hostname>/<pipeline-family-name>/<pipeline-family-version>/<api-name>' \\\n      -H 'accept: application/json'  \\\n      -H 'Content-Type: multipart/form-data' \\\n      -F 'file=@file-to-process.txt' \\\n      -F 'subject=art' \\\n      -F 'subject=history'\n      -F 'subject=math' \\\n      -F 'name=feynman'\n\nIn addition, you can specify the response type if `pipeline_api` can support both \"application/json\"\nand \"text/csv\" as return types.\n\nFor example, in a notebook containing a kwarg `response_type`:\n\n    def pipeline_api(text, response_type=\"text/csv\", m_subject=[], m_name=[]):\n\nThe consumer of the API may then specify \"text/csv\" as the requested response content type with the usual\nHTTP Accept header, e.g. `Accept: application/json` or `Accept: text/csv`.\n\n## Security Policy\n\nSee our [security policy](https://github.com/Unstructured-IO/unstructured-api-tools/security/policy) for\ninformation on how to report security vulnerabilities.\n\n## Learn more\n\n| Section | Description |\n|-|-|\n| [Company Website](https://unstructured.io) | Unstructured.io product and company info |\n\n\n",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "A library that prepares raw documents for downstream ML tasks.",
    "version": "0.10.11",
    "project_urls": {
        "Homepage": "https://github.com/Unstructured-IO/unstructured-api-tools"
    },
    "split_keywords": [
        "nlp",
        "pdf",
        "html",
        "cv",
        "xml",
        "parsing",
        "preprocessing"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "75b5b95a391d63290ce7429b0a2aded9ea7402e6263ef9bec5ae45f67ba5eadd",
                "md5": "80a36382996456318946a64bf8604796",
                "sha256": "76926c571b751b16a6cce2275dc36c48bd5c922be1d96dad796a28ad53e91694"
            },
            "downloads": -1,
            "filename": "unstructured_api_tools-0.10.11-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "80a36382996456318946a64bf8604796",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8.0",
            "size": 22458,
            "upload_time": "2023-08-14T21:53:11",
            "upload_time_iso_8601": "2023-08-14T21:53:11.026842Z",
            "url": "https://files.pythonhosted.org/packages/75/b5/b95a391d63290ce7429b0a2aded9ea7402e6263ef9bec5ae45f67ba5eadd/unstructured_api_tools-0.10.11-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "e2367bd2e0ccb2a988b0e8b595b3e8559f859f053b9b21657ee61c74a617c567",
                "md5": "d60df4d2d7611dd1322e079c8b7f29bd",
                "sha256": "27d354d24a7c5615dfadaf5813ffc47c35e31a642a1bc8ff40389ac05bca39c0"
            },
            "downloads": -1,
            "filename": "unstructured_api_tools-0.10.11.tar.gz",
            "has_sig": false,
            "md5_digest": "d60df4d2d7611dd1322e079c8b7f29bd",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8.0",
            "size": 21763,
            "upload_time": "2023-08-14T21:53:12",
            "upload_time_iso_8601": "2023-08-14T21:53:12.210390Z",
            "url": "https://files.pythonhosted.org/packages/e2/36/7bd2e0ccb2a988b0e8b595b3e8559f859f053b9b21657ee61c74a617c567/unstructured_api_tools-0.10.11.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-08-14 21:53:12",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Unstructured-IO",
    "github_project": "unstructured-api-tools",
    "travis_ci": false,
    "coveralls": true,
    "github_actions": true,
    "lcname": "unstructured-api-tools"
}
        
Elapsed time: 1.08005s