indexify-extractor-sdk


Nameindexify-extractor-sdk JSON
Version 0.0.92 PyPI version JSON
download
home_pageNone
SummaryIndexify Extractor SDK to build new extractors for extraction from unstructured data
upload_time2024-08-28 05:50:22
maintainerNone
docs_urlNone
authorDiptanu Gon Choudhury
requires_python<4.0,>=3.9
licenseNone
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Indexify Extractor SDK

[![PyPI version](https://badge.fury.io/py/indexify-extractor-sdk.svg)](https://badge.fury.io/py/indexify-extractor-sdk)

Indexify Extractor SDK is for developing new extractors to extract information
from any unstructured data sources.

We already have a few extractors here - https://github.com/tensorlakeai/indexify
If you don't find one that works for your use-case use this SDK to build one.

## Install the SDK

Install the SDK from PyPi

```bash
virtualenv ve
source ve/bin/activate
pip install indexify-extractor-sdk
```

## Implement the extractor SDK

There are two ways to implement an extractor. If you don't need any
setup/teardown or additional functionality, check out the decorator:

```python
from indexify_extractor_sdk import Content, extractor

@extractor()
def my_extractor(content: Content, params: dict) -> List[Content]:
    return [
        Content.from_text(
            text="Hello World",
            features=[
                Feature.embedding(values=[1, 2, 3]),
                Feature.metadata(json.loads('{"a": 1, "b": "foo"}')),
            ],
            labels={"url": "test.com"},
        ),
        Content.from_text(
            text="Pipe Baz",
            features=[Feature.embedding(values=[1, 2, 3])],
            labels={"url": "test.com"},
        ),
    ]
```

Note: `@extractor()` takes many parameters, check out the documentation for more
details.

For more advanced use cases, check out the class:

```python
from indexify_extractor_sdk import Content, Extractor, Feature
from pydantic import BaseModel

class InputParams(BaseModel):
    pass

class MyExtractor(Extractor):
    input_mime_types = ["text/plain", "application/pdf", "image/jpeg"]

    def __init__(self):
        super().__init__()

    def extract(self, content: Content, params: InputParams) -> List[Content]:
        return [
            Content.from_text(
                text="Hello World",
                features=[
                    Feature.embedding(values=[1, 2, 3]),
                    Feature.metadata(json.loads('{"a": 1, "b": "foo"}')),
                ],
                labels={"url": "test.com"},
            ),
            Content.from_text(
                text="Pipe Baz",
                features=[Feature.embedding(values=[1, 2, 3])],
                labels={"url": "test.com"},
            ),
        ]

    def sample_input(self) -> Content:
        return Content.from_text("hello world")

```

## Test the extractor

You can run the extractor locally using the command line tool attached to the
SDK like this, by passing some arbitrary text or a file.

```bash
indexify-extractor local my_extractor:MyExtractor --text "hello"
```

## Deploy the extractor

Once you are ready to deploy the new extractor and ready to build pipelines with
it. Package the extractor and deploy as many copies you want, and point it to
the indexify server. Indexify server has two addresses, one for sending your
extractor the extraction task, and another endpoint for your extractor to write
the extracted content.

```
indexify-extractor join-server --coordinator-addr localhost:8950 --ingestion-addr:8900
```


            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "indexify-extractor-sdk",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.9",
    "maintainer_email": null,
    "keywords": null,
    "author": "Diptanu Gon Choudhury",
    "author_email": "diptanu@tensorlake.ai",
    "download_url": "https://files.pythonhosted.org/packages/8a/eb/30938c69108035827cb42444adf5ff4af2edaba7120fd4669764c1f34458/indexify_extractor_sdk-0.0.92.tar.gz",
    "platform": null,
    "description": "# Indexify Extractor SDK\n\n[![PyPI version](https://badge.fury.io/py/indexify-extractor-sdk.svg)](https://badge.fury.io/py/indexify-extractor-sdk)\n\nIndexify Extractor SDK is for developing new extractors to extract information\nfrom any unstructured data sources.\n\nWe already have a few extractors here - https://github.com/tensorlakeai/indexify\nIf you don't find one that works for your use-case use this SDK to build one.\n\n## Install the SDK\n\nInstall the SDK from PyPi\n\n```bash\nvirtualenv ve\nsource ve/bin/activate\npip install indexify-extractor-sdk\n```\n\n## Implement the extractor SDK\n\nThere are two ways to implement an extractor. If you don't need any\nsetup/teardown or additional functionality, check out the decorator:\n\n```python\nfrom indexify_extractor_sdk import Content, extractor\n\n@extractor()\ndef my_extractor(content: Content, params: dict) -> List[Content]:\n    return [\n        Content.from_text(\n            text=\"Hello World\",\n            features=[\n                Feature.embedding(values=[1, 2, 3]),\n                Feature.metadata(json.loads('{\"a\": 1, \"b\": \"foo\"}')),\n            ],\n            labels={\"url\": \"test.com\"},\n        ),\n        Content.from_text(\n            text=\"Pipe Baz\",\n            features=[Feature.embedding(values=[1, 2, 3])],\n            labels={\"url\": \"test.com\"},\n        ),\n    ]\n```\n\nNote: `@extractor()` takes many parameters, check out the documentation for more\ndetails.\n\nFor more advanced use cases, check out the class:\n\n```python\nfrom indexify_extractor_sdk import Content, Extractor, Feature\nfrom pydantic import BaseModel\n\nclass InputParams(BaseModel):\n    pass\n\nclass MyExtractor(Extractor):\n    input_mime_types = [\"text/plain\", \"application/pdf\", \"image/jpeg\"]\n\n    def __init__(self):\n        super().__init__()\n\n    def extract(self, content: Content, params: InputParams) -> List[Content]:\n        return [\n            Content.from_text(\n                text=\"Hello World\",\n                features=[\n                    Feature.embedding(values=[1, 2, 3]),\n                    Feature.metadata(json.loads('{\"a\": 1, \"b\": \"foo\"}')),\n                ],\n                labels={\"url\": \"test.com\"},\n            ),\n            Content.from_text(\n                text=\"Pipe Baz\",\n                features=[Feature.embedding(values=[1, 2, 3])],\n                labels={\"url\": \"test.com\"},\n            ),\n        ]\n\n    def sample_input(self) -> Content:\n        return Content.from_text(\"hello world\")\n\n```\n\n## Test the extractor\n\nYou can run the extractor locally using the command line tool attached to the\nSDK like this, by passing some arbitrary text or a file.\n\n```bash\nindexify-extractor local my_extractor:MyExtractor --text \"hello\"\n```\n\n## Deploy the extractor\n\nOnce you are ready to deploy the new extractor and ready to build pipelines with\nit. Package the extractor and deploy as many copies you want, and point it to\nthe indexify server. Indexify server has two addresses, one for sending your\nextractor the extraction task, and another endpoint for your extractor to write\nthe extracted content.\n\n```\nindexify-extractor join-server --coordinator-addr localhost:8950 --ingestion-addr:8900\n```\n\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Indexify Extractor SDK to build new extractors for extraction from unstructured data",
    "version": "0.0.92",
    "project_urls": null,
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "20c57ad8f07df37460077fff79c4974331fa88ed8742adfacecc74d4b58e2079",
                "md5": "5e4934410f7f166d8f0723050ab44570",
                "sha256": "4c72adaa43cd30edae806499cf7c1845b8ea40a90a1165f7d0ead7d1706e42d9"
            },
            "downloads": -1,
            "filename": "indexify_extractor_sdk-0.0.92-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "5e4934410f7f166d8f0723050ab44570",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.9",
            "size": 61713,
            "upload_time": "2024-08-28T05:50:20",
            "upload_time_iso_8601": "2024-08-28T05:50:20.378265Z",
            "url": "https://files.pythonhosted.org/packages/20/c5/7ad8f07df37460077fff79c4974331fa88ed8742adfacecc74d4b58e2079/indexify_extractor_sdk-0.0.92-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "8aeb30938c69108035827cb42444adf5ff4af2edaba7120fd4669764c1f34458",
                "md5": "7489f4a58cc779e56a6212f7b4b73008",
                "sha256": "ed569429ecd95902fb77393e6c7f121a4e3ab40a5d018796f7472f9e82ec26b8"
            },
            "downloads": -1,
            "filename": "indexify_extractor_sdk-0.0.92.tar.gz",
            "has_sig": false,
            "md5_digest": "7489f4a58cc779e56a6212f7b4b73008",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.9",
            "size": 49393,
            "upload_time": "2024-08-28T05:50:22",
            "upload_time_iso_8601": "2024-08-28T05:50:22.177880Z",
            "url": "https://files.pythonhosted.org/packages/8a/eb/30938c69108035827cb42444adf5ff4af2edaba7120fd4669764c1f34458/indexify_extractor_sdk-0.0.92.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-08-28 05:50:22",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "indexify-extractor-sdk"
}
        
Elapsed time: 0.27831s