llama-index-readers-preprocess


Namellama-index-readers-preprocess JSON
Version 0.2.0 PyPI version JSON
download
home_pageNone
Summaryllama-index readers preprocess integration
upload_time2024-08-22 06:50:57
maintainerpreprocess
docs_urlNone
authorYour Name
requires_python<4.0,>=3.8.1
licenseMIT
keywords chunk chunking documents preprocess
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Preprocess Loader

```bash
pip install llama-index-readers-preprocess
```

[Preprocess](https://preprocess.co) is an API service that splits any kind of document into optimal chunks of text for use in language model tasks.
Given documents in input `Preprocess` splits them into chunks of text that respect the layout and semantics of the original document.
We split the content by taking into account sections, paragraphs, lists, images, data tables, text tables, and slides, and following the content semantics for long texts.
We support PDFs, Microsoft Office documents (Word, PowerPoint, Excel), OpenOffice documents (ods, odt, odp), HTML content (web pages, articles, emails), and plain text.

This loader integrates with the `Preprocess` API library to provide document conversion and chunking or to load already chunked files inside LlamaIndex.

## Requirements

Install the Python `Preprocess` library if it is not already present:

```
pip install pypreprocess
```

## Usage

To use this loader, you need to pass the `Preprocess API Key`.
When initializing `PreprocessReader`, you should pass your `API Key`, if you don't have it yet, please ask for one at [support@preprocess.co](mailto:support@preprocess.co). Without an `API Key`, the loader will raise an error.

To chunk a file pass a valid filepath and the reader will start converting and chunking it.
`Preprocess` will chunk your files by applying an internal `Splitter`. For this reason, you should not parse the document into nodes using a `Splitter` or applying a `Splitter` while transforming documents in your `IngestionPipeline`.

If you want to handle the nodes directly:

```python
from llama_index.core import VectorStoreIndex

from llama_index.readers.preprocess import PreprocessReader

# pass a filepath and get the chunks as nodes
loader = PreprocessReader(
    api_key="your-api-key", filepath="valid/path/to/file"
)
nodes = loader.get_nodes()

# import the nodes in a Vector Store with your configuration
index = VectorStoreIndex(nodes)
query_engine = index.as_query_engine()
```

By default load_data() returns a document for each chunk, remember to not apply any splitting to these documents

```python
from llama_index.core import VectorStoreIndex

from llama_index.readers.preprocess import PreprocessReader

# pass a filepath and get the chunks as nodes
loader = PreprocessReader(
    api_key="your-api-key", filepath="valid/path/to/file"
)
documents = loader.load_data()

# don't apply any Splitter parser to documents
# if you have an ingestion pipeline you should not apply a Splitter in the transformations
# import the documents in a Vector Store, if you set the service_context parameter remember to avoid including a splitter
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
```

If you want to return only the extracted text and handle it with custom pipelines set `return_whole_document = True`

```python
# pass a filepath and get the chunks as nodes
loader = PreprocessReader(
    api_key="your-api-key", filepath="valid/path/to/file"
)
document = loader.load_data(return_whole_document=True)
```

If you want to load already chunked files you can do it via `process_id` passing it to the reader.

```python
# pass a process_id obtained from a previous instance and get the chunks as one string inside a Document
loader = PreprocessReader(api_key="your-api-key", process_id="your-process-id")
```

This loader is designed to be used as a way to load data into [LlamaIndex](https://github.com/run-llama/llama_index/).

## Other info

`PreprocessReader` is based on `pypreprocess` from [Preprocess](https://github.com/preprocess-co/pypreprocess) library.
For more information or other integration needs please check the [documentation](https://github.com/preprocess-co/pypreprocess).

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "llama-index-readers-preprocess",
    "maintainer": "preprocess",
    "docs_url": null,
    "requires_python": "<4.0,>=3.8.1",
    "maintainer_email": null,
    "keywords": "chunk, chunking, documents, preprocess",
    "author": "Your Name",
    "author_email": "you@example.com",
    "download_url": "https://files.pythonhosted.org/packages/8f/c3/c237940ad17268e9bb42803979b76a4ffb61388d02ad61ca7e606d82f1e3/llama_index_readers_preprocess-0.2.0.tar.gz",
    "platform": null,
    "description": "# Preprocess Loader\n\n```bash\npip install llama-index-readers-preprocess\n```\n\n[Preprocess](https://preprocess.co) is an API service that splits any kind of document into optimal chunks of text for use in language model tasks.\nGiven documents in input `Preprocess` splits them into chunks of text that respect the layout and semantics of the original document.\nWe split the content by taking into account sections, paragraphs, lists, images, data tables, text tables, and slides, and following the content semantics for long texts.\nWe support PDFs, Microsoft Office documents (Word, PowerPoint, Excel), OpenOffice documents (ods, odt, odp), HTML content (web pages, articles, emails), and plain text.\n\nThis loader integrates with the `Preprocess` API library to provide document conversion and chunking or to load already chunked files inside LlamaIndex.\n\n## Requirements\n\nInstall the Python `Preprocess` library if it is not already present:\n\n```\npip install pypreprocess\n```\n\n## Usage\n\nTo use this loader, you need to pass the `Preprocess API Key`.\nWhen initializing `PreprocessReader`, you should pass your `API Key`, if you don't have it yet, please ask for one at [support@preprocess.co](mailto:support@preprocess.co). Without an `API Key`, the loader will raise an error.\n\nTo chunk a file pass a valid filepath and the reader will start converting and chunking it.\n`Preprocess` will chunk your files by applying an internal `Splitter`. For this reason, you should not parse the document into nodes using a `Splitter` or applying a `Splitter` while transforming documents in your `IngestionPipeline`.\n\nIf you want to handle the nodes directly:\n\n```python\nfrom llama_index.core import VectorStoreIndex\n\nfrom llama_index.readers.preprocess import PreprocessReader\n\n# pass a filepath and get the chunks as nodes\nloader = PreprocessReader(\n    api_key=\"your-api-key\", filepath=\"valid/path/to/file\"\n)\nnodes = loader.get_nodes()\n\n# import the nodes in a Vector Store with your configuration\nindex = VectorStoreIndex(nodes)\nquery_engine = index.as_query_engine()\n```\n\nBy default load_data() returns a document for each chunk, remember to not apply any splitting to these documents\n\n```python\nfrom llama_index.core import VectorStoreIndex\n\nfrom llama_index.readers.preprocess import PreprocessReader\n\n# pass a filepath and get the chunks as nodes\nloader = PreprocessReader(\n    api_key=\"your-api-key\", filepath=\"valid/path/to/file\"\n)\ndocuments = loader.load_data()\n\n# don't apply any Splitter parser to documents\n# if you have an ingestion pipeline you should not apply a Splitter in the transformations\n# import the documents in a Vector Store, if you set the service_context parameter remember to avoid including a splitter\nindex = VectorStoreIndex.from_documents(documents)\nquery_engine = index.as_query_engine()\n```\n\nIf you want to return only the extracted text and handle it with custom pipelines set `return_whole_document = True`\n\n```python\n# pass a filepath and get the chunks as nodes\nloader = PreprocessReader(\n    api_key=\"your-api-key\", filepath=\"valid/path/to/file\"\n)\ndocument = loader.load_data(return_whole_document=True)\n```\n\nIf you want to load already chunked files you can do it via `process_id` passing it to the reader.\n\n```python\n# pass a process_id obtained from a previous instance and get the chunks as one string inside a Document\nloader = PreprocessReader(api_key=\"your-api-key\", process_id=\"your-process-id\")\n```\n\nThis loader is designed to be used as a way to load data into [LlamaIndex](https://github.com/run-llama/llama_index/).\n\n## Other info\n\n`PreprocessReader` is based on `pypreprocess` from [Preprocess](https://github.com/preprocess-co/pypreprocess) library.\nFor more information or other integration needs please check the [documentation](https://github.com/preprocess-co/pypreprocess).\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "llama-index readers preprocess integration",
    "version": "0.2.0",
    "project_urls": null,
    "split_keywords": [
        "chunk",
        " chunking",
        " documents",
        " preprocess"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "8c90c78a19c6d4e1093e373d91f2b3b43ff9bc23fcf10a6f14bc25102b01a33d",
                "md5": "5bec146e0f6a69bfc2cb9cc3d20c653f",
                "sha256": "40cdf947db331435a86f089f78cb4940dc2feb3707966ecea1171a6cf84317f8"
            },
            "downloads": -1,
            "filename": "llama_index_readers_preprocess-0.2.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "5bec146e0f6a69bfc2cb9cc3d20c653f",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.8.1",
            "size": 4334,
            "upload_time": "2024-08-22T06:50:55",
            "upload_time_iso_8601": "2024-08-22T06:50:55.863119Z",
            "url": "https://files.pythonhosted.org/packages/8c/90/c78a19c6d4e1093e373d91f2b3b43ff9bc23fcf10a6f14bc25102b01a33d/llama_index_readers_preprocess-0.2.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "8fc3c237940ad17268e9bb42803979b76a4ffb61388d02ad61ca7e606d82f1e3",
                "md5": "35b72dc7462e6df1d4bd79c414fd506e",
                "sha256": "6dfb29d88f4f2c8c8657d98c08bf9350ebacf8c1e08cb97c8e5b47dd75318eed"
            },
            "downloads": -1,
            "filename": "llama_index_readers_preprocess-0.2.0.tar.gz",
            "has_sig": false,
            "md5_digest": "35b72dc7462e6df1d4bd79c414fd506e",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.8.1",
            "size": 4138,
            "upload_time": "2024-08-22T06:50:57",
            "upload_time_iso_8601": "2024-08-22T06:50:57.193876Z",
            "url": "https://files.pythonhosted.org/packages/8f/c3/c237940ad17268e9bb42803979b76a4ffb61388d02ad61ca7e606d82f1e3/llama_index_readers_preprocess-0.2.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-08-22 06:50:57",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "llama-index-readers-preprocess"
}
        
Elapsed time: 0.78218s