# Preprocess SDK ![V1.4.3](https://img.shields.io/badge/Version-1.4.3-333.svg?labelColor=eee) ![MIT License](https://img.shields.io/badge/License-MIT-333.svg?labelColor=eee)
[Preprocess](https://preprocess.co) is an API service that splits any kind of document into optimal chunks of text for use in language model tasks.
Given documents in input `Preprocess` splits them into chunks of text that respect the layout and semantics of the original document.
We split the content by taking into account sections, paragraphs, lists, images, data tables, text tables, and slides, and following the content semantics for long texts.
We support:
- PDFs
- Microsoft Office documents (Word, PowerPoint, Excel)
- OpenOffice documents (ods, odt, odp)
- HTML content (web pages, articles, emails)
- Plain text
## Installation
Install the Python `Preprocess` library if it is not already present:
```bash
pip install pypreprocess
```
Alternatively, if you want to add it as a dependency with poetry:
```bash
poetry add pypreprocess
poetry install
```
**You need a `Preprocess API Key` to use the SDK, to get one please reach out to [support@preprocess.co](mailto:support@preprocess.co) asking for an API key.**
## Getting started
Get the chunks from a file for use in your language model tasks.
```python
from pypreprocess import Preprocess
#init the SDK with a file
preprocess = Preprocess(api_key=YOUR_API_KEY, filepath="path/for/file")
#chunk the file
preprocess.chunk()
preprocess.wait()
#get the result
result = preprocess.result()
for i in result.data['chunks']:
#use chunks
```
## Initialise a connection
You can initialize the SDK in 3 ways.
1- Passing a local `filepath`
_when you want to init the SDK to chunk a local file_
```python
from pypreprocess import Preprocess
preprocess = Preprocess(api_key=YOUR_API_KEY, filepath="path/for/file")
```
2- Passing a `process_id`
_when the chunking process starts `Preprocess` generate a process_id, you can use it to instance the SDK_
```python
from pypreprocess import Preprocess
preprocess = Preprocess(api_key=YOUR_API_KEY, process_id="id_of_the_process")
```
3- Passing a `PreprocessResponse` Object
_When you need to store the result of a chunking process permanently, you can later load them in the SDK via a PreprocessResponse object_
```python
import json
from pypreprocess import Preprocess, ProcessResponse
response = PreprocessResponse(**json.loads("The JSON result from calling chunk before."))
preprocess = Preprocess(api_key=YOUR_API_KEY, process=PreprocessResponse)
```
## Chunking options
We support a few options you can configure to get the best result for your ingestion pipeline.
> **Preprocess tries to output chunks of less than 512 tokens. Sometimes longer chunks can be produced to preserve the content integrity. We are currently working to allow you to set an arbitrary chunk token length, stay tuned.**
| Parameter | Type | Default |Description |
| :-------- | :------- | :------- | :------------------------- |
| `merge` | `bool` | False | If `True` small paragraphs will be merged to maximize chunk length |
| `repeat_title` | `bool` | False | If `True` each chunk will start with the title of the section in which is contained |
| `repeat_table_header` | `bool` | False | If `True` each chunk will start with the header of the table in which is contained |
| `table_output_format` | `enum ['text', 'markdown', 'html']` | `'text'` | Return tables in the format you need for your ingestion pipelines |
| `keep_header` | `bool` | True | If set to `False`, the content of the headers will be removed. Headers may include page numbers, document titles, section titles, paragraph titles, and fixed layout elements. |
| `smart_header` | `bool` | True | If set to `True`, only relevant titles will be included in the chunks, while other information will be removed. Relevant titles are those that should be part of the body of the page as a title. If set to `False`, only the keep_header parameter will be considered. If keep_header is `False`, the smart_header parameter will be ignored. |
| `keep_footer` | `bool` | False | If set to `True`, the content of the footers will be included in the chunks. Footers may include page numbers, footnotes, and fixed layout elements. |
| `image_text` | `bool` | False | If set to `True`, the text contained in the images will be added to the chunks. |
You can pass each parameter from those during the SDK initialization
```python
preprocess = Preprocess(api_key=YOUR_API_KEY, filepath="path/for/file", merge=True, repeat_title=True, ...)
```
Or set them later with the `set_options` method using a `dict`
```python
preprocess.set_options({"merge": True, "repeat_title": True, ...})
```
## Chunk
After initing the SDK with a `filepath` you should call the `chunk()` method.
```python
from pypreprocess import Preprocess
preprocess = Preprocess(api_key=YOUR_API_KEY, filepath="path/for/file")
response = preprocess.chunk()
```
the response contains the `process_id` and details about the success of the API call.
## Getting the results
The conversion and chunking process may take a while.
You can use the built-in `wait()` method to wait and get the result as the process finishes.
```python
result = preprocess.wait()
print(result.data['chunks'])
```
In a more complex scenario, you can store the `process_id` after initiating the chunking process and then use it in a different flow.
```python
#initing the chunking process
from pypreprocess import Preprocess
preprocess = Preprocess(api_key=YOUR_API_KEY, filepath="path/for/file")
preprocess.chunk()
preocess_id = preprocess.get_process_id()
#in a different flow
from pypreprocess import Preprocess
preprocess = Preprocess(api_key=YOUR_API_KEY, process_id=process_id)
result = preprocess.wait()
print(result.data['chunks'])
```
If you want to implement a different logic for getting the result you can use the method `result()` and check if the status if it's `FINISHED`.
```python
result = preprocess.result()
if result.data['process']['status'] == "FINISHED":
print(result.data['chunks'])
```
## Other methods
Here is a list of some methods inside the SDK that may help you
- `set_filepath(path)` setting the file path after initializing the object.
- `set_process_id(id)` setting the process_id param by id.
- `set_process(PreprocessResponse)` setting the process_id param by `PreprocessResponse` object.
- `set_options(dict)` set the parameters for configuring chunking options.
- `to_json()` returning a JSON str representing the current object.
- `get_process_id()` returning the current process_id.
- `get_filepath()` returning the filepath.
- `get_options()` returning the current chunking options.
Raw data
{
"_id": null,
"home_page": null,
"name": "pypreprocess",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": "python, python3, preprocess, chunks, paragraphs, chunk, paragraph, llama, llamaondex, langchain, chunking, llm, rag",
"author": "Preprocess",
"author_email": "<support@preprocess.co>",
"download_url": "https://files.pythonhosted.org/packages/c4/00/70499cd5f09d0498280549e3cbf782482309f8a669f2fee3f6b21a63f8cf/pypreprocess-1.4.3.tar.gz",
"platform": null,
"description": "\n# Preprocess SDK ![V1.4.3](https://img.shields.io/badge/Version-1.4.3-333.svg?labelColor=eee) ![MIT License](https://img.shields.io/badge/License-MIT-333.svg?labelColor=eee)\n\n[Preprocess](https://preprocess.co) is an API service that splits any kind of document into optimal chunks of text for use in language model tasks.\nGiven documents in input `Preprocess` splits them into chunks of text that respect the layout and semantics of the original document.\nWe split the content by taking into account sections, paragraphs, lists, images, data tables, text tables, and slides, and following the content semantics for long texts.\nWe support:\n- PDFs\n- Microsoft Office documents (Word, PowerPoint, Excel) \n- OpenOffice documents (ods, odt, odp)\n- HTML content (web pages, articles, emails)\n- Plain text\n\n## Installation\nInstall the Python `Preprocess` library if it is not already present:\n```bash\npip install pypreprocess\n```\nAlternatively, if you want to add it as a dependency with poetry: \n```bash\npoetry add pypreprocess\npoetry install\n```\n\n**You need a `Preprocess API Key` to use the SDK, to get one please reach out to [support@preprocess.co](mailto:support@preprocess.co) asking for an API key.**\n\n## Getting started\nGet the chunks from a file for use in your language model tasks.\n\n```python\nfrom pypreprocess import Preprocess\n\n#init the SDK with a file\npreprocess = Preprocess(api_key=YOUR_API_KEY, filepath=\"path/for/file\")\n\n#chunk the file\npreprocess.chunk()\npreprocess.wait()\n\n#get the result\nresult = preprocess.result()\nfor i in result.data['chunks']:\n #use chunks\n\n```\n\n\n## Initialise a connection\nYou can initialize the SDK in 3 ways.\n\n1- Passing a local `filepath`\n\n_when you want to init the SDK to chunk a local file_\n```python\nfrom pypreprocess import Preprocess\npreprocess = Preprocess(api_key=YOUR_API_KEY, filepath=\"path/for/file\")\n```\n\n2- Passing a `process_id`\n\n_when the chunking process starts `Preprocess` generate a process_id, you can use it to instance the SDK_\n```python\nfrom pypreprocess import Preprocess\npreprocess = Preprocess(api_key=YOUR_API_KEY, process_id=\"id_of_the_process\")\n```\n\n3- Passing a `PreprocessResponse` Object\n\n_When you need to store the result of a chunking process permanently, you can later load them in the SDK via a PreprocessResponse object_\n```python\nimport json\nfrom pypreprocess import Preprocess, ProcessResponse\nresponse = PreprocessResponse(**json.loads(\"The JSON result from calling chunk before.\"))\npreprocess = Preprocess(api_key=YOUR_API_KEY, process=PreprocessResponse)\n```\n\n## Chunking options\nWe support a few options you can configure to get the best result for your ingestion pipeline.\n\n> **Preprocess tries to output chunks of less than 512 tokens. Sometimes longer chunks can be produced to preserve the content integrity. We are currently working to allow you to set an arbitrary chunk token length, stay tuned.**\n\n\n| Parameter | Type | Default |Description |\n| :-------- | :------- | :------- | :------------------------- |\n| `merge` | `bool` | False | If `True` small paragraphs will be merged to maximize chunk length |\n| `repeat_title` | `bool` | False | If `True` each chunk will start with the title of the section in which is contained |\n| `repeat_table_header` | `bool` | False | If `True` each chunk will start with the header of the table in which is contained |\n| `table_output_format` | `enum ['text', 'markdown', 'html']` | `'text'` | Return tables in the format you need for your ingestion pipelines |\n| `keep_header` | `bool` | True | If set to `False`, the content of the headers will be removed. Headers may include page numbers, document titles, section titles, paragraph titles, and fixed layout elements. |\n| `smart_header` | `bool` | True | If set to `True`, only relevant titles will be included in the chunks, while other information will be removed. Relevant titles are those that should be part of the body of the page as a title. If set to `False`, only the keep_header parameter will be considered. If keep_header is `False`, the smart_header parameter will be ignored. |\n| `keep_footer` | `bool` | False | If set to `True`, the content of the footers will be included in the chunks. Footers may include page numbers, footnotes, and fixed layout elements. |\n| `image_text` | `bool` | False | If set to `True`, the text contained in the images will be added to the chunks. |\n\nYou can pass each parameter from those during the SDK initialization\n```python \npreprocess = Preprocess(api_key=YOUR_API_KEY, filepath=\"path/for/file\", merge=True, repeat_title=True, ...)\n```\n\nOr set them later with the `set_options` method using a `dict` \n```python\npreprocess.set_options({\"merge\": True, \"repeat_title\": True, ...})\n```\n\n\n## Chunk\nAfter initing the SDK with a `filepath` you should call the `chunk()` method.\n\n```python\nfrom pypreprocess import Preprocess\npreprocess = Preprocess(api_key=YOUR_API_KEY, filepath=\"path/for/file\")\nresponse = preprocess.chunk()\n```\nthe response contains the `process_id` and details about the success of the API call.\n\n\n## Getting the results\nThe conversion and chunking process may take a while.\nYou can use the built-in `wait()` method to wait and get the result as the process finishes.\n```python\nresult = preprocess.wait()\nprint(result.data['chunks'])\n```\n\nIn a more complex scenario, you can store the `process_id` after initiating the chunking process and then use it in a different flow.\n```python\n#initing the chunking process\nfrom pypreprocess import Preprocess\npreprocess = Preprocess(api_key=YOUR_API_KEY, filepath=\"path/for/file\")\npreprocess.chunk()\npreocess_id = preprocess.get_process_id()\n\n#in a different flow\nfrom pypreprocess import Preprocess\npreprocess = Preprocess(api_key=YOUR_API_KEY, process_id=process_id)\nresult = preprocess.wait()\nprint(result.data['chunks'])\n```\n\nIf you want to implement a different logic for getting the result you can use the method `result()` and check if the status if it's `FINISHED`.\n```python\nresult = preprocess.result()\nif result.data['process']['status'] == \"FINISHED\": \n print(result.data['chunks'])\n```\n\n\n## Other methods\nHere is a list of some methods inside the SDK that may help you\n- `set_filepath(path)` setting the file path after initializing the object.\n- `set_process_id(id)` setting the process_id param by id.\n- `set_process(PreprocessResponse)` setting the process_id param by `PreprocessResponse` object.\n- `set_options(dict)` set the parameters for configuring chunking options.\n- `to_json()` returning a JSON str representing the current object.\n- `get_process_id()` returning the current process_id.\n- `get_filepath()` returning the filepath.\n- `get_options()` returning the current chunking options.\n",
"bugtrack_url": null,
"license": null,
"summary": "Preprocess SDK",
"version": "1.4.3",
"project_urls": null,
"split_keywords": [
"python",
" python3",
" preprocess",
" chunks",
" paragraphs",
" chunk",
" paragraph",
" llama",
" llamaondex",
" langchain",
" chunking",
" llm",
" rag"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "220235f850a12211b546d7d3ba50c2576ac48a9321c7bd72a62418858834f47c",
"md5": "e85f94f0639069f30c6e16f389ddccd5",
"sha256": "2bc5461d8b246ec69189817d114654c4e33d7c2cd74e0575a89f6b7a2982e1d5"
},
"downloads": -1,
"filename": "pypreprocess-1.4.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "e85f94f0639069f30c6e16f389ddccd5",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 6713,
"upload_time": "2024-08-11T08:00:55",
"upload_time_iso_8601": "2024-08-11T08:00:55.455796Z",
"url": "https://files.pythonhosted.org/packages/22/02/35f850a12211b546d7d3ba50c2576ac48a9321c7bd72a62418858834f47c/pypreprocess-1.4.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "c40070499cd5f09d0498280549e3cbf782482309f8a669f2fee3f6b21a63f8cf",
"md5": "93eba18d52b1b9e559683f1e803f8541",
"sha256": "220373bb4e94e029ee8665acd3e2bdb689973e678b404048e97e9164c056a4ad"
},
"downloads": -1,
"filename": "pypreprocess-1.4.3.tar.gz",
"has_sig": false,
"md5_digest": "93eba18d52b1b9e559683f1e803f8541",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 6615,
"upload_time": "2024-08-11T08:00:57",
"upload_time_iso_8601": "2024-08-11T08:00:57.149857Z",
"url": "https://files.pythonhosted.org/packages/c4/00/70499cd5f09d0498280549e3cbf782482309f8a669f2fee3f6b21a63f8cf/pypreprocess-1.4.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-08-11 08:00:57",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "pypreprocess"
}