<h1 align="center">
<img width="1100" height="259" alt="Group 39884" src="https://github.com/user-attachments/assets/ac9adfc2-53cb-427e-ad6a-91394cdee961" />
</h1>
<p align="center">Get high quality data from Documents fast, and deploy scalable serverless Data Processor APIs</p>
<div align="center">
[](https://pypi.org/project/tensorlake/)
[](https://pypi.org/project/tensorlake/)
[](LICENSE)
[](https://docs.tensorlake.ai)
[](https://join.slack.com/t/tensorlakecloud/shared_invite/zt-32fq4nmib-gO0OM5RIar3zLOBm~ZGqKg)
TensorLake transforms unstructured documents into AI-ready data through Document Ingestion APIs and enables building scalable data processing pipelines with a serverless workflow runtime.

</div>
## Features
- **Document Ingestion** - Parse documents (PDFs, DOCX, spreadsheets, presentations, images, and raw text) to markdown or extract structured data with schemas. This is powered by Tensorlake's state of the art Layout Detection and Table Recognition models.
- **Serverless Workflows** - Build and deploy data ingestion and orchestration APIs using Durable Functions in Python that scales automatically on fully managed infrastructure. The requests to workflows automatically resume from failure checkpoints and scale to zero when idle.
---
## Document Ingestion Quickstart
### Installation
Install the SDK and get an API Key.
```bash
pip install tensorlake
```
Sign up at [cloud.tensorlake.ai](https://cloud.tensorlake.ai/) and get your API key.
### Parse Documents
```python
from tensorlake.documentai import DocumentAI, ParseStatus
doc_ai = DocumentAI(api_key="your-api-key")
# Upload and parse document
file_id = doc_ai.upload("/path/to/document.pdf")
# Get parse ID
parse_id = doc_ai.parse(file_id)
# Wait for completion and get results
result = doc_ai.wait_for_completion(parse_id)
if result.status == ParseStatus.SUCCESSFUL:
for chunk in result.chunks:
print(chunk.content) # Clean markdown output
```
### Customize Parsing
Various aspect of Document Parsing, such as detecting strike through lines, table output mode, figure and table summarization can be customized. The API is [documented here](https://docs.tensorlake.ai/document-ingestion/parsing/read#options-for-parsing-documents).
```python
from tensorlake.documentai import DocumentAI, ParsingOptions, EnrichmentOptions, ParseStatus, ChunkingStrategy, TableOutputMode
doc_ai = DocumentAI(api_key="your-api-key")
# Skip the upload step, if you are passing pre-signed URLs or HTTPS accessible files.
file_id = doc_ai.upload("/path/to/document.pdf")
# Configure parsing options
parsing_options = ParsingOptions(
chunking_strategy=ChunkingStrategy.SECTION,
table_output_mode=TableOutputMode.HTML,
signature_detection=True
)
# Configure enrichment options
enrichment_options = EnrichmentOptions(
figure_summarization=True,
table_summarization=True
)
# Parse and wait for completion
result = doc_ai.parse_and_wait(
file_id,
parsing_options=parsing_options,
enrichment_options=enrichment_options
)
if result.status == ParseStatus.SUCCESSFUL:
for chunk in result.chunks:
print(chunk.content)
```
### Structured Extraction
Extract specific data fields from documents using JSON schemas or Pydantic models:
#### Using Pydantic Models
```python
from tensorlake.documentai import DocumentAI, StructuredExtractionOptions, ParseStatus
from pydantic import BaseModel, Field
# Define Pydantic model
class InvoiceData(BaseModel):
invoice_number: str = Field(description="Invoice number")
total_amount: float = Field(description="Total amount due")
due_date: str = Field(description="Payment due date")
vendor_name: str = Field(description="Vendor company name")
doc_ai = DocumentAI(api_key="your-api-key")
# Passing https accessible file directly (no need to upload to Tensorlake)
file_id = "https://...." # publicly available URL of the invoice data file
# Configure structured extraction using Pydantic model
structured_extraction_options = StructuredExtractionOptions(
schema_name="Invoice Data",
json_schema=InvoiceData # Can pass Pydantic model directly
)
# Parse and wait for completion
result = doc_ai.parse_and_wait(
file_id,
structured_extraction_options=[structured_extraction_options]
)
if result.status == ParseStatus.SUCCESSFUL:
print(result.structured_data)
```
#### Using JSON Schema
```python
# Define JSON schema directly
invoice_schema = {
"title": "InvoiceData",
"type": "object",
"properties": {
"invoice_number": {"type": "string", "description": "Invoice number"},
"total_amount": {"type": "number", "description": "Total amount due"},
"due_date": {"type": "string", "description": "Payment due date"},
"vendor_name": {"type": "string", "description": "Vendor company name"}
}
}
structured_extraction_options = StructuredExtractionOptions(
schema_name="Invoice Data",
json_schema=invoice_schema
)
```
Structured Extraction is guided by the provided schema. We support Pydantic Models as well JSON Schema. All the levers for structured extraction are documented [here](https://docs.tensorlake.ai/document-ingestion/parsing/structured-extraction).
### Learn More
* [Document Parsing Guide](https://docs.tensorlake.ai/document-ingestion/parsing/read)
* [Structured Output Guide](https://docs.tensorlake.ai/document-ingestion/parsing/structured-extraction)
* [Page Classification](https://docs.tensorlake.ai/document-ingestion/parsing/page-classification)
* [Signature Detection](https://docs.tensorlake.ai/document-ingestion/parsing/signature)
## Data Workflows
Workflows enables building and deploying workflow APIs. The workflow APIs are exposed as HTTP Endpoints.Functions in workflows can do anything from calling a web service to loading a data model into a GPU to run inference.
### Workflows Quickstart
Define a workflow by implementing its data transformation steps as Python functions decorated with `@tensorlake_function()`.
Connect the outputs of a function to the inputs of another function using edges in a `Graph` object, which represents the full workflow.
The example below creates a workflow with the following steps:
1. Generate a sequence of numbers from 0 to the supplied value.
2. Compute square of each number.
3. Sum all the squares.
4. Send the sum to a web service.
```python
import os
import urllib.request
from typing import List, Optional
import click # Used for pretty printing to console.
from tensorlake import Graph, RemoteGraph, tensorlake_function
# Define a function for each workflow step
# 1. Generate a sequence of numbers from 0 to the supplied value.
@tensorlake_function()
def generate_sequence(last_sequence_number: int) -> List[int]:
# This function impelements a map operation because it returns a list.
return [i for i in range(last_sequence_number + 1)]
# 2. Compute square of each number.
@tensorlake_function()
def squared(number: int) -> int:
# This function transforms each element of the sequence because it accepts
# only a single int as a parameter.
return number * number
# 3. Sum all the squares.
@tensorlake_function(accumulate=int)
def sum_all(current_sum: int, number: int) -> int:
# This function implements a reduce operation.
# It is called for each element of the sequence. The returned value is passed
# to the next call in `current_sum` parameter. The first call gets `current_sum`=int()
# which is 0. The return value of the last call is the result of the reduce operation.
return current_sum + number
# 4. Send the sum to a web service.
@tensorlake_function()
def send_to_web_service(value: int) -> str:
# This function accepts the sum from the previous step and sends it to a web service.
url = f"https://example.com/?number={value}"
req = urllib.request.Request(url, method="GET")
with urllib.request.urlopen(req) as response:
return response.read()
# Define the full workflow using Graph object
g = Graph(
name="example_workflow",
start_node=generate_sequence,
description="Example workflow",
)
g.add_edge(generate_sequence, squared)
g.add_edge(squared, sum_all)
g.add_edge(sum_all, send_to_web_service)
# Invoke the workflow for sequence [0..200].
def run_workflow(g: Graph) -> None:
invocation_id: str = g.run(last_sequence_number=200, block_until_done=True)
# Get the output of the the workflow (of its last step).
last_step_output: str = g.output(invocation_id, "send_to_web_service")
click.secho("Web service response:", fg="green", bold=True)
click.echo(last_step_output[0])
click.echo()
# Get the sum.
sum_output: str = g.output(invocation_id, "sum_all")
click.secho("Sum:", fg="green", bold=True)
click.echo(sum_output[0])
click.echo()
```
#### Running locally
The workflow code is available at [examples/readme_example.py](examples/readme_example.py).
The following code was added there to create the workflow and run it locally on your computer:
```python
run_workflow(g)
```
Run the workflow locally:
```bash
python examples/readme_example.py
```
In console output you can see that the workflow computed the sum and got a response from the web service.
Running a workflow locally is convenient during its development. There's no need to wait until the workflow sgets deployed to see how it works.
#### Running on Tensorlake Cloud
To run the workflow on tensorlake cloud it first needs to get deployed there.
1. Set `TENSORLAKE_API_KEY` environment variable in your shell session:
```bash
export TENSORLAKE_API_KEY="Paste your API key here"
```
2. Deploy the workflow to Tensorlake Cloud:
```bash
tensorlake deploy examples/readme_example.py
```
3. The following code was added to the workflow file to run it on Tensorlake Cloud:
```python
from tensorlake import RemoteGraph
cloud_workflow = RemoteGraph.by_name("example_workflow")
run_workflow(cloud_workflow)
```
4. Run the workflow on Tensorlake Cloud:
```bash
python examples/readme_example.py
```
## Learn more about workflows
* [Serverless Workflows Documentation](https://docs.tensorlake.ai/workflows/quickstart)
* [Key programming concepts in Tensorlake Workflows](https://docs.tensorlake.ai/workflows/compute)
* [Dependencies and container images in Tensorlake Workflows](https://docs.tensorlake.ai/workflows/images)
* [Open Source Workflow Compute Engine](https://docs.tensorlake.ai/opensource/indexify)
Raw data
{
"_id": null,
"home_page": "https://github.com/tensorlakeai/tensorlake",
"name": "tensorlake",
"maintainer": null,
"docs_url": null,
"requires_python": "<4.0,>=3.10",
"maintainer_email": null,
"keywords": null,
"author": "Tensorlake Inc.",
"author_email": "support@tensorlake.ai",
"download_url": "https://files.pythonhosted.org/packages/f5/c9/2d1dc3390c76c80122d57604944430f23d7971eabe37055928c6071a5508/tensorlake-0.2.46.tar.gz",
"platform": null,
"description": "<h1 align=\"center\">\n <img width=\"1100\" height=\"259\" alt=\"Group 39884\" src=\"https://github.com/user-attachments/assets/ac9adfc2-53cb-427e-ad6a-91394cdee961\" />\n\n</h1>\n\n<p align=\"center\">Get high quality data from Documents fast, and deploy scalable serverless Data Processor APIs</p>\n<div align=\"center\">\n\n\n[](https://pypi.org/project/tensorlake/)\n[](https://pypi.org/project/tensorlake/)\n[](LICENSE)\n[](https://docs.tensorlake.ai)\n[](https://join.slack.com/t/tensorlakecloud/shared_invite/zt-32fq4nmib-gO0OM5RIar3zLOBm~ZGqKg)\n\nTensorLake transforms unstructured documents into AI-ready data through Document Ingestion APIs and enables building scalable data processing pipelines with a serverless workflow runtime. \n\n\n</div>\n\n## Features\n\n- **Document Ingestion** - Parse documents (PDFs, DOCX, spreadsheets, presentations, images, and raw text) to markdown or extract structured data with schemas. This is powered by Tensorlake's state of the art Layout Detection and Table Recognition models.\n\n- **Serverless Workflows** - Build and deploy data ingestion and orchestration APIs using Durable Functions in Python that scales automatically on fully managed infrastructure. The requests to workflows automatically resume from failure checkpoints and scale to zero when idle.\n---\n\n## Document Ingestion Quickstart\n\n### Installation\n\nInstall the SDK and get an API Key.\n\n```bash\npip install tensorlake\n```\n\nSign up at [cloud.tensorlake.ai](https://cloud.tensorlake.ai/) and get your API key.\n\n### Parse Documents\n\n```python\nfrom tensorlake.documentai import DocumentAI, ParseStatus\n\ndoc_ai = DocumentAI(api_key=\"your-api-key\")\n\n# Upload and parse document\nfile_id = doc_ai.upload(\"/path/to/document.pdf\")\n\n# Get parse ID\nparse_id = doc_ai.parse(file_id)\n\n# Wait for completion and get results\nresult = doc_ai.wait_for_completion(parse_id)\n\nif result.status == ParseStatus.SUCCESSFUL:\n for chunk in result.chunks:\n print(chunk.content) # Clean markdown output\n```\n\n### Customize Parsing\n\nVarious aspect of Document Parsing, such as detecting strike through lines, table output mode, figure and table summarization can be customized. The API is [documented here](https://docs.tensorlake.ai/document-ingestion/parsing/read#options-for-parsing-documents).\n\n```python\nfrom tensorlake.documentai import DocumentAI, ParsingOptions, EnrichmentOptions, ParseStatus, ChunkingStrategy, TableOutputMode\n\ndoc_ai = DocumentAI(api_key=\"your-api-key\")\n\n# Skip the upload step, if you are passing pre-signed URLs or HTTPS accessible files.\nfile_id = doc_ai.upload(\"/path/to/document.pdf\")\n\n# Configure parsing options\nparsing_options = ParsingOptions(\n chunking_strategy=ChunkingStrategy.SECTION,\n table_output_mode=TableOutputMode.HTML,\n signature_detection=True\n)\n\n# Configure enrichment options\nenrichment_options = EnrichmentOptions(\n figure_summarization=True,\n table_summarization=True\n)\n\n# Parse and wait for completion\nresult = doc_ai.parse_and_wait(\n file_id,\n parsing_options=parsing_options,\n enrichment_options=enrichment_options\n)\n\nif result.status == ParseStatus.SUCCESSFUL:\n for chunk in result.chunks:\n print(chunk.content)\n```\n\n### Structured Extraction\n\nExtract specific data fields from documents using JSON schemas or Pydantic models:\n\n#### Using Pydantic Models\n```python\nfrom tensorlake.documentai import DocumentAI, StructuredExtractionOptions, ParseStatus\nfrom pydantic import BaseModel, Field\n\n# Define Pydantic model\nclass InvoiceData(BaseModel):\n invoice_number: str = Field(description=\"Invoice number\")\n total_amount: float = Field(description=\"Total amount due\")\n due_date: str = Field(description=\"Payment due date\")\n vendor_name: str = Field(description=\"Vendor company name\")\n\ndoc_ai = DocumentAI(api_key=\"your-api-key\")\n\n# Passing https accessible file directly (no need to upload to Tensorlake)\nfile_id = \"https://....\" # publicly available URL of the invoice data file\n\n# Configure structured extraction using Pydantic model\nstructured_extraction_options = StructuredExtractionOptions(\n schema_name=\"Invoice Data\",\n json_schema=InvoiceData # Can pass Pydantic model directly\n)\n\n# Parse and wait for completion\nresult = doc_ai.parse_and_wait(\n file_id,\n structured_extraction_options=[structured_extraction_options]\n)\n\nif result.status == ParseStatus.SUCCESSFUL:\n print(result.structured_data)\n```\n\n#### Using JSON Schema\n```python\n# Define JSON schema directly\ninvoice_schema = {\n \"title\": \"InvoiceData\",\n \"type\": \"object\",\n \"properties\": {\n \"invoice_number\": {\"type\": \"string\", \"description\": \"Invoice number\"},\n \"total_amount\": {\"type\": \"number\", \"description\": \"Total amount due\"},\n \"due_date\": {\"type\": \"string\", \"description\": \"Payment due date\"},\n \"vendor_name\": {\"type\": \"string\", \"description\": \"Vendor company name\"}\n }\n}\n\nstructured_extraction_options = StructuredExtractionOptions(\n schema_name=\"Invoice Data\",\n json_schema=invoice_schema\n)\n```\n\nStructured Extraction is guided by the provided schema. We support Pydantic Models as well JSON Schema. All the levers for structured extraction are documented [here](https://docs.tensorlake.ai/document-ingestion/parsing/structured-extraction).\n\n### Learn More\n* [Document Parsing Guide](https://docs.tensorlake.ai/document-ingestion/parsing/read)\n* [Structured Output Guide](https://docs.tensorlake.ai/document-ingestion/parsing/structured-extraction)\n* [Page Classification](https://docs.tensorlake.ai/document-ingestion/parsing/page-classification)\n* [Signature Detection](https://docs.tensorlake.ai/document-ingestion/parsing/signature)\n\n## Data Workflows\n\nWorkflows enables building and deploying workflow APIs. The workflow APIs are exposed as HTTP Endpoints.Functions in workflows can do anything from calling a web service to loading a data model into a GPU to run inference.\n\n### Workflows Quickstart\n\nDefine a workflow by implementing its data transformation steps as Python functions decorated with `@tensorlake_function()`.\nConnect the outputs of a function to the inputs of another function using edges in a `Graph` object, which represents the full workflow.\n\nThe example below creates a workflow with the following steps:\n\n1. Generate a sequence of numbers from 0 to the supplied value.\n2. Compute square of each number.\n3. Sum all the squares.\n4. Send the sum to a web service.\n\n```python\nimport os\nimport urllib.request\nfrom typing import List, Optional\n\nimport click # Used for pretty printing to console.\n\nfrom tensorlake import Graph, RemoteGraph, tensorlake_function\n\n# Define a function for each workflow step\n\n\n# 1. Generate a sequence of numbers from 0 to the supplied value.\n@tensorlake_function()\ndef generate_sequence(last_sequence_number: int) -> List[int]:\n # This function impelements a map operation because it returns a list.\n return [i for i in range(last_sequence_number + 1)]\n\n\n# 2. Compute square of each number.\n@tensorlake_function()\ndef squared(number: int) -> int:\n # This function transforms each element of the sequence because it accepts\n # only a single int as a parameter.\n return number * number\n\n\n# 3. Sum all the squares.\n@tensorlake_function(accumulate=int)\ndef sum_all(current_sum: int, number: int) -> int:\n # This function implements a reduce operation.\n # It is called for each element of the sequence. The returned value is passed\n # to the next call in `current_sum` parameter. The first call gets `current_sum`=int()\n # which is 0. The return value of the last call is the result of the reduce operation.\n return current_sum + number\n\n\n# 4. Send the sum to a web service.\n@tensorlake_function()\ndef send_to_web_service(value: int) -> str:\n # This function accepts the sum from the previous step and sends it to a web service.\n url = f\"https://example.com/?number={value}\"\n req = urllib.request.Request(url, method=\"GET\")\n with urllib.request.urlopen(req) as response:\n return response.read()\n\n\n# Define the full workflow using Graph object\ng = Graph(\n name=\"example_workflow\",\n start_node=generate_sequence,\n description=\"Example workflow\",\n)\ng.add_edge(generate_sequence, squared)\ng.add_edge(squared, sum_all)\ng.add_edge(sum_all, send_to_web_service)\n\n# Invoke the workflow for sequence [0..200].\ndef run_workflow(g: Graph) -> None:\n invocation_id: str = g.run(last_sequence_number=200, block_until_done=True)\n\n # Get the output of the the workflow (of its last step).\n last_step_output: str = g.output(invocation_id, \"send_to_web_service\")\n click.secho(\"Web service response:\", fg=\"green\", bold=True)\n click.echo(last_step_output[0])\n click.echo()\n\n # Get the sum.\n sum_output: str = g.output(invocation_id, \"sum_all\")\n click.secho(\"Sum:\", fg=\"green\", bold=True)\n click.echo(sum_output[0])\n click.echo()\n```\n\n#### Running locally\n\nThe workflow code is available at [examples/readme_example.py](examples/readme_example.py).\nThe following code was added there to create the workflow and run it locally on your computer:\n\n```python\nrun_workflow(g)\n```\n\nRun the workflow locally:\n\n```bash\npython examples/readme_example.py\n```\n\nIn console output you can see that the workflow computed the sum and got a response from the web service.\nRunning a workflow locally is convenient during its development. There's no need to wait until the workflow sgets deployed to see how it works.\n\n#### Running on Tensorlake Cloud\n\nTo run the workflow on tensorlake cloud it first needs to get deployed there.\n\n1. Set `TENSORLAKE_API_KEY` environment variable in your shell session:\n```bash\nexport TENSORLAKE_API_KEY=\"Paste your API key here\"\n```\n2. Deploy the workflow to Tensorlake Cloud:\n```bash\ntensorlake deploy examples/readme_example.py\n```\n3. The following code was added to the workflow file to run it on Tensorlake Cloud:\n```python\nfrom tensorlake import RemoteGraph\n\ncloud_workflow = RemoteGraph.by_name(\"example_workflow\")\nrun_workflow(cloud_workflow)\n```\n4. Run the workflow on Tensorlake Cloud:\n\n```bash\npython examples/readme_example.py\n```\n\n## Learn more about workflows \n\n* [Serverless Workflows Documentation](https://docs.tensorlake.ai/workflows/quickstart)\n* [Key programming concepts in Tensorlake Workflows](https://docs.tensorlake.ai/workflows/compute)\n* [Dependencies and container images in Tensorlake Workflows](https://docs.tensorlake.ai/workflows/images)\n* [Open Source Workflow Compute Engine](https://docs.tensorlake.ai/opensource/indexify)\n\n",
"bugtrack_url": null,
"license": null,
"summary": "Tensorlake SDK for Document Ingestion API and Serverless Workflows",
"version": "0.2.46",
"project_urls": {
"Homepage": "https://github.com/tensorlakeai/tensorlake",
"Repository": "https://github.com/tensorlakeai/tensorlake"
},
"split_keywords": [],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "cb1fee650331e3c37668d43eea4b810d6833db5a78fe782cc5a739a0fb9f7df8",
"md5": "f60e3c82d703684fc706879dbb75f603",
"sha256": "3cc7dd443cdc77a89378b22dbc8470d6042ce610324e874f845b9f0c41b2c7a6"
},
"downloads": -1,
"filename": "tensorlake-0.2.46-py3-none-any.whl",
"has_sig": false,
"md5_digest": "f60e3c82d703684fc706879dbb75f603",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0,>=3.10",
"size": 140144,
"upload_time": "2025-09-03T15:56:23",
"upload_time_iso_8601": "2025-09-03T15:56:23.594549Z",
"url": "https://files.pythonhosted.org/packages/cb/1f/ee650331e3c37668d43eea4b810d6833db5a78fe782cc5a739a0fb9f7df8/tensorlake-0.2.46-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "f5c92d1dc3390c76c80122d57604944430f23d7971eabe37055928c6071a5508",
"md5": "59fd21e3899809fe91659314ff68e636",
"sha256": "a653b93e08f5905911b24dd84c2c02b046d99a2d54ad9fbfacb0b4086a5d4fd7"
},
"downloads": -1,
"filename": "tensorlake-0.2.46.tar.gz",
"has_sig": false,
"md5_digest": "59fd21e3899809fe91659314ff68e636",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4.0,>=3.10",
"size": 104198,
"upload_time": "2025-09-03T15:56:25",
"upload_time_iso_8601": "2025-09-03T15:56:25.288756Z",
"url": "https://files.pythonhosted.org/packages/f5/c9/2d1dc3390c76c80122d57604944430f23d7971eabe37055928c6071a5508/tensorlake-0.2.46.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-09-03 15:56:25",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "tensorlakeai",
"github_project": "tensorlake",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "tensorlake"
}