llama-extract

Name	llama-extract JSON
Version	0.1.1 JSON
	download
home_page	None
Summary	Structured data extraction from files.
upload_time	2025-01-29 21:36:03
maintainer	None
docs_url	None
author	Logan Markewich
requires_python	<4.0,>=3.9
license	MIT
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # LlamaExtract

> **⚠️ EXPERIMENTAL**
> This library is under active development with frequent breaking changes. APIs and functionality may change significantly between versions. If you're interested in being an early adopter, please contact us at [support@llamaindex.ai](mailto:support@llamaindex.ai) or join our [Discord](https://discord.com/invite/eN6D2HQ4aX).

LlamaExtract provides a simple API for extracting structured data from unstructured documents like PDFs, text files and images (upcoming).

## Quick Start

```python
from llama_extract import LlamaExtract
from pydantic import BaseModel, Field

# Initialize client
extractor = LlamaExtract()


# Define schema using Pydantic
class Resume(BaseModel):
    name: str = Field(description="Full name of candidate")
    email: str = Field(description="Email address")
    skills: list[str] = Field(description="Technical skills and technologies")


# Create extraction agent
agent = extractor.create_agent(name="resume-parser", data_schema=Resume)

# Extract data from document
result = agent.extract("resume.pdf")
print(result.data)
```

## Core Concepts

- **Extraction Agents**: Reusable extractors configured with a specific schema and extraction settings.
- **Data Schema**: Structure definition for the data you want to extract.
- **Extraction Jobs**: Asynchronous extraction tasks that can be monitored.

## Defining Schemas

Schemas can be defined using either Pydantic models or JSON Schema:

### Using Pydantic (Recommended)

```python
from pydantic import BaseModel, Field
from typing import List, Optional


class Experience(BaseModel):
    company: str = Field(description="Company name")
    title: str = Field(description="Job title")
    start_date: Optional[str] = Field(description="Start date of employment")
    end_date: Optional[str] = Field(description="End date of employment")


class Resume(BaseModel):
    name: str = Field(description="Candidate name")
    experience: List[Experience] = Field(description="Work history")
```

### Using JSON Schema

```python
schema = {
    "type": "object",
    "properties": {
        "name": {"type": "string", "description": "Candidate name"},
        "experience": {
            "type": "array",
            "description": "Work history",
            "items": {
                "type": "object",
                "properties": {
                    "company": {
                        "type": "string",
                        "description": "Company name",
                    },
                    "title": {"type": "string", "description": "Job title"},
                    "start_date": {
                        "anyOf": [{"type": "string"}, {"type": "null"}],
                        "description": "Start date of employment",
                    },
                    "end_date": {
                        "anyOf": [{"type": "string"}, {"type": "null"}],
                        "description": "End date of employment",
                    },
                },
            },
        },
    },
}

agent = extractor.create_agent(name="resume-parser", data_schema=schema)
```

### Important restrictions on JSON/Pydantic Schema

*LlamaExtract only supports a subset of the JSON Schema specification.* While limited, it should
be sufficient for a wide variety of use-cases.

  - All fields are required by default. Nullable fields must be explicitly marked as such,
  using `"anyOf"` with a `"null"` type. See `"start_date"` field above.
  - Root node must be of type `"object"`.
  - Schema nesting must be limited to within 5 levels.
  - The important fields are key names/titles, type and description. Fields for
  formatting, default values, etc. are not supported.
  - There are other restrictions on number of keys, size of the schema, etc. that you may
  hit for complex extraction use cases. In such cases, it is worth thinking how to restructure
  your extraction workflow to fit within these constraints, e.g. by extracting subset of fields
  and later merging them together.

## Other Extraction APIs

### Batch Processing

Process multiple files asynchronously:

```python
# Queue multiple files for extraction
jobs = await agent.queue_extraction(["resume1.pdf", "resume2.pdf"])

# Check job status
for job in jobs:
    status = agent.get_extraction_job(job.id).status
    print(f"Job {job.id}: {status}")

# Get results when complete
results = [agent.get_extraction_run_for_job(job.id) for job in jobs]
```

### Updating Schemas

Schemas can be modified and updated after creation:

```python
# Update schema
agent.data_schema = new_schema

# Save changes
agent.save()
```

### Managing Agents

```python
# List all agents
agents = extractor.list_agents()

# Get specific agent
agent = extractor.get_agent(name="resume-parser")

# Delete agent
extractor.delete_agent(agent.id)
```

## Installation

```bash
pip install llama-extract==0.1.0
```

## Tips & Best Practices

1. **Schema Design**:
   - Try to limit schema nesting to 3-4 levels.
   - Make fields optional when data might not always be present. Having required fields may force the model
   to hallucinate when these fields are not present in the documents.
   - When you want to extract a variable number of entities, use an `array` type. Note that you cannot use
   an `array` type for the root node.
   - Use descriptive field names and detailed descriptions. Use descriptions to pass formatting
   instructions or few-shot examples.
   - Start simple and iteratively build your schema to incorporate requirements.

2. **Running Extractions**:
   - Note that resetting `agent.schema` will not save the schema to the database,
   until you call `agent.save`, but it will be used for running extractions.
   - Check job status prior to accessing results. Any extraction error should be available as
   part of `job.error` or `extraction_run.error` fields for debugging.
   - Consider async operations (`queue_extraction`) for large-scale extraction once you have finalized your schema.

## Additional Resources

- [Example Notebook](examples/resume_screening.ipynb) - Detailed walkthrough of resume parsing
- [Discord Community](https://discord.com/invite/eN6D2HQ4aX) - Get help and share feedback

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "llama-extract",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.9",
    "maintainer_email": null,
    "keywords": null,
    "author": "Logan Markewich",
    "author_email": "logan@runllama.ai",
    "download_url": "https://files.pythonhosted.org/packages/8f/c9/ae6a39936bdce16f62627e0ae76eff560e77458b1aca208a8cac736a3533/llama_extract-0.1.1.tar.gz",
    "platform": null,
    "description": "# LlamaExtract\n\n> **\u26a0\ufe0f EXPERIMENTAL**\n> This library is under active development with frequent breaking changes. APIs and functionality may change significantly between versions. If you're interested in being an early adopter, please contact us at [support@llamaindex.ai](mailto:support@llamaindex.ai) or join our [Discord](https://discord.com/invite/eN6D2HQ4aX).\n\nLlamaExtract provides a simple API for extracting structured data from unstructured documents like PDFs, text files and images (upcoming).\n\n## Quick Start\n\n```python\nfrom llama_extract import LlamaExtract\nfrom pydantic import BaseModel, Field\n\n# Initialize client\nextractor = LlamaExtract()\n\n\n# Define schema using Pydantic\nclass Resume(BaseModel):\n    name: str = Field(description=\"Full name of candidate\")\n    email: str = Field(description=\"Email address\")\n    skills: list[str] = Field(description=\"Technical skills and technologies\")\n\n\n# Create extraction agent\nagent = extractor.create_agent(name=\"resume-parser\", data_schema=Resume)\n\n# Extract data from document\nresult = agent.extract(\"resume.pdf\")\nprint(result.data)\n```\n\n## Core Concepts\n\n- **Extraction Agents**: Reusable extractors configured with a specific schema and extraction settings.\n- **Data Schema**: Structure definition for the data you want to extract.\n- **Extraction Jobs**: Asynchronous extraction tasks that can be monitored.\n\n## Defining Schemas\n\nSchemas can be defined using either Pydantic models or JSON Schema:\n\n### Using Pydantic (Recommended)\n\n```python\nfrom pydantic import BaseModel, Field\nfrom typing import List, Optional\n\n\nclass Experience(BaseModel):\n    company: str = Field(description=\"Company name\")\n    title: str = Field(description=\"Job title\")\n    start_date: Optional[str] = Field(description=\"Start date of employment\")\n    end_date: Optional[str] = Field(description=\"End date of employment\")\n\n\nclass Resume(BaseModel):\n    name: str = Field(description=\"Candidate name\")\n    experience: List[Experience] = Field(description=\"Work history\")\n```\n\n### Using JSON Schema\n\n```python\nschema = {\n    \"type\": \"object\",\n    \"properties\": {\n        \"name\": {\"type\": \"string\", \"description\": \"Candidate name\"},\n        \"experience\": {\n            \"type\": \"array\",\n            \"description\": \"Work history\",\n            \"items\": {\n                \"type\": \"object\",\n                \"properties\": {\n                    \"company\": {\n                        \"type\": \"string\",\n                        \"description\": \"Company name\",\n                    },\n                    \"title\": {\"type\": \"string\", \"description\": \"Job title\"},\n                    \"start_date\": {\n                        \"anyOf\": [{\"type\": \"string\"}, {\"type\": \"null\"}],\n                        \"description\": \"Start date of employment\",\n                    },\n                    \"end_date\": {\n                        \"anyOf\": [{\"type\": \"string\"}, {\"type\": \"null\"}],\n                        \"description\": \"End date of employment\",\n                    },\n                },\n            },\n        },\n    },\n}\n\nagent = extractor.create_agent(name=\"resume-parser\", data_schema=schema)\n```\n\n### Important restrictions on JSON/Pydantic Schema\n\n*LlamaExtract only supports a subset of the JSON Schema specification.* While limited, it should\nbe sufficient for a wide variety of use-cases.\n\n  - All fields are required by default. Nullable fields must be explicitly marked as such,\n  using `\"anyOf\"` with a `\"null\"` type. See `\"start_date\"` field above.\n  - Root node must be of type `\"object\"`.\n  - Schema nesting must be limited to within 5 levels.\n  - The important fields are key names/titles, type and description. Fields for\n  formatting, default values, etc. are not supported.\n  - There are other restrictions on number of keys, size of the schema, etc. that you may\n  hit for complex extraction use cases. In such cases, it is worth thinking how to restructure\n  your extraction workflow to fit within these constraints, e.g. by extracting subset of fields\n  and later merging them together.\n\n## Other Extraction APIs\n\n### Batch Processing\n\nProcess multiple files asynchronously:\n\n```python\n# Queue multiple files for extraction\njobs = await agent.queue_extraction([\"resume1.pdf\", \"resume2.pdf\"])\n\n# Check job status\nfor job in jobs:\n    status = agent.get_extraction_job(job.id).status\n    print(f\"Job {job.id}: {status}\")\n\n# Get results when complete\nresults = [agent.get_extraction_run_for_job(job.id) for job in jobs]\n```\n\n### Updating Schemas\n\nSchemas can be modified and updated after creation:\n\n```python\n# Update schema\nagent.data_schema = new_schema\n\n# Save changes\nagent.save()\n```\n\n### Managing Agents\n\n```python\n# List all agents\nagents = extractor.list_agents()\n\n# Get specific agent\nagent = extractor.get_agent(name=\"resume-parser\")\n\n# Delete agent\nextractor.delete_agent(agent.id)\n```\n\n## Installation\n\n```bash\npip install llama-extract==0.1.0\n```\n\n## Tips & Best Practices\n\n1. **Schema Design**:\n   - Try to limit schema nesting to 3-4 levels.\n   - Make fields optional when data might not always be present. Having required fields may force the model\n   to hallucinate when these fields are not present in the documents.\n   - When you want to extract a variable number of entities, use an `array` type. Note that you cannot use\n   an `array` type for the root node.\n   - Use descriptive field names and detailed descriptions. Use descriptions to pass formatting\n   instructions or few-shot examples.\n   - Start simple and iteratively build your schema to incorporate requirements.\n\n2. **Running Extractions**:\n   - Note that resetting `agent.schema` will not save the schema to the database,\n   until you call `agent.save`, but it will be used for running extractions.\n   - Check job status prior to accessing results. Any extraction error should be available as\n   part of `job.error` or `extraction_run.error` fields for debugging.\n   - Consider async operations (`queue_extraction`) for large-scale extraction once you have finalized your schema.\n\n## Additional Resources\n\n- [Example Notebook](examples/resume_screening.ipynb) - Detailed walkthrough of resume parsing\n- [Discord Community](https://discord.com/invite/eN6D2HQ4aX) - Get help and share feedback\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Structured data extraction from files.",
    "version": "0.1.1",
    "project_urls": null,
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "30d0710a1df7ba89bbbbbad5c352ce617d5fd7ef8f2ecd554e660359b4ac831e",
                "md5": "0f191fe8f1c095b0619dd66d95117131",
                "sha256": "0f87f2e99ea2aab1b8d2be4398d7b368867f23b7cfdeaa1a6db1db57f0c119e4"
            },
            "downloads": -1,
            "filename": "llama_extract-0.1.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "0f191fe8f1c095b0619dd66d95117131",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.9",
            "size": 9941,
            "upload_time": "2025-01-29T21:36:01",
            "upload_time_iso_8601": "2025-01-29T21:36:01.103888Z",
            "url": "https://files.pythonhosted.org/packages/30/d0/710a1df7ba89bbbbbad5c352ce617d5fd7ef8f2ecd554e660359b4ac831e/llama_extract-0.1.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "8fc9ae6a39936bdce16f62627e0ae76eff560e77458b1aca208a8cac736a3533",
                "md5": "2f17571603f53b7547db39957011808e",
                "sha256": "01e618064f7ec6cb3e170a33d0e75a35d5305be205722125c0a1cbbbec9a48bb"
            },
            "downloads": -1,
            "filename": "llama_extract-0.1.1.tar.gz",
            "has_sig": false,
            "md5_digest": "2f17571603f53b7547db39957011808e",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.9",
            "size": 10941,
            "upload_time": "2025-01-29T21:36:03",
            "upload_time_iso_8601": "2025-01-29T21:36:03.530017Z",
            "url": "https://files.pythonhosted.org/packages/8f/c9/ae6a39936bdce16f62627e0ae76eff560e77458b1aca208a8cac736a3533/llama_extract-0.1.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-01-29 21:36:03",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "llama-extract"
}

Logan Markewich