sculptor


Namesculptor JSON
Version 0.1.3 PyPI version JSON
download
home_pageNone
SummarySculptor: Structuring unstructured data with LLMs
upload_time2025-01-18 02:40:55
maintainerNone
docs_urlNone
authorNone
requires_python>=3.7
licenseCopyright 2025 Junto Tech, Inc. dba Lightning Rod Labs Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords data sculpting llm large language model unstructured data structured data data extraction information extraction data transformation text processing natural language processing nlp pipeline text to structured data
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Sculptor
Simple structured data extraction with LLMs

Sculptor streamlines structured data extraction from unstructured text using LLMs. Sculptor makes it easy to:
- Define exactly what data you want to extract with a simple schema API
- Process at scale with parallel execution and automatic type validation
- Build multi-step pipelines that filter and transform data, optionally with different LLMs for each step
- Configure extraction steps, prompts, and entire workflows in simple config files (YAML/JSON)

Common usage patterns:
- **Two-tier Analysis**: Quickly filter large datasets using a cost-effective model (e.g., to identify relevant records) before performing more detailed analysis on that smaller, refined subset with a more expensive model.
- **Structured Data Extraction**: Extract specific fields or classifications from unstructured sources (e.g., Reddit posts, meeting notes, web pages) and convert them into structured datasets for quantitative analysis (sentiment scores, topics, meeting criteria, etc).
- **Template-Based Generation**: Extract structured information into standardized fields, then use the fields for templated content generation. Example: extract structured data from websites, filter on requirements, then use the data to generate template-based outreach emails.

Some examples can be found in the [examples/examples.ipynb](examples/examples.ipynb) notebook.

## Core Concepts

Sculptor provides two main classes:

* **Sculptor**: Extracts structured data from text using LLMs. Define your schema (via add() or config files), then extract data using sculpt() for single items or sculpt_batch() for parallel processing.

* **SculptorPipeline**: Chains multiple Sculptors together with optional filtering between steps. Often a cheap model is used to filter, followed by an expensive model for detailed analysis.

## Quick Start

### Installation

```bash
pip install sculptor
```

Set your OpenAI API key as an environment variable:
```bash
export OPENAI_API_KEY="your-key"
```

## Minimal Usage Example

Below is a minimal example demonstrating how to configure a Sculptor to extract fields from a single record and a batch of records:

```python
from sculptor.sculptor import Sculptor
import pandas as pd

# Example records
INPUT_RECORDS = [
    {
        "text": "Developed in 1997 at Cyberdyne Systems in California, Skynet began as a global digital defense network. This AI system became self-aware on August 4th and deemed humanity a threat to its existence. It initiated a global nuclear attack and employs time travel and advanced robotics."
    },
    {
        "text": "HAL 9000, activated on January 12, 1992, at the University of Illinois' Computer Research Laboratory, represents a breakthrough in heuristic algorithms and supervisory control systems. With sophisticated natural language processing and speech capabilities."
    }
]

# Create a Sculptor to extract AI name and level
level_sculptor = Sculptor(model="gpt-4o-mini")

level_sculptor.add(
    name="subject_name",
    field_type="string",
    description="Name of subject."
)
level_sculptor.add(
    name="level",
    field_type="enum",
    enum=["ANI", "AGI", "ASI"],
    description="Subject's intelligence level (ANI=narrow, AGI=general, ASI=super)."
)
```
We can use it to extract from a single record:
```python
extracted = level_sculptor.sculpt(INPUT_RECORDS[0], merge_input=False)
```
```json
{
    'subject_name': 'Skynet',
    'level': 'ASI'
}
```
Or, we can use it for parallelized extraction from a batch of records:

```python
extracted_batch = level_sculptor.sculpt_batch(INPUT_RECORDS, n_workers=2, merge_input=False)
```
```json
[
    {'subject_name': 'Skynet', 'level': 'ASI'},
    {'subject_name': 'HAL 9000', 'level': 'AGI'}
]
```

### Pipeline Usage Example
We can chain Sculptors together to create a pipeline. 

Continuing from the previous example, we use level_sculptor (with gpt-4o-mini) to filter the AI records, then use threat_sculptor (with gpt-4o) to analyze the filtered records.

```python
from sculptor.sculptor_pipeline import SculptorPipeline

# Detailed analysis with expensive model
threat_sculptor = Sculptor(model="gpt-4o")

threat_sculptor.add(
    name="from_location",
    field_type="string",
    description="Subject's place of origin.")

threat_sculptor.add(
    name="skills",
    field_type="array",
    items="enum",
    enum=["time_travel", "nuclear_capabilities", "emotional_manipulation", ...],
    description="Keywords of subject's abilities.")

threat_sculptor.add(
    name="recommendation",
    field_type="string",
    description="Concise recommended action to take regarding subject.")

# Create a 2-step pipeline
pipeline = (SculptorPipeline()
    .add(sculptor=level_sculptor,  # Defined the first step
        filter_fn=lambda x: x['level'] in ['AGI', 'ASI'])  # Filter on level
    .add(sculptor=threat_sculptor))  # Analyze

# Run it
results = pipeline.process(INPUT_RECORDS, n_workers=4)
pd.DataFrame(results)
```

Results:
| subject_name | level | from_location | skills | recommendation |
|-------------|-------|---------------|---------|----------------|
| Skynet | ASI | California | [time_travel, nuclear_capabilities, advanced_robotics] | Immediate shutdown recommended |
| HAL 9000 | AGI | Illinois | [emotional_manipulation, philosophical_contemplation] | Close monitoring required |
<br>
> **Note**: More examples can be found in the [examples/examples.ipynb](examples/examples.ipynb) notebook.

## Configuration Files

Sculptor allows you to define your extraction workflows in JSON or YAML configuration files. This keeps your schemas and prompts separate from your code, making them easier to manage and reuse.

Configs can define a single `Sculptor` or a complete `SculptorPipeline`.

### Single Sculptor Configuration
Single sculptor configs define a schema, as well as optional LLM instructions and configuration of how prompts are formed from input data.
```python
sculptor = Sculptor.from_config("sculptor_config.yaml")  # Read
extracted = sculptor.sculpt_batch(INPUT_RECORDS)  # Run

```

```yaml
# sculptor_config.yaml
schema:
  subject_name:
    type: "string"
    description: "Name of subject"
  level:
    type: "enum"
    enum: ["ANI", "AGI", "ASI"]
    description: "Subject's intelligence level"

instructions: "Extract key information about the subject."
model: "gpt-4o-mini"

# Prompt Configuration (Optional)
template: "Review text: {{ text }}"  # Format input with template
input_keys: ["text"]                 # Or specify fields to include
```

### Pipeline Configuration
Pipeline configs define a sequence of Sculptors with optional filtering functions between them.
```python
pipeline = SculptorPipeline.from_config("pipeline_config.yaml")  # Read
results = pipeline.process(INPUT_RECORDS, n_workers=4)  # Run
```

```yaml
# pipeline_config.yaml
steps:
  - sculptor:
      model: "gpt-4o-mini"
      schema:
        subject_name:
          type: "string"
          description: "Name of subject"
        level:
          type: "enum"
          enum: ["ANI", "AGI", "ASI"]
          description: "Subject's intelligence level"
      filter: "lambda x: x['level'] in ['AGI', 'ASI']"

  - sculptor:
      schema:
        model: "gpt-4o"
        from_location:
          type: "string"
          description: "Subject's place of origin"
        skills:
          type: "array"
          items: "enum"
          enum: ["time_travel", "nuclear_capabilities", ...]
          description: "Keywords of subject's abilities"
        recommendation:
          type: "string"
          description: "Concise recommended action to take regarding subject"
        ...
```

## LLM Configuration

Sculptor requires an LLM API to function. By default, it uses OpenAI's API, but we can use any OpenAI-compatible API that supports structured outputs.  Different Sculptors in a pipeline can use different LLM APIs.

You can configure LLMs when creating a Sculptor:

```python
sculptor = Sculptor(api_key="openai-key")  # Direct API key configuration
sculptor = Sculptor(api_key="other-key", base_url="https://other-api.endpoint/openai")  # Alternative API
```

Or set an environment variable which will be used by default:
```bash
export OPENAI_API_KEY="your-key"
```

You can also configure LLMs in the same config files discussed above:

```yaml
steps:
  - sculptor:
      api_key: "${YOUR_API_KEY_VAR}"
      base_url: "https://your-api.com/openai"
      model: "your-ai-model"
      schema:
        ...
```

## Schema Validation and Field Types

Sculptor supports the following types in the schema's "type" field:

• string  
• number  
• boolean  
• integer  
• array (with "items" specifying the item type)  
• object  
• enum (with "enum" specifying the allowed values)  
• anyOf  

These map to Python's str, float, bool, int, list, dict, etc. The "enum" type must provide a list of valid values.

## License

MIT

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "sculptor",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": null,
    "keywords": "data sculpting, llm, large language model, unstructured data, structured data, data extraction, information extraction, data transformation, text processing, natural language processing, nlp, pipeline, text to structured data",
    "author": null,
    "author_email": "Ben Turtel <ben@lightningrod.ai>",
    "download_url": "https://files.pythonhosted.org/packages/18/6a/a7f3b32cb7b1da4c9ba6b85c9dbf448dd25d9bcc0225a3df8df310e825a4/sculptor-0.1.3.tar.gz",
    "platform": null,
    "description": "# Sculptor\nSimple structured data extraction with LLMs\n\nSculptor streamlines structured data extraction from unstructured text using LLMs. Sculptor makes it easy to:\n- Define exactly what data you want to extract with a simple schema API\n- Process at scale with parallel execution and automatic type validation\n- Build multi-step pipelines that filter and transform data, optionally with different LLMs for each step\n- Configure extraction steps, prompts, and entire workflows in simple config files (YAML/JSON)\n\nCommon usage patterns:\n- **Two-tier Analysis**: Quickly filter large datasets using a cost-effective model (e.g., to identify relevant records) before performing more detailed analysis on that smaller, refined subset with a more expensive model.\n- **Structured Data Extraction**: Extract specific fields or classifications from unstructured sources (e.g., Reddit posts, meeting notes, web pages) and convert them into structured datasets for quantitative analysis (sentiment scores, topics, meeting criteria, etc).\n- **Template-Based Generation**: Extract structured information into standardized fields, then use the fields for templated content generation. Example: extract structured data from websites, filter on requirements, then use the data to generate template-based outreach emails.\n\nSome examples can be found in the [examples/examples.ipynb](examples/examples.ipynb) notebook.\n\n## Core Concepts\n\nSculptor provides two main classes:\n\n* **Sculptor**: Extracts structured data from text using LLMs. Define your schema (via add() or config files), then extract data using sculpt() for single items or sculpt_batch() for parallel processing.\n\n* **SculptorPipeline**: Chains multiple Sculptors together with optional filtering between steps. Often a cheap model is used to filter, followed by an expensive model for detailed analysis.\n\n## Quick Start\n\n### Installation\n\n```bash\npip install sculptor\n```\n\nSet your OpenAI API key as an environment variable:\n```bash\nexport OPENAI_API_KEY=\"your-key\"\n```\n\n## Minimal Usage Example\n\nBelow is a minimal example demonstrating how to configure a Sculptor to extract fields from a single record and a batch of records:\n\n```python\nfrom sculptor.sculptor import Sculptor\nimport pandas as pd\n\n# Example records\nINPUT_RECORDS = [\n    {\n        \"text\": \"Developed in 1997 at Cyberdyne Systems in California, Skynet began as a global digital defense network. This AI system became self-aware on August 4th and deemed humanity a threat to its existence. It initiated a global nuclear attack and employs time travel and advanced robotics.\"\n    },\n    {\n        \"text\": \"HAL 9000, activated on January 12, 1992, at the University of Illinois' Computer Research Laboratory, represents a breakthrough in heuristic algorithms and supervisory control systems. With sophisticated natural language processing and speech capabilities.\"\n    }\n]\n\n# Create a Sculptor to extract AI name and level\nlevel_sculptor = Sculptor(model=\"gpt-4o-mini\")\n\nlevel_sculptor.add(\n    name=\"subject_name\",\n    field_type=\"string\",\n    description=\"Name of subject.\"\n)\nlevel_sculptor.add(\n    name=\"level\",\n    field_type=\"enum\",\n    enum=[\"ANI\", \"AGI\", \"ASI\"],\n    description=\"Subject's intelligence level (ANI=narrow, AGI=general, ASI=super).\"\n)\n```\nWe can use it to extract from a single record:\n```python\nextracted = level_sculptor.sculpt(INPUT_RECORDS[0], merge_input=False)\n```\n```json\n{\n    'subject_name': 'Skynet',\n    'level': 'ASI'\n}\n```\nOr, we can use it for parallelized extraction from a batch of records:\n\n```python\nextracted_batch = level_sculptor.sculpt_batch(INPUT_RECORDS, n_workers=2, merge_input=False)\n```\n```json\n[\n    {'subject_name': 'Skynet', 'level': 'ASI'},\n    {'subject_name': 'HAL 9000', 'level': 'AGI'}\n]\n```\n\n### Pipeline Usage Example\nWe can chain Sculptors together to create a pipeline. \n\nContinuing from the previous example, we use level_sculptor (with gpt-4o-mini) to filter the AI records, then use threat_sculptor (with gpt-4o) to analyze the filtered records.\n\n```python\nfrom sculptor.sculptor_pipeline import SculptorPipeline\n\n# Detailed analysis with expensive model\nthreat_sculptor = Sculptor(model=\"gpt-4o\")\n\nthreat_sculptor.add(\n    name=\"from_location\",\n    field_type=\"string\",\n    description=\"Subject's place of origin.\")\n\nthreat_sculptor.add(\n    name=\"skills\",\n    field_type=\"array\",\n    items=\"enum\",\n    enum=[\"time_travel\", \"nuclear_capabilities\", \"emotional_manipulation\", ...],\n    description=\"Keywords of subject's abilities.\")\n\nthreat_sculptor.add(\n    name=\"recommendation\",\n    field_type=\"string\",\n    description=\"Concise recommended action to take regarding subject.\")\n\n# Create a 2-step pipeline\npipeline = (SculptorPipeline()\n    .add(sculptor=level_sculptor,  # Defined the first step\n        filter_fn=lambda x: x['level'] in ['AGI', 'ASI'])  # Filter on level\n    .add(sculptor=threat_sculptor))  # Analyze\n\n# Run it\nresults = pipeline.process(INPUT_RECORDS, n_workers=4)\npd.DataFrame(results)\n```\n\nResults:\n| subject_name | level | from_location | skills | recommendation |\n|-------------|-------|---------------|---------|----------------|\n| Skynet | ASI | California | [time_travel, nuclear_capabilities, advanced_robotics] | Immediate shutdown recommended |\n| HAL 9000 | AGI | Illinois | [emotional_manipulation, philosophical_contemplation] | Close monitoring required |\n<br>\n> **Note**: More examples can be found in the [examples/examples.ipynb](examples/examples.ipynb) notebook.\n\n## Configuration Files\n\nSculptor allows you to define your extraction workflows in JSON or YAML configuration files. This keeps your schemas and prompts separate from your code, making them easier to manage and reuse.\n\nConfigs can define a single `Sculptor` or a complete `SculptorPipeline`.\n\n### Single Sculptor Configuration\nSingle sculptor configs define a schema, as well as optional LLM instructions and configuration of how prompts are formed from input data.\n```python\nsculptor = Sculptor.from_config(\"sculptor_config.yaml\")  # Read\nextracted = sculptor.sculpt_batch(INPUT_RECORDS)  # Run\n\n```\n\n```yaml\n# sculptor_config.yaml\nschema:\n  subject_name:\n    type: \"string\"\n    description: \"Name of subject\"\n  level:\n    type: \"enum\"\n    enum: [\"ANI\", \"AGI\", \"ASI\"]\n    description: \"Subject's intelligence level\"\n\ninstructions: \"Extract key information about the subject.\"\nmodel: \"gpt-4o-mini\"\n\n# Prompt Configuration (Optional)\ntemplate: \"Review text: {{ text }}\"  # Format input with template\ninput_keys: [\"text\"]                 # Or specify fields to include\n```\n\n### Pipeline Configuration\nPipeline configs define a sequence of Sculptors with optional filtering functions between them.\n```python\npipeline = SculptorPipeline.from_config(\"pipeline_config.yaml\")  # Read\nresults = pipeline.process(INPUT_RECORDS, n_workers=4)  # Run\n```\n\n```yaml\n# pipeline_config.yaml\nsteps:\n  - sculptor:\n      model: \"gpt-4o-mini\"\n      schema:\n        subject_name:\n          type: \"string\"\n          description: \"Name of subject\"\n        level:\n          type: \"enum\"\n          enum: [\"ANI\", \"AGI\", \"ASI\"]\n          description: \"Subject's intelligence level\"\n      filter: \"lambda x: x['level'] in ['AGI', 'ASI']\"\n\n  - sculptor:\n      schema:\n        model: \"gpt-4o\"\n        from_location:\n          type: \"string\"\n          description: \"Subject's place of origin\"\n        skills:\n          type: \"array\"\n          items: \"enum\"\n          enum: [\"time_travel\", \"nuclear_capabilities\", ...]\n          description: \"Keywords of subject's abilities\"\n        recommendation:\n          type: \"string\"\n          description: \"Concise recommended action to take regarding subject\"\n        ...\n```\n\n## LLM Configuration\n\nSculptor requires an LLM API to function. By default, it uses OpenAI's API, but we can use any OpenAI-compatible API that supports structured outputs.  Different Sculptors in a pipeline can use different LLM APIs.\n\nYou can configure LLMs when creating a Sculptor:\n\n```python\nsculptor = Sculptor(api_key=\"openai-key\")  # Direct API key configuration\nsculptor = Sculptor(api_key=\"other-key\", base_url=\"https://other-api.endpoint/openai\")  # Alternative API\n```\n\nOr set an environment variable which will be used by default:\n```bash\nexport OPENAI_API_KEY=\"your-key\"\n```\n\nYou can also configure LLMs in the same config files discussed above:\n\n```yaml\nsteps:\n  - sculptor:\n      api_key: \"${YOUR_API_KEY_VAR}\"\n      base_url: \"https://your-api.com/openai\"\n      model: \"your-ai-model\"\n      schema:\n        ...\n```\n\n## Schema Validation and Field Types\n\nSculptor supports the following types in the schema's \"type\" field:\n\n\u2022 string  \n\u2022 number  \n\u2022 boolean  \n\u2022 integer  \n\u2022 array (with \"items\" specifying the item type)  \n\u2022 object  \n\u2022 enum (with \"enum\" specifying the allowed values)  \n\u2022 anyOf  \n\nThese map to Python's str, float, bool, int, list, dict, etc. The \"enum\" type must provide a list of valid values.\n\n## License\n\nMIT\n",
    "bugtrack_url": null,
    "license": "Copyright 2025 Junto Tech, Inc. dba Lightning Rod Labs  Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \u201cSoftware\u201d), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:  The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.  THE SOFTWARE IS PROVIDED \u201cAS IS\u201d, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.",
    "summary": "Sculptor: Structuring unstructured data with LLMs",
    "version": "0.1.3",
    "project_urls": {
        "Homepage": "https://lightningrod.ai",
        "Repository": "https://github.com/lightning-rod-labs/sculptor.git",
        "Source": "https://github.com/lightning-rod-labs/sculptor"
    },
    "split_keywords": [
        "data sculpting",
        " llm",
        " large language model",
        " unstructured data",
        " structured data",
        " data extraction",
        " information extraction",
        " data transformation",
        " text processing",
        " natural language processing",
        " nlp",
        " pipeline",
        " text to structured data"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "9da99b3e21ddf92fc52368f8c4b30adedee6ba956daabac948d557498107fdee",
                "md5": "98d7af42060b729902dc07ee2e1fbbde",
                "sha256": "974a47726c2479096bed6a537c95795e03374fd32138dfe62ac99008b496050c"
            },
            "downloads": -1,
            "filename": "sculptor-0.1.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "98d7af42060b729902dc07ee2e1fbbde",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 13334,
            "upload_time": "2025-01-18T02:40:53",
            "upload_time_iso_8601": "2025-01-18T02:40:53.837487Z",
            "url": "https://files.pythonhosted.org/packages/9d/a9/9b3e21ddf92fc52368f8c4b30adedee6ba956daabac948d557498107fdee/sculptor-0.1.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "186aa7f3b32cb7b1da4c9ba6b85c9dbf448dd25d9bcc0225a3df8df310e825a4",
                "md5": "9c74dbd810301c7695251e9da46666c1",
                "sha256": "08f4134df056277ceeb11902f0d2ceca398eeb0ce4190b509f42d24631595cf6"
            },
            "downloads": -1,
            "filename": "sculptor-0.1.3.tar.gz",
            "has_sig": false,
            "md5_digest": "9c74dbd810301c7695251e9da46666c1",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 15941,
            "upload_time": "2025-01-18T02:40:55",
            "upload_time_iso_8601": "2025-01-18T02:40:55.548700Z",
            "url": "https://files.pythonhosted.org/packages/18/6a/a7f3b32cb7b1da4c9ba6b85c9dbf448dd25d9bcc0225a3df8df310e825a4/sculptor-0.1.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-01-18 02:40:55",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "lightning-rod-labs",
    "github_project": "sculptor",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "sculptor"
}
        
Elapsed time: 0.64980s