# data-tamer
Lightweight Python wrappers (built on LiteLLM) for transforming data with structured outputs, compact prompts for lower token usage, and batching utilities. Strict structured outputs are supported via Pydantic models or JSON Schema.
## Install
Install from PyPI via pip or UV:
```
pip install data-tamer
# or with UV
uv add data-tamer
```
Basic usage in Python mirrors the TS API and prompt-compaction behavior:
```python
from pydantic import BaseModel
import os
from data_tamer import transform_object, transform_batch
class Person(BaseModel):
name: str
age: int | None
# Choose a LiteLLM model id; set provider API keys via env (e.g., OPENAI_API_KEY, OPENROUTER_API_KEY)
model = os.environ.get("LITELLM_MODEL", "gpt-4o-mini")
# Single transform from guidance only
single = transform_object(
model=model,
schema=Person,
prompt_context={
"instructions": "Extract name and age. Use null when unknown.",
},
)
print(single["data"]) # -> Person(name=..., age=...)
# Batch transform from compact prompt
inputs = [
"Jane Doe, 29",
"Mr. Smith, unknown age",
{"text": "Alice, 41"},
]
results = transform_batch(
model=model,
schema=Person,
items=inputs,
batch_size=2,
prompt_context={
"instructions": "Extract name and age. Use null when unknown.",
},
)
print(results) # list of Person-like dicts
```
Streaming structured output is supported via `data_tamer.stream_transform_object` (LiteLLM streaming under the hood).
### Async batching
For higher throughput, use the async variant with concurrency:
```python
import asyncio
from pydantic import BaseModel
import os
from data_tamer import async_transform_batch
class Person(BaseModel):
name: str
age: int | None
async def main():
model = os.environ.get("LITELLM_MODEL", "gpt-4o-mini")
inputs = [f"User {i}, {20 + (i % 40)}" for i in range(100)]
results = await async_transform_batch(
model=model,
schema=Person,
items=inputs,
batch_size=10,
concurrency=5,
prompt_context={"instructions": "Extract name and age"},
)
print(len(results))
asyncio.run(main())
```
## Prompt Compaction
The prompt builder:
- De-duplicates schema guidance and uses short, strict JSON directions.
- Truncates per-item input via `char_limit_per_item`.
- Supports optional `system`, `instructions`, and few-shot `examples`.
- Items are raw inputs (strings or objects). Place guidance/instructions in `prompt_context.system`/`prompt_context.instructions`.
## API
- `transform_object(model, schema, items|prompt_context, ...)`
- Generates a single structured object. If `items` are provided, a compact prompt is built; otherwise use `prompt_context` with instructions.
- `schema` can be a Pydantic model class or a JSON Schema dict. When supported by the provider, LiteLLM enforces structured output. We also parse JSON and, for dict schemas, validate locally via `jsonschema` as a fallback.
- `stream_transform_object(...)`
- Streams text chunks and allows awaiting the final parsed object.
- `transform_batch(model, schema, items, batch_size=..., concurrency=...)`
- Splits inputs into batches, builds compact prompts, and parses array outputs. Uses threads when `concurrency > 1`.
- `async_transform_batch(...)`
- Async variant with concurrency control via asyncio.
## Notes
- Providers (LiteLLM): pass a model id string (e.g., `gpt-4o-mini`, `openrouter/google/gemini-2.5-flash-lite`) and set the corresponding API key in env (`OPENAI_API_KEY`, `OPENROUTER_API_KEY`, etc.).
- Structured outputs:
- Pydantic: pass a `BaseModel` subclass as `schema`. LiteLLM will request structured responses when supported; we parse JSON regardless.
- JSON Schema: pass a dict; we set LiteLLM `response_format={"type":"json_schema",...}` and also validate locally with `jsonschema`.
- Helpers: `pydantic_json_schema`, `pydantic_array_json_schema` generate dict schemas from Pydantic models.
- OpenRouter: set `OPENROUTER_API_KEY` and pick an OpenRouter model id via `LITELLM_MODEL`, e.g., `openrouter/google/gemini-2.5-flash-lite`.
## Examples
- `examples/generate_object_example.py` — basic structured generation
- `examples/transform_batch_example.py` — batching with compact prompts
- `examples/jsonschema_example.py` — JSON Schema with validation
- `examples/legacy_contacts.py` — real-world cleanup with OpenRouter (default Gemini model)
Raw data
{
"_id": null,
"home_page": null,
"name": "data-tamer",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "llm, structured-output, pydantic, jsonschema, batching, litellm",
"author": "Sebastian Lewis",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/ea/43/de8765cff1332f3c31ed77a9104cd4fb7cdd37f075813f2cc175832ecbfc/data_tamer-0.1.2.tar.gz",
"platform": null,
"description": "# data-tamer\n\nLightweight Python wrappers (built on LiteLLM) for transforming data with structured outputs, compact prompts for lower token usage, and batching utilities. Strict structured outputs are supported via Pydantic models or JSON Schema.\n\n## Install\n\nInstall from PyPI via pip or UV:\n\n```\npip install data-tamer\n# or with UV\nuv add data-tamer\n```\n\nBasic usage in Python mirrors the TS API and prompt-compaction behavior:\n\n```python\nfrom pydantic import BaseModel\nimport os\nfrom data_tamer import transform_object, transform_batch\n\n\nclass Person(BaseModel):\n name: str\n age: int | None\n\n# Choose a LiteLLM model id; set provider API keys via env (e.g., OPENAI_API_KEY, OPENROUTER_API_KEY)\nmodel = os.environ.get(\"LITELLM_MODEL\", \"gpt-4o-mini\")\n\n# Single transform from guidance only\nsingle = transform_object(\n model=model,\n schema=Person,\n prompt_context={\n \"instructions\": \"Extract name and age. Use null when unknown.\",\n },\n)\nprint(single[\"data\"]) # -> Person(name=..., age=...)\n\n# Batch transform from compact prompt\ninputs = [\n \"Jane Doe, 29\",\n \"Mr. Smith, unknown age\",\n {\"text\": \"Alice, 41\"},\n]\n\nresults = transform_batch(\n model=model,\n schema=Person,\n items=inputs,\n batch_size=2,\n prompt_context={\n \"instructions\": \"Extract name and age. Use null when unknown.\",\n },\n)\nprint(results) # list of Person-like dicts\n```\n\nStreaming structured output is supported via `data_tamer.stream_transform_object` (LiteLLM streaming under the hood).\n\n### Async batching\n\nFor higher throughput, use the async variant with concurrency:\n\n```python\nimport asyncio\nfrom pydantic import BaseModel\nimport os\nfrom data_tamer import async_transform_batch\n\n\nclass Person(BaseModel):\n name: str\n age: int | None\n\n\nasync def main():\n model = os.environ.get(\"LITELLM_MODEL\", \"gpt-4o-mini\")\n inputs = [f\"User {i}, {20 + (i % 40)}\" for i in range(100)]\n results = await async_transform_batch(\n model=model,\n schema=Person,\n items=inputs,\n batch_size=10,\n concurrency=5,\n prompt_context={\"instructions\": \"Extract name and age\"},\n )\n print(len(results))\n\n\nasyncio.run(main())\n```\n\n## Prompt Compaction\n\nThe prompt builder:\n\n- De-duplicates schema guidance and uses short, strict JSON directions.\n- Truncates per-item input via `char_limit_per_item`.\n- Supports optional `system`, `instructions`, and few-shot `examples`.\n- Items are raw inputs (strings or objects). Place guidance/instructions in `prompt_context.system`/`prompt_context.instructions`.\n\n## API\n\n- `transform_object(model, schema, items|prompt_context, ...)`\n - Generates a single structured object. If `items` are provided, a compact prompt is built; otherwise use `prompt_context` with instructions.\n - `schema` can be a Pydantic model class or a JSON Schema dict. When supported by the provider, LiteLLM enforces structured output. We also parse JSON and, for dict schemas, validate locally via `jsonschema` as a fallback.\n\n- `stream_transform_object(...)`\n - Streams text chunks and allows awaiting the final parsed object.\n\n- `transform_batch(model, schema, items, batch_size=..., concurrency=...)`\n - Splits inputs into batches, builds compact prompts, and parses array outputs. Uses threads when `concurrency > 1`.\n\n- `async_transform_batch(...)`\n - Async variant with concurrency control via asyncio.\n\n## Notes\n\n- Providers (LiteLLM): pass a model id string (e.g., `gpt-4o-mini`, `openrouter/google/gemini-2.5-flash-lite`) and set the corresponding API key in env (`OPENAI_API_KEY`, `OPENROUTER_API_KEY`, etc.).\n- Structured outputs:\n - Pydantic: pass a `BaseModel` subclass as `schema`. LiteLLM will request structured responses when supported; we parse JSON regardless.\n - JSON Schema: pass a dict; we set LiteLLM `response_format={\"type\":\"json_schema\",...}` and also validate locally with `jsonschema`.\n - Helpers: `pydantic_json_schema`, `pydantic_array_json_schema` generate dict schemas from Pydantic models.\n- OpenRouter: set `OPENROUTER_API_KEY` and pick an OpenRouter model id via `LITELLM_MODEL`, e.g., `openrouter/google/gemini-2.5-flash-lite`.\n\n## Examples\n\n- `examples/generate_object_example.py` \u2014 basic structured generation\n- `examples/transform_batch_example.py` \u2014 batching with compact prompts\n- `examples/jsonschema_example.py` \u2014 JSON Schema with validation\n- `examples/legacy_contacts.py` \u2014 real-world cleanup with OpenRouter (default Gemini model)\n",
"bugtrack_url": null,
"license": null,
"summary": "Python port of data-tamer using LiteLLM for structured outputs and batching",
"version": "0.1.2",
"project_urls": {
"Homepage": "https://github.com/seb-lewis/data-tamer",
"Issues": "https://github.com/seb-lewis/data-tamer/issues",
"Repository": "https://github.com/seb-lewis/data-tamer"
},
"split_keywords": [
"llm",
" structured-output",
" pydantic",
" jsonschema",
" batching",
" litellm"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "e4edca5ddaa2d93cf6bc41c34412ce4768e70d16fc087fdaba5e449ffc6b58f3",
"md5": "2afbd483324375cb87e5199ce502c3ec",
"sha256": "fab430a69379d3e99ccb5093f791114d4e3c8d81f7982f41ba8f8f1c51d70e89"
},
"downloads": -1,
"filename": "data_tamer-0.1.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "2afbd483324375cb87e5199ce502c3ec",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 11560,
"upload_time": "2025-10-22T04:25:03",
"upload_time_iso_8601": "2025-10-22T04:25:03.774875Z",
"url": "https://files.pythonhosted.org/packages/e4/ed/ca5ddaa2d93cf6bc41c34412ce4768e70d16fc087fdaba5e449ffc6b58f3/data_tamer-0.1.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "ea43de8765cff1332f3c31ed77a9104cd4fb7cdd37f075813f2cc175832ecbfc",
"md5": "2cb88051a007f0bba5d4dfe131c8deeb",
"sha256": "20bf4b99c2997e84b26846bafae6f85679848ac57d44caeb529a968441c123fe"
},
"downloads": -1,
"filename": "data_tamer-0.1.2.tar.gz",
"has_sig": false,
"md5_digest": "2cb88051a007f0bba5d4dfe131c8deeb",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 11142,
"upload_time": "2025-10-22T04:25:05",
"upload_time_iso_8601": "2025-10-22T04:25:05.019407Z",
"url": "https://files.pythonhosted.org/packages/ea/43/de8765cff1332f3c31ed77a9104cd4fb7cdd37f075813f2cc175832ecbfc/data_tamer-0.1.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-22 04:25:05",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "seb-lewis",
"github_project": "data-tamer",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "data-tamer"
}