| Name | tweaktune JSON |
| Version |
0.0.1a12
JSON |
| download |
| home_page | None |
| Summary | A Python package for syntesize datasets for training and fine-tuning AI models. |
| upload_time | 2025-10-17 20:23:26 |
| maintainer | None |
| docs_url | None |
| author | None |
| requires_python | >=3.8 |
| license | MIT OR Apache-2.0 |
| keywords |
llm
ai
machine-learning
|
| VCS |
 |
| bugtrack_url |
|
| requirements |
No requirements were recorded.
|
| Travis-CI |
No Travis.
|
| coveralls test coverage |
No coveralls.
|
# tweaktune
**tweaktune** is a Rust-powered, Python-facing library designed to **synthesize datasets** for **training and fine-tuning AI models**, especially **LMs** (Language Models).
It allows you to easily build data pipelines, generate new examples using LLM APIs, and create structured datasets from a variety of sources.
---
## Features
- **Flexible Data Sources**:
Supports datasets from:
- Parquet files
- CSV files
- JSONL files
- Arrow datasets
- OpenAPI specifications (for function calling datasets)
- Lists of tools (Python functions for function calling datasets)
- Pydantic models (for structured output datasets)
- **LLM Integration**:
Connects to any LLM API to generate synthetic text or structured JSON.
- **Dynamic Prompting**:
Supports **Jinja templates** for highly customizable prompts.
- **Parallel Processing**:
Configure **multiple workers** to run your pipeline steps in parallel.
- **Easy Pipeline Building**:
Compose steps like sampling, generating, writing, or debugging into a seamless pipeline.
---
## Quick Example
Here's how you can build a dataset from a Parquet file and synthesize new data using an LLM API:
```python
from tweaktune import Pipeline
import os
persona_template = """
Na podstawie poniższego fragmentu tekstu opisz personę która jest z nim związana.
Dla opisywanej osoby wymyśl fikcyjne imię i nazwisko.
Napisz dwa zdania na temat tej osoby, opis zwróć w formacie json, nie dodawaj nic więcej:
{"persona":"opis osoby"}
---
FRAGMENT TEKSTU:
{{article[0].text}}
"""
url = "http://localhost:8000/"
api_key = os.environ["API_KEY"]
model = "model"
p = Pipeline()\
.with_workers(5)\
.with_parquet_dataset("web_articles", "../../datasets/articles.pq")\
.with_llm_api("bielik", url, api_key, model)\
.with_template("persona", persona_template)\
.with_template("output", """{"persona": {{persona|jstr}} }""")\
.iter_range(10000)\
.sample(dataset="web_articles", size=1, output="article")\
.generate_json(template="persona", llm="bielik", output="persona", json_path="persona")\
.write_jsonl(path="../../datasets/personas.jsonl", template="output")\
.run()
```
---
## Pipeline Steps
You can easily chain together multiple steps:
- `sample()` – sample items from a dataset
- `read()` – read entire dataset
- `generate_text()` – generate text using an LLM
- `generate_json()` – generate JSON output and extract a specific field
- `write_jsonl()` – write output to a JSONL file
- `write_csv()` – write output to a CSV file
- `print()` – print outputs
- `debug()` – enable detailed debugging
- `log()` – set log level
- `python step` – add custom Python-defined step classes
---
## Why tweaktune?
- Build synthetic datasets faster for fine-tuning models.
- Automate text, JSON, or structured data generation.
- Stay flexible: plug your own LLM API or use existing OpenAI-compatible ones.
- Rust speed, Python usability.
---
## 📦 Installation
```bash
pip install tweaktune
```
## 🤝 Contributing
We welcome contributions! Feel free to open issues, suggest features, or create pull requests.
Please note that by contributing to this project, you agree to the terms of the [Contributor License Agreement (CLA)](CLA.md).
Raw data
{
"_id": null,
"home_page": null,
"name": "tweaktune",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "llm, ai, machine-learning",
"author": null,
"author_email": null,
"download_url": null,
"platform": null,
"description": "# tweaktune\n\n**tweaktune** is a Rust-powered, Python-facing library designed to **synthesize datasets** for **training and fine-tuning AI models**, especially **LMs** (Language Models). \nIt allows you to easily build data pipelines, generate new examples using LLM APIs, and create structured datasets from a variety of sources.\n\n---\n\n## Features\n\n- **Flexible Data Sources**: \n Supports datasets from:\n - Parquet files\n - CSV files\n - JSONL files\n - Arrow datasets\n - OpenAPI specifications (for function calling datasets)\n - Lists of tools (Python functions for function calling datasets)\n - Pydantic models (for structured output datasets)\n\n- **LLM Integration**: \n Connects to any LLM API to generate synthetic text or structured JSON.\n\n- **Dynamic Prompting**: \n Supports **Jinja templates** for highly customizable prompts.\n\n- **Parallel Processing**: \n Configure **multiple workers** to run your pipeline steps in parallel.\n\n- **Easy Pipeline Building**: \n Compose steps like sampling, generating, writing, or debugging into a seamless pipeline.\n\n---\n\n## Quick Example\n\nHere's how you can build a dataset from a Parquet file and synthesize new data using an LLM API:\n\n```python\nfrom tweaktune import Pipeline\nimport os\n\npersona_template = \"\"\"\nNa podstawie poni\u017cszego fragmentu tekstu opisz person\u0119 kt\u00f3ra jest z nim zwi\u0105zana.\nDla opisywanej osoby wymy\u015bl fikcyjne imi\u0119 i nazwisko.\nNapisz dwa zdania na temat tej osoby, opis zwr\u00f3\u0107 w formacie json, nie dodawaj nic wi\u0119cej:\n{\"persona\":\"opis osoby\"}\n\n---\nFRAGMENT TEKSTU:\n\n{{article[0].text}}\n\"\"\"\n\nurl = \"http://localhost:8000/\"\napi_key = os.environ[\"API_KEY\"]\nmodel = \"model\"\n\np = Pipeline()\\\n .with_workers(5)\\\n .with_parquet_dataset(\"web_articles\", \"../../datasets/articles.pq\")\\\n .with_llm_api(\"bielik\", url, api_key, model)\\\n .with_template(\"persona\", persona_template)\\\n .with_template(\"output\", \"\"\"{\"persona\": {{persona|jstr}} }\"\"\")\\\n .iter_range(10000)\\\n .sample(dataset=\"web_articles\", size=1, output=\"article\")\\\n .generate_json(template=\"persona\", llm=\"bielik\", output=\"persona\", json_path=\"persona\")\\\n .write_jsonl(path=\"../../datasets/personas.jsonl\", template=\"output\")\\\n .run()\n```\n\n---\n\n## Pipeline Steps\n\nYou can easily chain together multiple steps:\n\n- `sample()` \u2013 sample items from a dataset\n- `read()` \u2013 read entire dataset\n- `generate_text()` \u2013 generate text using an LLM\n- `generate_json()` \u2013 generate JSON output and extract a specific field\n- `write_jsonl()` \u2013 write output to a JSONL file\n- `write_csv()` \u2013 write output to a CSV file\n- `print()` \u2013 print outputs\n- `debug()` \u2013 enable detailed debugging\n- `log()` \u2013 set log level\n- `python step` \u2013 add custom Python-defined step classes\n\n---\n\n## Why tweaktune?\n\n- Build synthetic datasets faster for fine-tuning models.\n- Automate text, JSON, or structured data generation.\n- Stay flexible: plug your own LLM API or use existing OpenAI-compatible ones.\n- Rust speed, Python usability.\n\n---\n\n## \ud83d\udce6 Installation\n\n```bash\npip install tweaktune\n```\n\n\n## \ud83e\udd1d Contributing\n\nWe welcome contributions! Feel free to open issues, suggest features, or create pull requests.\n\nPlease note that by contributing to this project, you agree to the terms of the [Contributor License Agreement (CLA)](CLA.md).\n\n\n\n\n\n",
"bugtrack_url": null,
"license": "MIT OR Apache-2.0",
"summary": "A Python package for syntesize datasets for training and fine-tuning AI models.",
"version": "0.0.1a12",
"project_urls": {
"repository": "https://github.com/qooba/tweaktune"
},
"split_keywords": [
"llm",
" ai",
" machine-learning"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "9d734c18f2748e2984167e04b82ca31e545c571a340ed0e575808d0c298b9ab2",
"md5": "cc4f3bbaaefbe6203ba48c1122538e7f",
"sha256": "4145b4873f684d790e6318a102158b62ae62747d7da2d43e2715d4951f78e96e"
},
"downloads": -1,
"filename": "tweaktune-0.0.1a12-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
"has_sig": false,
"md5_digest": "cc4f3bbaaefbe6203ba48c1122538e7f",
"packagetype": "bdist_wheel",
"python_version": "cp38",
"requires_python": ">=3.8",
"size": 40793543,
"upload_time": "2025-10-17T20:23:26",
"upload_time_iso_8601": "2025-10-17T20:23:26.610860Z",
"url": "https://files.pythonhosted.org/packages/9d/73/4c18f2748e2984167e04b82ca31e545c571a340ed0e575808d0c298b9ab2/tweaktune-0.0.1a12-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-17 20:23:26",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "qooba",
"github_project": "tweaktune",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "tweaktune"
}