tweaktune

Name	tweaktune JSON
Version	0.0.1a12 JSON
	download
home_page	None
Summary	A Python package for syntesize datasets for training and fine-tuning AI models.
upload_time	2025-10-17 20:23:26
maintainer	None
docs_url	None
author	None
requires_python	>=3.8
license	MIT OR Apache-2.0
keywords	llm ai machine-learning
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # tweaktune

**tweaktune** is a Rust-powered, Python-facing library designed to **synthesize datasets** for **training and fine-tuning AI models**, especially **LMs** (Language Models).  
It allows you to easily build data pipelines, generate new examples using LLM APIs, and create structured datasets from a variety of sources.

---

## Features

- **Flexible Data Sources**:  
  Supports datasets from:
  - Parquet files
  - CSV files
  - JSONL files
  - Arrow datasets
  - OpenAPI specifications (for function calling datasets)
  - Lists of tools (Python functions for function calling datasets)
  - Pydantic models (for structured output datasets)

- **LLM Integration**:  
  Connects to any LLM API to generate synthetic text or structured JSON.

- **Dynamic Prompting**:  
  Supports **Jinja templates** for highly customizable prompts.

- **Parallel Processing**:  
  Configure **multiple workers** to run your pipeline steps in parallel.

- **Easy Pipeline Building**:  
  Compose steps like sampling, generating, writing, or debugging into a seamless pipeline.

---

## Quick Example

Here's how you can build a dataset from a Parquet file and synthesize new data using an LLM API:

```python
from tweaktune import Pipeline
import os

persona_template = """
Na podstawie poniższego fragmentu tekstu opisz personę która jest z nim związana.
Dla opisywanej osoby wymyśl fikcyjne imię i nazwisko.
Napisz dwa zdania na temat tej osoby, opis zwróć w formacie json, nie dodawaj nic więcej:
{"persona":"opis osoby"}

---
FRAGMENT TEKSTU:

{{article[0].text}}
"""

url = "http://localhost:8000/"
api_key = os.environ["API_KEY"]
model = "model"

p = Pipeline()\
    .with_workers(5)\
    .with_parquet_dataset("web_articles", "../../datasets/articles.pq")\
    .with_llm_api("bielik", url, api_key, model)\
    .with_template("persona", persona_template)\
    .with_template("output", """{"persona": {{persona|jstr}} }""")\
    .iter_range(10000)\
        .sample(dataset="web_articles", size=1, output="article")\
        .generate_json(template="persona", llm="bielik", output="persona", json_path="persona")\
        .write_jsonl(path="../../datasets/personas.jsonl", template="output")\
    .run()
```

---

## Pipeline Steps

You can easily chain together multiple steps:

- `sample()` – sample items from a dataset
- `read()` – read entire dataset
- `generate_text()` – generate text using an LLM
- `generate_json()` – generate JSON output and extract a specific field
- `write_jsonl()` – write output to a JSONL file
- `write_csv()` – write output to a CSV file
- `print()` – print outputs
- `debug()` – enable detailed debugging
- `log()` – set log level
- `python step` – add custom Python-defined step classes

---

## Why tweaktune?

- Build synthetic datasets faster for fine-tuning models.
- Automate text, JSON, or structured data generation.
- Stay flexible: plug your own LLM API or use existing OpenAI-compatible ones.
- Rust speed, Python usability.

---

## 📦 Installation

```bash
pip install tweaktune
```


## 🤝 Contributing

We welcome contributions! Feel free to open issues, suggest features, or create pull requests.

Please note that by contributing to this project, you agree to the terms of the [Contributor License Agreement (CLA)](CLA.md).

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "tweaktune",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "llm, ai, machine-learning",
    "author": null,
    "author_email": null,
    "download_url": null,
    "platform": null,
    "description": "# tweaktune\n\n**tweaktune** is a Rust-powered, Python-facing library designed to **synthesize datasets** for **training and fine-tuning AI models**, especially **LMs** (Language Models).  \nIt allows you to easily build data pipelines, generate new examples using LLM APIs, and create structured datasets from a variety of sources.\n\n---\n\n## Features\n\n- **Flexible Data Sources**:  \n  Supports datasets from:\n  - Parquet files\n  - CSV files\n  - JSONL files\n  - Arrow datasets\n  - OpenAPI specifications (for function calling datasets)\n  - Lists of tools (Python functions for function calling datasets)\n  - Pydantic models (for structured output datasets)\n\n- **LLM Integration**:  \n  Connects to any LLM API to generate synthetic text or structured JSON.\n\n- **Dynamic Prompting**:  \n  Supports **Jinja templates** for highly customizable prompts.\n\n- **Parallel Processing**:  \n  Configure **multiple workers** to run your pipeline steps in parallel.\n\n- **Easy Pipeline Building**:  \n  Compose steps like sampling, generating, writing, or debugging into a seamless pipeline.\n\n---\n\n## Quick Example\n\nHere's how you can build a dataset from a Parquet file and synthesize new data using an LLM API:\n\n```python\nfrom tweaktune import Pipeline\nimport os\n\npersona_template = \"\"\"\nNa podstawie poni\u017cszego fragmentu tekstu opisz person\u0119 kt\u00f3ra jest z nim zwi\u0105zana.\nDla opisywanej osoby wymy\u015bl fikcyjne imi\u0119 i nazwisko.\nNapisz dwa zdania na temat tej osoby, opis zwr\u00f3\u0107 w formacie json, nie dodawaj nic wi\u0119cej:\n{\"persona\":\"opis osoby\"}\n\n---\nFRAGMENT TEKSTU:\n\n{{article[0].text}}\n\"\"\"\n\nurl = \"http://localhost:8000/\"\napi_key = os.environ[\"API_KEY\"]\nmodel = \"model\"\n\np = Pipeline()\\\n    .with_workers(5)\\\n    .with_parquet_dataset(\"web_articles\", \"../../datasets/articles.pq\")\\\n    .with_llm_api(\"bielik\", url, api_key, model)\\\n    .with_template(\"persona\", persona_template)\\\n    .with_template(\"output\", \"\"\"{\"persona\": {{persona|jstr}} }\"\"\")\\\n    .iter_range(10000)\\\n        .sample(dataset=\"web_articles\", size=1, output=\"article\")\\\n        .generate_json(template=\"persona\", llm=\"bielik\", output=\"persona\", json_path=\"persona\")\\\n        .write_jsonl(path=\"../../datasets/personas.jsonl\", template=\"output\")\\\n    .run()\n```\n\n---\n\n## Pipeline Steps\n\nYou can easily chain together multiple steps:\n\n- `sample()` \u2013 sample items from a dataset\n- `read()` \u2013 read entire dataset\n- `generate_text()` \u2013 generate text using an LLM\n- `generate_json()` \u2013 generate JSON output and extract a specific field\n- `write_jsonl()` \u2013 write output to a JSONL file\n- `write_csv()` \u2013 write output to a CSV file\n- `print()` \u2013 print outputs\n- `debug()` \u2013 enable detailed debugging\n- `log()` \u2013 set log level\n- `python step` \u2013 add custom Python-defined step classes\n\n---\n\n## Why tweaktune?\n\n- Build synthetic datasets faster for fine-tuning models.\n- Automate text, JSON, or structured data generation.\n- Stay flexible: plug your own LLM API or use existing OpenAI-compatible ones.\n- Rust speed, Python usability.\n\n---\n\n## \ud83d\udce6 Installation\n\n```bash\npip install tweaktune\n```\n\n\n## \ud83e\udd1d Contributing\n\nWe welcome contributions! Feel free to open issues, suggest features, or create pull requests.\n\nPlease note that by contributing to this project, you agree to the terms of the [Contributor License Agreement (CLA)](CLA.md).\n\n\n\n\n\n",
    "bugtrack_url": null,
    "license": "MIT OR Apache-2.0",
    "summary": "A Python package for syntesize datasets for training and fine-tuning AI models.",
    "version": "0.0.1a12",
    "project_urls": {
        "repository": "https://github.com/qooba/tweaktune"
    },
    "split_keywords": [
        "llm",
        " ai",
        " machine-learning"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "9d734c18f2748e2984167e04b82ca31e545c571a340ed0e575808d0c298b9ab2",
                "md5": "cc4f3bbaaefbe6203ba48c1122538e7f",
                "sha256": "4145b4873f684d790e6318a102158b62ae62747d7da2d43e2715d4951f78e96e"
            },
            "downloads": -1,
            "filename": "tweaktune-0.0.1a12-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
            "has_sig": false,
            "md5_digest": "cc4f3bbaaefbe6203ba48c1122538e7f",
            "packagetype": "bdist_wheel",
            "python_version": "cp38",
            "requires_python": ">=3.8",
            "size": 40793543,
            "upload_time": "2025-10-17T20:23:26",
            "upload_time_iso_8601": "2025-10-17T20:23:26.610860Z",
            "url": "https://files.pythonhosted.org/packages/9d/73/4c18f2748e2984167e04b82ca31e545c571a340ed0e575808d0c298b9ab2/tweaktune-0.0.1a12-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-17 20:23:26",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "qooba",
    "github_project": "tweaktune",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "tweaktune"
}

None