synthetic-data-kit


Namesynthetic-data-kit JSON
Version 0.0.4 PyPI version JSON
download
home_pageNone
SummaryTool for generating high quality Synthetic datasets
upload_time2025-07-15 20:12:51
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseMIT
keywords ai chain-of-thought dataset-generation fine-tuning llama llm machine-learning nlp reasoning synthetic-data tool-use
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Synthetic Data Kit

Tool for generating high-quality synthetic datasets to fine-tune LLMs.

Generate Reasoning Traces, QA Pairs, save them to a fine-tuning format with a simple CLI.

> [Checkout our guide on using the tool to unlock task-specific reasoning in Llama-3 family](https://github.com/meta-llama/synthetic-data-kit/tree/main/use-cases/adding_reasoning_to_llama_3)

# What does Synthetic Data Kit offer? 

Fine-Tuning Large Language Models is easy. There are many mature tools that you can use to fine-tune Llama model family using various post-training techniques.

### Why target data preparation?

Multiple tools support standardized formats. However, most of the times your dataset is not structured in "user", "assistant" threads or in a certain format that plays well with a fine-tuning packages. 

This toolkit simplifies the journey of:

- Using a LLM (vLLM or any local/external API endpoint) to generate examples
- Modular 4 command flow
- Converting your existing files to fine-tuning friendly formats
- Creating synthetic datasets
- Supporting various formats of post-training fine-tuning

# How does Synthetic Data Kit offer it? 

The tool is designed to follow a simple CLI structure with 4 commands:

- `ingest` various file formats
- `create` your fine-tuning format: `QA` pairs, `QA` pairs with CoT, `summary` format
- `curate`: Using Llama as a judge to curate high quality examples. 
- `save-as`: After that you can simply save these to a format that your fine-tuning workflow requires.

You can override any parameter or detail by either using the CLI or overriding the default YAML config.


### Installation

#### From PyPI

```bash
# Create a new environment

conda create -n synthetic-data python=3.10 

conda activate synthetic-data

pip install synthetic-data-kit
```

#### (Alternatively) From Source

```bash
git clone https://github.com/meta-llama/synthetic-data-kit.git
cd synthetic-data-kit
pip install -e .
```

To get an overview of commands type: 

`synthetic-data-kit --help`

### 1. Tool Setup

- The tool expects respective files to be put in named folders.

```bash
# Create directory structure
mkdir -p data/{pdf,html,youtube,docx,ppt,txt,output,generated,cleaned,final}
```

- You also need a LLM backend that you will utilize for generating your dataset, if using vLLM:

```bash
# Start vLLM server
# Note you will need to grab your HF Authentication from: https://huggingface.co/settings/tokens
vllm serve meta-llama/Llama-3.3-70B-Instruct --port 8000
```

### 2. Usage

The flow follows 4 simple steps: `ingest`, `create`, `curate`, `save-as`, please paste your file into the respective folder:

```bash
# Check if your backend is running
synthetic-data-kit system-check

# Parse a document to text
synthetic-data-kit ingest docs/report.pdf
# This will save file to data/output/report.txt

# Generate QA pairs (default)
synthetic-data-kit create data/output/report.txt --type qa

OR 

# Generate Chain of Thought (CoT) reasoning examples
synthetic-data-kit create data/output/report.txt --type cot

# Both of these will save file to data/generated/report_qa_pairs.json

# Filter content based on quality
synthetic-data-kit curate data/generated/report_qa_pairs.json

# Convert to alpaca fine-tuning format and save as HF arrow file
synthetic-data-kit save-as data/cleaned/report_cleaned.json --format alpaca --storage hf
```
## Configuration

The toolkit uses a YAML configuration file (default: `configs/config.yaml`).

Note, this can be overridden via either CLI arguments OR passing a custom YAML file

```yaml
# Example configuration using vLLM
llm:
  provider: "vllm"

vllm:
  api_base: "http://localhost:8000/v1"
  model: "meta-llama/Llama-3.3-70B-Instruct"

generation:
  temperature: 0.7
  chunk_size: 4000
  num_pairs: 25

curate:
  threshold: 7.0
  batch_size: 8
```

or using an API endpoint:

```yaml
# Example configuration using the llama API
llm:
  provider: "api-endpoint"

api-endpoint:
  api_base: "https://api.llama.com/v1"
  api_key: "llama-api-key"
  model: "Llama-4-Maverick-17B-128E-Instruct-FP8"
```

### Customizing Configuration

Create a overriding configuration file and use it with the `-c` flag:

```bash
synthetic-data-kit -c my_config.yaml ingest docs/paper.pdf
```

## Examples

### Processing a PDF Document

```bash
# Ingest PDF
synthetic-data-kit ingest research_paper.pdf

# Generate QA pairs
synthetic-data-kit create data/output/research_paper.txt -n 30 --threshold 8.0

# Curate data
synthetic-data-kit curate data/generated/research_paper_qa_pairs.json -t 8.5

# Save in OpenAI fine-tuning format (JSON)
synthetic-data-kit save-as data/cleaned/research_paper_cleaned.json -f ft

# Save in OpenAI fine-tuning format (HF dataset)
synthetic-data-kit save-as data/cleaned/research_paper_cleaned.json -f ft --storage hf
```

### Processing a YouTube Video

```bash
# Extract transcript
synthetic-data-kit ingest "https://www.youtube.com/watch?v=dQw4w9WgXcQ"

# Generate QA pairs with specific model
synthetic-data-kit create data/output/youtube_dQw4w9WgXcQ.txt
```

### Processing Multiple Files

```bash
# Bash script to process multiple files
for file in data/pdf/*.pdf; do
  filename=$(basename "$file" .pdf)
  
  synthetic-data-kit ingest "$file"
  synthetic-data-kit create "data/output/${filename}.txt" -n 20
  synthetic-data-kit curate "data/generated/${filename}_qa_pairs.json" -t 7.5
  synthetic-data-kit save-as "data/cleaned/${filename}_cleaned.json" -f chatml
done
```

## Advanced Usage

### Custom Prompt Templates

Edit the `prompts` section in your configuration file to customize generation behavior:

```yaml
prompts:
  qa_generation: |
    You are creating question-answer pairs for fine-tuning a legal assistant.
    Focus on technical legal concepts, precedents, and statutory interpretation.
    
    Below is a chunk of text about: {summary}...
    
    Create {num_pairs} high-quality question-answer pairs based ONLY on this text.
    
    Return ONLY valid JSON formatted as:
    [
      {
        "question": "Detailed legal question?",
        "answer": "Precise legal answer."
      },
      ...
    ]
    
    Text:
    ---
    {text}
    ---
```

### Mental Model:

```mermaid
graph LR
    SDK --> SystemCheck[system-check]
    SDK[synthetic-data-kit] --> Ingest[ingest]
    SDK --> Create[create]
    SDK --> Curate[curate]
    SDK --> SaveAs[save-as]
    
    Ingest --> PDFFile[PDF File]
    Ingest --> HTMLFile[HTML File]
    Ingest --> YouTubeURL[File Format]

    
    Create --> CoT[CoT]
    Create --> QA[QA Pairs]
    Create --> Summary[Summary]
    
    Curate --> Filter[Filter by Quality]
    
    SaveAs --> JSONL[JSONL Format]
    SaveAs --> Alpaca[Alpaca Format]
    SaveAs --> FT[Fine-Tuning Format]
    SaveAs --> ChatML[ChatML Format]
```

## Troubleshooting FAQs:

### vLLM Server Issues

- Ensure vLLM is installed: `pip install vllm`
- Start server with: `vllm serve <model_name> --port 8000`
- Check connection: `synthetic-data-kit system-check`

### Memory Issues

If you encounter CUDA out of memory errors:
- Use a smaller model
- Reduce batch size in config
- Start vLLM with `--gpu-memory-utilization 0.85`

### JSON Parsing Issues

If you encounter issues with the `curate` command:
- Use the `-v` flag to enable verbose output
- Set smaller batch sizes in your config.yaml
- Ensure the LLM model supports proper JSON output
- Install json5 for enhanced JSON parsing: `pip install json5`

### Parser Errors

- Ensure required dependencies are installed for specific parsers:
  - PDF: `pip install pdfminer.six`
  - HTML: `pip install beautifulsoup4`
  - YouTube: `pip install pytubefix youtube-transcript-api`
  - DOCX: `pip install python-docx`
  - PPTX: `pip install python-pptx`

## License

Read more about the [License](./LICENSE)

## Contributing

Contributions are welcome! [Read our contributing guide](./CONTRIBUTING.md)

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "synthetic-data-kit",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "Sanyam Bhutani <sanyambhutani@meta.com>, Hamid Shojanazeri <hamidnazeri@meta.com>",
    "keywords": "ai, chain-of-thought, dataset-generation, fine-tuning, llama, llm, machine-learning, nlp, reasoning, synthetic-data, tool-use",
    "author": null,
    "author_email": "Sanyam Bhutani <sanyambhutani@meta.com>, Hamid Shojanazeri <hamidnazeri@meta.com>",
    "download_url": "https://files.pythonhosted.org/packages/a8/8b/a733421ef6fb8c3feb7cf318a83e9189417afe27c411bb672f1733dd523e/synthetic_data_kit-0.0.4.tar.gz",
    "platform": null,
    "description": "# Synthetic Data Kit\n\nTool for generating high-quality synthetic datasets to fine-tune LLMs.\n\nGenerate Reasoning Traces, QA Pairs, save them to a fine-tuning format with a simple CLI.\n\n> [Checkout our guide on using the tool to unlock task-specific reasoning in Llama-3 family](https://github.com/meta-llama/synthetic-data-kit/tree/main/use-cases/adding_reasoning_to_llama_3)\n\n# What does Synthetic Data Kit offer? \n\nFine-Tuning Large Language Models is easy. There are many mature tools that you can use to fine-tune Llama model family using various post-training techniques.\n\n### Why target data preparation?\n\nMultiple tools support standardized formats. However, most of the times your dataset is not structured in \"user\", \"assistant\" threads or in a certain format that plays well with a fine-tuning packages. \n\nThis toolkit simplifies the journey of:\n\n- Using a LLM (vLLM or any local/external API endpoint) to generate examples\n- Modular 4 command flow\n- Converting your existing files to fine-tuning friendly formats\n- Creating synthetic datasets\n- Supporting various formats of post-training fine-tuning\n\n# How does Synthetic Data Kit offer it? \n\nThe tool is designed to follow a simple CLI structure with 4 commands:\n\n- `ingest` various file formats\n- `create` your fine-tuning format: `QA` pairs, `QA` pairs with CoT, `summary` format\n- `curate`: Using Llama as a judge to curate high quality examples. \n- `save-as`: After that you can simply save these to a format that your fine-tuning workflow requires.\n\nYou can override any parameter or detail by either using the CLI or overriding the default YAML config.\n\n\n### Installation\n\n#### From PyPI\n\n```bash\n# Create a new environment\n\nconda create -n synthetic-data python=3.10 \n\nconda activate synthetic-data\n\npip install synthetic-data-kit\n```\n\n#### (Alternatively) From Source\n\n```bash\ngit clone https://github.com/meta-llama/synthetic-data-kit.git\ncd synthetic-data-kit\npip install -e .\n```\n\nTo get an overview of commands type: \n\n`synthetic-data-kit --help`\n\n### 1. Tool Setup\n\n- The tool expects respective files to be put in named folders.\n\n```bash\n# Create directory structure\nmkdir -p data/{pdf,html,youtube,docx,ppt,txt,output,generated,cleaned,final}\n```\n\n- You also need a LLM backend that you will utilize for generating your dataset, if using vLLM:\n\n```bash\n# Start vLLM server\n# Note you will need to grab your HF Authentication from: https://huggingface.co/settings/tokens\nvllm serve meta-llama/Llama-3.3-70B-Instruct --port 8000\n```\n\n### 2. Usage\n\nThe flow follows 4 simple steps: `ingest`, `create`, `curate`, `save-as`, please paste your file into the respective folder:\n\n```bash\n# Check if your backend is running\nsynthetic-data-kit system-check\n\n# Parse a document to text\nsynthetic-data-kit ingest docs/report.pdf\n# This will save file to data/output/report.txt\n\n# Generate QA pairs (default)\nsynthetic-data-kit create data/output/report.txt --type qa\n\nOR \n\n# Generate Chain of Thought (CoT) reasoning examples\nsynthetic-data-kit create data/output/report.txt --type cot\n\n# Both of these will save file to data/generated/report_qa_pairs.json\n\n# Filter content based on quality\nsynthetic-data-kit curate data/generated/report_qa_pairs.json\n\n# Convert to alpaca fine-tuning format and save as HF arrow file\nsynthetic-data-kit save-as data/cleaned/report_cleaned.json --format alpaca --storage hf\n```\n## Configuration\n\nThe toolkit uses a YAML configuration file (default: `configs/config.yaml`).\n\nNote, this can be overridden via either CLI arguments OR passing a custom YAML file\n\n```yaml\n# Example configuration using vLLM\nllm:\n  provider: \"vllm\"\n\nvllm:\n  api_base: \"http://localhost:8000/v1\"\n  model: \"meta-llama/Llama-3.3-70B-Instruct\"\n\ngeneration:\n  temperature: 0.7\n  chunk_size: 4000\n  num_pairs: 25\n\ncurate:\n  threshold: 7.0\n  batch_size: 8\n```\n\nor using an API endpoint:\n\n```yaml\n# Example configuration using the llama API\nllm:\n  provider: \"api-endpoint\"\n\napi-endpoint:\n  api_base: \"https://api.llama.com/v1\"\n  api_key: \"llama-api-key\"\n  model: \"Llama-4-Maverick-17B-128E-Instruct-FP8\"\n```\n\n### Customizing Configuration\n\nCreate a overriding configuration file and use it with the `-c` flag:\n\n```bash\nsynthetic-data-kit -c my_config.yaml ingest docs/paper.pdf\n```\n\n## Examples\n\n### Processing a PDF Document\n\n```bash\n# Ingest PDF\nsynthetic-data-kit ingest research_paper.pdf\n\n# Generate QA pairs\nsynthetic-data-kit create data/output/research_paper.txt -n 30 --threshold 8.0\n\n# Curate data\nsynthetic-data-kit curate data/generated/research_paper_qa_pairs.json -t 8.5\n\n# Save in OpenAI fine-tuning format (JSON)\nsynthetic-data-kit save-as data/cleaned/research_paper_cleaned.json -f ft\n\n# Save in OpenAI fine-tuning format (HF dataset)\nsynthetic-data-kit save-as data/cleaned/research_paper_cleaned.json -f ft --storage hf\n```\n\n### Processing a YouTube Video\n\n```bash\n# Extract transcript\nsynthetic-data-kit ingest \"https://www.youtube.com/watch?v=dQw4w9WgXcQ\"\n\n# Generate QA pairs with specific model\nsynthetic-data-kit create data/output/youtube_dQw4w9WgXcQ.txt\n```\n\n### Processing Multiple Files\n\n```bash\n# Bash script to process multiple files\nfor file in data/pdf/*.pdf; do\n  filename=$(basename \"$file\" .pdf)\n  \n  synthetic-data-kit ingest \"$file\"\n  synthetic-data-kit create \"data/output/${filename}.txt\" -n 20\n  synthetic-data-kit curate \"data/generated/${filename}_qa_pairs.json\" -t 7.5\n  synthetic-data-kit save-as \"data/cleaned/${filename}_cleaned.json\" -f chatml\ndone\n```\n\n## Advanced Usage\n\n### Custom Prompt Templates\n\nEdit the `prompts` section in your configuration file to customize generation behavior:\n\n```yaml\nprompts:\n  qa_generation: |\n    You are creating question-answer pairs for fine-tuning a legal assistant.\n    Focus on technical legal concepts, precedents, and statutory interpretation.\n    \n    Below is a chunk of text about: {summary}...\n    \n    Create {num_pairs} high-quality question-answer pairs based ONLY on this text.\n    \n    Return ONLY valid JSON formatted as:\n    [\n      {\n        \"question\": \"Detailed legal question?\",\n        \"answer\": \"Precise legal answer.\"\n      },\n      ...\n    ]\n    \n    Text:\n    ---\n    {text}\n    ---\n```\n\n### Mental Model:\n\n```mermaid\ngraph LR\n    SDK --> SystemCheck[system-check]\n    SDK[synthetic-data-kit] --> Ingest[ingest]\n    SDK --> Create[create]\n    SDK --> Curate[curate]\n    SDK --> SaveAs[save-as]\n    \n    Ingest --> PDFFile[PDF File]\n    Ingest --> HTMLFile[HTML File]\n    Ingest --> YouTubeURL[File Format]\n\n    \n    Create --> CoT[CoT]\n    Create --> QA[QA Pairs]\n    Create --> Summary[Summary]\n    \n    Curate --> Filter[Filter by Quality]\n    \n    SaveAs --> JSONL[JSONL Format]\n    SaveAs --> Alpaca[Alpaca Format]\n    SaveAs --> FT[Fine-Tuning Format]\n    SaveAs --> ChatML[ChatML Format]\n```\n\n## Troubleshooting FAQs:\n\n### vLLM Server Issues\n\n- Ensure vLLM is installed: `pip install vllm`\n- Start server with: `vllm serve <model_name> --port 8000`\n- Check connection: `synthetic-data-kit system-check`\n\n### Memory Issues\n\nIf you encounter CUDA out of memory errors:\n- Use a smaller model\n- Reduce batch size in config\n- Start vLLM with `--gpu-memory-utilization 0.85`\n\n### JSON Parsing Issues\n\nIf you encounter issues with the `curate` command:\n- Use the `-v` flag to enable verbose output\n- Set smaller batch sizes in your config.yaml\n- Ensure the LLM model supports proper JSON output\n- Install json5 for enhanced JSON parsing: `pip install json5`\n\n### Parser Errors\n\n- Ensure required dependencies are installed for specific parsers:\n  - PDF: `pip install pdfminer.six`\n  - HTML: `pip install beautifulsoup4`\n  - YouTube: `pip install pytubefix youtube-transcript-api`\n  - DOCX: `pip install python-docx`\n  - PPTX: `pip install python-pptx`\n\n## License\n\nRead more about the [License](./LICENSE)\n\n## Contributing\n\nContributions are welcome! [Read our contributing guide](./CONTRIBUTING.md)\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Tool for generating high quality Synthetic datasets",
    "version": "0.0.4",
    "project_urls": {
        "Bug Tracker": "https://github.com/meta-llama/synthetic-data-kit/issues",
        "Documentation": "https://github.com/meta-llama/synthetic-data-kit#readme",
        "Getting Started": "https://github.com/meta-llama/synthetic-data-kit/blob/main/getting-started/README.md",
        "Homepage": "https://github.com/meta-llama/synthetic-data-kit"
    },
    "split_keywords": [
        "ai",
        " chain-of-thought",
        " dataset-generation",
        " fine-tuning",
        " llama",
        " llm",
        " machine-learning",
        " nlp",
        " reasoning",
        " synthetic-data",
        " tool-use"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "b228f4d9500a5c4f72b611ce90cf9ab6d9b7dab7a0b143cec7be636093388828",
                "md5": "52777200fbcb13483e6f7e1165b62f12",
                "sha256": "2dcbed7dbec057e205aae132a46b0aba97689705cdae1f1e64bda2e96974f5c2"
            },
            "downloads": -1,
            "filename": "synthetic_data_kit-0.0.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "52777200fbcb13483e6f7e1165b62f12",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 70713,
            "upload_time": "2025-07-15T20:12:50",
            "upload_time_iso_8601": "2025-07-15T20:12:50.058115Z",
            "url": "https://files.pythonhosted.org/packages/b2/28/f4d9500a5c4f72b611ce90cf9ab6d9b7dab7a0b143cec7be636093388828/synthetic_data_kit-0.0.4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "a88ba733421ef6fb8c3feb7cf318a83e9189417afe27c411bb672f1733dd523e",
                "md5": "5c7181a76b1c0f261a629ec816d0f197",
                "sha256": "88c3f17e2a3a4c1f01f15c5f0d7e545112edee8937df64abc44e3ebad5ac4399"
            },
            "downloads": -1,
            "filename": "synthetic_data_kit-0.0.4.tar.gz",
            "has_sig": false,
            "md5_digest": "5c7181a76b1c0f261a629ec816d0f197",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 53695,
            "upload_time": "2025-07-15T20:12:51",
            "upload_time_iso_8601": "2025-07-15T20:12:51.337169Z",
            "url": "https://files.pythonhosted.org/packages/a8/8b/a733421ef6fb8c3feb7cf318a83e9189417afe27c411bb672f1733dd523e/synthetic_data_kit-0.0.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-15 20:12:51",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "meta-llama",
    "github_project": "synthetic-data-kit",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "synthetic-data-kit"
}
        
Elapsed time: 1.98332s