# Synthetic Data Kit
Tool for generating high-quality synthetic datasets to fine-tune LLMs.
Generate Reasoning Traces, QA Pairs, save them to a fine-tuning format with a simple CLI.
> [Checkout our guide on using the tool to unlock task-specific reasoning in Llama-3 family](https://github.com/meta-llama/synthetic-data-kit/tree/main/use-cases/adding_reasoning_to_llama_3)
# What does Synthetic Data Kit offer?
Fine-Tuning Large Language Models is easy. There are many mature tools that you can use to fine-tune Llama model family using various post-training techniques.
### Why target data preparation?
Multiple tools support standardized formats. However, most of the times your dataset is not structured in "user", "assistant" threads or in a certain format that plays well with a fine-tuning packages.
This toolkit simplifies the journey of:
- Using a LLM (vLLM or any local/external API endpoint) to generate examples
- Modular 4 command flow
- Converting your existing files to fine-tuning friendly formats
- Creating synthetic datasets
- Supporting various formats of post-training fine-tuning
# How does Synthetic Data Kit offer it?
The tool is designed to follow a simple CLI structure with 4 commands:
- `ingest` various file formats
- `create` your fine-tuning format: `QA` pairs, `QA` pairs with CoT, `summary` format
- `curate`: Using Llama as a judge to curate high quality examples.
- `save-as`: After that you can simply save these to a format that your fine-tuning workflow requires.
You can override any parameter or detail by either using the CLI or overriding the default YAML config.
### Installation
#### From PyPI
```bash
# Create a new environment
conda create -n synthetic-data python=3.10
conda activate synthetic-data
pip install synthetic-data-kit
```
#### (Alternatively) From Source
```bash
git clone https://github.com/meta-llama/synthetic-data-kit.git
cd synthetic-data-kit
pip install -e .
```
To get an overview of commands type:
`synthetic-data-kit --help`
### 1. Tool Setup
- The tool can process both individual files and entire directories.
```bash
# Create directory structure for the 4-stage pipeline
mkdir -p data/{input,parsed,generated,curated,final}
# Or use the legacy structure (still supported)
mkdir -p data/{pdf,html,youtube,docx,ppt,txt,output,generated,cleaned,final}
```
- You also need a LLM backend that you will utilize for generating your dataset, if using vLLM:
```bash
# Start vLLM server
# Note you will need to grab your HF Authentication from: https://huggingface.co/settings/tokens
vllm serve meta-llama/Llama-3.3-70B-Instruct --port 8000
```
### 2. Usage
The flow follows 4 simple steps: `ingest`, `create`, `curate`, `save-as`. You can process individual files or entire directories:
```bash
# Check if your backend is running
synthetic-data-kit system-check
# SINGLE FILE PROCESSING (Original approach)
# Parse a document to text
synthetic-data-kit ingest docs/report.pdf
# This saves file to data/parsed/report.txt
# Generate QA pairs (default)
synthetic-data-kit create data/parsed/report.txt --type qa
OR
# Generate Chain of Thought (CoT) reasoning examples
synthetic-data-kit create data/parsed/report.txt --type cot
# Both of these save file to data/generated/report_qa_pairs.json
# Filter content based on quality
synthetic-data-kit curate data/generated/report_qa_pairs.json
# Convert to alpaca fine-tuning format and save as HF arrow file
synthetic-data-kit save-as data/curated/report_cleaned.json --format alpaca --storage hf
```
### 2.1 Batch Directory Processing (New)
Process entire directories of files with a single command:
```bash
# Parse all documents in a directory
synthetic-data-kit ingest ./documents/
# Processes all .pdf, .html, .docx, .pptx, .txt files
# Saves parsed text files to data/parsed/
# Generate QA pairs for all text files
synthetic-data-kit create ./data/parsed/ --type qa
# Processes all .txt files in the directory
# Saves QA pairs to data/generated/
# Curate all generated files
synthetic-data-kit curate ./data/generated/ --threshold 8.0
# Processes all .json files in the directory
# Saves curated files to data/curated/
# Convert all curated files to training format
synthetic-data-kit save-as ./data/curated/ --format alpaca
# Processes all .json files in the directory
# Saves final files to data/final/
```
### 2.2 Preview Mode
Use `--preview` to see what files would be processed without actually processing them:
```bash
# Preview files before processing
synthetic-data-kit ingest ./documents --preview
# Shows: directory stats, file counts by extension, list of files
synthetic-data-kit create ./data/parsed --preview
# Shows: .txt files that would be processed
```
## Configuration
The toolkit uses a YAML configuration file (default: `configs/config.yaml`).
Note, this can be overridden via either CLI arguments OR passing a custom YAML file
```yaml
# Example configuration using vLLM
llm:
provider: "vllm"
vllm:
api_base: "http://localhost:8000/v1"
model: "meta-llama/Llama-3.3-70B-Instruct"
sleep_time: 0.1
generation:
temperature: 0.7
chunk_size: 4000
num_pairs: 25
max_context_length: 8000
curate:
threshold: 7.0
batch_size: 8
```
or using an API endpoint:
```yaml
# Example configuration using the llama API
llm:
provider: "api-endpoint"
api-endpoint:
api_base: "https://api.llama.com/v1"
api_key: "llama-api-key"
model: "Llama-4-Maverick-17B-128E-Instruct-FP8"
sleep_time: 0.5
```
### Customizing Configuration
Create a overriding configuration file and use it with the `-c` flag:
```bash
synthetic-data-kit -c my_config.yaml ingest docs/paper.pdf
```
## Examples
### Processing a Single PDF Document
```bash
# Ingest PDF
synthetic-data-kit ingest research_paper.pdf
# Generate QA pairs
synthetic-data-kit create data/parsed/research_paper.txt -n 30
# Curate data
synthetic-data-kit curate data/generated/research_paper_qa_pairs.json -t 8.5
# Save in OpenAI fine-tuning format (JSON)
synthetic-data-kit save-as data/curated/research_paper_cleaned.json -f ft
# Save in OpenAI fine-tuning format (HF dataset)
synthetic-data-kit save-as data/curated/research_paper_cleaned.json -f ft --storage hf
```
### Processing Multiple Documents (Directory)
```bash
# Process all research papers in a directory
synthetic-data-kit ingest ./research_papers/
# Generate QA pairs for all parsed documents
synthetic-data-kit create ./data/parsed/ --type qa -n 30
# Curate all generated files
synthetic-data-kit curate ./data/generated/ -t 8.5
# Save all curated files in OpenAI fine-tuning format
synthetic-data-kit save-as ./data/curated/ -f ft --storage hf
```
### Preview Before Processing
```bash
# See what files would be processed
synthetic-data-kit ingest ./research_papers --preview
# Output:
# Directory: ./research_papers
# Total files: 15
# Supported files: 12
# Extensions: .pdf (8), .docx (3), .txt (1)
# Files: paper1.pdf, paper2.pdf, ...
# Preview with verbose output
synthetic-data-kit create ./data/parsed --preview --verbose
```
### Processing a YouTube Video
```bash
# Extract transcript
synthetic-data-kit ingest "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
# Generate QA pairs with specific model
synthetic-data-kit create data/parsed/youtube_dQw4w9WgXcQ.txt
```
### Processing Multiple Files
```bash
# NEW: Process entire directories (recommended)
synthetic-data-kit ingest ./data/input/
synthetic-data-kit create ./data/parsed/ --type qa -n 20
synthetic-data-kit curate ./data/generated/ -t 7.5
synthetic-data-kit save-as ./data/curated/ -f chatml
# LEGACY: Bash script to process multiple files (still supported)
for file in data/pdf/*.pdf; do
filename=$(basename "$file" .pdf)
synthetic-data-kit ingest "$file"
synthetic-data-kit create "data/parsed/${filename}.txt" -n 20
synthetic-data-kit curate "data/generated/${filename}_qa_pairs.json" -t 7.5
synthetic-data-kit save-as "data/curated/${filename}_cleaned.json" -f chatml
done
```
## Document Processing & Chunking
### How Chunking Works
The Synthetic Data Kit automatically handles documents of any size using an intelligent processing strategy:
- **Small documents** (< 8000 characters): Processed in a single API call for maximum context and quality
- **Large documents** (≥ 8000 characters): Automatically split into chunks with overlap to maintain context
### Controlling Chunking Behavior
You can customize chunking with CLI flags or config settings for both single files and directories:
```bash
# Single file with custom chunking
synthetic-data-kit create document.txt --type qa --chunk-size 2000 --chunk-overlap 100
# Directory processing with custom chunking
synthetic-data-kit create ./data/parsed/ --type cot --num-pairs 50 --chunk-size 6000 --verbose
# Preview directory processing with chunking details
synthetic-data-kit create ./data/parsed/ --preview --verbose
```
### Chunking Parameters
| Parameter | Default | Description |
|-----------|---------|-------------|
| `--chunk-size` | 4000 | Size of text chunks in characters |
| `--chunk-overlap` | 200 | Overlap between chunks to preserve context |
| `--verbose` | false | Show chunking details and progress |
### Understanding Chunking Output
When using `--verbose`, you'll see chunking information for both single files and directories:
```bash
# Single file verbose output
synthetic-data-kit create large_document.txt --type qa --num-pairs 20 --verbose
# Directory verbose output
synthetic-data-kit create ./data/parsed/ --type qa --num-pairs 20 --verbose
```
Output:
```
# Single file output
Generating QA pairs...
Document split into 8 chunks
Using batch size of 32
Processing 8 chunks to generate QA pairs...
Generated 3 pairs from chunk 1 (total: 3/20)
Generated 2 pairs from chunk 2 (total: 5/20)
...
Reached target of 20 pairs. Stopping processing.
Generated 20 QA pairs total (requested: 20)
# Directory output
Processing directory: ./data/parsed/
Supported files: 5 (.txt files)
Progress: ████████████████████████████████████████ 100% (5/5 files)
✓ document1.txt: Generated 20 QA pairs
✓ document2.txt: Generated 18 QA pairs
✗ document3.txt: Failed - Invalid format
✓ document4.txt: Generated 20 QA pairs
✓ document5.txt: Generated 15 QA pairs
Processing Summary:
Total files: 5
Successful: 4
Failed: 1
Total pairs generated: 73
```
### Chunking logic
Both QA and CoT generation use the same chunking logic for files and directories:
```bash
# Single file processing
synthetic-data-kit create document.txt --type qa --num-pairs 100 --chunk-size 3000
synthetic-data-kit create document.txt --type cot --num-pairs 20 --chunk-size 3000
# Directory processing
synthetic-data-kit create ./data/parsed/ --type qa --num-pairs 100 --chunk-size 3000
synthetic-data-kit create ./data/parsed/ --type cot --num-pairs 20 --chunk-size 3000
```
## Advanced Usage
### Custom Prompt Templates
Edit the `prompts` section in your configuration file to customize generation behavior:
```yaml
prompts:
qa_generation: |
You are creating question-answer pairs for fine-tuning a legal assistant.
Focus on technical legal concepts, precedents, and statutory interpretation.
Below is a chunk of text about: {summary}...
Create {num_pairs} high-quality question-answer pairs based ONLY on this text.
Return ONLY valid JSON formatted as:
[
{
"question": "Detailed legal question?",
"answer": "Precise legal answer."
},
...
]
Text:
---
{text}
---
```
### Mental Model:
```mermaid
graph LR
SDK --> SystemCheck[system-check]
SDK[synthetic-data-kit] --> Ingest[ingest]
SDK --> Create[create]
SDK --> Curate[curate]
SDK --> SaveAs[save-as]
Ingest --> PDFFile[PDF File]
Ingest --> HTMLFile[HTML File]
Ingest --> YouTubeURL[File Format]
Create --> CoT[CoT]
Create --> QA[QA Pairs]
Create --> Summary[Summary]
Curate --> Filter[Filter by Quality]
SaveAs --> JSONL[JSONL Format]
SaveAs --> Alpaca[Alpaca Format]
SaveAs --> FT[Fine-Tuning Format]
SaveAs --> ChatML[ChatML Format]
```
## Troubleshooting FAQs:
### vLLM Server Issues
- Ensure vLLM is installed: `pip install vllm`
- Start server with: `vllm serve <model_name> --port 8000`
- Check connection: `synthetic-data-kit system-check`
### Memory Issues
If you encounter CUDA out of memory errors:
- Use a smaller model
- Reduce batch size in config
- Start vLLM with `--gpu-memory-utilization 0.85`
### JSON Parsing Issues
If you encounter issues with the `curate` command:
- Use the `-v` flag to enable verbose output
- Set smaller batch sizes in your config.yaml
- Ensure the LLM model supports proper JSON output
- Install json5 for enhanced JSON parsing: `pip install json5`
### Parser Errors
- Ensure required dependencies are installed for specific parsers:
- PDF: `pip install pdfminer.six`
- HTML: `pip install beautifulsoup4`
- YouTube: `pip install pytubefix youtube-transcript-api`
- DOCX: `pip install python-docx`
- PPTX: `pip install python-pptx`
## License
Read more about the [License](./LICENSE)
## Contributing
Contributions are welcome! [Read our contributing guide](./CONTRIBUTING.md)
Raw data
{
"_id": null,
"home_page": null,
"name": "synthetic-data-kit",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "Sanyam Bhutani <sanyambhutani@meta.com>, Hamid Shojanazeri <hamidnazeri@meta.com>",
"keywords": "ai, chain-of-thought, dataset-generation, fine-tuning, llama, llm, machine-learning, nlp, reasoning, synthetic-data, tool-use",
"author": null,
"author_email": "Sanyam Bhutani <sanyambhutani@meta.com>, Hamid Shojanazeri <hamidnazeri@meta.com>",
"download_url": "https://files.pythonhosted.org/packages/0e/ed/93851cb72dbea20f40d155ea6663294f8c6fe0c4a12b5f02f5e963f44bd8/synthetic_data_kit-0.0.5.tar.gz",
"platform": null,
"description": "# Synthetic Data Kit\n\nTool for generating high-quality synthetic datasets to fine-tune LLMs.\n\nGenerate Reasoning Traces, QA Pairs, save them to a fine-tuning format with a simple CLI.\n\n> [Checkout our guide on using the tool to unlock task-specific reasoning in Llama-3 family](https://github.com/meta-llama/synthetic-data-kit/tree/main/use-cases/adding_reasoning_to_llama_3)\n\n# What does Synthetic Data Kit offer? \n\nFine-Tuning Large Language Models is easy. There are many mature tools that you can use to fine-tune Llama model family using various post-training techniques.\n\n### Why target data preparation?\n\nMultiple tools support standardized formats. However, most of the times your dataset is not structured in \"user\", \"assistant\" threads or in a certain format that plays well with a fine-tuning packages. \n\nThis toolkit simplifies the journey of:\n\n- Using a LLM (vLLM or any local/external API endpoint) to generate examples\n- Modular 4 command flow\n- Converting your existing files to fine-tuning friendly formats\n- Creating synthetic datasets\n- Supporting various formats of post-training fine-tuning\n\n# How does Synthetic Data Kit offer it? \n\nThe tool is designed to follow a simple CLI structure with 4 commands:\n\n- `ingest` various file formats\n- `create` your fine-tuning format: `QA` pairs, `QA` pairs with CoT, `summary` format\n- `curate`: Using Llama as a judge to curate high quality examples. \n- `save-as`: After that you can simply save these to a format that your fine-tuning workflow requires.\n\nYou can override any parameter or detail by either using the CLI or overriding the default YAML config.\n\n\n### Installation\n\n#### From PyPI\n\n```bash\n# Create a new environment\n\nconda create -n synthetic-data python=3.10 \n\nconda activate synthetic-data\n\npip install synthetic-data-kit\n```\n\n#### (Alternatively) From Source\n\n```bash\ngit clone https://github.com/meta-llama/synthetic-data-kit.git\ncd synthetic-data-kit\npip install -e .\n```\n\nTo get an overview of commands type: \n\n`synthetic-data-kit --help`\n\n### 1. Tool Setup\n\n- The tool can process both individual files and entire directories.\n\n```bash\n# Create directory structure for the 4-stage pipeline\nmkdir -p data/{input,parsed,generated,curated,final}\n\n# Or use the legacy structure (still supported)\nmkdir -p data/{pdf,html,youtube,docx,ppt,txt,output,generated,cleaned,final}\n```\n\n- You also need a LLM backend that you will utilize for generating your dataset, if using vLLM:\n\n```bash\n# Start vLLM server\n# Note you will need to grab your HF Authentication from: https://huggingface.co/settings/tokens\nvllm serve meta-llama/Llama-3.3-70B-Instruct --port 8000\n```\n\n### 2. Usage\n\nThe flow follows 4 simple steps: `ingest`, `create`, `curate`, `save-as`. You can process individual files or entire directories:\n\n```bash\n# Check if your backend is running\nsynthetic-data-kit system-check\n\n# SINGLE FILE PROCESSING (Original approach)\n# Parse a document to text\nsynthetic-data-kit ingest docs/report.pdf\n# This saves file to data/parsed/report.txt\n\n# Generate QA pairs (default)\nsynthetic-data-kit create data/parsed/report.txt --type qa\n\nOR \n\n# Generate Chain of Thought (CoT) reasoning examples\nsynthetic-data-kit create data/parsed/report.txt --type cot\n\n# Both of these save file to data/generated/report_qa_pairs.json\n\n# Filter content based on quality\nsynthetic-data-kit curate data/generated/report_qa_pairs.json\n\n# Convert to alpaca fine-tuning format and save as HF arrow file\nsynthetic-data-kit save-as data/curated/report_cleaned.json --format alpaca --storage hf\n```\n\n### 2.1 Batch Directory Processing (New)\n\nProcess entire directories of files with a single command:\n\n```bash\n# Parse all documents in a directory\nsynthetic-data-kit ingest ./documents/\n# Processes all .pdf, .html, .docx, .pptx, .txt files\n# Saves parsed text files to data/parsed/\n\n# Generate QA pairs for all text files\nsynthetic-data-kit create ./data/parsed/ --type qa\n# Processes all .txt files in the directory\n# Saves QA pairs to data/generated/\n\n# Curate all generated files\nsynthetic-data-kit curate ./data/generated/ --threshold 8.0\n# Processes all .json files in the directory\n# Saves curated files to data/curated/\n\n# Convert all curated files to training format\nsynthetic-data-kit save-as ./data/curated/ --format alpaca\n# Processes all .json files in the directory\n# Saves final files to data/final/\n```\n\n### 2.2 Preview Mode\n\nUse `--preview` to see what files would be processed without actually processing them:\n\n```bash\n# Preview files before processing\nsynthetic-data-kit ingest ./documents --preview\n# Shows: directory stats, file counts by extension, list of files\n\nsynthetic-data-kit create ./data/parsed --preview\n# Shows: .txt files that would be processed\n```\n## Configuration\n\nThe toolkit uses a YAML configuration file (default: `configs/config.yaml`).\n\nNote, this can be overridden via either CLI arguments OR passing a custom YAML file\n\n```yaml\n# Example configuration using vLLM\nllm:\n provider: \"vllm\"\n\nvllm:\n api_base: \"http://localhost:8000/v1\"\n model: \"meta-llama/Llama-3.3-70B-Instruct\"\n sleep_time: 0.1\n\ngeneration:\n temperature: 0.7\n chunk_size: 4000\n num_pairs: 25\n max_context_length: 8000\n\ncurate:\n threshold: 7.0\n batch_size: 8\n```\n\nor using an API endpoint:\n\n```yaml\n# Example configuration using the llama API\nllm:\n provider: \"api-endpoint\"\n\napi-endpoint:\n api_base: \"https://api.llama.com/v1\"\n api_key: \"llama-api-key\"\n model: \"Llama-4-Maverick-17B-128E-Instruct-FP8\"\n sleep_time: 0.5\n```\n\n### Customizing Configuration\n\nCreate a overriding configuration file and use it with the `-c` flag:\n\n```bash\nsynthetic-data-kit -c my_config.yaml ingest docs/paper.pdf\n```\n\n## Examples\n\n### Processing a Single PDF Document\n\n```bash\n# Ingest PDF\nsynthetic-data-kit ingest research_paper.pdf\n\n# Generate QA pairs\nsynthetic-data-kit create data/parsed/research_paper.txt -n 30\n\n# Curate data\nsynthetic-data-kit curate data/generated/research_paper_qa_pairs.json -t 8.5\n\n# Save in OpenAI fine-tuning format (JSON)\nsynthetic-data-kit save-as data/curated/research_paper_cleaned.json -f ft\n\n# Save in OpenAI fine-tuning format (HF dataset)\nsynthetic-data-kit save-as data/curated/research_paper_cleaned.json -f ft --storage hf\n```\n\n### Processing Multiple Documents (Directory)\n\n```bash\n# Process all research papers in a directory\nsynthetic-data-kit ingest ./research_papers/\n\n# Generate QA pairs for all parsed documents\nsynthetic-data-kit create ./data/parsed/ --type qa -n 30\n\n# Curate all generated files\nsynthetic-data-kit curate ./data/generated/ -t 8.5\n\n# Save all curated files in OpenAI fine-tuning format\nsynthetic-data-kit save-as ./data/curated/ -f ft --storage hf\n```\n\n### Preview Before Processing\n\n```bash\n# See what files would be processed\nsynthetic-data-kit ingest ./research_papers --preview\n# Output:\n# Directory: ./research_papers\n# Total files: 15\n# Supported files: 12\n# Extensions: .pdf (8), .docx (3), .txt (1)\n# Files: paper1.pdf, paper2.pdf, ...\n\n# Preview with verbose output\nsynthetic-data-kit create ./data/parsed --preview --verbose\n```\n\n### Processing a YouTube Video\n\n```bash\n# Extract transcript\nsynthetic-data-kit ingest \"https://www.youtube.com/watch?v=dQw4w9WgXcQ\"\n\n# Generate QA pairs with specific model\nsynthetic-data-kit create data/parsed/youtube_dQw4w9WgXcQ.txt\n```\n\n### Processing Multiple Files\n\n```bash\n# NEW: Process entire directories (recommended)\nsynthetic-data-kit ingest ./data/input/\nsynthetic-data-kit create ./data/parsed/ --type qa -n 20\nsynthetic-data-kit curate ./data/generated/ -t 7.5\nsynthetic-data-kit save-as ./data/curated/ -f chatml\n\n# LEGACY: Bash script to process multiple files (still supported)\nfor file in data/pdf/*.pdf; do\n filename=$(basename \"$file\" .pdf)\n \n synthetic-data-kit ingest \"$file\"\n synthetic-data-kit create \"data/parsed/${filename}.txt\" -n 20\n synthetic-data-kit curate \"data/generated/${filename}_qa_pairs.json\" -t 7.5\n synthetic-data-kit save-as \"data/curated/${filename}_cleaned.json\" -f chatml\ndone\n```\n\n## Document Processing & Chunking\n\n### How Chunking Works\n\nThe Synthetic Data Kit automatically handles documents of any size using an intelligent processing strategy:\n\n- **Small documents** (< 8000 characters): Processed in a single API call for maximum context and quality\n- **Large documents** (\u2265 8000 characters): Automatically split into chunks with overlap to maintain context\n\n### Controlling Chunking Behavior\n\nYou can customize chunking with CLI flags or config settings for both single files and directories:\n\n```bash\n# Single file with custom chunking\nsynthetic-data-kit create document.txt --type qa --chunk-size 2000 --chunk-overlap 100\n\n# Directory processing with custom chunking\nsynthetic-data-kit create ./data/parsed/ --type cot --num-pairs 50 --chunk-size 6000 --verbose\n\n# Preview directory processing with chunking details\nsynthetic-data-kit create ./data/parsed/ --preview --verbose\n```\n\n### Chunking Parameters\n\n| Parameter | Default | Description |\n|-----------|---------|-------------|\n| `--chunk-size` | 4000 | Size of text chunks in characters |\n| `--chunk-overlap` | 200 | Overlap between chunks to preserve context |\n| `--verbose` | false | Show chunking details and progress |\n\n### Understanding Chunking Output\n\nWhen using `--verbose`, you'll see chunking information for both single files and directories:\n\n```bash\n# Single file verbose output\nsynthetic-data-kit create large_document.txt --type qa --num-pairs 20 --verbose\n\n# Directory verbose output\nsynthetic-data-kit create ./data/parsed/ --type qa --num-pairs 20 --verbose\n```\n\nOutput:\n```\n# Single file output\nGenerating QA pairs...\nDocument split into 8 chunks\nUsing batch size of 32\nProcessing 8 chunks to generate QA pairs...\n Generated 3 pairs from chunk 1 (total: 3/20)\n Generated 2 pairs from chunk 2 (total: 5/20)\n ...\n Reached target of 20 pairs. Stopping processing.\nGenerated 20 QA pairs total (requested: 20)\n\n# Directory output\nProcessing directory: ./data/parsed/\nSupported files: 5 (.txt files)\nProgress: \u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588 100% (5/5 files)\n\u2713 document1.txt: Generated 20 QA pairs\n\u2713 document2.txt: Generated 18 QA pairs\n\u2717 document3.txt: Failed - Invalid format\n\u2713 document4.txt: Generated 20 QA pairs\n\u2713 document5.txt: Generated 15 QA pairs\n\nProcessing Summary:\nTotal files: 5\nSuccessful: 4\nFailed: 1\nTotal pairs generated: 73\n```\n\n### Chunking logic\n\nBoth QA and CoT generation use the same chunking logic for files and directories:\n\n```bash\n# Single file processing\nsynthetic-data-kit create document.txt --type qa --num-pairs 100 --chunk-size 3000\nsynthetic-data-kit create document.txt --type cot --num-pairs 20 --chunk-size 3000\n\n# Directory processing\nsynthetic-data-kit create ./data/parsed/ --type qa --num-pairs 100 --chunk-size 3000\nsynthetic-data-kit create ./data/parsed/ --type cot --num-pairs 20 --chunk-size 3000\n```\n\n## Advanced Usage\n\n### Custom Prompt Templates\n\nEdit the `prompts` section in your configuration file to customize generation behavior:\n\n```yaml\nprompts:\n qa_generation: |\n You are creating question-answer pairs for fine-tuning a legal assistant.\n Focus on technical legal concepts, precedents, and statutory interpretation.\n \n Below is a chunk of text about: {summary}...\n \n Create {num_pairs} high-quality question-answer pairs based ONLY on this text.\n \n Return ONLY valid JSON formatted as:\n [\n {\n \"question\": \"Detailed legal question?\",\n \"answer\": \"Precise legal answer.\"\n },\n ...\n ]\n \n Text:\n ---\n {text}\n ---\n```\n\n### Mental Model:\n\n```mermaid\ngraph LR\n SDK --> SystemCheck[system-check]\n SDK[synthetic-data-kit] --> Ingest[ingest]\n SDK --> Create[create]\n SDK --> Curate[curate]\n SDK --> SaveAs[save-as]\n \n Ingest --> PDFFile[PDF File]\n Ingest --> HTMLFile[HTML File]\n Ingest --> YouTubeURL[File Format]\n\n \n Create --> CoT[CoT]\n Create --> QA[QA Pairs]\n Create --> Summary[Summary]\n \n Curate --> Filter[Filter by Quality]\n \n SaveAs --> JSONL[JSONL Format]\n SaveAs --> Alpaca[Alpaca Format]\n SaveAs --> FT[Fine-Tuning Format]\n SaveAs --> ChatML[ChatML Format]\n```\n\n## Troubleshooting FAQs:\n\n### vLLM Server Issues\n\n- Ensure vLLM is installed: `pip install vllm`\n- Start server with: `vllm serve <model_name> --port 8000`\n- Check connection: `synthetic-data-kit system-check`\n\n### Memory Issues\n\nIf you encounter CUDA out of memory errors:\n- Use a smaller model\n- Reduce batch size in config\n- Start vLLM with `--gpu-memory-utilization 0.85`\n\n### JSON Parsing Issues\n\nIf you encounter issues with the `curate` command:\n- Use the `-v` flag to enable verbose output\n- Set smaller batch sizes in your config.yaml\n- Ensure the LLM model supports proper JSON output\n- Install json5 for enhanced JSON parsing: `pip install json5`\n\n### Parser Errors\n\n- Ensure required dependencies are installed for specific parsers:\n - PDF: `pip install pdfminer.six`\n - HTML: `pip install beautifulsoup4`\n - YouTube: `pip install pytubefix youtube-transcript-api`\n - DOCX: `pip install python-docx`\n - PPTX: `pip install python-pptx`\n\n## License\n\nRead more about the [License](./LICENSE)\n\n## Contributing\n\nContributions are welcome! [Read our contributing guide](./CONTRIBUTING.md)\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Tool for generating high quality Synthetic datasets",
"version": "0.0.5",
"project_urls": {
"Bug Tracker": "https://github.com/meta-llama/synthetic-data-kit/issues",
"Documentation": "https://github.com/meta-llama/synthetic-data-kit#readme",
"Getting Started": "https://github.com/meta-llama/synthetic-data-kit/blob/main/getting-started/README.md",
"Homepage": "https://github.com/meta-llama/synthetic-data-kit"
},
"split_keywords": [
"ai",
" chain-of-thought",
" dataset-generation",
" fine-tuning",
" llama",
" llm",
" machine-learning",
" nlp",
" reasoning",
" synthetic-data",
" tool-use"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "d419a38422e9855d7351116af8204c719dbcec8165c3d71aa83988a194f788e8",
"md5": "cb6de206278ce4d9d5c4aefce9555558",
"sha256": "47c91c95a0d526acb33006ee1557b50d2b27d3058bc02592e7760cd15eb6f6ce"
},
"downloads": -1,
"filename": "synthetic_data_kit-0.0.5-py3-none-any.whl",
"has_sig": false,
"md5_digest": "cb6de206278ce4d9d5c4aefce9555558",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 79727,
"upload_time": "2025-07-21T15:53:28",
"upload_time_iso_8601": "2025-07-21T15:53:28.131536Z",
"url": "https://files.pythonhosted.org/packages/d4/19/a38422e9855d7351116af8204c719dbcec8165c3d71aa83988a194f788e8/synthetic_data_kit-0.0.5-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "0eed93851cb72dbea20f40d155ea6663294f8c6fe0c4a12b5f02f5e963f44bd8",
"md5": "d782e24d082267983775e07975608ca5",
"sha256": "2f74443f5304f0fe5c8a8347f7e125e814ddd9d0e64867a1cb368237956862a6"
},
"downloads": -1,
"filename": "synthetic_data_kit-0.0.5.tar.gz",
"has_sig": false,
"md5_digest": "d782e24d082267983775e07975608ca5",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 66612,
"upload_time": "2025-07-21T15:53:29",
"upload_time_iso_8601": "2025-07-21T15:53:29.194726Z",
"url": "https://files.pythonhosted.org/packages/0e/ed/93851cb72dbea20f40d155ea6663294f8c6fe0c4a12b5f02f5e963f44bd8/synthetic_data_kit-0.0.5.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-21 15:53:29",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "meta-llama",
"github_project": "synthetic-data-kit",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "synthetic-data-kit"
}