# Synda
> [!WARNING]
> This project is in its very early stages of development and should not be used in production environments.
> [!NOTE]
> PR are more than welcome. Check the roadmap if you want to contribute or create discussion to submit a use-case.
Synda (*synthetic data*) is a package that allows you to create synthetic data generation pipelines.
It is opinionated and fast by design, with plans to become highly configurable in the future.
## Installation
Synda requires Python 3.10 or higher.
You can install Synda using pipx:
```bash
pipx install synda
```
## Usage
1. Create a YAML configuration file (e.g., `config.yaml`) that defines your pipeline:
```yaml
input:
type: csv
properties:
path: tests/stubs/simple_pipeline/source.csv
target_column: content
separator: "\t"
pipeline:
- type: split
method: chunk
name: chunk_faq
parameters:
size: 500
# overlap: 20
- type: split
method: separator
name: sentence_chunk_faq
parameters:
separator: .
keep_separator: true
- type: generation
method: llm
parameters:
provider: openai
model: gpt-4o-mini
template: |
Ask a question regarding the sentence about the content.
content: {chunk_faq}
sentence: {sentence_chunk_faq}
Instructions :
1. Use english only
2. Keep it short
question:
- type: clean
method: deduplicate-tf-idf
parameters:
strategy: fuzzy
similarity_threshold: 0.9
keep: first
- type: ablation
method: llm-judge-binary
parameters:
provider: openai
model: gpt-4o-mini
consensus: all # any, majority
criteria:
- Is the question written in english?
- Is the question consistent?
output:
type: csv
properties:
path: tests/stubs/simple_pipeline/output.csv
separator: "\t"
```
2. Add a model provider:
```bash
synda provider add openai --api-key [YOUR_API_KEY]
```
3. Generate some synthetic data:
```bash
synda generate config.yaml
```
## Pipeline Structure
The Nebula pipeline consists of three main parts:
- **Input**: Data source configuration
- **Pipeline**: Sequence of transformation and generation steps
- **Output**: Configuration for the generated data output
### Available Pipeline Steps
Currently, Synda supports four pipeline steps (as shown in the example above):
- **split**: Breaks down data (`method: chunk` or `method: split`)
- **generation**: Generates content using LLMs (`method: llm`)
- **clean**: Delete the duplicated data (`method: deduplicate-tf-idf`)
- **ablation**: Filters data based on defined criteria (`method: llm-judge-binary`)
- **metadata**: Add metadata to text (`method: word-position`)
More steps will be added in future releases.
## Roadmap
The following features are planned for future releases.
### Core
- [x] Implement a Proof of Concept
- [x] Implement a common interface (Node) for input and output of each step
- [x] Add SQLite support
- [x] Add setter command for provider variable (openai, etc.)
- [x] Store each execution and step in DB
- [x] Add "split" -> "separator" step
- [x] Add named step
- [x] Store each Node in DB
- [x] Add "clean" -> "deduplicate" step
- [x] Allow injecting params from distant step into prompt
- [x] Add Ollama with structured generation output
- [x] Retry a failed run
- [ ] Add vLLM with structured generation output
- [ ] Batch processing logic (via param.) for LLMs steps
- [ ] Move input into pipeline (step type: 'load')
- [ ] Move output into pipeline (step type: 'export')
- [ ] Allow pausing and resuming pipelines
- [ ] Trace each synthetic data with his historic
- [ ] Enable caching of each step's output
- [ ] Implement custom scriptable step for developer
- [ ] Use Ray for large workload
- [ ] Add a programmatic API
### Steps
- [x] input/output: .xls format
- [ ] input/output: Hugging Face datasets
- [ ] chunk: Semantic chunks
- [ ] clean: embedding deduplication
- [ ] ablation: LLMs as a juries
- [ ] masking: NER (GliNER)
- [ ] masking: Regexp
- [ ] masking: PII
- [ ] metadata: Word position
- [ ] metadata: Regexp
### Ideas
- [ ] translations (SeamlessM4T)
- [ ] speech-to-text
- [ ] text-to-speech
- [ ] metadata extraction
- [ ] tSNE / PCA
- [ ] custom steps?
## License
This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.
Raw data
{
"_id": null,
"home_page": "https://github.com/timothepearce/synda",
"name": "synda",
"maintainer": null,
"docs_url": null,
"requires_python": "<4.0,>=3.10",
"maintainer_email": null,
"keywords": "synthetic data, pipeline, llm",
"author": "Timoth\u00e9 Pearce",
"author_email": "timothe.pearce@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/61/13/c5a93755b5deb54a84d96bb04c917b2800c0760cf1f88a5b34f2257a1e72/synda-0.6.0.tar.gz",
"platform": null,
"description": "# Synda\n\n> [!WARNING]\n> This project is in its very early stages of development and should not be used in production environments.\n\n> [!NOTE]\n> PR are more than welcome. Check the roadmap if you want to contribute or create discussion to submit a use-case.\n\nSynda (*synthetic data*) is a package that allows you to create synthetic data generation pipelines. \nIt is opinionated and fast by design, with plans to become highly configurable in the future.\n\n\n## Installation\n\nSynda requires Python 3.10 or higher.\n\nYou can install Synda using pipx:\n\n```bash\npipx install synda\n```\n\n## Usage\n\n1. Create a YAML configuration file (e.g., `config.yaml`) that defines your pipeline:\n\n```yaml\ninput:\n type: csv\n properties:\n path: tests/stubs/simple_pipeline/source.csv\n target_column: content\n separator: \"\\t\"\n\npipeline:\n - type: split\n method: chunk\n name: chunk_faq\n parameters:\n size: 500\n # overlap: 20\n\n - type: split\n method: separator\n name: sentence_chunk_faq\n parameters:\n separator: .\n keep_separator: true\n\n - type: generation\n method: llm\n parameters:\n provider: openai\n model: gpt-4o-mini\n template: |\n Ask a question regarding the sentence about the content.\n content: {chunk_faq}\n sentence: {sentence_chunk_faq}\n\n Instructions :\n 1. Use english only\n 2. Keep it short\n\n question:\n\n - type: clean\n method: deduplicate-tf-idf\n parameters:\n strategy: fuzzy\n similarity_threshold: 0.9\n keep: first \n\n - type: ablation\n method: llm-judge-binary\n parameters:\n provider: openai\n model: gpt-4o-mini\n consensus: all # any, majority\n criteria:\n - Is the question written in english?\n - Is the question consistent?\n\noutput:\n type: csv\n properties:\n path: tests/stubs/simple_pipeline/output.csv\n separator: \"\\t\"\n```\n\n2. Add a model provider:\n\n```bash\nsynda provider add openai --api-key [YOUR_API_KEY]\n```\n\n3. Generate some synthetic data:\n\n```bash\nsynda generate config.yaml\n```\n\n## Pipeline Structure\n\nThe Nebula pipeline consists of three main parts:\n\n- **Input**: Data source configuration\n- **Pipeline**: Sequence of transformation and generation steps\n- **Output**: Configuration for the generated data output\n\n### Available Pipeline Steps\n\nCurrently, Synda supports four pipeline steps (as shown in the example above):\n\n- **split**: Breaks down data (`method: chunk` or `method: split`)\n- **generation**: Generates content using LLMs (`method: llm`)\n- **clean**: Delete the duplicated data (`method: deduplicate-tf-idf`)\n- **ablation**: Filters data based on defined criteria (`method: llm-judge-binary`)\n- **metadata**: Add metadata to text (`method: word-position`)\n\nMore steps will be added in future releases.\n\n## Roadmap\n\nThe following features are planned for future releases.\n\n### Core\n- [x] Implement a Proof of Concept\n- [x] Implement a common interface (Node) for input and output of each step\n- [x] Add SQLite support\n- [x] Add setter command for provider variable (openai, etc.)\n- [x] Store each execution and step in DB\n- [x] Add \"split\" -> \"separator\" step\n- [x] Add named step\n- [x] Store each Node in DB\n- [x] Add \"clean\" -> \"deduplicate\" step\n- [x] Allow injecting params from distant step into prompt\n- [x] Add Ollama with structured generation output\n- [x] Retry a failed run\n- [ ] Add vLLM with structured generation output\n- [ ] Batch processing logic (via param.) for LLMs steps\n- [ ] Move input into pipeline (step type: 'load')\n- [ ] Move output into pipeline (step type: 'export')\n- [ ] Allow pausing and resuming pipelines\n- [ ] Trace each synthetic data with his historic\n- [ ] Enable caching of each step's output\n- [ ] Implement custom scriptable step for developer\n- [ ] Use Ray for large workload\n- [ ] Add a programmatic API\n\n### Steps\n- [x] input/output: .xls format\n- [ ] input/output: Hugging Face datasets\n- [ ] chunk: Semantic chunks\n- [ ] clean: embedding deduplication\n- [ ] ablation: LLMs as a juries\n- [ ] masking: NER (GliNER)\n- [ ] masking: Regexp\n- [ ] masking: PII\n- [ ] metadata: Word position\n- [ ] metadata: Regexp\n\n### Ideas\n- [ ] translations (SeamlessM4T)\n- [ ] speech-to-text\n- [ ] text-to-speech\n- [ ] metadata extraction\n- [ ] tSNE / PCA\n- [ ] custom steps?\n\n## License\n\nThis project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.\n",
"bugtrack_url": null,
"license": "Apache 2.0",
"summary": "A CLI for generating synthetic data",
"version": "0.6.0",
"project_urls": {
"Documentation": "https://github.com/timothepearce/synda",
"Homepage": "https://github.com/timothepearce/synda",
"Repository": "https://github.com/timothepearce/synda"
},
"split_keywords": [
"synthetic data",
" pipeline",
" llm"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "bc38972b76ebcdb343fc548f0a1460331bf5b8884dcfc60a7d4033bf6e2ed06b",
"md5": "91779be9533e07221976aaa04b621bd1",
"sha256": "d18d44428394c312230d898089479852a23cb001083057b04a0c6625618925b5"
},
"downloads": -1,
"filename": "synda-0.6.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "91779be9533e07221976aaa04b621bd1",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0,>=3.10",
"size": 38690,
"upload_time": "2025-02-21T21:11:10",
"upload_time_iso_8601": "2025-02-21T21:11:10.353693Z",
"url": "https://files.pythonhosted.org/packages/bc/38/972b76ebcdb343fc548f0a1460331bf5b8884dcfc60a7d4033bf6e2ed06b/synda-0.6.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "6113c5a93755b5deb54a84d96bb04c917b2800c0760cf1f88a5b34f2257a1e72",
"md5": "a0df43c9aa9d030538ff00634078f34b",
"sha256": "7cdc1bf62bea3f651c1f8d4f8cc1b48a6cf892e1f35b9fb917364d836c557663"
},
"downloads": -1,
"filename": "synda-0.6.0.tar.gz",
"has_sig": false,
"md5_digest": "a0df43c9aa9d030538ff00634078f34b",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4.0,>=3.10",
"size": 21801,
"upload_time": "2025-02-21T21:11:12",
"upload_time_iso_8601": "2025-02-21T21:11:12.147816Z",
"url": "https://files.pythonhosted.org/packages/61/13/c5a93755b5deb54a84d96bb04c917b2800c0760cf1f88a5b34f2257a1e72/synda-0.6.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-02-21 21:11:12",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "timothepearce",
"github_project": "synda",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "synda"
}