synda

Name	synda JSON
Version	0.6.0 JSON
	download
home_page	https://github.com/timothepearce/synda
Summary	A CLI for generating synthetic data
upload_time	2025-02-21 21:11:12
maintainer	None
docs_url	None
author	Timothé Pearce
requires_python	<4.0,>=3.10
license	Apache 2.0
keywords	synthetic data pipeline llm
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Synda

> [!WARNING]
> This project is in its very early stages of development and should not be used in production environments.

> [!NOTE]
> PR are more than welcome. Check the roadmap if you want to contribute or create discussion to submit a use-case.

Synda (*synthetic data*) is a package that allows you to create synthetic data generation pipelines. 
It is opinionated and fast by design, with plans to become highly configurable in the future.


## Installation

Synda requires Python 3.10 or higher.

You can install Synda using pipx:

```bash
pipx install synda
```

## Usage

1. Create a YAML configuration file (e.g., `config.yaml`) that defines your pipeline:

```yaml
input:
  type: csv
  properties:
    path: tests/stubs/simple_pipeline/source.csv
    target_column: content
    separator: "\t"

pipeline:
  - type: split
    method: chunk
    name: chunk_faq
    parameters:
      size: 500
      # overlap: 20

  - type: split
    method: separator
    name: sentence_chunk_faq
    parameters:
      separator: .
      keep_separator: true

  - type: generation
    method: llm
    parameters:
      provider: openai
      model: gpt-4o-mini
      template: |
        Ask a question regarding the sentence about the content.
        content: {chunk_faq}
        sentence: {sentence_chunk_faq}

        Instructions :
        1. Use english only
        2. Keep it short

        question:

  - type: clean
    method: deduplicate-tf-idf
    parameters:
      strategy: fuzzy
      similarity_threshold: 0.9
      keep: first 

  - type: ablation
    method: llm-judge-binary
    parameters:
      provider: openai
      model: gpt-4o-mini
      consensus: all # any, majority
      criteria:
        - Is the question written in english?
        - Is the question consistent?

output:
  type: csv
  properties:
    path: tests/stubs/simple_pipeline/output.csv
    separator: "\t"
```

2. Add a model provider:

```bash
synda provider add openai --api-key [YOUR_API_KEY]
```

3. Generate some synthetic data:

```bash
synda generate config.yaml
```

## Pipeline Structure

The Nebula pipeline consists of three main parts:

- **Input**: Data source configuration
- **Pipeline**: Sequence of transformation and generation steps
- **Output**: Configuration for the generated data output

### Available Pipeline Steps

Currently, Synda supports four pipeline steps (as shown in the example above):

- **split**: Breaks down data (`method: chunk` or `method: split`)
- **generation**: Generates content using LLMs (`method: llm`)
- **clean**: Delete the duplicated data (`method: deduplicate-tf-idf`)
- **ablation**: Filters data based on defined criteria (`method: llm-judge-binary`)
- **metadata**: Add metadata to text (`method: word-position`)

More steps will be added in future releases.

## Roadmap

The following features are planned for future releases.

### Core
- [x] Implement a Proof of Concept
- [x] Implement a common interface (Node) for input and output of each step
- [x] Add SQLite support
- [x] Add setter command for provider variable (openai, etc.)
- [x] Store each execution and step in DB
- [x] Add "split" -> "separator" step
- [x] Add named step
- [x] Store each Node in DB
- [x] Add "clean" -> "deduplicate" step
- [x] Allow injecting params from distant step into prompt
- [x] Add Ollama with structured generation output
- [x] Retry a failed run
- [ ] Add vLLM with structured generation output
- [ ] Batch processing logic (via param.) for LLMs steps
- [ ] Move input into pipeline (step type: 'load')
- [ ] Move output into pipeline (step type: 'export')
- [ ] Allow pausing and resuming pipelines
- [ ] Trace each synthetic data with his historic
- [ ] Enable caching of each step's output
- [ ] Implement custom scriptable step for developer
- [ ] Use Ray for large workload
- [ ] Add a programmatic API

### Steps
- [x] input/output: .xls format
- [ ] input/output: Hugging Face datasets
- [ ] chunk: Semantic chunks
- [ ] clean: embedding deduplication
- [ ] ablation: LLMs as a juries
- [ ] masking: NER (GliNER)
- [ ] masking: Regexp
- [ ] masking: PII
- [ ] metadata: Word position
- [ ] metadata: Regexp

### Ideas
- [ ] translations (SeamlessM4T)
- [ ] speech-to-text
- [ ] text-to-speech
- [ ] metadata extraction
- [ ] tSNE / PCA
- [ ] custom steps?

## License

This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/timothepearce/synda",
    "name": "synda",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.10",
    "maintainer_email": null,
    "keywords": "synthetic data, pipeline, llm",
    "author": "Timoth\u00e9 Pearce",
    "author_email": "timothe.pearce@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/61/13/c5a93755b5deb54a84d96bb04c917b2800c0760cf1f88a5b34f2257a1e72/synda-0.6.0.tar.gz",
    "platform": null,
    "description": "# Synda\n\n> [!WARNING]\n> This project is in its very early stages of development and should not be used in production environments.\n\n> [!NOTE]\n> PR are more than welcome. Check the roadmap if you want to contribute or create discussion to submit a use-case.\n\nSynda (*synthetic data*) is a package that allows you to create synthetic data generation pipelines. \nIt is opinionated and fast by design, with plans to become highly configurable in the future.\n\n\n## Installation\n\nSynda requires Python 3.10 or higher.\n\nYou can install Synda using pipx:\n\n```bash\npipx install synda\n```\n\n## Usage\n\n1. Create a YAML configuration file (e.g., `config.yaml`) that defines your pipeline:\n\n```yaml\ninput:\n  type: csv\n  properties:\n    path: tests/stubs/simple_pipeline/source.csv\n    target_column: content\n    separator: \"\\t\"\n\npipeline:\n  - type: split\n    method: chunk\n    name: chunk_faq\n    parameters:\n      size: 500\n      # overlap: 20\n\n  - type: split\n    method: separator\n    name: sentence_chunk_faq\n    parameters:\n      separator: .\n      keep_separator: true\n\n  - type: generation\n    method: llm\n    parameters:\n      provider: openai\n      model: gpt-4o-mini\n      template: |\n        Ask a question regarding the sentence about the content.\n        content: {chunk_faq}\n        sentence: {sentence_chunk_faq}\n\n        Instructions :\n        1. Use english only\n        2. Keep it short\n\n        question:\n\n  - type: clean\n    method: deduplicate-tf-idf\n    parameters:\n      strategy: fuzzy\n      similarity_threshold: 0.9\n      keep: first \n\n  - type: ablation\n    method: llm-judge-binary\n    parameters:\n      provider: openai\n      model: gpt-4o-mini\n      consensus: all # any, majority\n      criteria:\n        - Is the question written in english?\n        - Is the question consistent?\n\noutput:\n  type: csv\n  properties:\n    path: tests/stubs/simple_pipeline/output.csv\n    separator: \"\\t\"\n```\n\n2. Add a model provider:\n\n```bash\nsynda provider add openai --api-key [YOUR_API_KEY]\n```\n\n3. Generate some synthetic data:\n\n```bash\nsynda generate config.yaml\n```\n\n## Pipeline Structure\n\nThe Nebula pipeline consists of three main parts:\n\n- **Input**: Data source configuration\n- **Pipeline**: Sequence of transformation and generation steps\n- **Output**: Configuration for the generated data output\n\n### Available Pipeline Steps\n\nCurrently, Synda supports four pipeline steps (as shown in the example above):\n\n- **split**: Breaks down data (`method: chunk` or `method: split`)\n- **generation**: Generates content using LLMs (`method: llm`)\n- **clean**: Delete the duplicated data (`method: deduplicate-tf-idf`)\n- **ablation**: Filters data based on defined criteria (`method: llm-judge-binary`)\n- **metadata**: Add metadata to text (`method: word-position`)\n\nMore steps will be added in future releases.\n\n## Roadmap\n\nThe following features are planned for future releases.\n\n### Core\n- [x] Implement a Proof of Concept\n- [x] Implement a common interface (Node) for input and output of each step\n- [x] Add SQLite support\n- [x] Add setter command for provider variable (openai, etc.)\n- [x] Store each execution and step in DB\n- [x] Add \"split\" -> \"separator\" step\n- [x] Add named step\n- [x] Store each Node in DB\n- [x] Add \"clean\" -> \"deduplicate\" step\n- [x] Allow injecting params from distant step into prompt\n- [x] Add Ollama with structured generation output\n- [x] Retry a failed run\n- [ ] Add vLLM with structured generation output\n- [ ] Batch processing logic (via param.) for LLMs steps\n- [ ] Move input into pipeline (step type: 'load')\n- [ ] Move output into pipeline (step type: 'export')\n- [ ] Allow pausing and resuming pipelines\n- [ ] Trace each synthetic data with his historic\n- [ ] Enable caching of each step's output\n- [ ] Implement custom scriptable step for developer\n- [ ] Use Ray for large workload\n- [ ] Add a programmatic API\n\n### Steps\n- [x] input/output: .xls format\n- [ ] input/output: Hugging Face datasets\n- [ ] chunk: Semantic chunks\n- [ ] clean: embedding deduplication\n- [ ] ablation: LLMs as a juries\n- [ ] masking: NER (GliNER)\n- [ ] masking: Regexp\n- [ ] masking: PII\n- [ ] metadata: Word position\n- [ ] metadata: Regexp\n\n### Ideas\n- [ ] translations (SeamlessM4T)\n- [ ] speech-to-text\n- [ ] text-to-speech\n- [ ] metadata extraction\n- [ ] tSNE / PCA\n- [ ] custom steps?\n\n## License\n\nThis project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.\n",
    "bugtrack_url": null,
    "license": "Apache 2.0",
    "summary": "A CLI for generating synthetic data",
    "version": "0.6.0",
    "project_urls": {
        "Documentation": "https://github.com/timothepearce/synda",
        "Homepage": "https://github.com/timothepearce/synda",
        "Repository": "https://github.com/timothepearce/synda"
    },
    "split_keywords": [
        "synthetic data",
        " pipeline",
        " llm"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "bc38972b76ebcdb343fc548f0a1460331bf5b8884dcfc60a7d4033bf6e2ed06b",
                "md5": "91779be9533e07221976aaa04b621bd1",
                "sha256": "d18d44428394c312230d898089479852a23cb001083057b04a0c6625618925b5"
            },
            "downloads": -1,
            "filename": "synda-0.6.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "91779be9533e07221976aaa04b621bd1",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.10",
            "size": 38690,
            "upload_time": "2025-02-21T21:11:10",
            "upload_time_iso_8601": "2025-02-21T21:11:10.353693Z",
            "url": "https://files.pythonhosted.org/packages/bc/38/972b76ebcdb343fc548f0a1460331bf5b8884dcfc60a7d4033bf6e2ed06b/synda-0.6.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "6113c5a93755b5deb54a84d96bb04c917b2800c0760cf1f88a5b34f2257a1e72",
                "md5": "a0df43c9aa9d030538ff00634078f34b",
                "sha256": "7cdc1bf62bea3f651c1f8d4f8cc1b48a6cf892e1f35b9fb917364d836c557663"
            },
            "downloads": -1,
            "filename": "synda-0.6.0.tar.gz",
            "has_sig": false,
            "md5_digest": "a0df43c9aa9d030538ff00634078f34b",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.10",
            "size": 21801,
            "upload_time": "2025-02-21T21:11:12",
            "upload_time_iso_8601": "2025-02-21T21:11:12.147816Z",
            "url": "https://files.pythonhosted.org/packages/61/13/c5a93755b5deb54a84d96bb04c917b2800c0760cf1f88a5b34f2257a1e72/synda-0.6.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-02-21 21:11:12",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "timothepearce",
    "github_project": "synda",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "synda"
}

Timothé Pearce