opendatagen


Nameopendatagen JSON
Version 0.0.33 PyPI version JSON
download
home_pagehttps://github.com/thoddnn/open-datagen
SummaryData preparation system to build controllable AI system
upload_time2024-02-20 17:15:13
maintainer
docs_urlNone
authorThomas DORDONNE
requires_python
license
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # ⬜️ Open Datagen ⬜️

**Open Datagen** is a Data Preparation Tool designed to build Controllable AI Systems

It offers improvements for:

**RAG**: Generate large Q&A datasets to improve your Retrieval strategies.

**Evals**: Create unique, “unseen” datasets to robustly test your models and avoid overfitting.

**Fine-Tuning**: Produce large, low-bias, and high-quality datasets to get better models after the fine-tuning process.

**Guardrails**: Generate red teaming datasets to strengthen the security and robustness of your Generative AI applications against attack.

## Additional Features

- Use external sources to generate high-quality synthetic data (Local files, Hugging Face datasets and Internet)

- Data anonymization 

- Open-source model support + local inference

- Decontamination

- Tree of thought 

- (SOON) No-code dataset generation

- (SOON) Multimodality 

## Installation

```bash
pip install --upgrade opendatagen
```

### Setting up your API keys

```bash
export OPENAI_API_KEY='your_openai_api_key' #(using openai>=1.2)
export MISTRAL_API_KEY='your_mistral_api_key'
export TOGETHER_API_KEY='your_together_api_key'
export ANYSCALE_API_KEY='your_anyscale_api_key'
export ELEVENLABS_API_KEY='your_elevenlabs_api_key'
export SERPLY_API_KEY='your_serply_api_key' #Google Search API 
```

## Usage

Example: Generate a low-biased FAQ dataset based on Wikipedia content

```python
from opendatagen.template import TemplateManager
from opendatagen.data_generator import DataGenerator

output_path = "opendatagen.csv"
template_name = "opendatagen"
manager = TemplateManager(template_file_path="faq_wikipedia.json")
template = manager.get_template(template_name=template_name)

if template:
    
    generator = DataGenerator(template=template)
    
    data, data_decontaminated = generator.generate_data(output_path=output_path, output_decontaminated_path=None)
    
```

where faq_wikipedia.json is [here](opendatagen/examples/faq_wikipedia.json)

## Contribution

We welcome contributions to Open Datagen! Whether you're looking to fix bugs, add templates, new features, or improve documentation, your help is greatly appreciated.

## Acknowledgements

We would like to express our gratitude to the following open source projects and individuals that have inspired and helped us:

- **Textbooks are all you need** ([Read the paper](https://arxiv.org/abs/2306.11644)) 

- **Evol-Instruct Paper** ([Read the paper](https://arxiv.org/abs/2306.08568)) by [WizardLM_AI](https://twitter.com/WizardLM_AI)

- **Textbook Generation** by [VikParuchuri](https://github.com/VikParuchuri/textbook_quality)

## Connect

If you need help for your Generative AI strategy, implementation, and infrastructure, reach us on

Linkedin: [@Thomas](https://linkedin.com/in/thomasdordonne).
Twitter: [@thoddnn](https://twitter.com/thoddnn).

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/thoddnn/open-datagen",
    "name": "opendatagen",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "",
    "author": "Thomas DORDONNE",
    "author_email": "dordonne.thomas@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/d6/c6/c262ce217c8881b0cda0e40b39b3afe9937cb39897392dc05d8b5d89f24e/opendatagen-0.0.33.tar.gz",
    "platform": null,
    "description": "# \u2b1c\ufe0f Open Datagen \u2b1c\ufe0f\n\n**Open Datagen** is a Data Preparation Tool designed to build Controllable AI Systems\n\nIt offers improvements for:\n\n**RAG**: Generate large Q&A datasets to improve your Retrieval strategies.\n\n**Evals**: Create unique, \u201cunseen\u201d datasets to robustly test your models and avoid overfitting.\n\n**Fine-Tuning**: Produce large, low-bias, and high-quality datasets to get better models after the fine-tuning process.\n\n**Guardrails**: Generate red teaming datasets to strengthen the security and robustness of your Generative AI applications against attack.\n\n## Additional Features\n\n- Use external sources to generate high-quality synthetic data (Local files, Hugging Face datasets and Internet)\n\n- Data anonymization \n\n- Open-source model support + local inference\n\n- Decontamination\n\n- Tree of thought \n\n- (SOON) No-code dataset generation\n\n- (SOON) Multimodality \n\n## Installation\n\n```bash\npip install --upgrade opendatagen\n```\n\n### Setting up your API keys\n\n```bash\nexport OPENAI_API_KEY='your_openai_api_key' #(using openai>=1.2)\nexport MISTRAL_API_KEY='your_mistral_api_key'\nexport TOGETHER_API_KEY='your_together_api_key'\nexport ANYSCALE_API_KEY='your_anyscale_api_key'\nexport ELEVENLABS_API_KEY='your_elevenlabs_api_key'\nexport SERPLY_API_KEY='your_serply_api_key' #Google Search API \n```\n\n## Usage\n\nExample: Generate a low-biased FAQ dataset based on Wikipedia content\n\n```python\nfrom opendatagen.template import TemplateManager\nfrom opendatagen.data_generator import DataGenerator\n\noutput_path = \"opendatagen.csv\"\ntemplate_name = \"opendatagen\"\nmanager = TemplateManager(template_file_path=\"faq_wikipedia.json\")\ntemplate = manager.get_template(template_name=template_name)\n\nif template:\n    \n    generator = DataGenerator(template=template)\n    \n    data, data_decontaminated = generator.generate_data(output_path=output_path, output_decontaminated_path=None)\n    \n```\n\nwhere faq_wikipedia.json is [here](opendatagen/examples/faq_wikipedia.json)\n\n## Contribution\n\nWe welcome contributions to Open Datagen! Whether you're looking to fix bugs, add templates, new features, or improve documentation, your help is greatly appreciated.\n\n## Acknowledgements\n\nWe would like to express our gratitude to the following open source projects and individuals that have inspired and helped us:\n\n- **Textbooks are all you need** ([Read the paper](https://arxiv.org/abs/2306.11644)) \n\n- **Evol-Instruct Paper** ([Read the paper](https://arxiv.org/abs/2306.08568)) by [WizardLM_AI](https://twitter.com/WizardLM_AI)\n\n- **Textbook Generation** by [VikParuchuri](https://github.com/VikParuchuri/textbook_quality)\n\n## Connect\n\nIf you need help for your Generative AI strategy, implementation, and infrastructure, reach us on\n\nLinkedin: [@Thomas](https://linkedin.com/in/thomasdordonne).\nTwitter: [@thoddnn](https://twitter.com/thoddnn).\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Data preparation system to build controllable AI system",
    "version": "0.0.33",
    "project_urls": {
        "Homepage": "https://github.com/thoddnn/open-datagen"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "464dbe8a1ffa2e5021fc6f74d8c847f8fa9d1b4d2f65cd484d7ecba37b1a05e4",
                "md5": "c3f567a88dbdac12c773bae29f112253",
                "sha256": "19abc69339fa58faf199cb9aa658b7e87066ac92993b3b4d2b11372ebc26ac7e"
            },
            "downloads": -1,
            "filename": "opendatagen-0.0.33-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "c3f567a88dbdac12c773bae29f112253",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 49048,
            "upload_time": "2024-02-20T17:15:08",
            "upload_time_iso_8601": "2024-02-20T17:15:08.915181Z",
            "url": "https://files.pythonhosted.org/packages/46/4d/be8a1ffa2e5021fc6f74d8c847f8fa9d1b4d2f65cd484d7ecba37b1a05e4/opendatagen-0.0.33-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "d6c6c262ce217c8881b0cda0e40b39b3afe9937cb39897392dc05d8b5d89f24e",
                "md5": "05d71955854e7c54b29a87dfd62d08d1",
                "sha256": "71440f45d36632eef3eb8789bd8b510c079c8292c45efab13dd20235578e83ed"
            },
            "downloads": -1,
            "filename": "opendatagen-0.0.33.tar.gz",
            "has_sig": false,
            "md5_digest": "05d71955854e7c54b29a87dfd62d08d1",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 29467,
            "upload_time": "2024-02-20T17:15:13",
            "upload_time_iso_8601": "2024-02-20T17:15:13.998326Z",
            "url": "https://files.pythonhosted.org/packages/d6/c6/c262ce217c8881b0cda0e40b39b3afe9937cb39897392dc05d8b5d89f24e/opendatagen-0.0.33.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-02-20 17:15:13",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "thoddnn",
    "github_project": "open-datagen",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "opendatagen"
}
        
Elapsed time: 0.20668s