# ALEA Data Generator
[![PyPI version](https://badge.fury.io/py/alea-data-generator.svg)](https://badge.fury.io/py/alea-data-generator)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Python Versions](https://img.shields.io/pypi/pyversions/alea-data-generator.svg)](https://pypi.org/project/alea-data-generator/)
This is a basic synthetic data generation/perturbation library designed to support the creation or augmentation
designed by the ALEA Institute to support the creation and augmentation of data without relying on "tainted" LLMs.
Data generation techniques in this library:
* do not require the use of any LLM or external data source
* can be used with [KL3M](https://kl3m.ai), our Fairly Trained LLM
### Supported Patterns
The following data generation patterns are supported:
* [x] Simple string templates with sampled values (e.g., `This Agreement, by and between <|company:a|> and <|company:b|>, is made as of <|date|>.`)
- [x] Faker integration for common data types (e.g., names, addresses, dates, etc.)
* [x] Large templates with sampled values (e.g., `jinja2` templates in files)
* [ ] Common document types (e.g., emails, contracts, memos, etc. using templates)
* [ ] Data perturbation (e.g., realistic errors introduced by humans, OCR, or other automated systems)
- [x] Skipping, doubling, or transposing/swapping characters
- [x] Skipping, doubling, or transposing/swapping tokens
- [x] QWERTY and mobile keyboard mistakes (off-by-one key, shift errors, etc.)
- [ ] Homophones (e.g., `their` vs. `there`)
- [ ] Synonyms (e.g., `big` vs. `large`)
- [ ] Negation/antonyms (e.g., `big` vs. `small`)
- [ ] Capitalization errors (e.g., `big` vs. `Big`)
- [ ] Punctuation errors (e.g., `big` vs. `big.`)
- [ ] OCR-like errors (e.g., misreading characters, smudges, etc.) -
* [ ] Representation conversion (e.g., `429` to `four hundred twenty-nine` or `four twenty-nine`)
* [ ] Format conversion (e.g., Markdown <-> HTML variants)
## Future Roadmap
* Document image generation for document/OCR models
## License
The ALEA Data Generator library is released under the MIT License. See the [LICENSE](LICENSE) file for details.
Some of the data generation techniques used in this library may also retrieve data from external sources,
which have their own licensing terms. These terms are documented in the `alea-data-sources` here:
* [alea-data-sources](https://github.com/alea-institute/alea-data-resources)
See, e.g., the CMU Pronouncing Dictionary (`cmudict`), which is used in tasks like homophonic errors:
* [cmudict metadata](https://github.com/alea-institute/alea-data-resources/blob/v0.1.0/alea_data_resources/sources/cmudict.py#L10)
## Support
If you encounter any issues or have questions about using the ALEA Data Generator library, please [open an issue](https://github.com/alea-institute/alea-data-generator/issues) on GitHub.
## Learn More
To learn more about ALEA and its software and research projects like KL3M and leeky, visit the [ALEA website](https://aleainstitute.ai/).
Raw data
{
"_id": null,
"home_page": "https://aleainstitute.ai/",
"name": "alea-data-generator",
"maintainer": null,
"docs_url": null,
"requires_python": "<4.0.0,>=3.10",
"maintainer_email": null,
"keywords": "alea, synthetic, data, kl3m",
"author": "ALEA Institute",
"author_email": "hello@aleainstitute.ai",
"download_url": "https://files.pythonhosted.org/packages/8c/3e/71b9f44c312fccabf782c6e358aa33059dd9a0473fe4add55d7f48a7ec69/alea_data_generator-0.1.0.tar.gz",
"platform": null,
"description": "# ALEA Data Generator\n\n[![PyPI version](https://badge.fury.io/py/alea-data-generator.svg)](https://badge.fury.io/py/alea-data-generator)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![Python Versions](https://img.shields.io/pypi/pyversions/alea-data-generator.svg)](https://pypi.org/project/alea-data-generator/)\n\nThis is a basic synthetic data generation/perturbation library designed to support the creation or augmentation\ndesigned by the ALEA Institute to support the creation and augmentation of data without relying on \"tainted\" LLMs.\n\nData generation techniques in this library:\n * do not require the use of any LLM or external data source\n * can be used with [KL3M](https://kl3m.ai), our Fairly Trained LLM\n\n### Supported Patterns\n\nThe following data generation patterns are supported:\n\n * [x] Simple string templates with sampled values (e.g., `This Agreement, by and between <|company:a|> and <|company:b|>, is made as of <|date|>.`)\n - [x] Faker integration for common data types (e.g., names, addresses, dates, etc.)\n * [x] Large templates with sampled values (e.g., `jinja2` templates in files)\n * [ ] Common document types (e.g., emails, contracts, memos, etc. using templates)\n * [ ] Data perturbation (e.g., realistic errors introduced by humans, OCR, or other automated systems)\n - [x] Skipping, doubling, or transposing/swapping characters\n - [x] Skipping, doubling, or transposing/swapping tokens\n - [x] QWERTY and mobile keyboard mistakes (off-by-one key, shift errors, etc.)\n - [ ] Homophones (e.g., `their` vs. `there`)\n - [ ] Synonyms (e.g., `big` vs. `large`)\n - [ ] Negation/antonyms (e.g., `big` vs. `small`)\n - [ ] Capitalization errors (e.g., `big` vs. `Big`)\n - [ ] Punctuation errors (e.g., `big` vs. `big.`)\n - [ ] OCR-like errors (e.g., misreading characters, smudges, etc.) -\n * [ ] Representation conversion (e.g., `429` to `four hundred twenty-nine` or `four twenty-nine`)\n * [ ] Format conversion (e.g., Markdown <-> HTML variants)\n\n\n## Future Roadmap\n\n * Document image generation for document/OCR models\n\n## License\n\nThe ALEA Data Generator library is released under the MIT License. See the [LICENSE](LICENSE) file for details.\n\nSome of the data generation techniques used in this library may also retrieve data from external sources,\nwhich have their own licensing terms. These terms are documented in the `alea-data-sources` here:\n\n * [alea-data-sources](https://github.com/alea-institute/alea-data-resources)\n\nSee, e.g., the CMU Pronouncing Dictionary (`cmudict`), which is used in tasks like homophonic errors:\n\n * [cmudict metadata](https://github.com/alea-institute/alea-data-resources/blob/v0.1.0/alea_data_resources/sources/cmudict.py#L10)\n\n## Support\n\nIf you encounter any issues or have questions about using the ALEA Data Generator library, please [open an issue](https://github.com/alea-institute/alea-data-generator/issues) on GitHub.\n\n## Learn More\n\nTo learn more about ALEA and its software and research projects like KL3M and leeky, visit the [ALEA website](https://aleainstitute.ai/).\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "ALEA low-level data generation techniques (procedural, KL3M)",
"version": "0.1.0",
"project_urls": {
"Homepage": "https://aleainstitute.ai/",
"Repository": "https://github.com/alea-institute/alea-data-generator"
},
"split_keywords": [
"alea",
" synthetic",
" data",
" kl3m"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "a0e927abcc363c016913e4e8581d0a78059868142a52fe35496c3fd490028468",
"md5": "95fbbaaa4fd8ec1275e08717f31d14d8",
"sha256": "2cd0eed5971f14312887c569ca1ed5a39a0573933511ca5dcd68d012c834660a"
},
"downloads": -1,
"filename": "alea_data_generator-0.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "95fbbaaa4fd8ec1275e08717f31d14d8",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0.0,>=3.10",
"size": 31472,
"upload_time": "2024-09-15T18:18:43",
"upload_time_iso_8601": "2024-09-15T18:18:43.447288Z",
"url": "https://files.pythonhosted.org/packages/a0/e9/27abcc363c016913e4e8581d0a78059868142a52fe35496c3fd490028468/alea_data_generator-0.1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "8c3e71b9f44c312fccabf782c6e358aa33059dd9a0473fe4add55d7f48a7ec69",
"md5": "481ac9f03b65fe15a3c3b3a56c9f1157",
"sha256": "8a7b84fbe537b583474cdd15cf48f7c9f2c95b8a3214ade2c349b1a2e724ecf5"
},
"downloads": -1,
"filename": "alea_data_generator-0.1.0.tar.gz",
"has_sig": false,
"md5_digest": "481ac9f03b65fe15a3c3b3a56c9f1157",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4.0.0,>=3.10",
"size": 21385,
"upload_time": "2024-09-15T18:18:45",
"upload_time_iso_8601": "2024-09-15T18:18:45.234795Z",
"url": "https://files.pythonhosted.org/packages/8c/3e/71b9f44c312fccabf782c6e358aa33059dd9a0473fe4add55d7f48a7ec69/alea_data_generator-0.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-09-15 18:18:45",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "alea-institute",
"github_project": "alea-data-generator",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "alea-data-generator"
}