[](https://pypi.org/project/llm-to-corpus/)
# Introduction
The goal of this tool is to apply Large Language Models operations to monolingual corpus to generate parallell corpus.
Uses cases:
* Asking a model to translate, summarize, paraphrasing original sentence to be able to benchmark its performance
* For corpus generation tasks from monolingual corpus, like for example, translated corpus.
* When developing prompts for your application, enables to test the prompt over a list of sentence to do evaluations
You basically provide an input file and prompt and it generates a target corpus:

# Quick start
For example, to use OpenAI ChatGPT to translate a file:
```shell
llm-to-corpus samples/eng.txt samples/fra.txt "translate to French"
```
To see models and options available:
```shell
llm-to-corpus --help
```
# Usage
## Evaluation with Chatgpt
Translate Flores200 corpus to evalute quality of Catalan translation
```shell
llm-to-corpus samples/flores200.eng chatgpt.txt "Translate to Catalan the following text:"
```
```shell
pip install sacrebleu
```
```shell
sacrebleu samples/flores200.cat -i chatgpt.txt -m bleu chrf --format text
```
## Evaluation with Bloom
Translate Flores200 corpus to evalute quality of Catalan translation
```shell
llm-to-corpus samples/flores200.eng bloom.txt "Translate to Catalan the following text:" --model mt0-xxl-mt
```
```shell
pip install sacrebleu
```
```shell
sacrebleu samples/flores200.cat -i bloom.txt -m bleu chrf --format text
```
Raw data
{
"_id": null,
"home_page": "https://github.com/jordimas/llm-to-corpus/",
"name": "llm-to-corpus",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "",
"author": "Jordi Mas",
"author_email": "jmas@softcatala.org",
"download_url": "https://files.pythonhosted.org/packages/fe/76/a18724e39fbbe4a6a2e85d5cf92580061e0a0bc130be66297795474169ee/llm-to-corpus-0.0.3.tar.gz",
"platform": null,
"description": "[](https://pypi.org/project/llm-to-corpus/)\n\n# Introduction\n\nThe goal of this tool is to apply Large Language Models operations to monolingual corpus to generate parallell corpus.\n\nUses cases:\n\n* Asking a model to translate, summarize, paraphrasing original sentence to be able to benchmark its performance\n* For corpus generation tasks from monolingual corpus, like for example, translated corpus.\n* When developing prompts for your application, enables to test the prompt over a list of sentence to do evaluations\n\nYou basically provide an input file and prompt and it generates a target corpus:\n\n\n# Quick start\n\nFor example, to use OpenAI ChatGPT to translate a file:\n\n```shell\n\nllm-to-corpus samples/eng.txt samples/fra.txt \"translate to French\"\n```\n\nTo see models and options available:\n```shell\n\nllm-to-corpus --help\n```\n\n# Usage\n\n## Evaluation with Chatgpt\n\nTranslate Flores200 corpus to evalute quality of Catalan translation\n\n```shell\n\nllm-to-corpus samples/flores200.eng chatgpt.txt \"Translate to Catalan the following text:\"\n```\n\n```shell\npip install sacrebleu\n```\n\n```shell\nsacrebleu samples/flores200.cat -i chatgpt.txt -m bleu chrf --format text\n```\n\n\n\n## Evaluation with Bloom\n\nTranslate Flores200 corpus to evalute quality of Catalan translation\n\n```shell\n\nllm-to-corpus samples/flores200.eng bloom.txt \"Translate to Catalan the following text:\" --model mt0-xxl-mt\n```\n\n```shell\npip install sacrebleu\n```\n\n```shell\nsacrebleu samples/flores200.cat -i bloom.txt -m bleu chrf --format text\n```\n\n\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Large language model to corpus",
"version": "0.0.3",
"project_urls": {
"Homepage": "https://github.com/jordimas/llm-to-corpus/"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "9ab71c4e67ea009805c8b562d167223f462ef5d78a1eb7e687d0541a15d8b548",
"md5": "6b7c4f1c4bdb5b8d4bef60db3265a2f1",
"sha256": "2a26923b0dc2b545a2e42d2abc1fe10eef09c81a1897948398b0765291ebe79e"
},
"downloads": -1,
"filename": "llm_to_corpus-0.0.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "6b7c4f1c4bdb5b8d4bef60db3265a2f1",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 10140,
"upload_time": "2023-07-07T11:22:09",
"upload_time_iso_8601": "2023-07-07T11:22:09.631150Z",
"url": "https://files.pythonhosted.org/packages/9a/b7/1c4e67ea009805c8b562d167223f462ef5d78a1eb7e687d0541a15d8b548/llm_to_corpus-0.0.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "fe76a18724e39fbbe4a6a2e85d5cf92580061e0a0bc130be66297795474169ee",
"md5": "e98d66ce09fc9961f1637bc62dd04b19",
"sha256": "4453cfffd3f53d532bbd7c3b01ec2699f9b6fe0d4add9d7060077bf350b21179"
},
"downloads": -1,
"filename": "llm-to-corpus-0.0.3.tar.gz",
"has_sig": false,
"md5_digest": "e98d66ce09fc9961f1637bc62dd04b19",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 5460,
"upload_time": "2023-07-07T11:22:11",
"upload_time_iso_8601": "2023-07-07T11:22:11.059803Z",
"url": "https://files.pythonhosted.org/packages/fe/76/a18724e39fbbe4a6a2e85d5cf92580061e0a0bc130be66297795474169ee/llm-to-corpus-0.0.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-07-07 11:22:11",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "jordimas",
"github_project": "llm-to-corpus",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [],
"lcname": "llm-to-corpus"
}