llm-to-corpus


Namellm-to-corpus JSON
Version 0.0.3 PyPI version JSON
download
home_pagehttps://github.com/jordimas/llm-to-corpus/
SummaryLarge language model to corpus
upload_time2023-07-07 11:22:11
maintainer
docs_urlNone
authorJordi Mas
requires_python
licenseMIT
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            [![PyPI version](https://img.shields.io/pypi/v/llm-to-corpus.svg?logo=pypi&logoColor=FFE873)](https://pypi.org/project/llm-to-corpus/)

# Introduction

The goal of this tool is to apply Large Language Models operations to monolingual corpus to generate parallell corpus.

Uses cases:

* Asking a model to translate, summarize, paraphrasing original sentence to be able to benchmark its performance
* For corpus generation tasks from monolingual corpus, like for example, translated corpus.
* When developing prompts for your application, enables to test the prompt over a list of sentence to do evaluations

You basically provide an input file and prompt and it generates a target corpus:
![Alt text](docs/flow.svg?raw=true "Sample of the flow")

# Quick start

For example, to use OpenAI ChatGPT to translate a file:

```shell

llm-to-corpus samples/eng.txt samples/fra.txt "translate to French"
```

To see models and options available:
```shell

llm-to-corpus --help
```

# Usage

## Evaluation with Chatgpt

Translate Flores200 corpus to evalute quality of Catalan translation

```shell

llm-to-corpus samples/flores200.eng chatgpt.txt "Translate to Catalan the following text:"
```

```shell
pip install sacrebleu
```

```shell
sacrebleu samples/flores200.cat -i chatgpt.txt -m bleu chrf --format text
```



## Evaluation with Bloom

Translate Flores200 corpus to evalute quality of Catalan translation

```shell

llm-to-corpus samples/flores200.eng bloom.txt "Translate to Catalan the following text:" --model mt0-xxl-mt
```

```shell
pip install sacrebleu
```

```shell
sacrebleu samples/flores200.cat -i bloom.txt -m bleu chrf --format text
```




            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/jordimas/llm-to-corpus/",
    "name": "llm-to-corpus",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "",
    "author": "Jordi Mas",
    "author_email": "jmas@softcatala.org",
    "download_url": "https://files.pythonhosted.org/packages/fe/76/a18724e39fbbe4a6a2e85d5cf92580061e0a0bc130be66297795474169ee/llm-to-corpus-0.0.3.tar.gz",
    "platform": null,
    "description": "[![PyPI version](https://img.shields.io/pypi/v/llm-to-corpus.svg?logo=pypi&logoColor=FFE873)](https://pypi.org/project/llm-to-corpus/)\n\n# Introduction\n\nThe goal of this tool is to apply Large Language Models operations to monolingual corpus to generate parallell corpus.\n\nUses cases:\n\n* Asking a model to translate, summarize, paraphrasing original sentence to be able to benchmark its performance\n* For corpus generation tasks from monolingual corpus, like for example, translated corpus.\n* When developing prompts for your application, enables to test the prompt over a list of sentence to do evaluations\n\nYou basically provide an input file and prompt and it generates a target corpus:\n![Alt text](docs/flow.svg?raw=true \"Sample of the flow\")\n\n# Quick start\n\nFor example, to use OpenAI ChatGPT to translate a file:\n\n```shell\n\nllm-to-corpus samples/eng.txt samples/fra.txt \"translate to French\"\n```\n\nTo see models and options available:\n```shell\n\nllm-to-corpus --help\n```\n\n# Usage\n\n## Evaluation with Chatgpt\n\nTranslate Flores200 corpus to evalute quality of Catalan translation\n\n```shell\n\nllm-to-corpus samples/flores200.eng chatgpt.txt \"Translate to Catalan the following text:\"\n```\n\n```shell\npip install sacrebleu\n```\n\n```shell\nsacrebleu samples/flores200.cat -i chatgpt.txt -m bleu chrf --format text\n```\n\n\n\n## Evaluation with Bloom\n\nTranslate Flores200 corpus to evalute quality of Catalan translation\n\n```shell\n\nllm-to-corpus samples/flores200.eng bloom.txt \"Translate to Catalan the following text:\" --model mt0-xxl-mt\n```\n\n```shell\npip install sacrebleu\n```\n\n```shell\nsacrebleu samples/flores200.cat -i bloom.txt -m bleu chrf --format text\n```\n\n\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Large language model to corpus",
    "version": "0.0.3",
    "project_urls": {
        "Homepage": "https://github.com/jordimas/llm-to-corpus/"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "9ab71c4e67ea009805c8b562d167223f462ef5d78a1eb7e687d0541a15d8b548",
                "md5": "6b7c4f1c4bdb5b8d4bef60db3265a2f1",
                "sha256": "2a26923b0dc2b545a2e42d2abc1fe10eef09c81a1897948398b0765291ebe79e"
            },
            "downloads": -1,
            "filename": "llm_to_corpus-0.0.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "6b7c4f1c4bdb5b8d4bef60db3265a2f1",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 10140,
            "upload_time": "2023-07-07T11:22:09",
            "upload_time_iso_8601": "2023-07-07T11:22:09.631150Z",
            "url": "https://files.pythonhosted.org/packages/9a/b7/1c4e67ea009805c8b562d167223f462ef5d78a1eb7e687d0541a15d8b548/llm_to_corpus-0.0.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "fe76a18724e39fbbe4a6a2e85d5cf92580061e0a0bc130be66297795474169ee",
                "md5": "e98d66ce09fc9961f1637bc62dd04b19",
                "sha256": "4453cfffd3f53d532bbd7c3b01ec2699f9b6fe0d4add9d7060077bf350b21179"
            },
            "downloads": -1,
            "filename": "llm-to-corpus-0.0.3.tar.gz",
            "has_sig": false,
            "md5_digest": "e98d66ce09fc9961f1637bc62dd04b19",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 5460,
            "upload_time": "2023-07-07T11:22:11",
            "upload_time_iso_8601": "2023-07-07T11:22:11.059803Z",
            "url": "https://files.pythonhosted.org/packages/fe/76/a18724e39fbbe4a6a2e85d5cf92580061e0a0bc130be66297795474169ee/llm-to-corpus-0.0.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-07-07 11:22:11",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "jordimas",
    "github_project": "llm-to-corpus",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "llm-to-corpus"
}
        
Elapsed time: 0.19847s