[![PyPI version](https://badge.fury.io/py/gpt_translate.svg)](https://badge.fury.io/py/gpt_translate)
[![Weave](https://raw.githubusercontent.com/wandb/weave/master/docs/static/img/logo.svg)](https://wandb.ai/capecape/gpt-translate/weave/)
# gpt_translate: Translating MD files with GPT-4
This is a tool to translate Markdown files without breaking the structure of the document. It is powered by OpenAI models and has multiple parsing and formatting options. The provided default example is the one we use to translate our documentation website [docs.wandb.ai](https://docs.wandb.ai) to [japanese](https://docs.wandb.ai/ja/) and [korean](https://docs.wandb.ai/ko/).
![](assets/screenshot.png)
> You can click [here](https://wandb.ai/capecape/gpt-translate/r/call/a18deff9-a963-4ad6-b5d6-b0ae63580575) to see the output of the translation on the screenshot above.
## Installation
We have a stable version on PyPI, so you can install it with pip:
```bash
$ pip install gpt-translate
```
or to get latest version from the repo:
```bash
$ cd gpt_translate
$ pip install .
```
Export your OpenAI API key:
```bash
export OPENAI_API_KEY=aa-proj-bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
```
## Usage
The library provides a set of commands that you can access as CLI. All the commands start by `gpt_translate.`:
- `gpt_translate.file`: Translate a single file
- `gpt_translate.folder`: Translate a folder recursively
- `gpt_translate.files`: Translate a list of files, accepts `.txt` list of files as input.
We use GPT4 by default. You can change this on `configs/config.yaml`. The dafault values are:
```yaml
# Logs:
debug: false # Debug mode
weave_project: "gpt-translate" # Weave project
silence_openai: true # Silence OpenAI logger
# Translation:
language: "ja" # Language to translate to
config_folder: "./configs" # Config folder, where the prompts and dictionaries are
replace: true # Replace existing file
remove_comments: true # Remove comments
do_translate_header_description: true # Translate the header description
max_openai_concurrent_calls: 7 # Max number of concurrent calls to OpenAI
# Files:
input_file: "docs/intro.md" # File to translate
out_file: " intro_ja.md" # File to save the translated file to
input_folder: null # Folder to translate
out_folder: null # Folder to save the translated files to
limit: null # Limit number of files to translate
# Model:
model: "gpt-4o"
temperature: 1.0
max_tokens: 4096
```
You can override the arguments at runtime or by creating another `config.yaml` file. You can also use the `--config_path` flag to specify a different config file.
- The `--config_folder` argument is where the prompts and dictionaries are located, the actual `config.yaml` could be located somewhere else. Maybe I need a better naming here =P.
- You can add new languages by providing the language translation dictionaries in `configs/language_dicts`
## Examples
1. To translate a single file:
```bash
$ gpt_translate.file \
--input_file README.md \
--out_file README_es_.md \
--language es
--config_folder ./configs
```
2. Translate a list of files from `list.txt`:
```bash
$ gpt_translate.files \
--input_file list.txt \
--input_folder docs \
--out_folder docs_ja \
--language ja
--config_folder ./configs
```
Note here that we need to pass and input and output folder. This is because we will be using the input folder to get the relative path and create the same folder structure in the output folder. This is tipically what you want for documentation websites that are organized in folders like `./docs`.
3. Translate a full folder recursively:
```bash
$ gpt_translate.folder \
--input_folder docs \
--out_folder docs_ja \
--language ja
--config_folder ./configs
```
If you don't know what to do, you can always do `--help` on any of the commands:
```bash
$ gpt_translate.* --help
```
## Weave Tracing
The library does a lot! keeping track of every piece of interaction is necessary. We added [W&B Weave](wandb.me/weave) support to trace every call to the model and underlying processing bits.
You can pass a project name to the CLI to trace the calls:
```bash
$ gpt_translate.folder \
--input_folder docs \
--output_folder docs_ja \
--language ja \
--weave_project gpt-translate
--config_folder ./configs
```
![Weave Tracing](./assets/weave.gif)
## Evaluation
Once the translation is done, you can evaluate the quality of the translation by running:
```bash
$ gpt_translate.eval \
--eval_dataset "Translation-ja:latest"
```
You can iterate on the translation prompts and dictionaries to improve the quality of the translation.
![Weave Evaluation](./assets/compare_eval.png)
The config for the evaluation shares many similarities with the translation config, which is stored in `configs/eval_config.yaml`. The `configs/evaluation_prompt.txt` file contains the prompt used by the LLM Judge to evaluate the translation quality. Feel free to play with it to find better ways to evaluate the quality of the translation according to your needs.
> Whenever you run `gpt_translate.files` or `gpt_translate.folder`, it automatically creates a new Weave Dataset with the name in the format `Translation-{language}:{timestamp}`.
![Weave Dataset](./assets/translation_ds.png)
## Github Action
We supply an [action.yml file](action.yml) to use this library in a Github Action. It is not much tested, but it should work.
- You will need to setup your [Weights & Biases](https://wandb.ai/site) API key as a secret in your Github repository as `WANDB_API_KEY`.
An example workflow is shown in https://github.com/tcapelle/dummy_docs and the [corresponding workflow file](https://github.com/tcapelle/dummy_docs/blob/main/.github/workflows/main.yml)
## TroubleShooting
If you have any issue, you can always pass the `--debug` flag to get more information about what is happening:
```bash
$ gpt_translate.folder ... --debug
```
this will get you a very verbose output (calls to models, inputs and outputs, etc.)
Raw data
{
"_id": null,
"home_page": null,
"name": "gpt-translate",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "artificial intelligence, generative models, natural language processing, openai",
"author": null,
"author_email": "Thomas Capelle <tcapelle@pm.me>",
"download_url": "https://files.pythonhosted.org/packages/85/9e/467cec40f0817c6e68075413d6b508874b1cb69a99110f518de46a389343/gpt_translate-4.0.0.tar.gz",
"platform": null,
"description": "[![PyPI version](https://badge.fury.io/py/gpt_translate.svg)](https://badge.fury.io/py/gpt_translate)\n[![Weave](https://raw.githubusercontent.com/wandb/weave/master/docs/static/img/logo.svg)](https://wandb.ai/capecape/gpt-translate/weave/)\n\n# gpt_translate: Translating MD files with GPT-4\n\nThis is a tool to translate Markdown files without breaking the structure of the document. It is powered by OpenAI models and has multiple parsing and formatting options. The provided default example is the one we use to translate our documentation website [docs.wandb.ai](https://docs.wandb.ai) to [japanese](https://docs.wandb.ai/ja/) and [korean](https://docs.wandb.ai/ko/).\n\n![](assets/screenshot.png)\n\n> You can click [here](https://wandb.ai/capecape/gpt-translate/r/call/a18deff9-a963-4ad6-b5d6-b0ae63580575) to see the output of the translation on the screenshot above.\n\n## Installation\nWe have a stable version on PyPI, so you can install it with pip:\n```bash\n$ pip install gpt-translate\n```\nor to get latest version from the repo:\n\n```bash\n$ cd gpt_translate\n$ pip install .\n```\n\nExport your OpenAI API key:\n\n```bash\nexport OPENAI_API_KEY=aa-proj-bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb\n```\n\n## Usage\n\nThe library provides a set of commands that you can access as CLI. All the commands start by `gpt_translate.`:\n\n- `gpt_translate.file`: Translate a single file\n- `gpt_translate.folder`: Translate a folder recursively\n- `gpt_translate.files`: Translate a list of files, accepts `.txt` list of files as input.\n\n\nWe use GPT4 by default. You can change this on `configs/config.yaml`. The dafault values are:\n\n```yaml\n# Logs:\ndebug: false # Debug mode\nweave_project: \"gpt-translate\" # Weave project\nsilence_openai: true # Silence OpenAI logger\n\n# Translation:\nlanguage: \"ja\" # Language to translate to\nconfig_folder: \"./configs\" # Config folder, where the prompts and dictionaries are\nreplace: true # Replace existing file\nremove_comments: true # Remove comments\ndo_translate_header_description: true # Translate the header description\nmax_openai_concurrent_calls: 7 # Max number of concurrent calls to OpenAI\n\n# Files:\ninput_file: \"docs/intro.md\" # File to translate\nout_file: \" intro_ja.md\" # File to save the translated file to\ninput_folder: null # Folder to translate\nout_folder: null # Folder to save the translated files to\nlimit: null # Limit number of files to translate\n\n# Model:\nmodel: \"gpt-4o\"\ntemperature: 1.0\nmax_tokens: 4096\n\n```\nYou can override the arguments at runtime or by creating another `config.yaml` file. You can also use the `--config_path` flag to specify a different config file.\n\n- The `--config_folder` argument is where the prompts and dictionaries are located, the actual `config.yaml` could be located somewhere else. Maybe I need a better naming here =P.\n\n- You can add new languages by providing the language translation dictionaries in `configs/language_dicts`\n\n## Examples\n\n1. To translate a single file:\n\n```bash\n$ gpt_translate.file \\\n --input_file README.md \\\n --out_file README_es_.md \\\n --language es\n --config_folder ./configs\n```\n\n2. Translate a list of files from `list.txt`:\n\n```bash\n$ gpt_translate.files \\\n --input_file list.txt \\\n --input_folder docs \\ \n --out_folder docs_ja \\\n --language ja\n --config_folder ./configs\n```\n\nNote here that we need to pass and input and output folder. This is because we will be using the input folder to get the relative path and create the same folder structure in the output folder. This is tipically what you want for documentation websites that are organized in folders like `./docs`.\n\n3. Translate a full folder recursively:\n\n```bash\n$ gpt_translate.folder \\\n --input_folder docs \\\n --out_folder docs_ja \\\n --language ja\n --config_folder ./configs\n```\n\nIf you don't know what to do, you can always do `--help` on any of the commands:\n\n```bash\n$ gpt_translate.* --help\n```\n\n\n## Weave Tracing\n\nThe library does a lot! keeping track of every piece of interaction is necessary. We added [W&B Weave](wandb.me/weave) support to trace every call to the model and underlying processing bits.\n\nYou can pass a project name to the CLI to trace the calls:\n\n```bash\n$ gpt_translate.folder \\\n --input_folder docs \\\n --output_folder docs_ja \\\n --language ja \\\n --weave_project gpt-translate\n --config_folder ./configs\n```\n\n![Weave Tracing](./assets/weave.gif)\n\n## Evaluation\n\nOnce the translation is done, you can evaluate the quality of the translation by running:\n\n```bash\n$ gpt_translate.eval \\\n --eval_dataset \"Translation-ja:latest\"\n```\n\nYou can iterate on the translation prompts and dictionaries to improve the quality of the translation.\n\n![Weave Evaluation](./assets/compare_eval.png)\n\nThe config for the evaluation shares many similarities with the translation config, which is stored in `configs/eval_config.yaml`. The `configs/evaluation_prompt.txt` file contains the prompt used by the LLM Judge to evaluate the translation quality. Feel free to play with it to find better ways to evaluate the quality of the translation according to your needs.\n\n> Whenever you run `gpt_translate.files` or `gpt_translate.folder`, it automatically creates a new Weave Dataset with the name in the format `Translation-{language}:{timestamp}`.\n\n![Weave Dataset](./assets/translation_ds.png)\n\n## Github Action\n\nWe supply an [action.yml file](action.yml) to use this library in a Github Action. It is not much tested, but it should work.\n\n- You will need to setup your [Weights & Biases](https://wandb.ai/site) API key as a secret in your Github repository as `WANDB_API_KEY`.\n\nAn example workflow is shown in https://github.com/tcapelle/dummy_docs and the [corresponding workflow file](https://github.com/tcapelle/dummy_docs/blob/main/.github/workflows/main.yml)\n\n## TroubleShooting\n\nIf you have any issue, you can always pass the `--debug` flag to get more information about what is happening:\n\n```bash\n$ gpt_translate.folder ... --debug\n```\nthis will get you a very verbose output (calls to models, inputs and outputs, etc.)\n",
"bugtrack_url": null,
"license": null,
"summary": "A tool to translate markdown files using GPT-4",
"version": "4.0.0",
"project_urls": {
"homepage": "https://github.com/tcapelle/gpt_translate"
},
"split_keywords": [
"artificial intelligence",
" generative models",
" natural language processing",
" openai"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "5f1c10aa3ce3915aca6d289f47bf053da26498925b7fd8e63aab2bb25a44c9d4",
"md5": "9b5468b3e4e4f5618ef4968aed371423",
"sha256": "9798644b80c97c7d73921c2a18219f1e9ddf9f95d4cedbcf0cd655530e8ca3e3"
},
"downloads": -1,
"filename": "gpt_translate-4.0.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "9b5468b3e4e4f5618ef4968aed371423",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 17376,
"upload_time": "2024-10-09T14:25:40",
"upload_time_iso_8601": "2024-10-09T14:25:40.270927Z",
"url": "https://files.pythonhosted.org/packages/5f/1c/10aa3ce3915aca6d289f47bf053da26498925b7fd8e63aab2bb25a44c9d4/gpt_translate-4.0.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "859e467cec40f0817c6e68075413d6b508874b1cb69a99110f518de46a389343",
"md5": "1a145f434a627fe55309dafb80ec9918",
"sha256": "cb96348297b7753994566459739d8d420c14a559884d93fe80514f837eaaa006"
},
"downloads": -1,
"filename": "gpt_translate-4.0.0.tar.gz",
"has_sig": false,
"md5_digest": "1a145f434a627fe55309dafb80ec9918",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 18862,
"upload_time": "2024-10-09T14:25:42",
"upload_time_iso_8601": "2024-10-09T14:25:42.040853Z",
"url": "https://files.pythonhosted.org/packages/85/9e/467cec40f0817c6e68075413d6b508874b1cb69a99110f518de46a389343/gpt_translate-4.0.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-10-09 14:25:42",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "tcapelle",
"github_project": "gpt_translate",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [],
"lcname": "gpt-translate"
}