Name | dataset-translator JSON |
Version |
0.1.2
JSON |
| download |
home_page | None |
Summary | ⚡️ Efficient dataset translation using Google Translate's API |
upload_time | 2025-02-02 15:11:37 |
maintainer | None |
docs_url | None |
author | None |
requires_python | >=3.11 |
license | MIT |
keywords |
dataset
translate
hf
google
api
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# `dataset-translator`
[![PyPI version](https://badge.fury.io/py/dataset-translator.svg?icon=si%3Apython)](https://pypi.org/project/dataset-translator/)
[![GitHub issues](https://img.shields.io/github/issues/ivanvmoreno/dataset-translator)](https://github.com/ivanvmoreno/dataset-translator/issues)
![License](https://img.shields.io/github/license/ivanvmoreno/dataset-translator)
A robust CLI tool for translating text columns in datasets using Google Translate, with support for protected words, retries, and checkpoint recovery.
## Features
- **⚡️ Asynchronous**
- Leverages Python’s asyncio for concurrent translation of text batches.
- **📦 Batch Processing**
- Translates texts in batches to improve API efficiency.
- **💾 Checkpointing**
- Saves completed translations periodically to prevent data loss during long-running tasks. Supports resuming from the last checkpoint.
- **🔄 Retry Mechanism**
- Automatically retries failed translation batches with exponential backoff.
- **🛡️ Protected Words**
- Preserves specific terms/phrases from being translated.
- **🚑 Failure Handling**
- Supports re-processing of previously failed translations using a dedicated "only-failed" mode.
- **🌐 Proxy Support**
- Supports HTTP/HTTPS proxies for network requests.
## Installation
```bash
pip install dataset-translator
```
## Usage
```bash
dataset-translator <path_to_dataset> ./output en eu \
-c instruction -c output
```
### Key Options
| Option | Description |
|--------|-------------|
| `--columns \| -c` | Columns to translate (multiple allowed). Required unless using `--only-failed`. You can pass this flag multiple times for several columns. |
| `--protected-words \| -p` | Comma-separated list or `@file.txt` of protected words. |
| `--file-format \| -f` | File format to use: `csv`, `parquet`, or `auto` (automatic detection; default: `auto`). |
| `--batch-size \| -b` | Number of texts per translation request (default: `1`). |
| `--max-concurrency` | Maximum concurrent translation requests (default: `1`). |
| `--checkpoint-step` | Number of successful translations between checkpoints (default: `500`). |
| `--max-retries` | Maximum retry attempts per batch before marking as failed (default: `3`). |
| `--max-failure-cycles` | Number of full retry cycles for previously failed translations (default: `3`). |
| `--only-failed` | Process only previously failed translations from the checkpoint directory (default: `False`). |
| `--proxy` | HTTP/HTTPS proxy URL. Protocol must be specified. (e.g., `http://<proxy_host>:<proxy_port>`). |
### Supported Languages
Here is the list of languages that are supported (free of restrictions, without subscription) by the service at `translate.googleapis.com`:
| Code | Language |
|----------|--------------------------|
| af | Afrikaans |
| sq | Albanian |
| am | Amharic |
| ar | Arabic |
| hy | Armenian |
| as | Assamese |
| ay | Aymara |
| az | Azerbaijani |
| bm | Bambara |
| eu | Basque |
| be | Belarusian |
| bn | Bengali |
| bho | Bhojpuri |
| bs | Bosnian |
| bg | Bulgarian |
| ca | Catalan |
| ceb | Cebuano |
| ny | Chichewa |
| zh-CN | Chinese (Simplified) |
| zh-TW | Chinese (Traditional) |
| co | Corsican |
| hr | Croatian |
| cs | Czech |
| da | Danish |
| fa-AF | Dari |
| dv | Dhivehi |
| doi | Dogri |
| nl | Dutch |
| en | English |
| eo | Esperanto |
| et | Estonian |
| ee | Ewe |
| tl | Filipino |
| fi | Finnish |
| fr | French |
| fy | Frisian |
| gl | Galician |
| ka | Georgian |
| de | German |
| el | Greek |
| gn | Guarani |
| gu | Gujarati |
| ht | Haitian Creole |
| ha | Hausa |
| haw | Hawaiian |
| iw | Hebrew |
| hi | Hindi |
| hmn | Hmong |
| hu | Hungarian |
| is | Icelandic |
| ig | Igbo |
| ilo | Ilocano |
| id | Indonesian |
| ga | Irish |
| it | Italian |
| ja | Japanese |
| jw | Javanese |
| kn | Kannada |
| kk | Kazakh |
| km | Khmer |
| rw | Kinyarwanda |
| gom | Konkani |
| ko | Korean |
| kri | Krio |
| ku | Kurdish (Kurmanji) |
| ckb | Kurdish (Sorani) |
| ky | Kyrgyz |
| lo | Lao |
| la | Latin |
| lv | Latvian |
| ln | Lingala |
| lt | Lithuanian |
| lg | Luganda |
| lb | Luxembourgish |
| mk | Macedonian |
| mai | Maithili |
| mg | Malagasy |
| ms | Malay |
| ms-Arab | Malay (Jawi) |
| ml | Malayalam |
| mt | Maltese |
| mi | Maori |
| mr | Marathi |
| mni-Mtei | Meiteilon (Manipuri) |
| lus | Mizo |
| mn | Mongolian |
| my | Myanmar (Burmese) |
| ne | Nepali |
| bm-Nkoo | NKo |
| no | Norwegian |
| or | Odia (Oriya) |
| om | Oromo |
| ps | Pashto |
| fa | Persian |
| pl | Polish |
| pt | Portuguese (Brazil) |
| pt-PT | Portuguese (Portugal) |
| pa | Punjabi (Gurmukhi) |
| pa-Arab | Punjabi (Shahmukhi) |
| qu | Quechua |
| ro | Romanian |
| ru | Russian |
| sm | Samoan |
| sa | Sanskrit |
| gd | Scots Gaelic |
| nso | Sepedi |
| sr | Serbian |
| st | Sesotho |
| sn | Shona |
| sd | Sindhi |
| si | Sinhala |
| sk | Slovak |
| sl | Slovenian |
| so | Somali |
| es | Spanish |
| su | Sundanese |
| sw | Swahili |
| sv | Swedish |
| tg | Tajik |
| ta | Tamil |
| tt | Tatar |
| te | Telugu |
| th | Thai |
| ti | Tigrinya |
| ts | Tsonga |
| tr | Turkish |
| tk | Turkmen |
| ak | Twi |
| uk | Ukrainian |
| ur | Urdu |
| ug | Uyghur |
| uz | Uzbek |
| vi | Vietnamese |
| cy | Welsh |
| xh | Xhosa |
| yi | Yiddish |
| yo | Yoruba |
| zu | Zulu |
[Source](https://github.com/ssut/py-googletrans/issues/408#issuecomment-2246262832)
Raw data
{
"_id": null,
"home_page": null,
"name": "dataset-translator",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.11",
"maintainer_email": null,
"keywords": "dataset, translate, hf, google, api",
"author": null,
"author_email": "Iv\u00e1n Moreno <ivan@ivan.build>",
"download_url": "https://files.pythonhosted.org/packages/8c/af/1bf0ea1aefc44aad5635d58a69f2a212e08e1a72414adeea70a4abe2c3a4/dataset_translator-0.1.2.tar.gz",
"platform": null,
"description": "# `dataset-translator`\n\n[![PyPI version](https://badge.fury.io/py/dataset-translator.svg?icon=si%3Apython)](https://pypi.org/project/dataset-translator/)\n[![GitHub issues](https://img.shields.io/github/issues/ivanvmoreno/dataset-translator)](https://github.com/ivanvmoreno/dataset-translator/issues)\n![License](https://img.shields.io/github/license/ivanvmoreno/dataset-translator)\n\nA robust CLI tool for translating text columns in datasets using Google Translate, with support for protected words, retries, and checkpoint recovery.\n\n## Features\n\n- **\u26a1\ufe0f Asynchronous**\n - Leverages Python\u2019s asyncio for concurrent translation of text batches.\n- **\ud83d\udce6 Batch Processing**\n - Translates texts in batches to improve API efficiency.\n- **\ud83d\udcbe Checkpointing**\n - Saves completed translations periodically to prevent data loss during long-running tasks. Supports resuming from the last checkpoint.\n- **\ud83d\udd04 Retry Mechanism**\n - Automatically retries failed translation batches with exponential backoff.\n- **\ud83d\udee1\ufe0f Protected Words**\n - Preserves specific terms/phrases from being translated.\n- **\ud83d\ude91 Failure Handling**\n - Supports re-processing of previously failed translations using a dedicated \"only-failed\" mode.\n- **\ud83c\udf10 Proxy Support**\n - Supports HTTP/HTTPS proxies for network requests.\n\n## Installation\n\n```bash\npip install dataset-translator\n```\n\n## Usage\n\n```bash\ndataset-translator <path_to_dataset> ./output en eu \\\n -c instruction -c output\n```\n\n### Key Options\n\n| Option | Description |\n|--------|-------------|\n| `--columns \\| -c` | Columns to translate (multiple allowed). Required unless using `--only-failed`. You can pass this flag multiple times for several columns. |\n| `--protected-words \\| -p` | Comma-separated list or `@file.txt` of protected words. |\n| `--file-format \\| -f` | File format to use: `csv`, `parquet`, or `auto` (automatic detection; default: `auto`). |\n| `--batch-size \\| -b` | Number of texts per translation request (default: `1`). |\n| `--max-concurrency` | Maximum concurrent translation requests (default: `1`). |\n| `--checkpoint-step` | Number of successful translations between checkpoints (default: `500`). |\n| `--max-retries` | Maximum retry attempts per batch before marking as failed (default: `3`). |\n| `--max-failure-cycles` | Number of full retry cycles for previously failed translations (default: `3`). |\n| `--only-failed` | Process only previously failed translations from the checkpoint directory (default: `False`). |\n| `--proxy` | HTTP/HTTPS proxy URL. Protocol must be specified. (e.g., `http://<proxy_host>:<proxy_port>`). |\n\n### Supported Languages\n\nHere is the list of languages that are supported (free of restrictions, without subscription) by the service at `translate.googleapis.com`:\n\n| Code | Language |\n|----------|--------------------------|\n| af | Afrikaans |\n| sq | Albanian |\n| am | Amharic |\n| ar | Arabic |\n| hy | Armenian |\n| as | Assamese |\n| ay | Aymara |\n| az | Azerbaijani |\n| bm | Bambara |\n| eu | Basque |\n| be | Belarusian |\n| bn | Bengali |\n| bho | Bhojpuri |\n| bs | Bosnian |\n| bg | Bulgarian |\n| ca | Catalan |\n| ceb | Cebuano |\n| ny | Chichewa |\n| zh-CN | Chinese (Simplified) |\n| zh-TW | Chinese (Traditional) |\n| co | Corsican |\n| hr | Croatian |\n| cs | Czech |\n| da | Danish |\n| fa-AF | Dari |\n| dv | Dhivehi |\n| doi | Dogri |\n| nl | Dutch |\n| en | English |\n| eo | Esperanto |\n| et | Estonian |\n| ee | Ewe |\n| tl | Filipino |\n| fi | Finnish |\n| fr | French |\n| fy | Frisian |\n| gl | Galician |\n| ka | Georgian |\n| de | German |\n| el | Greek |\n| gn | Guarani |\n| gu | Gujarati |\n| ht | Haitian Creole |\n| ha | Hausa |\n| haw | Hawaiian |\n| iw | Hebrew |\n| hi | Hindi |\n| hmn | Hmong |\n| hu | Hungarian |\n| is | Icelandic |\n| ig | Igbo |\n| ilo | Ilocano |\n| id | Indonesian |\n| ga | Irish |\n| it | Italian |\n| ja | Japanese |\n| jw | Javanese |\n| kn | Kannada |\n| kk | Kazakh |\n| km | Khmer |\n| rw | Kinyarwanda |\n| gom | Konkani |\n| ko | Korean |\n| kri | Krio |\n| ku | Kurdish (Kurmanji) |\n| ckb | Kurdish (Sorani) |\n| ky | Kyrgyz |\n| lo | Lao |\n| la | Latin |\n| lv | Latvian |\n| ln | Lingala |\n| lt | Lithuanian |\n| lg | Luganda |\n| lb | Luxembourgish |\n| mk | Macedonian |\n| mai | Maithili |\n| mg | Malagasy |\n| ms | Malay |\n| ms-Arab | Malay (Jawi) |\n| ml | Malayalam |\n| mt | Maltese |\n| mi | Maori |\n| mr | Marathi |\n| mni-Mtei | Meiteilon (Manipuri) |\n| lus | Mizo |\n| mn | Mongolian |\n| my | Myanmar (Burmese) |\n| ne | Nepali |\n| bm-Nkoo | NKo |\n| no | Norwegian |\n| or | Odia (Oriya) |\n| om | Oromo |\n| ps | Pashto |\n| fa | Persian |\n| pl | Polish |\n| pt | Portuguese (Brazil) |\n| pt-PT | Portuguese (Portugal) |\n| pa | Punjabi (Gurmukhi) |\n| pa-Arab | Punjabi (Shahmukhi) |\n| qu | Quechua |\n| ro | Romanian |\n| ru | Russian |\n| sm | Samoan |\n| sa | Sanskrit |\n| gd | Scots Gaelic |\n| nso | Sepedi |\n| sr | Serbian |\n| st | Sesotho |\n| sn | Shona |\n| sd | Sindhi |\n| si | Sinhala |\n| sk | Slovak |\n| sl | Slovenian |\n| so | Somali |\n| es | Spanish |\n| su | Sundanese |\n| sw | Swahili |\n| sv | Swedish |\n| tg | Tajik |\n| ta | Tamil |\n| tt | Tatar |\n| te | Telugu |\n| th | Thai |\n| ti | Tigrinya |\n| ts | Tsonga |\n| tr | Turkish |\n| tk | Turkmen |\n| ak | Twi |\n| uk | Ukrainian |\n| ur | Urdu |\n| ug | Uyghur |\n| uz | Uzbek |\n| vi | Vietnamese |\n| cy | Welsh |\n| xh | Xhosa |\n| yi | Yiddish |\n| yo | Yoruba |\n| zu | Zulu |\n\n[Source](https://github.com/ssut/py-googletrans/issues/408#issuecomment-2246262832)\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "\u26a1\ufe0f Efficient dataset translation using Google Translate's API",
"version": "0.1.2",
"project_urls": {
"Issues": "https://github.com/ivanvmoreno/dataset-translator/issues",
"Repository": "https://github.com/ivanvmoreno/dataset-translator"
},
"split_keywords": [
"dataset",
" translate",
" hf",
" google",
" api"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "f76beca8cee2ab351fa14946b76bea26a4e2170a312ff2fc15c1061ac0df0dd1",
"md5": "478e1c25a60e2deaefa2c44237d42d2d",
"sha256": "284362d0808b02f69baa6675c4cd6d010407ec0d74f2e529cea2d162c5b92e36"
},
"downloads": -1,
"filename": "dataset_translator-0.1.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "478e1c25a60e2deaefa2c44237d42d2d",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.11",
"size": 8945,
"upload_time": "2025-02-02T15:11:35",
"upload_time_iso_8601": "2025-02-02T15:11:35.915653Z",
"url": "https://files.pythonhosted.org/packages/f7/6b/eca8cee2ab351fa14946b76bea26a4e2170a312ff2fc15c1061ac0df0dd1/dataset_translator-0.1.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "8caf1bf0ea1aefc44aad5635d58a69f2a212e08e1a72414adeea70a4abe2c3a4",
"md5": "0ddd3ecf0ca37bc64dd3486f365acf42",
"sha256": "161641173eeca31235f9336b4a7f383471b5fd39bafa1966e5a0ffa245876c37"
},
"downloads": -1,
"filename": "dataset_translator-0.1.2.tar.gz",
"has_sig": false,
"md5_digest": "0ddd3ecf0ca37bc64dd3486f365acf42",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.11",
"size": 9060,
"upload_time": "2025-02-02T15:11:37",
"upload_time_iso_8601": "2025-02-02T15:11:37.683577Z",
"url": "https://files.pythonhosted.org/packages/8c/af/1bf0ea1aefc44aad5635d58a69f2a212e08e1a72414adeea70a4abe2c3a4/dataset_translator-0.1.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-02-02 15:11:37",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "ivanvmoreno",
"github_project": "dataset-translator",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "dataset-translator"
}