dataset-translator


Namedataset-translator JSON
Version 0.1.2 PyPI version JSON
download
home_pageNone
Summary⚡️ Efficient dataset translation using Google Translate's API
upload_time2025-02-02 15:11:37
maintainerNone
docs_urlNone
authorNone
requires_python>=3.11
licenseMIT
keywords dataset translate hf google api
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # `dataset-translator`

[![PyPI version](https://badge.fury.io/py/dataset-translator.svg?icon=si%3Apython)](https://pypi.org/project/dataset-translator/)
[![GitHub issues](https://img.shields.io/github/issues/ivanvmoreno/dataset-translator)](https://github.com/ivanvmoreno/dataset-translator/issues)
![License](https://img.shields.io/github/license/ivanvmoreno/dataset-translator)

A robust CLI tool for translating text columns in datasets using Google Translate, with support for protected words, retries, and checkpoint recovery.

## Features

- **⚡️ Asynchronous**
  - Leverages Python’s asyncio for concurrent translation of text batches.
- **📦 Batch Processing**
  - Translates texts in batches to improve API efficiency.
- **💾 Checkpointing**
  - Saves completed translations periodically to prevent data loss during long-running tasks. Supports resuming from the last checkpoint.
- **🔄 Retry Mechanism**
  - Automatically retries failed translation batches with exponential backoff.
- **🛡️ Protected Words**
  - Preserves specific terms/phrases from being translated.
- **🚑 Failure Handling**
  - Supports re-processing of previously failed translations using a dedicated "only-failed" mode.
- **🌐 Proxy Support**
  - Supports HTTP/HTTPS proxies for network requests.

## Installation

```bash
pip install dataset-translator
```

## Usage

```bash
dataset-translator <path_to_dataset> ./output en eu \
  -c instruction -c output
```

### Key Options

| Option | Description |
|--------|-------------|
| `--columns \| -c` | Columns to translate (multiple allowed). Required unless using `--only-failed`. You can pass this flag multiple times for several columns. |
| `--protected-words \| -p` | Comma-separated list or `@file.txt` of protected words. |
| `--file-format \| -f` | File format to use: `csv`, `parquet`, or `auto` (automatic detection; default: `auto`). |
| `--batch-size \| -b` | Number of texts per translation request (default: `1`). |
| `--max-concurrency` | Maximum concurrent translation requests (default: `1`). |
| `--checkpoint-step` | Number of successful translations between checkpoints (default: `500`). |
| `--max-retries` | Maximum retry attempts per batch before marking as failed (default: `3`). |
| `--max-failure-cycles` | Number of full retry cycles for previously failed translations (default: `3`). |
| `--only-failed` | Process only previously failed translations from the checkpoint directory (default: `False`). |
| `--proxy` | HTTP/HTTPS proxy URL. Protocol must be specified. (e.g., `http://<proxy_host>:<proxy_port>`). |

### Supported Languages

Here is the list of languages that are supported (free of restrictions, without subscription) by the service at `translate.googleapis.com`:

| Code     | Language                 |
|----------|--------------------------|
| af       | Afrikaans                |
| sq       | Albanian                 |
| am       | Amharic                  |
| ar       | Arabic                   |
| hy       | Armenian                 |
| as       | Assamese                 |
| ay       | Aymara                   |
| az       | Azerbaijani              |
| bm       | Bambara                  |
| eu       | Basque                   |
| be       | Belarusian               |
| bn       | Bengali                  |
| bho      | Bhojpuri                 |
| bs       | Bosnian                  |
| bg       | Bulgarian                |
| ca       | Catalan                  |
| ceb      | Cebuano                  |
| ny       | Chichewa                 |
| zh-CN    | Chinese (Simplified)     |
| zh-TW    | Chinese (Traditional)    |
| co       | Corsican                 |
| hr       | Croatian                 |
| cs       | Czech                    |
| da       | Danish                   |
| fa-AF    | Dari                     |
| dv       | Dhivehi                  |
| doi      | Dogri                    |
| nl       | Dutch                    |
| en       | English                  |
| eo       | Esperanto                |
| et       | Estonian                 |
| ee       | Ewe                      |
| tl       | Filipino                 |
| fi       | Finnish                  |
| fr       | French                   |
| fy       | Frisian                  |
| gl       | Galician                 |
| ka       | Georgian                 |
| de       | German                   |
| el       | Greek                    |
| gn       | Guarani                  |
| gu       | Gujarati                 |
| ht       | Haitian Creole           |
| ha       | Hausa                    |
| haw      | Hawaiian                 |
| iw       | Hebrew                   |
| hi       | Hindi                    |
| hmn      | Hmong                    |
| hu       | Hungarian                |
| is       | Icelandic                |
| ig       | Igbo                     |
| ilo      | Ilocano                  |
| id       | Indonesian               |
| ga       | Irish                    |
| it       | Italian                  |
| ja       | Japanese                 |
| jw       | Javanese                 |
| kn       | Kannada                  |
| kk       | Kazakh                   |
| km       | Khmer                    |
| rw       | Kinyarwanda              |
| gom      | Konkani                  |
| ko       | Korean                   |
| kri      | Krio                     |
| ku       | Kurdish (Kurmanji)       |
| ckb      | Kurdish (Sorani)         |
| ky       | Kyrgyz                   |
| lo       | Lao                      |
| la       | Latin                    |
| lv       | Latvian                  |
| ln       | Lingala                  |
| lt       | Lithuanian               |
| lg       | Luganda                  |
| lb       | Luxembourgish            |
| mk       | Macedonian               |
| mai      | Maithili                 |
| mg       | Malagasy                 |
| ms       | Malay                    |
| ms-Arab  | Malay (Jawi)             |
| ml       | Malayalam                |
| mt       | Maltese                  |
| mi       | Maori                    |
| mr       | Marathi                  |
| mni-Mtei | Meiteilon (Manipuri)     |
| lus      | Mizo                     |
| mn       | Mongolian                |
| my       | Myanmar (Burmese)        |
| ne       | Nepali                   |
| bm-Nkoo  | NKo                      |
| no       | Norwegian                |
| or       | Odia (Oriya)             |
| om       | Oromo                    |
| ps       | Pashto                   |
| fa       | Persian                  |
| pl       | Polish                   |
| pt       | Portuguese (Brazil)      |
| pt-PT    | Portuguese (Portugal)    |
| pa       | Punjabi (Gurmukhi)       |
| pa-Arab  | Punjabi (Shahmukhi)      |
| qu       | Quechua                  |
| ro       | Romanian                 |
| ru       | Russian                  |
| sm       | Samoan                   |
| sa       | Sanskrit                 |
| gd       | Scots Gaelic             |
| nso      | Sepedi                   |
| sr       | Serbian                  |
| st       | Sesotho                  |
| sn       | Shona                    |
| sd       | Sindhi                   |
| si       | Sinhala                  |
| sk       | Slovak                   |
| sl       | Slovenian                |
| so       | Somali                   |
| es       | Spanish                  |
| su       | Sundanese                |
| sw       | Swahili                  |
| sv       | Swedish                  |
| tg       | Tajik                    |
| ta       | Tamil                    |
| tt       | Tatar                    |
| te       | Telugu                   |
| th       | Thai                     |
| ti       | Tigrinya                 |
| ts       | Tsonga                   |
| tr       | Turkish                  |
| tk       | Turkmen                  |
| ak       | Twi                      |
| uk       | Ukrainian                |
| ur       | Urdu                     |
| ug       | Uyghur                   |
| uz       | Uzbek                    |
| vi       | Vietnamese               |
| cy       | Welsh                    |
| xh       | Xhosa                    |
| yi       | Yiddish                  |
| yo       | Yoruba                   |
| zu       | Zulu                     |

[Source](https://github.com/ssut/py-googletrans/issues/408#issuecomment-2246262832)

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "dataset-translator",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.11",
    "maintainer_email": null,
    "keywords": "dataset, translate, hf, google, api",
    "author": null,
    "author_email": "Iv\u00e1n Moreno <ivan@ivan.build>",
    "download_url": "https://files.pythonhosted.org/packages/8c/af/1bf0ea1aefc44aad5635d58a69f2a212e08e1a72414adeea70a4abe2c3a4/dataset_translator-0.1.2.tar.gz",
    "platform": null,
    "description": "# `dataset-translator`\n\n[![PyPI version](https://badge.fury.io/py/dataset-translator.svg?icon=si%3Apython)](https://pypi.org/project/dataset-translator/)\n[![GitHub issues](https://img.shields.io/github/issues/ivanvmoreno/dataset-translator)](https://github.com/ivanvmoreno/dataset-translator/issues)\n![License](https://img.shields.io/github/license/ivanvmoreno/dataset-translator)\n\nA robust CLI tool for translating text columns in datasets using Google Translate, with support for protected words, retries, and checkpoint recovery.\n\n## Features\n\n- **\u26a1\ufe0f Asynchronous**\n  - Leverages Python\u2019s asyncio for concurrent translation of text batches.\n- **\ud83d\udce6 Batch Processing**\n  - Translates texts in batches to improve API efficiency.\n- **\ud83d\udcbe Checkpointing**\n  - Saves completed translations periodically to prevent data loss during long-running tasks. Supports resuming from the last checkpoint.\n- **\ud83d\udd04 Retry Mechanism**\n  - Automatically retries failed translation batches with exponential backoff.\n- **\ud83d\udee1\ufe0f Protected Words**\n  - Preserves specific terms/phrases from being translated.\n- **\ud83d\ude91 Failure Handling**\n  - Supports re-processing of previously failed translations using a dedicated \"only-failed\" mode.\n- **\ud83c\udf10 Proxy Support**\n  - Supports HTTP/HTTPS proxies for network requests.\n\n## Installation\n\n```bash\npip install dataset-translator\n```\n\n## Usage\n\n```bash\ndataset-translator <path_to_dataset> ./output en eu \\\n  -c instruction -c output\n```\n\n### Key Options\n\n| Option | Description |\n|--------|-------------|\n| `--columns \\| -c` | Columns to translate (multiple allowed). Required unless using `--only-failed`. You can pass this flag multiple times for several columns. |\n| `--protected-words \\| -p` | Comma-separated list or `@file.txt` of protected words. |\n| `--file-format \\| -f` | File format to use: `csv`, `parquet`, or `auto` (automatic detection; default: `auto`). |\n| `--batch-size \\| -b` | Number of texts per translation request (default: `1`). |\n| `--max-concurrency` | Maximum concurrent translation requests (default: `1`). |\n| `--checkpoint-step` | Number of successful translations between checkpoints (default: `500`). |\n| `--max-retries` | Maximum retry attempts per batch before marking as failed (default: `3`). |\n| `--max-failure-cycles` | Number of full retry cycles for previously failed translations (default: `3`). |\n| `--only-failed` | Process only previously failed translations from the checkpoint directory (default: `False`). |\n| `--proxy` | HTTP/HTTPS proxy URL. Protocol must be specified. (e.g., `http://<proxy_host>:<proxy_port>`). |\n\n### Supported Languages\n\nHere is the list of languages that are supported (free of restrictions, without subscription) by the service at `translate.googleapis.com`:\n\n| Code     | Language                 |\n|----------|--------------------------|\n| af       | Afrikaans                |\n| sq       | Albanian                 |\n| am       | Amharic                  |\n| ar       | Arabic                   |\n| hy       | Armenian                 |\n| as       | Assamese                 |\n| ay       | Aymara                   |\n| az       | Azerbaijani              |\n| bm       | Bambara                  |\n| eu       | Basque                   |\n| be       | Belarusian               |\n| bn       | Bengali                  |\n| bho      | Bhojpuri                 |\n| bs       | Bosnian                  |\n| bg       | Bulgarian                |\n| ca       | Catalan                  |\n| ceb      | Cebuano                  |\n| ny       | Chichewa                 |\n| zh-CN    | Chinese (Simplified)     |\n| zh-TW    | Chinese (Traditional)    |\n| co       | Corsican                 |\n| hr       | Croatian                 |\n| cs       | Czech                    |\n| da       | Danish                   |\n| fa-AF    | Dari                     |\n| dv       | Dhivehi                  |\n| doi      | Dogri                    |\n| nl       | Dutch                    |\n| en       | English                  |\n| eo       | Esperanto                |\n| et       | Estonian                 |\n| ee       | Ewe                      |\n| tl       | Filipino                 |\n| fi       | Finnish                  |\n| fr       | French                   |\n| fy       | Frisian                  |\n| gl       | Galician                 |\n| ka       | Georgian                 |\n| de       | German                   |\n| el       | Greek                    |\n| gn       | Guarani                  |\n| gu       | Gujarati                 |\n| ht       | Haitian Creole           |\n| ha       | Hausa                    |\n| haw      | Hawaiian                 |\n| iw       | Hebrew                   |\n| hi       | Hindi                    |\n| hmn      | Hmong                    |\n| hu       | Hungarian                |\n| is       | Icelandic                |\n| ig       | Igbo                     |\n| ilo      | Ilocano                  |\n| id       | Indonesian               |\n| ga       | Irish                    |\n| it       | Italian                  |\n| ja       | Japanese                 |\n| jw       | Javanese                 |\n| kn       | Kannada                  |\n| kk       | Kazakh                   |\n| km       | Khmer                    |\n| rw       | Kinyarwanda              |\n| gom      | Konkani                  |\n| ko       | Korean                   |\n| kri      | Krio                     |\n| ku       | Kurdish (Kurmanji)       |\n| ckb      | Kurdish (Sorani)         |\n| ky       | Kyrgyz                   |\n| lo       | Lao                      |\n| la       | Latin                    |\n| lv       | Latvian                  |\n| ln       | Lingala                  |\n| lt       | Lithuanian               |\n| lg       | Luganda                  |\n| lb       | Luxembourgish            |\n| mk       | Macedonian               |\n| mai      | Maithili                 |\n| mg       | Malagasy                 |\n| ms       | Malay                    |\n| ms-Arab  | Malay (Jawi)             |\n| ml       | Malayalam                |\n| mt       | Maltese                  |\n| mi       | Maori                    |\n| mr       | Marathi                  |\n| mni-Mtei | Meiteilon (Manipuri)     |\n| lus      | Mizo                     |\n| mn       | Mongolian                |\n| my       | Myanmar (Burmese)        |\n| ne       | Nepali                   |\n| bm-Nkoo  | NKo                      |\n| no       | Norwegian                |\n| or       | Odia (Oriya)             |\n| om       | Oromo                    |\n| ps       | Pashto                   |\n| fa       | Persian                  |\n| pl       | Polish                   |\n| pt       | Portuguese (Brazil)      |\n| pt-PT    | Portuguese (Portugal)    |\n| pa       | Punjabi (Gurmukhi)       |\n| pa-Arab  | Punjabi (Shahmukhi)      |\n| qu       | Quechua                  |\n| ro       | Romanian                 |\n| ru       | Russian                  |\n| sm       | Samoan                   |\n| sa       | Sanskrit                 |\n| gd       | Scots Gaelic             |\n| nso      | Sepedi                   |\n| sr       | Serbian                  |\n| st       | Sesotho                  |\n| sn       | Shona                    |\n| sd       | Sindhi                   |\n| si       | Sinhala                  |\n| sk       | Slovak                   |\n| sl       | Slovenian                |\n| so       | Somali                   |\n| es       | Spanish                  |\n| su       | Sundanese                |\n| sw       | Swahili                  |\n| sv       | Swedish                  |\n| tg       | Tajik                    |\n| ta       | Tamil                    |\n| tt       | Tatar                    |\n| te       | Telugu                   |\n| th       | Thai                     |\n| ti       | Tigrinya                 |\n| ts       | Tsonga                   |\n| tr       | Turkish                  |\n| tk       | Turkmen                  |\n| ak       | Twi                      |\n| uk       | Ukrainian                |\n| ur       | Urdu                     |\n| ug       | Uyghur                   |\n| uz       | Uzbek                    |\n| vi       | Vietnamese               |\n| cy       | Welsh                    |\n| xh       | Xhosa                    |\n| yi       | Yiddish                  |\n| yo       | Yoruba                   |\n| zu       | Zulu                     |\n\n[Source](https://github.com/ssut/py-googletrans/issues/408#issuecomment-2246262832)\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "\u26a1\ufe0f Efficient dataset translation using Google Translate's API",
    "version": "0.1.2",
    "project_urls": {
        "Issues": "https://github.com/ivanvmoreno/dataset-translator/issues",
        "Repository": "https://github.com/ivanvmoreno/dataset-translator"
    },
    "split_keywords": [
        "dataset",
        " translate",
        " hf",
        " google",
        " api"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "f76beca8cee2ab351fa14946b76bea26a4e2170a312ff2fc15c1061ac0df0dd1",
                "md5": "478e1c25a60e2deaefa2c44237d42d2d",
                "sha256": "284362d0808b02f69baa6675c4cd6d010407ec0d74f2e529cea2d162c5b92e36"
            },
            "downloads": -1,
            "filename": "dataset_translator-0.1.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "478e1c25a60e2deaefa2c44237d42d2d",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.11",
            "size": 8945,
            "upload_time": "2025-02-02T15:11:35",
            "upload_time_iso_8601": "2025-02-02T15:11:35.915653Z",
            "url": "https://files.pythonhosted.org/packages/f7/6b/eca8cee2ab351fa14946b76bea26a4e2170a312ff2fc15c1061ac0df0dd1/dataset_translator-0.1.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "8caf1bf0ea1aefc44aad5635d58a69f2a212e08e1a72414adeea70a4abe2c3a4",
                "md5": "0ddd3ecf0ca37bc64dd3486f365acf42",
                "sha256": "161641173eeca31235f9336b4a7f383471b5fd39bafa1966e5a0ffa245876c37"
            },
            "downloads": -1,
            "filename": "dataset_translator-0.1.2.tar.gz",
            "has_sig": false,
            "md5_digest": "0ddd3ecf0ca37bc64dd3486f365acf42",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.11",
            "size": 9060,
            "upload_time": "2025-02-02T15:11:37",
            "upload_time_iso_8601": "2025-02-02T15:11:37.683577Z",
            "url": "https://files.pythonhosted.org/packages/8c/af/1bf0ea1aefc44aad5635d58a69f2a212e08e1a72414adeea70a4abe2c3a4/dataset_translator-0.1.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-02-02 15:11:37",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "ivanvmoreno",
    "github_project": "dataset-translator",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "dataset-translator"
}
        
Elapsed time: 0.97416s