fastlangid


Namefastlangid JSON
Version 1.0.11 PyPI version JSON
download
home_pagehttps://github.com/currentsapi/fastlangid
SummaryLanguage detection for news powered by fasttext
upload_time2022-12-06 00:04:51
maintainer
docs_urlNone
authorRay
requires_python
licenseApache License 2.0
keywords language detection library
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI
coveralls test coverage No coveralls.
            # fastlangid

[![codecov](https://codecov.io/gh/currentsapi/fastlangid/branch/master/graph/badge.svg)](https://codecov.io/gh/currentsapi/fastlangid)  [![PyPI version](https://badge.fury.io/py/fastlangid.svg)](https://badge.fury.io/py/fastlangid)


The only language identification that includes Cantonese (廣東話), traditional and simplified Chinese.


## Why and who is this package for?

This is a language identification language focus on providing higher accuracy in Japanese, Korean, and Chinese language compares to the original Fasttext model ( lid.176.ftz ). This package also include identification for cantonese, simplified and traditional Chinese language.

First stage model F1, which is same from fasttext language identification model

|         Model         |  F1@1  |
|-----------------------|--------|
| lid.176.ftz           | 0.977  |

We can achieve higher accuracy by including an additional language identification model to handle low confidence scores for Japanese, Korean, Chinese. The table below shows F1 (k=1) scores in identifying 3 languages. (we updated the validation corpus which is much harder to the first revision : shorter text, latest news text )


|   2nd-Stage Model     |  F1@1  |  Acc@1  |
|-----------------------|--------|--------|
| version 1.0.0         | 0.826  | 0.744  |
| master                | 0.801  | 0.894  |

Master version is also trained with identifying Cantonese (zh-yue) text from Mozilla Common Voice corpus text. Currently the model is senstive to non cantonese text mixing inside the sentence, hence please use the model with care.

To use Cantonese prediction, it recommended to force inference using the second stage prediction

```
lang_code = langid.predict('平嘢有冇好嘢?', force_second=True)
```


For more edge case detail please refer to [fasttext_issues.py](tests/fasttext_issues.py)


The training data for the supplement model was drawn from Common Crawl Corpus and [Currents API](https://currentsapi.services/en) internal language dataset.

We wish to support Cantonese language in the upcoming future. Feel free to contact us if you would like to provide any related corpus.


## Install


```bash
$ pip install fastlangid
```

## Example

Only one function call away to handle single or multiple sentences

```
from fastlangid.langid import LID
langid = LID()
result = langid.predict('This is a test')
print(result)
```


```
from fastlangid.langid import LID
langid = LID()
examples = [
  '中文繁體',
  '中文简体',
  'Lorem Ipsum is simply dummy text of the printing and typesetting industry',
  'Lorem Ipsum adalah text contoh digunakan didalam industri pencetakan dan typesetting',
  'Le Lorem Ipsum est simplement du faux texte employé dans la composition et la mise en page avant impression'
]
results = langid.predict(examples)
print(results)
```



## Supported Languages

Supports 177 languages. The ISO codes for the corresponding languages are as below.

```
af als am an ar arz as ast av az azb ba bar bcl be bg bh bn bo bpy br bs bxr ca cbk
ce ceb ckb co cs cv cy da de diq dsb dty dv el eml en eo es et eu fa fi fr frr fy ga
gd gl gn gom gu gv he hi hif hr hsb ht hu hy ia id ie ilo io is it ja jbo jv ka kk km
kn ko krc ku kv kw ky la lb lez li lmo lo lrc lt lv mai mg mhr min mk ml mn mr mrj ms
mt mwl my myv mzn nah nap nds ne new nl nn no oc or os pa pam pfl pl pms pnb ps pt qu
rm ro ru rue sa sah sc scn sco sd sh si sk sl so sq sr su sv sw ta te tg th tk tl tr
tt tyv ug uk ur uz vec vep vi vls vo wa war wuu xal xmf yi yo yue zh-hans zh-hant zh-yue
```

## Caveats

Bag of words method doesn't work well in short text classification as found by [this article by Apple](https://machinelearning.apple.com/research/language-identification-from-very-short-strings). Hence it's recommend that you ensure the text have at least more than 5 characters/words.

Cantonese language identification is trained on daily conversation text which may not represent well in article types text. Hence it may confuse with traditional chinese (zh-hant) as they share the exact same characters.


## Reference

### Enriching Word Vectors with Subword Information

[1] P. Bojanowski\*, E. Grave\*, A. Joulin, T. Mikolov, [*Enriching Word Vectors with Subword Information*](https://arxiv.org/abs/1607.04606)

```
@article{bojanowski2016enriching,
  title={Enriching Word Vectors with Subword Information},
  author={Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1607.04606},
  year={2016}
}
```

### Bag of Tricks for Efficient Text Classification

[2] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, [*Bag of Tricks for Efficient Text Classification*](https://arxiv.org/abs/1607.01759)

```
@article{joulin2016bag,
  title={Bag of Tricks for Efficient Text Classification},
  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1607.01759},
  year={2016}
}
```

### FastText.zip: Compressing text classification models

[3] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, T. Mikolov, [*FastText.zip: Compressing text classification models*](https://arxiv.org/abs/1612.03651)

```
@article{joulin2016fasttext,
  title={FastText.zip: Compressing text classification models},
  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Douze, Matthijs and J{\'e}gou, H{\'e}rve and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1612.03651},
  year={2016}
}
```

## License

[![FOSSA Status](https://app.fossa.com/api/projects/git%2Bgithub.com%2Fcurrentsapi%2Ffastlangid.svg?type=large)](https://app.fossa.com/projects/git%2Bgithub.com%2Fcurrentsapi%2Ffastlangid?ref=badge_large)



            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/currentsapi/fastlangid",
    "name": "fastlangid",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "language detection library",
    "author": "Ray",
    "author_email": "ray@currentsapi.services",
    "download_url": "https://files.pythonhosted.org/packages/e1/b7/1b17848fb522872f0923967d5c113338e23ba2be9d3c311d5215b88045ab/fastlangid-1.0.11.tar.gz",
    "platform": null,
    "description": "# fastlangid\n\n[![codecov](https://codecov.io/gh/currentsapi/fastlangid/branch/master/graph/badge.svg)](https://codecov.io/gh/currentsapi/fastlangid)  [![PyPI version](https://badge.fury.io/py/fastlangid.svg)](https://badge.fury.io/py/fastlangid)\n\n\nThe only language identification that includes Cantonese (\u5ee3\u6771\u8a71), traditional and simplified Chinese.\n\n\n## Why and who is this package for?\n\nThis is a language identification language focus on providing higher accuracy in Japanese, Korean, and Chinese language compares to the original Fasttext model ( lid.176.ftz ). This package also include identification for cantonese, simplified and traditional Chinese language.\n\nFirst stage model F1, which is same from fasttext language identification model\n\n|         Model         |  F1@1  |\n|-----------------------|--------|\n| lid.176.ftz           | 0.977  |\n\nWe can achieve higher accuracy by including an additional language identification model to handle low confidence scores for Japanese, Korean, Chinese. The table below shows F1 (k=1) scores in identifying 3 languages. (we updated the validation corpus which is much harder to the first revision : shorter text, latest news text )\n\n\n|   2nd-Stage Model     |  F1@1  |  Acc@1  |\n|-----------------------|--------|--------|\n| version 1.0.0         | 0.826  | 0.744  |\n| master                | 0.801  | 0.894  |\n\nMaster version is also trained with identifying Cantonese (zh-yue) text from Mozilla Common Voice corpus text. Currently the model is senstive to non cantonese text mixing inside the sentence, hence please use the model with care.\n\nTo use Cantonese prediction, it recommended to force inference using the second stage prediction\n\n```\nlang_code = langid.predict('\u5e73\u5622\u6709\u5187\u597d\u5622?', force_second=True)\n```\n\n\nFor more edge case detail please refer to [fasttext_issues.py](tests/fasttext_issues.py)\n\n\nThe training data for the supplement model was drawn from Common Crawl Corpus and [Currents API](https://currentsapi.services/en) internal language dataset.\n\nWe wish to support Cantonese language in the upcoming future. Feel free to contact us if you would like to provide any related corpus.\n\n\n## Install\n\n\n```bash\n$ pip install fastlangid\n```\n\n## Example\n\nOnly one function call away to handle single or multiple sentences\n\n```\nfrom fastlangid.langid import LID\nlangid = LID()\nresult = langid.predict('This is a test')\nprint(result)\n```\n\n\n```\nfrom fastlangid.langid import LID\nlangid = LID()\nexamples = [\n  '\u4e2d\u6587\u7e41\u9ad4',\n  '\u4e2d\u6587\u7b80\u4f53',\n  'Lorem Ipsum is simply dummy text of the printing and typesetting industry',\n  'Lorem Ipsum adalah text contoh digunakan didalam industri pencetakan dan typesetting',\n  'Le Lorem Ipsum est simplement du faux texte employ\u00e9 dans la composition et la mise en page avant impression'\n]\nresults = langid.predict(examples)\nprint(results)\n```\n\n\n\n## Supported Languages\n\nSupports 177 languages. The ISO codes for the corresponding languages are as below.\n\n```\naf als am an ar arz as ast av az azb ba bar bcl be bg bh bn bo bpy br bs bxr ca cbk\nce ceb ckb co cs cv cy da de diq dsb dty dv el eml en eo es et eu fa fi fr frr fy ga\ngd gl gn gom gu gv he hi hif hr hsb ht hu hy ia id ie ilo io is it ja jbo jv ka kk km\nkn ko krc ku kv kw ky la lb lez li lmo lo lrc lt lv mai mg mhr min mk ml mn mr mrj ms\nmt mwl my myv mzn nah nap nds ne new nl nn no oc or os pa pam pfl pl pms pnb ps pt qu\nrm ro ru rue sa sah sc scn sco sd sh si sk sl so sq sr su sv sw ta te tg th tk tl tr\ntt tyv ug uk ur uz vec vep vi vls vo wa war wuu xal xmf yi yo yue zh-hans zh-hant zh-yue\n```\n\n## Caveats\n\nBag of words method doesn't work well in short text classification as found by [this article by Apple](https://machinelearning.apple.com/research/language-identification-from-very-short-strings). Hence it's recommend that you ensure the text have at least more than 5 characters/words.\n\nCantonese language identification is trained on daily conversation text which may not represent well in article types text. Hence it may confuse with traditional chinese (zh-hant) as they share the exact same characters.\n\n\n## Reference\n\n### Enriching Word Vectors with Subword Information\n\n[1] P. Bojanowski\\*, E. Grave\\*, A. Joulin, T. Mikolov, [*Enriching Word Vectors with Subword Information*](https://arxiv.org/abs/1607.04606)\n\n```\n@article{bojanowski2016enriching,\n  title={Enriching Word Vectors with Subword Information},\n  author={Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas},\n  journal={arXiv preprint arXiv:1607.04606},\n  year={2016}\n}\n```\n\n### Bag of Tricks for Efficient Text Classification\n\n[2] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, [*Bag of Tricks for Efficient Text Classification*](https://arxiv.org/abs/1607.01759)\n\n```\n@article{joulin2016bag,\n  title={Bag of Tricks for Efficient Text Classification},\n  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},\n  journal={arXiv preprint arXiv:1607.01759},\n  year={2016}\n}\n```\n\n### FastText.zip: Compressing text classification models\n\n[3] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. J\u00e9gou, T. Mikolov, [*FastText.zip: Compressing text classification models*](https://arxiv.org/abs/1612.03651)\n\n```\n@article{joulin2016fasttext,\n  title={FastText.zip: Compressing text classification models},\n  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Douze, Matthijs and J{\\'e}gou, H{\\'e}rve and Mikolov, Tomas},\n  journal={arXiv preprint arXiv:1612.03651},\n  year={2016}\n}\n```\n\n## License\n\n[![FOSSA Status](https://app.fossa.com/api/projects/git%2Bgithub.com%2Fcurrentsapi%2Ffastlangid.svg?type=large)](https://app.fossa.com/projects/git%2Bgithub.com%2Fcurrentsapi%2Ffastlangid?ref=badge_large)\n\n\n",
    "bugtrack_url": null,
    "license": "Apache License 2.0",
    "summary": "Language detection for news powered by fasttext",
    "version": "1.0.11",
    "split_keywords": [
        "language",
        "detection",
        "library"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "b8702b758eb6ef4236b90a057527fbd7",
                "sha256": "feb53f9415f6556f5c29b98a6792f697d3780a970393a7ec6e44689257069992"
            },
            "downloads": -1,
            "filename": "fastlangid-1.0.11-py2.py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "b8702b758eb6ef4236b90a057527fbd7",
            "packagetype": "bdist_wheel",
            "python_version": "py2.py3",
            "requires_python": null,
            "size": 1156395,
            "upload_time": "2022-12-06T00:04:47",
            "upload_time_iso_8601": "2022-12-06T00:04:47.974473Z",
            "url": "https://files.pythonhosted.org/packages/c4/aa/11ed99e3592830829f7f8626738d27300f0939a1363864b1f8172bbf7b6f/fastlangid-1.0.11-py2.py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "md5": "d7c6fc104a6769d085e0d6fcc662fe5b",
                "sha256": "e19923245943714809e1ed283ae2fbc1223f64a0afee3e02ad66004edc119f50"
            },
            "downloads": -1,
            "filename": "fastlangid-1.0.11.tar.gz",
            "has_sig": false,
            "md5_digest": "d7c6fc104a6769d085e0d6fcc662fe5b",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 1157848,
            "upload_time": "2022-12-06T00:04:51",
            "upload_time_iso_8601": "2022-12-06T00:04:51.807445Z",
            "url": "https://files.pythonhosted.org/packages/e1/b7/1b17848fb522872f0923967d5c113338e23ba2be9d3c311d5215b88045ab/fastlangid-1.0.11.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2022-12-06 00:04:51",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "currentsapi",
    "github_project": "fastlangid",
    "travis_ci": true,
    "coveralls": false,
    "github_actions": false,
    "lcname": "fastlangid"
}
        
Ray
Elapsed time: 0.01454s