langid-pyc


Namelangid-pyc JSON
Version 0.1.0 PyPI version JSON
download
home_pageNone
SummaryWritten in C drop-in replacement of the language identification tool langid.py
upload_time2024-04-11 19:32:43
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenselangid.pyc - Language Identifier BSD 3-Clause License Modifications (fork): Copyright (c) 2024, Aleksandr Lukoianov <liablefish@gmail.com>. Original code: Copyright (c) 2014 Marco Lui <saffsd@gmail.com>. Based on research by Marco Lui and Tim Baldwin. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. 3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDER ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
keywords language detection language identification langid langid.py
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            
# langid.pyc
The modified version of `langid.c` with Python bindings -- a straightforward replacement for `langid.py`, offering the same features, but 200 times as faster.

## Installation
```bash
pip install langid-pyc
```

## Usage
### Basic
```python
from langid_pyc import (
    classify,
    rank,
)

classify("This is English text")
# ('en', 0.9999999239251556)

rank("This is English text")
# [('en', 0.9999999239251556),
#  ('la', 5.0319768731501096e-08),
#  ('br', 1.2684715402216825e-08),
#  ...]
```
### Language set constraint
```python
from langid_pyc import (
    classify,
    nb_classes,
    set_languages,
)

nb_classes()
# ['af',
#  'am',
#  'an',
#  ...]

len(nb_classes())
# 97

set_languages(["en", "ru"])
nb_classes()
# ['en', 'ru']

classify("This is English text")
# ('en', 1.0)

classify("А это текст на русском")
# ('ru', 1.0)

set_languages() # reset languages
len(nb_classes())
# 97
```
### `LanguageIdentifier` class
```python
from langid_pyc import LanguageIdentifier

identifier = LanguageIdentifier.from_modelpath("ldpy3.pmodel")  # default model

len(identifier.nb_classes)
# 97

identifier.classify("This is English text")
# ('en', 0.9999999239251556)

# identifier.rank(...)
# identifier.set_languages(...)
```

## How to build?
Install relevant `protobuf` packages
```bash
apt install protobuf-c-compiler libprotobuf-c-dev
```

Install dev python requirements
```bash
pip install -r requirements.txt
```

Run build
```
make build
```

See [Makefile](Makefile) for more details.

## How to add a new model?
Train a new model using `langid.py` package. You will get the model file as described [here](https://github.com/saffsd/langid.py/blob/master/langid/train/train.py#L283):
```python
# output the model
output_path = os.path.join(model_dir, 'your_new_model.model')
model = nb_ptc, nb_pc, nb_classes,tk_nextmove, tk_output
string = base64.b64encode(bz2.compress(cPickle.dumps(model)))
with open(output_path, 'w') as f:
f.write(string)
print "wrote model to %s (%d bytes)" % (output_path, len(string))
```

Move `your_new_model.model` to `models` dir and run
```bash
make your_new_model.model
```

Now you have `your_new_model.pmodel` file in the root which can be feed to `LanguageIdentifer.from_modelpath`

```python
from langid_pyc import LanguageIdentifier

your_new_identifier = LanguageIdentifier.from_modelpath("your_new_model.pmodel")
```

## Benchmark
Benchmark was calculated on Mac M2 Max, 32Gb RAM with python 3.8.18 and can be found [here](benchmark/benchmark.html).

TL;DR `langid.pyc` is ~200x faster than `langid.py` and ~1-1.5x faster than `pycld2`, especially on long texts.

# Original README

================
``langid.c`` readme
================

Introduction
------------
`langid.c` is an experimental implementation of the language identifier
described by [1] in pure C. It is largely based on the design of
`langid.py`[2], and uses `langid.py` to train models. 

Planned features
----------------
See TODO

Speed
-----

Initial comparisons against Google's cld2[3] suggest that `langid.c` is about
twice as fast.

    (langid.c) @mlui langid.c git:[master] wc -l wikifiles 
    28600 wikifiles
    (langid.c) @mlui langid.c git:[master] time cat wikifiles | ./compact_lang_det_batch > xxx
    cat wikifiles  0.00s user 0.00s system 0% cpu 7.989 total
    ./compact_lang_det_batch > xxx  7.77s user 0.60s system 98% cpu 8.479 total
    (langid.c) @mlui langid.c git:[master] time cat wikifiles | ./langidOs -b > xxx           
    cat wikifiles  0.00s user 0.00s system 0% cpu 3.577 total
    ./langidOs -b > xxx  3.44s user 0.24s system 97% cpu 3.759 total

    (langid.c) @mlui langid.c git:[master] wc -l rcv2files 
    20000 rcv2files
    (langid.c) @mlui langid.c git:[master] time cat rcv2files | ./langidO2 -b > xxx     
    cat rcv2files  0.00s user 0.00s system 0% cpu 31.702 total
    ./langidO2 -b > xxx  8.23s user 0.54s system 22% cpu 38.644 total
    (langid.c) @mlui langid.c git:[master] time cat rcv2files | ./compact_lang_det_batch > xxx 
    cat rcv2files  0.00s user 0.00s system 0% cpu 18.343 total
    ./compact_lang_det_batch > xxx  18.14s user 0.53s system 97% cpu 19.155 total


Model Training
--------------

Google's protocol buffers [4] are used to transfer models between languages. The
Python program `ldpy2ldc.py` can convert a model produced by langid.py [2] into
the protocol-buffer format, and also the C source format used to compile an
in-built model directly into executable.

Dependencies
------------
Protocol buffers [4]
protobuf-c [5]

Contact
-------
Marco Lui <saffsd@gmail.com>

References
----------
[1] http://aclweb.org/anthology-new/I/I11/I11-1062.pdf
[2] https://github.com/saffsd/langid.py
[3] https://code.google.com/p/cld2/
[4] https://github.com/google/protobuf/
[5] https://github.com/protobuf-c/protobuf-c

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "langid-pyc",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "language detection, language identification, langid, langid.py",
    "author": null,
    "author_email": "Aleksandr Lukoianov <liablefish@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/b7/e3/c5f36fc463e2eec50c054eb452f943d2e45e068420dc240a359c18c30168/langid-pyc-0.1.0.tar.gz",
    "platform": null,
    "description": "\n# langid.pyc\nThe modified version of `langid.c` with Python bindings -- a straightforward replacement for `langid.py`, offering the same features, but 200 times as faster.\n\n## Installation\n```bash\npip install langid-pyc\n```\n\n## Usage\n### Basic\n```python\nfrom langid_pyc import (\n    classify,\n    rank,\n)\n\nclassify(\"This is English text\")\n# ('en', 0.9999999239251556)\n\nrank(\"This is English text\")\n# [('en', 0.9999999239251556),\n#  ('la', 5.0319768731501096e-08),\n#  ('br', 1.2684715402216825e-08),\n#  ...]\n```\n### Language set constraint\n```python\nfrom langid_pyc import (\n    classify,\n    nb_classes,\n    set_languages,\n)\n\nnb_classes()\n# ['af',\n#  'am',\n#  'an',\n#  ...]\n\nlen(nb_classes())\n# 97\n\nset_languages([\"en\", \"ru\"])\nnb_classes()\n# ['en', 'ru']\n\nclassify(\"This is English text\")\n# ('en', 1.0)\n\nclassify(\"\u0410 \u044d\u0442\u043e \u0442\u0435\u043a\u0441\u0442 \u043d\u0430 \u0440\u0443\u0441\u0441\u043a\u043e\u043c\")\n# ('ru', 1.0)\n\nset_languages() # reset languages\nlen(nb_classes())\n# 97\n```\n### `LanguageIdentifier` class\n```python\nfrom langid_pyc import LanguageIdentifier\n\nidentifier = LanguageIdentifier.from_modelpath(\"ldpy3.pmodel\")  # default model\n\nlen(identifier.nb_classes)\n# 97\n\nidentifier.classify(\"This is English text\")\n# ('en', 0.9999999239251556)\n\n# identifier.rank(...)\n# identifier.set_languages(...)\n```\n\n## How to build?\nInstall relevant `protobuf` packages\n```bash\napt install protobuf-c-compiler libprotobuf-c-dev\n```\n\nInstall dev python requirements\n```bash\npip install -r requirements.txt\n```\n\nRun build\n```\nmake build\n```\n\nSee [Makefile](Makefile) for more details.\n\n## How to add a new model?\nTrain a new model using `langid.py` package. You will get the model file as described [here](https://github.com/saffsd/langid.py/blob/master/langid/train/train.py#L283):\n```python\n# output the model\noutput_path = os.path.join(model_dir, 'your_new_model.model')\nmodel = nb_ptc, nb_pc, nb_classes,tk_nextmove, tk_output\nstring = base64.b64encode(bz2.compress(cPickle.dumps(model)))\nwith open(output_path, 'w') as f:\nf.write(string)\nprint \"wrote model to %s (%d bytes)\" % (output_path, len(string))\n```\n\nMove `your_new_model.model` to `models` dir and run\n```bash\nmake your_new_model.model\n```\n\nNow you have `your_new_model.pmodel` file in the root which can be feed to `LanguageIdentifer.from_modelpath`\n\n```python\nfrom langid_pyc import LanguageIdentifier\n\nyour_new_identifier = LanguageIdentifier.from_modelpath(\"your_new_model.pmodel\")\n```\n\n## Benchmark\nBenchmark was calculated on Mac M2 Max, 32Gb RAM with python 3.8.18 and can be found [here](benchmark/benchmark.html).\n\nTL;DR `langid.pyc` is ~200x faster than `langid.py` and ~1-1.5x faster than `pycld2`, especially on long texts.\n\n# Original README\n\n================\n``langid.c`` readme\n================\n\nIntroduction\n------------\n`langid.c` is an experimental implementation of the language identifier\ndescribed by [1] in pure C. It is largely based on the design of\n`langid.py`[2], and uses `langid.py` to train models. \n\nPlanned features\n----------------\nSee TODO\n\nSpeed\n-----\n\nInitial comparisons against Google's cld2[3] suggest that `langid.c` is about\ntwice as fast.\n\n    (langid.c) @mlui langid.c git:[master] wc -l wikifiles \n    28600 wikifiles\n    (langid.c) @mlui langid.c git:[master] time cat wikifiles | ./compact_lang_det_batch > xxx\n    cat wikifiles  0.00s user 0.00s system 0% cpu 7.989 total\n    ./compact_lang_det_batch > xxx  7.77s user 0.60s system 98% cpu 8.479 total\n    (langid.c) @mlui langid.c git:[master] time cat wikifiles | ./langidOs -b > xxx           \n    cat wikifiles  0.00s user 0.00s system 0% cpu 3.577 total\n    ./langidOs -b > xxx  3.44s user 0.24s system 97% cpu 3.759 total\n\n    (langid.c) @mlui langid.c git:[master] wc -l rcv2files \n    20000 rcv2files\n    (langid.c) @mlui langid.c git:[master] time cat rcv2files | ./langidO2 -b > xxx     \n    cat rcv2files  0.00s user 0.00s system 0% cpu 31.702 total\n    ./langidO2 -b > xxx  8.23s user 0.54s system 22% cpu 38.644 total\n    (langid.c) @mlui langid.c git:[master] time cat rcv2files | ./compact_lang_det_batch > xxx \n    cat rcv2files  0.00s user 0.00s system 0% cpu 18.343 total\n    ./compact_lang_det_batch > xxx  18.14s user 0.53s system 97% cpu 19.155 total\n\n\nModel Training\n--------------\n\nGoogle's protocol buffers [4] are used to transfer models between languages. The\nPython program `ldpy2ldc.py` can convert a model produced by langid.py [2] into\nthe protocol-buffer format, and also the C source format used to compile an\nin-built model directly into executable.\n\nDependencies\n------------\nProtocol buffers [4]\nprotobuf-c [5]\n\nContact\n-------\nMarco Lui <saffsd@gmail.com>\n\nReferences\n----------\n[1] http://aclweb.org/anthology-new/I/I11/I11-1062.pdf\n[2] https://github.com/saffsd/langid.py\n[3] https://code.google.com/p/cld2/\n[4] https://github.com/google/protobuf/\n[5] https://github.com/protobuf-c/protobuf-c\n",
    "bugtrack_url": null,
    "license": "langid.pyc - Language Identifier BSD 3-Clause License  Modifications (fork): Copyright (c) 2024, Aleksandr Lukoianov <liablefish@gmail.com>.  Original code: Copyright (c) 2014 Marco Lui <saffsd@gmail.com>. Based on research by Marco Lui and Tim Baldwin.  All rights reserved.  Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:  1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.  2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.  3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDER ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. ",
    "summary": "Written in C drop-in replacement of the language identification tool langid.py",
    "version": "0.1.0",
    "project_urls": {
        "Homepage": "https://github.com/LiableFish/langid.pyc"
    },
    "split_keywords": [
        "language detection",
        " language identification",
        " langid",
        " langid.py"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a5fcfc2e885e35478281cd8a35fa4c0979907ca6305eaa2e4141c7f03bb1ec0f",
                "md5": "2cb6058d0bba37f8bfe448f04c6202a4",
                "sha256": "c94aef19dd11b93d0b8140828b351ef8902e977dc8a9c68da8c417a99d2eed6e"
            },
            "downloads": -1,
            "filename": "langid_pyc-0.1.0-cp38-cp38-macosx_14_0_arm64.whl",
            "has_sig": false,
            "md5_digest": "2cb6058d0bba37f8bfe448f04c6202a4",
            "packagetype": "bdist_wheel",
            "python_version": "cp38",
            "requires_python": ">=3.8",
            "size": 1722466,
            "upload_time": "2024-04-11T19:32:41",
            "upload_time_iso_8601": "2024-04-11T19:32:41.011226Z",
            "url": "https://files.pythonhosted.org/packages/a5/fc/fc2e885e35478281cd8a35fa4c0979907ca6305eaa2e4141c7f03bb1ec0f/langid_pyc-0.1.0-cp38-cp38-macosx_14_0_arm64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "b7e3c5f36fc463e2eec50c054eb452f943d2e45e068420dc240a359c18c30168",
                "md5": "f49f1119dfb6d1b843f68f7423dd6f2b",
                "sha256": "5581908cb83cdcc1c7e0eece336d9da7f563bc0a7a4b52d870609e312b45a745"
            },
            "downloads": -1,
            "filename": "langid-pyc-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "f49f1119dfb6d1b843f68f7423dd6f2b",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 4469832,
            "upload_time": "2024-04-11T19:32:43",
            "upload_time_iso_8601": "2024-04-11T19:32:43.691057Z",
            "url": "https://files.pythonhosted.org/packages/b7/e3/c5f36fc463e2eec50c054eb452f943d2e45e068420dc240a359c18c30168/langid-pyc-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-04-11 19:32:43",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "LiableFish",
    "github_project": "langid.pyc",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "langid-pyc"
}
        
Elapsed time: 1.25541s