fastspell


Namefastspell JSON
Version 0.6.1 PyPI version JSON
download
home_page
SummaryTargetted language identifier, based on FastText and Hunspell.
upload_time2023-01-25 10:28:33
maintainer
docs_urlNone
author
requires_python>=3.8
license
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # FastSpell

Targetted language identifier, based on FastText and Hunspell.

## How it works 

FastSpell will try to determine the language of a sentence by using **[FastText](https://fasttext.cc/)**.

If the language detected is very similar to the target language (i.e. FastText detected Spanish, while the targetted language is Galician), extra checks are performed with **[Hunspell](http://hunspell.github.io/)** to determine the language more precisely.


## Requirements & Installation

**FastSpell** can be installed from PyPI
```
pip install fastspell
```
or directly from source:
```
pip install .
```
Note that **Hunspell** requires `python-dev` and `libhunspell-dev`:

```
sudo apt-get install python-dev libhunspell-dev
```

Before running FastSpell for any of the languages listed as [similar](https://github.com/mbanon/fastspell/blob/main/fastspell/config/similar.yaml), you must have all the [needed Hunspell dictionaries](https://github.com/mbanon/fastspell/blob/main/fastspell/config/hunspell.yaml) for that language.
For further explanation about how configuration works, [see below](#configuration).
You can use the `fastspell-download` command to download all the needed files for the default configuration, just run it without arguments:
```
fastspell-download
```

### Conda
Also, you can install the conda package:
```
conda install -c conda-forge -c bitextor fastspell
```

### RedHat installation
For RedHat and its derivatives
```
sudo dnf install hunspell hunspell-devel
```
must be ran to install Hunspell.

If you found an installation error during `pip install hunspell` that says `/usr/bin/ld: cannot find -lhunspell`, you'll probably need to add a symlink to `/usr/lib64` or other path in your environment (like `/home/user/.local/lib`).
```
sudo ln -s /usr/lib64/libhunspell-1.7.so /usr/lib64/libhunspell.so
```

## Configuration

A few configuration files are provided under the `fastspell/config` directory.
If you need to change default configuration, you can provide the path to your config directory with `-c`/`--config` or with the environment variable `FASTSPELL_CONFIG`.

#### similar.yaml

In this dictionary-like file, similar languages are stored. These are the languages that are going to be "double-checked" with Hunspell after being identified with FastText. For example, see the line `gl: [es, pt, gl] `. This means that, when the targetted language is Galician, and FastText identifies a given sentence as Spanish, Portuguese or Galician, extra checks will be performed with Hunspell to confirm which of the three similar languages is more suitable for the sentence.

Please note that you need Hunspell dictionaries for all the languages in this file (if you use the `fastspell-download` command, there is nothing else to do). This file can be modified to remove a language you are not interested in, or a language for which you don't have Hunspell dictionaries, or to add new similar or target languages.

#### hunspell.yaml

In this file, the names of the dictionaries are stored. All similar languages must be in this list in order to properly work.

For example, the first entry in the `hunspell_codes` is ` ca: ca_ES`, and the dictionary path is `~/.local/share/fastspell/`. That means that the Hunspell files for Catalan are  `~/.local/share/fastspell/ca_ES.dic` and `~/.local/share/fastspell/ca_ES.aff`.

By default `dicpath` is empty, which means FastSpell will look in these directories for the dictionaries:
```
~/.local/share/fastspell
~/.local/share/hunspell
$VIRTUAL_ENV/share/hunspell
/usr/share/hunspell
```
To use a custom path, put it in `dicpath` and will be the first one to search.


## Usage

### Module:
In order to use **FastSpell** as a Python module, just install and import it :
```
from fastspell import FastSpell
```
Build a FastSpell object, like:
```
fsobj = FastSpell.FastSpell("en", mode="cons")
```
(learn more about modes in the section below)

And then use the `getlang` function with the sentences you want to identify, for example:
```
fsobj.getlang("Hello, world")
#'en'
fsobj.getlang("Hola, mundo")
#'es'

```

### CLI:
```
iusage: fastspell [-h] [--aggr] [--cons] [--hbs] [-q] [--debug]
                 [--logfile LOGFILE] [-v]
                 lang [input] [output]

positional arguments:
  lang
  input              Input sentences. (default: <_io.TextIOWrapper
                     name='<stdin>' encoding='UTF-8'>)
  output             Output of the language identification. (default:
                     <_io.TextIOWrapper name='<stdout>' mode='w'
                     encoding='UTF-8'>)

optional arguments:
  -h, --help         show this help message and exit
  --aggr             Aggressive strategy (more positives) (default: False)
  --cons             Conservative strategy (less positives) (default: False)
  --hbs              Return all Serbo-Croatian variants as 'hbs' (default:
                     False)

Logging:
  -q, --quiet        Silent logging mode (default: False)
  --debug            Debug logging mode (default: False)
  --logfile LOGFILE  Store log to a file (default: <_io.TextIOWrapper
                     name='<stderr>' mode='w' encoding='UTF-8'>)
  -v, --version      show version of this script and exit
```

## Aggressive vs Conservative

FastSpell comes in two flavours: Aggressive and Conservative.

The **Aggressive** mode is less hesitant to tag a sentence with the target language, and never has doubts. The **Conservative** version, on the other hand, is more reluctant to tag a sentence with the target language and will use the `unk`(unknown) tag in case of doubt (when there is a tie between the target language and other language, for example)

## Benchmark 

![comparative.png](comparative.png)


## Usage example

Input text:
```
19-01-2011 47 comentarios 7o Xornadas de Xardinería de Galicia (RE)PLANTEAR
• Proceso de valoración de idoneidade: entrevistas psicosociais e visita domiciliaria e aplicación de test psicolóxicos, se é o caso.
- Chrome e Firefox en MacOS non son compatibles (unicamente Safari é compatible con MacOS), pero invocarase PSAL ao intentar empregar Chrome ou Firefox.
Mago da luz / Maga da luz
Celebrada a homenaxe a Xosé Manuel Seivane Rivas
A instalación eléctrica en teletraballo
Saltar á navegación Navegación INICIO
Julio Freire, competidor da FGA, invitado polo Kennel club de Inglaterra, para participar nos Crufts 2014 (Birmingham, 6 - 9 de marzo).
25 de xullo - Truong Tan Sang toma posesión como presidente de Vietnam
Quen pode solicitar o dito financiamento?
```
Command:
```
fastspell $L --aggr inputtext
fastspell $L --cons inputtext
```
Aggressive output:
```
19-01-2011 47 comentarios 7o Xornadas de Xardinería de Galicia (RE)PLANTEAR     gl
• Proceso de valoración de idoneidade: entrevistas psicosociais e visita domiciliaria e aplicación de test psicolóxicos, se é o caso.   gl
- Chrome e Firefox en MacOS non son compatibles (unicamente Safari é compatible con MacOS), pero invocarase PSAL ao intentar empregar Chrome ou Firefox.        gl
Mago da luz / Maga da luz       gl
Celebrada a homenaxe a Xosé Manuel Seivane Rivas        gl
A instalación eléctrica en teletraballo gl
Saltar á navegación Navegación INICIO   gl
Julio Freire, competidor da FGA, invitado polo Kennel club de Inglaterra, para participar nos Crufts 2014 (Birmingham, 6 - 9 de marzo). es
25 de xullo - Truong Tan Sang toma posesión como presidente de Vietnam  gl
Quen pode solicitar o dito financiamento?       gl
```

Conservative output:
```
19-01-2011 47 comentarios 7o Xornadas de Xardinería de Galicia (RE)PLANTEAR     unk
• Proceso de valoración de idoneidade: entrevistas psicosociais e visita domiciliaria e aplicación de test psicolóxicos, se é o caso.   gl
- Chrome e Firefox en MacOS non son compatibles (unicamente Safari é compatible con MacOS), pero invocarase PSAL ao intentar empregar Chrome ou Firefox.        gl
Mago da luz / Maga da luz       unk
Celebrada a homenaxe a Xosé Manuel Seivane Rivas        gl
A instalación eléctrica en teletraballo unk
Saltar á navegación Navegación INICIO   gl
Julio Freire, competidor da FGA, invitado polo Kennel club de Inglaterra, para participar nos Crufts 2014 (Birmingham, 6 - 9 de marzo). es
25 de xullo - Truong Tan Sang toma posesión como presidente de Vietnam  gl
Quen pode solicitar o dito financiamento?       gl
```
Getting stats:
```
cat inputtext | fastspell $L --aggr | cut -f2 | sort | uniq -c | sort -nr
cat inputtext | fastspell $L --cons | cut -f2 | sort | uniq -c | sort -nr
```
Aggressive:
```
9 gl
1 es
```
Conservative:
```
6 gl
3 unk
1 es
```


            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "fastspell",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "Marta Ba\u00f1on <mbanon@prompsit.com>, Jaume Zaragoza <jzaragoza@prompsit.com>",
    "keywords": "",
    "author": "",
    "author_email": "Prompsit Language Engineering <info@prompsit.com>",
    "download_url": "https://files.pythonhosted.org/packages/81/af/eebba902a9938ae4d393089582bcdac6a0696519d61939db5cc5da9e4f2d/fastspell-0.6.1.tar.gz",
    "platform": null,
    "description": "# FastSpell\n\nTargetted language identifier, based on FastText and Hunspell.\n\n## How it works \n\nFastSpell will try to determine the language of a sentence by using **[FastText](https://fasttext.cc/)**.\n\nIf the language detected is very similar to the target language (i.e. FastText detected Spanish, while the targetted language is Galician), extra checks are performed with **[Hunspell](http://hunspell.github.io/)** to determine the language more precisely.\n\n\n## Requirements & Installation\n\n**FastSpell** can be installed from PyPI\n```\npip install fastspell\n```\nor directly from source:\n```\npip install .\n```\nNote that **Hunspell** requires `python-dev` and `libhunspell-dev`:\n\n```\nsudo apt-get install python-dev libhunspell-dev\n```\n\nBefore running FastSpell for any of the languages listed as [similar](https://github.com/mbanon/fastspell/blob/main/fastspell/config/similar.yaml), you must have all the [needed Hunspell dictionaries](https://github.com/mbanon/fastspell/blob/main/fastspell/config/hunspell.yaml) for that language.\nFor further explanation about how configuration works, [see below](#configuration).\nYou can use the `fastspell-download` command to download all the needed files for the default configuration, just run it without arguments:\n```\nfastspell-download\n```\n\n### Conda\nAlso, you can install the conda package:\n```\nconda install -c conda-forge -c bitextor fastspell\n```\n\n### RedHat installation\nFor RedHat and its derivatives\n```\nsudo dnf install hunspell hunspell-devel\n```\nmust be ran to install Hunspell.\n\nIf you found an installation error during `pip install hunspell` that says `/usr/bin/ld: cannot find -lhunspell`, you'll probably need to add a symlink to `/usr/lib64` or other path in your environment (like `/home/user/.local/lib`).\n```\nsudo ln -s /usr/lib64/libhunspell-1.7.so /usr/lib64/libhunspell.so\n```\n\n## Configuration\n\nA few configuration files are provided under the `fastspell/config` directory.\nIf you need to change default configuration, you can provide the path to your config directory with `-c`/`--config` or with the environment variable `FASTSPELL_CONFIG`.\n\n#### similar.yaml\n\nIn this dictionary-like file, similar languages are stored. These are the languages that are going to be \"double-checked\" with Hunspell after being identified with FastText. For example, see the line `gl: [es, pt, gl] `. This means that, when the targetted language is Galician, and FastText identifies a given sentence as Spanish, Portuguese or Galician, extra checks will be performed with Hunspell to confirm which of the three similar languages is more suitable for the sentence.\n\nPlease note that you need Hunspell dictionaries for all the languages in this file (if you use the `fastspell-download` command, there is nothing else to do). This file can be modified to remove a language you are not interested in, or a language for which you don't have Hunspell dictionaries, or to add new similar or target languages.\n\n#### hunspell.yaml\n\nIn this file, the names of the dictionaries are stored. All similar languages must be in this list in order to properly work.\n\nFor example, the first entry in the `hunspell_codes` is ` ca: ca_ES`, and the dictionary path is `~/.local/share/fastspell/`. That means that the Hunspell files for Catalan are  `~/.local/share/fastspell/ca_ES.dic` and `~/.local/share/fastspell/ca_ES.aff`.\n\nBy default `dicpath` is empty, which means FastSpell will look in these directories for the dictionaries:\n```\n~/.local/share/fastspell\n~/.local/share/hunspell\n$VIRTUAL_ENV/share/hunspell\n/usr/share/hunspell\n```\nTo use a custom path, put it in `dicpath` and will be the first one to search.\n\n\n## Usage\n\n### Module:\nIn order to use **FastSpell** as a Python module, just install and import it :\n```\nfrom fastspell import FastSpell\n```\nBuild a FastSpell object, like:\n```\nfsobj = FastSpell.FastSpell(\"en\", mode=\"cons\")\n```\n(learn more about modes in the section below)\n\nAnd then use the `getlang` function with the sentences you want to identify, for example:\n```\nfsobj.getlang(\"Hello, world\")\n#'en'\nfsobj.getlang(\"Hola, mundo\")\n#'es'\n\n```\n\n### CLI:\n```\niusage: fastspell [-h] [--aggr] [--cons] [--hbs] [-q] [--debug]\n                 [--logfile LOGFILE] [-v]\n                 lang [input] [output]\n\npositional arguments:\n  lang\n  input              Input sentences. (default: <_io.TextIOWrapper\n                     name='<stdin>' encoding='UTF-8'>)\n  output             Output of the language identification. (default:\n                     <_io.TextIOWrapper name='<stdout>' mode='w'\n                     encoding='UTF-8'>)\n\noptional arguments:\n  -h, --help         show this help message and exit\n  --aggr             Aggressive strategy (more positives) (default: False)\n  --cons             Conservative strategy (less positives) (default: False)\n  --hbs              Return all Serbo-Croatian variants as 'hbs' (default:\n                     False)\n\nLogging:\n  -q, --quiet        Silent logging mode (default: False)\n  --debug            Debug logging mode (default: False)\n  --logfile LOGFILE  Store log to a file (default: <_io.TextIOWrapper\n                     name='<stderr>' mode='w' encoding='UTF-8'>)\n  -v, --version      show version of this script and exit\n```\n\n## Aggressive vs Conservative\n\nFastSpell comes in two flavours: Aggressive and Conservative.\n\nThe **Aggressive** mode is less hesitant to tag a sentence with the target language, and never has doubts. The **Conservative** version, on the other hand, is more reluctant to tag a sentence with the target language and will use the `unk`(unknown) tag in case of doubt (when there is a tie between the target language and other language, for example)\n\n## Benchmark \n\n![comparative.png](comparative.png)\n\n\n## Usage example\n\nInput text:\n```\n19-01-2011 47 comentarios 7o Xornadas de Xardiner\u00eda de Galicia (RE)PLANTEAR\n\u2022 Proceso de valoraci\u00f3n de idoneidade: entrevistas psicosociais e visita domiciliaria e aplicaci\u00f3n de test psicol\u00f3xicos, se \u00e9 o caso.\n- Chrome e Firefox en MacOS non son compatibles (unicamente Safari \u00e9 compatible con MacOS), pero invocarase PSAL ao intentar empregar Chrome ou Firefox.\nMago da luz / Maga da luz\nCelebrada a homenaxe a Xos\u00e9 Manuel Seivane Rivas\nA instalaci\u00f3n el\u00e9ctrica en teletraballo\nSaltar \u00e1 navegaci\u00f3n Navegaci\u00f3n INICIO\nJulio Freire, competidor da FGA, invitado polo Kennel club de Inglaterra, para participar nos Crufts 2014 (Birmingham, 6 - 9 de marzo).\n25 de xullo - Truong Tan Sang toma posesi\u00f3n como presidente de Vietnam\nQuen pode solicitar o dito financiamento?\n```\nCommand:\n```\nfastspell $L --aggr inputtext\nfastspell $L --cons inputtext\n```\nAggressive output:\n```\n19-01-2011 47 comentarios 7o Xornadas de Xardiner\u00eda de Galicia (RE)PLANTEAR     gl\n\u2022 Proceso de valoraci\u00f3n de idoneidade: entrevistas psicosociais e visita domiciliaria e aplicaci\u00f3n de test psicol\u00f3xicos, se \u00e9 o caso.   gl\n- Chrome e Firefox en MacOS non son compatibles (unicamente Safari \u00e9 compatible con MacOS), pero invocarase PSAL ao intentar empregar Chrome ou Firefox.        gl\nMago da luz / Maga da luz       gl\nCelebrada a homenaxe a Xos\u00e9 Manuel Seivane Rivas        gl\nA instalaci\u00f3n el\u00e9ctrica en teletraballo gl\nSaltar \u00e1 navegaci\u00f3n Navegaci\u00f3n INICIO   gl\nJulio Freire, competidor da FGA, invitado polo Kennel club de Inglaterra, para participar nos Crufts 2014 (Birmingham, 6 - 9 de marzo). es\n25 de xullo - Truong Tan Sang toma posesi\u00f3n como presidente de Vietnam  gl\nQuen pode solicitar o dito financiamento?       gl\n```\n\nConservative output:\n```\n19-01-2011 47 comentarios 7o Xornadas de Xardiner\u00eda de Galicia (RE)PLANTEAR     unk\n\u2022 Proceso de valoraci\u00f3n de idoneidade: entrevistas psicosociais e visita domiciliaria e aplicaci\u00f3n de test psicol\u00f3xicos, se \u00e9 o caso.   gl\n- Chrome e Firefox en MacOS non son compatibles (unicamente Safari \u00e9 compatible con MacOS), pero invocarase PSAL ao intentar empregar Chrome ou Firefox.        gl\nMago da luz / Maga da luz       unk\nCelebrada a homenaxe a Xos\u00e9 Manuel Seivane Rivas        gl\nA instalaci\u00f3n el\u00e9ctrica en teletraballo unk\nSaltar \u00e1 navegaci\u00f3n Navegaci\u00f3n INICIO   gl\nJulio Freire, competidor da FGA, invitado polo Kennel club de Inglaterra, para participar nos Crufts 2014 (Birmingham, 6 - 9 de marzo). es\n25 de xullo - Truong Tan Sang toma posesi\u00f3n como presidente de Vietnam  gl\nQuen pode solicitar o dito financiamento?       gl\n```\nGetting stats:\n```\ncat inputtext | fastspell $L --aggr | cut -f2 | sort | uniq -c | sort -nr\ncat inputtext | fastspell $L --cons | cut -f2 | sort | uniq -c | sort -nr\n```\nAggressive:\n```\n9 gl\n1 es\n```\nConservative:\n```\n6 gl\n3 unk\n1 es\n```\n\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Targetted language identifier, based on FastText and Hunspell.",
    "version": "0.6.1",
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "c89242dbdf199a7725769210bec05a00676b91a99fcf8984dcd2c49660be1ad4",
                "md5": "02813b480e2ac11b1dec3d9f5e0fbbca",
                "sha256": "ea3b7987a0086d1abe038549f04711b242991b1a04edc068b9a0cbbbe7b8574a"
            },
            "downloads": -1,
            "filename": "fastspell-0.6.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "02813b480e2ac11b1dec3d9f5e0fbbca",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 13407,
            "upload_time": "2023-01-25T10:28:32",
            "upload_time_iso_8601": "2023-01-25T10:28:32.371769Z",
            "url": "https://files.pythonhosted.org/packages/c8/92/42dbdf199a7725769210bec05a00676b91a99fcf8984dcd2c49660be1ad4/fastspell-0.6.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "81afeebba902a9938ae4d393089582bcdac6a0696519d61939db5cc5da9e4f2d",
                "md5": "bcd7e98059c6353bb9e8cf138db69b04",
                "sha256": "b1c64752d8a0f8684bc78ac0f812fc9697fc1e88dc3df82f8fcfeb9ba8a8f648"
            },
            "downloads": -1,
            "filename": "fastspell-0.6.1.tar.gz",
            "has_sig": false,
            "md5_digest": "bcd7e98059c6353bb9e8cf138db69b04",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 14935,
            "upload_time": "2023-01-25T10:28:33",
            "upload_time_iso_8601": "2023-01-25T10:28:33.922402Z",
            "url": "https://files.pythonhosted.org/packages/81/af/eebba902a9938ae4d393089582bcdac6a0696519d61939db5cc5da9e4f2d/fastspell-0.6.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-01-25 10:28:33",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "lcname": "fastspell"
}
        
Elapsed time: 0.08935s