laser-encoders


Namelaser-encoders JSON
Version 0.0.2 PyPI version JSON
download
home_pageNone
SummaryLASER Language-Agnostic SEntence Representations is a toolkit to calculate multilingual sentence embeddings and to use them for document classification, bitext filtering and mining
upload_time2024-05-02 20:31:01
maintainerNone
docs_urlNone
authorFacebook AI Research
requires_python>=3.8
licenseNone
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # LASER encoders

LASER Language-Agnostic SEntence Representations Toolkit

laser_encoders is the official Python package for the Facebook [LASER](https://github.com/facebookresearch/LASER) library. It provides a simple and convenient way to use LASER embeddings in Python. It allows you to calculate multilingual sentence embeddings using the LASER toolkit. These embeddings can be utilized for various natural language processing tasks, including document classification, bitext filtering, and mining.

## Dependencies

- Python `>= 3.8`
- [PyTorch `>= 1.10.0`](http://pytorch.org/)
- sacremoses `>=0.1.0`
- sentencepiece `>=0.1.99`
- numpy `>=1.21.3`
- fairseq `>=0.12.2`

You can find a full list of requirements [here](https://github.com/facebookresearch/LASER/blob/main/pyproject.toml)

## Installation

You can install `laser_encoders` package from PyPI:

```sh
pip install laser_encoders
```

Alternatively, you can install it from a local clone of this repository, in editable mode:
```sh
pip install . -e
```

## Usage

Here's a simple example on how to obtain embeddings for sentences using the `LaserEncoderPipeline`:

>**Note:** By default, the models will be downloaded to the `~/.cache/laser_encoders` directory. To specify a different download location, you can provide the argument `model_dir=path/to/model/directory`

```py
from laser_encoders import LaserEncoderPipeline

# Initialize the LASER encoder pipeline
encoder = LaserEncoderPipeline(lang="igbo")

# Encode sentences into embeddings
embeddings = encoder.encode_sentences(["nnọọ, kedu ka ị mere"])
# If you want the output embeddings to be L2-normalized, set normalize_embeddings to True
normalized_embeddings = encoder.encode_sentences(["nnọọ, kedu ka ị mere"], normalize_embeddings=True)

```

If you prefer more control over the tokenization and encoding process, you can initialize the tokenizer and encoder separately:
```py
from laser_encoders import initialize_encoder, initialize_tokenizer

# Initialize the LASER tokenizer
tokenizer = initialize_tokenizer(lang="igbo")
tokenized_sentence = tokenizer.tokenize("nnọọ, kedu ka ị mere")

# Initialize the LASER sentence encoder
encoder = initialize_encoder(lang="igbo")

# Encode tokenized sentences into embeddings
embeddings = encoder.encode_sentences([tokenized_sentence])
```
>By default, the `spm` flag is set to `True` when initializing the encoder, ensuring the accompanying spm model is downloaded.

**Supported Languages:** You can specify any language from the [FLORES200](https://github.com/facebookresearch/flores/blob/main/flores200/README.md#languages-in-flores-200) dataset. This includes both languages identified by their full codes (like "ibo_Latn") and simpler alternatives (like "igbo").

## Downloading the pre-trained models

If you prefer to download the models individually, you can use the following command:

```sh
python -m laser_encoders.download_models --lang=your_prefered_language  # e.g., --lang="igbo""
```

By default, the downloaded models will be stored in the `~/.cache/laser_encoders` directory. To specify a different download location, utilize the following command:

```sh
python -m laser_encoders.download_models --model-dir=path/to/model/directory
```

> For a comprehensive list of available arguments, you can use the `--help` command with the download_models script.

Once you have successfully downloaded the models, you can utilize the `SentenceEncoder` to tokenize and encode your text in your desired language. Here's an example of how you can achieve this:

```py
from laser_encoders.models import SentenceEncoder
from pathlib import Path

encoder = SentenceEncoder(model_path=path/to/downloaded/model, spm_model=Path(path/to/spm_model), spm_vocab=path/to/cvocab)
embeddings = encoder("This is a test sentence.")
```
If you want to perform tokenization seperately, you can do this below:
```py
from laser_encoders.laser_tokenizer import LaserTokenizer

tokenizer = LaserTokenizer(spm_model=Path(path/to/spm_model))

tokenized_sentence = tokenizer.tokenize("This is a test sentence.")

encoder = SentenceEncoder(model_path=path/to/downloaded/model, spm_vocab=path/to/cvocab)
embeddings = encoder.encode_sentences([tokenized_sentence])
```

For tokenizing a file instead of a string, you can use the following:

```py
tokenized_sentence = tokenizer.tokenize_file(inp_fname=Path(path/to/input_file.txt), out_fname=Path(path/to/output_file.txt))
```

### Now you can use these embeddings for downstream tasks

For more advanced usage and options, please refer to the official LASER repository documentation.

## LASER Versions and Associated Packages

For users familiar with the earlier version of LASER, you might have encountered the [`laserembeddings`](https://pypi.org/project/laserembeddings/) package. This package primarily dealt with LASER-1 model embeddings.

For the latest LASER-2,3 models, use the newly introduced `laser_encoders` package, which offers better performance and support for a wider range of languages.


## Contributing

We welcome contributions from the developer community to enhance and improve laser_encoders. If you'd like to contribute, you can:

1. Submit bug reports or feature requests through GitHub issues.
1. Fork the repository, make changes, and submit pull requests for review.

Please follow our [Contribution Guidelines](https://github.com/facebookresearch/LASER/blob/main/CONTRIBUTING.md) to ensure a smooth process.

### Code of Conduct

We expect all contributors to adhere to our [Code of Conduct](https://github.com/facebookresearch/LASER/blob/main/CODE_OF_CONDUCT.md).

### Contributors

The following people have contributed to this project:

- [Victor Joseph](https://github.com/CaptainVee)
- [Paul Okewunmi](https://github.com/Paulooh007)
- [Siddharth Singh Rana](https://github.com/NIXBLACK11)
- [David Dale](https://github.com/avidale/)
- [Holger Schwenk](https://github.com/hoschwenk)
- [Kevin Heffernan](https://github.com/heffernankevin)

### License

This package is released under the [LASER](https://github.com/facebookresearch/LASER/blob/main/LICENSE) BSD License.



            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "laser-encoders",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": null,
    "author": "Facebook AI Research",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/6f/92/fc54eefdde3443518bf9211c1a527c2fee8c2d4b192e6d6451987e5233de/laser_encoders-0.0.2.tar.gz",
    "platform": null,
    "description": "# LASER encoders\n\nLASER Language-Agnostic SEntence Representations Toolkit\n\nlaser_encoders is the official Python package for the Facebook [LASER](https://github.com/facebookresearch/LASER) library. It provides a simple and convenient way to use LASER embeddings in Python. It allows you to calculate multilingual sentence embeddings using the LASER toolkit. These embeddings can be utilized for various natural language processing tasks, including document classification, bitext filtering, and mining.\n\n## Dependencies\n\n- Python `>= 3.8`\n- [PyTorch `>= 1.10.0`](http://pytorch.org/)\n- sacremoses `>=0.1.0`\n- sentencepiece `>=0.1.99`\n- numpy `>=1.21.3`\n- fairseq `>=0.12.2`\n\nYou can find a full list of requirements [here](https://github.com/facebookresearch/LASER/blob/main/pyproject.toml)\n\n## Installation\n\nYou can install `laser_encoders` package from PyPI:\n\n```sh\npip install laser_encoders\n```\n\nAlternatively, you can install it from a local clone of this repository, in editable mode:\n```sh\npip install . -e\n```\n\n## Usage\n\nHere's a simple example on how to obtain embeddings for sentences using the `LaserEncoderPipeline`:\n\n>**Note:** By default, the models will be downloaded to the `~/.cache/laser_encoders` directory. To specify a different download location, you can provide the argument `model_dir=path/to/model/directory`\n\n```py\nfrom laser_encoders import LaserEncoderPipeline\n\n# Initialize the LASER encoder pipeline\nencoder = LaserEncoderPipeline(lang=\"igbo\")\n\n# Encode sentences into embeddings\nembeddings = encoder.encode_sentences([\"nn\u1ecd\u1ecd, kedu ka \u1ecb mere\"])\n# If you want the output embeddings to be L2-normalized, set normalize_embeddings to True\nnormalized_embeddings = encoder.encode_sentences([\"nn\u1ecd\u1ecd, kedu ka \u1ecb mere\"], normalize_embeddings=True)\n\n```\n\nIf you prefer more control over the tokenization and encoding process, you can initialize the tokenizer and encoder separately:\n```py\nfrom laser_encoders import initialize_encoder, initialize_tokenizer\n\n# Initialize the LASER tokenizer\ntokenizer = initialize_tokenizer(lang=\"igbo\")\ntokenized_sentence = tokenizer.tokenize(\"nn\u1ecd\u1ecd, kedu ka \u1ecb mere\")\n\n# Initialize the LASER sentence encoder\nencoder = initialize_encoder(lang=\"igbo\")\n\n# Encode tokenized sentences into embeddings\nembeddings = encoder.encode_sentences([tokenized_sentence])\n```\n>By default, the `spm` flag is set to `True` when initializing the encoder, ensuring the accompanying spm model is downloaded.\n\n**Supported Languages:** You can specify any language from the [FLORES200](https://github.com/facebookresearch/flores/blob/main/flores200/README.md#languages-in-flores-200) dataset. This includes both languages identified by their full codes (like \"ibo_Latn\") and simpler alternatives (like \"igbo\").\n\n## Downloading the pre-trained models\n\nIf you prefer to download the models individually, you can use the following command:\n\n```sh\npython -m laser_encoders.download_models --lang=your_prefered_language  # e.g., --lang=\"igbo\"\"\n```\n\nBy default, the downloaded models will be stored in the `~/.cache/laser_encoders` directory. To specify a different download location, utilize the following command:\n\n```sh\npython -m laser_encoders.download_models --model-dir=path/to/model/directory\n```\n\n> For a comprehensive list of available arguments, you can use the `--help` command with the download_models script.\n\nOnce you have successfully downloaded the models, you can utilize the `SentenceEncoder` to tokenize and encode your text in your desired language. Here's an example of how you can achieve this:\n\n```py\nfrom laser_encoders.models import SentenceEncoder\nfrom pathlib import Path\n\nencoder = SentenceEncoder(model_path=path/to/downloaded/model, spm_model=Path(path/to/spm_model), spm_vocab=path/to/cvocab)\nembeddings = encoder(\"This is a test sentence.\")\n```\nIf you want to perform tokenization seperately, you can do this below:\n```py\nfrom laser_encoders.laser_tokenizer import LaserTokenizer\n\ntokenizer = LaserTokenizer(spm_model=Path(path/to/spm_model))\n\ntokenized_sentence = tokenizer.tokenize(\"This is a test sentence.\")\n\nencoder = SentenceEncoder(model_path=path/to/downloaded/model, spm_vocab=path/to/cvocab)\nembeddings = encoder.encode_sentences([tokenized_sentence])\n```\n\nFor tokenizing a file instead of a string, you can use the following:\n\n```py\ntokenized_sentence = tokenizer.tokenize_file(inp_fname=Path(path/to/input_file.txt), out_fname=Path(path/to/output_file.txt))\n```\n\n### Now you can use these embeddings for downstream tasks\n\nFor more advanced usage and options, please refer to the official LASER repository documentation.\n\n## LASER Versions and Associated Packages\n\nFor users familiar with the earlier version of LASER, you might have encountered the [`laserembeddings`](https://pypi.org/project/laserembeddings/) package. This package primarily dealt with LASER-1 model embeddings.\n\nFor the latest LASER-2,3 models, use the newly introduced `laser_encoders` package, which offers better performance and support for a wider range of languages.\n\n\n## Contributing\n\nWe welcome contributions from the developer community to enhance and improve laser_encoders. If you'd like to contribute, you can:\n\n1. Submit bug reports or feature requests through GitHub issues.\n1. Fork the repository, make changes, and submit pull requests for review.\n\nPlease follow our [Contribution Guidelines](https://github.com/facebookresearch/LASER/blob/main/CONTRIBUTING.md) to ensure a smooth process.\n\n### Code of Conduct\n\nWe expect all contributors to adhere to our [Code of Conduct](https://github.com/facebookresearch/LASER/blob/main/CODE_OF_CONDUCT.md).\n\n### Contributors\n\nThe following people have contributed to this project:\n\n- [Victor Joseph](https://github.com/CaptainVee)\n- [Paul Okewunmi](https://github.com/Paulooh007)\n- [Siddharth Singh Rana](https://github.com/NIXBLACK11)\n- [David Dale](https://github.com/avidale/)\n- [Holger Schwenk](https://github.com/hoschwenk)\n- [Kevin Heffernan](https://github.com/heffernankevin)\n\n### License\n\nThis package is released under the [LASER](https://github.com/facebookresearch/LASER/blob/main/LICENSE) BSD License.\n\n\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "LASER  Language-Agnostic SEntence Representations is a toolkit to calculate multilingual sentence embeddings and to use them for document classification, bitext filtering and mining",
    "version": "0.0.2",
    "project_urls": {
        "Bug Tracker": "https://github.com/facebookresearch/LASER/issues",
        "Homepage": "https://github.com/facebookresearch/LASER"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "dbc585de477993198489df0534cf3278992fa5f858a3aa13fcb9b95798c071e7",
                "md5": "71d54e77d8cbd47422b21c9a99ef9416",
                "sha256": "dd455b058e90da3fca6f40ea285abd66b71cf7efd3599e2748093811321cb38f"
            },
            "downloads": -1,
            "filename": "laser_encoders-0.0.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "71d54e77d8cbd47422b21c9a99ef9416",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 24333,
            "upload_time": "2024-05-02T20:30:59",
            "upload_time_iso_8601": "2024-05-02T20:30:59.553108Z",
            "url": "https://files.pythonhosted.org/packages/db/c5/85de477993198489df0534cf3278992fa5f858a3aa13fcb9b95798c071e7/laser_encoders-0.0.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "6f92fc54eefdde3443518bf9211c1a527c2fee8c2d4b192e6d6451987e5233de",
                "md5": "df2c9723f85c2777ff60fc766f5c7b25",
                "sha256": "e24ed062406dfbe9f2362969ae8f3b10353b7b2051fca330279ebd0db2ac1763"
            },
            "downloads": -1,
            "filename": "laser_encoders-0.0.2.tar.gz",
            "has_sig": false,
            "md5_digest": "df2c9723f85c2777ff60fc766f5c7b25",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 19378,
            "upload_time": "2024-05-02T20:31:01",
            "upload_time_iso_8601": "2024-05-02T20:31:01.236384Z",
            "url": "https://files.pythonhosted.org/packages/6f/92/fc54eefdde3443518bf9211c1a527c2fee8c2d4b192e6d6451987e5233de/laser_encoders-0.0.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-05-02 20:31:01",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "facebookresearch",
    "github_project": "LASER",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "laser-encoders"
}
        
Elapsed time: 0.40818s