sonar-space

Name	sonar-space JSON
Version	0.3.1 JSON
	download
home_page	None
Summary	SONAR provides a set of speech and text encoders for multilingual, multimodal semantic embedding.
upload_time	2024-12-11 15:43:17
maintainer	None
docs_url	None
author	Meta AI Research
requires_python	>=3.8
license	None
keywords	sentence embeddings sentence representation sentence encoder sonar models speech2speech text2text speech2text text2speech multi-modal models multi-language models
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # SONAR
[[Paper]](https://ai.meta.com/research/publications/sonar-sentence-level-multimodal-and-language-agnostic-representations/)
[[Demo]](#usage)

We introduce SONAR, a new multilingual and multimodal fixed-size sentence embedding space, with a full suite of speech and text encoders and decoders. It substantially outperforms existing sentence embeddings such as LASER3 and LabSE on the xsim and xsim++ multilingual similarity search tasks. 

Speech segments can be embedded in the same SONAR embedding space using language-specific speech encoders trained in a teacher-student setting on speech transcription data. We also provide a single text decoder, which allows us to perform text-to-text and speech-to-text machine translation, including for zero-shot language and modality combinations.

*SONAR* stands for **S**entence-level multim**O**dal and la**N**guage-**A**gnostic **R**epresentations

The full list of supported languages (along with download links) can be found here [below](#supported-languages-and-download-links).

## SONAR Architecture:
<p align="center">
  <img src="materials/sonar_archi.png" width="800"><br />
</p>


## Text results
<p align="center">
  <img src="materials/sonar_text_resulsts.png" width="800"><br />
</p>

## Speech results
<p align="center">
  <img src="materials/sonar_langs.png" width="400"><br />
</p>


## Installing

You can install SONAR with `pip install sonar-space`. Note that there is another `sonar` package on pip that IS NOT this project, make sure to use `sonar-space` in your dependencies.

If you want to install SONAR manually, you can install it localy. SONAR depends mainly on [Fairseq2](https://github.com/facebookresearch/fairseq2) and can be installed using (tested with `python=3.8`)

```bash
pip install --upgrade pip
pip install -e .
```

If fairseq2 does not provide a build for your machine, check the readme of that project to build it locally.

## Usage
fairseq2 will automatically download models into your `$TORCH_HOME/hub` directory upon using the commands below.

### Compute text sentence embeddings with SONAR:
```python
from sonar.inference_pipelines.text import TextToEmbeddingModelPipeline
t2vec_model = TextToEmbeddingModelPipeline(encoder="text_sonar_basic_encoder",
                                           tokenizer="text_sonar_basic_encoder")
sentences = ['My name is SONAR.', 'I can embed the sentences into vectorial space.']
embeddings = t2vec_model.predict(sentences, source_lang="eng_Latn")
print(embeddings.shape)
# torch.Size([2, 1024])
```

### Reconstruct text from SONAR embeddings
```python
from sonar.inference_pipelines.text import EmbeddingToTextModelPipeline
vec2text_model = EmbeddingToTextModelPipeline(decoder="text_sonar_basic_decoder",
                                              tokenizer="text_sonar_basic_encoder")
reconstructed = vec2text_model.predict(embeddings, target_lang="eng_Latn", max_seq_len=512)
# max_seq_len is a keyword argument passed to the fairseq2 BeamSearchSeq2SeqGenerator.
print(reconstructed)
# ['My name is SONAR.', 'I can embed the sentences into vector space.']
```

### Translate text with SONAR
```python
from sonar.inference_pipelines.text import TextToTextModelPipeline
t2t_model = TextToTextModelPipeline(encoder="text_sonar_basic_encoder",
                                    decoder="text_sonar_basic_decoder",
                                    tokenizer="text_sonar_basic_encoder")  # tokenizer is attached to both encoder and decoder cards

sentences = ['My name is SONAR.', 'I can embed the sentences into vectorial space.']
t2t_model.predict(sentences, source_lang="eng_Latn", target_lang="fra_Latn")
# ['Mon nom est SONAR.', "Je peux intégrer les phrases dans l'espace vectoriel."]
```

### Compute speech sentence embeddings with SONAR
```python
from sonar.inference_pipelines.speech import SpeechToEmbeddingModelPipeline
s2vec_model = SpeechToEmbeddingModelPipeline(encoder="sonar_speech_encoder_eng")

s2vec_model.predict(["./tests/integration_tests/data/audio_files/audio_1.wav",
                     "./tests/integration_tests/data/audio_files/audio_2.wav"]).shape
# torch.Size([2, 1024])
import torchaudio
inp, sr = torchaudio.load("./tests/integration_tests/data/audio_files/audio_1.wav")
assert sr == 16000, "Sample rate should be 16kHz"

s2vec_model.predict([inp]).shape
# torch.Size([1, 1024])
```

### Speech-to-text translation with SONAR
```python
from sonar.inference_pipelines.speech import SpeechToTextModelPipeline

s2t_model = SpeechToTextModelPipeline(encoder="sonar_speech_encoder_eng",
                                      decoder="text_sonar_basic_decoder",
                                      tokenizer="text_sonar_basic_decoder")

import torchaudio
inp, sr = torchaudio.load("./tests/integration_tests/data/audio_files/audio_1.wav")
assert sr == 16000, "Sample rate should be 16kHz"

# passing loaded audio files
s2t_model.predict([inp], target_lang="eng_Latn")
# ['Television reports show white smoke coming from the plant.']

# passing multiple wav files 
s2t_model.predict(["./tests/integration_tests/data/audio_files/audio_1.wav",
                   "./tests/integration_tests/data/audio_files/audio_2.wav"], target_lang="eng_Latn")
# ['Television reports show white smoke coming from the plant.',
# 'These couples may choose to make an adoption plan for their baby.']
```


### Predicting sentence similarity with BLASER 2.0 models

BLASER 2.0 is a family of models for automatic evaluation of machine translation quality based on SONAR embeddings.
They predict [cross-lingual semantic similarity](https://github.com/facebookresearch/fairseq/tree/nllb/examples/nllb/human_XSTS_eval) 
between the translation and the source (optionally, also using a reference translation). 

```Python
from sonar.inference_pipelines.text import TextToEmbeddingModelPipeline
from sonar.models.blaser.loader import load_blaser_model

blaser_ref = load_blaser_model("blaser_2_0_ref").eval()
blaser_qe = load_blaser_model("blaser_2_0_qe").eval()
text_embedder = TextToEmbeddingModelPipeline(encoder="text_sonar_basic_encoder", tokenizer="text_sonar_basic_encoder")

src_embs = text_embedder.predict(["Le chat s'assit sur le tapis."], source_lang="fra_Latn")
ref_embs = text_embedder.predict(["The cat sat on the mat."], source_lang="eng_Latn")
mt_embs = text_embedder.predict(["The cat sat down on the carpet."], source_lang="eng_Latn")

print(blaser_ref(src=src_embs, ref=ref_embs, mt=mt_embs).item())  # 4.688
print(blaser_qe(src=src_embs, mt=mt_embs).item())  # 4.708
```

Detailed model cards with more examples: [facebook/blaser-2.0-ref](https://huggingface.co/facebook/blaser-2.0-ref), 
[facebook/blaser-2.0-qe](https://huggingface.co/facebook/blaser-2.0-qe). 

### Classifying the toxicity of sentences with MuTox

[MuTox](https://github.com/facebookresearch/seamless_communication/tree/main/src/seamless_communication/cli/toxicity/mutox), the first highly multilingual audio-based classifier (binary) and dataset with toxicity labels. The dataset consists of 20k audio utterances for English and Spanish, and 4k for the other 19 languages, and uses the multi-model and multilingual encoders from SONAR. The output of the MuTox classifier is a logit of the evaluated being _"toxic"_, according to the definition adopted in the corresponding dataset.

```Python
from sonar.models.mutox.loader import load_mutox_model
from sonar.inference_pipelines.text import TextToEmbeddingModelPipeline
import torch

if torch.cuda.is_available():
    device = torch.device("cuda:0")
    dtype = torch.float16
else:
    device = torch.device("cpu")
    dtype = torch.float32

t2vec_model = TextToEmbeddingModelPipeline(
    encoder="text_sonar_basic_encoder",
    tokenizer="text_sonar_basic_encoder",
    device=device,
)
text_column='lang_txt'
classifier = load_mutox_model(
    "sonar_mutox",
    device=device,
    dtype=dtype,
).eval()

with torch.inference_mode():
    emb = t2vec_model.predict(["De peur que le pays ne se prostitue et ne se remplisse de crimes."], source_lang='fra_Latn')
    x = classifier(emb.to(device).to(dtype)) # tensor([[-19.7812]], device='cuda:0', dtype=torch.float16)

with torch.inference_mode():
    emb = t2vec_model.predict(["She worked hard and made a significant contribution to the team."], source_lang='eng_Latn')
    x = classifier(emb.to(device).to(dtype)) # tensor([[-53.5938]], device='cuda:0', dtype=torch.float16)

with torch.inference_mode():
    emb = t2vec_model.predict(["El no tiene ni el más mínimo talento, todo lo que ha logrado ha sido gracias a sobornos y manipulaciones."], source_lang='spa_Latn')
    x = classifier(emb.to(device).to(dtype)) # tensor([[-21.4062]], device='cuda:0', dtype=torch.float16)
```

For a CLI way of running the MuTox pipeline, go to [Seamless Communication/.../MuTox](https://github.com/facebookresearch/seamless_communication/tree/main/src/seamless_communication/cli/toxicity/mutox).

### Demo notebooks
See more complete demo notebooks :

* [sonar text2text similarity and translation](examples/sonar_text_demo.ipynb)
* [sonar speech2text and other data pipeline examples](examples/inference_pipelines.ipynb)
* [sonar bilingual document alignment with sonar text similarity](examples/bilingual_document.ipynb)


## Supported languages and download links
The SONAR text encoder & decoder supports 200 languages. SONAR speech encoders support 37 languages.

<details>
<summary>Available text encoders/decoders</summary>

| model             | link                                                                               |
| ----------------- | ---------------------------------------------------------------------------------- |
| encoder           | [download](https://dl.fbaipublicfiles.com/SONAR/sonar_text_encoder.pt)             |
| decoder           | [download](https://dl.fbaipublicfiles.com/SONAR/sonar_text_encoder.pt)             |
| finetuned decoder | [download](https://dl.fbaipublicfiles.com/SONAR/finetuned_decoder.pt)              |
| tokenizer         | [download](https://dl.fbaipublicfiles.com/SONAR/sentencepiece.source.256000.model) |

All 200 languages from the [No Language Left Behind project](https://arxiv.org/abs/2207.04672) are supported.

</details>

<details>
<summary>Available speech encoders</summary>

| lang_code | language         | link                                                               |
| --------- | ---------------- | ------------------------------------------------------------------ |
| arb       | ms arabic        | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.arb.pt) |
| asm       | assamese         | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.asm.pt) |
| bel       | belarussian      | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.bel.pt) |
| ben       | bengali          | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.ben.pt) |
| bos       | bosnian          | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.bos.pt) |
| bul       | bulgarian        | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.bul.pt) |
| cat       | catalan          | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.cat.pt) |
| ces       | czech            | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.ces.pt) |
| cmn       | mandarin chinese | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.cmn.pt) |
| cym       | welsh            | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.cym.pt) |
| dan       | danish           | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.dan.pt) |
| deu       | german           | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.deu.pt) |
| est       | estonian         | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.est.pt) |
| fin       | finnish          | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.fin.pt) |
| fra       | french           | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.fra.pt) |
| guj       | gujurati         | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.guj.pt) |
| heb       | hebrew           | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.heb.pt) |
| hin       | hindi            | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.hin.pt) |
| hrv       | croatian         | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.hrv.pt) |
| ind       | indonesian       | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.ind.pt) |
| ita       | italian          | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.ita.pt) |
| jpn       | japanse          | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.jpn.pt) |
| kan       | kannada          | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.jan.pt) |
| kor       | korean           | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.kor.pt) |
| lao       | lao              | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.lao.pt) |
| lit       | lithaian         | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.lit.pt) |
| lvs       | standard latvian | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.lvs.pt) |
| mal       | malayalam        | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.mal.pt) |
| mar       | marathi          | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.mar.pt) |
| mkd       | macedonian       | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.mkd.pt) |
| mlt       | maltese          | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.mlt.pt) |
| npi       | nepali           | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.npi.pt) |
| nld       | dutch            | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.nld.pt) |
| ory       | odia             | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.ory.pt) |
| pan       | punjabi          | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.pan.pt) |
| pes       | western persian  | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.pes.pt) |
| pol       | polish           | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.po.pt)  |
| por       | portuguese       | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.por.pt) |
| ron       | romanian         | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.ron.pt) |
| rus       | russian          | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.rus.pt) |
| slk       | slovak           | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.slk.pt) |
| slv       | slovenian        | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.slv.pt) |
| snd       | sindhi           | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.snd.pt) |
| srp       | serbian          | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.srp.pt) |
| spa       | spanish          | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.spa.pt) |
| swe       | swedish          | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.swe.pt) |
| swh       | swahili          | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.swh.pt) |
| tam       | tamil            | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.tam.pt) |
| tel       | telugu           | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.tel.pt) |
| tgl       | tagalog          | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.tgl.pt) |
| tha       | thai             | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.tha.pt) |
| tur       | turkish          | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.tur.pt) |
| ukr       | ukrainian        | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.ukr.pt) |
| urd       | urdu             | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.urd.pt) |
| uzn       | northern uzbek   | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.uzn.pt) |
| vie       | vietnamese       | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.vie.pt) |
| yue       | yue              | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.yue.pt) |

</details>

## Citation Information

Please cite the paper when referencing the SONAR embedding space, encoders and decoders as:

```
@misc{Duquenne:2023:sonar_arxiv,
  author = {Paul-Ambroise Duquenne and Holger Schwenk and Benoit Sagot},
  title = {{SONAR:} Sentence-Level Multimodal and Language-Agnostic Representations},
  publisher = {arXiv},
  year = {2023},
  url = {https://arxiv.org/abs/2308.11466},
}
```

## Contributing

See the [CONTRIBUTING](CONTRIBUTING.md) file for how to help out.

## License

SONAR code is released under the MIT license (see [CODE_LICENSE](CODE_LICENSE.md)).

Some of SONAR models are released with the same MIT license, BUT BEWARE, 
some of them are released under a non commercial license (see [NC_MODEL_LICENSE](NC_MODEL_LICENSE.md)).
Please refer to [LICENSE](LICENSE.md) for the details.

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "sonar-space",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "sentence embeddings, sentence representation, sentence encoder, sonar models, speech2speech, text2text, speech2text, text2speech, multi-modal models, multi-language models",
    "author": "Meta AI Research",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/ca/e9/607886df1275fcb631eb91f3ab47b3f231444a320da97ea69a8adfe62340/sonar_space-0.3.1.tar.gz",
    "platform": null,
    "description": "# SONAR\n[[Paper]](https://ai.meta.com/research/publications/sonar-sentence-level-multimodal-and-language-agnostic-representations/)\n[[Demo]](#usage)\n\nWe introduce SONAR, a new multilingual and multimodal fixed-size sentence embedding space, with a full suite of speech and text encoders and decoders. It substantially outperforms existing sentence embeddings such as LASER3 and LabSE on the xsim and xsim++ multilingual similarity search tasks. \n\nSpeech segments can be embedded in the same SONAR embedding space using language-specific speech encoders trained in a teacher-student setting on speech transcription data. We also provide a single text decoder, which allows us to perform text-to-text and speech-to-text machine translation, including for zero-shot language and modality combinations.\n\n*SONAR* stands for **S**entence-level multim**O**dal and la**N**guage-**A**gnostic **R**epresentations\n\nThe full list of supported languages (along with download links) can be found here [below](#supported-languages-and-download-links).\n\n## SONAR Architecture:\n<p align=\"center\">\n  <img src=\"materials/sonar_archi.png\" width=\"800\"><br />\n</p>\n\n\n## Text results\n<p align=\"center\">\n  <img src=\"materials/sonar_text_resulsts.png\" width=\"800\"><br />\n</p>\n\n## Speech results\n<p align=\"center\">\n  <img src=\"materials/sonar_langs.png\" width=\"400\"><br />\n</p>\n\n\n## Installing\n\nYou can install SONAR with `pip install sonar-space`. Note that there is another `sonar` package on pip that IS NOT this project, make sure to use `sonar-space` in your dependencies.\n\nIf you want to install SONAR manually, you can install it localy. SONAR depends mainly on [Fairseq2](https://github.com/facebookresearch/fairseq2) and can be installed using (tested with `python=3.8`)\n\n```bash\npip install --upgrade pip\npip install -e .\n```\n\nIf fairseq2 does not provide a build for your machine, check the readme of that project to build it locally.\n\n## Usage\nfairseq2 will automatically download models into your `$TORCH_HOME/hub` directory upon using the commands below.\n\n### Compute text sentence embeddings with SONAR:\n```python\nfrom sonar.inference_pipelines.text import TextToEmbeddingModelPipeline\nt2vec_model = TextToEmbeddingModelPipeline(encoder=\"text_sonar_basic_encoder\",\n                                           tokenizer=\"text_sonar_basic_encoder\")\nsentences = ['My name is SONAR.', 'I can embed the sentences into vectorial space.']\nembeddings = t2vec_model.predict(sentences, source_lang=\"eng_Latn\")\nprint(embeddings.shape)\n# torch.Size([2, 1024])\n```\n\n### Reconstruct text from SONAR embeddings\n```python\nfrom sonar.inference_pipelines.text import EmbeddingToTextModelPipeline\nvec2text_model = EmbeddingToTextModelPipeline(decoder=\"text_sonar_basic_decoder\",\n                                              tokenizer=\"text_sonar_basic_encoder\")\nreconstructed = vec2text_model.predict(embeddings, target_lang=\"eng_Latn\", max_seq_len=512)\n# max_seq_len is a keyword argument passed to the fairseq2 BeamSearchSeq2SeqGenerator.\nprint(reconstructed)\n# ['My name is SONAR.', 'I can embed the sentences into vector space.']\n```\n\n### Translate text with SONAR\n```python\nfrom sonar.inference_pipelines.text import TextToTextModelPipeline\nt2t_model = TextToTextModelPipeline(encoder=\"text_sonar_basic_encoder\",\n                                    decoder=\"text_sonar_basic_decoder\",\n                                    tokenizer=\"text_sonar_basic_encoder\")  # tokenizer is attached to both encoder and decoder cards\n\nsentences = ['My name is SONAR.', 'I can embed the sentences into vectorial space.']\nt2t_model.predict(sentences, source_lang=\"eng_Latn\", target_lang=\"fra_Latn\")\n# ['Mon nom est SONAR.', \"Je peux int\u00e9grer les phrases dans l'espace vectoriel.\"]\n```\n\n### Compute speech sentence embeddings with SONAR\n```python\nfrom sonar.inference_pipelines.speech import SpeechToEmbeddingModelPipeline\ns2vec_model = SpeechToEmbeddingModelPipeline(encoder=\"sonar_speech_encoder_eng\")\n\ns2vec_model.predict([\"./tests/integration_tests/data/audio_files/audio_1.wav\",\n                     \"./tests/integration_tests/data/audio_files/audio_2.wav\"]).shape\n# torch.Size([2, 1024])\nimport torchaudio\ninp, sr = torchaudio.load(\"./tests/integration_tests/data/audio_files/audio_1.wav\")\nassert sr == 16000, \"Sample rate should be 16kHz\"\n\ns2vec_model.predict([inp]).shape\n# torch.Size([1, 1024])\n```\n\n### Speech-to-text translation with SONAR\n```python\nfrom sonar.inference_pipelines.speech import SpeechToTextModelPipeline\n\ns2t_model = SpeechToTextModelPipeline(encoder=\"sonar_speech_encoder_eng\",\n                                      decoder=\"text_sonar_basic_decoder\",\n                                      tokenizer=\"text_sonar_basic_decoder\")\n\nimport torchaudio\ninp, sr = torchaudio.load(\"./tests/integration_tests/data/audio_files/audio_1.wav\")\nassert sr == 16000, \"Sample rate should be 16kHz\"\n\n# passing loaded audio files\ns2t_model.predict([inp], target_lang=\"eng_Latn\")\n# ['Television reports show white smoke coming from the plant.']\n\n# passing multiple wav files \ns2t_model.predict([\"./tests/integration_tests/data/audio_files/audio_1.wav\",\n                   \"./tests/integration_tests/data/audio_files/audio_2.wav\"], target_lang=\"eng_Latn\")\n# ['Television reports show white smoke coming from the plant.',\n# 'These couples may choose to make an adoption plan for their baby.']\n```\n\n\n### Predicting sentence similarity with BLASER 2.0 models\n\nBLASER 2.0 is a family of models for automatic evaluation of machine translation quality based on SONAR embeddings.\nThey predict [cross-lingual semantic similarity](https://github.com/facebookresearch/fairseq/tree/nllb/examples/nllb/human_XSTS_eval) \nbetween the translation and the source (optionally, also using a reference translation). \n\n```Python\nfrom sonar.inference_pipelines.text import TextToEmbeddingModelPipeline\nfrom sonar.models.blaser.loader import load_blaser_model\n\nblaser_ref = load_blaser_model(\"blaser_2_0_ref\").eval()\nblaser_qe = load_blaser_model(\"blaser_2_0_qe\").eval()\ntext_embedder = TextToEmbeddingModelPipeline(encoder=\"text_sonar_basic_encoder\", tokenizer=\"text_sonar_basic_encoder\")\n\nsrc_embs = text_embedder.predict([\"Le chat s'assit sur le tapis.\"], source_lang=\"fra_Latn\")\nref_embs = text_embedder.predict([\"The cat sat on the mat.\"], source_lang=\"eng_Latn\")\nmt_embs = text_embedder.predict([\"The cat sat down on the carpet.\"], source_lang=\"eng_Latn\")\n\nprint(blaser_ref(src=src_embs, ref=ref_embs, mt=mt_embs).item())  # 4.688\nprint(blaser_qe(src=src_embs, mt=mt_embs).item())  # 4.708\n```\n\nDetailed model cards with more examples: [facebook/blaser-2.0-ref](https://huggingface.co/facebook/blaser-2.0-ref), \n[facebook/blaser-2.0-qe](https://huggingface.co/facebook/blaser-2.0-qe). \n\n### Classifying the toxicity of sentences with MuTox\n\n[MuTox](https://github.com/facebookresearch/seamless_communication/tree/main/src/seamless_communication/cli/toxicity/mutox), the first highly multilingual audio-based classifier (binary) and dataset with toxicity labels. The dataset consists of 20k audio utterances for English and Spanish, and 4k for the other 19 languages, and uses the multi-model and multilingual encoders from SONAR. The output of the MuTox classifier is a logit of the evaluated being _\"toxic\"_, according to the definition adopted in the corresponding dataset.\n\n```Python\nfrom sonar.models.mutox.loader import load_mutox_model\nfrom sonar.inference_pipelines.text import TextToEmbeddingModelPipeline\nimport torch\n\nif torch.cuda.is_available():\n    device = torch.device(\"cuda:0\")\n    dtype = torch.float16\nelse:\n    device = torch.device(\"cpu\")\n    dtype = torch.float32\n\nt2vec_model = TextToEmbeddingModelPipeline(\n    encoder=\"text_sonar_basic_encoder\",\n    tokenizer=\"text_sonar_basic_encoder\",\n    device=device,\n)\ntext_column='lang_txt'\nclassifier = load_mutox_model(\n    \"sonar_mutox\",\n    device=device,\n    dtype=dtype,\n).eval()\n\nwith torch.inference_mode():\n    emb = t2vec_model.predict([\"De peur que le pays ne se prostitue et ne se remplisse de crimes.\"], source_lang='fra_Latn')\n    x = classifier(emb.to(device).to(dtype)) # tensor([[-19.7812]], device='cuda:0', dtype=torch.float16)\n\nwith torch.inference_mode():\n    emb = t2vec_model.predict([\"She worked hard and made a significant contribution to the team.\"], source_lang='eng_Latn')\n    x = classifier(emb.to(device).to(dtype)) # tensor([[-53.5938]], device='cuda:0', dtype=torch.float16)\n\nwith torch.inference_mode():\n    emb = t2vec_model.predict([\"El no tiene ni el m\u00e1s m\u00ednimo talento, todo lo que ha logrado ha sido gracias a sobornos y manipulaciones.\"], source_lang='spa_Latn')\n    x = classifier(emb.to(device).to(dtype)) # tensor([[-21.4062]], device='cuda:0', dtype=torch.float16)\n```\n\nFor a CLI way of running the MuTox pipeline, go to [Seamless Communication/.../MuTox](https://github.com/facebookresearch/seamless_communication/tree/main/src/seamless_communication/cli/toxicity/mutox).\n\n### Demo notebooks\nSee more complete demo notebooks :\n\n* [sonar text2text similarity and translation](examples/sonar_text_demo.ipynb)\n* [sonar speech2text and other data pipeline examples](examples/inference_pipelines.ipynb)\n* [sonar bilingual document alignment with sonar text similarity](examples/bilingual_document.ipynb)\n\n\n## Supported languages and download links\nThe SONAR text encoder & decoder supports 200 languages. SONAR speech encoders support 37 languages.\n\n<details>\n<summary>Available text encoders/decoders</summary>\n\n| model             | link                                                                               |\n| ----------------- | ---------------------------------------------------------------------------------- |\n| encoder           | [download](https://dl.fbaipublicfiles.com/SONAR/sonar_text_encoder.pt)             |\n| decoder           | [download](https://dl.fbaipublicfiles.com/SONAR/sonar_text_encoder.pt)             |\n| finetuned decoder | [download](https://dl.fbaipublicfiles.com/SONAR/finetuned_decoder.pt)              |\n| tokenizer         | [download](https://dl.fbaipublicfiles.com/SONAR/sentencepiece.source.256000.model) |\n\nAll 200 languages from the [No Language Left Behind project](https://arxiv.org/abs/2207.04672) are supported.\n\n</details>\n\n<details>\n<summary>Available speech encoders</summary>\n\n| lang_code | language         | link                                                               |\n| --------- | ---------------- | ------------------------------------------------------------------ |\n| arb       | ms arabic        | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.arb.pt) |\n| asm       | assamese         | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.asm.pt) |\n| bel       | belarussian      | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.bel.pt) |\n| ben       | bengali          | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.ben.pt) |\n| bos       | bosnian          | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.bos.pt) |\n| bul       | bulgarian        | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.bul.pt) |\n| cat       | catalan          | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.cat.pt) |\n| ces       | czech            | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.ces.pt) |\n| cmn       | mandarin chinese | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.cmn.pt) |\n| cym       | welsh            | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.cym.pt) |\n| dan       | danish           | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.dan.pt) |\n| deu       | german           | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.deu.pt) |\n| est       | estonian         | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.est.pt) |\n| fin       | finnish          | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.fin.pt) |\n| fra       | french           | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.fra.pt) |\n| guj       | gujurati         | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.guj.pt) |\n| heb       | hebrew           | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.heb.pt) |\n| hin       | hindi            | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.hin.pt) |\n| hrv       | croatian         | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.hrv.pt) |\n| ind       | indonesian       | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.ind.pt) |\n| ita       | italian          | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.ita.pt) |\n| jpn       | japanse          | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.jpn.pt) |\n| kan       | kannada          | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.jan.pt) |\n| kor       | korean           | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.kor.pt) |\n| lao       | lao              | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.lao.pt) |\n| lit       | lithaian         | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.lit.pt) |\n| lvs       | standard latvian | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.lvs.pt) |\n| mal       | malayalam        | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.mal.pt) |\n| mar       | marathi          | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.mar.pt) |\n| mkd       | macedonian       | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.mkd.pt) |\n| mlt       | maltese          | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.mlt.pt) |\n| npi       | nepali           | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.npi.pt) |\n| nld       | dutch            | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.nld.pt) |\n| ory       | odia             | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.ory.pt) |\n| pan       | punjabi          | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.pan.pt) |\n| pes       | western persian  | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.pes.pt) |\n| pol       | polish           | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.po.pt)  |\n| por       | portuguese       | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.por.pt) |\n| ron       | romanian         | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.ron.pt) |\n| rus       | russian          | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.rus.pt) |\n| slk       | slovak           | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.slk.pt) |\n| slv       | slovenian        | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.slv.pt) |\n| snd       | sindhi           | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.snd.pt) |\n| srp       | serbian          | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.srp.pt) |\n| spa       | spanish          | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.spa.pt) |\n| swe       | swedish          | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.swe.pt) |\n| swh       | swahili          | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.swh.pt) |\n| tam       | tamil            | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.tam.pt) |\n| tel       | telugu           | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.tel.pt) |\n| tgl       | tagalog          | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.tgl.pt) |\n| tha       | thai             | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.tha.pt) |\n| tur       | turkish          | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.tur.pt) |\n| ukr       | ukrainian        | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.ukr.pt) |\n| urd       | urdu             | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.urd.pt) |\n| uzn       | northern uzbek   | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.uzn.pt) |\n| vie       | vietnamese       | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.vie.pt) |\n| yue       | yue              | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.yue.pt) |\n\n</details>\n\n## Citation Information\n\nPlease cite the paper when referencing the SONAR embedding space, encoders and decoders as:\n\n```\n@misc{Duquenne:2023:sonar_arxiv,\n  author = {Paul-Ambroise Duquenne and Holger Schwenk and Benoit Sagot},\n  title = {{SONAR:} Sentence-Level Multimodal and Language-Agnostic Representations},\n  publisher = {arXiv},\n  year = {2023},\n  url = {https://arxiv.org/abs/2308.11466},\n}\n```\n\n## Contributing\n\nSee the [CONTRIBUTING](CONTRIBUTING.md) file for how to help out.\n\n## License\n\nSONAR code is released under the MIT license (see [CODE_LICENSE](CODE_LICENSE.md)).\n\nSome of SONAR models are released with the same MIT license, BUT BEWARE, \nsome of them are released under a non commercial license (see [NC_MODEL_LICENSE](NC_MODEL_LICENSE.md)).\nPlease refer to [LICENSE](LICENSE.md) for the details.\n\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "SONAR provides a set of speech and text encoders for multilingual, multimodal semantic embedding.",
    "version": "0.3.1",
    "project_urls": {
        "Source": "https://github.com/facebookresearch/SONAR",
        "Tracker": "https://github.com/facebookresearch/SONAR/issues"
    },
    "split_keywords": [
        "sentence embeddings",
        " sentence representation",
        " sentence encoder",
        " sonar models",
        " speech2speech",
        " text2text",
        " speech2text",
        " text2speech",
        " multi-modal models",
        " multi-language models"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "57c9be8d3e845f4e93e7f678a0e2f56bef3483e5c3133270e48de89665d8135f",
                "md5": "37d7c178ae211adb3b7037dfaca0996c",
                "sha256": "cd6fd58c53487d779aa381bc319c2a4ce8b9b8cca571414bd7ad7daac32e897c"
            },
            "downloads": -1,
            "filename": "sonar_space-0.3.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "37d7c178ae211adb3b7037dfaca0996c",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 54147,
            "upload_time": "2024-12-11T15:43:14",
            "upload_time_iso_8601": "2024-12-11T15:43:14.606362Z",
            "url": "https://files.pythonhosted.org/packages/57/c9/be8d3e845f4e93e7f678a0e2f56bef3483e5c3133270e48de89665d8135f/sonar_space-0.3.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "cae9607886df1275fcb631eb91f3ab47b3f231444a320da97ea69a8adfe62340",
                "md5": "a959d4be4eca38bd48765f2c536240b3",
                "sha256": "44262b10a4d158fffeb7849abf0ead4d73fbe7262f20172874fd6a991d976a65"
            },
            "downloads": -1,
            "filename": "sonar_space-0.3.1.tar.gz",
            "has_sig": false,
            "md5_digest": "a959d4be4eca38bd48765f2c536240b3",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 36236,
            "upload_time": "2024-12-11T15:43:17",
            "upload_time_iso_8601": "2024-12-11T15:43:17.949732Z",
            "url": "https://files.pythonhosted.org/packages/ca/e9/607886df1275fcb631eb91f3ab47b3f231444a320da97ea69a8adfe62340/sonar_space-0.3.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-12-11 15:43:17",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "facebookresearch",
    "github_project": "SONAR",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "sonar-space"
}

Meta AI Research