# SONAR
[[Paper]](https://ai.meta.com/research/publications/sonar-sentence-level-multimodal-and-language-agnostic-representations/)
[[Demo]](#usage)
We introduce SONAR, a new multilingual and multimodal fixed-size sentence embedding space, with a full suite of speech and text encoders and decoders. It substantially outperforms existing sentence embeddings such as LASER3 and LabSE on the xsim and xsim++ multilingual similarity search tasks.
Speech segments can be embedded in the same SONAR embedding space using language-specific speech encoders trained in a teacher-student setting on speech transcription data. We also provide a single text decoder, which allows us to perform text-to-text and speech-to-text machine translation, including for zero-shot language and modality combinations.
*SONAR* stands for **S**entence-level multim**O**dal and la**N**guage-**A**gnostic **R**epresentations
The full list of supported languages (along with download links) can be found here [below](#supported-languages-and-download-links).
## SONAR Architecture:
<p align="center">
<img src="materials/sonar_archi.png" width="800"><br />
</p>
## Text results
<p align="center">
<img src="materials/sonar_text_resulsts.png" width="800"><br />
</p>
## Speech results
<p align="center">
<img src="materials/sonar_langs.png" width="400"><br />
</p>
## Installing
You can install SONAR with `pip install sonar-space`. Note that there is another `sonar` package on pip that IS NOT this project, make sure to use `sonar-space` in your dependencies.
If you want to install SONAR manually, you can install it localy. SONAR depends mainly on [Fairseq2](https://github.com/facebookresearch/fairseq2) and can be installed using (tested with `python=3.8`)
```bash
pip install --upgrade pip
pip install -e .
```
If fairseq2 does not provide a build for your machine, check the readme of that project to build it locally.
## Usage
fairseq2 will automatically download models into your `$TORCH_HOME/hub` directory upon using the commands below.
### Compute text sentence embeddings with SONAR:
```python
from sonar.inference_pipelines.text import TextToEmbeddingModelPipeline
t2vec_model = TextToEmbeddingModelPipeline(encoder="text_sonar_basic_encoder",
tokenizer="text_sonar_basic_encoder")
sentences = ['My name is SONAR.', 'I can embed the sentences into vectorial space.']
embeddings = t2vec_model.predict(sentences, source_lang="eng_Latn")
print(embeddings.shape)
# torch.Size([2, 1024])
```
### Reconstruct text from SONAR embeddings
```python
from sonar.inference_pipelines.text import EmbeddingToTextModelPipeline
vec2text_model = EmbeddingToTextModelPipeline(decoder="text_sonar_basic_decoder",
tokenizer="text_sonar_basic_encoder")
reconstructed = vec2text_model.predict(embeddings, target_lang="eng_Latn", max_seq_len=512)
# max_seq_len is a keyword argument passed to the fairseq2 BeamSearchSeq2SeqGenerator.
print(reconstructed)
# ['My name is SONAR.', 'I can embed the sentences into vector space.']
```
### Translate text with SONAR
```python
from sonar.inference_pipelines.text import TextToTextModelPipeline
t2t_model = TextToTextModelPipeline(encoder="text_sonar_basic_encoder",
decoder="text_sonar_basic_decoder",
tokenizer="text_sonar_basic_encoder") # tokenizer is attached to both encoder and decoder cards
sentences = ['My name is SONAR.', 'I can embed the sentences into vectorial space.']
t2t_model.predict(sentences, source_lang="eng_Latn", target_lang="fra_Latn")
# ['Mon nom est SONAR.', "Je peux intégrer les phrases dans l'espace vectoriel."]
```
### Compute speech sentence embeddings with SONAR
```python
from sonar.inference_pipelines.speech import SpeechToEmbeddingModelPipeline
s2vec_model = SpeechToEmbeddingModelPipeline(encoder="sonar_speech_encoder_eng")
s2vec_model.predict(["./tests/integration_tests/data/audio_files/audio_1.wav",
"./tests/integration_tests/data/audio_files/audio_2.wav"]).shape
# torch.Size([2, 1024])
import torchaudio
inp, sr = torchaudio.load("./tests/integration_tests/data/audio_files/audio_1.wav")
assert sr == 16000, "Sample rate should be 16kHz"
s2vec_model.predict([inp]).shape
# torch.Size([1, 1024])
```
### Speech-to-text translation with SONAR
```python
from sonar.inference_pipelines.speech import SpeechToTextModelPipeline
s2t_model = SpeechToTextModelPipeline(encoder="sonar_speech_encoder_eng",
decoder="text_sonar_basic_decoder",
tokenizer="text_sonar_basic_decoder")
import torchaudio
inp, sr = torchaudio.load("./tests/integration_tests/data/audio_files/audio_1.wav")
assert sr == 16000, "Sample rate should be 16kHz"
# passing loaded audio files
s2t_model.predict([inp], target_lang="eng_Latn")
# ['Television reports show white smoke coming from the plant.']
# passing multiple wav files
s2t_model.predict(["./tests/integration_tests/data/audio_files/audio_1.wav",
"./tests/integration_tests/data/audio_files/audio_2.wav"], target_lang="eng_Latn")
# ['Television reports show white smoke coming from the plant.',
# 'These couples may choose to make an adoption plan for their baby.']
```
### Predicting sentence similarity with BLASER 2.0 models
BLASER 2.0 is a family of models for automatic evaluation of machine translation quality based on SONAR embeddings.
They predict [cross-lingual semantic similarity](https://github.com/facebookresearch/fairseq/tree/nllb/examples/nllb/human_XSTS_eval)
between the translation and the source (optionally, also using a reference translation).
```Python
from sonar.inference_pipelines.text import TextToEmbeddingModelPipeline
from sonar.models.blaser.loader import load_blaser_model
blaser_ref = load_blaser_model("blaser_2_0_ref").eval()
blaser_qe = load_blaser_model("blaser_2_0_qe").eval()
text_embedder = TextToEmbeddingModelPipeline(encoder="text_sonar_basic_encoder", tokenizer="text_sonar_basic_encoder")
src_embs = text_embedder.predict(["Le chat s'assit sur le tapis."], source_lang="fra_Latn")
ref_embs = text_embedder.predict(["The cat sat on the mat."], source_lang="eng_Latn")
mt_embs = text_embedder.predict(["The cat sat down on the carpet."], source_lang="eng_Latn")
print(blaser_ref(src=src_embs, ref=ref_embs, mt=mt_embs).item()) # 4.688
print(blaser_qe(src=src_embs, mt=mt_embs).item()) # 4.708
```
Detailed model cards with more examples: [facebook/blaser-2.0-ref](https://huggingface.co/facebook/blaser-2.0-ref),
[facebook/blaser-2.0-qe](https://huggingface.co/facebook/blaser-2.0-qe).
### Classifying the toxicity of sentences with MuTox
[MuTox](https://github.com/facebookresearch/seamless_communication/tree/main/src/seamless_communication/cli/toxicity/mutox), the first highly multilingual audio-based classifier (binary) and dataset with toxicity labels. The dataset consists of 20k audio utterances for English and Spanish, and 4k for the other 19 languages, and uses the multi-model and multilingual encoders from SONAR. The output of the MuTox classifier is a logit of the evaluated being _"toxic"_, according to the definition adopted in the corresponding dataset.
```Python
from sonar.models.mutox.loader import load_mutox_model
from sonar.inference_pipelines.text import TextToEmbeddingModelPipeline
import torch
if torch.cuda.is_available():
device = torch.device("cuda:0")
dtype = torch.float16
else:
device = torch.device("cpu")
dtype = torch.float32
t2vec_model = TextToEmbeddingModelPipeline(
encoder="text_sonar_basic_encoder",
tokenizer="text_sonar_basic_encoder",
device=device,
)
text_column='lang_txt'
classifier = load_mutox_model(
"sonar_mutox",
device=device,
dtype=dtype,
).eval()
with torch.inference_mode():
emb = t2vec_model.predict(["De peur que le pays ne se prostitue et ne se remplisse de crimes."], source_lang='fra_Latn')
x = classifier(emb.to(device).to(dtype)) # tensor([[-19.7812]], device='cuda:0', dtype=torch.float16)
with torch.inference_mode():
emb = t2vec_model.predict(["She worked hard and made a significant contribution to the team."], source_lang='eng_Latn')
x = classifier(emb.to(device).to(dtype)) # tensor([[-53.5938]], device='cuda:0', dtype=torch.float16)
with torch.inference_mode():
emb = t2vec_model.predict(["El no tiene ni el más mínimo talento, todo lo que ha logrado ha sido gracias a sobornos y manipulaciones."], source_lang='spa_Latn')
x = classifier(emb.to(device).to(dtype)) # tensor([[-21.4062]], device='cuda:0', dtype=torch.float16)
```
For a CLI way of running the MuTox pipeline, go to [Seamless Communication/.../MuTox](https://github.com/facebookresearch/seamless_communication/tree/main/src/seamless_communication/cli/toxicity/mutox).
### Demo notebooks
See more complete demo notebooks :
* [sonar text2text similarity and translation](examples/sonar_text_demo.ipynb)
* [sonar speech2text and other data pipeline examples](examples/inference_pipelines.ipynb)
* [sonar bilingual document alignment with sonar text similarity](examples/bilingual_document.ipynb)
## Supported languages and download links
The SONAR text encoder & decoder supports 200 languages. SONAR speech encoders support 37 languages.
<details>
<summary>Available text encoders/decoders</summary>
| model | link |
| ----------------- | ---------------------------------------------------------------------------------- |
| encoder | [download](https://dl.fbaipublicfiles.com/SONAR/sonar_text_encoder.pt) |
| decoder | [download](https://dl.fbaipublicfiles.com/SONAR/sonar_text_encoder.pt) |
| finetuned decoder | [download](https://dl.fbaipublicfiles.com/SONAR/finetuned_decoder.pt) |
| tokenizer | [download](https://dl.fbaipublicfiles.com/SONAR/sentencepiece.source.256000.model) |
All 200 languages from the [No Language Left Behind project](https://arxiv.org/abs/2207.04672) are supported.
</details>
<details>
<summary>Available speech encoders</summary>
| lang_code | language | link |
| --------- | ---------------- | ------------------------------------------------------------------ |
| arb | ms arabic | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.arb.pt) |
| asm | assamese | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.asm.pt) |
| bel | belarussian | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.bel.pt) |
| ben | bengali | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.ben.pt) |
| bos | bosnian | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.bos.pt) |
| bul | bulgarian | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.bul.pt) |
| cat | catalan | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.cat.pt) |
| ces | czech | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.ces.pt) |
| cmn | mandarin chinese | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.cmn.pt) |
| cym | welsh | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.cym.pt) |
| dan | danish | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.dan.pt) |
| deu | german | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.deu.pt) |
| est | estonian | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.est.pt) |
| fin | finnish | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.fin.pt) |
| fra | french | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.fra.pt) |
| guj | gujurati | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.guj.pt) |
| heb | hebrew | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.heb.pt) |
| hin | hindi | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.hin.pt) |
| hrv | croatian | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.hrv.pt) |
| ind | indonesian | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.ind.pt) |
| ita | italian | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.ita.pt) |
| jpn | japanse | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.jpn.pt) |
| kan | kannada | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.jan.pt) |
| kor | korean | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.kor.pt) |
| lao | lao | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.lao.pt) |
| lit | lithaian | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.lit.pt) |
| lvs | standard latvian | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.lvs.pt) |
| mal | malayalam | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.mal.pt) |
| mar | marathi | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.mar.pt) |
| mkd | macedonian | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.mkd.pt) |
| mlt | maltese | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.mlt.pt) |
| npi | nepali | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.npi.pt) |
| nld | dutch | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.nld.pt) |
| ory | odia | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.ory.pt) |
| pan | punjabi | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.pan.pt) |
| pes | western persian | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.pes.pt) |
| pol | polish | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.po.pt) |
| por | portuguese | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.por.pt) |
| ron | romanian | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.ron.pt) |
| rus | russian | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.rus.pt) |
| slk | slovak | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.slk.pt) |
| slv | slovenian | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.slv.pt) |
| snd | sindhi | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.snd.pt) |
| srp | serbian | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.srp.pt) |
| spa | spanish | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.spa.pt) |
| swe | swedish | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.swe.pt) |
| swh | swahili | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.swh.pt) |
| tam | tamil | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.tam.pt) |
| tel | telugu | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.tel.pt) |
| tgl | tagalog | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.tgl.pt) |
| tha | thai | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.tha.pt) |
| tur | turkish | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.tur.pt) |
| ukr | ukrainian | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.ukr.pt) |
| urd | urdu | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.urd.pt) |
| uzn | northern uzbek | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.uzn.pt) |
| vie | vietnamese | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.vie.pt) |
| yue | yue | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.yue.pt) |
</details>
## Citation Information
Please cite the paper when referencing the SONAR embedding space, encoders and decoders as:
```
@misc{Duquenne:2023:sonar_arxiv,
author = {Paul-Ambroise Duquenne and Holger Schwenk and Benoit Sagot},
title = {{SONAR:} Sentence-Level Multimodal and Language-Agnostic Representations},
publisher = {arXiv},
year = {2023},
url = {https://arxiv.org/abs/2308.11466},
}
```
## Contributing
See the [CONTRIBUTING](CONTRIBUTING.md) file for how to help out.
## License
SONAR code is released under the MIT license (see [CODE_LICENSE](CODE_LICENSE.md)).
Some of SONAR models are released with the same MIT license, BUT BEWARE,
some of them are released under a non commercial license (see [NC_MODEL_LICENSE](NC_MODEL_LICENSE.md)).
Please refer to [LICENSE](LICENSE.md) for the details.
Raw data
{
"_id": null,
"home_page": null,
"name": "sonar-space",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "sentence embeddings, sentence representation, sentence encoder, sonar models, speech2speech, text2text, speech2text, text2speech, multi-modal models, multi-language models",
"author": "Meta AI Research",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/ca/e9/607886df1275fcb631eb91f3ab47b3f231444a320da97ea69a8adfe62340/sonar_space-0.3.1.tar.gz",
"platform": null,
"description": "# SONAR\n[[Paper]](https://ai.meta.com/research/publications/sonar-sentence-level-multimodal-and-language-agnostic-representations/)\n[[Demo]](#usage)\n\nWe introduce SONAR, a new multilingual and multimodal fixed-size sentence embedding space, with a full suite of speech and text encoders and decoders. It substantially outperforms existing sentence embeddings such as LASER3 and LabSE on the xsim and xsim++ multilingual similarity search tasks. \n\nSpeech segments can be embedded in the same SONAR embedding space using language-specific speech encoders trained in a teacher-student setting on speech transcription data. We also provide a single text decoder, which allows us to perform text-to-text and speech-to-text machine translation, including for zero-shot language and modality combinations.\n\n*SONAR* stands for **S**entence-level multim**O**dal and la**N**guage-**A**gnostic **R**epresentations\n\nThe full list of supported languages (along with download links) can be found here [below](#supported-languages-and-download-links).\n\n## SONAR Architecture:\n<p align=\"center\">\n <img src=\"materials/sonar_archi.png\" width=\"800\"><br />\n</p>\n\n\n## Text results\n<p align=\"center\">\n <img src=\"materials/sonar_text_resulsts.png\" width=\"800\"><br />\n</p>\n\n## Speech results\n<p align=\"center\">\n <img src=\"materials/sonar_langs.png\" width=\"400\"><br />\n</p>\n\n\n## Installing\n\nYou can install SONAR with `pip install sonar-space`. Note that there is another `sonar` package on pip that IS NOT this project, make sure to use `sonar-space` in your dependencies.\n\nIf you want to install SONAR manually, you can install it localy. SONAR depends mainly on [Fairseq2](https://github.com/facebookresearch/fairseq2) and can be installed using (tested with `python=3.8`)\n\n```bash\npip install --upgrade pip\npip install -e .\n```\n\nIf fairseq2 does not provide a build for your machine, check the readme of that project to build it locally.\n\n## Usage\nfairseq2 will automatically download models into your `$TORCH_HOME/hub` directory upon using the commands below.\n\n### Compute text sentence embeddings with SONAR:\n```python\nfrom sonar.inference_pipelines.text import TextToEmbeddingModelPipeline\nt2vec_model = TextToEmbeddingModelPipeline(encoder=\"text_sonar_basic_encoder\",\n tokenizer=\"text_sonar_basic_encoder\")\nsentences = ['My name is SONAR.', 'I can embed the sentences into vectorial space.']\nembeddings = t2vec_model.predict(sentences, source_lang=\"eng_Latn\")\nprint(embeddings.shape)\n# torch.Size([2, 1024])\n```\n\n### Reconstruct text from SONAR embeddings\n```python\nfrom sonar.inference_pipelines.text import EmbeddingToTextModelPipeline\nvec2text_model = EmbeddingToTextModelPipeline(decoder=\"text_sonar_basic_decoder\",\n tokenizer=\"text_sonar_basic_encoder\")\nreconstructed = vec2text_model.predict(embeddings, target_lang=\"eng_Latn\", max_seq_len=512)\n# max_seq_len is a keyword argument passed to the fairseq2 BeamSearchSeq2SeqGenerator.\nprint(reconstructed)\n# ['My name is SONAR.', 'I can embed the sentences into vector space.']\n```\n\n### Translate text with SONAR\n```python\nfrom sonar.inference_pipelines.text import TextToTextModelPipeline\nt2t_model = TextToTextModelPipeline(encoder=\"text_sonar_basic_encoder\",\n decoder=\"text_sonar_basic_decoder\",\n tokenizer=\"text_sonar_basic_encoder\") # tokenizer is attached to both encoder and decoder cards\n\nsentences = ['My name is SONAR.', 'I can embed the sentences into vectorial space.']\nt2t_model.predict(sentences, source_lang=\"eng_Latn\", target_lang=\"fra_Latn\")\n# ['Mon nom est SONAR.', \"Je peux int\u00e9grer les phrases dans l'espace vectoriel.\"]\n```\n\n### Compute speech sentence embeddings with SONAR\n```python\nfrom sonar.inference_pipelines.speech import SpeechToEmbeddingModelPipeline\ns2vec_model = SpeechToEmbeddingModelPipeline(encoder=\"sonar_speech_encoder_eng\")\n\ns2vec_model.predict([\"./tests/integration_tests/data/audio_files/audio_1.wav\",\n \"./tests/integration_tests/data/audio_files/audio_2.wav\"]).shape\n# torch.Size([2, 1024])\nimport torchaudio\ninp, sr = torchaudio.load(\"./tests/integration_tests/data/audio_files/audio_1.wav\")\nassert sr == 16000, \"Sample rate should be 16kHz\"\n\ns2vec_model.predict([inp]).shape\n# torch.Size([1, 1024])\n```\n\n### Speech-to-text translation with SONAR\n```python\nfrom sonar.inference_pipelines.speech import SpeechToTextModelPipeline\n\ns2t_model = SpeechToTextModelPipeline(encoder=\"sonar_speech_encoder_eng\",\n decoder=\"text_sonar_basic_decoder\",\n tokenizer=\"text_sonar_basic_decoder\")\n\nimport torchaudio\ninp, sr = torchaudio.load(\"./tests/integration_tests/data/audio_files/audio_1.wav\")\nassert sr == 16000, \"Sample rate should be 16kHz\"\n\n# passing loaded audio files\ns2t_model.predict([inp], target_lang=\"eng_Latn\")\n# ['Television reports show white smoke coming from the plant.']\n\n# passing multiple wav files \ns2t_model.predict([\"./tests/integration_tests/data/audio_files/audio_1.wav\",\n \"./tests/integration_tests/data/audio_files/audio_2.wav\"], target_lang=\"eng_Latn\")\n# ['Television reports show white smoke coming from the plant.',\n# 'These couples may choose to make an adoption plan for their baby.']\n```\n\n\n### Predicting sentence similarity with BLASER 2.0 models\n\nBLASER 2.0 is a family of models for automatic evaluation of machine translation quality based on SONAR embeddings.\nThey predict [cross-lingual semantic similarity](https://github.com/facebookresearch/fairseq/tree/nllb/examples/nllb/human_XSTS_eval) \nbetween the translation and the source (optionally, also using a reference translation). \n\n```Python\nfrom sonar.inference_pipelines.text import TextToEmbeddingModelPipeline\nfrom sonar.models.blaser.loader import load_blaser_model\n\nblaser_ref = load_blaser_model(\"blaser_2_0_ref\").eval()\nblaser_qe = load_blaser_model(\"blaser_2_0_qe\").eval()\ntext_embedder = TextToEmbeddingModelPipeline(encoder=\"text_sonar_basic_encoder\", tokenizer=\"text_sonar_basic_encoder\")\n\nsrc_embs = text_embedder.predict([\"Le chat s'assit sur le tapis.\"], source_lang=\"fra_Latn\")\nref_embs = text_embedder.predict([\"The cat sat on the mat.\"], source_lang=\"eng_Latn\")\nmt_embs = text_embedder.predict([\"The cat sat down on the carpet.\"], source_lang=\"eng_Latn\")\n\nprint(blaser_ref(src=src_embs, ref=ref_embs, mt=mt_embs).item()) # 4.688\nprint(blaser_qe(src=src_embs, mt=mt_embs).item()) # 4.708\n```\n\nDetailed model cards with more examples: [facebook/blaser-2.0-ref](https://huggingface.co/facebook/blaser-2.0-ref), \n[facebook/blaser-2.0-qe](https://huggingface.co/facebook/blaser-2.0-qe). \n\n### Classifying the toxicity of sentences with MuTox\n\n[MuTox](https://github.com/facebookresearch/seamless_communication/tree/main/src/seamless_communication/cli/toxicity/mutox), the first highly multilingual audio-based classifier (binary) and dataset with toxicity labels. The dataset consists of 20k audio utterances for English and Spanish, and 4k for the other 19 languages, and uses the multi-model and multilingual encoders from SONAR. The output of the MuTox classifier is a logit of the evaluated being _\"toxic\"_, according to the definition adopted in the corresponding dataset.\n\n```Python\nfrom sonar.models.mutox.loader import load_mutox_model\nfrom sonar.inference_pipelines.text import TextToEmbeddingModelPipeline\nimport torch\n\nif torch.cuda.is_available():\n device = torch.device(\"cuda:0\")\n dtype = torch.float16\nelse:\n device = torch.device(\"cpu\")\n dtype = torch.float32\n\nt2vec_model = TextToEmbeddingModelPipeline(\n encoder=\"text_sonar_basic_encoder\",\n tokenizer=\"text_sonar_basic_encoder\",\n device=device,\n)\ntext_column='lang_txt'\nclassifier = load_mutox_model(\n \"sonar_mutox\",\n device=device,\n dtype=dtype,\n).eval()\n\nwith torch.inference_mode():\n emb = t2vec_model.predict([\"De peur que le pays ne se prostitue et ne se remplisse de crimes.\"], source_lang='fra_Latn')\n x = classifier(emb.to(device).to(dtype)) # tensor([[-19.7812]], device='cuda:0', dtype=torch.float16)\n\nwith torch.inference_mode():\n emb = t2vec_model.predict([\"She worked hard and made a significant contribution to the team.\"], source_lang='eng_Latn')\n x = classifier(emb.to(device).to(dtype)) # tensor([[-53.5938]], device='cuda:0', dtype=torch.float16)\n\nwith torch.inference_mode():\n emb = t2vec_model.predict([\"El no tiene ni el m\u00e1s m\u00ednimo talento, todo lo que ha logrado ha sido gracias a sobornos y manipulaciones.\"], source_lang='spa_Latn')\n x = classifier(emb.to(device).to(dtype)) # tensor([[-21.4062]], device='cuda:0', dtype=torch.float16)\n```\n\nFor a CLI way of running the MuTox pipeline, go to [Seamless Communication/.../MuTox](https://github.com/facebookresearch/seamless_communication/tree/main/src/seamless_communication/cli/toxicity/mutox).\n\n### Demo notebooks\nSee more complete demo notebooks :\n\n* [sonar text2text similarity and translation](examples/sonar_text_demo.ipynb)\n* [sonar speech2text and other data pipeline examples](examples/inference_pipelines.ipynb)\n* [sonar bilingual document alignment with sonar text similarity](examples/bilingual_document.ipynb)\n\n\n## Supported languages and download links\nThe SONAR text encoder & decoder supports 200 languages. SONAR speech encoders support 37 languages.\n\n<details>\n<summary>Available text encoders/decoders</summary>\n\n| model | link |\n| ----------------- | ---------------------------------------------------------------------------------- |\n| encoder | [download](https://dl.fbaipublicfiles.com/SONAR/sonar_text_encoder.pt) |\n| decoder | [download](https://dl.fbaipublicfiles.com/SONAR/sonar_text_encoder.pt) |\n| finetuned decoder | [download](https://dl.fbaipublicfiles.com/SONAR/finetuned_decoder.pt) |\n| tokenizer | [download](https://dl.fbaipublicfiles.com/SONAR/sentencepiece.source.256000.model) |\n\nAll 200 languages from the [No Language Left Behind project](https://arxiv.org/abs/2207.04672) are supported.\n\n</details>\n\n<details>\n<summary>Available speech encoders</summary>\n\n| lang_code | language | link |\n| --------- | ---------------- | ------------------------------------------------------------------ |\n| arb | ms arabic | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.arb.pt) |\n| asm | assamese | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.asm.pt) |\n| bel | belarussian | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.bel.pt) |\n| ben | bengali | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.ben.pt) |\n| bos | bosnian | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.bos.pt) |\n| bul | bulgarian | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.bul.pt) |\n| cat | catalan | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.cat.pt) |\n| ces | czech | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.ces.pt) |\n| cmn | mandarin chinese | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.cmn.pt) |\n| cym | welsh | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.cym.pt) |\n| dan | danish | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.dan.pt) |\n| deu | german | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.deu.pt) |\n| est | estonian | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.est.pt) |\n| fin | finnish | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.fin.pt) |\n| fra | french | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.fra.pt) |\n| guj | gujurati | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.guj.pt) |\n| heb | hebrew | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.heb.pt) |\n| hin | hindi | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.hin.pt) |\n| hrv | croatian | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.hrv.pt) |\n| ind | indonesian | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.ind.pt) |\n| ita | italian | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.ita.pt) |\n| jpn | japanse | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.jpn.pt) |\n| kan | kannada | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.jan.pt) |\n| kor | korean | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.kor.pt) |\n| lao | lao | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.lao.pt) |\n| lit | lithaian | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.lit.pt) |\n| lvs | standard latvian | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.lvs.pt) |\n| mal | malayalam | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.mal.pt) |\n| mar | marathi | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.mar.pt) |\n| mkd | macedonian | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.mkd.pt) |\n| mlt | maltese | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.mlt.pt) |\n| npi | nepali | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.npi.pt) |\n| nld | dutch | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.nld.pt) |\n| ory | odia | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.ory.pt) |\n| pan | punjabi | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.pan.pt) |\n| pes | western persian | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.pes.pt) |\n| pol | polish | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.po.pt) |\n| por | portuguese | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.por.pt) |\n| ron | romanian | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.ron.pt) |\n| rus | russian | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.rus.pt) |\n| slk | slovak | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.slk.pt) |\n| slv | slovenian | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.slv.pt) |\n| snd | sindhi | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.snd.pt) |\n| srp | serbian | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.srp.pt) |\n| spa | spanish | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.spa.pt) |\n| swe | swedish | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.swe.pt) |\n| swh | swahili | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.swh.pt) |\n| tam | tamil | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.tam.pt) |\n| tel | telugu | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.tel.pt) |\n| tgl | tagalog | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.tgl.pt) |\n| tha | thai | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.tha.pt) |\n| tur | turkish | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.tur.pt) |\n| ukr | ukrainian | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.ukr.pt) |\n| urd | urdu | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.urd.pt) |\n| uzn | northern uzbek | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v3ap.uzn.pt) |\n| vie | vietnamese | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.vie.pt) |\n| yue | yue | [download](https://dl.fbaipublicfiles.com/SONAR/spenc.v5ap.yue.pt) |\n\n</details>\n\n## Citation Information\n\nPlease cite the paper when referencing the SONAR embedding space, encoders and decoders as:\n\n```\n@misc{Duquenne:2023:sonar_arxiv,\n author = {Paul-Ambroise Duquenne and Holger Schwenk and Benoit Sagot},\n title = {{SONAR:} Sentence-Level Multimodal and Language-Agnostic Representations},\n publisher = {arXiv},\n year = {2023},\n url = {https://arxiv.org/abs/2308.11466},\n}\n```\n\n## Contributing\n\nSee the [CONTRIBUTING](CONTRIBUTING.md) file for how to help out.\n\n## License\n\nSONAR code is released under the MIT license (see [CODE_LICENSE](CODE_LICENSE.md)).\n\nSome of SONAR models are released with the same MIT license, BUT BEWARE, \nsome of them are released under a non commercial license (see [NC_MODEL_LICENSE](NC_MODEL_LICENSE.md)).\nPlease refer to [LICENSE](LICENSE.md) for the details.\n\n",
"bugtrack_url": null,
"license": null,
"summary": "SONAR provides a set of speech and text encoders for multilingual, multimodal semantic embedding.",
"version": "0.3.1",
"project_urls": {
"Source": "https://github.com/facebookresearch/SONAR",
"Tracker": "https://github.com/facebookresearch/SONAR/issues"
},
"split_keywords": [
"sentence embeddings",
" sentence representation",
" sentence encoder",
" sonar models",
" speech2speech",
" text2text",
" speech2text",
" text2speech",
" multi-modal models",
" multi-language models"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "57c9be8d3e845f4e93e7f678a0e2f56bef3483e5c3133270e48de89665d8135f",
"md5": "37d7c178ae211adb3b7037dfaca0996c",
"sha256": "cd6fd58c53487d779aa381bc319c2a4ce8b9b8cca571414bd7ad7daac32e897c"
},
"downloads": -1,
"filename": "sonar_space-0.3.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "37d7c178ae211adb3b7037dfaca0996c",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 54147,
"upload_time": "2024-12-11T15:43:14",
"upload_time_iso_8601": "2024-12-11T15:43:14.606362Z",
"url": "https://files.pythonhosted.org/packages/57/c9/be8d3e845f4e93e7f678a0e2f56bef3483e5c3133270e48de89665d8135f/sonar_space-0.3.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "cae9607886df1275fcb631eb91f3ab47b3f231444a320da97ea69a8adfe62340",
"md5": "a959d4be4eca38bd48765f2c536240b3",
"sha256": "44262b10a4d158fffeb7849abf0ead4d73fbe7262f20172874fd6a991d976a65"
},
"downloads": -1,
"filename": "sonar_space-0.3.1.tar.gz",
"has_sig": false,
"md5_digest": "a959d4be4eca38bd48765f2c536240b3",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 36236,
"upload_time": "2024-12-11T15:43:17",
"upload_time_iso_8601": "2024-12-11T15:43:17.949732Z",
"url": "https://files.pythonhosted.org/packages/ca/e9/607886df1275fcb631eb91f3ab47b3f231444a320da97ea69a8adfe62340/sonar_space-0.3.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-12-11 15:43:17",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "facebookresearch",
"github_project": "SONAR",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "sonar-space"
}