# mPLM-Sim: Better Cross-Lingual Similarity and Transfer in Multilingual Pretrained Language Models
[![arXiv](https://img.shields.io/badge/arXiv-2305.13684-b31b1b.svg)](https://arxiv.org/abs/2305.13684)
mplm-sim is a language similarity tool providing:
- `Loader`: Accessing high-quality language similarity results directly.
- `Executor`: Obtaining similarity results from scratch.
## Quickstart
Download the repo for use or alternatively install with PyPi
`pip install mplm_sim`
or directly with pip from GitHub
`pip install --upgrade git+https://github.com/cisnlp/mPLM-Sim.git#egg=mplm_sim`
## Loader
```python
from mplm_sim import Loader
# loading existing results given model_name and corpus_name
loader = Loader.from_pretrained(model_name='cis-lmu/glot500-base', corpus_name='flores200')
# Or loading results given similarity file
# loader = Loader.from_tsv('your_similarity_file.tsv')
# Getting similarity given language pairs
# iso3_script
sim = loader.get_sim('eng_Latn', 'cmn_Hani')
# or language name
sim = loader.get_sim('English', 'Chinese')
```
## Executor
```python
from mplm_sim import Loader
# model_name: any text/speech language model support by Huggingface
# corpus_name: specific corpus name for saving
# corpus_path: path for multi-parallel corpora, see corpora_demo for file formatting
# corpus_type: text or speech
executor = Executor(model_name='cis-lmu/glot500-base', corpus_name='own',
corpus_path='corpora/own', corpus_type='text')
# Run
executor.run()
```
## Citation
```
@article{DBLP:journals/corr/abs-2305-13684,
author = {Peiqin Lin and
Chengzhi Hu and
Zheyu Zhang and
Andr{\'{e}} F. T. Martins and
Hinrich Sch{\"{u}}tze},
title = {mPLM-Sim: Unveiling Better Cross-Lingual Similarity and Transfer in
Multilingual Pretrained Language Models},
journal = {CoRR},
volume = {abs/2305.13684},
year = {2023},
url = {https://doi.org/10.48550/arXiv.2305.13684},
doi = {10.48550/ARXIV.2305.13684},
eprinttype = {arXiv},
eprint = {2305.13684},
timestamp = {Mon, 05 Jun 2023 15:42:15 +0200},
biburl = {https://dblp.org/rec/journals/corr/abs-2305-13684.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
```
Raw data
{
"_id": null,
"home_page": "https://github.com/cisnlp/mplm_sim",
"name": "mplm-sim",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.6",
"maintainer_email": "",
"keywords": "mplm_sim",
"author": "Peiqin Lin, Chengzhi Hu",
"author_email": "lpq29743@gmail.com, Chengzhi.Hu@campus.lmu.de",
"download_url": "https://files.pythonhosted.org/packages/43/d9/ddc3fca946d7beec4e7a554c71a703192c7f7bc616f3e921c4ac748c0bff/mplm_sim-0.1.0.tar.gz",
"platform": null,
"description": "# mPLM-Sim: Better Cross-Lingual Similarity and Transfer in Multilingual Pretrained Language Models\n\n[![arXiv](https://img.shields.io/badge/arXiv-2305.13684-b31b1b.svg)](https://arxiv.org/abs/2305.13684)\n\nmplm-sim is a language similarity tool providing:\n\n- `Loader`: Accessing high-quality language similarity results directly.\n- `Executor`: Obtaining similarity results from scratch.\n\n## Quickstart\n\nDownload the repo for use or alternatively install with PyPi\n\n`pip install mplm_sim`\n\nor directly with pip from GitHub\n\n`pip install --upgrade git+https://github.com/cisnlp/mPLM-Sim.git#egg=mplm_sim`\n\n## Loader\n\n```python\nfrom mplm_sim import Loader\n\n# loading existing results given model_name and corpus_name\nloader = Loader.from_pretrained(model_name='cis-lmu/glot500-base', corpus_name='flores200')\n# Or loading results given similarity file\n# loader = Loader.from_tsv('your_similarity_file.tsv')\n\n# Getting similarity given language pairs\n# iso3_script\nsim = loader.get_sim('eng_Latn', 'cmn_Hani')\n# or language name\nsim = loader.get_sim('English', 'Chinese')\n```\n\n## Executor\n\n```python\nfrom mplm_sim import Loader\n\n# model_name: any text/speech language model support by Huggingface\n# corpus_name: specific corpus name for saving\n# corpus_path: path for multi-parallel corpora, see corpora_demo for file formatting\n# corpus_type: text or speech\nexecutor = Executor(model_name='cis-lmu/glot500-base', corpus_name='own',\n corpus_path='corpora/own', corpus_type='text')\n\n# Run\nexecutor.run()\n```\n\n## Citation\n\n```\n@article{DBLP:journals/corr/abs-2305-13684,\n author = {Peiqin Lin and\n Chengzhi Hu and\n Zheyu Zhang and\n Andr{\\'{e}} F. T. Martins and\n Hinrich Sch{\\\"{u}}tze},\n title = {mPLM-Sim: Unveiling Better Cross-Lingual Similarity and Transfer in\n Multilingual Pretrained Language Models},\n journal = {CoRR},\n volume = {abs/2305.13684},\n year = {2023},\n url = {https://doi.org/10.48550/arXiv.2305.13684},\n doi = {10.48550/ARXIV.2305.13684},\n eprinttype = {arXiv},\n eprint = {2305.13684},\n timestamp = {Mon, 05 Jun 2023 15:42:15 +0200},\n biburl = {https://dblp.org/rec/journals/corr/abs-2305-13684.bib},\n bibsource = {dblp computer science bibliography, https://dblp.org}\n}\n```\n\n\n\n",
"bugtrack_url": null,
"license": "Apache Software License 2.0",
"summary": "mPLM-Sim: Better Cross-Lingual Similarity and Transfer in Multilingual Pretrained Language Models",
"version": "0.1.0",
"project_urls": {
"Homepage": "https://github.com/cisnlp/mplm_sim"
},
"split_keywords": [
"mplm_sim"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "48aeb658e7624a475148d0cb1dba809fee8ca7fe5e881c2ee4a8b893ba0a1b20",
"md5": "4b4f330aadd664345d5067af4bf6b217",
"sha256": "8c222f2cc84892a04afb8f53132f0afb4b57006117396e71ffbe589bac404f31"
},
"downloads": -1,
"filename": "mplm_sim-0.1.0-py2.py3-none-any.whl",
"has_sig": false,
"md5_digest": "4b4f330aadd664345d5067af4bf6b217",
"packagetype": "bdist_wheel",
"python_version": "py2.py3",
"requires_python": ">=3.6",
"size": 9839,
"upload_time": "2024-01-19T13:26:58",
"upload_time_iso_8601": "2024-01-19T13:26:58.053933Z",
"url": "https://files.pythonhosted.org/packages/48/ae/b658e7624a475148d0cb1dba809fee8ca7fe5e881c2ee4a8b893ba0a1b20/mplm_sim-0.1.0-py2.py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "43d9ddc3fca946d7beec4e7a554c71a703192c7f7bc616f3e921c4ac748c0bff",
"md5": "291db56c68e5494e4d4a9f471ecd3ace",
"sha256": "2dd40f40c27ace8f745d0ab1d59b1e28eca8fda2cb67667c2b5284c7449ea776"
},
"downloads": -1,
"filename": "mplm_sim-0.1.0.tar.gz",
"has_sig": false,
"md5_digest": "291db56c68e5494e4d4a9f471ecd3ace",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.6",
"size": 9906,
"upload_time": "2024-01-19T13:26:59",
"upload_time_iso_8601": "2024-01-19T13:26:59.954584Z",
"url": "https://files.pythonhosted.org/packages/43/d9/ddc3fca946d7beec4e7a554c71a703192c7f7bc616f3e921c4ac748c0bff/mplm_sim-0.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-01-19 13:26:59",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "cisnlp",
"github_project": "mplm_sim",
"github_not_found": true,
"lcname": "mplm-sim"
}