simalign

Name	simalign JSON
Version	0.4 JSON
	download
home_page	https://github.com/cisnlp/simalign
Summary	Word Alignments using Pretrained Language Models
upload_time	2023-11-07 21:44:36
maintainer
docs_url	None
author	Masoud Jalili Sabet, Philipp Dufter
requires_python	>=3.6.0
license
keywords	nlp deep learning transformer pytorch bert word alignment
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            SimAlign: Similarity Based Word Aligner
==============

<p align="center">
    <br>
    <img alt="Alignment Example" src="https://raw.githubusercontent.com/cisnlp/simalign/master/assets/example.png" width="300"/>
    <br>
<p>

SimAlign is a high-quality word alignment tool that uses static and contextualized embeddings and **does not require parallel training data**.

The following table shows how it compares to popular statistical alignment models:

|            | ENG-CES | ENG-DEU | ENG-FAS | ENG-FRA | ENG-HIN | ENG-RON |
| ---------- | ------- | ------- | ------- | ------- | ------- | ------- |
| fast-align | .78     | .71     | .46     | .84     | .38     | .68     |
| eflomal    | .85     | .77     | .63     | .93     | .52     | .72     |
| mBERT-Argmax | .87     | .81     | .67     | .94     | .55     | .65     |

Shown is F1, maximum across subword and word level. For more details see the [Paper](https://arxiv.org/pdf/2004.08728.pdf).


Installation and Usage
--------

Tested with Python 3.7, Transformers 3.1.0, Torch 1.5.0. Networkx 2.4 is optional (only required for Match algorithm). 
For full list of dependencies see `setup.py`.
For installation of transformers see [their repo](https://github.com/huggingface/transformers#installation).

Download the repo for use or alternatively install with PyPi

`pip install simalign`

or directly with pip from GitHub

`pip install --upgrade git+https://github.com/cisnlp/simalign.git#egg=simalign`


An example for using our code:
```python
from simalign import SentenceAligner

# making an instance of our model.
# You can specify the embedding model and all alignment settings in the constructor.
myaligner = SentenceAligner(model="bert", token_type="bpe", matching_methods="mai")

# The source and target sentences should be tokenized to words.
src_sentence = ["This", "is", "a", "test", "."]
trg_sentence = ["Das", "ist", "ein", "Test", "."]

# The output is a dictionary with different matching methods.
# Each method has a list of pairs indicating the indexes of aligned words (The alignments are zero-indexed).
alignments = myaligner.get_word_aligns(src_sentence, trg_sentence)

for matching_method in alignments:
    print(matching_method, ":", alignments[matching_method])

# Expected output:
# mwmf (Match): [(0, 0), (1, 1), (2, 2), (3, 3), (4, 4)]
# inter (ArgMax): [(0, 0), (1, 1), (2, 2), (3, 3), (4, 4)]
# itermax (IterMax): [(0, 0), (1, 1), (2, 2), (3, 3), (4, 4)]
```
For more examples of how to use our code see `scripts/align_example.py`.

Demo
--------

An online demo is available [here](https://simalign.cis.lmu.de/).


Gold Standards
--------
Links to the gold standars used in the paper are here: 


| Language Pair  | Citation | Type |Link |
| ------------- | ------------- | ------------- | ------------- |
| ENG-CES | Marecek et al. 2008 | Gold Alignment | http://ufal.mff.cuni.cz/czech-english-manual-word-alignment |
| ENG-DEU | EuroParl-based | Gold Alignment | www-i6.informatik.rwth-aachen.de/goldAlignment/ |
| ENG-FAS | Tvakoli et al. 2014 | Gold Alignment | http://eceold.ut.ac.ir/en/node/940 |
| ENG-FRA |  WPT2003, Och et al. 2000,| Gold Alignment | http://web.eecs.umich.edu/~mihalcea/wpt/ |
| ENG-HIN |   WPT2005 | Gold Alignment | http://web.eecs.umich.edu/~mihalcea/wpt05/ |
| ENG-RON |  WPT2005 Mihalcea et al. 2003 | Gold Alignment | http://web.eecs.umich.edu/~mihalcea/wpt05/ |
        
        
Evaluation Script
--------
For evaluating the output alignments use `scripts/calc_align_score.py`.

The gold alignment file should have the same format as SimAlign outputs.
Sure alignment edges in the gold standard have a '-' between the source and the target indices and the possible edges have a 'p' between indices.
For sample parallel sentences and their gold alignments from ENG-DEU, see `samples`.


Publication
--------

If you use the code, please cite 

```
@inproceedings{jalili-sabet-etal-2020-simalign,
    title = "{S}im{A}lign: High Quality Word Alignments without Parallel Training Data using Static and Contextualized Embeddings",
    author = {Jalili Sabet, Masoud  and
      Dufter, Philipp  and
      Yvon, Fran{\c{c}}ois  and
      Sch{\"u}tze, Hinrich},
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.findings-emnlp.147",
    pages = "1627--1643",
}
```

Feedback
--------

Feedback and Contributions more than welcome! Just reach out to @masoudjs or @pdufter. 


FAQ
--------

##### Do I need parallel data to train the system?

No, no parallel training data is required.

##### Which languages can be aligned?

This depends on the underlying pretrained multilingual language model used. For example, if mBERT is used, it covers 104 languages as listed [here](https://github.com/google-research/bert/blob/master/multilingual.md).

##### Do I need GPUs for running this?

Each alignment simply requires a single forward pass in the pretrained language model. While this is certainly 
faster on GPU, it runs fine on CPU. On one GPU (GeForce GTX 1080 Ti) it takes around 15-20 seconds to align 500 parallel sentences.



License
-------

Copyright (C) 2020, Masoud Jalili Sabet, Philipp Dufter

A full copy of the license can be found in LICENSE.

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/cisnlp/simalign",
    "name": "simalign",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.6.0",
    "maintainer_email": "",
    "keywords": "NLP deep learning transformer pytorch BERT Word Alignment",
    "author": "Masoud Jalili Sabet, Philipp Dufter",
    "author_email": "philipp@cis.lmu.de,masoud@cis.lmu.de",
    "download_url": "https://files.pythonhosted.org/packages/c4/23/160b158e2b70ded0a91afd5d75bd8e431a8b4a6512a817946b7501640518/simalign-0.4.tar.gz",
    "platform": null,
    "description": "SimAlign: Similarity Based Word Aligner\n==============\n\n<p align=\"center\">\n    <br>\n    <img alt=\"Alignment Example\" src=\"https://raw.githubusercontent.com/cisnlp/simalign/master/assets/example.png\" width=\"300\"/>\n    <br>\n<p>\n\nSimAlign is a high-quality word alignment tool that uses static and contextualized embeddings and **does not require parallel training data**.\n\nThe following table shows how it compares to popular statistical alignment models:\n\n|            | ENG-CES | ENG-DEU | ENG-FAS | ENG-FRA | ENG-HIN | ENG-RON |\n| ---------- | ------- | ------- | ------- | ------- | ------- | ------- |\n| fast-align | .78     | .71     | .46     | .84     | .38     | .68     |\n| eflomal    | .85     | .77     | .63     | .93     | .52     | .72     |\n| mBERT-Argmax | .87     | .81     | .67     | .94     | .55     | .65     |\n\nShown is F1, maximum across subword and word level. For more details see the [Paper](https://arxiv.org/pdf/2004.08728.pdf).\n\n\nInstallation and Usage\n--------\n\nTested with Python 3.7, Transformers 3.1.0, Torch 1.5.0. Networkx 2.4 is optional (only required for Match algorithm). \nFor full list of dependencies see `setup.py`.\nFor installation of transformers see [their repo](https://github.com/huggingface/transformers#installation).\n\nDownload the repo for use or alternatively install with PyPi\n\n`pip install simalign`\n\nor directly with pip from GitHub\n\n`pip install --upgrade git+https://github.com/cisnlp/simalign.git#egg=simalign`\n\n\nAn example for using our code:\n```python\nfrom simalign import SentenceAligner\n\n# making an instance of our model.\n# You can specify the embedding model and all alignment settings in the constructor.\nmyaligner = SentenceAligner(model=\"bert\", token_type=\"bpe\", matching_methods=\"mai\")\n\n# The source and target sentences should be tokenized to words.\nsrc_sentence = [\"This\", \"is\", \"a\", \"test\", \".\"]\ntrg_sentence = [\"Das\", \"ist\", \"ein\", \"Test\", \".\"]\n\n# The output is a dictionary with different matching methods.\n# Each method has a list of pairs indicating the indexes of aligned words (The alignments are zero-indexed).\nalignments = myaligner.get_word_aligns(src_sentence, trg_sentence)\n\nfor matching_method in alignments:\n    print(matching_method, \":\", alignments[matching_method])\n\n# Expected output:\n# mwmf (Match): [(0, 0), (1, 1), (2, 2), (3, 3), (4, 4)]\n# inter (ArgMax): [(0, 0), (1, 1), (2, 2), (3, 3), (4, 4)]\n# itermax (IterMax): [(0, 0), (1, 1), (2, 2), (3, 3), (4, 4)]\n```\nFor more examples of how to use our code see `scripts/align_example.py`.\n\nDemo\n--------\n\nAn online demo is available [here](https://simalign.cis.lmu.de/).\n\n\nGold Standards\n--------\nLinks to the gold standars used in the paper are here: \n\n\n| Language Pair  | Citation | Type |Link |\n| ------------- | ------------- | ------------- | ------------- |\n| ENG-CES | Marecek et al. 2008 | Gold Alignment | http://ufal.mff.cuni.cz/czech-english-manual-word-alignment |\n| ENG-DEU | EuroParl-based | Gold Alignment | www-i6.informatik.rwth-aachen.de/goldAlignment/ |\n| ENG-FAS | Tvakoli et al. 2014 | Gold Alignment | http://eceold.ut.ac.ir/en/node/940 |\n| ENG-FRA |  WPT2003, Och et al. 2000,| Gold Alignment | http://web.eecs.umich.edu/~mihalcea/wpt/ |\n| ENG-HIN |   WPT2005 | Gold Alignment | http://web.eecs.umich.edu/~mihalcea/wpt05/ |\n| ENG-RON |  WPT2005 Mihalcea et al. 2003 | Gold Alignment | http://web.eecs.umich.edu/~mihalcea/wpt05/ |\n        \n        \nEvaluation Script\n--------\nFor evaluating the output alignments use `scripts/calc_align_score.py`.\n\nThe gold alignment file should have the same format as SimAlign outputs.\nSure alignment edges in the gold standard have a '-' between the source and the target indices and the possible edges have a 'p' between indices.\nFor sample parallel sentences and their gold alignments from ENG-DEU, see `samples`.\n\n\nPublication\n--------\n\nIf you use the code, please cite \n\n```\n@inproceedings{jalili-sabet-etal-2020-simalign,\n    title = \"{S}im{A}lign: High Quality Word Alignments without Parallel Training Data using Static and Contextualized Embeddings\",\n    author = {Jalili Sabet, Masoud  and\n      Dufter, Philipp  and\n      Yvon, Fran{\\c{c}}ois  and\n      Sch{\\\"u}tze, Hinrich},\n    booktitle = \"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings\",\n    month = nov,\n    year = \"2020\",\n    address = \"Online\",\n    publisher = \"Association for Computational Linguistics\",\n    url = \"https://www.aclweb.org/anthology/2020.findings-emnlp.147\",\n    pages = \"1627--1643\",\n}\n```\n\nFeedback\n--------\n\nFeedback and Contributions more than welcome! Just reach out to @masoudjs or @pdufter. \n\n\nFAQ\n--------\n\n##### Do I need parallel data to train the system?\n\nNo, no parallel training data is required.\n\n##### Which languages can be aligned?\n\nThis depends on the underlying pretrained multilingual language model used. For example, if mBERT is used, it covers 104 languages as listed [here](https://github.com/google-research/bert/blob/master/multilingual.md).\n\n##### Do I need GPUs for running this?\n\nEach alignment simply requires a single forward pass in the pretrained language model. While this is certainly \nfaster on GPU, it runs fine on CPU. On one GPU (GeForce GTX 1080 Ti) it takes around 15-20 seconds to align 500 parallel sentences.\n\n\n\nLicense\n-------\n\nCopyright (C) 2020, Masoud Jalili Sabet, Philipp Dufter\n\nA full copy of the license can be found in LICENSE.\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Word Alignments using Pretrained Language Models",
    "version": "0.4",
    "project_urls": {
        "Bug Tracker": "https://github.com/cisnlp/simalign/issues",
        "Homepage": "https://github.com/cisnlp/simalign"
    },
    "split_keywords": [
        "nlp",
        "deep",
        "learning",
        "transformer",
        "pytorch",
        "bert",
        "word",
        "alignment"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "0676b122f58e411c79c3ec335e9c606ab57142506b704feb276697754b90f226",
                "md5": "3b86fd03a797daf666924d28947224db",
                "sha256": "e28942b861d99416788c773f0694257aec99b6e976bc96b61ea474d51495d99e"
            },
            "downloads": -1,
            "filename": "simalign-0.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "3b86fd03a797daf666924d28947224db",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6.0",
            "size": 8077,
            "upload_time": "2023-11-07T21:44:35",
            "upload_time_iso_8601": "2023-11-07T21:44:35.099655Z",
            "url": "https://files.pythonhosted.org/packages/06/76/b122f58e411c79c3ec335e9c606ab57142506b704feb276697754b90f226/simalign-0.4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "c423160b158e2b70ded0a91afd5d75bd8e431a8b4a6512a817946b7501640518",
                "md5": "d252c083453a3584cf5708869b2df130",
                "sha256": "7010699c15987a102860df136ca6b2fd761401b06e8df6138e2033f560bb4039"
            },
            "downloads": -1,
            "filename": "simalign-0.4.tar.gz",
            "has_sig": false,
            "md5_digest": "d252c083453a3584cf5708869b2df130",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6.0",
            "size": 7553,
            "upload_time": "2023-11-07T21:44:36",
            "upload_time_iso_8601": "2023-11-07T21:44:36.874230Z",
            "url": "https://files.pythonhosted.org/packages/c4/23/160b158e2b70ded0a91afd5d75bd8e431a8b4a6512a817946b7501640518/simalign-0.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-11-07 21:44:36",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "cisnlp",
    "github_project": "simalign",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "simalign"
}

Masoud Jalili Sabet, Philipp Dufter