crosslingual-coreference

Name	crosslingual-coreference JSON
Version	0.3.1 JSON
	download
home_page	https://github.com/pandora-intelligence/crosslingual-coreference
Summary	A multi-lingual approach to AllenNLP CoReference Resolution, along with a wrapper for spaCy.
upload_time	2023-06-19 13:32:34
maintainer
docs_url	None
author	David Berenstein
requires_python	>=3.8,<3.12
license	MIT
keywords	allennlp spacy nlp
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Crosslingual Coreference
Coreference is amazing but the data required for training a model is very scarce. In our case, the available training for non-English languages also proved to be poorly annotated. Crosslingual Coreference, therefore, uses the assumption a trained model with English data and cross-lingual embeddings should work for languages with similar sentence structures.

[![Current Release Version](https://img.shields.io/github/release/pandora-intelligence/crosslingual-coreference.svg?style=flat-square&logo=github)](https://github.com/pandora-intelligence/crosslingual-coreference/releases)
[![pypi Version](https://img.shields.io/pypi/v/crosslingual-coreference.svg?style=flat-square&logo=pypi&logoColor=white)](https://pypi.org/project/crosslingual-coreference/)
[![PyPi downloads](https://static.pepy.tech/personalized-badge/crosslingual-coreference?period=total&units=international_system&left_color=grey&right_color=orange&left_text=pip%20downloads)](https://pypi.org/project/crosslingual-coreference/)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg?style=flat-square)](https://github.com/ambv/black)

# Install

```
pip install crosslingual-coreference
```
# Quickstart
```python
from crosslingual_coreference import Predictor

text = (
    "Do not forget about Momofuku Ando! He created instant noodles in Osaka. At"
    " that location, Nissin was founded. Many students survived by eating these"
    " noodles, but they don't even know him."
)

# choose minilm for speed/memory and info_xlm for accuracy
predictor = Predictor(
    language="en_core_web_sm", device=-1, model_name="minilm"
)

print(predictor.predict(text)["resolved_text"])
print(predictor.pipe([text])[0]["resolved_text"])
# Note you can also get 'cluster_heads' and 'clusters'
# Output
#
# Do not forget about Momofuku Ando!
# Momofuku Ando created instant noodles in Osaka.
# At Osaka, Nissin was founded.
# Many students survived by eating instant noodles,
# but Many students don't even know Momofuku Ando.
```
![](https://raw.githubusercontent.com/Pandora-Intelligence/crosslingual-coreference/master/img/example_en.png)

## Models
As of now, there are two models available "spanbert", "info_xlm", "xlm_roberta", "minilm", which scored 83, 77, 74 and 74 on OntoNotes Release 5.0 English data, respectively.
- The "minilm" model is the best quality speed trade-off for both mult-lingual and english texts.
- The "info_xlm" model produces the best quality for multi-lingual texts.
- The AllenNLP "spanbert" model produces the best quality for english texts.

## Chunking/batching to resolve memory OOM errors

```python
from crosslingual_coreference import Predictor

predictor = Predictor(
    language="en_core_web_sm",
    device=0,
    model_name="minilm",
    chunk_size=2500,
    chunk_overlap=2,
)
```

## Use spaCy pipeline
```python
import spacy

text = (
    "Do not forget about Momofuku Ando! He created instant noodles in Osaka. At"
    " that location, Nissin was founded. Many students survived by eating these"
    " noodles, but they don't even know him."
)


nlp = spacy.load("en_core_web_sm")
nlp.add_pipe(
    "xx_coref", config={"chunk_size": 2500, "chunk_overlap": 2, "device": 0}
)

doc = nlp(text)
print(doc._.coref_clusters)
# Output
#
# [[[4, 5], [7, 7], [27, 27], [36, 36]],
# [[12, 12], [15, 16]],
# [[9, 10], [27, 28]],
# [[22, 23], [31, 31]]]
print(doc._.resolved_text)
# Output
#
# Do not forget about Momofuku Ando!
# Momofuku Ando created instant noodles in Osaka.
# At Osaka, Nissin was founded.
# Many students survived by eating instant noodles,
# but Many students don't even know Momofuku Ando.
print(doc._.cluster_heads)
# Output
#
# {Momofuku Ando: [5, 6],
# instant noodles: [11, 12],
# Osaka: [14, 14],
# Nissin: [21, 21],
# Many students: [26, 27]}
```
### Visualize spacy pipeline
This only works with spacy >= 3.3.
```python
import spacy
from spacy.tokens import Span
from spacy import displacy

text = (
    "Do not forget about Momofuku Ando! He created instant noodles in Osaka. At"
    " that location, Nissin was founded. Many students survived by eating these"
    " noodles, but they don't even know him."
)

nlp = spacy.load("nl_core_news_sm")
nlp.add_pipe("xx_coref", config={"model_name": "minilm"})
doc = nlp(text)
spans = []
for idx, cluster in enumerate(doc._.coref_clusters):
    for span in cluster:
        spans.append(
            Span(doc, span[0], span[1]+1, str(idx).upper())
        )

doc.spans["custom"] = spans

displacy.render(doc, style="span", options={"spans_key": "custom"})
```

## More Examples
![](https://raw.githubusercontent.com/Pandora-Intelligence/crosslingual-coreference/master/img/example_total.png)

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/pandora-intelligence/crosslingual-coreference",
    "name": "crosslingual-coreference",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8,<3.12",
    "maintainer_email": "",
    "keywords": "AllenNLP,spaCy,NLP",
    "author": "David Berenstein",
    "author_email": "david.m.berenstein@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/81/a0/7dca701ec4ad2eef0df1de5d5952dbae2c4c86ade79fb9a4e23bd36dd1d4/crosslingual-coreference-0.3.1.tar.gz",
    "platform": null,
    "description": "# Crosslingual Coreference\nCoreference is amazing but the data required for training a model is very scarce. In our case, the available training for non-English languages also proved to be poorly annotated. Crosslingual Coreference, therefore, uses the assumption a trained model with English data and cross-lingual embeddings should work for languages with similar sentence structures.\n\n[![Current Release Version](https://img.shields.io/github/release/pandora-intelligence/crosslingual-coreference.svg?style=flat-square&logo=github)](https://github.com/pandora-intelligence/crosslingual-coreference/releases)\n[![pypi Version](https://img.shields.io/pypi/v/crosslingual-coreference.svg?style=flat-square&logo=pypi&logoColor=white)](https://pypi.org/project/crosslingual-coreference/)\n[![PyPi downloads](https://static.pepy.tech/personalized-badge/crosslingual-coreference?period=total&units=international_system&left_color=grey&right_color=orange&left_text=pip%20downloads)](https://pypi.org/project/crosslingual-coreference/)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg?style=flat-square)](https://github.com/ambv/black)\n\n# Install\n\n```\npip install crosslingual-coreference\n```\n# Quickstart\n```python\nfrom crosslingual_coreference import Predictor\n\ntext = (\n    \"Do not forget about Momofuku Ando! He created instant noodles in Osaka. At\"\n    \" that location, Nissin was founded. Many students survived by eating these\"\n    \" noodles, but they don't even know him.\"\n)\n\n# choose minilm for speed/memory and info_xlm for accuracy\npredictor = Predictor(\n    language=\"en_core_web_sm\", device=-1, model_name=\"minilm\"\n)\n\nprint(predictor.predict(text)[\"resolved_text\"])\nprint(predictor.pipe([text])[0][\"resolved_text\"])\n# Note you can also get 'cluster_heads' and 'clusters'\n# Output\n#\n# Do not forget about Momofuku Ando!\n# Momofuku Ando created instant noodles in Osaka.\n# At Osaka, Nissin was founded.\n# Many students survived by eating instant noodles,\n# but Many students don't even know Momofuku Ando.\n```\n![](https://raw.githubusercontent.com/Pandora-Intelligence/crosslingual-coreference/master/img/example_en.png)\n\n## Models\nAs of now, there are two models available \"spanbert\", \"info_xlm\", \"xlm_roberta\", \"minilm\", which scored 83, 77, 74 and 74 on OntoNotes Release 5.0 English data, respectively.\n- The \"minilm\" model is the best quality speed trade-off for both mult-lingual and english texts.\n- The \"info_xlm\" model produces the best quality for multi-lingual texts.\n- The AllenNLP \"spanbert\" model produces the best quality for english texts.\n\n## Chunking/batching to resolve memory OOM errors\n\n```python\nfrom crosslingual_coreference import Predictor\n\npredictor = Predictor(\n    language=\"en_core_web_sm\",\n    device=0,\n    model_name=\"minilm\",\n    chunk_size=2500,\n    chunk_overlap=2,\n)\n```\n\n## Use spaCy pipeline\n```python\nimport spacy\n\ntext = (\n    \"Do not forget about Momofuku Ando! He created instant noodles in Osaka. At\"\n    \" that location, Nissin was founded. Many students survived by eating these\"\n    \" noodles, but they don't even know him.\"\n)\n\n\nnlp = spacy.load(\"en_core_web_sm\")\nnlp.add_pipe(\n    \"xx_coref\", config={\"chunk_size\": 2500, \"chunk_overlap\": 2, \"device\": 0}\n)\n\ndoc = nlp(text)\nprint(doc._.coref_clusters)\n# Output\n#\n# [[[4, 5], [7, 7], [27, 27], [36, 36]],\n# [[12, 12], [15, 16]],\n# [[9, 10], [27, 28]],\n# [[22, 23], [31, 31]]]\nprint(doc._.resolved_text)\n# Output\n#\n# Do not forget about Momofuku Ando!\n# Momofuku Ando created instant noodles in Osaka.\n# At Osaka, Nissin was founded.\n# Many students survived by eating instant noodles,\n# but Many students don't even know Momofuku Ando.\nprint(doc._.cluster_heads)\n# Output\n#\n# {Momofuku Ando: [5, 6],\n# instant noodles: [11, 12],\n# Osaka: [14, 14],\n# Nissin: [21, 21],\n# Many students: [26, 27]}\n```\n### Visualize spacy pipeline\nThis only works with spacy >= 3.3.\n```python\nimport spacy\nfrom spacy.tokens import Span\nfrom spacy import displacy\n\ntext = (\n    \"Do not forget about Momofuku Ando! He created instant noodles in Osaka. At\"\n    \" that location, Nissin was founded. Many students survived by eating these\"\n    \" noodles, but they don't even know him.\"\n)\n\nnlp = spacy.load(\"nl_core_news_sm\")\nnlp.add_pipe(\"xx_coref\", config={\"model_name\": \"minilm\"})\ndoc = nlp(text)\nspans = []\nfor idx, cluster in enumerate(doc._.coref_clusters):\n    for span in cluster:\n        spans.append(\n            Span(doc, span[0], span[1]+1, str(idx).upper())\n        )\n\ndoc.spans[\"custom\"] = spans\n\ndisplacy.render(doc, style=\"span\", options={\"spans_key\": \"custom\"})\n```\n\n## More Examples\n![](https://raw.githubusercontent.com/Pandora-Intelligence/crosslingual-coreference/master/img/example_total.png)\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A multi-lingual approach to AllenNLP CoReference Resolution, along with a wrapper for spaCy.",
    "version": "0.3.1",
    "project_urls": {
        "Documentation": "https://github.com/pandora-intelligence/crosslingual-coreference",
        "Homepage": "https://github.com/pandora-intelligence/crosslingual-coreference",
        "Repository": "https://github.com/pandora-intelligence/crosslingual-coreference"
    },
    "split_keywords": [
        "allennlp",
        "spacy",
        "nlp"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "2b8de4ad53fd3a0f805658a140bd8e0affee4436a05831141660dc2a4103fcda",
                "md5": "a2de1f451d34036e0d1286c88bda50db",
                "sha256": "bd44ec22b2a1a02eb03203d04c4c92d819e2ac929baa2b825f21de7106beeedd"
            },
            "downloads": -1,
            "filename": "crosslingual_coreference-0.3.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "a2de1f451d34036e0d1286c88bda50db",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8,<3.12",
            "size": 12748,
            "upload_time": "2023-06-19T13:32:35",
            "upload_time_iso_8601": "2023-06-19T13:32:35.951463Z",
            "url": "https://files.pythonhosted.org/packages/2b/8d/e4ad53fd3a0f805658a140bd8e0affee4436a05831141660dc2a4103fcda/crosslingual_coreference-0.3.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "81a07dca701ec4ad2eef0df1de5d5952dbae2c4c86ade79fb9a4e23bd36dd1d4",
                "md5": "8f91bf2ff7e8c471dbda8972ce098147",
                "sha256": "cbd46de0afedf75d3315c39e9fecb851112e29cb7d8b3d85fdb7eb39ac63c25e"
            },
            "downloads": -1,
            "filename": "crosslingual-coreference-0.3.1.tar.gz",
            "has_sig": false,
            "md5_digest": "8f91bf2ff7e8c471dbda8972ce098147",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8,<3.12",
            "size": 11733,
            "upload_time": "2023-06-19T13:32:34",
            "upload_time_iso_8601": "2023-06-19T13:32:34.263460Z",
            "url": "https://files.pythonhosted.org/packages/81/a0/7dca701ec4ad2eef0df1de5d5952dbae2c4c86ade79fb9a4e23bd36dd1d4/crosslingual-coreference-0.3.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-06-19 13:32:34",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "pandora-intelligence",
    "github_project": "crosslingual-coreference",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "crosslingual-coreference"
}

David Berenstein