zensols.spanmatch


Namezensols.spanmatch JSON
Version 0.0.1 PyPI version JSON
download
home_pagehttps://github.com/plandes/spanmatch
SummaryAn API to match spans of semantically similar text across documents.
upload_time2023-06-30 17:06:21
maintainer
docs_urlNone
authorPaul Landes
requires_python
license
keywords tooling
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Unsupervised Position-Based Semantic Matching

[![PyPI][pypi-badge]][pypi-link]
[![Python 3.9][python39-badge]][python39-link]
[![Python 3.10][python310-badge]][python310-link]
[![Build Status][build-badge]][build-link]

An API to match spans of semantically similar text across documents.  Each
match is a span of text in a source document and another span of text in a
target document that are both tied together.

<!-- markdown-toc start - Don't edit this section. Run M-x markdown-toc-refresh-toc -->
## Table of Contents

- [Introduction](#introduction)
- [Documentation](#documentation)
- [Usage](#usage)
- [Citation](#citation)
- [Obtaining](#obtaining)
- [Changelog](#changelog)
- [License](#license)

<!-- markdown-toc end -->


## Introduction

Spans are formed by a weighted combination of the semantic similarity of the
each document's text and the token position.  Hyperparameters are used to
control which take precedent (semantic similarity or token position for longer
contiguous token spans).

This is done using position embeddings on a third (see Figure 1) axis shows
data blue word embeddings moving from cluster 1 to cluster 2. Cluster spans the
discharge summaries (orange), the note antecedent (green) and arrows connecting
the tokens to word points.

![Figure 1](./doc/pos-emb.png)

*Figure 1*

For more information, see "Hybrid Semantic Positional Token Clustering" in our
paper [Hospital Discharge Summarization Data Provenance].


## Documentation

See the [full documentation](https://plandes.github.io/spanmatch/index.html).
The [API reference](https://plandes.github.io/spanmatch/api.html) is also
available.


## Usage

```python
from zensols.cli import CliHarness
from zensols.nlp import FeatureDocument, FeatureDocumentParser
from zensols.spanmatch import Match, MatchResult, Matcher, ApplicationFactory

SOURCE = """\
Johannes Gutenberg (1398 – 1468) was a German goldsmith and publisher who
introduced printing to Europe. His introduction of mechanical movable type
printing to Europe started the Printing Revolution and is widely regarded as the
most important event of the modern period. It played a key role in the
scientific revolution and laid the basis for the modern knowledge-based economy
and the spread of learning to the masses.

Gutenberg many contributions to printing are: the invention of a process for
mass-producing movable type, the use of oil-based ink for printing books,
adjustable molds, and the use of a wooden printing press. His truly epochal
invention was the combination of these elements into a practical system that
allowed the mass production of printed books and was economically viable for
printers and readers alike.
"""

SUMMARY = """\
The German Johannes Gutenberg introduced printing in Europe. His invention had a
decisive contribution in spread of mass-learning and in building the basis of
the modern society.
"""

harness: CliHarness = ApplicationFactory.create_harness()
doc_parser: FeatureDocumentParser = harness['spanmatch_doc_parser']
matcher: Matcher = harness['spanmatch_matcher']
source: FeatureDocument = doc_parser(SOURCE)
summary: FeatureDocument = doc_parser(SUMMARY)
# shorten source doc span length by scaling up positional importance
matcher.hyp.source_position_scale = 2.5
# elongate summary doc span length by scaling up positional importance
matcher.hyp.target_position_scale = 0.9
res: MatchResult = matcher(source, summary)
match: Match
for i, match in enumerate(res.matches[:5]):
	match.write(include_flow=False)
```

Output:

```abnf
2023-06-11 08:22:38,392 24 matches found
source (0, 55):
    Johannes Gutenberg (1398 – 1468) was a German goldsmith
target (4, 29):
    German Johannes Gutenberg
source (524, 631):
    type, the use of oil-based ink for printing books, adjustable molds, and the use
    of a wooden printing press
target (4, 59):
    German Johannes Gutenberg introduced printing in Europe
source (301, 421):
    scientific revolution and laid the basis for the modern knowledge-based economy
    and the spread of learning to the masses
target (106, 177):
    spread of mass-learning and in building the basis of the modern society
source (516, 585):
    movable type, the use of oil-based ink for printing books, adjustable
target (116, 169):
    mass-learning and in building the basis of the modern
source (168, 199):
    started the Printing Revolution
target (106, 145):
    spread of mass-learning and in building
```


## Obtaining

The easiest way to install the command line program is via the `pip` installer:
```bash
pip3 install --use-deprecated=legacy-resolver zensols.spanmatch
```

Binaries are also available on [pypi].


## Citation

If you use this project in your research please use the following BibTeX entry:

```bibtex
@inproceedings{landes-etal-2023-dsprov,
    title = "{{Hospital Discharge Summarization Data Provenance}}",
    author = "Landes, Paul  and
      Chaise, Aaron J.  and
      Patel, Kunal P. and
      Huang, Sean S.  and
      Di Eugenio, Barbara",
    booktitle = "Proceedings of the 21st {{Workshop}} on {{Biomedical Language Processing}}",
    month = jul,
    year = "2023",
    day = 9,
    address = "Toronto, Canada",
    publisher = "{{Association for Computational Linguistics}}"
}
```


## Changelog

An extensive changelog is available [here](CHANGELOG.md).


## License

[MIT License](LICENSE.md)

Copyright (c) 2023 Paul Landes


<!-- links -->
[pypi]: https://pypi.org/project/zensols.spanmatch/
[pypi-link]: https://pypi.python.org/pypi/zensols.spanmatch
[pypi-badge]: https://img.shields.io/pypi/v/zensols.spanmatch.svg
[python39-badge]: https://img.shields.io/badge/python-3.9-blue.svg
[python39-link]: https://www.python.org/downloads/release/python-390
[python310-badge]: https://img.shields.io/badge/python-3.10-blue.svg
[python310-link]: https://www.python.org/downloads/release/python-310
[build-badge]: https://github.com/plandes/spanmatch/workflows/CI/badge.svg
[build-link]: https://github.com/plandes/spanmatch/actions

[Hospital Discharge Summarization Data Provenance]: https://example.com

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/plandes/spanmatch",
    "name": "zensols.spanmatch",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "tooling",
    "author": "Paul Landes",
    "author_email": "landes@mailc.net",
    "download_url": "https://github.com/plandes/spanmatch/releases/download/v0.0.1/zensols.spanmatch-0.0.1-py3-none-any.whl",
    "platform": null,
    "description": "# Unsupervised Position-Based Semantic Matching\n\n[![PyPI][pypi-badge]][pypi-link]\n[![Python 3.9][python39-badge]][python39-link]\n[![Python 3.10][python310-badge]][python310-link]\n[![Build Status][build-badge]][build-link]\n\nAn API to match spans of semantically similar text across documents.  Each\nmatch is a span of text in a source document and another span of text in a\ntarget document that are both tied together.\n\n<!-- markdown-toc start - Don't edit this section. Run M-x markdown-toc-refresh-toc -->\n## Table of Contents\n\n- [Introduction](#introduction)\n- [Documentation](#documentation)\n- [Usage](#usage)\n- [Citation](#citation)\n- [Obtaining](#obtaining)\n- [Changelog](#changelog)\n- [License](#license)\n\n<!-- markdown-toc end -->\n\n\n## Introduction\n\nSpans are formed by a weighted combination of the semantic similarity of the\neach document's text and the token position.  Hyperparameters are used to\ncontrol which take precedent (semantic similarity or token position for longer\ncontiguous token spans).\n\nThis is done using position embeddings on a third (see Figure 1) axis shows\ndata blue word embeddings moving from cluster 1 to cluster 2. Cluster spans the\ndischarge summaries (orange), the note antecedent (green) and arrows connecting\nthe tokens to word points.\n\n![Figure 1](./doc/pos-emb.png)\n\n*Figure 1*\n\nFor more information, see \"Hybrid Semantic Positional Token Clustering\" in our\npaper [Hospital Discharge Summarization Data Provenance].\n\n\n## Documentation\n\nSee the [full documentation](https://plandes.github.io/spanmatch/index.html).\nThe [API reference](https://plandes.github.io/spanmatch/api.html) is also\navailable.\n\n\n## Usage\n\n```python\nfrom zensols.cli import CliHarness\nfrom zensols.nlp import FeatureDocument, FeatureDocumentParser\nfrom zensols.spanmatch import Match, MatchResult, Matcher, ApplicationFactory\n\nSOURCE = \"\"\"\\\nJohannes Gutenberg (1398 \u2013 1468) was a German goldsmith and publisher who\nintroduced printing to Europe. His introduction of mechanical movable type\nprinting to Europe started the Printing Revolution and is widely regarded as the\nmost important event of the modern period. It played a key role in the\nscientific revolution and laid the basis for the modern knowledge-based economy\nand the spread of learning to the masses.\n\nGutenberg many contributions to printing are: the invention of a process for\nmass-producing movable type, the use of oil-based ink for printing books,\nadjustable molds, and the use of a wooden printing press. His truly epochal\ninvention was the combination of these elements into a practical system that\nallowed the mass production of printed books and was economically viable for\nprinters and readers alike.\n\"\"\"\n\nSUMMARY = \"\"\"\\\nThe German Johannes Gutenberg introduced printing in Europe. His invention had a\ndecisive contribution in spread of mass-learning and in building the basis of\nthe modern society.\n\"\"\"\n\nharness: CliHarness = ApplicationFactory.create_harness()\ndoc_parser: FeatureDocumentParser = harness['spanmatch_doc_parser']\nmatcher: Matcher = harness['spanmatch_matcher']\nsource: FeatureDocument = doc_parser(SOURCE)\nsummary: FeatureDocument = doc_parser(SUMMARY)\n# shorten source doc span length by scaling up positional importance\nmatcher.hyp.source_position_scale = 2.5\n# elongate summary doc span length by scaling up positional importance\nmatcher.hyp.target_position_scale = 0.9\nres: MatchResult = matcher(source, summary)\nmatch: Match\nfor i, match in enumerate(res.matches[:5]):\n\tmatch.write(include_flow=False)\n```\n\nOutput:\n\n```abnf\n2023-06-11 08:22:38,392 24 matches found\nsource (0, 55):\n    Johannes Gutenberg (1398 \u2013 1468) was a German goldsmith\ntarget (4, 29):\n    German Johannes Gutenberg\nsource (524, 631):\n    type, the use of oil-based ink for printing books, adjustable molds, and the use\n    of a wooden printing press\ntarget (4, 59):\n    German Johannes Gutenberg introduced printing in Europe\nsource (301, 421):\n    scientific revolution and laid the basis for the modern knowledge-based economy\n    and the spread of learning to the masses\ntarget (106, 177):\n    spread of mass-learning and in building the basis of the modern society\nsource (516, 585):\n    movable type, the use of oil-based ink for printing books, adjustable\ntarget (116, 169):\n    mass-learning and in building the basis of the modern\nsource (168, 199):\n    started the Printing Revolution\ntarget (106, 145):\n    spread of mass-learning and in building\n```\n\n\n## Obtaining\n\nThe easiest way to install the command line program is via the `pip` installer:\n```bash\npip3 install --use-deprecated=legacy-resolver zensols.spanmatch\n```\n\nBinaries are also available on [pypi].\n\n\n## Citation\n\nIf you use this project in your research please use the following BibTeX entry:\n\n```bibtex\n@inproceedings{landes-etal-2023-dsprov,\n    title = \"{{Hospital Discharge Summarization Data Provenance}}\",\n    author = \"Landes, Paul  and\n      Chaise, Aaron J.  and\n      Patel, Kunal P. and\n      Huang, Sean S.  and\n      Di Eugenio, Barbara\",\n    booktitle = \"Proceedings of the 21st {{Workshop}} on {{Biomedical Language Processing}}\",\n    month = jul,\n    year = \"2023\",\n    day = 9,\n    address = \"Toronto, Canada\",\n    publisher = \"{{Association for Computational Linguistics}}\"\n}\n```\n\n\n## Changelog\n\nAn extensive changelog is available [here](CHANGELOG.md).\n\n\n## License\n\n[MIT License](LICENSE.md)\n\nCopyright (c) 2023 Paul Landes\n\n\n<!-- links -->\n[pypi]: https://pypi.org/project/zensols.spanmatch/\n[pypi-link]: https://pypi.python.org/pypi/zensols.spanmatch\n[pypi-badge]: https://img.shields.io/pypi/v/zensols.spanmatch.svg\n[python39-badge]: https://img.shields.io/badge/python-3.9-blue.svg\n[python39-link]: https://www.python.org/downloads/release/python-390\n[python310-badge]: https://img.shields.io/badge/python-3.10-blue.svg\n[python310-link]: https://www.python.org/downloads/release/python-310\n[build-badge]: https://github.com/plandes/spanmatch/workflows/CI/badge.svg\n[build-link]: https://github.com/plandes/spanmatch/actions\n\n[Hospital Discharge Summarization Data Provenance]: https://example.com\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "An API to match spans of semantically similar text across documents.",
    "version": "0.0.1",
    "project_urls": {
        "Download": "https://github.com/plandes/spanmatch/releases/download/v0.0.1/zensols.spanmatch-0.0.1-py3-none-any.whl",
        "Homepage": "https://github.com/plandes/spanmatch"
    },
    "split_keywords": [
        "tooling"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "0dfe38eb80b4f983224fc808ad44c056ed56a589c7b8a9b0a7ee0cb1d0571710",
                "md5": "bd20207599cc9874f98a763214edf4f6",
                "sha256": "aaa79193c332a9e7cc6a3af88870b47d24d1db45a4f6256bae826befea756731"
            },
            "downloads": -1,
            "filename": "zensols.spanmatch-0.0.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "bd20207599cc9874f98a763214edf4f6",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 15456,
            "upload_time": "2023-06-30T17:06:21",
            "upload_time_iso_8601": "2023-06-30T17:06:21.858096Z",
            "url": "https://files.pythonhosted.org/packages/0d/fe/38eb80b4f983224fc808ad44c056ed56a589c7b8a9b0a7ee0cb1d0571710/zensols.spanmatch-0.0.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-06-30 17:06:21",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "plandes",
    "github_project": "spanmatch",
    "github_not_found": true,
    "lcname": "zensols.spanmatch"
}
        
Elapsed time: 0.08902s