vulgata-spacy

Name	vulgata-spacy JSON
Version	0.0.1 JSON
	download
home_page
Summary	A library for finding Vulgate references in Medieval Latin texts.
upload_time	2022-12-28 15:08:28
maintainer
docs_url	None
author	WJB Mattingly
requires_python
license
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            ![vulgata spacy logo](images/logo.png)

# Install

```python
pip install vulgata-spacy
```

# About

Vulgata spaCy is a library built upon [spaCy](www.spacy.io) to automate the identification and extraction of potential Biblical quotes in medieval Latin texts. The main pipeline leverages [Bloom embeddings](https://explosion.ai/blog/bloom-embeddings) which were trained on the entire Patrologia Latina (PL) after substantial cleaning. The PL text was left unlemmatized. Unfortunately, the data licensing prevents its distribution. These bloom embeddings were loaded into a spaCy pipeline as floret embeddings. This allows for the pipeline to be quite small, while still being able to capture deep semantic and syntactic meaning.

The pipeline contains several different components. First, the pipeline contains an EntityRuler whose patterns can identify direct quotation or partial quotation of Scripture with or without punctuation marks.

A second component is a quote detection machine learning model that is designed to identify quotes within context. Quotation marks and other indicators were removed before training so that the model would learn the features and context of Latin quotations.

Both of these components flag potential or likely Scriptural quotes. But being able to detect a Scriptural reference is, however, only part of the challenge. The second step in this process is linking that quote to a specific verse of the Vulgate. This problem is not as straightforward as it might appear and is compounded in several ways. First, Latin texts do not have standard spelling (even across modern editions); second, modern Latin texts do not agree on punctuation; third, many version of the Bible circulated in the Middle Ages that did not conform to the standard Vulgate, meaning words could appear out of order or synonyms used. These variant readings are often collectively known as Vetus Latina material. Therefore, matching "in initio fect Deus terras et caelum" to Genesis 1:1 ("in principio creavit Deus terram et caelum") is not as simple as fuzzy-string matching. In order to correctly link data, semantic and syntactic meaning must be retained.

To overcome this, Vulgata SpaCy contains the entire Vulgate as raw text, cleaned text, and partial text. These were embedded with the pipeline's vectors. An annoy index was then created. The pipeline allows for users to create their own index and query it with a new text.


# Usage

```python
#import vulgata-spacy
from vulgata_spacy import vulgata_spacy

nlp = vulgata_spacy.VulgataSpaCy()
doc = nlp.create_doc("Deus enim spiritus est. Denique: Nemo scit, inquit, quae sunt in Deo, nisi spiritus qui in ipso est (I Cor. II, 11). Legimus quidem: Quis cognovit sensum Domini, aut quis consiliarius ejus fuit (Rom. XI, 34)?")

doc = nlp.annoy_matcher(style="sent", max_distance=.50)

for ent in doc.ents:
    print(ent.text)
    if ent._.scripture != True:
        for scripture in ent._.scripture:
            print(scripture)
        print()
```
## Expected Output

```
Deus enim spiritus est.
{'score': 0.25021496415138245, 'book': '3Rg', 'chapter': 18, 'verse': 27, 'latin': 'Cumque esset jam meridies, illudebat illis Elias, dicens: Clamate voce majore: deus enim est, et forsitan loquitur, aut in diversorio est, aut in itinere, aut certe dormit, ut excitetur.', 'matcher': 'annoy'}

Denique: Nemo scit, inquit, quae sunt in Deo, nisi spiritus qui in ipso est (I Cor.
{'score': 0.38035616278648376, 'book': '1Cor', 'chapter': 2, 'verse': 11, 'latin': 'Quis enim hominum scit quae sunt hominis, nisi spiritus hominis, qui in ipso est? ita et quae Dei sunt, nemo cognovit, nisi Spiritus Dei.', 'matcher': 'annoy'}

Legimus quidem: Quis cognovit sensum Domini, aut quis consiliarius ejus fuit (Rom.
{'score': 0.20919208228588104, 'book': 'Rom', 'chapter': 11, 'verse': 34, 'latin': 'Quis enim cognovit sensum Domini? aut quis consiliarius ejus fuit?', 'matcher': 'annoy'}
```

# Visualize the Document in Streamlit
```python
import streamlit as st
from vulgata_spacy import vulgata_spacy

nlp = vulgata_spacy.VulgataSpaCy()
doc = nlp.create_doc("Deus enim spiritus est. Denique: Nemo scit, inquit, quae sunt in Deo, nisi spiritus qui in ipso est (I Cor. II, 11). Legimus quidem: Quis cognovit sensum Domini, aut quis consiliarius ejus fuit (Rom. XI, 34)?")
doc = nlp.annoy_matcher(style="sent", max_distance=.50)
nlp.visualize_doc()
```

Once created, run the following command in your CLI

```
streamlit run [file]
```

## Expected Output
![streamlit demo](images/streamlit-demo.png)

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "vulgata-spacy",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "",
    "author": "WJB Mattingly",
    "author_email": "",
    "download_url": "https://files.pythonhosted.org/packages/6c/1d/584370f0a2679719b836355aae67b0da02152b3a3a0fc4e2e64c786afcfd/vulgata-spacy-0.0.1.tar.gz",
    "platform": null,
    "description": "![vulgata spacy logo](images/logo.png)\n\n# Install\n\n```python\npip install vulgata-spacy\n```\n\n# About\n\nVulgata spaCy is a library built upon [spaCy](www.spacy.io) to automate the identification and extraction of potential Biblical quotes in medieval Latin texts. The main pipeline leverages [Bloom embeddings](https://explosion.ai/blog/bloom-embeddings) which were trained on the entire Patrologia Latina (PL) after substantial cleaning. The PL text was left unlemmatized. Unfortunately, the data licensing prevents its distribution. These bloom embeddings were loaded into a spaCy pipeline as floret embeddings. This allows for the pipeline to be quite small, while still being able to capture deep semantic and syntactic meaning.\n\nThe pipeline contains several different components. First, the pipeline contains an EntityRuler whose patterns can identify direct quotation or partial quotation of Scripture with or without punctuation marks.\n\nA second component is a quote detection machine learning model that is designed to identify quotes within context. Quotation marks and other indicators were removed before training so that the model would learn the features and context of Latin quotations.\n\nBoth of these components flag potential or likely Scriptural quotes. But being able to detect a Scriptural reference is, however, only part of the challenge. The second step in this process is linking that quote to a specific verse of the Vulgate. This problem is not as straightforward as it might appear and is compounded in several ways. First, Latin texts do not have standard spelling (even across modern editions); second, modern Latin texts do not agree on punctuation; third, many version of the Bible circulated in the Middle Ages that did not conform to the standard Vulgate, meaning words could appear out of order or synonyms used. These variant readings are often collectively known as Vetus Latina material. Therefore, matching \"in initio fect Deus terras et caelum\" to Genesis 1:1 (\"in principio creavit Deus terram et caelum\") is not as simple as fuzzy-string matching. In order to correctly link data, semantic and syntactic meaning must be retained.\n\nTo overcome this, Vulgata SpaCy contains the entire Vulgate as raw text, cleaned text, and partial text. These were embedded with the pipeline's vectors. An annoy index was then created. The pipeline allows for users to create their own index and query it with a new text.\n\n\n# Usage\n\n```python\n#import vulgata-spacy\nfrom vulgata_spacy import vulgata_spacy\n\nnlp = vulgata_spacy.VulgataSpaCy()\ndoc = nlp.create_doc(\"Deus enim spiritus est. Denique: Nemo scit, inquit, quae sunt in Deo, nisi spiritus qui in ipso est (I Cor. II, 11). Legimus quidem: Quis cognovit sensum Domini, aut quis consiliarius ejus fuit (Rom. XI, 34)?\")\n\ndoc = nlp.annoy_matcher(style=\"sent\", max_distance=.50)\n\nfor ent in doc.ents:\n    print(ent.text)\n    if ent._.scripture != True:\n        for scripture in ent._.scripture:\n            print(scripture)\n        print()\n```\n## Expected Output\n\n```\nDeus enim spiritus est.\n{'score': 0.25021496415138245, 'book': '3Rg', 'chapter': 18, 'verse': 27, 'latin': 'Cumque esset jam meridies, illudebat illis Elias, dicens: Clamate voce majore: deus enim est, et forsitan loquitur, aut in diversorio est, aut in itinere, aut certe dormit, ut excitetur.', 'matcher': 'annoy'}\n\nDenique: Nemo scit, inquit, quae sunt in Deo, nisi spiritus qui in ipso est (I Cor.\n{'score': 0.38035616278648376, 'book': '1Cor', 'chapter': 2, 'verse': 11, 'latin': 'Quis enim hominum scit quae sunt hominis, nisi spiritus hominis, qui in ipso est? ita et quae Dei sunt, nemo cognovit, nisi Spiritus Dei.', 'matcher': 'annoy'}\n\nLegimus quidem: Quis cognovit sensum Domini, aut quis consiliarius ejus fuit (Rom.\n{'score': 0.20919208228588104, 'book': 'Rom', 'chapter': 11, 'verse': 34, 'latin': 'Quis enim cognovit sensum Domini? aut quis consiliarius ejus fuit?', 'matcher': 'annoy'}\n```\n\n# Visualize the Document in Streamlit\n```python\nimport streamlit as st\nfrom vulgata_spacy import vulgata_spacy\n\nnlp = vulgata_spacy.VulgataSpaCy()\ndoc = nlp.create_doc(\"Deus enim spiritus est. Denique: Nemo scit, inquit, quae sunt in Deo, nisi spiritus qui in ipso est (I Cor. II, 11). Legimus quidem: Quis cognovit sensum Domini, aut quis consiliarius ejus fuit (Rom. XI, 34)?\")\ndoc = nlp.annoy_matcher(style=\"sent\", max_distance=.50)\nnlp.visualize_doc()\n```\n\nOnce created, run the following command in your CLI\n\n```\nstreamlit run [file]\n```\n\n## Expected Output\n![streamlit demo](images/streamlit-demo.png)\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "A library for finding Vulgate references in Medieval Latin texts.",
    "version": "0.0.1",
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "22289bdd87079e311b756c461a9b8c9d",
                "sha256": "5fd76ecc959362aed0fad47d1900ef4956c0072dcb416fab29ebd5c5cc4df3ec"
            },
            "downloads": -1,
            "filename": "vulgata_spacy-0.0.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "22289bdd87079e311b756c461a9b8c9d",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 62386802,
            "upload_time": "2022-12-28T15:08:17",
            "upload_time_iso_8601": "2022-12-28T15:08:17.398919Z",
            "url": "https://files.pythonhosted.org/packages/8a/1c/c590f24a7366500e7950110ecd1a31d87d4e90db8bf9dfa9f710e27a5652/vulgata_spacy-0.0.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "md5": "7641705298bf94ba1f8872d63931252f",
                "sha256": "3ee8ba3a406e012406f0d78c5ab0796fd54b025fe35f3ed8e3aa498453ee9fc3"
            },
            "downloads": -1,
            "filename": "vulgata-spacy-0.0.1.tar.gz",
            "has_sig": false,
            "md5_digest": "7641705298bf94ba1f8872d63931252f",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 62388275,
            "upload_time": "2022-12-28T15:08:28",
            "upload_time_iso_8601": "2022-12-28T15:08:28.342858Z",
            "url": "https://files.pythonhosted.org/packages/6c/1d/584370f0a2679719b836355aae67b0da02152b3a3a0fc4e2e64c786afcfd/vulgata-spacy-0.0.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2022-12-28 15:08:28",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "lcname": "vulgata-spacy"
}

WJB Mattingly