# TEI parser
This is a parser written in Python 3 that takes TEI-XML Documents as an inpput and writes them in a [Neo4j Graph Database](https://neo4j.com).
It makes use of the following existing libraries:
- [Beautiful Soup 4](https://beautiful-soup-4.readthedocs.io/en/latest/) An easy-to-use XML parser
- [Spacy](https://spacy.io). Currently we use the german language package `de_core_news_sm` to parse the text.
- [Py2neo v4](https://py2neo.org/v4/) whih is a library to work with the Neo4j database.
## Installation
```bash
$ virtualenv venv
$ source venv/bin/activate
$ pip install -e TEIParse
$ python -m spacy download de_core_news_sm
$ pip install ../semper-backend # for the GraphUtils class
```
## Synopsis
```
from tei2neo import parse
from semper_backend.utils import GraphUtils
graph = Graph(host="localhost", user="neo4j", password="password")
doc, status, soup = parse(
filename=file,
start_with_tag='TEI',
idno='20-MS-221'
)
tx = graph.begin()
doc.save(tx)
tx.commit()
ut = GraphUtils(graph)
paras = ut.paragraphs_for_filename('20_MS_221_1.xml')
# create unhyphened tokens
for para in paras:
tokens = ut.tokens_in_paragraph(para)
ut.create_unhyphenated(tokens)
# show hyphened text
for token in ut.tokens_in_paragraph(paras[5], concatenated=0):
if 'lb' in token.labels:
print(' | ', end='')
print(token.get('string',''), end='')
print(token.get('whitespace', ''), end='')
# show concatenated (non-hyphened) version of the text
for token in ut.tokens_in_paragraph(paras[5], concatenated=1):
if 'lb' in token.labels:
print(' ', end='')
print(token.get('string',''), end='')
print(token.get('whitespace', ''), end='')
```
# How the parser works
A TEI document can be constructed in various ways and there are many elements that work very similarly. Likewise, this parser expects certain elements and treats them in a specific manner.
## Elements that affect all following elements
### handShift
A `handShift` element **affects all elements that are below**, until another `handShift` element is encountered.
**Example**
From now on everything is written in «Latein» and a pencil is being used (medium=Blei):
```
<handShift new="#hWH" medium="Blei" script="Latein"/>
```
Now we switch to «Kurrent» script and use black ink (STinte):
```
<handShift new="#hGS" medium="STinte" script="Kurrent"/>
```
**Appearance in Neo4j**
As we have seen, a `handShift` element contains three attributes:
- new="#hWH"
- medium="Blei"
- script="Latein"
These attributes are passed to all Token elements that follow after a `handShift` occurs. Previous attributes are not deleted, i.e. if only the medium changes from «Blei» to «STinte», all other attributes stay the same.
The `handShift` element will _not_ appear as a node in Neo4j.
### delSpan
A `delSpan` element works much like a `handShift` element, as it alters the appearance of all the following text until it reaches its `spanTo` target:
```
<delSpan spanTo="#A20_MS_215_12_3"/>
... (a lot of XML code here)
<anchor xml:id="A20_MS_215_12_3"/>
```
**Appearance in Neo4j**
- both the `delSpan` and the `anchor` appear as additional nodes.
- all elements between the `delSpan` and the `anchor` element receive an additional `delSpan` label
- a `delSpan` attribute is added to every element, the value is equal to the `xml:id` attribute of the anchor.
Raw data
{
"_id": null,
"home_page": "https://sissource.ethz.ch/sis/semper-tei",
"name": "tei2neo",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.5",
"maintainer_email": null,
"keywords": null,
"author": "Swen Vermeul \u2022 ID SIS \u2022 ETH Z\u00fcrich",
"author_email": "swen@ethz.ch",
"download_url": "https://files.pythonhosted.org/packages/21/30/31b1bb14d035fc816230273ee498a652a8b052fe2203a1c77165a8532bf5/tei2neo-0.6.1.tar.gz",
"platform": null,
"description": "# TEI parser\n\nThis is a parser written in Python 3 that takes TEI-XML Documents as an inpput and writes them in a [Neo4j Graph Database](https://neo4j.com).\n\nIt makes use of the following existing libraries:\n\n- [Beautiful Soup 4](https://beautiful-soup-4.readthedocs.io/en/latest/) An easy-to-use XML parser\n- [Spacy](https://spacy.io). Currently we use the german language package `de_core_news_sm` to parse the text.\n- [Py2neo v4](https://py2neo.org/v4/) whih is a library to work with the Neo4j database.\n\n## Installation\n\n```bash\n$ virtualenv venv\n$ source venv/bin/activate\n$ pip install -e TEIParse\n$ python -m spacy download de_core_news_sm\n$ pip install ../semper-backend # for the GraphUtils class\n```\n\n## Synopsis\n\n```\nfrom tei2neo import parse\nfrom semper_backend.utils import GraphUtils\ngraph = Graph(host=\"localhost\", user=\"neo4j\", password=\"password\")\ndoc, status, soup = parse(\n\tfilename=file,\n\tstart_with_tag='TEI',\n\tidno='20-MS-221'\n)\ntx = graph.begin()\ndoc.save(tx)\ntx.commit()\n\nut = GraphUtils(graph)\nparas = ut.paragraphs_for_filename('20_MS_221_1.xml')\n\n# create unhyphened tokens\nfor para in paras:\n tokens = ut.tokens_in_paragraph(para)\n ut.create_unhyphenated(tokens)\n\n# show hyphened text\nfor token in ut.tokens_in_paragraph(paras[5], concatenated=0):\n if 'lb' in token.labels:\n print(' | ', end='')\n print(token.get('string',''), end='')\n print(token.get('whitespace', ''), end='')\n\n# show concatenated (non-hyphened) version of the text\nfor token in ut.tokens_in_paragraph(paras[5], concatenated=1):\n if 'lb' in token.labels:\n print(' ', end='')\n\n print(token.get('string',''), end='')\n print(token.get('whitespace', ''), end='')\n```\n\n# How the parser works\n\nA TEI document can be constructed in various ways and there are many elements that work very similarly. Likewise, this parser expects certain elements and treats them in a specific manner.\n\n## Elements that affect all following elements\n\n### handShift\n\nA `handShift` element **affects all elements that are below**, until another `handShift` element is encountered.\n\n**Example**\n\nFrom now on everything is written in \u00abLatein\u00bb and a pencil is being used (medium=Blei):\n\n```\n<handShift new=\"#hWH\" medium=\"Blei\" script=\"Latein\"/>\n```\n\nNow we switch to \u00abKurrent\u00bb script and use black ink (STinte):\n\n```\n<handShift new=\"#hGS\" medium=\"STinte\" script=\"Kurrent\"/>\n```\n\n**Appearance in Neo4j**\n\nAs we have seen, a `handShift` element contains three attributes:\n\n- new=\"#hWH\"\n- medium=\"Blei\"\n- script=\"Latein\"\n\nThese attributes are passed to all Token elements that follow after a `handShift` occurs. Previous attributes are not deleted, i.e. if only the medium changes from \u00abBlei\u00bb to \u00abSTinte\u00bb, all other attributes stay the same.\nThe `handShift` element will _not_ appear as a node in Neo4j.\n\n### delSpan\n\nA `delSpan` element works much like a `handShift` element, as it alters the appearance of all the following text until it reaches its `spanTo` target:\n\n```\n<delSpan spanTo=\"#A20_MS_215_12_3\"/>\n... (a lot of XML code here)\n<anchor xml:id=\"A20_MS_215_12_3\"/>\n```\n\n**Appearance in Neo4j**\n\n- both the `delSpan` and the `anchor` appear as additional nodes.\n- all elements between the `delSpan` and the `anchor` element receive an additional `delSpan` label\n- a `delSpan` attribute is added to every element, the value is equal to the `xml:id` attribute of the anchor.\n",
"bugtrack_url": null,
"license": "BSD",
"summary": "TEI (Text Encoding Initiative) parser to extract information and store it in Neo4j database",
"version": "0.6.1",
"project_urls": {
"Homepage": "https://sissource.ethz.ch/sis/semper-tei"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "262e978e5a9d4e163407be1fdc0972242dfb36c130a87701bc4dc39f7e63a04b",
"md5": "3b57f3ec33409bbc917d10e962d4c469",
"sha256": "acb71dcb6f47c66ccb631607ec493801dad917a33cdf89657738d2ae719c3c9c"
},
"downloads": -1,
"filename": "tei2neo-0.6.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "3b57f3ec33409bbc917d10e962d4c469",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.5",
"size": 23890,
"upload_time": "2024-10-09T09:30:18",
"upload_time_iso_8601": "2024-10-09T09:30:18.982329Z",
"url": "https://files.pythonhosted.org/packages/26/2e/978e5a9d4e163407be1fdc0972242dfb36c130a87701bc4dc39f7e63a04b/tei2neo-0.6.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "213031b1bb14d035fc816230273ee498a652a8b052fe2203a1c77165a8532bf5",
"md5": "da0f9d4eb7e2584a3cffa2d1edcb2f9d",
"sha256": "9e98c049772e70c6c63cfd0ac4d3800a756708cf96390be520da82d981641530"
},
"downloads": -1,
"filename": "tei2neo-0.6.1.tar.gz",
"has_sig": false,
"md5_digest": "da0f9d4eb7e2584a3cffa2d1edcb2f9d",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.5",
"size": 26511,
"upload_time": "2024-10-09T09:30:20",
"upload_time_iso_8601": "2024-10-09T09:30:20.177371Z",
"url": "https://files.pythonhosted.org/packages/21/30/31b1bb14d035fc816230273ee498a652a8b052fe2203a1c77165a8532bf5/tei2neo-0.6.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-10-09 09:30:20",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "tei2neo"
}