txt2tei

Name	txt2tei JSON
Version	1.0.7 JSON
	download
home_page	https://github.com/fsanzl/txt2tei
Summary	An aid to encoding plays as XML-TEI
upload_time	2024-05-14 15:54:25
maintainer	None
docs_url	None
author	Fernando Sanz-Lázaro
requires_python	>=3.5
license	LGPL
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            [![License: GPL](https://img.shields.io/github/license/fsanzl/txt2tei)](https://opensource.org/licenses/GPL-3.0)
<!--- [![Version: 1.0.7](https://img.shields.io/github/v/release/fsanzl/txt2tei?include_prereleases)](https://pypi.org/project/txt2tei/)
# [![Python versions: 3.5, 3.6, 3.7, 3.8, 3.9](https://img.shields.io/pypi/pyversions/txt2tei)](https://pypi.org/project/txt2tei/) -->

<h2 align="center">TXT2TEI</h2>
<h3 align="center">An aid to encoding plays as XML-TEI</h2>


*txt2tei*  is a Python script to encode Spanish Siglo de Oro plays as XML-TEI files. It takes a minimally annotated tabular TXT file resembling the printed layout with a reduced set of simple tags. The script speeds up the process of encoding TEI files by automating their annotation, requiring just an (almost) visual edition of the sourceTXT files.

These scripts are part of the research project [Sound and Meaning in Spanish Golden Age Literature](https://soundandmeaning.univie.ac.at/).

## Requirements

The programme requires following libraries:

* BeautifulSoup 4
* pandas
* datetime
* unidecode
* lxml >= 4.9.2
* chardet

txt2tei runs on lxml and tei2txt on BS4. They may be unified in the future though. 


# Installation

Download the python scripts and the files sexos.csv and authors.xml in the same directory. You can also install it as a pip package, in which case it is not necessary to save the data files in your working directory. 

```bash
pip install txt2tei
```

## Use

Edit the personalised section of txt2tei.py to suit your needs and run the following commands:

```bash
./txt2tei.py tabularfile.txt
```

If installed with pip, the syntax differs:
```bash
txt2tei tabularfile.txt
```

Additionally, the script tei2txt.py performs the inverse operation
```bash
./tei2txt.py xmlteifile.xml
```

## Description

The tabular file must be encoded as UTF-8 Unicode text with LF terminators (Unix encoding). The script will end with error if used on a text with CRLF terminators (DOS encoding) or other exotic encodings. It should be no problem as any respectable editor lets you change  the encoding. Alternatively, there are simple one-liner tools around to convert from one format to another. In any case, the content must follow the following conventions:
```
<x>Comment
<a>Author's name (Just one single word, e.g. Calderón, Lope, Tirso...
<t>Title
<g>Genre
<s>Subgenre
<o>Source*URL
<f>Date
<x> The tag el marks the dramatis personae of  <castList>
<el>DRAMATIS PERSONAE (optional)|CHARACTER 1, a character*CHARACTER 2, another character*CHARACTER 3, a third one*CHARACTER 4, and just one more
<j>Act
<sc>Scene
<i>Stage direction
<x>Comment
CHARACTER ONE
<x>A tabulator marks the speeches
        Verse 1,
        verse 2.
CHARACTER 2
        <i>Internal stage direction
        Verse 3
        verse 4 (beginning)
CHARACTER 1
                Verse 4 (middle)
<x>An additional tab marks the continuation of a shared verse
CHARACTER 3
                        Verse 4 (end)
    verse 5,
    verse 6.
  ...
MUSIC
<e>Echo
CHARACTER 4
        <i>Reads:
<p>Prose
MULTIPLE CHARACTERS #character1 #character2
<x> Instead of letting the programme guess the characters in a collective parlamente, they can be indicated here explicitly
```

In order toi assign sexes to the characters, there is a CSV file in the format:

```csv
NAME,MALE,True
```

The first field is the name, the second the sex, and the third if was manully checked. This can be done with the provided script makelist.py

## Known issues

The programme only recognises "Calderón", "Lope" and "Calderón (atri.)" as authors. Adding new authors is trivial, as they can just be added to the dictionary authors.

Lope's ids are placeholders. Proper numbers should be given.

Most important: The programme is aimed to Spanish 17th century plays. The language conventions (e.g., this is an issue concerning sex of collective characters or a shared parlament in which 'Y' will be parsed as 'AND') and structure (versified plays) may need some tinkering to be applied to other kind of plays.


## Contributions

Feel free to contribute using the [GitHub Issue Tracker](https://github.com/fsanzl/txt2tei/issues) for feedback, suggestions, or bug reports.

## Changelog


### 1.0.6-2

- Added chardet to dependencies
- Solved deprecated 'rU'

### 1.0.6

- Solved empty date crash. 
- Handling BOM and Hasefroch line terminators
- Changelog markdown syntax

### 1.0.5

- Solved pronouns-related crash

## How to cite *txt2tei*

Authors of scientific papers including results generated using *txt2tei* are encouraged to cite the following paper.

```bibtex
@article{SanzLazaroF_RHD2023, 
    author    = {Sanz-Lázaro, Fernando},
    title     = {Del fonema al verso: una caja de herramientas digitales de escansión teatral},
    volume    = {8},
    date  = {2023},
    journal   = {Revista de Humanidades Digitales},
    pages = {74-89},
    doi = {https://doi.org/10.5944/rhd.vol.8.2023.37830}
}
```

## Copyright

Copyright (C) 2022  Fernando Sanz-Lázaro <<fsanzl@gmail.com>>

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program.  If not, see <<https://www.gnu.org/licenses/>>.

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/fsanzl/txt2tei",
    "name": "txt2tei",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.5",
    "maintainer_email": null,
    "keywords": null,
    "author": "Fernando Sanz-L\u00e1zaro",
    "author_email": "fsanzl@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/b1/1d/00a0251045440b87f22cc987448e67f49cc547dd18ec2754f3daead9abc9/txt2tei-1.0.7.tar.gz",
    "platform": null,
    "description": "[![License: GPL](https://img.shields.io/github/license/fsanzl/txt2tei)](https://opensource.org/licenses/GPL-3.0)\n<!--- [![Version: 1.0.7](https://img.shields.io/github/v/release/fsanzl/txt2tei?include_prereleases)](https://pypi.org/project/txt2tei/)\n# [![Python versions: 3.5, 3.6, 3.7, 3.8, 3.9](https://img.shields.io/pypi/pyversions/txt2tei)](https://pypi.org/project/txt2tei/) -->\n\n<h2 align=\"center\">TXT2TEI</h2>\n<h3 align=\"center\">An aid to encoding plays as XML-TEI</h2>\n\n\n*txt2tei*  is a Python script to encode Spanish Siglo de Oro plays as XML-TEI files. It takes a minimally annotated tabular TXT file resembling the printed layout with a reduced set of simple tags. The script speeds up the process of encoding TEI files by automating their annotation, requiring just an (almost) visual edition of the sourceTXT files.\n\nThese scripts are part of the research project [Sound and Meaning in Spanish Golden Age Literature](https://soundandmeaning.univie.ac.at/).\n\n## Requirements\n\nThe programme requires following libraries:\n\n* BeautifulSoup 4\n* pandas\n* datetime\n* unidecode\n* lxml >= 4.9.2\n* chardet\n\ntxt2tei runs on lxml and tei2txt on BS4. They may be unified in the future though. \n\n\n# Installation\n\nDownload the python scripts and the files sexos.csv and authors.xml in the same directory. You can also install it as a pip package, in which case it is not necessary to save the data files in your working directory. \n\n```bash\npip install txt2tei\n```\n\n## Use\n\nEdit the personalised section of txt2tei.py to suit your needs and run the following commands:\n\n```bash\n./txt2tei.py tabularfile.txt\n```\n\nIf installed with pip, the syntax differs:\n```bash\ntxt2tei tabularfile.txt\n```\n\nAdditionally, the script tei2txt.py performs the inverse operation\n```bash\n./tei2txt.py xmlteifile.xml\n```\n\n## Description\n\nThe tabular file must be encoded as UTF-8 Unicode text with LF terminators (Unix encoding). The script will end with error if used on a text with CRLF terminators (DOS encoding) or other exotic encodings. It should be no problem as any respectable editor lets you change  the encoding. Alternatively, there are simple one-liner tools around to convert from one format to another. In any case, the content must follow the following conventions:\n```\n<x>Comment\n<a>Author's name (Just one single word, e.g. Calder\u00f3n, Lope, Tirso...\n<t>Title\n<g>Genre\n<s>Subgenre\n<o>Source*URL\n<f>Date\n<x> The tag el marks the dramatis personae of  <castList>\n<el>DRAMATIS PERSONAE (optional)|CHARACTER 1, a character*CHARACTER 2, another character*CHARACTER 3, a third one*CHARACTER 4, and just one more\n<j>Act\n<sc>Scene\n<i>Stage direction\n<x>Comment\nCHARACTER ONE\n<x>A tabulator marks the speeches\n        Verse 1,\n        verse 2.\nCHARACTER 2\n        <i>Internal stage direction\n        Verse 3\n        verse 4 (beginning)\nCHARACTER 1\n                Verse 4 (middle)\n<x>An additional tab marks the continuation of a shared verse\nCHARACTER 3\n                        Verse 4 (end)\n    verse 5,\n    verse 6.\n  ...\nMUSIC\n<e>Echo\nCHARACTER 4\n        <i>Reads:\n<p>Prose\nMULTIPLE CHARACTERS #character1 #character2\n<x> Instead of letting the programme guess the characters in a collective parlamente, they can be indicated here explicitly\n```\n\nIn order toi assign sexes to the characters, there is a CSV file in the format:\n\n```csv\nNAME,MALE,True\n```\n\nThe first field is the name, the second the sex, and the third if was manully checked. This can be done with the provided script makelist.py\n\n## Known issues\n\nThe programme only recognises \"Calder\u00f3n\", \"Lope\" and \"Calder\u00f3n (atri.)\" as authors. Adding new authors is trivial, as they can just be added to the dictionary authors.\n\nLope's ids are placeholders. Proper numbers should be given.\n\nMost important: The programme is aimed to Spanish 17th century plays. The language conventions (e.g., this is an issue concerning sex of collective characters or a shared parlament in which 'Y' will be parsed as 'AND') and structure (versified plays) may need some tinkering to be applied to other kind of plays.\n\n\n## Contributions\n\nFeel free to contribute using the [GitHub Issue Tracker](https://github.com/fsanzl/txt2tei/issues) for feedback, suggestions, or bug reports.\n\n## Changelog\n\n\n### 1.0.6-2\n\n- Added chardet to dependencies\n- Solved deprecated 'rU'\n\n### 1.0.6\n\n- Solved empty date crash. \n- Handling BOM and Hasefroch line terminators\n- Changelog markdown syntax\n\n### 1.0.5\n\n- Solved pronouns-related crash\n\n## How to cite *txt2tei*\n\nAuthors of scientific papers including results generated using *txt2tei* are encouraged to cite the following paper.\n\n```bibtex\n@article{SanzLazaroF_RHD2023, \n    author    = {Sanz-L\u00e1zaro, Fernando},\n    title     = {Del fonema al verso: una caja de herramientas digitales de escansi\u00f3n teatral},\n    volume    = {8},\n    date  = {2023},\n    journal   = {Revista de Humanidades Digitales},\n    pages = {74-89},\n    doi = {https://doi.org/10.5944/rhd.vol.8.2023.37830}\n}\n```\n\n## Copyright\n\nCopyright (C) 2022  Fernando Sanz-L\u00e1zaro <<fsanzl@gmail.com>>\n\nThis program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.\n\nThis program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for more details.\n\nYou should have received a copy of the GNU General Public License along with this program.  If not, see <<https://www.gnu.org/licenses/>>.\n",
    "bugtrack_url": null,
    "license": "LGPL",
    "summary": "An aid to encoding plays as XML-TEI",
    "version": "1.0.7",
    "project_urls": {
        "Homepage": "https://github.com/fsanzl/txt2tei",
        "Source": "https://github.com/fsanzl/txt2tei/",
        "Tracker": "https://github.com/fsanzl/txt2tei/issues"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "fc939d93e973f98f57eb7170cb467e8350356a555ce5010aafc9c025b48baa69",
                "md5": "3bce39a9418c91aa3bbb82e3e655faf1",
                "sha256": "4b57997505ce849488b69e24ffaa24348a011a8de6b2d29be4c2dc5be57ddd6b"
            },
            "downloads": -1,
            "filename": "txt2tei-1.0.7-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "3bce39a9418c91aa3bbb82e3e655faf1",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.5",
            "size": 39147,
            "upload_time": "2024-05-14T15:54:23",
            "upload_time_iso_8601": "2024-05-14T15:54:23.230347Z",
            "url": "https://files.pythonhosted.org/packages/fc/93/9d93e973f98f57eb7170cb467e8350356a555ce5010aafc9c025b48baa69/txt2tei-1.0.7-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "b11d00a0251045440b87f22cc987448e67f49cc547dd18ec2754f3daead9abc9",
                "md5": "448a43f885d96e183125044dddffce8c",
                "sha256": "40f0329d9e752b86589ff640deb0b718533da5144c509512e2df82e918d26696"
            },
            "downloads": -1,
            "filename": "txt2tei-1.0.7.tar.gz",
            "has_sig": false,
            "md5_digest": "448a43f885d96e183125044dddffce8c",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.5",
            "size": 37067,
            "upload_time": "2024-05-14T15:54:25",
            "upload_time_iso_8601": "2024-05-14T15:54:25.615159Z",
            "url": "https://files.pythonhosted.org/packages/b1/1d/00a0251045440b87f22cc987448e67f49cc547dd18ec2754f3daead9abc9/txt2tei-1.0.7.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-05-14 15:54:25",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "fsanzl",
    "github_project": "txt2tei",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "txt2tei"
}

Fernando Sanz-Lázaro