auchann

Name	auchann JSON
Version	0.2.0 JSON
	download
home_page	https://github.com/UUDigitalHumanitieslab/auchann
Summary	The AuChAnn (Automatic CHAT Annotation) package can generate CHAT annotations based on a transcript-correction pairs of utterances.
upload_time	2023-08-21 09:26:55
maintainer
docs_url	None
author	Digital Humanities Lab, Utrecht University
requires_python	>=3.7
license	BSD-3 Clause
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # AuChAnn

[![Actions Status](https://github.com/UUDigitalHumanitieslab/auchann/workflows/Unit%20tests/badge.svg)](https://github.com/UUDigitalHumanitieslab/auchann/actions)

[pypi auchann](https://pypi.org/project/auchann)

AuChAnn is a python package that provides Automatic CHAT Annotation based on a transcript string and an interpretation (or 'corrected') string. For example, when given:
Transcript:      'Ik wilt nu eh na huis'
Correction:      'Ik wil nu naar huis'

AuChAnn produces:
CHAT-Annotation: 'ik wilt [: wil] nu &-eh na(ar) [* s:r:prep] huis'

CHAT is an annotation convention that was developed for the CHILDES corpus (MacWinney, 2000) and is used by many linguists to annotate speech. For more information on CHAT,  you can read their manual: https://talkbank.org/manuals/CHAT.html.

AuChAnn was specifically developed to enhance linguistic data in the form of a transcript and interpretation by a linguist for use with SASTA (https://github.com/UUDigitalHumanitieslab/sasta)

## Getting Started

You can install AuChAnn using pip:

```bash
pip install auchann
```

You can also optionally install [Sastadev](https://github.com/UUDigitalHumanitieslab/sastadev)
which is used for detecting inflection errors.

```bash
pip install auchann[NL]
```

When installed, the program can be run interactively from the console using the command `auchann` .

## Import as Library

To use AuChAnn in your own python applications, you can import the align_words function from align_words, see below. This is the main functionality of the package.

```python
from auchann.align_words import align_words

transcript = input("Transcript: ")
correction = input("Correction: ")
alignment = align_words(transcript, correction)
print(alignment)
```

### Settings

Various settings can be adjusted. Default values are used for every unchanged property.

```python
from auchann.align_words import align_words, AlignmentSettings
import editdistance

settings = AlignmentSettings()

# Return the edit distance between the original and correction
settings.calc_distance = lambda original, correction: editdistance.distance(original, correction)

# Return an override of the distance and the error type;
# if error type is None the distance returned will be ignored
# Default method detects inflection errors
settings.detect_error = lambda original, correction: (1, "m") if original == "geloopt" and correction == "liep" else (0, None)

### Sastadev contains a helper function for Dutch which detects inflection errors
from sastadev.deregularise import detect_error
settings.detect_error = detect_error

# How many words could be split from one?
# e.g. das -> da(t) (i)s requires a lookahead of 2
# hoest -> hoe (i)s (he)t requires a lookahead of 3
settings.lookahead = 5

# Allow detection of replacements within a group
# e.g. swapping articles this will then be marked with
# the specified key

# EXAMPLE:
# Transcript: de huis
# Correction: het huis
# de [: het] [* s:r:gc:art] huis
settings.replacements = {
    's:r:gc:art': ['de', 'het', 'een'],
    's:r:gc:pro': ['dit', 'dat', 'deze'],
    's:r:prep': ['aan', 'uit']
}

# Other lists to adjust
settings.fillers = ['eh', 'hm', 'uh']
settings.fragments = ['ba', 'to', 'mu']

### Example usage
transcript = input("Transcript: ")
correction = input("Correction: ")
alignment = align_words(transcript, correction, settings)
print(alignment)
```

## How it Works

The `align_words` function scans the transcript and correction and determines for each token whether a correction token is copied exactly from the transcript, replaces a token from the transcript, is inserted, or whether a transcript token has been omitted. Based on which of these operations has occurred, the function adds the appropriate CHAT annotation to the output string.

The algorithm uses edit distance to establish which words are replacements of each other, i.e. it links a transcript token to a correction token. Words with the lowest available edit distance are matched together, and based on this match the operations COPY and REPLACE are determined. If two candidates have the same edit distance to a token, word position is used to determine the match. The operations REMOVE and INSERT are established if no suitable match can be found for a transcript and correction token respectively.

In addition to establishing these four operations, the function detects several other properties of the transcript and correction which can be expressed in CHAT. For example, it determines whether a word is a filler or fragment, whether a conjugation error has occurred, or if a pronoun, preposition, or article has been used incorrectly.

## Development

To install the requirements:

```bash
pip install -r requirements.txt
```

To run the AuChAnn command-line function from the console:

```bash
python -m auchann
```

### Run Tests

```bash
pip install pytest
pytest
```

### Upload to PyPi

```bash
pip install pip-tools twine
python setup.py sdist
twine upload dist/*.tar.gz
```

## Acknowledgments

The research for this software was made possible by the CLARIAH-PLUS project financed by NWO (Grant 184.034.023).

## References

MacWhinney, B. (2000).  The CHILDES Project: Tools for Analyzing Talk. 3rd Edition.  Mahwah, NJ: Lawrence Erlbaum Associates

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/UUDigitalHumanitieslab/auchann",
    "name": "auchann",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": "",
    "keywords": "",
    "author": "Digital Humanities Lab, Utrecht University",
    "author_email": "digitalhumanities@uu.nl",
    "download_url": "https://files.pythonhosted.org/packages/75/bf/09a3e6e54e211caa1b3ad47411d9b4d5207288a8a765def20165efb53809/auchann-0.2.0.tar.gz",
    "platform": null,
    "description": "# AuChAnn\n\n[![Actions Status](https://github.com/UUDigitalHumanitieslab/auchann/workflows/Unit%20tests/badge.svg)](https://github.com/UUDigitalHumanitieslab/auchann/actions)\n\n[pypi auchann](https://pypi.org/project/auchann)\n\nAuChAnn is a python package that provides Automatic CHAT Annotation based on a transcript string and an interpretation (or 'corrected') string. For example, when given:\nTranscript:      'Ik wilt nu eh na huis'\nCorrection:      'Ik wil nu naar huis'\n\nAuChAnn produces:\nCHAT-Annotation: 'ik wilt [: wil] nu &-eh na(ar) [* s:r:prep] huis'\n\nCHAT is an annotation convention that was developed for the CHILDES corpus (MacWinney, 2000) and is used by many linguists to annotate speech. For more information on CHAT,  you can read their manual: https://talkbank.org/manuals/CHAT.html.\n\nAuChAnn was specifically developed to enhance linguistic data in the form of a transcript and interpretation by a linguist for use with SASTA (https://github.com/UUDigitalHumanitieslab/sasta)\n\n## Getting Started\n\nYou can install AuChAnn using pip:\n\n```bash\npip install auchann\n```\n\nYou can also optionally install [Sastadev](https://github.com/UUDigitalHumanitieslab/sastadev)\nwhich is used for detecting inflection errors.\n\n```bash\npip install auchann[NL]\n```\n\nWhen installed, the program can be run interactively from the console using the command `auchann` .\n\n## Import as Library\n\nTo use AuChAnn in your own python applications, you can import the align_words function from align_words, see below. This is the main functionality of the package.\n\n```python\nfrom auchann.align_words import align_words\n\ntranscript = input(\"Transcript: \")\ncorrection = input(\"Correction: \")\nalignment = align_words(transcript, correction)\nprint(alignment)\n```\n\n### Settings\n\nVarious settings can be adjusted. Default values are used for every unchanged property.\n\n```python\nfrom auchann.align_words import align_words, AlignmentSettings\nimport editdistance\n\nsettings = AlignmentSettings()\n\n# Return the edit distance between the original and correction\nsettings.calc_distance = lambda original, correction: editdistance.distance(original, correction)\n\n# Return an override of the distance and the error type;\n# if error type is None the distance returned will be ignored\n# Default method detects inflection errors\nsettings.detect_error = lambda original, correction: (1, \"m\") if original == \"geloopt\" and correction == \"liep\" else (0, None)\n\n### Sastadev contains a helper function for Dutch which detects inflection errors\nfrom sastadev.deregularise import detect_error\nsettings.detect_error = detect_error\n\n# How many words could be split from one?\n# e.g. das -> da(t) (i)s requires a lookahead of 2\n# hoest -> hoe (i)s (he)t requires a lookahead of 3\nsettings.lookahead = 5\n\n# Allow detection of replacements within a group\n# e.g. swapping articles this will then be marked with\n# the specified key\n\n# EXAMPLE:\n# Transcript: de huis\n# Correction: het huis\n# de [: het] [* s:r:gc:art] huis\nsettings.replacements = {\n    's:r:gc:art': ['de', 'het', 'een'],\n    's:r:gc:pro': ['dit', 'dat', 'deze'],\n    's:r:prep': ['aan', 'uit']\n}\n\n# Other lists to adjust\nsettings.fillers = ['eh', 'hm', 'uh']\nsettings.fragments = ['ba', 'to', 'mu']\n\n### Example usage\ntranscript = input(\"Transcript: \")\ncorrection = input(\"Correction: \")\nalignment = align_words(transcript, correction, settings)\nprint(alignment)\n```\n\n## How it Works\n\nThe `align_words` function scans the transcript and correction and determines for each token whether a correction token is copied exactly from the transcript, replaces a token from the transcript, is inserted, or whether a transcript token has been omitted. Based on which of these operations has occurred, the function adds the appropriate CHAT annotation to the output string.\n\nThe algorithm uses edit distance to establish which words are replacements of each other, i.e. it links a transcript token to a correction token. Words with the lowest available edit distance are matched together, and based on this match the operations COPY and REPLACE are determined. If two candidates have the same edit distance to a token, word position is used to determine the match. The operations REMOVE and INSERT are established if no suitable match can be found for a transcript and correction token respectively.\n\nIn addition to establishing these four operations, the function detects several other properties of the transcript and correction which can be expressed in CHAT. For example, it determines whether a word is a filler or fragment, whether a conjugation error has occurred, or if a pronoun, preposition, or article has been used incorrectly.\n\n## Development\n\nTo install the requirements:\n\n```bash\npip install -r requirements.txt\n```\n\nTo run the AuChAnn command-line function from the console:\n\n```bash\npython -m auchann\n```\n\n### Run Tests\n\n```bash\npip install pytest\npytest\n```\n\n### Upload to PyPi\n\n```bash\npip install pip-tools twine\npython setup.py sdist\ntwine upload dist/*.tar.gz\n```\n\n## Acknowledgments\n\nThe research for this software was made possible by the CLARIAH-PLUS project financed by NWO (Grant 184.034.023).\n\n## References\n\nMacWhinney, B. (2000).  The CHILDES Project: Tools for Analyzing Talk. 3rd Edition.  Mahwah, NJ: Lawrence Erlbaum Associates\n\n\n",
    "bugtrack_url": null,
    "license": "BSD-3 Clause",
    "summary": "The AuChAnn (Automatic CHAT Annotation) package can generate CHAT annotations based on a transcript-correction pairs of utterances.",
    "version": "0.2.0",
    "project_urls": {
        "Homepage": "https://github.com/UUDigitalHumanitieslab/auchann"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "75bf09a3e6e54e211caa1b3ad47411d9b4d5207288a8a765def20165efb53809",
                "md5": "45df6ec14783909fb286e00a28b744af",
                "sha256": "e036c9fd1b8e8d6687d27f36714863bcb1f50cc8092b76e71fea902729dc4b01"
            },
            "downloads": -1,
            "filename": "auchann-0.2.0.tar.gz",
            "has_sig": false,
            "md5_digest": "45df6ec14783909fb286e00a28b744af",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 14712,
            "upload_time": "2023-08-21T09:26:55",
            "upload_time_iso_8601": "2023-08-21T09:26:55.980317Z",
            "url": "https://files.pythonhosted.org/packages/75/bf/09a3e6e54e211caa1b3ad47411d9b4d5207288a8a765def20165efb53809/auchann-0.2.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-08-21 09:26:55",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "UUDigitalHumanitieslab",
    "github_project": "auchann",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "auchann"
}

Digital Humanities Lab, Utrecht University