zensols.nlp

Name	zensols.nlp JSON
Version	1.12.1 JSON
	download
home_page	https://github.com/plandes/nlparse
Summary	A utility library to assist in parsing natural language text.
upload_time	2025-01-24 02:20:08
maintainer	None
docs_url	None
author	Paul Landes
requires_python	None
license	None
keywords	tooling
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Zensols Natural Language Parsing

[![PyPI][pypi-badge]][pypi-link]
[![Python 3.11][python311-badge]][python311-link]
[![Build Status][build-badge]][build-link]


From the paper [DeepZensols: A Deep Learning Natural Language Processing
Framework for Experimentation and Reproducibility].  This framework wraps the
[spaCy] framework and creates light weight features in a class [hierarchy] that
reflects the structure of natural language.  The motivation is to generate
features from the parsed text in an object oriented fashion that is fast and
easy to pickle.

Other features include:
* [Parse and normalize] a stream of tokens as stop words, punctuation
  filters, up/down casing, porter stemming and [others].
* [Detached features] that are safe and easy to pickle to disk.
* Configuration drive parsing and token normalization using [configuration
  factories].
* Pretty print functionality for easy natural language feature selection.
* A comprehensive [scoring module] including following scoring methods:
  * [Rouge]
  * [Bleu]
  * [SemEval-2013 Task 9.1]
  * [Levenshtein distance]
  * Exact match

## Documentation

* [Framework documentation]
* [Natural Language Parsing]
* [List Token Normalizers and Mappers]


## Obtaining / Installing

The easiest way to install the command line program is via the `pip`
installer.  Since the package needs at least one spaCy module, the second
command downloads the smallest model.
```bash
pip3 install --use-deprecated=legacy-resolver zensols.nlp
python -m spacy download en_core_web_sm
```

Binaries are also available on [pypi].


## Usage

A parser using the default configuration can be obtained by:
```python
from zensols.nlp import FeatureDocumentParser
parser: FeatureDocumentParser = FeatureDocumentParser.default_instance()
doc = parser('Obama was the 44th president of the United States.')
for tok in doc.tokens:
    print(tok.norm, tok.pos_, tok.tag_)
print(doc.entities)

>>>
Obama PROPN NNP
was AUX VBD
the DET DT
45th ADJ JJ
president NOUN NN
of ADP IN
the United States DET DT
. PUNCT .
(<Obama>, <45th>, <the United States>)
```

However, minimal effort is needed to configure the parser using a [resource library]:
```python
from io import StringIO
from zensols.config import ImportIniConfig, ImportConfigFactory
from zensols.nlp import FeatureDocument, FeatureDocumentParser

CONFIG = """
# import the `zensols.nlp` library
[import]
config_file = resource(zensols.nlp): resources/obj.conf

# override the parse to keep only the norm, ent
[doc_parser]
token_feature_ids = set: ent_, tag_
"""

if (__name__ == '__main__'):
    fac = ImportConfigFactory(ImportIniConfig(StringIO(CONFIG)))
    doc_parser: FeatureDocumentParser = fac('doc_parser')
    sent = 'He was George Washington and first president of the United States.'
    doc: FeatureDocument = doc_parser(sent)
    for tok in doc.tokens:
        tok.write()
```

This uses a [resource library] to source in the configuration from this package
so minimal configuration is necessary.  More advanced configuration [examples]
are also available.

See the [feature documents] for more information.


## Scoring

Certain scores in the [scoring module] need additional Python packages.  These
are installed with:
```bash
pip install -R src/python/requirements-score.txt
```


## Attribution

This project, or example code, uses:
* [spaCy] for natural language parsing
* [msgpack] and [smart-open] for Python disk serialization
* [nltk] for the [porter stemmer] functionality


## Citation

If you use this project in your research please use the following BibTeX entry:

```bibtex
@inproceedings{landes-etal-2023-deepzensols,
    title = "{D}eep{Z}ensols: A Deep Learning Natural Language Processing Framework for Experimentation and Reproducibility",
    author = "Landes, Paul  and
      Di Eugenio, Barbara  and
      Caragea, Cornelia",
    editor = "Tan, Liling  and
      Milajevs, Dmitrijs  and
      Chauhan, Geeticka  and
      Gwinnup, Jeremy  and
      Rippeth, Elijah",
    booktitle = "Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023)",
    month = dec,
    year = "2023",
    address = "Singapore, Singapore",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.nlposs-1.16",
    pages = "141--146"
}
```


## Changelog

An extensive changelog is available [here](CHANGELOG.md).


## Community

Please star this repository and let me know how and where you use this API.
Contributions as pull requests, feedback and any input is welcome.


## License

[MIT License](LICENSE.md)

Copyright (c) 2020 - 2025 Paul Landes


<!-- links -->
[pypi]: https://pypi.org/project/zensols.nlp/
[pypi-link]: https://pypi.python.org/pypi/zensols.nlp
[pypi-badge]: https://img.shields.io/pypi/v/zensols.nlp.svg
[python311-badge]: https://img.shields.io/badge/python-3.11-blue.svg
[python311-link]: https://www.python.org/downloads/release/python-3110
[build-badge]: https://github.com/plandes/nlparse/workflows/CI/badge.svg
[build-link]: https://github.com/plandes/nlparse/actions

[DeepZensols: A Deep Learning Natural Language Processing Framework for Experimentation and Reproducibility]: https://aclanthology.org/2023.nlposs-1.16.pdf
[examples]: https://github.com/plandes/nlparse/tree/master/example/config

[hierarchy]: https://plandes.github.io/nlparse/api/zensols.nlp.html#zensols.nlp.container.FeatureDocument
[Parse and normalize]: https://plandes.github.io/nlparse/doc/parse.html
[others]: https://plandes.github.io/nlparse/doc/normalizers.html
[Detached features]: https://plandes.github.io/nlparse/doc/parse.html#detached-features
[full documentation]: https://plandes.github.io/nlparse/
[Framework documentation]: https://plandes.github.io/nlparse/api.html
[Natural Language Parsing]: https://plandes.github.io/nlparse/doc/parse.html
[List Token Normalizers and Mappers]: https://plandes.github.io/nlparse/doc/normalizers.html
[resource library]: https://plandes.github.io/util/doc/config.html#resource-libraries

[spaCy]: https://spacy.io
[nltk]: https://www.nltk.org
[smart-open]: https://pypi.org/project/smart-open/
[msgpack]: https://msgpack.org
[porter stemmer]: https://tartarus.org/martin/PorterStemmer/

[configuration factories]: https://plandes.github.io/util/doc/config.html#configuration-factory
[feature documents]: https://plandes.github.io/nlparse/doc/feature-doc.html
[scoring module]: https://plandes.github.io/nlparse/api/zensols.nlp.html#zensols-nlp-score
[Rouge]: https://aclanthology.org/W04-1013
[Bleu]: https://aclanthology.org/P02-1040
[SemEval-2013 Task 9.1]: https://web.archive.org/web/20150131105418/https://www.cs.york.ac.uk/semeval-2013/task9/data/uploads/semeval_2013-task-9_1-evaluation-metrics.pdf
[Levenshtein distance]: https://en.wikipedia.org/wiki/Levenshtein_distance

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/plandes/nlparse",
    "name": "zensols.nlp",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "tooling",
    "author": "Paul Landes",
    "author_email": "landes@mailc.net",
    "download_url": "https://github.com/plandes/nlparse/releases/download/v1.12.1/zensols.nlp-1.12.1-py3-none-any.whl",
    "platform": null,
    "description": "# Zensols Natural Language Parsing\n\n[![PyPI][pypi-badge]][pypi-link]\n[![Python 3.11][python311-badge]][python311-link]\n[![Build Status][build-badge]][build-link]\n\n\nFrom the paper [DeepZensols: A Deep Learning Natural Language Processing\nFramework for Experimentation and Reproducibility].  This framework wraps the\n[spaCy] framework and creates light weight features in a class [hierarchy] that\nreflects the structure of natural language.  The motivation is to generate\nfeatures from the parsed text in an object oriented fashion that is fast and\neasy to pickle.\n\nOther features include:\n* [Parse and normalize] a stream of tokens as stop words, punctuation\n  filters, up/down casing, porter stemming and [others].\n* [Detached features] that are safe and easy to pickle to disk.\n* Configuration drive parsing and token normalization using [configuration\n  factories].\n* Pretty print functionality for easy natural language feature selection.\n* A comprehensive [scoring module] including following scoring methods:\n  * [Rouge]\n  * [Bleu]\n  * [SemEval-2013 Task 9.1]\n  * [Levenshtein distance]\n  * Exact match\n\n## Documentation\n\n* [Framework documentation]\n* [Natural Language Parsing]\n* [List Token Normalizers and Mappers]\n\n\n## Obtaining / Installing\n\nThe easiest way to install the command line program is via the `pip`\ninstaller.  Since the package needs at least one spaCy module, the second\ncommand downloads the smallest model.\n```bash\npip3 install --use-deprecated=legacy-resolver zensols.nlp\npython -m spacy download en_core_web_sm\n```\n\nBinaries are also available on [pypi].\n\n\n## Usage\n\nA parser using the default configuration can be obtained by:\n```python\nfrom zensols.nlp import FeatureDocumentParser\nparser: FeatureDocumentParser = FeatureDocumentParser.default_instance()\ndoc = parser('Obama was the 44th president of the United States.')\nfor tok in doc.tokens:\n    print(tok.norm, tok.pos_, tok.tag_)\nprint(doc.entities)\n\n>>>\nObama PROPN NNP\nwas AUX VBD\nthe DET DT\n45th ADJ JJ\npresident NOUN NN\nof ADP IN\nthe United States DET DT\n. PUNCT .\n(<Obama>, <45th>, <the United States>)\n```\n\nHowever, minimal effort is needed to configure the parser using a [resource library]:\n```python\nfrom io import StringIO\nfrom zensols.config import ImportIniConfig, ImportConfigFactory\nfrom zensols.nlp import FeatureDocument, FeatureDocumentParser\n\nCONFIG = \"\"\"\n# import the `zensols.nlp` library\n[import]\nconfig_file = resource(zensols.nlp): resources/obj.conf\n\n# override the parse to keep only the norm, ent\n[doc_parser]\ntoken_feature_ids = set: ent_, tag_\n\"\"\"\n\nif (__name__ == '__main__'):\n    fac = ImportConfigFactory(ImportIniConfig(StringIO(CONFIG)))\n    doc_parser: FeatureDocumentParser = fac('doc_parser')\n    sent = 'He was George Washington and first president of the United States.'\n    doc: FeatureDocument = doc_parser(sent)\n    for tok in doc.tokens:\n        tok.write()\n```\n\nThis uses a [resource library] to source in the configuration from this package\nso minimal configuration is necessary.  More advanced configuration [examples]\nare also available.\n\nSee the [feature documents] for more information.\n\n\n## Scoring\n\nCertain scores in the [scoring module] need additional Python packages.  These\nare installed with:\n```bash\npip install -R src/python/requirements-score.txt\n```\n\n\n## Attribution\n\nThis project, or example code, uses:\n* [spaCy] for natural language parsing\n* [msgpack] and [smart-open] for Python disk serialization\n* [nltk] for the [porter stemmer] functionality\n\n\n## Citation\n\nIf you use this project in your research please use the following BibTeX entry:\n\n```bibtex\n@inproceedings{landes-etal-2023-deepzensols,\n    title = \"{D}eep{Z}ensols: A Deep Learning Natural Language Processing Framework for Experimentation and Reproducibility\",\n    author = \"Landes, Paul  and\n      Di Eugenio, Barbara  and\n      Caragea, Cornelia\",\n    editor = \"Tan, Liling  and\n      Milajevs, Dmitrijs  and\n      Chauhan, Geeticka  and\n      Gwinnup, Jeremy  and\n      Rippeth, Elijah\",\n    booktitle = \"Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023)\",\n    month = dec,\n    year = \"2023\",\n    address = \"Singapore, Singapore\",\n    publisher = \"Association for Computational Linguistics\",\n    url = \"https://aclanthology.org/2023.nlposs-1.16\",\n    pages = \"141--146\"\n}\n```\n\n\n## Changelog\n\nAn extensive changelog is available [here](CHANGELOG.md).\n\n\n## Community\n\nPlease star this repository and let me know how and where you use this API.\nContributions as pull requests, feedback and any input is welcome.\n\n\n## License\n\n[MIT License](LICENSE.md)\n\nCopyright (c) 2020 - 2025 Paul Landes\n\n\n<!-- links -->\n[pypi]: https://pypi.org/project/zensols.nlp/\n[pypi-link]: https://pypi.python.org/pypi/zensols.nlp\n[pypi-badge]: https://img.shields.io/pypi/v/zensols.nlp.svg\n[python311-badge]: https://img.shields.io/badge/python-3.11-blue.svg\n[python311-link]: https://www.python.org/downloads/release/python-3110\n[build-badge]: https://github.com/plandes/nlparse/workflows/CI/badge.svg\n[build-link]: https://github.com/plandes/nlparse/actions\n\n[DeepZensols: A Deep Learning Natural Language Processing Framework for Experimentation and Reproducibility]: https://aclanthology.org/2023.nlposs-1.16.pdf\n[examples]: https://github.com/plandes/nlparse/tree/master/example/config\n\n[hierarchy]: https://plandes.github.io/nlparse/api/zensols.nlp.html#zensols.nlp.container.FeatureDocument\n[Parse and normalize]: https://plandes.github.io/nlparse/doc/parse.html\n[others]: https://plandes.github.io/nlparse/doc/normalizers.html\n[Detached features]: https://plandes.github.io/nlparse/doc/parse.html#detached-features\n[full documentation]: https://plandes.github.io/nlparse/\n[Framework documentation]: https://plandes.github.io/nlparse/api.html\n[Natural Language Parsing]: https://plandes.github.io/nlparse/doc/parse.html\n[List Token Normalizers and Mappers]: https://plandes.github.io/nlparse/doc/normalizers.html\n[resource library]: https://plandes.github.io/util/doc/config.html#resource-libraries\n\n[spaCy]: https://spacy.io\n[nltk]: https://www.nltk.org\n[smart-open]: https://pypi.org/project/smart-open/\n[msgpack]: https://msgpack.org\n[porter stemmer]: https://tartarus.org/martin/PorterStemmer/\n\n[configuration factories]: https://plandes.github.io/util/doc/config.html#configuration-factory\n[feature documents]: https://plandes.github.io/nlparse/doc/feature-doc.html\n[scoring module]: https://plandes.github.io/nlparse/api/zensols.nlp.html#zensols-nlp-score\n[Rouge]: https://aclanthology.org/W04-1013\n[Bleu]: https://aclanthology.org/P02-1040\n[SemEval-2013 Task 9.1]: https://web.archive.org/web/20150131105418/https://www.cs.york.ac.uk/semeval-2013/task9/data/uploads/semeval_2013-task-9_1-evaluation-metrics.pdf\n[Levenshtein distance]: https://en.wikipedia.org/wiki/Levenshtein_distance\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A utility library to assist in parsing natural language text.",
    "version": "1.12.1",
    "project_urls": {
        "Download": "https://github.com/plandes/nlparse/releases/download/v1.12.1/zensols.nlp-1.12.1-py3-none-any.whl",
        "Homepage": "https://github.com/plandes/nlparse"
    },
    "split_keywords": [
        "tooling"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "485deaef2c2b1b91c7e944dbea48ea6bdc1be954e9be5494bbdbea889e62efaa",
                "md5": "771aa268e59053b26e57351a1c824283",
                "sha256": "638d34bb9599050a2222351feacc8aaedf3254cb71d51353ee788fec2b3fee1b"
            },
            "downloads": -1,
            "filename": "zensols.nlp-1.12.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "771aa268e59053b26e57351a1c824283",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 68320,
            "upload_time": "2025-01-24T02:20:08",
            "upload_time_iso_8601": "2025-01-24T02:20:08.657348Z",
            "url": "https://files.pythonhosted.org/packages/48/5d/eaef2c2b1b91c7e944dbea48ea6bdc1be954e9be5494bbdbea889e62efaa/zensols.nlp-1.12.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-01-24 02:20:08",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "plandes",
    "github_project": "nlparse",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "zensols.nlp"
}

Paul Landes