wordseg

Name	wordseg JSON
Version	0.0.5 JSON
	download
home_page
Summary	Word segmentation models
upload_time	2023-03-11 21:58:09
maintainer
docs_url	None
author
requires_python	>=3.8
license	MIT License
keywords	computational linguistics natural language processing nlp word segmentation linguistics corpora speech language
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # wordseg

[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.4077433.svg)](https://doi.org/10.5281/zenodo.4077433)
[![PyPI version](https://badge.fury.io/py/wordseg.svg)](https://pypi.org/project/wordseg)
[![Supported Python versions](https://img.shields.io/pypi/pyversions/wordseg.svg)](https://pypi.org/project/wordseg)
[![CircleCI](https://circleci.com/gh/jacksonllee/wordseg/tree/main.svg?style=shield)](https://circleci.com/gh/jacksonllee/wordseg/tree/main)

`wordseg` is a Python package of word segmentation models.

Table of contents:

* [Installation](https://github.com/jacksonllee/wordseg#installation)
* [Usage](https://github.com/jacksonllee/wordseg#usage)
* [License](https://github.com/jacksonllee/wordseg#license)
* [Changelog](https://github.com/jacksonllee/wordseg#changelog)
* [Contributing](https://github.com/jacksonllee/wordseg/blob/main/CONTRIBUTING.md)
* [Citation](https://github.com/jacksonllee/wordseg#citation)

## Installation

`wordseg` is available through pip:

```bash
pip install wordseg
```

To install `wordseg` from the GitHub source:

```bash
git clone https://github.com/jacksonllee/wordseg.git
cd wordseg
pip install -e ".[dev]"
```

## Usage

`wordseg` implements a word segmentation model as a Python class.
An instantiated model class object has the following methods
(emulating the scikit-learn-styled API for machine learning):

* `fit`: Train the model with segmented sentences.
* `predict`: Predict the segmented sentences from unsegmented sentences.

The implemented model classes are as follows:

* `RandomSegmenter`:
  Segmentation is predicted at random at each potential word
  boundary independently for some given probability. No training is required.
* `LongestStringMatching`: 
  This model constructs predicted words by moving
  from left to right along an unsegmented sentence and
  finding the longest matching words, constrained by a maximum word length parameter.

Sample code snippet:

```python
from src.wordseg import LongestStringMatching

# Initialize a model.
model = LongestStringMatching(max_word_length=4)

# Train the model.
# `fit` takes an iterable of segmented sentences (a tuple or list of strings).
model.fit(
  [
    ("this", "is", "a", "sentence"),
    ("that", "is", "not", "a", "sentence"),
  ]
)

# Make some predictions; `predict` gives a generator, which is materialized by list() here.
list(model.predict(["thatisadog", "thisisnotacat"]))
# [['that', 'is', 'a', 'd', 'o', 'g'], ['this', 'is', 'not', 'a', 'c', 'a', 't']]
# We can't get 'dog' and 'cat' because they aren't in the training data.
```

## License

MIT License. Please see [`LICENSE.txt`](https://github.com/jacksonllee/wordseg/blob/main/LICENSE.txt).

## Changelog

Please see [`CHANGELOG.md`](https://github.com/jacksonllee/wordseg/blob/main/CHANGELOG.md).

## Contributing

Please see [`CONTRIBUTING.md`](https://github.com/jacksonllee/wordseg/blob/main/CONTRIBUTING.md).

## Citation

Lee, Jackson L. 2023. wordseg: Word segmentation models in Python. https://doi.org/10.5281/zenodo.4077433

```bibtex
@software{leengrams,
  author       = {Jackson L. Lee},
  title        = {wordseg: Word segmentation models in Python},
  year         = 2023,
  doi          = {10.5281/zenodo.4077433},
  url          = {https://doi.org/10.5281/zenodo.4077433}
}
```

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "wordseg",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "",
    "keywords": "computational linguistics,natural language processing,NLP,word segmentation,linguistics,corpora,speech,language",
    "author": "",
    "author_email": "\"Jackson L. Lee\" <jacksonlunlee@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/52/f8/fdf2a02790257d75017a36018ea7ce7f8530beaecec6dc97ba7c46769dc5/wordseg-0.0.5.tar.gz",
    "platform": null,
    "description": "# wordseg\n\n[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.4077433.svg)](https://doi.org/10.5281/zenodo.4077433)\n[![PyPI version](https://badge.fury.io/py/wordseg.svg)](https://pypi.org/project/wordseg)\n[![Supported Python versions](https://img.shields.io/pypi/pyversions/wordseg.svg)](https://pypi.org/project/wordseg)\n[![CircleCI](https://circleci.com/gh/jacksonllee/wordseg/tree/main.svg?style=shield)](https://circleci.com/gh/jacksonllee/wordseg/tree/main)\n\n`wordseg` is a Python package of word segmentation models.\n\nTable of contents:\n\n* [Installation](https://github.com/jacksonllee/wordseg#installation)\n* [Usage](https://github.com/jacksonllee/wordseg#usage)\n* [License](https://github.com/jacksonllee/wordseg#license)\n* [Changelog](https://github.com/jacksonllee/wordseg#changelog)\n* [Contributing](https://github.com/jacksonllee/wordseg/blob/main/CONTRIBUTING.md)\n* [Citation](https://github.com/jacksonllee/wordseg#citation)\n\n## Installation\n\n`wordseg` is available through pip:\n\n```bash\npip install wordseg\n```\n\nTo install `wordseg` from the GitHub source:\n\n```bash\ngit clone https://github.com/jacksonllee/wordseg.git\ncd wordseg\npip install -e \".[dev]\"\n```\n\n## Usage\n\n`wordseg` implements a word segmentation model as a Python class.\nAn instantiated model class object has the following methods\n(emulating the scikit-learn-styled API for machine learning):\n\n* `fit`: Train the model with segmented sentences.\n* `predict`: Predict the segmented sentences from unsegmented sentences.\n\nThe implemented model classes are as follows:\n\n* `RandomSegmenter`:\n  Segmentation is predicted at random at each potential word\n  boundary independently for some given probability. No training is required.\n* `LongestStringMatching`: \n  This model constructs predicted words by moving\n  from left to right along an unsegmented sentence and\n  finding the longest matching words, constrained by a maximum word length parameter.\n\nSample code snippet:\n\n```python\nfrom src.wordseg import LongestStringMatching\n\n# Initialize a model.\nmodel = LongestStringMatching(max_word_length=4)\n\n# Train the model.\n# `fit` takes an iterable of segmented sentences (a tuple or list of strings).\nmodel.fit(\n  [\n    (\"this\", \"is\", \"a\", \"sentence\"),\n    (\"that\", \"is\", \"not\", \"a\", \"sentence\"),\n  ]\n)\n\n# Make some predictions; `predict` gives a generator, which is materialized by list() here.\nlist(model.predict([\"thatisadog\", \"thisisnotacat\"]))\n# [['that', 'is', 'a', 'd', 'o', 'g'], ['this', 'is', 'not', 'a', 'c', 'a', 't']]\n# We can't get 'dog' and 'cat' because they aren't in the training data.\n```\n\n## License\n\nMIT License. Please see [`LICENSE.txt`](https://github.com/jacksonllee/wordseg/blob/main/LICENSE.txt).\n\n## Changelog\n\nPlease see [`CHANGELOG.md`](https://github.com/jacksonllee/wordseg/blob/main/CHANGELOG.md).\n\n## Contributing\n\nPlease see [`CONTRIBUTING.md`](https://github.com/jacksonllee/wordseg/blob/main/CONTRIBUTING.md).\n\n## Citation\n\nLee, Jackson L. 2023. wordseg: Word segmentation models in Python. https://doi.org/10.5281/zenodo.4077433\n\n```bibtex\n@software{leengrams,\n  author       = {Jackson L. Lee},\n  title        = {wordseg: Word segmentation models in Python},\n  year         = 2023,\n  doi          = {10.5281/zenodo.4077433},\n  url          = {https://doi.org/10.5281/zenodo.4077433}\n}\n```\n",
    "bugtrack_url": null,
    "license": "MIT License",
    "summary": "Word segmentation models",
    "version": "0.0.5",
    "split_keywords": [
        "computational linguistics",
        "natural language processing",
        "nlp",
        "word segmentation",
        "linguistics",
        "corpora",
        "speech",
        "language"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "5ef436a9e82678df2f11fdb5803c7463854b63b4151fff729aba807e4d7f665c",
                "md5": "0a7301c64a87fa9a5c5fde842661ba36",
                "sha256": "7c607f58e040c19d187e2886b093b06470ba016454377e19f63e1b866774b3d6"
            },
            "downloads": -1,
            "filename": "wordseg-0.0.5-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "0a7301c64a87fa9a5c5fde842661ba36",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 6181,
            "upload_time": "2023-03-11T21:57:47",
            "upload_time_iso_8601": "2023-03-11T21:57:47.399477Z",
            "url": "https://files.pythonhosted.org/packages/5e/f4/36a9e82678df2f11fdb5803c7463854b63b4151fff729aba807e4d7f665c/wordseg-0.0.5-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "52f8fdf2a02790257d75017a36018ea7ce7f8530beaecec6dc97ba7c46769dc5",
                "md5": "4fd9b2abf4721d0a47a56b1d2d6c42d1",
                "sha256": "0ba2b87bcfa801508e8dbeb71c62d3056fc462b2610fd5d883680f636204700c"
            },
            "downloads": -1,
            "filename": "wordseg-0.0.5.tar.gz",
            "has_sig": false,
            "md5_digest": "4fd9b2abf4721d0a47a56b1d2d6c42d1",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 6231,
            "upload_time": "2023-03-11T21:58:09",
            "upload_time_iso_8601": "2023-03-11T21:58:09.821878Z",
            "url": "https://files.pythonhosted.org/packages/52/f8/fdf2a02790257d75017a36018ea7ce7f8530beaecec6dc97ba7c46769dc5/wordseg-0.0.5.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-03-11 21:58:09",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "lcname": "wordseg"
}