# wordseg
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.4077433.svg)](https://doi.org/10.5281/zenodo.4077433)
[![PyPI version](https://badge.fury.io/py/wordseg.svg)](https://pypi.org/project/wordseg)
[![Supported Python versions](https://img.shields.io/pypi/pyversions/wordseg.svg)](https://pypi.org/project/wordseg)
[![CircleCI](https://circleci.com/gh/jacksonllee/wordseg/tree/main.svg?style=shield)](https://circleci.com/gh/jacksonllee/wordseg/tree/main)
`wordseg` is a Python package of word segmentation models.
Table of contents:
* [Installation](https://github.com/jacksonllee/wordseg#installation)
* [Usage](https://github.com/jacksonllee/wordseg#usage)
* [License](https://github.com/jacksonllee/wordseg#license)
* [Changelog](https://github.com/jacksonllee/wordseg#changelog)
* [Contributing](https://github.com/jacksonllee/wordseg/blob/main/CONTRIBUTING.md)
* [Citation](https://github.com/jacksonllee/wordseg#citation)
## Installation
`wordseg` is available through pip:
```bash
pip install wordseg
```
To install `wordseg` from the GitHub source:
```bash
git clone https://github.com/jacksonllee/wordseg.git
cd wordseg
pip install -e ".[dev]"
```
## Usage
`wordseg` implements a word segmentation model as a Python class.
An instantiated model class object has the following methods
(emulating the scikit-learn-styled API for machine learning):
* `fit`: Train the model with segmented sentences.
* `predict`: Predict the segmented sentences from unsegmented sentences.
The implemented model classes are as follows:
* `RandomSegmenter`:
Segmentation is predicted at random at each potential word
boundary independently for some given probability. No training is required.
* `LongestStringMatching`:
This model constructs predicted words by moving
from left to right along an unsegmented sentence and
finding the longest matching words, constrained by a maximum word length parameter.
Sample code snippet:
```python
from src.wordseg import LongestStringMatching
# Initialize a model.
model = LongestStringMatching(max_word_length=4)
# Train the model.
# `fit` takes an iterable of segmented sentences (a tuple or list of strings).
model.fit(
[
("this", "is", "a", "sentence"),
("that", "is", "not", "a", "sentence"),
]
)
# Make some predictions; `predict` gives a generator, which is materialized by list() here.
list(model.predict(["thatisadog", "thisisnotacat"]))
# [['that', 'is', 'a', 'd', 'o', 'g'], ['this', 'is', 'not', 'a', 'c', 'a', 't']]
# We can't get 'dog' and 'cat' because they aren't in the training data.
```
## License
MIT License. Please see [`LICENSE.txt`](https://github.com/jacksonllee/wordseg/blob/main/LICENSE.txt).
## Changelog
Please see [`CHANGELOG.md`](https://github.com/jacksonllee/wordseg/blob/main/CHANGELOG.md).
## Contributing
Please see [`CONTRIBUTING.md`](https://github.com/jacksonllee/wordseg/blob/main/CONTRIBUTING.md).
## Citation
Lee, Jackson L. 2023. wordseg: Word segmentation models in Python. https://doi.org/10.5281/zenodo.4077433
```bibtex
@software{leengrams,
author = {Jackson L. Lee},
title = {wordseg: Word segmentation models in Python},
year = 2023,
doi = {10.5281/zenodo.4077433},
url = {https://doi.org/10.5281/zenodo.4077433}
}
```
Raw data
{
"_id": null,
"home_page": "",
"name": "wordseg",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "",
"keywords": "computational linguistics,natural language processing,NLP,word segmentation,linguistics,corpora,speech,language",
"author": "",
"author_email": "\"Jackson L. Lee\" <jacksonlunlee@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/52/f8/fdf2a02790257d75017a36018ea7ce7f8530beaecec6dc97ba7c46769dc5/wordseg-0.0.5.tar.gz",
"platform": null,
"description": "# wordseg\n\n[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.4077433.svg)](https://doi.org/10.5281/zenodo.4077433)\n[![PyPI version](https://badge.fury.io/py/wordseg.svg)](https://pypi.org/project/wordseg)\n[![Supported Python versions](https://img.shields.io/pypi/pyversions/wordseg.svg)](https://pypi.org/project/wordseg)\n[![CircleCI](https://circleci.com/gh/jacksonllee/wordseg/tree/main.svg?style=shield)](https://circleci.com/gh/jacksonllee/wordseg/tree/main)\n\n`wordseg` is a Python package of word segmentation models.\n\nTable of contents:\n\n* [Installation](https://github.com/jacksonllee/wordseg#installation)\n* [Usage](https://github.com/jacksonllee/wordseg#usage)\n* [License](https://github.com/jacksonllee/wordseg#license)\n* [Changelog](https://github.com/jacksonllee/wordseg#changelog)\n* [Contributing](https://github.com/jacksonllee/wordseg/blob/main/CONTRIBUTING.md)\n* [Citation](https://github.com/jacksonllee/wordseg#citation)\n\n## Installation\n\n`wordseg` is available through pip:\n\n```bash\npip install wordseg\n```\n\nTo install `wordseg` from the GitHub source:\n\n```bash\ngit clone https://github.com/jacksonllee/wordseg.git\ncd wordseg\npip install -e \".[dev]\"\n```\n\n## Usage\n\n`wordseg` implements a word segmentation model as a Python class.\nAn instantiated model class object has the following methods\n(emulating the scikit-learn-styled API for machine learning):\n\n* `fit`: Train the model with segmented sentences.\n* `predict`: Predict the segmented sentences from unsegmented sentences.\n\nThe implemented model classes are as follows:\n\n* `RandomSegmenter`:\n Segmentation is predicted at random at each potential word\n boundary independently for some given probability. No training is required.\n* `LongestStringMatching`: \n This model constructs predicted words by moving\n from left to right along an unsegmented sentence and\n finding the longest matching words, constrained by a maximum word length parameter.\n\nSample code snippet:\n\n```python\nfrom src.wordseg import LongestStringMatching\n\n# Initialize a model.\nmodel = LongestStringMatching(max_word_length=4)\n\n# Train the model.\n# `fit` takes an iterable of segmented sentences (a tuple or list of strings).\nmodel.fit(\n [\n (\"this\", \"is\", \"a\", \"sentence\"),\n (\"that\", \"is\", \"not\", \"a\", \"sentence\"),\n ]\n)\n\n# Make some predictions; `predict` gives a generator, which is materialized by list() here.\nlist(model.predict([\"thatisadog\", \"thisisnotacat\"]))\n# [['that', 'is', 'a', 'd', 'o', 'g'], ['this', 'is', 'not', 'a', 'c', 'a', 't']]\n# We can't get 'dog' and 'cat' because they aren't in the training data.\n```\n\n## License\n\nMIT License. Please see [`LICENSE.txt`](https://github.com/jacksonllee/wordseg/blob/main/LICENSE.txt).\n\n## Changelog\n\nPlease see [`CHANGELOG.md`](https://github.com/jacksonllee/wordseg/blob/main/CHANGELOG.md).\n\n## Contributing\n\nPlease see [`CONTRIBUTING.md`](https://github.com/jacksonllee/wordseg/blob/main/CONTRIBUTING.md).\n\n## Citation\n\nLee, Jackson L. 2023. wordseg: Word segmentation models in Python. https://doi.org/10.5281/zenodo.4077433\n\n```bibtex\n@software{leengrams,\n author = {Jackson L. Lee},\n title = {wordseg: Word segmentation models in Python},\n year = 2023,\n doi = {10.5281/zenodo.4077433},\n url = {https://doi.org/10.5281/zenodo.4077433}\n}\n```\n",
"bugtrack_url": null,
"license": "MIT License",
"summary": "Word segmentation models",
"version": "0.0.5",
"split_keywords": [
"computational linguistics",
"natural language processing",
"nlp",
"word segmentation",
"linguistics",
"corpora",
"speech",
"language"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "5ef436a9e82678df2f11fdb5803c7463854b63b4151fff729aba807e4d7f665c",
"md5": "0a7301c64a87fa9a5c5fde842661ba36",
"sha256": "7c607f58e040c19d187e2886b093b06470ba016454377e19f63e1b866774b3d6"
},
"downloads": -1,
"filename": "wordseg-0.0.5-py3-none-any.whl",
"has_sig": false,
"md5_digest": "0a7301c64a87fa9a5c5fde842661ba36",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 6181,
"upload_time": "2023-03-11T21:57:47",
"upload_time_iso_8601": "2023-03-11T21:57:47.399477Z",
"url": "https://files.pythonhosted.org/packages/5e/f4/36a9e82678df2f11fdb5803c7463854b63b4151fff729aba807e4d7f665c/wordseg-0.0.5-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "52f8fdf2a02790257d75017a36018ea7ce7f8530beaecec6dc97ba7c46769dc5",
"md5": "4fd9b2abf4721d0a47a56b1d2d6c42d1",
"sha256": "0ba2b87bcfa801508e8dbeb71c62d3056fc462b2610fd5d883680f636204700c"
},
"downloads": -1,
"filename": "wordseg-0.0.5.tar.gz",
"has_sig": false,
"md5_digest": "4fd9b2abf4721d0a47a56b1d2d6c42d1",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 6231,
"upload_time": "2023-03-11T21:58:09",
"upload_time_iso_8601": "2023-03-11T21:58:09.821878Z",
"url": "https://files.pythonhosted.org/packages/52/f8/fdf2a02790257d75017a36018ea7ce7f8530beaecec6dc97ba7c46769dc5/wordseg-0.0.5.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-03-11 21:58:09",
"github": false,
"gitlab": false,
"bitbucket": false,
"lcname": "wordseg"
}