## Introduction
This package `shorttext` is a Python package that facilitates supervised and unsupervised
learning for short text categorization. Due to the sparseness of words and
the lack of information carried in the short texts themselves, an intermediate
representation of the texts and documents are needed before they are put into
any classification algorithm. In this package, it facilitates various types
of these representations, including topic modeling and word-embedding algorithms.
The package `shorttext` runs on Python 3.8, 3.9, 3.10, and 3.11.
Characteristics:
- example data provided (including subject keywords and NIH RePORT);
- text preprocessing;
- pre-trained word-embedding support;
- `gensim` topic models (LDA, LSI, Random Projections) and autoencoder;
- topic model representation supported for supervised learning using `scikit-learn`;
- cosine distance classification;
- neural network classification (including ConvNet, and C-LSTM);
- maximum entropy classification;
- metrics of phrases differences, including soft Jaccard score (using Damerau-Levenshtein distance), and Word Mover's distance (WMD);
- character-level sequence-to-sequence (seq2seq) learning;
- spell correction;
- API for word-embedding algorithm for one-time loading; and
- Sentence encodings and similarities based on BERT.
## Documentation
Documentation and tutorials for `shorttext` can be found here: [http://shorttext.rtfd.io/](http://shorttext.rtfd.io/).
See [tutorial](http://shorttext.readthedocs.io/en/latest/tutorial.html) for how to use the package, and [FAQ](https://shorttext.readthedocs.io/en/latest/faq.html).
## Installation
To install it, in a console, use `pip`.
```
>>> pip install shorttext
```
or, if you want the most recent development version on Github, type
```
>>> pip install git+https://github.com/stephenhky/PyShortTextCategorization@master
```
Developers are advised to make sure `Keras` >=2 be installed. Users are advised to install the backend `Tensorflow` (preferred) or `Theano` in advance. It is desirable if `Cython` has been previously installed too.
See [installation guide](https://shorttext.readthedocs.io/en/latest/install.html) for more details.
## Issues
To report any issues, go to the [Issues](https://github.com/stephenhky/PyShortTextCategorization/issues) tab of the Github page and start a thread.
It is welcome for developers to submit pull requests on their own
to fix any errors.
## Contributors
If you would like to contribute, feel free to submit the pull requests. You can talk to me in advance through e-mails or the
[Issues](https://github.com/stephenhky/PyShortTextCategorization/issues) page.
## Useful Links
* Documentation: [http://shorttext.readthedocs.io](http://shorttext.readthedocs.io/)
* Github: [https://github.com/stephenhky/PyShortTextCategorization](https://github.com/stephenhky/PyShortTextCategorization)
* PyPI: [https://pypi.org/project/shorttext/](https://pypi.org/project/shorttext/)
* "Package shorttext 1.0.0 released," [Medium](https://medium.com/@stephenhky/package-shorttext-1-0-0-released-ca3cb24d0ff3)
* "Python Package for Short Text Mining", [WordPress](https://datawarrior.wordpress.com/2016/12/22/python-package-for-short-text-mining/)
* "Document-Term Matrix: Text Mining in R and Python," [WordPress](https://datawarrior.wordpress.com/2018/01/22/document-term-matrix-text-mining-in-r-and-python/)
* An [earlier version](https://github.com/stephenhky/PyShortTextCategorization/tree/b298d3ce7d06a9b4e0f7d32f27bab66064ba7afa) of this repository is a demonstration of the following blog post: [Short Text Categorization using Deep Neural Networks and Word-Embedding Models](https://datawarrior.wordpress.com/2016/10/12/short-text-categorization-using-deep-neural-networks-and-word-embedding-models/)
## News
* 07/12/2024: `shorttext` 2.0.0 released.
* 12/21/2023: `shorttext` 1.6.1 released.
* 08/26/2023: `shorttext` 1.6.0 released.
* 06/19/2023: `shorttext` 1.5.9 released.
* 09/23/2022: `shorttext` 1.5.8 released.
* 09/22/2022: `shorttext` 1.5.7 released.
* 08/29/2022: `shorttext` 1.5.6 released.
* 05/28/2022: `shorttext` 1.5.5 released.
* 12/15/2021: `shorttext` 1.5.4 released.
* 07/11/2021: `shorttext` 1.5.3 released.
* 07/06/2021: `shorttext` 1.5.2 released.
* 04/10/2021: `shorttext` 1.5.1 released.
* 04/09/2021: `shorttext` 1.5.0 released.
* 02/11/2021: `shorttext` 1.4.8 released.
* 01/11/2021: `shorttext` 1.4.7 released.
* 01/03/2021: `shorttext` 1.4.6 released.
* 12/28/2020: `shorttext` 1.4.5 released.
* 12/24/2020: `shorttext` 1.4.4 released.
* 11/10/2020: `shorttext` 1.4.3 released.
* 10/18/2020: `shorttext` 1.4.2 released.
* 09/23/2020: `shorttext` 1.4.1 released.
* 09/02/2020: `shorttext` 1.4.0 released.
* 07/23/2020: `shorttext` 1.3.0 released.
* 06/05/2020: `shorttext` 1.2.6 released.
* 05/20/2020: `shorttext` 1.2.5 released.
* 05/13/2020: `shorttext` 1.2.4 released.
* 04/28/2020: `shorttext` 1.2.3 released.
* 04/07/2020: `shorttext` 1.2.2 released.
* 03/23/2020: `shorttext` 1.2.1 released.
* 03/21/2020: `shorttext` 1.2.0 released.
* 12/01/2019: `shorttext` 1.1.6 released.
* 09/24/2019: `shorttext` 1.1.5 released.
* 07/20/2019: `shorttext` 1.1.4 released.
* 07/07/2019: `shorttext` 1.1.3 released.
* 06/05/2019: `shorttext` 1.1.2 released.
* 04/23/2019: `shorttext` 1.1.1 released.
* 03/03/2019: `shorttext` 1.1.0 released.
* 02/14/2019: `shorttext` 1.0.8 released.
* 01/30/2019: `shorttext` 1.0.7 released.
* 01/29/2019: `shorttext` 1.0.6 released.
* 01/13/2019: `shorttext` 1.0.5 released.
* 10/03/2018: `shorttext` 1.0.4 released.
* 08/06/2018: `shorttext` 1.0.3 released.
* 07/24/2018: `shorttext` 1.0.2 released.
* 07/17/2018: `shorttext` 1.0.1 released.
* 07/14/2018: `shorttext` 1.0.0 released.
* 06/18/2018: `shorttext` 0.7.2 released.
* 05/30/2018: `shorttext` 0.7.1 released.
* 05/17/2018: `shorttext` 0.7.0 released.
* 02/27/2018: `shorttext` 0.6.0 released.
* 01/19/2018: `shorttext` 0.5.11 released.
* 01/15/2018: `shorttext` 0.5.10 released.
* 12/14/2017: `shorttext` 0.5.9 released.
* 11/08/2017: `shorttext` 0.5.8 released.
* 10/27/2017: `shorttext` 0.5.7 released.
* 10/17/2017: `shorttext` 0.5.6 released.
* 09/28/2017: `shorttext` 0.5.5 released.
* 09/08/2017: `shorttext` 0.5.4 released.
* 09/02/2017: end of GSoC project. ([Report](https://rare-technologies.com/chinmayas-gsoc-2017-summary-integration-with-sklearn-keras-and-implementing-fasttext/))
* 08/22/2017: `shorttext` 0.5.1 released.
* 07/28/2017: `shorttext` 0.4.1 released.
* 07/26/2017: `shorttext` 0.4.0 released.
* 06/16/2017: `shorttext` 0.3.8 released.
* 06/12/2017: `shorttext` 0.3.7 released.
* 06/02/2017: `shorttext` 0.3.6 released.
* 05/30/2017: GSoC project ([Chinmaya Pancholi](https://rare-technologies.com/google-summer-of-code-2017-week-1-on-integrating-gensim-with-scikit-learn-and-keras/), with [gensim](https://radimrehurek.com/gensim/))
* 05/16/2017: `shorttext` 0.3.5 released.
* 04/27/2017: `shorttext` 0.3.4 released.
* 04/19/2017: `shorttext` 0.3.3 released.
* 03/28/2017: `shorttext` 0.3.2 released.
* 03/14/2017: `shorttext` 0.3.1 released.
* 02/23/2017: `shorttext` 0.2.1 released.
* 12/21/2016: `shorttext` 0.2.0 released.
* 11/25/2016: `shorttext` 0.1.2 released.
* 11/21/2016: `shorttext` 0.1.1 released.
## Possible Future Updates
- [ ] Dividing components to other packages;
- [ ] More available corpus.
Raw data
{
"_id": null,
"home_page": "https://github.com/stephenhky/PyShortTextCategorization",
"name": "shorttext",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "shorttext natural language processing text mining",
"author": "Kwan Yuet Stephen Ho",
"author_email": "stephenhky@yahoo.com.hk",
"download_url": "https://files.pythonhosted.org/packages/11/f6/15ea7e5298092c67abda44f37a9b8b2f3d625a4c04efc7807fa80bb95f29/shorttext-2.0.0.tar.gz",
"platform": null,
"description": "## Introduction\n\nThis package `shorttext` is a Python package that facilitates supervised and unsupervised\nlearning for short text categorization. Due to the sparseness of words and\nthe lack of information carried in the short texts themselves, an intermediate\nrepresentation of the texts and documents are needed before they are put into\nany classification algorithm. In this package, it facilitates various types\nof these representations, including topic modeling and word-embedding algorithms.\n\nThe package `shorttext` runs on Python 3.8, 3.9, 3.10, and 3.11.\nCharacteristics:\n\n- example data provided (including subject keywords and NIH RePORT);\n- text preprocessing;\n- pre-trained word-embedding support;\n- `gensim` topic models (LDA, LSI, Random Projections) and autoencoder;\n- topic model representation supported for supervised learning using `scikit-learn`;\n- cosine distance classification;\n- neural network classification (including ConvNet, and C-LSTM);\n- maximum entropy classification;\n- metrics of phrases differences, including soft Jaccard score (using Damerau-Levenshtein distance), and Word Mover's distance (WMD);\n- character-level sequence-to-sequence (seq2seq) learning; \n- spell correction;\n- API for word-embedding algorithm for one-time loading; and\n- Sentence encodings and similarities based on BERT.\n\n## Documentation\n\nDocumentation and tutorials for `shorttext` can be found here: [http://shorttext.rtfd.io/](http://shorttext.rtfd.io/).\n\nSee [tutorial](http://shorttext.readthedocs.io/en/latest/tutorial.html) for how to use the package, and [FAQ](https://shorttext.readthedocs.io/en/latest/faq.html).\n\n## Installation\n\nTo install it, in a console, use `pip`.\n\n```\n>>> pip install shorttext\n```\n\nor, if you want the most recent development version on Github, type\n\n```\n>>> pip install git+https://github.com/stephenhky/PyShortTextCategorization@master\n```\n\nDevelopers are advised to make sure `Keras` >=2 be installed. Users are advised to install the backend `Tensorflow` (preferred) or `Theano` in advance. It is desirable if `Cython` has been previously installed too.\n\nSee [installation guide](https://shorttext.readthedocs.io/en/latest/install.html) for more details.\n\n\n## Issues\n\nTo report any issues, go to the [Issues](https://github.com/stephenhky/PyShortTextCategorization/issues) tab of the Github page and start a thread.\nIt is welcome for developers to submit pull requests on their own\nto fix any errors.\n\n## Contributors\n\nIf you would like to contribute, feel free to submit the pull requests. You can talk to me in advance through e-mails or the\n[Issues](https://github.com/stephenhky/PyShortTextCategorization/issues) page.\n\n## Useful Links\n\n* Documentation: [http://shorttext.readthedocs.io](http://shorttext.readthedocs.io/)\n* Github: [https://github.com/stephenhky/PyShortTextCategorization](https://github.com/stephenhky/PyShortTextCategorization)\n* PyPI: [https://pypi.org/project/shorttext/](https://pypi.org/project/shorttext/)\n* \"Package shorttext 1.0.0 released,\" [Medium](https://medium.com/@stephenhky/package-shorttext-1-0-0-released-ca3cb24d0ff3)\n* \"Python Package for Short Text Mining\", [WordPress](https://datawarrior.wordpress.com/2016/12/22/python-package-for-short-text-mining/)\n* \"Document-Term Matrix: Text Mining in R and Python,\" [WordPress](https://datawarrior.wordpress.com/2018/01/22/document-term-matrix-text-mining-in-r-and-python/)\n* An [earlier version](https://github.com/stephenhky/PyShortTextCategorization/tree/b298d3ce7d06a9b4e0f7d32f27bab66064ba7afa) of this repository is a demonstration of the following blog post: [Short Text Categorization using Deep Neural Networks and Word-Embedding Models](https://datawarrior.wordpress.com/2016/10/12/short-text-categorization-using-deep-neural-networks-and-word-embedding-models/)\n\n\n## News\n\n* 07/12/2024: `shorttext` 2.0.0 released.\n* 12/21/2023: `shorttext` 1.6.1 released.\n* 08/26/2023: `shorttext` 1.6.0 released.\n* 06/19/2023: `shorttext` 1.5.9 released.\n* 09/23/2022: `shorttext` 1.5.8 released.\n* 09/22/2022: `shorttext` 1.5.7 released.\n* 08/29/2022: `shorttext` 1.5.6 released.\n* 05/28/2022: `shorttext` 1.5.5 released.\n* 12/15/2021: `shorttext` 1.5.4 released.\n* 07/11/2021: `shorttext` 1.5.3 released.\n* 07/06/2021: `shorttext` 1.5.2 released.\n* 04/10/2021: `shorttext` 1.5.1 released.\n* 04/09/2021: `shorttext` 1.5.0 released.\n* 02/11/2021: `shorttext` 1.4.8 released.\n* 01/11/2021: `shorttext` 1.4.7 released.\n* 01/03/2021: `shorttext` 1.4.6 released.\n* 12/28/2020: `shorttext` 1.4.5 released.\n* 12/24/2020: `shorttext` 1.4.4 released.\n* 11/10/2020: `shorttext` 1.4.3 released.\n* 10/18/2020: `shorttext` 1.4.2 released.\n* 09/23/2020: `shorttext` 1.4.1 released.\n* 09/02/2020: `shorttext` 1.4.0 released.\n* 07/23/2020: `shorttext` 1.3.0 released.\n* 06/05/2020: `shorttext` 1.2.6 released.\n* 05/20/2020: `shorttext` 1.2.5 released.\n* 05/13/2020: `shorttext` 1.2.4 released.\n* 04/28/2020: `shorttext` 1.2.3 released.\n* 04/07/2020: `shorttext` 1.2.2 released.\n* 03/23/2020: `shorttext` 1.2.1 released.\n* 03/21/2020: `shorttext` 1.2.0 released.\n* 12/01/2019: `shorttext` 1.1.6 released.\n* 09/24/2019: `shorttext` 1.1.5 released.\n* 07/20/2019: `shorttext` 1.1.4 released.\n* 07/07/2019: `shorttext` 1.1.3 released.\n* 06/05/2019: `shorttext` 1.1.2 released.\n* 04/23/2019: `shorttext` 1.1.1 released.\n* 03/03/2019: `shorttext` 1.1.0 released.\n* 02/14/2019: `shorttext` 1.0.8 released.\n* 01/30/2019: `shorttext` 1.0.7 released.\n* 01/29/2019: `shorttext` 1.0.6 released.\n* 01/13/2019: `shorttext` 1.0.5 released.\n* 10/03/2018: `shorttext` 1.0.4 released.\n* 08/06/2018: `shorttext` 1.0.3 released.\n* 07/24/2018: `shorttext` 1.0.2 released.\n* 07/17/2018: `shorttext` 1.0.1 released.\n* 07/14/2018: `shorttext` 1.0.0 released.\n* 06/18/2018: `shorttext` 0.7.2 released.\n* 05/30/2018: `shorttext` 0.7.1 released.\n* 05/17/2018: `shorttext` 0.7.0 released.\n* 02/27/2018: `shorttext` 0.6.0 released.\n* 01/19/2018: `shorttext` 0.5.11 released.\n* 01/15/2018: `shorttext` 0.5.10 released.\n* 12/14/2017: `shorttext` 0.5.9 released.\n* 11/08/2017: `shorttext` 0.5.8 released.\n* 10/27/2017: `shorttext` 0.5.7 released.\n* 10/17/2017: `shorttext` 0.5.6 released.\n* 09/28/2017: `shorttext` 0.5.5 released.\n* 09/08/2017: `shorttext` 0.5.4 released.\n* 09/02/2017: end of GSoC project. ([Report](https://rare-technologies.com/chinmayas-gsoc-2017-summary-integration-with-sklearn-keras-and-implementing-fasttext/))\n* 08/22/2017: `shorttext` 0.5.1 released.\n* 07/28/2017: `shorttext` 0.4.1 released.\n* 07/26/2017: `shorttext` 0.4.0 released.\n* 06/16/2017: `shorttext` 0.3.8 released.\n* 06/12/2017: `shorttext` 0.3.7 released.\n* 06/02/2017: `shorttext` 0.3.6 released.\n* 05/30/2017: GSoC project ([Chinmaya Pancholi](https://rare-technologies.com/google-summer-of-code-2017-week-1-on-integrating-gensim-with-scikit-learn-and-keras/), with [gensim](https://radimrehurek.com/gensim/))\n* 05/16/2017: `shorttext` 0.3.5 released.\n* 04/27/2017: `shorttext` 0.3.4 released.\n* 04/19/2017: `shorttext` 0.3.3 released.\n* 03/28/2017: `shorttext` 0.3.2 released.\n* 03/14/2017: `shorttext` 0.3.1 released.\n* 02/23/2017: `shorttext` 0.2.1 released.\n* 12/21/2016: `shorttext` 0.2.0 released.\n* 11/25/2016: `shorttext` 0.1.2 released.\n* 11/21/2016: `shorttext` 0.1.1 released.\n\n## Possible Future Updates\n\n- [ ] Dividing components to other packages;\n- [ ] More available corpus.",
"bugtrack_url": null,
"license": "MIT",
"summary": "Short Text Mining",
"version": "2.0.0",
"project_urls": {
"Homepage": "https://github.com/stephenhky/PyShortTextCategorization"
},
"split_keywords": [
"shorttext",
"natural",
"language",
"processing",
"text",
"mining"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "11f615ea7e5298092c67abda44f37a9b8b2f3d625a4c04efc7807fa80bb95f29",
"md5": "bf2cf106ca48b8715a320fd82f7daea4",
"sha256": "90c00b7cf301d855b484e0112cf7389b41bb1699ec662d3791b24ecf5c8dbe2f"
},
"downloads": -1,
"filename": "shorttext-2.0.0.tar.gz",
"has_sig": false,
"md5_digest": "bf2cf106ca48b8715a320fd82f7daea4",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 266627,
"upload_time": "2024-07-13T19:07:50",
"upload_time_iso_8601": "2024-07-13T19:07:50.192281Z",
"url": "https://files.pythonhosted.org/packages/11/f6/15ea7e5298092c67abda44f37a9b8b2f3d625a4c04efc7807fa80bb95f29/shorttext-2.0.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-07-13 19:07:50",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "stephenhky",
"github_project": "PyShortTextCategorization",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"circle": true,
"requirements": [
{
"name": "Cython",
"specs": [
[
">=",
"3.0.0"
]
]
},
{
"name": "numpy",
"specs": [
[
">=",
"1.23.3"
]
]
},
{
"name": "scipy",
"specs": [
[
"<",
"1.13.0"
],
[
">=",
"1.10.0"
]
]
},
{
"name": "joblib",
"specs": [
[
">=",
"1.3.0"
]
]
},
{
"name": "scikit-learn",
"specs": [
[
">=",
"1.2.0"
]
]
},
{
"name": "tensorflow",
"specs": [
[
">=",
"2.13.0"
]
]
},
{
"name": "keras",
"specs": [
[
">=",
"2.13.0"
]
]
},
{
"name": "gensim",
"specs": [
[
">=",
"4.0.0"
]
]
},
{
"name": "pandas",
"specs": [
[
">=",
"1.2.0"
]
]
},
{
"name": "snowballstemmer",
"specs": [
[
">=",
"2.0.0"
]
]
},
{
"name": "transformers",
"specs": [
[
">=",
"4.39.0"
]
]
},
{
"name": "torch",
"specs": [
[
">=",
"2.0.0"
]
]
},
{
"name": "python-Levenshtein",
"specs": [
[
">=",
"0.21.0"
]
]
},
{
"name": "numba",
"specs": [
[
">=",
"0.57.0"
]
]
}
],
"test_requirements": [
{
"name": "simplerepresentations",
"specs": [
[
">=",
"0.0.4"
]
]
}
],
"lcname": "shorttext"
}