| Name | bitermplus JSON |
| Version |
0.8.0
JSON |
| download |
| home_page | None |
| Summary | Biterm Topic Model with sklearn-compatible API |
| upload_time | 2025-09-13 17:25:30 |
| maintainer | None |
| docs_url | None |
| author | None |
| requires_python | >=3.8 |
| license | MIT License
Copyright (c) 2021 Maksim Terpilowski
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
|
| keywords |
topic-modeling
machine-learning
nlp
biterm
sklearn
text-mining
unsupervised-learning
|
| VCS |
 |
| bugtrack_url |
|
| requirements |
No requirements were recorded.
|
| Travis-CI |
No Travis.
|
| coveralls test coverage |
|
# Biterm Topic Model

[](https://bitermplus.readthedocs.io/en/latest/?badge=latest)

[](https://github.com/maximtrp/bitermplus/issues)
[](https://pepy.tech/project/bitermplus)

**Bitermplus** is a high-performance implementation of the [Biterm Topic Model](https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.402.4032&rep=rep1&type=pdf) for short text analysis, originally developed by Xiaohui Yan, Jiafeng Guo, Yanyan Lan, and Xueqi Cheng. Built on a cythonized version of [BTM](https://github.com/xiaohuiyan/BTM), it features OpenMP parallelization and a modern scikit-learn compatible API for seamless integration into ML workflows.
## Key Features
- **Scikit-learn Compatible API** — Familiar `fit()`, `transform()`, and `fit_transform()` methods for easy adoption
- **ML Pipeline Integration** — Seamless compatibility with sklearn workflows, cross-validation, and grid search
- **High-Performance Computing** — Cythonized implementation with OpenMP parallel processing for speed
- **Advanced Inference Methods** — Multiple approaches including sum of biterms, sum of words, and mixed inference
- **Comprehensive Model Evaluation** — Built-in perplexity, semantic coherence, and entropy metrics
- **Intuitive Topic Interpretation** — Simple extraction of topic keywords and document-topic assignments
- **Flexible Text Preprocessing** — Customizable vectorization pipeline with sklearn CountVectorizer integration
## Donate
If you find this package useful, please consider donating any amount of money. This will help me spend more time on supporting open-source software.
<a href="https://www.buymeacoffee.com/maximtrp" target="_blank"><img src="https://cdn.buymeacoffee.com/buttons/v2/default-yellow.png" alt="Buy Me A Coffee" style="height: 60px !important;width: 217px !important;" ></a>
## Requirements
- **Python** ≥ 3.8
- **NumPy** ≥ 1.19.0 — Numerical computing foundation
- **Pandas** ≥ 1.2.0 — Data manipulation and analysis
- **SciPy** ≥ 1.6.0 — Scientific computing library
- **scikit-learn** ≥ 1.0.0 — Machine learning utilities and API compatibility
- **tqdm** ≥ 4.50.0 — Progress bars for model training
## Installation
### Standard Installation
Install the latest stable release from PyPI:
```bash
pip install bitermplus
```
### Development Version
Install the latest development version directly from the repository:
```bash
pip install git+https://github.com/maximtrp/bitermplus.git
```
### Platform-Specific Setup
**Linux/Ubuntu:** Ensure Python development headers are installed:
```bash
sudo apt-get install python3.x-dev # where x is your Python minor version
```
**Windows:** No additional setup required with standard Python installations.
**macOS:** Install OpenMP support for parallel processing:
```bash
# Install Xcode Command Line Tools and Homebrew (if not already installed)
xcode-select --install
# Install OpenMP library
brew install libomp
pip install bitermplus
```
If you encounter OpenMP compilation errors, configure the environment:
```bash
export LDFLAGS="-L/opt/homebrew/opt/libomp/lib"
export CPPFLAGS="-I/opt/homebrew/opt/libomp/include"
pip install bitermplus
```
## Quick Start
### Sklearn-style API (Recommended)
```python
import bitermplus as btm
# Sample documents
texts = [
"machine learning algorithms are powerful",
"deep learning neural networks process data",
"natural language processing understands text"
]
# Create and train model
model = btm.BTMClassifier(n_topics=2, random_state=42)
doc_topics = model.fit_transform(texts)
# Get topic keywords
topic_words = model.get_topic_words(n_words=5)
print("Topic 0:", topic_words[0])
print("Topic 1:", topic_words[1])
# Evaluate model
coherence_score = model.score(texts)
print(f"Coherence: {coherence_score:.3f}")
```
### Traditional API
```python
import bitermplus as btm
import numpy as np
import pandas as pd
# Importing data
df = pd.read_csv(
'dataset/SearchSnippets.txt.gz', header=None, names=['texts'])
texts = df['texts'].str.strip().tolist()
# Preprocessing
X, vocabulary, vocab_dict = btm.get_words_freqs(texts)
docs_vec = btm.get_vectorized_docs(texts, vocabulary)
biterms = btm.get_biterms(docs_vec)
# Initializing and running model
model = btm.BTM(
X, vocabulary, seed=12321, T=8, M=20, alpha=50/8, beta=0.01)
model.fit(biterms, iterations=20)
p_zd = model.transform(docs_vec)
# Metrics
coherence = model.coherence_
perplexity = model.perplexity_
```
### Visualization
Visualize your topic modeling results with [tmplot](https://github.com/maximtrp/tmplot):
```bash
pip install tmplot
```
```python
import tmplot as tmp
# Generate interactive topic visualization
tmp.report(model=model, docs=texts)
```

## Documentation
**[Sklearn-style API Guide](https://bitermplus.readthedocs.io/en/latest/sklearn_api.html)**
Complete guide to the modern sklearn-compatible interface with examples and best practices
**[Traditional API Tutorial](https://bitermplus.readthedocs.io/en/latest/tutorial.html)**
In-depth tutorial covering advanced topic modeling techniques and model evaluation
**[API Reference](https://bitermplus.readthedocs.io/en/latest/bitermplus.html)**
Comprehensive documentation of all functions, classes, and parameters
## Migration from v0.7.0 to v0.8.0
The traditional API remains fully compatible. The new sklearn-style API provides a simpler alternative:
### Old approach (still works)
```python
# Multi-step manual process
X, vocabulary, vocab_dict = btm.get_words_freqs(texts)
docs_vec = btm.get_vectorized_docs(texts, vocabulary)
biterms = btm.get_biterms(docs_vec)
model = btm.BTM(X, vocabulary, seed=42, T=8, M=20, alpha=50/8, beta=0.01)
model.fit(biterms, iterations=100)
p_zd = model.transform(docs_vec)
```
### New approach (recommended)
```python
# One-liner with automatic preprocessing
model = btm.BTMClassifier(n_topics=8, random_state=42, max_iter=100)
p_zd = model.fit_transform(texts)
```
### Migration Benefits
- **Streamlined Workflow** — Direct text input with automatic preprocessing eliminates manual steps
- **Enhanced ML Integration** — Native support for sklearn pipelines, cross-validation, and hyperparameter tuning
- **Improved Developer Experience** — Clear parameter validation with informative error messages
- **Advanced Model Evaluation** — Built-in scoring methods and intuitive topic interpretation tools
- **Backward Compatibility:** All existing code using the traditional API will continue to work without modifications.
Raw data
{
"_id": null,
"home_page": null,
"name": "bitermplus",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "Maksim Terpilovskii <maximtrp@gmail.com>",
"keywords": "topic-modeling, machine-learning, nlp, biterm, sklearn, text-mining, unsupervised-learning",
"author": null,
"author_email": "Maksim Terpilovskii <maximtrp@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/c9/5b/59c9739ff2219d1cae36b7b7420ee61c86acf8c85a0169cf3cec8f4a6e24/bitermplus-0.8.0.tar.gz",
"platform": null,
"description": "# Biterm Topic Model\n\n\n[](https://bitermplus.readthedocs.io/en/latest/?badge=latest)\n\n[](https://github.com/maximtrp/bitermplus/issues)\n[](https://pepy.tech/project/bitermplus)\n\n\n**Bitermplus** is a high-performance implementation of the [Biterm Topic Model](https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.402.4032&rep=rep1&type=pdf) for short text analysis, originally developed by Xiaohui Yan, Jiafeng Guo, Yanyan Lan, and Xueqi Cheng. Built on a cythonized version of [BTM](https://github.com/xiaohuiyan/BTM), it features OpenMP parallelization and a modern scikit-learn compatible API for seamless integration into ML workflows.\n\n## Key Features\n\n- **Scikit-learn Compatible API** \u2014 Familiar `fit()`, `transform()`, and `fit_transform()` methods for easy adoption\n- **ML Pipeline Integration** \u2014 Seamless compatibility with sklearn workflows, cross-validation, and grid search\n- **High-Performance Computing** \u2014 Cythonized implementation with OpenMP parallel processing for speed\n- **Advanced Inference Methods** \u2014 Multiple approaches including sum of biterms, sum of words, and mixed inference\n- **Comprehensive Model Evaluation** \u2014 Built-in perplexity, semantic coherence, and entropy metrics\n- **Intuitive Topic Interpretation** \u2014 Simple extraction of topic keywords and document-topic assignments\n- **Flexible Text Preprocessing** \u2014 Customizable vectorization pipeline with sklearn CountVectorizer integration\n\n## Donate\n\nIf you find this package useful, please consider donating any amount of money. This will help me spend more time on supporting open-source software.\n\n<a href=\"https://www.buymeacoffee.com/maximtrp\" target=\"_blank\"><img src=\"https://cdn.buymeacoffee.com/buttons/v2/default-yellow.png\" alt=\"Buy Me A Coffee\" style=\"height: 60px !important;width: 217px !important;\" ></a>\n\n## Requirements\n\n- **Python** \u2265 3.8\n- **NumPy** \u2265 1.19.0 \u2014 Numerical computing foundation\n- **Pandas** \u2265 1.2.0 \u2014 Data manipulation and analysis\n- **SciPy** \u2265 1.6.0 \u2014 Scientific computing library\n- **scikit-learn** \u2265 1.0.0 \u2014 Machine learning utilities and API compatibility\n- **tqdm** \u2265 4.50.0 \u2014 Progress bars for model training\n\n## Installation\n\n### Standard Installation\n\nInstall the latest stable release from PyPI:\n\n```bash\npip install bitermplus\n```\n\n### Development Version\n\nInstall the latest development version directly from the repository:\n\n```bash\npip install git+https://github.com/maximtrp/bitermplus.git\n```\n\n### Platform-Specific Setup\n\n**Linux/Ubuntu:** Ensure Python development headers are installed:\n\n```bash\nsudo apt-get install python3.x-dev # where x is your Python minor version\n```\n\n**Windows:** No additional setup required with standard Python installations.\n\n**macOS:** Install OpenMP support for parallel processing:\n\n```bash\n# Install Xcode Command Line Tools and Homebrew (if not already installed)\nxcode-select --install\n# Install OpenMP library\nbrew install libomp\npip install bitermplus\n```\n\nIf you encounter OpenMP compilation errors, configure the environment:\n\n```bash\nexport LDFLAGS=\"-L/opt/homebrew/opt/libomp/lib\"\nexport CPPFLAGS=\"-I/opt/homebrew/opt/libomp/include\"\npip install bitermplus\n```\n\n## Quick Start\n\n### Sklearn-style API (Recommended)\n\n```python\nimport bitermplus as btm\n\n# Sample documents\ntexts = [\n \"machine learning algorithms are powerful\",\n \"deep learning neural networks process data\",\n \"natural language processing understands text\"\n]\n\n# Create and train model\nmodel = btm.BTMClassifier(n_topics=2, random_state=42)\ndoc_topics = model.fit_transform(texts)\n\n# Get topic keywords\ntopic_words = model.get_topic_words(n_words=5)\nprint(\"Topic 0:\", topic_words[0])\nprint(\"Topic 1:\", topic_words[1])\n\n# Evaluate model\ncoherence_score = model.score(texts)\nprint(f\"Coherence: {coherence_score:.3f}\")\n```\n\n### Traditional API\n\n```python\nimport bitermplus as btm\nimport numpy as np\nimport pandas as pd\n\n# Importing data\ndf = pd.read_csv(\n 'dataset/SearchSnippets.txt.gz', header=None, names=['texts'])\ntexts = df['texts'].str.strip().tolist()\n\n# Preprocessing\nX, vocabulary, vocab_dict = btm.get_words_freqs(texts)\ndocs_vec = btm.get_vectorized_docs(texts, vocabulary)\nbiterms = btm.get_biterms(docs_vec)\n\n# Initializing and running model\nmodel = btm.BTM(\n X, vocabulary, seed=12321, T=8, M=20, alpha=50/8, beta=0.01)\nmodel.fit(biterms, iterations=20)\np_zd = model.transform(docs_vec)\n\n# Metrics\ncoherence = model.coherence_\nperplexity = model.perplexity_\n```\n\n### Visualization\n\nVisualize your topic modeling results with [tmplot](https://github.com/maximtrp/tmplot):\n\n```bash\npip install tmplot\n```\n\n```python\nimport tmplot as tmp\n\n# Generate interactive topic visualization\ntmp.report(model=model, docs=texts)\n```\n\n\n\n## Documentation\n\n**[Sklearn-style API Guide](https://bitermplus.readthedocs.io/en/latest/sklearn_api.html)**\nComplete guide to the modern sklearn-compatible interface with examples and best practices\n\n**[Traditional API Tutorial](https://bitermplus.readthedocs.io/en/latest/tutorial.html)**\nIn-depth tutorial covering advanced topic modeling techniques and model evaluation\n\n**[API Reference](https://bitermplus.readthedocs.io/en/latest/bitermplus.html)**\nComprehensive documentation of all functions, classes, and parameters\n\n## Migration from v0.7.0 to v0.8.0\n\nThe traditional API remains fully compatible. The new sklearn-style API provides a simpler alternative:\n\n### Old approach (still works)\n\n```python\n# Multi-step manual process\nX, vocabulary, vocab_dict = btm.get_words_freqs(texts)\ndocs_vec = btm.get_vectorized_docs(texts, vocabulary)\nbiterms = btm.get_biterms(docs_vec)\n\nmodel = btm.BTM(X, vocabulary, seed=42, T=8, M=20, alpha=50/8, beta=0.01)\nmodel.fit(biterms, iterations=100)\np_zd = model.transform(docs_vec)\n```\n\n### New approach (recommended)\n\n```python\n# One-liner with automatic preprocessing\nmodel = btm.BTMClassifier(n_topics=8, random_state=42, max_iter=100)\np_zd = model.fit_transform(texts)\n```\n\n### Migration Benefits\n\n- **Streamlined Workflow** \u2014 Direct text input with automatic preprocessing eliminates manual steps\n- **Enhanced ML Integration** \u2014 Native support for sklearn pipelines, cross-validation, and hyperparameter tuning\n- **Improved Developer Experience** \u2014 Clear parameter validation with informative error messages\n- **Advanced Model Evaluation** \u2014 Built-in scoring methods and intuitive topic interpretation tools\n- **Backward Compatibility:** All existing code using the traditional API will continue to work without modifications.\n",
"bugtrack_url": null,
"license": "MIT License\n \n Copyright (c) 2021 Maksim Terpilowski\n \n Permission is hereby granted, free of charge, to any person obtaining a copy\n of this software and associated documentation files (the \"Software\"), to deal\n in the Software without restriction, including without limitation the rights\n to use, copy, modify, merge, publish, distribute, sublicense, and/or sell\n copies of the Software, and to permit persons to whom the Software is\n furnished to do so, subject to the following conditions:\n \n The above copyright notice and this permission notice shall be included in all\n copies or substantial portions of the Software.\n \n THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\n IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\n FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\n AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\n LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\n OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\n SOFTWARE.\n ",
"summary": "Biterm Topic Model with sklearn-compatible API",
"version": "0.8.0",
"project_urls": {
"Bug Tracker": "https://github.com/maximtrp/bitermplus/issues",
"Documentation": "https://bitermplus.readthedocs.io/",
"Homepage": "https://github.com/maximtrp/bitermplus",
"Repository": "https://github.com/maximtrp/bitermplus.git"
},
"split_keywords": [
"topic-modeling",
" machine-learning",
" nlp",
" biterm",
" sklearn",
" text-mining",
" unsupervised-learning"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "c95b59c9739ff2219d1cae36b7b7420ee61c86acf8c85a0169cf3cec8f4a6e24",
"md5": "a4a489b439c70f0db714a6927fb6c7f9",
"sha256": "0be58792e40ff8865dd33c1fd0595089a099d3ac5e4ee269a30cd7ac677973d9"
},
"downloads": -1,
"filename": "bitermplus-0.8.0.tar.gz",
"has_sig": false,
"md5_digest": "a4a489b439c70f0db714a6927fb6c7f9",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 354116,
"upload_time": "2025-09-13T17:25:30",
"upload_time_iso_8601": "2025-09-13T17:25:30.119496Z",
"url": "https://files.pythonhosted.org/packages/c9/5b/59c9739ff2219d1cae36b7b7420ee61c86acf8c85a0169cf3cec8f4a6e24/bitermplus-0.8.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-09-13 17:25:30",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "maximtrp",
"github_project": "bitermplus",
"travis_ci": false,
"coveralls": true,
"github_actions": true,
"lcname": "bitermplus"
}