bitermplus


Namebitermplus JSON
Version 0.8.0 PyPI version JSON
download
home_pageNone
SummaryBiterm Topic Model with sklearn-compatible API
upload_time2025-09-13 17:25:30
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseMIT License Copyright (c) 2021 Maksim Terpilowski Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords topic-modeling machine-learning nlp biterm sklearn text-mining unsupervised-learning
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage
            # Biterm Topic Model

![GitHub Actions Workflow Status](https://img.shields.io/github/actions/workflow/status/maximtrp/bitermplus/package-test.yml)
[![Documentation Status](https://readthedocs.org/projects/bitermplus/badge/?version=latest)](https://bitermplus.readthedocs.io/en/latest/?badge=latest)
![Codacy grade](https://img.shields.io/codacy/grade/192b6a75449040ff868932a15ca28ce9)
[![Issues](https://img.shields.io/github/issues/maximtrp/bitermplus.svg)](https://github.com/maximtrp/bitermplus/issues)
[![Downloads](https://static.pepy.tech/badge/bitermplus)](https://pepy.tech/project/bitermplus)
![PyPI](https://img.shields.io/pypi/v/bitermplus)

**Bitermplus** is a high-performance implementation of the [Biterm Topic Model](https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.402.4032&rep=rep1&type=pdf) for short text analysis, originally developed by Xiaohui Yan, Jiafeng Guo, Yanyan Lan, and Xueqi Cheng. Built on a cythonized version of [BTM](https://github.com/xiaohuiyan/BTM), it features OpenMP parallelization and a modern scikit-learn compatible API for seamless integration into ML workflows.

## Key Features

- **Scikit-learn Compatible API** — Familiar `fit()`, `transform()`, and `fit_transform()` methods for easy adoption
- **ML Pipeline Integration** — Seamless compatibility with sklearn workflows, cross-validation, and grid search
- **High-Performance Computing** — Cythonized implementation with OpenMP parallel processing for speed
- **Advanced Inference Methods** — Multiple approaches including sum of biterms, sum of words, and mixed inference
- **Comprehensive Model Evaluation** — Built-in perplexity, semantic coherence, and entropy metrics
- **Intuitive Topic Interpretation** — Simple extraction of topic keywords and document-topic assignments
- **Flexible Text Preprocessing** — Customizable vectorization pipeline with sklearn CountVectorizer integration

## Donate

If you find this package useful, please consider donating any amount of money. This will help me spend more time on supporting open-source software.

<a href="https://www.buymeacoffee.com/maximtrp" target="_blank"><img src="https://cdn.buymeacoffee.com/buttons/v2/default-yellow.png" alt="Buy Me A Coffee" style="height: 60px !important;width: 217px !important;" ></a>

## Requirements

- **Python** ≥ 3.8
- **NumPy** ≥ 1.19.0 — Numerical computing foundation
- **Pandas** ≥ 1.2.0 — Data manipulation and analysis
- **SciPy** ≥ 1.6.0 — Scientific computing library
- **scikit-learn** ≥ 1.0.0 — Machine learning utilities and API compatibility
- **tqdm** ≥ 4.50.0 — Progress bars for model training

## Installation

### Standard Installation

Install the latest stable release from PyPI:

```bash
pip install bitermplus
```

### Development Version

Install the latest development version directly from the repository:

```bash
pip install git+https://github.com/maximtrp/bitermplus.git
```

### Platform-Specific Setup

**Linux/Ubuntu:** Ensure Python development headers are installed:

```bash
sudo apt-get install python3.x-dev  # where x is your Python minor version
```

**Windows:** No additional setup required with standard Python installations.

**macOS:** Install OpenMP support for parallel processing:

```bash
# Install Xcode Command Line Tools and Homebrew (if not already installed)
xcode-select --install
# Install OpenMP library
brew install libomp
pip install bitermplus
```

If you encounter OpenMP compilation errors, configure the environment:

```bash
export LDFLAGS="-L/opt/homebrew/opt/libomp/lib"
export CPPFLAGS="-I/opt/homebrew/opt/libomp/include"
pip install bitermplus
```

## Quick Start

### Sklearn-style API (Recommended)

```python
import bitermplus as btm

# Sample documents
texts = [
    "machine learning algorithms are powerful",
    "deep learning neural networks process data",
    "natural language processing understands text"
]

# Create and train model
model = btm.BTMClassifier(n_topics=2, random_state=42)
doc_topics = model.fit_transform(texts)

# Get topic keywords
topic_words = model.get_topic_words(n_words=5)
print("Topic 0:", topic_words[0])
print("Topic 1:", topic_words[1])

# Evaluate model
coherence_score = model.score(texts)
print(f"Coherence: {coherence_score:.3f}")
```

### Traditional API

```python
import bitermplus as btm
import numpy as np
import pandas as pd

# Importing data
df = pd.read_csv(
    'dataset/SearchSnippets.txt.gz', header=None, names=['texts'])
texts = df['texts'].str.strip().tolist()

# Preprocessing
X, vocabulary, vocab_dict = btm.get_words_freqs(texts)
docs_vec = btm.get_vectorized_docs(texts, vocabulary)
biterms = btm.get_biterms(docs_vec)

# Initializing and running model
model = btm.BTM(
    X, vocabulary, seed=12321, T=8, M=20, alpha=50/8, beta=0.01)
model.fit(biterms, iterations=20)
p_zd = model.transform(docs_vec)

# Metrics
coherence = model.coherence_
perplexity = model.perplexity_
```

### Visualization

Visualize your topic modeling results with [tmplot](https://github.com/maximtrp/tmplot):

```bash
pip install tmplot
```

```python
import tmplot as tmp

# Generate interactive topic visualization
tmp.report(model=model, docs=texts)
```

![Topic Modeling Visualization](images/topics_terms_plots.png)

## Documentation

**[Sklearn-style API Guide](https://bitermplus.readthedocs.io/en/latest/sklearn_api.html)**
Complete guide to the modern sklearn-compatible interface with examples and best practices

**[Traditional API Tutorial](https://bitermplus.readthedocs.io/en/latest/tutorial.html)**
In-depth tutorial covering advanced topic modeling techniques and model evaluation

**[API Reference](https://bitermplus.readthedocs.io/en/latest/bitermplus.html)**
Comprehensive documentation of all functions, classes, and parameters

## Migration from v0.7.0 to v0.8.0

The traditional API remains fully compatible. The new sklearn-style API provides a simpler alternative:

### Old approach (still works)

```python
# Multi-step manual process
X, vocabulary, vocab_dict = btm.get_words_freqs(texts)
docs_vec = btm.get_vectorized_docs(texts, vocabulary)
biterms = btm.get_biterms(docs_vec)

model = btm.BTM(X, vocabulary, seed=42, T=8, M=20, alpha=50/8, beta=0.01)
model.fit(biterms, iterations=100)
p_zd = model.transform(docs_vec)
```

### New approach (recommended)

```python
# One-liner with automatic preprocessing
model = btm.BTMClassifier(n_topics=8, random_state=42, max_iter=100)
p_zd = model.fit_transform(texts)
```

### Migration Benefits

- **Streamlined Workflow** — Direct text input with automatic preprocessing eliminates manual steps
- **Enhanced ML Integration** — Native support for sklearn pipelines, cross-validation, and hyperparameter tuning
- **Improved Developer Experience** — Clear parameter validation with informative error messages
- **Advanced Model Evaluation** — Built-in scoring methods and intuitive topic interpretation tools
- **Backward Compatibility:** All existing code using the traditional API will continue to work without modifications.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "bitermplus",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "Maksim Terpilovskii <maximtrp@gmail.com>",
    "keywords": "topic-modeling, machine-learning, nlp, biterm, sklearn, text-mining, unsupervised-learning",
    "author": null,
    "author_email": "Maksim Terpilovskii <maximtrp@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/c9/5b/59c9739ff2219d1cae36b7b7420ee61c86acf8c85a0169cf3cec8f4a6e24/bitermplus-0.8.0.tar.gz",
    "platform": null,
    "description": "# Biterm Topic Model\n\n![GitHub Actions Workflow Status](https://img.shields.io/github/actions/workflow/status/maximtrp/bitermplus/package-test.yml)\n[![Documentation Status](https://readthedocs.org/projects/bitermplus/badge/?version=latest)](https://bitermplus.readthedocs.io/en/latest/?badge=latest)\n![Codacy grade](https://img.shields.io/codacy/grade/192b6a75449040ff868932a15ca28ce9)\n[![Issues](https://img.shields.io/github/issues/maximtrp/bitermplus.svg)](https://github.com/maximtrp/bitermplus/issues)\n[![Downloads](https://static.pepy.tech/badge/bitermplus)](https://pepy.tech/project/bitermplus)\n![PyPI](https://img.shields.io/pypi/v/bitermplus)\n\n**Bitermplus** is a high-performance implementation of the [Biterm Topic Model](https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.402.4032&rep=rep1&type=pdf) for short text analysis, originally developed by Xiaohui Yan, Jiafeng Guo, Yanyan Lan, and Xueqi Cheng. Built on a cythonized version of [BTM](https://github.com/xiaohuiyan/BTM), it features OpenMP parallelization and a modern scikit-learn compatible API for seamless integration into ML workflows.\n\n## Key Features\n\n- **Scikit-learn Compatible API** \u2014 Familiar `fit()`, `transform()`, and `fit_transform()` methods for easy adoption\n- **ML Pipeline Integration** \u2014 Seamless compatibility with sklearn workflows, cross-validation, and grid search\n- **High-Performance Computing** \u2014 Cythonized implementation with OpenMP parallel processing for speed\n- **Advanced Inference Methods** \u2014 Multiple approaches including sum of biterms, sum of words, and mixed inference\n- **Comprehensive Model Evaluation** \u2014 Built-in perplexity, semantic coherence, and entropy metrics\n- **Intuitive Topic Interpretation** \u2014 Simple extraction of topic keywords and document-topic assignments\n- **Flexible Text Preprocessing** \u2014 Customizable vectorization pipeline with sklearn CountVectorizer integration\n\n## Donate\n\nIf you find this package useful, please consider donating any amount of money. This will help me spend more time on supporting open-source software.\n\n<a href=\"https://www.buymeacoffee.com/maximtrp\" target=\"_blank\"><img src=\"https://cdn.buymeacoffee.com/buttons/v2/default-yellow.png\" alt=\"Buy Me A Coffee\" style=\"height: 60px !important;width: 217px !important;\" ></a>\n\n## Requirements\n\n- **Python** \u2265 3.8\n- **NumPy** \u2265 1.19.0 \u2014 Numerical computing foundation\n- **Pandas** \u2265 1.2.0 \u2014 Data manipulation and analysis\n- **SciPy** \u2265 1.6.0 \u2014 Scientific computing library\n- **scikit-learn** \u2265 1.0.0 \u2014 Machine learning utilities and API compatibility\n- **tqdm** \u2265 4.50.0 \u2014 Progress bars for model training\n\n## Installation\n\n### Standard Installation\n\nInstall the latest stable release from PyPI:\n\n```bash\npip install bitermplus\n```\n\n### Development Version\n\nInstall the latest development version directly from the repository:\n\n```bash\npip install git+https://github.com/maximtrp/bitermplus.git\n```\n\n### Platform-Specific Setup\n\n**Linux/Ubuntu:** Ensure Python development headers are installed:\n\n```bash\nsudo apt-get install python3.x-dev  # where x is your Python minor version\n```\n\n**Windows:** No additional setup required with standard Python installations.\n\n**macOS:** Install OpenMP support for parallel processing:\n\n```bash\n# Install Xcode Command Line Tools and Homebrew (if not already installed)\nxcode-select --install\n# Install OpenMP library\nbrew install libomp\npip install bitermplus\n```\n\nIf you encounter OpenMP compilation errors, configure the environment:\n\n```bash\nexport LDFLAGS=\"-L/opt/homebrew/opt/libomp/lib\"\nexport CPPFLAGS=\"-I/opt/homebrew/opt/libomp/include\"\npip install bitermplus\n```\n\n## Quick Start\n\n### Sklearn-style API (Recommended)\n\n```python\nimport bitermplus as btm\n\n# Sample documents\ntexts = [\n    \"machine learning algorithms are powerful\",\n    \"deep learning neural networks process data\",\n    \"natural language processing understands text\"\n]\n\n# Create and train model\nmodel = btm.BTMClassifier(n_topics=2, random_state=42)\ndoc_topics = model.fit_transform(texts)\n\n# Get topic keywords\ntopic_words = model.get_topic_words(n_words=5)\nprint(\"Topic 0:\", topic_words[0])\nprint(\"Topic 1:\", topic_words[1])\n\n# Evaluate model\ncoherence_score = model.score(texts)\nprint(f\"Coherence: {coherence_score:.3f}\")\n```\n\n### Traditional API\n\n```python\nimport bitermplus as btm\nimport numpy as np\nimport pandas as pd\n\n# Importing data\ndf = pd.read_csv(\n    'dataset/SearchSnippets.txt.gz', header=None, names=['texts'])\ntexts = df['texts'].str.strip().tolist()\n\n# Preprocessing\nX, vocabulary, vocab_dict = btm.get_words_freqs(texts)\ndocs_vec = btm.get_vectorized_docs(texts, vocabulary)\nbiterms = btm.get_biterms(docs_vec)\n\n# Initializing and running model\nmodel = btm.BTM(\n    X, vocabulary, seed=12321, T=8, M=20, alpha=50/8, beta=0.01)\nmodel.fit(biterms, iterations=20)\np_zd = model.transform(docs_vec)\n\n# Metrics\ncoherence = model.coherence_\nperplexity = model.perplexity_\n```\n\n### Visualization\n\nVisualize your topic modeling results with [tmplot](https://github.com/maximtrp/tmplot):\n\n```bash\npip install tmplot\n```\n\n```python\nimport tmplot as tmp\n\n# Generate interactive topic visualization\ntmp.report(model=model, docs=texts)\n```\n\n![Topic Modeling Visualization](images/topics_terms_plots.png)\n\n## Documentation\n\n**[Sklearn-style API Guide](https://bitermplus.readthedocs.io/en/latest/sklearn_api.html)**\nComplete guide to the modern sklearn-compatible interface with examples and best practices\n\n**[Traditional API Tutorial](https://bitermplus.readthedocs.io/en/latest/tutorial.html)**\nIn-depth tutorial covering advanced topic modeling techniques and model evaluation\n\n**[API Reference](https://bitermplus.readthedocs.io/en/latest/bitermplus.html)**\nComprehensive documentation of all functions, classes, and parameters\n\n## Migration from v0.7.0 to v0.8.0\n\nThe traditional API remains fully compatible. The new sklearn-style API provides a simpler alternative:\n\n### Old approach (still works)\n\n```python\n# Multi-step manual process\nX, vocabulary, vocab_dict = btm.get_words_freqs(texts)\ndocs_vec = btm.get_vectorized_docs(texts, vocabulary)\nbiterms = btm.get_biterms(docs_vec)\n\nmodel = btm.BTM(X, vocabulary, seed=42, T=8, M=20, alpha=50/8, beta=0.01)\nmodel.fit(biterms, iterations=100)\np_zd = model.transform(docs_vec)\n```\n\n### New approach (recommended)\n\n```python\n# One-liner with automatic preprocessing\nmodel = btm.BTMClassifier(n_topics=8, random_state=42, max_iter=100)\np_zd = model.fit_transform(texts)\n```\n\n### Migration Benefits\n\n- **Streamlined Workflow** \u2014 Direct text input with automatic preprocessing eliminates manual steps\n- **Enhanced ML Integration** \u2014 Native support for sklearn pipelines, cross-validation, and hyperparameter tuning\n- **Improved Developer Experience** \u2014 Clear parameter validation with informative error messages\n- **Advanced Model Evaluation** \u2014 Built-in scoring methods and intuitive topic interpretation tools\n- **Backward Compatibility:** All existing code using the traditional API will continue to work without modifications.\n",
    "bugtrack_url": null,
    "license": "MIT License\n        \n        Copyright (c) 2021 Maksim Terpilowski\n        \n        Permission is hereby granted, free of charge, to any person obtaining a copy\n        of this software and associated documentation files (the \"Software\"), to deal\n        in the Software without restriction, including without limitation the rights\n        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell\n        copies of the Software, and to permit persons to whom the Software is\n        furnished to do so, subject to the following conditions:\n        \n        The above copyright notice and this permission notice shall be included in all\n        copies or substantial portions of the Software.\n        \n        THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\n        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\n        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\n        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\n        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\n        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\n        SOFTWARE.\n        ",
    "summary": "Biterm Topic Model with sklearn-compatible API",
    "version": "0.8.0",
    "project_urls": {
        "Bug Tracker": "https://github.com/maximtrp/bitermplus/issues",
        "Documentation": "https://bitermplus.readthedocs.io/",
        "Homepage": "https://github.com/maximtrp/bitermplus",
        "Repository": "https://github.com/maximtrp/bitermplus.git"
    },
    "split_keywords": [
        "topic-modeling",
        " machine-learning",
        " nlp",
        " biterm",
        " sklearn",
        " text-mining",
        " unsupervised-learning"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "c95b59c9739ff2219d1cae36b7b7420ee61c86acf8c85a0169cf3cec8f4a6e24",
                "md5": "a4a489b439c70f0db714a6927fb6c7f9",
                "sha256": "0be58792e40ff8865dd33c1fd0595089a099d3ac5e4ee269a30cd7ac677973d9"
            },
            "downloads": -1,
            "filename": "bitermplus-0.8.0.tar.gz",
            "has_sig": false,
            "md5_digest": "a4a489b439c70f0db714a6927fb6c7f9",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 354116,
            "upload_time": "2025-09-13T17:25:30",
            "upload_time_iso_8601": "2025-09-13T17:25:30.119496Z",
            "url": "https://files.pythonhosted.org/packages/c9/5b/59c9739ff2219d1cae36b7b7420ee61c86acf8c85a0169cf3cec8f4a6e24/bitermplus-0.8.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-09-13 17:25:30",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "maximtrp",
    "github_project": "bitermplus",
    "travis_ci": false,
    "coveralls": true,
    "github_actions": true,
    "lcname": "bitermplus"
}
        
Elapsed time: 2.84690s