Name | ctm-pytorch-advi JSON |
Version |
0.1.3
JSON |
| download |
home_page | None |
Summary | Correlated Topic Model (CTM) with ADVI in PyTorch: training, inference, and utilities |
upload_time | 2025-08-10 19:04:58 |
maintainer | None |
docs_url | None |
author | None |
requires_python | >=3.10 |
license | MIT |
keywords |
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
### Correlated Topic Models in PyTorch (ADVI)
An end-to-end, clean implementation of the Correlated Topic Model (CTM) with Automatic Differentiation Variational Inference (ADVI) in PyTorch. This repo includes dataset preprocessing, training, evaluation, TensorBoard logging, and utilities to export topics and compute topic coherence.
CTM extends LDA by replacing the Dirichlet prior over document-topic proportions with a logistic-normal prior with full covariance, capturing correlations between topics.
### Highlights
- Full-covariance logistic-normal prior parameterized via a learned Cholesky factor
- Mean-field Gaussian per-document variational posterior trained with ADVI
- Mini-batch ELBO with MC estimates of the collapsed word-likelihood
- Optional symmetric Dirichlet prior on topic-word distributions `beta`
- TensorBoard logging and optional metrics plot export
- Reproducible training with saved configs and exact vocabulary for deterministic inference
### Project Structure
```
src/ctm/
__init__.py
config.py # TrainConfig dataclass (CLI surface)
data.py # 20NG loader + vectorization + DataLoaders
model.py # CTM module + ELBO
train.py # training loop, logging, checkpointing
infer.py # top-words, coherence, perplexity
utils.py # math and evaluation helpers
src/scripts/
export_topics.py # export top words to CSV from a checkpoint
```
### Requirements
- Python >= 3.10
- Key dependencies (see `pyproject.toml`):
```
torch
numpy
scipy
scikit-learn
tqdm
tyro
rich
tensorboard
matplotlib
spacy
```
If you enable lemmatization, install a spaCy model:
```bash
python -m spacy download en_core_web_sm
```
### Install
Using uv (recommended):
```bash
uv venv
uv sync
```
Or using pip:
```bash
python -m venv .venv
source .venv/bin/activate
pip install -e .
```
### Dataset
Training uses scikit-learn 20 Newsgroups. Text is vectorized via `CountVectorizer` with n-grams `(1, 3)`, English stopwords, token pattern `(?u)\b[a-zA-Z]{3,}\b`, and configurable `max_df`, `min_df`, and `vocab_size`. Optionally, spaCy lemmatization can be enabled. A validation split is drawn from the training set.
### Quickstart
Train a CTM with 50 topics and a 5k vocabulary:
```bash
uv run python -m ctm.train --num-topics 50 --vocab-size 5000 --epochs 50 --batch-size 128 --lr 1e-2
```
After training, export top words and evaluate metrics:
```bash
uv run python -m ctm.infer --checkpoint runs/ctm/ctm_k50_v5000_e50_b128/ctm.pt --topn 12
```
Export topics to CSV:
```bash
uv run python src/scripts/export_topics.py --checkpoint runs/ctm/ctm_k50_v5000_e50_b128/ctm.pt --topn 15 --out topics.csv
```
### CLI Usage
Training (`ctm.train`) uses `tyro` to expose the `TrainConfig` as CLI flags. Defaults shown below:
```bash
uv run python -m ctm.train \
--num-topics 80 \
--vocab-size 10000 \
--max-df 0.95 \
--min-df 5 \
--remove-headers True \
--remove-footers True \
--remove-quotes True \
--batch-size 128 \
--epochs 50 \
--lr 0.01 \
--beta-dirichlet-alpha 0.05 \
--mc-samples 5 \
--seed 42 \
--log-every 50 \
--ckpt-dir runs/ctm \
--device cuda \
--val-split 0.1 \
--use-tensorboard True \
--plot-metrics False \
--tensorboard-subdir tb \
--use-lemmatization True \
--spacy-model en_core_web_sm
```
Inference (`ctm.infer`) options:
```bash
uv run python -m ctm.infer \
--checkpoint runs/ctm/ctm_k80_v10000_e50_b128/ctm.pt \
--topn 10 \
--mc-samples 32 \
--device cuda \
--batch-size 256 \
--coherence-metric npmi \
--penalize-zero-npmi True \
--fold-in-val True \
--fold-in-steps 150 \
--fold-in-lr 0.05
```
Notes:
- Set `--device cpu` if you do not have a CUDA GPU.
- Inference loads the exact vocabulary saved during training for consistent evaluation.
### Outputs
For a run with `K=80`, `V=10000`, `epochs=50`, `batch_size=128`, outputs are placed under:
```
runs/ctm/ctm_k80_v10000_e50_b128/
├── config.json # full TrainConfig used
├── ctm.pt # checkpoint: model_state, m_all, logvar_all, vocab, N_train, N_val, cfg
├── tb/ # TensorBoard events (if enabled)
├── metrics.png # optional plot (if plot_metrics=True)
└── top_words.txt # written by ctm.infer
```
### Model and Objective (brief)
- Document-topic logits: `eta_d ~ N(mu, Sigma)`, with `Sigma = L L^T` learned via an unconstrained `L_raw` -> `L = tril(L_raw)` with softplus on the diagonal.
- Topic proportions: `theta = softmax(eta)`.
- Likelihood: words drawn from the mixture `p(v | eta, beta) = sum_k theta_k beta_{k,v}`.
- Per-document variational posterior: `q(eta_d) = N(m_d, diag(exp(logvar_d)))`.
- ELBO estimated with Monte Carlo samples for the expected log-likelihood; global prior includes optional symmetric Dirichlet on `beta`.
### TensorBoard
Enable with `--use-tensorboard True` and then run:
```bash
tensorboard --logdir runs/ctm/ctm_k80_v10000_e50_b128/tb
```
### Reproducibility
- Seeds are set for Python, NumPy, and PyTorch (`--seed`).
- Training saves the exact vectorizer vocabulary to the checkpoint; inference reconstructs data using it to ensure alignment.
### FAQ / Troubleshooting
- 20 Newsgroups download fails: ensure internet access; scikit-learn will cache the dataset.
- CUDA not used: pass `--device cpu` or ensure your PyTorch build detects CUDA.
- spaCy errors: install the model `en_core_web_sm` or disable lemmatization with `--use-lemmatization False`.
### License
MIT
### References
- Blei, D. M., & Lafferty, J. D. (2006). Correlated Topic Models.
Raw data
{
"_id": null,
"home_page": null,
"name": "ctm-pytorch-advi",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": null,
"author": null,
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/46/c6/25d3c3aab0d80daf5b256c4ff0171ac51ce2665e421f9b5b55418a8cef75/ctm_pytorch_advi-0.1.3.tar.gz",
"platform": null,
"description": "### Correlated Topic Models in PyTorch (ADVI)\n\nAn end-to-end, clean implementation of the Correlated Topic Model (CTM) with Automatic Differentiation Variational Inference (ADVI) in PyTorch. This repo includes dataset preprocessing, training, evaluation, TensorBoard logging, and utilities to export topics and compute topic coherence.\n\nCTM extends LDA by replacing the Dirichlet prior over document-topic proportions with a logistic-normal prior with full covariance, capturing correlations between topics.\n\n### Highlights\n\n- Full-covariance logistic-normal prior parameterized via a learned Cholesky factor\n- Mean-field Gaussian per-document variational posterior trained with ADVI\n- Mini-batch ELBO with MC estimates of the collapsed word-likelihood\n- Optional symmetric Dirichlet prior on topic-word distributions `beta`\n- TensorBoard logging and optional metrics plot export\n- Reproducible training with saved configs and exact vocabulary for deterministic inference\n\n### Project Structure\n\n```\nsrc/ctm/\n __init__.py\n config.py # TrainConfig dataclass (CLI surface)\n data.py # 20NG loader + vectorization + DataLoaders\n model.py # CTM module + ELBO\n train.py # training loop, logging, checkpointing\n infer.py # top-words, coherence, perplexity\n utils.py # math and evaluation helpers\nsrc/scripts/\n export_topics.py # export top words to CSV from a checkpoint\n```\n\n### Requirements\n\n- Python >= 3.10\n- Key dependencies (see `pyproject.toml`): \n\n```\ntorch\nnumpy\nscipy\nscikit-learn\ntqdm\ntyro\nrich\ntensorboard\nmatplotlib\nspacy\n```\n\nIf you enable lemmatization, install a spaCy model:\n\n```bash\npython -m spacy download en_core_web_sm\n```\n\n### Install\n\nUsing uv (recommended):\n\n```bash\nuv venv\nuv sync\n```\n\nOr using pip:\n\n```bash\npython -m venv .venv\nsource .venv/bin/activate\npip install -e .\n```\n\n### Dataset\n\nTraining uses scikit-learn 20 Newsgroups. Text is vectorized via `CountVectorizer` with n-grams `(1, 3)`, English stopwords, token pattern `(?u)\\b[a-zA-Z]{3,}\\b`, and configurable `max_df`, `min_df`, and `vocab_size`. Optionally, spaCy lemmatization can be enabled. A validation split is drawn from the training set.\n\n### Quickstart\n\nTrain a CTM with 50 topics and a 5k vocabulary:\n\n```bash\nuv run python -m ctm.train --num-topics 50 --vocab-size 5000 --epochs 50 --batch-size 128 --lr 1e-2\n```\n\nAfter training, export top words and evaluate metrics:\n\n```bash\nuv run python -m ctm.infer --checkpoint runs/ctm/ctm_k50_v5000_e50_b128/ctm.pt --topn 12\n```\n\nExport topics to CSV:\n\n```bash\nuv run python src/scripts/export_topics.py --checkpoint runs/ctm/ctm_k50_v5000_e50_b128/ctm.pt --topn 15 --out topics.csv\n```\n\n### CLI Usage\n\nTraining (`ctm.train`) uses `tyro` to expose the `TrainConfig` as CLI flags. Defaults shown below:\n\n```bash\nuv run python -m ctm.train \\\n --num-topics 80 \\\n --vocab-size 10000 \\\n --max-df 0.95 \\\n --min-df 5 \\\n --remove-headers True \\\n --remove-footers True \\\n --remove-quotes True \\\n --batch-size 128 \\\n --epochs 50 \\\n --lr 0.01 \\\n --beta-dirichlet-alpha 0.05 \\\n --mc-samples 5 \\\n --seed 42 \\\n --log-every 50 \\\n --ckpt-dir runs/ctm \\\n --device cuda \\\n --val-split 0.1 \\\n --use-tensorboard True \\\n --plot-metrics False \\\n --tensorboard-subdir tb \\\n --use-lemmatization True \\\n --spacy-model en_core_web_sm\n```\n\nInference (`ctm.infer`) options:\n\n```bash\nuv run python -m ctm.infer \\\n --checkpoint runs/ctm/ctm_k80_v10000_e50_b128/ctm.pt \\\n --topn 10 \\\n --mc-samples 32 \\\n --device cuda \\\n --batch-size 256 \\\n --coherence-metric npmi \\\n --penalize-zero-npmi True \\\n --fold-in-val True \\\n --fold-in-steps 150 \\\n --fold-in-lr 0.05\n```\n\nNotes:\n- Set `--device cpu` if you do not have a CUDA GPU.\n- Inference loads the exact vocabulary saved during training for consistent evaluation.\n\n### Outputs\n\nFor a run with `K=80`, `V=10000`, `epochs=50`, `batch_size=128`, outputs are placed under:\n\n```\nruns/ctm/ctm_k80_v10000_e50_b128/\n \u251c\u2500\u2500 config.json # full TrainConfig used\n \u251c\u2500\u2500 ctm.pt # checkpoint: model_state, m_all, logvar_all, vocab, N_train, N_val, cfg\n \u251c\u2500\u2500 tb/ # TensorBoard events (if enabled)\n \u251c\u2500\u2500 metrics.png # optional plot (if plot_metrics=True)\n \u2514\u2500\u2500 top_words.txt # written by ctm.infer\n```\n\n### Model and Objective (brief)\n\n- Document-topic logits: `eta_d ~ N(mu, Sigma)`, with `Sigma = L L^T` learned via an unconstrained `L_raw` -> `L = tril(L_raw)` with softplus on the diagonal.\n- Topic proportions: `theta = softmax(eta)`.\n- Likelihood: words drawn from the mixture `p(v | eta, beta) = sum_k theta_k beta_{k,v}`.\n- Per-document variational posterior: `q(eta_d) = N(m_d, diag(exp(logvar_d)))`.\n- ELBO estimated with Monte Carlo samples for the expected log-likelihood; global prior includes optional symmetric Dirichlet on `beta`.\n\n### TensorBoard\n\nEnable with `--use-tensorboard True` and then run:\n\n```bash\ntensorboard --logdir runs/ctm/ctm_k80_v10000_e50_b128/tb\n```\n\n### Reproducibility\n\n- Seeds are set for Python, NumPy, and PyTorch (`--seed`).\n- Training saves the exact vectorizer vocabulary to the checkpoint; inference reconstructs data using it to ensure alignment.\n\n### FAQ / Troubleshooting\n\n- 20 Newsgroups download fails: ensure internet access; scikit-learn will cache the dataset.\n- CUDA not used: pass `--device cpu` or ensure your PyTorch build detects CUDA.\n- spaCy errors: install the model `en_core_web_sm` or disable lemmatization with `--use-lemmatization False`.\n\n### License\n\nMIT\n\n### References\n\n- Blei, D. M., & Lafferty, J. D. (2006). Correlated Topic Models.",
"bugtrack_url": null,
"license": "MIT",
"summary": "Correlated Topic Model (CTM) with ADVI in PyTorch: training, inference, and utilities",
"version": "0.1.3",
"project_urls": null,
"split_keywords": [],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "468122581f11bdd1c312b82a4ab1340946199e78e5dd79119e04b6f68af586c6",
"md5": "77a652f3b30e5397cacff802cb32c78b",
"sha256": "91fea765940d4e9eb88d1af5cbb076a79a9e317d4477ef01bfd9b5c62794d827"
},
"downloads": -1,
"filename": "ctm_pytorch_advi-0.1.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "77a652f3b30e5397cacff802cb32c78b",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 19920,
"upload_time": "2025-08-10T19:04:51",
"upload_time_iso_8601": "2025-08-10T19:04:51.887047Z",
"url": "https://files.pythonhosted.org/packages/46/81/22581f11bdd1c312b82a4ab1340946199e78e5dd79119e04b6f68af586c6/ctm_pytorch_advi-0.1.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "46c625d3c3aab0d80daf5b256c4ff0171ac51ce2665e421f9b5b55418a8cef75",
"md5": "6a1f3bf71048e368615ecee4a7b88731",
"sha256": "217d7d6cdcfad3f0fc41ad30001547bc6e32b6c1d58a2f258ad751ddc48c5621"
},
"downloads": -1,
"filename": "ctm_pytorch_advi-0.1.3.tar.gz",
"has_sig": false,
"md5_digest": "6a1f3bf71048e368615ecee4a7b88731",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 15293192,
"upload_time": "2025-08-10T19:04:58",
"upload_time_iso_8601": "2025-08-10T19:04:58.761165Z",
"url": "https://files.pythonhosted.org/packages/46/c6/25d3c3aab0d80daf5b256c4ff0171ac51ce2665e421f9b5b55418a8cef75/ctm_pytorch_advi-0.1.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-10 19:04:58",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "ctm-pytorch-advi"
}