seededPF


NameseededPF JSON
Version 0.1.1 PyPI version JSON
download
home_pageNone
SummarySeededPF is a seed guided topic model based on Poisson factorization.
upload_time2025-08-04 20:40:29
maintainerNone
docs_urlNone
authorBernd Prostmaier
requires_python<3.12,>=3.10
licenseMIT
keywords nlp topicmodel topic-modeling textanalysis text mining
VCS
bugtrack_url
requirements matplotlib numpy pandas scikit_learn scipy seaborn tensorflow tensorflow_probability
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <h1 align="center">SeededPF</h1>

<div align="center">
    <a href="https://pypi.org/project/seededpf">
        <img alt="PyPI Version" src="https://img.shields.io/pypi/v/seededpf?color=blue">
    </a>
    <a href="https://www.python.org/downloads/">
        <img alt="Python Version" src="https://img.shields.io/pypi/pyversions/seededpf">
    </a>
    <a href="https://github.com/machine-intelligence-laboratory/seededpf/blob/master/LICENSE.txt">
        <img alt="License" src="https://img.shields.io/pypi/l/seededpf?color=Black">
    </a>
</div>

## What is seededPF
`seededPF` is an easy to use implementation of the Seeded Poisson Factorization (SPF) topic model, introduced in [this research paper](https://www.sciencedirect.com/science/article/pii/S095070512501161X). SPF provides a guided topic modeling approach that allows users to pre-specify topics of interest by providing sets of seed words. Built on Poisson factorization, it leverages variational inference techniques for efficient and scalable estimation. 

<p>
    <div align="center">
        <img src="https://raw.githubusercontent.com/BPro2410/Seeded-Poisson-Factorization/refs/heads/main/seededpf/spf_graphical.PNG" width="50%" alt/>
    </div>
</p>

Traditional unsupervised topic models (like LDA) often struggle to align with predefined conceptual domains and typically require significant post-processing efforts, such as topic merging or manual labeling, to ensure topic coherence. `seededPF` overcomes this limitation by enabling the pre-specification of topics, which leads to improved topic interpretability and reduces the need for manual post-processing. Additionally, it supports the estimation of unsupervised topics when no seed words are provided.

Consider using `seededPF`  if:
- You need to fit a topic model with a specific topic schema.
- You wish to estimate a topic model that is partially or fully unsupervised (i.e., providing no seed words means fitting a standard Poisson factorization topic model without predefined topics).
- You require a fast and scalable topic modeling solution.

`seededPF` offers a high-performance, scalable interface for guided topic modeling, providing a reliable alternative to [keyATM](https://keyatm.github.io/keyATM/index.html) and [SeededLDA](https://github.com/koheiw/seededlda), while minimizing the need for manual intervention and enhancing topic interpretability.


## Installation


`seededPF` works with **Python 3.10** or **Python 3.11**. The main dependencies are Tensorflow 2.18 and tensorflow_probability 0.25. 

> Please be sure to _adjust the dependencies if you are able to accelerate GPU support_.

### Via pip

The easiest way to install `seededPF` is via `pip`.

```{bash}
pip install seededpf
```

### From source

One can also install the package from [GitHub](https://github.com/BPro2410/Seeded-Poisson-Factorization). Configure a virtual environment using Pyhton 3.10 or Python 3.11. Inside the virtual environment, use `pip` to install the required packages:

```{bash}
(venv)$ pip install -r requirements.txt
```


# Training the Seeded Poisson Factorization model

`seededPF` is an easy to use library for topic modeling. We quickly walk through the most essential steps below:
1. Imports and data preparation
2. Initialization
3. Reading documents
4. Training the model
5. Post-hoc analysis

The following minimal example is available on [GitHub](https://github.com/BPro2410/Seeded-Poisson-Factorization/blob/main/minimal_example.ipynb).

## Step 1: Imports and data preparation

Once installed, one can import the `SPF` class of the `seededPF` library and is ready to go. There are only 2 things required to fit the SPF topic model:
1. Text documents
2. A seed word (i.e., keyword) dictionary for each topic to be estimated.

```python
# Imports
from seededpf import SPF
from sklearn.feature_extraction.text import CountVectorizer

# Example documents - customer reviews about either smartphones or computers
documents = [
    "My smartphone's battery life is fantastic, lasts all day!",
    "The camera on my phone is incredible, takes crystal-clear photos.",
    "Love the smooth performance, but it overheats with heavy apps.",
    "This phone charges super fast, very convenient.",
    "Upgraded my PC and it boots in seconds!",
    "Great for gaming, but gets hot after long sessions.",
    "My computer sometimes freezes, but a restart fixes it.",
    "Best laptop I’ve owned, powerful and reliable!"
]

# Define topic-specific seed words
smartphone = {"smartphone", "iphone", "phone", "touch", "app"}
pc = {"laptop", "keyboard", "desktop", "pc"}

keywords = {"smartphone": smartphone, "pc": pc}
```

## Step 2: Initialization

Now that we have both the documents and the pre-specification of topics to be estimated, we can initialize the SPF topic model.

```python
spf = SPF(keywords = keywords, residual_topics = 0) # Fits 2 seeded topics and 0 unsupervised topics
```

## Step 3: Reading documents

We tokenize the documents and create all data required for model training automatically.

```python
spf.read_docs(documents, 
            count_vectorizer=CountVectorizer(stop_words="english", min_df = 0), 
            batch_size = 1024)
```

## Step 4: Training the model
For model training, we have to set the learning rate and the number of epochs.

```python
spf.model_train(lr = 0.1, epochs = 150)
```


## Step 5: Analysis of the results

There are different methods available to analyze the topic model results. We refer to the [minimal example](https://github.com/BPro2410/Seeded-Poisson-Factorization/blob/main/minimal_example.ipynb) or [advanced example](https://github.com/BPro2410/Seeded-Poisson-Factorization/blob/main/analysis/examples/SPF_example_notebook.ipynb) where we show post-hoc analysis methods.


The `seededPF` package offers several methods, including:
- `SPF.plot_model_loss()`: Checks convergence of the negative ELBO.
- `SPF.return_topics()`: Returns a tuple (categories, E_theta), with categories being the most probable topic for each document and E_theta being the approximate posterior mean estimates per document and topic.
- `SPF.calculate_topic_word_distributions()`: Returns a pandas dataframe containing the approximate topic-term mean intensities.
- `SPF.print_topics()`: Returns a dictionary with the highest intensity words per topic.
- `SPF.plot_seeded_topic_distribution()`: Plots the variational topic word distribution of all seed words belonging to the topic parameter.
- `SPF.plot_word_distribution()`: Shows the fitted variational distribution of q(\Tilde{\beta}){topic,word} and q(\beta^*)_{topic,word}.


# Contribution

If you encounter any bugs or would like to suggest new features for the library, please feel free to contact us or create an [issue](https://github.com/BPro2410/Seeded-Poisson-Factorization/issues).

# Citing

When citing `seededPF`, please use this BibTeX entry:

```
@article{PROSTMAIER2025114116,
    title = {Seeded Poisson Factorization: leveraging domain knowledge to fit topic models},
    journal = {Knowledge-Based Systems},
    volume = {327},
    pages = {114116},
    year = {2025},
    issn = {0950-7051},
    doi = {https://doi.org/10.1016/j.knosys.2025.114116},
    url = {https://www.sciencedirect.com/science/article/pii/S095070512501161X},
    author = {Bernd Prostmaier and Jan Vávra and Bettina Grün and Paul Hofmarcher},
    keywords = {Poisson factorization, Topic model, Variational inference, Customer feedback},
    abstract = {Topic models are widely used for discovering latent thematic structures in large text corpora, yet traditional unsupervised methods often struggle to align with pre-defined conceptual domains. This paper introduces seeded Poisson factorization (SPF), a novel approach that extends the Poisson factorization (PF) framework by incorporating domain knowledge through seed words. SPF enables a structured topic discovery by modifying the prior distribution of topic-specific term intensities, assigning higher initial rates to pre-defined seed words. The model is estimated using variational inference with stochastic gradient optimization, ensuring scalability to large datasets. We present in detail the results of applying SPF to an Amazon customer feedback dataset, leveraging pre-defined product categories as guiding structures. SPF achieves superior performance compared to alternative guided probabilistic topic models in terms of computational efficiency and classification performance. Robustness checks highlight SPF’s ability to adaptively balance domain knowledge and data-driven topic discovery, even in case of imperfect seed word selection. Further applications of SPF to four additional benchmark datasets, where the corpus varies in size and the number of topics differs, demonstrate its general superior classification performance compared to the unseeded PF model.}
}
```

# License

Code licensed under [MIT](https://github.com/BPro2410/Seeded-Poisson-Factorization/blob/main/LICENSE).


            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "seededPF",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<3.12,>=3.10",
    "maintainer_email": null,
    "keywords": "nlp, topicmodel, topic-modeling, textanalysis, text mining",
    "author": "Bernd Prostmaier",
    "author_email": "b.prostmaier@icloud.com",
    "download_url": "https://files.pythonhosted.org/packages/45/50/b6d4153a1757c8abb88e21e38567398ebabcb7647949af01fd62de9eed4b/seededpf-0.1.1.tar.gz",
    "platform": null,
    "description": "<h1 align=\"center\">SeededPF</h1>\n\n<div align=\"center\">\n    <a href=\"https://pypi.org/project/seededpf\">\n        <img alt=\"PyPI Version\" src=\"https://img.shields.io/pypi/v/seededpf?color=blue\">\n    </a>\n    <a href=\"https://www.python.org/downloads/\">\n        <img alt=\"Python Version\" src=\"https://img.shields.io/pypi/pyversions/seededpf\">\n    </a>\n    <a href=\"https://github.com/machine-intelligence-laboratory/seededpf/blob/master/LICENSE.txt\">\n        <img alt=\"License\" src=\"https://img.shields.io/pypi/l/seededpf?color=Black\">\n    </a>\n</div>\n\n## What is seededPF\n`seededPF` is an easy to use implementation of the Seeded Poisson Factorization (SPF) topic model, introduced in [this research paper](https://www.sciencedirect.com/science/article/pii/S095070512501161X). SPF provides a guided topic modeling approach that allows users to pre-specify topics of interest by providing sets of seed words. Built on Poisson factorization, it leverages variational inference techniques for efficient and scalable estimation. \n\n<p>\n    <div align=\"center\">\n        <img src=\"https://raw.githubusercontent.com/BPro2410/Seeded-Poisson-Factorization/refs/heads/main/seededpf/spf_graphical.PNG\" width=\"50%\" alt/>\n    </div>\n</p>\n\nTraditional unsupervised topic models (like LDA) often struggle to align with predefined conceptual domains and typically require significant post-processing efforts, such as topic merging or manual labeling, to ensure topic coherence. `seededPF` overcomes this limitation by enabling the pre-specification of topics, which leads to improved topic interpretability and reduces the need for manual post-processing. Additionally, it supports the estimation of unsupervised topics when no seed words are provided.\n\nConsider using `seededPF`  if:\n- You need to fit a topic model with a specific topic schema.\n- You wish to estimate a topic model that is partially or fully unsupervised (i.e., providing no seed words means fitting a standard Poisson factorization topic model without predefined topics).\n- You require a fast and scalable topic modeling solution.\n\n`seededPF` offers a high-performance, scalable interface for guided topic modeling, providing a reliable alternative to [keyATM](https://keyatm.github.io/keyATM/index.html) and [SeededLDA](https://github.com/koheiw/seededlda), while minimizing the need for manual intervention and enhancing topic interpretability.\n\n\n## Installation\n\n\n`seededPF` works with **Python 3.10** or **Python 3.11**. The main dependencies are Tensorflow 2.18 and tensorflow_probability 0.25. \n\n> Please be sure to _adjust the dependencies if you are able to accelerate GPU support_.\n\n### Via pip\n\nThe easiest way to install `seededPF` is via `pip`.\n\n```{bash}\npip install seededpf\n```\n\n### From source\n\nOne can also install the package from [GitHub](https://github.com/BPro2410/Seeded-Poisson-Factorization). Configure a virtual environment using Pyhton 3.10 or Python 3.11. Inside the virtual environment, use `pip` to install the required packages:\n\n```{bash}\n(venv)$ pip install -r requirements.txt\n```\n\n\n# Training the Seeded Poisson Factorization model\n\n`seededPF` is an easy to use library for topic modeling. We quickly walk through the most essential steps below:\n1. Imports and data preparation\n2. Initialization\n3. Reading documents\n4. Training the model\n5. Post-hoc analysis\n\nThe following minimal example is available on [GitHub](https://github.com/BPro2410/Seeded-Poisson-Factorization/blob/main/minimal_example.ipynb).\n\n## Step 1: Imports and data preparation\n\nOnce installed, one can import the `SPF` class of the `seededPF` library and is ready to go. There are only 2 things required to fit the SPF topic model:\n1. Text documents\n2. A seed word (i.e., keyword) dictionary for each topic to be estimated.\n\n```python\n# Imports\nfrom seededpf import SPF\nfrom sklearn.feature_extraction.text import CountVectorizer\n\n# Example documents - customer reviews about either smartphones or computers\ndocuments = [\n    \"My smartphone's battery life is fantastic, lasts all day!\",\n    \"The camera on my phone is incredible, takes crystal-clear photos.\",\n    \"Love the smooth performance, but it overheats with heavy apps.\",\n    \"This phone charges super fast, very convenient.\",\n    \"Upgraded my PC and it boots in seconds!\",\n    \"Great for gaming, but gets hot after long sessions.\",\n    \"My computer sometimes freezes, but a restart fixes it.\",\n    \"Best laptop I\u2019ve owned, powerful and reliable!\"\n]\n\n# Define topic-specific seed words\nsmartphone = {\"smartphone\", \"iphone\", \"phone\", \"touch\", \"app\"}\npc = {\"laptop\", \"keyboard\", \"desktop\", \"pc\"}\n\nkeywords = {\"smartphone\": smartphone, \"pc\": pc}\n```\n\n## Step 2: Initialization\n\nNow that we have both the documents and the pre-specification of topics to be estimated, we can initialize the SPF topic model.\n\n```python\nspf = SPF(keywords = keywords, residual_topics = 0) # Fits 2 seeded topics and 0 unsupervised topics\n```\n\n## Step 3: Reading documents\n\nWe tokenize the documents and create all data required for model training automatically.\n\n```python\nspf.read_docs(documents, \n            count_vectorizer=CountVectorizer(stop_words=\"english\", min_df = 0), \n            batch_size = 1024)\n```\n\n## Step 4: Training the model\nFor model training, we have to set the learning rate and the number of epochs.\n\n```python\nspf.model_train(lr = 0.1, epochs = 150)\n```\n\n\n## Step 5: Analysis of the results\n\nThere are different methods available to analyze the topic model results. We refer to the [minimal example](https://github.com/BPro2410/Seeded-Poisson-Factorization/blob/main/minimal_example.ipynb) or [advanced example](https://github.com/BPro2410/Seeded-Poisson-Factorization/blob/main/analysis/examples/SPF_example_notebook.ipynb) where we show post-hoc analysis methods.\n\n\nThe `seededPF` package offers several methods, including:\n- `SPF.plot_model_loss()`: Checks convergence of the negative ELBO.\n- `SPF.return_topics()`: Returns a tuple (categories, E_theta), with categories being the most probable topic for each document and E_theta being the approximate posterior mean estimates per document and topic.\n- `SPF.calculate_topic_word_distributions()`: Returns a pandas dataframe containing the approximate topic-term mean intensities.\n- `SPF.print_topics()`: Returns a dictionary with the highest intensity words per topic.\n- `SPF.plot_seeded_topic_distribution()`: Plots the variational topic word distribution of all seed words belonging to the topic parameter.\n- `SPF.plot_word_distribution()`: Shows the fitted variational distribution of q(\\Tilde{\\beta}){topic,word} and q(\\beta^*)_{topic,word}.\n\n\n# Contribution\n\nIf you encounter any bugs or would like to suggest new features for the library, please feel free to contact us or create an [issue](https://github.com/BPro2410/Seeded-Poisson-Factorization/issues).\n\n# Citing\n\nWhen citing `seededPF`, please use this BibTeX entry:\n\n```\n@article{PROSTMAIER2025114116,\n    title = {Seeded Poisson Factorization: leveraging domain knowledge to fit topic models},\n    journal = {Knowledge-Based Systems},\n    volume = {327},\n    pages = {114116},\n    year = {2025},\n    issn = {0950-7051},\n    doi = {https://doi.org/10.1016/j.knosys.2025.114116},\n    url = {https://www.sciencedirect.com/science/article/pii/S095070512501161X},\n    author = {Bernd Prostmaier and Jan V\u00e1vra and Bettina Gr\u00fcn and Paul Hofmarcher},\n    keywords = {Poisson factorization, Topic model, Variational inference, Customer feedback},\n    abstract = {Topic models are widely used for discovering latent thematic structures in large text corpora, yet traditional unsupervised methods often struggle to align with pre-defined conceptual domains. This paper introduces seeded Poisson factorization (SPF), a novel approach that extends the Poisson factorization (PF) framework by incorporating domain knowledge through seed words. SPF enables a structured topic discovery by modifying the prior distribution of topic-specific term intensities, assigning higher initial rates to pre-defined seed words. The model is estimated using variational inference with stochastic gradient optimization, ensuring scalability to large datasets. We present in detail the results of applying SPF to an Amazon customer feedback dataset, leveraging pre-defined product categories as guiding structures. SPF achieves superior performance compared to alternative guided probabilistic topic models in terms of computational efficiency and classification performance. Robustness checks highlight SPF\u2019s ability to adaptively balance domain knowledge and data-driven topic discovery, even in case of imperfect seed word selection. Further applications of SPF to four additional benchmark datasets, where the corpus varies in size and the number of topics differs, demonstrate its general superior classification performance compared to the unseeded PF model.}\n}\n```\n\n# License\n\nCode licensed under [MIT](https://github.com/BPro2410/Seeded-Poisson-Factorization/blob/main/LICENSE).\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "SeededPF is a seed guided topic model based on Poisson factorization.",
    "version": "0.1.1",
    "project_urls": {
        "Homepage": "https://github.com/BPro2410/Seeded-Poisson-Factorization",
        "Repository": "https://github.com/BPro2410/Seeded-Poisson-Factorization"
    },
    "split_keywords": [
        "nlp",
        " topicmodel",
        " topic-modeling",
        " textanalysis",
        " text mining"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "42d53affd9e521732bc016afd573b03b02e3403e2553f965f12f9f45c1b45223",
                "md5": "9f3b4bbecd2b99853480b7b27b5a9d96",
                "sha256": "f69d9828ad016e119f3674d3100e703e21054a8435f224559cc0578c270f829c"
            },
            "downloads": -1,
            "filename": "seededpf-0.1.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "9f3b4bbecd2b99853480b7b27b5a9d96",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<3.12,>=3.10",
            "size": 48793,
            "upload_time": "2025-08-04T20:40:27",
            "upload_time_iso_8601": "2025-08-04T20:40:27.538681Z",
            "url": "https://files.pythonhosted.org/packages/42/d5/3affd9e521732bc016afd573b03b02e3403e2553f965f12f9f45c1b45223/seededpf-0.1.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "4550b6d4153a1757c8abb88e21e38567398ebabcb7647949af01fd62de9eed4b",
                "md5": "7d6d9139706cdcce9ca2005af6fc338a",
                "sha256": "2eedd7a08cfe1901937765e8593c2dca46fa94711a093035eef6a46ed9261f93"
            },
            "downloads": -1,
            "filename": "seededpf-0.1.1.tar.gz",
            "has_sig": false,
            "md5_digest": "7d6d9139706cdcce9ca2005af6fc338a",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<3.12,>=3.10",
            "size": 52513,
            "upload_time": "2025-08-04T20:40:29",
            "upload_time_iso_8601": "2025-08-04T20:40:29.023220Z",
            "url": "https://files.pythonhosted.org/packages/45/50/b6d4153a1757c8abb88e21e38567398ebabcb7647949af01fd62de9eed4b/seededpf-0.1.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-04 20:40:29",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "BPro2410",
    "github_project": "Seeded-Poisson-Factorization",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "matplotlib",
            "specs": [
                [
                    "==",
                    "3.10"
                ]
            ]
        },
        {
            "name": "numpy",
            "specs": [
                [
                    "==",
                    "2.0"
                ]
            ]
        },
        {
            "name": "pandas",
            "specs": [
                [
                    "==",
                    "2.2.3"
                ]
            ]
        },
        {
            "name": "scikit_learn",
            "specs": [
                [
                    "==",
                    "1.6"
                ]
            ]
        },
        {
            "name": "scipy",
            "specs": [
                [
                    "==",
                    "1.15"
                ]
            ]
        },
        {
            "name": "seaborn",
            "specs": [
                [
                    "==",
                    "0.13"
                ]
            ]
        },
        {
            "name": "tensorflow",
            "specs": [
                [
                    "==",
                    "2.18"
                ]
            ]
        },
        {
            "name": "tensorflow_probability",
            "specs": [
                [
                    "==",
                    "0.25.0"
                ]
            ]
        }
    ],
    "lcname": "seededpf"
}
        
Elapsed time: 0.70785s