semantic-components


Namesemantic-components JSON
Version 0.1.1 PyPI version JSON
download
home_pageNone
SummaryFinding semantic components in your neural representations.
upload_time2024-10-30 17:34:14
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseMIT License Copyright (c) 2024 Florian Eichin Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords nlp clustering topic modeling embeddings
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            

[![PyPI - Python](https://img.shields.io/badge/python-v3.8+-blue.svg)](https://pypi.org/project/semantic-components/0.1.0/)
[![PyPI - License](https://img.shields.io/badge/license-MIT-green.svg)](https://github.com/mainlp/semantic_components/blob/main/LICENSE)
[![PyPI - PyPi](https://img.shields.io/pypi/v/semantic-components)](https://pypi.org/project/semantic-components/0.1.0/)
[![arXiv](https://img.shields.io/badge/arXiv-2410.21054-<COLOR>.svg)](https://arxiv.org/abs/2410.21054)


# Semantic Component Analysis: Discovering Patterns in Short Texts Beyond Topics

<img src="images/tweet_decomposition.png" width="50%" height="50%" align="right" />

Semantic Component Analysis (SCA) is a powerful tool to analyse *your* text datasets. If you want to find out how it works and why it is the right tool for you, consider reading [our paper](https://arxiv.org/abs/2410.21054).

If you just want to test the method as quickly as possible, continue with the Quick Start section. For everything else, the Manual Installation section should have you covered. If you run into any problems or have suggestions, feel free to create an issue and we will try to adress it in future releases.

## Quick Start

The method is available through `pypi` and part of the `semantic_components` package. You can install it as

```bash
pip install semantic_components
```

Running SCA is as simple as importing this package, and two lines for instantiating and fitting the model:

```python
from semantic_components.sca import SCA

# fit sca model to data
sca = SCA(alpha_decomposition=0.1, mu=0.9, combine_overlap_threshold=0.5)
scores, residuals, ids = sca.fit(documents, embeddings)

# get representations and explainable transformations
representations = sca.representations  # pandas df
transformed = sca.transform(embeddings)  # equivalent to variable scores above
```

A full example including computing the embeddings and loading the Trump dataset can be found in `example.py`. We advise to clone this repository, if you want to run this example and/or our experiments found in the `experiments/` directory. 

Where applicable, the `results/` folder is where the experiment scripts will store their results. Run SCA with `save_results=True` and `verbose=True` to enable this behaviour. This will generate a `reports.txt` containing information and evaluation metrics. Furthermore, there will be `.pkl` and `.txt` files with the representations of the semantic components found by the procedure.

## OCTIS Evaluation

By default, we do not install `octis` as it requires older versions of some other packages and thus creates compatibility issues. If you want to 
use OCTIS evaluation (i.e. topic coherence and diversity), consider installing this package as
```bash
pip install semantic_components[octis]
```
for both versions, we recommend Python 3.10 or higher.

## Manual Installation

In order to run the code provided in this repository, a number of non-standard Python packages need to be installed. As of October 2024, Python 3.10.x with the most current versions should work with the implementions provided. Here is a pip install command you can use in your environment to install all of them.
```bash
pip install sentence_transformers umap-learn hdbscan jieba scikit-learn pandas octis
```

Our experiments have been run with the following versions:

```
hdbscan                  0.8.39
jieba                    0.42.1
numpy                    1.26.4
octis                    1.14.0
pandas                   2.2.3
scikit-learn             1.1.0
sentence-transformers    3.2.0
torch                    2.4.1
transformers             4.45.2
umap-learn               0.5.6
```
You can clone this repository to your machine as follows:
```bash
git clone git@github.com:eichinflo/semantic_components.git
```

If you work with conda for example, you can run the following commands to get an environment suited to run the code:

```bash
cd semantic_components
conda create -n sca python=3.10.15
conda activate sca
pip install sentence_transformers umap-learn hdbscan jieba scikit-learn pandas octis
```
then, you're ready to run the example script which reproduces part of the results on the Trump dataset:
```bash
python example.py
```

## Data

All data used in this work is publicly available. The Trump dataset is available from [the Trump Twitter Archive](https://www.thetrumparchive.com/). You can choose to download your version as `.csv` directly from that page and put it in the `data/`
directory (to work with the experiment code, rename it to `trump_tweets.csv`). However, we provide the version we used in this repository. 

Besides, we publish the Chinese News dataset, which we acquired to the Twitter API and was kept updated until our academic access got revoked in April 2023. We provide it as a download [HERE](https://drive.google.com/drive/folders/19H4gjnXGviXZUS8prv3l1WngGKwOpCMP?usp=sharing).

The current version of the Hausa Tweet dataset is available at the [NaijaSenti repository](https://github.com/hausanlp/NaijaSenti/blob/main/sections/unlabeled_twitter_corpus.md).

## AI Usage Disclaimer

The code in this repository has been written with the support of code completions of an AI coding assistant, namely GitHub Copilot. Completions were mostly single lines up to a few lines of code and were always checked carefully to ensure their functionality and safety. Furthermore, we did our best to avoid accepting code completions that would be incompatible with the license of our code or could be regarded as plagiarism.


## Acknowledgements

We're grateful to Kristin Shi-Kupfer and David Adelani for consulting on the Chinese and Hausa datasets respectively. Furthermore, we would like to mention that the code of the `c-TF-IDF` representer has been largely adapted from the original [BERTopic](https://github.com/MaartenGr/BERTopic) implementation by 
Maarten Grootendorst released under MIT license. 

## Citing This Work

If you're using this work for your project, please consider citing our paper:

```bibtext
@misc{eichin2024semanticcomponentanalysisdiscovering,
      title={Semantic Component Analysis: Discovering Patterns in Short Texts Beyond Topics}, 
      author={Florian Eichin and Carolin Schuster and Georg Groh and Michael A. Hedderich},
      year={2024},
      eprint={2410.21054},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.21054}, 
}
```

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "semantic-components",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "nlp, clustering, topic, modeling, embeddings",
    "author": null,
    "author_email": "Florian Eichin <feichin@cis.lmu.de>",
    "download_url": "https://files.pythonhosted.org/packages/e5/ef/f8e0b62fad6de2ef1747b29d120f38ad02016270f84fc189b1f681e43c36/semantic_components-0.1.1.tar.gz",
    "platform": null,
    "description": "\n\n[![PyPI - Python](https://img.shields.io/badge/python-v3.8+-blue.svg)](https://pypi.org/project/semantic-components/0.1.0/)\n[![PyPI - License](https://img.shields.io/badge/license-MIT-green.svg)](https://github.com/mainlp/semantic_components/blob/main/LICENSE)\n[![PyPI - PyPi](https://img.shields.io/pypi/v/semantic-components)](https://pypi.org/project/semantic-components/0.1.0/)\n[![arXiv](https://img.shields.io/badge/arXiv-2410.21054-<COLOR>.svg)](https://arxiv.org/abs/2410.21054)\n\n\n# Semantic Component Analysis: Discovering Patterns in Short Texts Beyond Topics\n\n<img src=\"images/tweet_decomposition.png\" width=\"50%\" height=\"50%\" align=\"right\" />\n\nSemantic Component Analysis (SCA) is a powerful tool to analyse *your* text datasets. If you want to find out how it works and why it is the right tool for you, consider reading [our paper](https://arxiv.org/abs/2410.21054).\n\nIf you just want to test the method as quickly as possible, continue with the Quick Start section. For everything else, the Manual Installation section should have you covered. If you run into any problems or have suggestions, feel free to create an issue and we will try to adress it in future releases.\n\n## Quick Start\n\nThe method is available through `pypi` and part of the `semantic_components` package. You can install it as\n\n```bash\npip install semantic_components\n```\n\nRunning SCA is as simple as importing this package, and two lines for instantiating and fitting the model:\n\n```python\nfrom semantic_components.sca import SCA\n\n# fit sca model to data\nsca = SCA(alpha_decomposition=0.1, mu=0.9, combine_overlap_threshold=0.5)\nscores, residuals, ids = sca.fit(documents, embeddings)\n\n# get representations and explainable transformations\nrepresentations = sca.representations  # pandas df\ntransformed = sca.transform(embeddings)  # equivalent to variable scores above\n```\n\nA full example including computing the embeddings and loading the Trump dataset can be found in `example.py`. We advise to clone this repository, if you want to run this example and/or our experiments found in the `experiments/` directory. \n\nWhere applicable, the `results/` folder is where the experiment scripts will store their results. Run SCA with `save_results=True` and `verbose=True` to enable this behaviour. This will generate a `reports.txt` containing information and evaluation metrics. Furthermore, there will be `.pkl` and `.txt` files with the representations of the semantic components found by the procedure.\n\n## OCTIS Evaluation\n\nBy default, we do not install `octis` as it requires older versions of some other packages and thus creates compatibility issues. If you want to \nuse OCTIS evaluation (i.e. topic coherence and diversity), consider installing this package as\n```bash\npip install semantic_components[octis]\n```\nfor both versions, we recommend Python 3.10 or higher.\n\n## Manual Installation\n\nIn order to run the code provided in this repository, a number of non-standard Python packages need to be installed. As of October 2024, Python 3.10.x with the most current versions should work with the implementions provided. Here is a pip install command you can use in your environment to install all of them.\n```bash\npip install sentence_transformers umap-learn hdbscan jieba scikit-learn pandas octis\n```\n\nOur experiments have been run with the following versions:\n\n```\nhdbscan                  0.8.39\njieba                    0.42.1\nnumpy                    1.26.4\noctis                    1.14.0\npandas                   2.2.3\nscikit-learn             1.1.0\nsentence-transformers    3.2.0\ntorch                    2.4.1\ntransformers             4.45.2\numap-learn               0.5.6\n```\nYou can clone this repository to your machine as follows:\n```bash\ngit clone git@github.com:eichinflo/semantic_components.git\n```\n\nIf you work with conda for example, you can run the following commands to get an environment suited to run the code:\n\n```bash\ncd semantic_components\nconda create -n sca python=3.10.15\nconda activate sca\npip install sentence_transformers umap-learn hdbscan jieba scikit-learn pandas octis\n```\nthen, you're ready to run the example script which reproduces part of the results on the Trump dataset:\n```bash\npython example.py\n```\n\n## Data\n\nAll data used in this work is publicly available. The Trump dataset is available from [the Trump Twitter Archive](https://www.thetrumparchive.com/). You can choose to download your version as `.csv` directly from that page and put it in the `data/`\ndirectory (to work with the experiment code, rename it to `trump_tweets.csv`). However, we provide the version we used in this repository. \n\nBesides, we publish the Chinese News dataset, which we acquired to the Twitter API and was kept updated until our academic access got revoked in April 2023. We provide it as a download [HERE](https://drive.google.com/drive/folders/19H4gjnXGviXZUS8prv3l1WngGKwOpCMP?usp=sharing).\n\nThe current version of the Hausa Tweet dataset is available at the [NaijaSenti repository](https://github.com/hausanlp/NaijaSenti/blob/main/sections/unlabeled_twitter_corpus.md).\n\n## AI Usage Disclaimer\n\nThe code in this repository has been written with the support of code completions of an AI coding assistant, namely GitHub Copilot. Completions were mostly single lines up to a few lines of code and were always checked carefully to ensure their functionality and safety. Furthermore, we did our best to avoid accepting code completions that would be incompatible with the license of our code or could be regarded as plagiarism.\n\n\n## Acknowledgements\n\nWe're grateful to Kristin Shi-Kupfer and David Adelani for consulting on the Chinese and Hausa datasets respectively. Furthermore, we would like to mention that the code of the `c-TF-IDF` representer has been largely adapted from the original [BERTopic](https://github.com/MaartenGr/BERTopic) implementation by \nMaarten Grootendorst released under MIT license. \n\n## Citing This Work\n\nIf you're using this work for your project, please consider citing our paper:\n\n```bibtext\n@misc{eichin2024semanticcomponentanalysisdiscovering,\n      title={Semantic Component Analysis: Discovering Patterns in Short Texts Beyond Topics}, \n      author={Florian Eichin and Carolin Schuster and Georg Groh and Michael A. Hedderich},\n      year={2024},\n      eprint={2410.21054},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL},\n      url={https://arxiv.org/abs/2410.21054}, \n}\n```\n",
    "bugtrack_url": null,
    "license": "MIT License  Copyright (c) 2024 Florian Eichin  Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:  The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.  THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ",
    "summary": "Finding semantic components in your neural representations.",
    "version": "0.1.1",
    "project_urls": {
        "Documentation": "https://github.com/mainlp/semantic_components",
        "Homepage": "https://github.com/mainlp/semantic_components",
        "Issues": "https://github.com/mainlp/semantic_components/issues",
        "Repository": "https://github.com/mainlp/semantic_components"
    },
    "split_keywords": [
        "nlp",
        " clustering",
        " topic",
        " modeling",
        " embeddings"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "e0ada2cdba6153fde2dada8ea8ef7e315675fe9e4d720d672a6fec701353bc60",
                "md5": "d4228e7940d45b71021ce6e640b1e335",
                "sha256": "284349bd972f68baf668ec39ce5677b00801f09374c1f3e22fdf17e52067f0bb"
            },
            "downloads": -1,
            "filename": "semantic_components-0.1.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "d4228e7940d45b71021ce6e640b1e335",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 26531,
            "upload_time": "2024-10-30T17:34:11",
            "upload_time_iso_8601": "2024-10-30T17:34:11.034868Z",
            "url": "https://files.pythonhosted.org/packages/e0/ad/a2cdba6153fde2dada8ea8ef7e315675fe9e4d720d672a6fec701353bc60/semantic_components-0.1.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "e5eff8e0b62fad6de2ef1747b29d120f38ad02016270f84fc189b1f681e43c36",
                "md5": "5fa961549db3e3ab2933808a4cbab8f2",
                "sha256": "e37a1becb5a81731123075318bf4e0f56532b30cb5bef1e9971c1f86b189a11c"
            },
            "downloads": -1,
            "filename": "semantic_components-0.1.1.tar.gz",
            "has_sig": false,
            "md5_digest": "5fa961549db3e3ab2933808a4cbab8f2",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 29017,
            "upload_time": "2024-10-30T17:34:14",
            "upload_time_iso_8601": "2024-10-30T17:34:14.287399Z",
            "url": "https://files.pythonhosted.org/packages/e5/ef/f8e0b62fad6de2ef1747b29d120f38ad02016270f84fc189b1f681e43c36/semantic_components-0.1.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-10-30 17:34:14",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "mainlp",
    "github_project": "semantic_components",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "semantic-components"
}
        
Elapsed time: 0.42953s