`ConTextMining` is a package generate interpretable topics labels from the keywords of topic models (e.g, `LDA`, `BERTopic`) through few-shot in-context learning.
[![pypi package](https://img.shields.io/badge/pypi_package-v0.0.2-brightgreen)](https://pypi.org/project/ConTextMining/) [![GitHub Source Code](https://img.shields.io/badge/github_source_code-source_code?logo=github&color=green)](https://github.com/cja5553/ConTextMining)
## Requirements
### Required packages
The following packages are required for `ConTextMining`.
- `torch` (to learn how to install, please refer to [pytorch.org/](https://pytorch.org/))
- `transformers`
- `tokenizers`
- `huggingface-hub`
- `flash_attn`
- `accelerate`
To install these packages, you can do the following:
```bash
pip install torch transformers tokenizers huggingface-hub flash_attn accelerate
```
### GPU requirements
You require at least one GPU to use `ConTextMining`.
VRAM requirements depend on factors like number of keywords or topics used to topic labels you wish to generate.
However, at least 8GB of VRAM is recommended
### huggingface access token
You will need a huggingface access token. To obtain one:
1. you'd first need to create a [huggingface](https://huggingface.co) account if you do not have one.
2. Create and store a new access token. To learn more, please refer to [huggingface.co/docs/hub/en/security-tokens](https://huggingface.co/docs/hub/en/security-tokens).
3. Note: Some pre-trained large language models (LLMs) may require permissions. For more information, please refer to [huggingface.co/docs/hub/en/models-gated](https://huggingface.co/docs/hub/en/models-gated).
## Installation
To install in python, simply do the following:
```bash
pip install ConTextMining
```
## Quick Start
Here we provide a quick example on how you can execute `ConTextMining` to conveniently generate interpretable topics labels from the keywords of topic models.
```python
from ConTextMining import get_topic_labels
# specify your huggingface access token. To learn how to obtain one, refer to huggingface.co/docs/hub/en/security-tokens
hf_access_token="<your huggingface access token>"
# specify the huggingface model id. Choose between "microsoft/Phi-3-mini-4k-instruct", "meta-llama/Meta-Llama-3.1-8B-Instruct" or "google/gemma-2-2b-it"
model_id="meta-llama/Meta-Llama-3.1-8B-Instruct"
# specify the keywords for the few-shot learning examples
keywords_examples = [
"olympic, year, said, games, team",
"mr, bush, president, white, house",
"sadi, report, evidence, findings, defense",
"french, union, germany, workers, paris",
"japanese, year, tokyo, matsui, said"
]
# specify the labels CORRESPONDING TO THE INDEX of the keywords of 'keyword_examples' above.
labels_examples = [
"sports",
"politics",
"research",
"france",
"japan"
]
# specify your topic modeling keywords of wish to generate coherently topic labels.
topic_modeling_keywords ='''Topic 1: [amazing, really, place, phenomenon, pleasant],
Topic 2: [loud, awful, sunday, like, slow],
Topic 3: [spinach, carrots, green, salad, dressing],
Topic 4: [mango, strawberry, vanilla, banana, peanut],
Topic 5: [fish, roll, salmon, fresh, good]'''
print(get_topic_labels(topic_modeling_keywords=topic_modeling_keywords, keywords_examples=keywords_examples, labels_examples=labels_examples, model_id=model_id, access_token=hf_access_token))
```
You will now get the interpretable topic model labels for all 5 topics!
## Documentation
```python
ConTextMining.get_topic_labels(*, topic_modeling_keywords, labels_examples,keywords_examples, model_id, access_token)
```
- `topic_modeling_keywords` *(str, required)*: keywords stemming from the outputs of topic models (keywords representing each cluster) for `ConTextMining` to label.
- `keywords_examples` *(list, required)*: list-of-string(s) containing topic modeling keywords which serves as training examples for few-shot learning.
- `labels_examples` *(list, required)*: list-of-string(s) containing the labels CORRESPONDING TO THE INDEX of the keywords of `keyword_examples` above.
- `model_id` *(str, required)*: huggingface model_id of choice. For now, its a choice between "microsoft/Phi-3-mini-4k-instruct", "meta-llama/Meta-Llama-3.1-8B-Instruct", or "google/gemma-2-2b-it". Defaults to "google/gemma-2-2b-it".
- `access_token` *(str, required)*: Huggingface access token. To learn how to obtain one, refer to [huggingface.co/docs/hub/en/security-tokens](https://huggingface.co/docs/hub/en/security-tokens). Defaults to `None`
## Citation
C Alba "ConText Mining: Complementing topic models with few-shot in-context learning to generate interpretable topics" Working paper.
## Questions?
Contact me at [alba@wustl.edu](mailto:alba@wustl.edu)
Raw data
{
"_id": null,
"home_page": null,
"name": "ConTextMining",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": "generative AI, text mining, topic modeling, LLMs",
"author": "Charles Alba",
"author_email": "alba@wustl.edu",
"download_url": "https://files.pythonhosted.org/packages/a8/34/182fc3d327f5bae4eed37f487e92b48a3840faeb0d158e18bad276f1704e/contextmining-0.0.2.tar.gz",
"platform": null,
"description": "`ConTextMining` is a package generate interpretable topics labels from the keywords of topic models (e.g, `LDA`, `BERTopic`) through few-shot in-context learning. \r\n\r\n\r\n\r\n\r\n\r\n[![pypi package](https://img.shields.io/badge/pypi_package-v0.0.2-brightgreen)](https://pypi.org/project/ConTextMining/) [![GitHub Source Code](https://img.shields.io/badge/github_source_code-source_code?logo=github&color=green)](https://github.com/cja5553/ConTextMining) \r\n\r\n\r\n\r\n\r\n\r\n## Requirements \r\n\r\n### Required packages\r\n\r\nThe following packages are required for `ConTextMining`. \r\n\r\n\r\n\r\n- `torch` (to learn how to install, please refer to [pytorch.org/](https://pytorch.org/))\r\n\r\n- `transformers`\r\n\r\n- `tokenizers`\r\n\r\n- `huggingface-hub`\r\n\r\n- `flash_attn`\r\n\r\n- `accelerate`\r\n\r\n\r\n\r\nTo install these packages, you can do the following:\r\n\r\n\r\n\r\n```bash\r\n\r\npip install torch transformers tokenizers huggingface-hub flash_attn accelerate\r\n\r\n```\r\n\r\n\r\n\r\n### GPU requirements\r\n\r\nYou require at least one GPU to use `ConTextMining`. \r\n\r\nVRAM requirements depend on factors like number of keywords or topics used to topic labels you wish to generate. \r\n\r\nHowever, at least 8GB of VRAM is recommended\r\n\r\n\r\n\r\n### huggingface access token\r\n\r\nYou will need a huggingface access token. To obtain one: \r\n\r\n1. you'd first need to create a [huggingface](https://huggingface.co) account if you do not have one. \r\n\r\n2. Create and store a new access token. To learn more, please refer to [huggingface.co/docs/hub/en/security-tokens](https://huggingface.co/docs/hub/en/security-tokens). \r\n\r\n3. Note: Some pre-trained large language models (LLMs) may require permissions. For more information, please refer to [huggingface.co/docs/hub/en/models-gated](https://huggingface.co/docs/hub/en/models-gated). \r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n## Installation\r\n\r\nTo install in python, simply do the following: \r\n\r\n```bash\r\n\r\npip install ConTextMining\r\n\r\n```\r\n\r\n\r\n\r\n## Quick Start\r\n\r\nHere we provide a quick example on how you can execute `ConTextMining` to conveniently generate interpretable topics labels from the keywords of topic models. \r\n\r\n```python\r\n\r\nfrom ConTextMining import get_topic_labels\r\n\r\n\r\n\r\n# specify your huggingface access token. To learn how to obtain one, refer to huggingface.co/docs/hub/en/security-tokens\r\n\r\nhf_access_token=\"<your huggingface access token>\" \r\n\r\n\r\n\r\n# specify the huggingface model id. Choose between \"microsoft/Phi-3-mini-4k-instruct\", \"meta-llama/Meta-Llama-3.1-8B-Instruct\" or \"google/gemma-2-2b-it\"\r\n\r\nmodel_id=\"meta-llama/Meta-Llama-3.1-8B-Instruct\"\r\n\r\n\r\n\r\n# specify the keywords for the few-shot learning examples\r\n\r\nkeywords_examples = [\r\n\r\n \"olympic, year, said, games, team\",\r\n\r\n \"mr, bush, president, white, house\",\r\n\r\n \"sadi, report, evidence, findings, defense\",\r\n\r\n \"french, union, germany, workers, paris\",\r\n\r\n \"japanese, year, tokyo, matsui, said\"\r\n\r\n]\r\n\r\n\r\n\r\n# specify the labels CORRESPONDING TO THE INDEX of the keywords of 'keyword_examples' above. \r\n\r\nlabels_examples = [\r\n\r\n \"sports\",\r\n\r\n \"politics\",\r\n\r\n \"research\",\r\n\r\n \"france\",\r\n\r\n \"japan\"\r\n\r\n]\r\n\r\n\r\n\r\n# specify your topic modeling keywords of wish to generate coherently topic labels. \r\n\r\ntopic_modeling_keywords ='''Topic 1: [amazing, really, place, phenomenon, pleasant],\r\n\r\nTopic 2: [loud, awful, sunday, like, slow],\r\n\r\nTopic 3: [spinach, carrots, green, salad, dressing],\r\n\r\nTopic 4: [mango, strawberry, vanilla, banana, peanut],\r\n\r\nTopic 5: [fish, roll, salmon, fresh, good]'''\r\n\r\n\r\n\r\n\r\n\r\nprint(get_topic_labels(topic_modeling_keywords=topic_modeling_keywords, keywords_examples=keywords_examples, labels_examples=labels_examples, model_id=model_id, access_token=hf_access_token))\r\n\r\n```\r\n\r\nYou will now get the interpretable topic model labels for all 5 topics! \r\n\r\n\r\n\r\n## Documentation\r\n\r\n\r\n\r\n```python\r\n\r\nConTextMining.get_topic_labels(*, topic_modeling_keywords, labels_examples,keywords_examples, model_id, access_token)\r\n\r\n```\r\n\r\n\r\n\r\n- `topic_modeling_keywords` *(str, required)*: keywords stemming from the outputs of topic models (keywords representing each cluster) for `ConTextMining` to label. \r\n\r\n- `keywords_examples` *(list, required)*: list-of-string(s) containing topic modeling keywords which serves as training examples for few-shot learning. \r\n\r\n- `labels_examples` *(list, required)*: list-of-string(s) containing the labels CORRESPONDING TO THE INDEX of the keywords of `keyword_examples` above. \r\n\r\n- `model_id` *(str, required)*: huggingface model_id of choice. For now, its a choice between \"microsoft/Phi-3-mini-4k-instruct\", \"meta-llama/Meta-Llama-3.1-8B-Instruct\", or \"google/gemma-2-2b-it\". Defaults to \"google/gemma-2-2b-it\". \r\n\r\n- `access_token` *(str, required)*: Huggingface access token. To learn how to obtain one, refer to [huggingface.co/docs/hub/en/security-tokens](https://huggingface.co/docs/hub/en/security-tokens). Defaults to `None`\r\n\r\n\r\n\r\n\r\n\r\n## Citation\r\n\r\nC Alba \"ConText Mining: Complementing topic models with few-shot in-context learning to generate interpretable topics\" Working paper. \r\n\r\n\r\n\r\n## Questions?\r\n\r\n\r\n\r\nContact me at [alba@wustl.edu](mailto:alba@wustl.edu)\r\n\r\n",
"bugtrack_url": null,
"license": null,
"summary": "Complementing topic models with few-shot in-context learning to generate interpretable topics",
"version": "0.0.2",
"project_urls": null,
"split_keywords": [
"generative ai",
" text mining",
" topic modeling",
" llms"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "a69dd500ff8ce7b776fc5a4531b630589034a50002107ddae8945757030bd23d",
"md5": "c149ead56b180f0009b75c9e26a91d3e",
"sha256": "db4ef650b9c57c04b5a25b0d63901eca07c69796ed22a22bc167dcaa92a90d2e"
},
"downloads": -1,
"filename": "ConTextMining-0.0.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "c149ead56b180f0009b75c9e26a91d3e",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 6017,
"upload_time": "2024-09-23T19:05:41",
"upload_time_iso_8601": "2024-09-23T19:05:41.131436Z",
"url": "https://files.pythonhosted.org/packages/a6/9d/d500ff8ce7b776fc5a4531b630589034a50002107ddae8945757030bd23d/ConTextMining-0.0.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "a834182fc3d327f5bae4eed37f487e92b48a3840faeb0d158e18bad276f1704e",
"md5": "c2ea0723d89b1f26d59dbec00df925f8",
"sha256": "cca09917a4f0e324d023b5e06a9238a34b50c918233acddd9cef4cec4c877eaf"
},
"downloads": -1,
"filename": "contextmining-0.0.2.tar.gz",
"has_sig": false,
"md5_digest": "c2ea0723d89b1f26d59dbec00df925f8",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 5906,
"upload_time": "2024-09-23T19:05:42",
"upload_time_iso_8601": "2024-09-23T19:05:42.203015Z",
"url": "https://files.pythonhosted.org/packages/a8/34/182fc3d327f5bae4eed37f487e92b48a3840faeb0d158e18bad276f1704e/contextmining-0.0.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-09-23 19:05:42",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "contextmining"
}