corpusshow


Namecorpusshow JSON
Version 0.1.8 PyPI version JSON
download
home_pagehttps://github.com/DSDanielPark/corpus-show
SummaryCorpus-Show makes it easier and faster to visualize corpus through sentence embedding of corpus.
upload_time2023-03-13 08:09:03
maintainer
docs_urlNone
authorparkminwoo
requires_python>=3.6
license
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Corpus-Show
[![Contributor Covenant](https://img.shields.io/badge/contributor%20covenant-v2.0%20adopted-black.svg)](code_of_conduct.md)
[![Python Version](https://img.shields.io/badge/python-3.6%2C3.7%2C3.8-black.svg)](code_of_conduct.md)
![Pypi Version](https://img.shields.io/pypi/v/corpusshow.svg)
![Code convention](https://img.shields.io/badge/code%20convention-pep8-black)

Corpus-Show helps to understand the corpus data distribution through various values generated from Sentence Transformer. (_It's not such a great package, but It simply helps you visualize comfortably._)
- Corpus-Show performs sentence embedding via Sentence Transformers, a Python framework for state-of-the-art sentence, text and image embeddings. [[Paper]](https://arxiv.org/abs/1908.10084) [[Document]](https://www.sbert.net/) [[Huggingface model]](https://huggingface.co/sentence-transformers/paraphrase-xlm-r-multilingual-v1) 
- You can visualize the embedded sentences of each document generated from SentenceTransformers.
- Corpus-Show can also generate clusters with sentences embedded array through [Scikit-Learn KMeans](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html).
- The sentence transformer model is downloaded through the hugging face interface, and the default model is set to [`paraphrase-xlm-r-multilingual-v1`](https://huggingface.co/sentence-transformers/paraphrase-xlm-r-multilingual-v1), which supports multiple languages. However, you can easily input your custom model as a sentence transformer model through the hugging face interface. It is also easy to fine-tune via SBERT. For more models, please see [this page](https://huggingface.co/sentence-transformers).




<br>

# Installation
  ```zsh
  pip install corpusshow
  ```
  - This package may not work properly in m1/m2 mac environment. If you are using Mac m1/m2, please use the git repository as a submodule because it has minimal encapsulation. [[issue#1]](https://github.com/DSDanielPark/corpus-show/issues/1)
<br>

# Tutorial
We provide tutorial notebooks for all the features we offer. We plan to provide additional docstrings or documentation from the official distribution version (major version 1 or higher).

1. Main-tutorials: https://github.com/DSDanielPark/corpus-show/blob/main/tutorials/corpusshow_tutorial.ipynb
2. Sub-tutorial-folder: https://github.com/DSDanielPark/corpus-show/blob/main/tutorials
 

<Br><Br><Br><Br>

# Main Feature
It helps to create a simple but useful plot as shown below with a simple dataframe and column names as input, such as the following [BBC sample dataset](http://mlg.ucd.ie/datasets/bbc.html) in [`./data/bbc_news_dataset.csv`](https://github.com/DSDanielPark/corpus-show/blob/main/data/bbc_news_dataset.csv).


|     | news  | topic |
|:---:|:----|:----:|
|0|Oil rebounds from weather effect (...)|business|
|1|Indonesia 'declines debt freeze' (...)|business|
|...|...|...|
|601|EU software patent law faces axe (...)|tech|

<br>

## 1. `CorpusClster`
Contains 1 static method. You can create great pictures with:
```python
from corpusshow import CorpusCluster

# Class arguments
csv_file_path = '../data/bbc_news_dataset.csv'
sentence_transformer_model_name = 'paraphrase-xlm-r-multilingual-v1'
target_col = 'news'
num_cluster = 4

# Get class object
cc = CorpusCluster(csv_file_path, sentence_transformer_model_name, target_col, num_cluster)

# 1. quick_corpus_show method: 
# Show figures without k-means clustering
cc.quick_corpus_show('topic', 'tsne2d', False, 'fig1.png')
cc.quick_corpus_show('topic', 'tsne3d', False, 'fig2.png')
cc.quick_corpus_show('topic', 'pca2d', False, 'fig3.png')
cc.quick_corpus_show('topic', 'pca3d', False, 'fig4.png')
```
![](https://github.com/DSDanielPark/corpus-show/blob/main/tutorials/imgs/readme_fig1.png)
```python
# 2. quick_cluster_show method:
# Show figures with k-means clustering
df_returned = cc.quick_cluster_show('tsne2d', False, 'fig5.png')
df_returned = cc.quick_cluster_show('tsne3d', False, 'fig6.png')
df_returned = cc.quick_cluster_show('pcda2d', False, 'fig7.png')
df_returned = cc.quick_cluster_show('pcda2d', False, 'fig8.png')
```
![](https://github.com/DSDanielPark/corpus-show/blob/main/tutorials/imgs/readme_fig2.png)
- If you want to change the design of the plot, use matplotlib's RcParams method or the returned dataframe.

<br>


# References
[1] Scikit-Learn https://scikit-learn.org <br>
[2] Matplotlib https://matplotlib.org/ <br>
[3] Huggingface Sentence Transformer https://huggingface.co/sentence-transformers <Br>
[4] SBERT https://www.sbert.net/

<br>


### Use Case
[1] [Korean-news-topic-classification-using-KO-BERT](https://github.com/DSDanielPark/fine-tuned-korean-BERT-news-article-classifier): all plots were created through Corpus-Show and Quick-Show.


### Contacts
Maintainer: [Daniel Park, South Korea](https://github.com/DSDanielPark) 
e-mail parkminwoo1991@gmail.com

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/DSDanielPark/corpus-show",
    "name": "corpusshow",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": "",
    "keywords": "",
    "author": "parkminwoo",
    "author_email": "parkminwoo1991@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/20/2e/09c03f9d64b4a62e7f55f0a08410032b1e5604e3d3e527a8eef2d99f426a/corpusshow-0.1.8.tar.gz",
    "platform": null,
    "description": "# Corpus-Show\r\n[![Contributor Covenant](https://img.shields.io/badge/contributor%20covenant-v2.0%20adopted-black.svg)](code_of_conduct.md)\r\n[![Python Version](https://img.shields.io/badge/python-3.6%2C3.7%2C3.8-black.svg)](code_of_conduct.md)\r\n![Pypi Version](https://img.shields.io/pypi/v/corpusshow.svg)\r\n![Code convention](https://img.shields.io/badge/code%20convention-pep8-black)\r\n\r\nCorpus-Show helps to understand the corpus data distribution through various values generated from Sentence Transformer. (_It's not such a great package, but It simply helps you visualize comfortably._)\r\n- Corpus-Show performs sentence embedding via Sentence Transformers, a Python framework for state-of-the-art sentence, text and image embeddings. [[Paper]](https://arxiv.org/abs/1908.10084) [[Document]](https://www.sbert.net/) [[Huggingface model]](https://huggingface.co/sentence-transformers/paraphrase-xlm-r-multilingual-v1) \r\n- You can visualize the embedded sentences of each document generated from SentenceTransformers.\r\n- Corpus-Show can also generate clusters with sentences embedded array through [Scikit-Learn KMeans](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html).\r\n- The sentence transformer model is downloaded through the hugging face interface, and the default model is set to [`paraphrase-xlm-r-multilingual-v1`](https://huggingface.co/sentence-transformers/paraphrase-xlm-r-multilingual-v1), which supports multiple languages. However, you can easily input your custom model as a sentence transformer model through the hugging face interface. It is also easy to fine-tune via SBERT. For more models, please see [this page](https://huggingface.co/sentence-transformers).\r\n\r\n\r\n\r\n\r\n<br>\r\n\r\n# Installation\r\n  ```zsh\r\n  pip install corpusshow\r\n  ```\r\n  - This package may not work properly in m1/m2 mac environment. If you are using Mac m1/m2, please use the git repository as a submodule because it has minimal encapsulation. [[issue#1]](https://github.com/DSDanielPark/corpus-show/issues/1)\r\n<br>\r\n\r\n# Tutorial\r\nWe provide tutorial notebooks for all the features we offer. We plan to provide additional docstrings or documentation from the official distribution version (major version 1 or higher).\r\n\r\n1. Main-tutorials: https://github.com/DSDanielPark/corpus-show/blob/main/tutorials/corpusshow_tutorial.ipynb\r\n2. Sub-tutorial-folder: https://github.com/DSDanielPark/corpus-show/blob/main/tutorials\r\n \r\n\r\n<Br><Br><Br><Br>\r\n\r\n# Main Feature\r\nIt helps to create a simple but useful plot as shown below with a simple dataframe and column names as input, such as the following [BBC sample dataset](http://mlg.ucd.ie/datasets/bbc.html) in [`./data/bbc_news_dataset.csv`](https://github.com/DSDanielPark/corpus-show/blob/main/data/bbc_news_dataset.csv).\r\n\r\n\r\n|     | news  | topic |\r\n|:---:|:----|:----:|\r\n|0|Oil rebounds from weather effect (...)|business|\r\n|1|Indonesia 'declines debt freeze' (...)|business|\r\n|...|...|...|\r\n|601|EU software patent law faces axe (...)|tech|\r\n\r\n<br>\r\n\r\n## 1. `CorpusClster`\r\nContains 1 static method. You can create great pictures with:\r\n```python\r\nfrom corpusshow import CorpusCluster\r\n\r\n# Class arguments\r\ncsv_file_path = '../data/bbc_news_dataset.csv'\r\nsentence_transformer_model_name = 'paraphrase-xlm-r-multilingual-v1'\r\ntarget_col = 'news'\r\nnum_cluster = 4\r\n\r\n# Get class object\r\ncc = CorpusCluster(csv_file_path, sentence_transformer_model_name, target_col, num_cluster)\r\n\r\n# 1. quick_corpus_show method: \r\n# Show figures without k-means clustering\r\ncc.quick_corpus_show('topic', 'tsne2d', False, 'fig1.png')\r\ncc.quick_corpus_show('topic', 'tsne3d', False, 'fig2.png')\r\ncc.quick_corpus_show('topic', 'pca2d', False, 'fig3.png')\r\ncc.quick_corpus_show('topic', 'pca3d', False, 'fig4.png')\r\n```\r\n![](https://github.com/DSDanielPark/corpus-show/blob/main/tutorials/imgs/readme_fig1.png)\r\n```python\r\n# 2. quick_cluster_show method:\r\n# Show figures with k-means clustering\r\ndf_returned = cc.quick_cluster_show('tsne2d', False, 'fig5.png')\r\ndf_returned = cc.quick_cluster_show('tsne3d', False, 'fig6.png')\r\ndf_returned = cc.quick_cluster_show('pcda2d', False, 'fig7.png')\r\ndf_returned = cc.quick_cluster_show('pcda2d', False, 'fig8.png')\r\n```\r\n![](https://github.com/DSDanielPark/corpus-show/blob/main/tutorials/imgs/readme_fig2.png)\r\n- If you want to change the design of the plot, use matplotlib's RcParams method or the returned dataframe.\r\n\r\n<br>\r\n\r\n\r\n# References\r\n[1] Scikit-Learn https://scikit-learn.org <br>\r\n[2] Matplotlib https://matplotlib.org/ <br>\r\n[3] Huggingface Sentence Transformer https://huggingface.co/sentence-transformers <Br>\r\n[4] SBERT https://www.sbert.net/\r\n\r\n<br>\r\n\r\n\r\n### Use Case\r\n[1] [Korean-news-topic-classification-using-KO-BERT](https://github.com/DSDanielPark/fine-tuned-korean-BERT-news-article-classifier): all plots were created through Corpus-Show and Quick-Show.\r\n\r\n\r\n### Contacts\r\nMaintainer: [Daniel Park, South Korea](https://github.com/DSDanielPark) \r\ne-mail parkminwoo1991@gmail.com\r\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Corpus-Show makes it easier and faster to visualize corpus through sentence embedding of corpus.",
    "version": "0.1.8",
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "4cf6e1bbe6f983946a20a42d260314df62fcc10090af5a5cae32a9e8246301cc",
                "md5": "6f0c333733033571549375ea1f0e955b",
                "sha256": "86cff99dc7eaf095a66af247dc8a839d1c4c665377a22175007036e39ccc4fd4"
            },
            "downloads": -1,
            "filename": "corpusshow-0.1.8-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "6f0c333733033571549375ea1f0e955b",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6",
            "size": 10632,
            "upload_time": "2023-03-13T08:08:54",
            "upload_time_iso_8601": "2023-03-13T08:08:54.520424Z",
            "url": "https://files.pythonhosted.org/packages/4c/f6/e1bbe6f983946a20a42d260314df62fcc10090af5a5cae32a9e8246301cc/corpusshow-0.1.8-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "202e09c03f9d64b4a62e7f55f0a08410032b1e5604e3d3e527a8eef2d99f426a",
                "md5": "dcf66cdcc88ee2caa03f030ca9b09eaf",
                "sha256": "22b67263c75fc37489da1a6947da8c10a9d55a3b2c642a6fa1cec8ff6c75b79c"
            },
            "downloads": -1,
            "filename": "corpusshow-0.1.8.tar.gz",
            "has_sig": false,
            "md5_digest": "dcf66cdcc88ee2caa03f030ca9b09eaf",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 10046,
            "upload_time": "2023-03-13T08:09:03",
            "upload_time_iso_8601": "2023-03-13T08:09:03.707603Z",
            "url": "https://files.pythonhosted.org/packages/20/2e/09c03f9d64b4a62e7f55f0a08410032b1e5604e3d3e527a8eef2d99f426a/corpusshow-0.1.8.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-03-13 08:09:03",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "DSDanielPark",
    "github_project": "corpus-show",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "corpusshow"
}
        
Elapsed time: 0.24543s