MoreThanSentiments


NameMoreThanSentiments JSON
Version 0.2.1 PyPI version JSON
download
home_pagehttps://github.com/jinhangjiang/morethansentiments
SummaryAn NLP python package for computing Boilerplate score and many other text features.
upload_time2022-12-23 02:06:07
maintainer
docs_urlNone
authorJinhang Jiang, Karthik Srinivasan
requires_python
license
keywords text mining data science natural language processing accounting
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            [![DOI](https://zenodo.org/badge/490040941.svg)](https://zenodo.org/badge/latestdoi/490040941)
[![License](https://img.shields.io/badge/License-BSD_3--Clause-green.svg)](https://opensource.org/licenses/BSD-3-Clause)
[![PyPI](https://img.shields.io/pypi/v/morethansentiments)](https://pypi.org/project/morethansentiments/)

# MoreThanSentiments
Besides sentiment scores, this Python package offers various ways of quantifying text corpus based on multiple works of literature. Currently, we support the calculation of the following measures:

-   Boilerplate (Lang and Stice-Lawrence, 2015)
-   Redundancy (Cazier and Pfeiffer, 2015)
-   Specificity (Hope et al., 2016)
-   Relative_prevalence (Blankespoor, 2016)

A medium blog is here: [MoreThanSentiments: A Python Library for Text Quantification](https://towardsdatascience.com/morethansentiments-a-python-library-for-text-quantification-e57ff9d51cd5)

## Citation

If this package was helpful in your work, feel free to cite it as

- Jiang, J., & Srinivasan, K. (2022). MoreThanSentiments: A text analysis package. Software Impacts, 100456. https://doi.org/10.1016/J.SIMPA.2022.100456

## Installation

The easiest way to install the toolbox is via pip (pip3 in some
distributions):

    pip install MoreThanSentiments


## Usage

#### Import the Package

    import MoreThanSentiments as mts

#### Read data from txt files

    my_dir_path = "D:/YourDataFolder"
    df = mts.read_txt_files(PATH = my_dir_path)

#### Sentence Token

    df['sent_tok'] = df.text.apply(mts.sent_tok)

#### Clean Data

If you want to clean on the sentence level:

    df['cleaned_data'] = pd.Series()    
    for i in range(len(df['sent_tok'])):
        df['cleaned_data'][i] = [mts.clean_data(x,\
                                                lower = True,\
                                                punctuations = True,\
                                                number = False,\
                                                unicode = True,\
                                                stop_words = False) for x in df['sent_tok'][i]] 

If you want to clean on the document level:

    df['cleaned_data'] = df.text.apply(mts.clean_data, args=(True, True, False, True, False))

For the data cleaning function, we offer the following options:
-   lower: make all the words to lowercase
-   punctuations: remove all the punctuations in the corpus
-   number: remove all the digits in the corpus
-   unicode: remove all the unicodes in the corpus
-   stop_words: remove the stopwords in the corpus

#### Boilerplate

    df['Boilerplate'] = mts.Boilerplate(sent_tok, n = 4, min_doc = 5, get_ngram = False)

Parameters:
-   input_data: this function requires tokenized documents.
-   n: number of the ngrams to use. The default is 4.
-   min_doc: when building the ngram list, ignore the ngrams that have a document frequency strictly lower than the given threshold. The default is 5 document. 30% of the number of the documents is recommended.
-   get_ngram: if this parameter is set to "True" it will return a datafram with all the ngrams and the corresponding frequency, and "min_doc" parameter will become ineffective.
-   max_doc: when building the ngram list, ignore the ngrams that have a document frequency strictly lower than the given threshold. The default is 75% of document. It can be percentage or integer.

#### Redundancy

    df['Redundancy'] = mts.Redundancy(df.cleaned_data, n = 10)

Parameters:
-   input_data: this function requires tokenized documents.
-   n: number of the ngrams to use. The default is 10.

#### Specificity

    df['Specificity'] = mts.Specificity(df.text)

Parameters:
-   input_data: this function requires the documents without tokenization

#### Relative_prevalence

    df['Relative_prevalence'] = mts.Relative_prevalence(df.text)

Parameters:
-   input_data: this function requires the documents without tokenization


For the full code script, you may check here:
-   [Script](https://github.com/jinhangjiang/morethansentiments/blob/main/tests/test_code.py)
-   [Jupyter Notebook](https://github.com/jinhangjiang/morethansentiments/blob/main/Boilerplate.ipynb)


# CHANGELOG
## Version 0.2.1, 2022-12-22
- Fixed the counting bug in Specificity 
- Added max_doc parameter to Boilerplate

## Version 0.2.0, 2022-10-2

- Added the "get_ngram" feature to the Boilerplate function
- Added the percentage as a option for "min_doc" in Boilerpate, when the given value is between 0 and 1, it will automatically become a percentage for "min_doc"

## Version 0.1.3, 2022-06-10

- Updated the usage guide
- Minor fix to the script


## Version 0.1.2, 2022-05-08

- Initial release.



            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/jinhangjiang/morethansentiments",
    "name": "MoreThanSentiments",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "Text Mining,Data Science,Natural Language Processing,Accounting",
    "author": "Jinhang Jiang, Karthik Srinivasan",
    "author_email": "jinhang@asu.edu",
    "download_url": "https://files.pythonhosted.org/packages/21/d0/ee190cf70b180d61eca55e14da2f4fcccf602f2cba79169ffa459240c0b1/MoreThanSentiments-0.2.1.tar.gz",
    "platform": null,
    "description": "[![DOI](https://zenodo.org/badge/490040941.svg)](https://zenodo.org/badge/latestdoi/490040941)\n[![License](https://img.shields.io/badge/License-BSD_3--Clause-green.svg)](https://opensource.org/licenses/BSD-3-Clause)\n[![PyPI](https://img.shields.io/pypi/v/morethansentiments)](https://pypi.org/project/morethansentiments/)\n\n# MoreThanSentiments\nBesides sentiment scores, this Python package offers various ways of quantifying text corpus based on multiple works of literature. Currently, we support the calculation of the following measures:\n\n-   Boilerplate (Lang and Stice-Lawrence, 2015)\n-   Redundancy (Cazier and Pfeiffer, 2015)\n-   Specificity (Hope et al., 2016)\n-   Relative_prevalence (Blankespoor, 2016)\n\nA medium blog is here: [MoreThanSentiments: A Python Library for Text Quantification](https://towardsdatascience.com/morethansentiments-a-python-library-for-text-quantification-e57ff9d51cd5)\n\n## Citation\n\nIf this package was helpful in your work, feel free to cite it as\n\n- Jiang, J., & Srinivasan, K. (2022). MoreThanSentiments: A text analysis package. Software Impacts, 100456. https://doi.org/10.1016/J.SIMPA.2022.100456\n\n## Installation\n\nThe easiest way to install the toolbox is via pip (pip3 in some\ndistributions):\n\n    pip install MoreThanSentiments\n\n\n## Usage\n\n#### Import the Package\n\n    import MoreThanSentiments as mts\n\n#### Read data from txt files\n\n    my_dir_path = \"D:/YourDataFolder\"\n    df = mts.read_txt_files(PATH = my_dir_path)\n\n#### Sentence Token\n\n    df['sent_tok'] = df.text.apply(mts.sent_tok)\n\n#### Clean Data\n\nIf you want to clean on the sentence level:\n\n    df['cleaned_data'] = pd.Series()    \n    for i in range(len(df['sent_tok'])):\n        df['cleaned_data'][i] = [mts.clean_data(x,\\\n                                                lower = True,\\\n                                                punctuations = True,\\\n                                                number = False,\\\n                                                unicode = True,\\\n                                                stop_words = False) for x in df['sent_tok'][i]] \n\nIf you want to clean on the document level:\n\n    df['cleaned_data'] = df.text.apply(mts.clean_data, args=(True, True, False, True, False))\n\nFor the data cleaning function, we offer the following options:\n-   lower: make all the words to lowercase\n-   punctuations: remove all the punctuations in the corpus\n-   number: remove all the digits in the corpus\n-   unicode: remove all the unicodes in the corpus\n-   stop_words: remove the stopwords in the corpus\n\n#### Boilerplate\n\n    df['Boilerplate'] = mts.Boilerplate(sent_tok, n = 4, min_doc = 5, get_ngram = False)\n\nParameters:\n-   input_data: this function requires tokenized documents.\n-   n: number of the ngrams to use. The default is 4.\n-   min_doc: when building the ngram list, ignore the ngrams that have a document frequency strictly lower than the given threshold. The default is 5 document. 30% of the number of the documents is recommended.\n-   get_ngram: if this parameter is set to \"True\" it will return a datafram with all the ngrams and the corresponding frequency, and \"min_doc\" parameter will become ineffective.\n-   max_doc: when building the ngram list, ignore the ngrams that have a document frequency strictly lower than the given threshold. The default is 75% of document. It can be percentage or integer.\n\n#### Redundancy\n\n    df['Redundancy'] = mts.Redundancy(df.cleaned_data, n = 10)\n\nParameters:\n-   input_data: this function requires tokenized documents.\n-   n: number of the ngrams to use. The default is 10.\n\n#### Specificity\n\n    df['Specificity'] = mts.Specificity(df.text)\n\nParameters:\n-   input_data: this function requires the documents without tokenization\n\n#### Relative_prevalence\n\n    df['Relative_prevalence'] = mts.Relative_prevalence(df.text)\n\nParameters:\n-   input_data: this function requires the documents without tokenization\n\n\nFor the full code script, you may check here:\n-   [Script](https://github.com/jinhangjiang/morethansentiments/blob/main/tests/test_code.py)\n-   [Jupyter Notebook](https://github.com/jinhangjiang/morethansentiments/blob/main/Boilerplate.ipynb)\n\n\n# CHANGELOG\n## Version 0.2.1, 2022-12-22\n- Fixed the counting bug in Specificity \n- Added max_doc parameter to Boilerplate\n\n## Version 0.2.0, 2022-10-2\n\n- Added the \"get_ngram\" feature to the Boilerplate function\n- Added the percentage as a option for \"min_doc\" in Boilerpate, when the given value is between 0 and 1, it will automatically become a percentage for \"min_doc\"\n\n## Version 0.1.3, 2022-06-10\n\n- Updated the usage guide\n- Minor fix to the script\n\n\n## Version 0.1.2, 2022-05-08\n\n- Initial release.\n\n\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "An NLP python package for computing Boilerplate score and many other text features.",
    "version": "0.2.1",
    "split_keywords": [
        "text mining",
        "data science",
        "natural language processing",
        "accounting"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "9f598e67bafb5959aa02a0bf840196f1",
                "sha256": "9f0f741957b9190bda15bd652f33657e5b58790a86993a79a88f343029ab18a3"
            },
            "downloads": -1,
            "filename": "MoreThanSentiments-0.2.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "9f598e67bafb5959aa02a0bf840196f1",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 7317,
            "upload_time": "2022-12-23T02:06:05",
            "upload_time_iso_8601": "2022-12-23T02:06:05.432695Z",
            "url": "https://files.pythonhosted.org/packages/7a/6d/682d055660b00793e8f999dcd831c0c7e0cf3c5b49f7489d909593d7dc3c/MoreThanSentiments-0.2.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "md5": "b4b7e792be4f265a2044f6d9f5b797fb",
                "sha256": "fcb81267c1441ea0ae5c1621b0c2b1f87e079f0f61dba23bcfc03c7951dff56d"
            },
            "downloads": -1,
            "filename": "MoreThanSentiments-0.2.1.tar.gz",
            "has_sig": false,
            "md5_digest": "b4b7e792be4f265a2044f6d9f5b797fb",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 6948,
            "upload_time": "2022-12-23T02:06:07",
            "upload_time_iso_8601": "2022-12-23T02:06:07.079979Z",
            "url": "https://files.pythonhosted.org/packages/21/d0/ee190cf70b180d61eca55e14da2f4fcccf602f2cba79169ffa459240c0b1/MoreThanSentiments-0.2.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2022-12-23 02:06:07",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "jinhangjiang",
    "github_project": "morethansentiments",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "morethansentiments"
}
        
Elapsed time: 0.02130s