MoreThanSentiments

Name	MoreThanSentiments JSON
Version	0.3.0 JSON
	download
home_page	https://github.com/jinhangjiang/morethansentiments
Summary	An NLP python package for computing Boilerplate score and many other text features.
upload_time	2025-01-31 19:16:27
maintainer	None
docs_url	None
author	Jinhang Jiang, Karthik Srinivasan
requires_python	None
license	None
keywords	text mining data science natural language processing accounting
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            [![License](https://img.shields.io/badge/License-BSD_3--Clause-green.svg)](https://opensource.org/licenses/BSD-3-Clause)
[![PyPI](https://img.shields.io/pypi/v/morethansentiments)](https://pypi.org/project/morethansentiments/)
[![Code Ocean](https://codeocean.com/codeocean-assets/badge/open-in-code-ocean.svg)](https://codeocean.com/capsule/7670045/tree)
[![Downloads](https://pepy.tech/badge/morethansentiments)](https://pepy.tech/project/morethansentiments)

# MoreThanSentiments
Besides sentiment scores, this Python package offers various ways of quantifying text corpus based on multiple works of literature. Currently, we support the calculation of the following measures:

-   Boilerplate (Lang and Stice-Lawrence, 2015)
-   Redundancy (Cazier and Pfeiffer, 2015)
-   Specificity (Hope et al., 2016)
-   Relative_prevalence (Blankespoor, 2016)

A medium blog is here: [MoreThanSentiments: A Python Library for Text Quantification](https://towardsdatascience.com/morethansentiments-a-python-library-for-text-quantification-e57ff9d51cd5)

## Citation

If this package was helpful in your work, feel free to cite it as

- Jiang, Jinhang, and Karthik Srinivasan. "MoreThanSentiments: A text analysis package." Software Impacts 15 (2023): 100456. https://doi.org/10.1016/J.SIMPA.2022.100456

## Installation

The easiest way to install the toolbox is via pip (pip3 in some
distributions):

    pip install MoreThanSentiments


## Usage

#### Import the Package

    import MoreThanSentiments as mts

#### Read data from txt files

    my_dir_path = "D:/YourDataFolder"
    df = mts.read_txt_files(PATH = my_dir_path)

#### Sentence Token

    df['sent_tok'] = df.text.apply(mts.sent_tok)

#### Clean Data

If you want to clean on the sentence level:

    df['cleaned_data'] = pd.Series()    
    for i in range(len(df['sent_tok'])):
        df['cleaned_data'][i] = [mts.clean_data(x,\
                                                lower = True,\
                                                punctuations = True,\
                                                number = False,\
                                                unicode = True,\
                                                stop_words = False) for x in df['sent_tok'][i]] 

If you want to clean on the document level:

    df['cleaned_data'] = df.text.apply(mts.clean_data, args=(True, True, False, True, False))

For the data cleaning function, we offer the following options:
-   lower: make all the words to lowercase
-   punctuations: remove all the punctuations in the corpus
-   number: remove all the digits in the corpus
-   unicode: remove all the unicodes in the corpus
-   stop_words: remove the stopwords in the corpus

#### Boilerplate

    df['Boilerplate'] = mts.Boilerplate(sent_tok, n = 4, min_doc = 5, get_ngram = False)

Parameters:
-   input_data: this function requires tokenized documents.
-   n: number of the ngrams to use. The default is 4.
-   min_doc: when building the ngram list, ignore the ngrams that have a document frequency strictly lower than the given threshold. The default is 5 document. 30% of the number of the documents is recommended.
-   get_ngram: if this parameter is set to "True" it will return a datafram with all the ngrams and the corresponding frequency, and "min_doc" parameter will become ineffective.
-   max_doc: when building the ngram list, ignore the ngrams that have a document frequency strictly lower than the given threshold. The default is 75% of document. It can be percentage or integer.

#### Redundancy

    df['Redundancy'] = mts.Redundancy(df.cleaned_data, n = 10)

Parameters:
-   input_data: this function requires tokenized documents.
-   n: number of the ngrams to use. The default is 10.

#### Specificity

    df['Specificity'] = mts.Specificity(df.text)

Parameters:
-   input_data: this function requires the documents without tokenization

#### Relative_prevalence

    df['Relative_prevalence'] = mts.Relative_prevalence(df.text)

Parameters:
-   input_data: this function requires the documents without tokenization


For the full code script, you may check here:
-   [Script](https://github.com/jinhangjiang/morethansentiments/blob/main/tests/test_code.py)
-   [Jupyter Notebook](https://github.com/jinhangjiang/morethansentiments/blob/main/Boilerplate.ipynb)


# CHANGELOG
## Version 0.3.0 2025-01-31
- Fixed the parameter misplace issue in Redudancy.
- Fully upgraded the algorithm and refactored the code base. 40% - 50% speed boost on large datasets.

## Version 0.2.1, 2022-12-22
- Fixed the counting bug in Specificity 
- Added max_doc parameter to Boilerplate

## Version 0.2.0, 2022-10-2

- Added the "get_ngram" feature to the Boilerplate function
- Added the percentage as a option for "min_doc" in Boilerpate, when the given value is between 0 and 1, it will automatically become a percentage for "min_doc"

## Version 0.1.3, 2022-06-10

- Updated the usage guide
- Minor fix to the script


## Version 0.1.2, 2022-05-08

- Initial release.

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/jinhangjiang/morethansentiments",
    "name": "MoreThanSentiments",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "Text Mining, Data Science, Natural Language Processing, Accounting",
    "author": "Jinhang Jiang, Karthik Srinivasan",
    "author_email": "jinhang@asu.edu",
    "download_url": "https://files.pythonhosted.org/packages/54/53/fb734a299dfa789d4e08f0c4d4d6dbdde6a61e36b00d3fe5070301422d1c/MoreThanSentiments-0.3.0.tar.gz",
    "platform": null,
    "description": "[![License](https://img.shields.io/badge/License-BSD_3--Clause-green.svg)](https://opensource.org/licenses/BSD-3-Clause)\n[![PyPI](https://img.shields.io/pypi/v/morethansentiments)](https://pypi.org/project/morethansentiments/)\n[![Code Ocean](https://codeocean.com/codeocean-assets/badge/open-in-code-ocean.svg)](https://codeocean.com/capsule/7670045/tree)\n[![Downloads](https://pepy.tech/badge/morethansentiments)](https://pepy.tech/project/morethansentiments)\n\n# MoreThanSentiments\nBesides sentiment scores, this Python package offers various ways of quantifying text corpus based on multiple works of literature. Currently, we support the calculation of the following measures:\n\n-   Boilerplate (Lang and Stice-Lawrence, 2015)\n-   Redundancy (Cazier and Pfeiffer, 2015)\n-   Specificity (Hope et al., 2016)\n-   Relative_prevalence (Blankespoor, 2016)\n\nA medium blog is here: [MoreThanSentiments: A Python Library for Text Quantification](https://towardsdatascience.com/morethansentiments-a-python-library-for-text-quantification-e57ff9d51cd5)\n\n## Citation\n\nIf this package was helpful in your work, feel free to cite it as\n\n- Jiang, Jinhang, and Karthik Srinivasan. \"MoreThanSentiments: A text analysis package.\" Software Impacts 15 (2023): 100456. https://doi.org/10.1016/J.SIMPA.2022.100456\n\n## Installation\n\nThe easiest way to install the toolbox is via pip (pip3 in some\ndistributions):\n\n    pip install MoreThanSentiments\n\n\n## Usage\n\n#### Import the Package\n\n    import MoreThanSentiments as mts\n\n#### Read data from txt files\n\n    my_dir_path = \"D:/YourDataFolder\"\n    df = mts.read_txt_files(PATH = my_dir_path)\n\n#### Sentence Token\n\n    df['sent_tok'] = df.text.apply(mts.sent_tok)\n\n#### Clean Data\n\nIf you want to clean on the sentence level:\n\n    df['cleaned_data'] = pd.Series()    \n    for i in range(len(df['sent_tok'])):\n        df['cleaned_data'][i] = [mts.clean_data(x,\\\n                                                lower = True,\\\n                                                punctuations = True,\\\n                                                number = False,\\\n                                                unicode = True,\\\n                                                stop_words = False) for x in df['sent_tok'][i]] \n\nIf you want to clean on the document level:\n\n    df['cleaned_data'] = df.text.apply(mts.clean_data, args=(True, True, False, True, False))\n\nFor the data cleaning function, we offer the following options:\n-   lower: make all the words to lowercase\n-   punctuations: remove all the punctuations in the corpus\n-   number: remove all the digits in the corpus\n-   unicode: remove all the unicodes in the corpus\n-   stop_words: remove the stopwords in the corpus\n\n#### Boilerplate\n\n    df['Boilerplate'] = mts.Boilerplate(sent_tok, n = 4, min_doc = 5, get_ngram = False)\n\nParameters:\n-   input_data: this function requires tokenized documents.\n-   n: number of the ngrams to use. The default is 4.\n-   min_doc: when building the ngram list, ignore the ngrams that have a document frequency strictly lower than the given threshold. The default is 5 document. 30% of the number of the documents is recommended.\n-   get_ngram: if this parameter is set to \"True\" it will return a datafram with all the ngrams and the corresponding frequency, and \"min_doc\" parameter will become ineffective.\n-   max_doc: when building the ngram list, ignore the ngrams that have a document frequency strictly lower than the given threshold. The default is 75% of document. It can be percentage or integer.\n\n#### Redundancy\n\n    df['Redundancy'] = mts.Redundancy(df.cleaned_data, n = 10)\n\nParameters:\n-   input_data: this function requires tokenized documents.\n-   n: number of the ngrams to use. The default is 10.\n\n#### Specificity\n\n    df['Specificity'] = mts.Specificity(df.text)\n\nParameters:\n-   input_data: this function requires the documents without tokenization\n\n#### Relative_prevalence\n\n    df['Relative_prevalence'] = mts.Relative_prevalence(df.text)\n\nParameters:\n-   input_data: this function requires the documents without tokenization\n\n\nFor the full code script, you may check here:\n-   [Script](https://github.com/jinhangjiang/morethansentiments/blob/main/tests/test_code.py)\n-   [Jupyter Notebook](https://github.com/jinhangjiang/morethansentiments/blob/main/Boilerplate.ipynb)\n\n\n# CHANGELOG\n## Version 0.3.0 2025-01-31\n- Fixed the parameter misplace issue in Redudancy.\n- Fully upgraded the algorithm and refactored the code base. 40% - 50% speed boost on large datasets.\n\n## Version 0.2.1, 2022-12-22\n- Fixed the counting bug in Specificity \n- Added max_doc parameter to Boilerplate\n\n## Version 0.2.0, 2022-10-2\n\n- Added the \"get_ngram\" feature to the Boilerplate function\n- Added the percentage as a option for \"min_doc\" in Boilerpate, when the given value is between 0 and 1, it will automatically become a percentage for \"min_doc\"\n\n## Version 0.1.3, 2022-06-10\n\n- Updated the usage guide\n- Minor fix to the script\n\n\n## Version 0.1.2, 2022-05-08\n\n- Initial release.\n\n\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "An NLP python package for computing Boilerplate score and many other text features.",
    "version": "0.3.0",
    "project_urls": {
        "Homepage": "https://github.com/jinhangjiang/morethansentiments"
    },
    "split_keywords": [
        "text mining",
        " data science",
        " natural language processing",
        " accounting"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "f2a5f24d7666cb496677c9ac94497a970c7c427cb90c300916e0e4b537a48a60",
                "md5": "64a3d2d21233800824b5c3239b22b5c3",
                "sha256": "ff5f802f93c01d15a16c6cc5afda8c3bc3a2bddba958c1f1f399449ec930a10b"
            },
            "downloads": -1,
            "filename": "MoreThanSentiments-0.3.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "64a3d2d21233800824b5c3239b22b5c3",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 6991,
            "upload_time": "2025-01-31T19:16:25",
            "upload_time_iso_8601": "2025-01-31T19:16:25.895069Z",
            "url": "https://files.pythonhosted.org/packages/f2/a5/f24d7666cb496677c9ac94497a970c7c427cb90c300916e0e4b537a48a60/MoreThanSentiments-0.3.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "5453fb734a299dfa789d4e08f0c4d4d6dbdde6a61e36b00d3fe5070301422d1c",
                "md5": "d403be9bf013dc334fecc0bc7d880040",
                "sha256": "af6192ea37f3d043ba7b76f04808a1ba3f99381e13e229f5d47c4c8f3c720d79"
            },
            "downloads": -1,
            "filename": "MoreThanSentiments-0.3.0.tar.gz",
            "has_sig": false,
            "md5_digest": "d403be9bf013dc334fecc0bc7d880040",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 2097028,
            "upload_time": "2025-01-31T19:16:27",
            "upload_time_iso_8601": "2025-01-31T19:16:27.938839Z",
            "url": "https://files.pythonhosted.org/packages/54/53/fb734a299dfa789d4e08f0c4d4d6dbdde6a61e36b00d3fe5070301422d1c/MoreThanSentiments-0.3.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-01-31 19:16:27",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "jinhangjiang",
    "github_project": "morethansentiments",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "morethansentiments"
}

Jinhang Jiang, Karthik Srinivasan