[![License](https://img.shields.io/badge/License-BSD_3--Clause-green.svg)](https://opensource.org/licenses/BSD-3-Clause)
[![PyPI](https://img.shields.io/pypi/v/morethansentiments)](https://pypi.org/project/morethansentiments/)
[![Code Ocean](https://codeocean.com/codeocean-assets/badge/open-in-code-ocean.svg)](https://codeocean.com/capsule/7670045/tree)
[![Downloads](https://pepy.tech/badge/morethansentiments)](https://pepy.tech/project/morethansentiments)
# MoreThanSentiments
Besides sentiment scores, this Python package offers various ways of quantifying text corpus based on multiple works of literature. Currently, we support the calculation of the following measures:
- Boilerplate (Lang and Stice-Lawrence, 2015)
- Redundancy (Cazier and Pfeiffer, 2015)
- Specificity (Hope et al., 2016)
- Relative_prevalence (Blankespoor, 2016)
A medium blog is here: [MoreThanSentiments: A Python Library for Text Quantification](https://towardsdatascience.com/morethansentiments-a-python-library-for-text-quantification-e57ff9d51cd5)
## Citation
If this package was helpful in your work, feel free to cite it as
- Jiang, Jinhang, and Karthik Srinivasan. "MoreThanSentiments: A text analysis package." Software Impacts 15 (2023): 100456. https://doi.org/10.1016/J.SIMPA.2022.100456
## Installation
The easiest way to install the toolbox is via pip (pip3 in some
distributions):
pip install MoreThanSentiments
## Usage
#### Import the Package
import MoreThanSentiments as mts
#### Read data from txt files
my_dir_path = "D:/YourDataFolder"
df = mts.read_txt_files(PATH = my_dir_path)
#### Sentence Token
df['sent_tok'] = df.text.apply(mts.sent_tok)
#### Clean Data
If you want to clean on the sentence level:
df['cleaned_data'] = pd.Series()
for i in range(len(df['sent_tok'])):
df['cleaned_data'][i] = [mts.clean_data(x,\
lower = True,\
punctuations = True,\
number = False,\
unicode = True,\
stop_words = False) for x in df['sent_tok'][i]]
If you want to clean on the document level:
df['cleaned_data'] = df.text.apply(mts.clean_data, args=(True, True, False, True, False))
For the data cleaning function, we offer the following options:
- lower: make all the words to lowercase
- punctuations: remove all the punctuations in the corpus
- number: remove all the digits in the corpus
- unicode: remove all the unicodes in the corpus
- stop_words: remove the stopwords in the corpus
#### Boilerplate
df['Boilerplate'] = mts.Boilerplate(sent_tok, n = 4, min_doc = 5, get_ngram = False)
Parameters:
- input_data: this function requires tokenized documents.
- n: number of the ngrams to use. The default is 4.
- min_doc: when building the ngram list, ignore the ngrams that have a document frequency strictly lower than the given threshold. The default is 5 document. 30% of the number of the documents is recommended.
- get_ngram: if this parameter is set to "True" it will return a datafram with all the ngrams and the corresponding frequency, and "min_doc" parameter will become ineffective.
- max_doc: when building the ngram list, ignore the ngrams that have a document frequency strictly lower than the given threshold. The default is 75% of document. It can be percentage or integer.
#### Redundancy
df['Redundancy'] = mts.Redundancy(df.cleaned_data, n = 10)
Parameters:
- input_data: this function requires tokenized documents.
- n: number of the ngrams to use. The default is 10.
#### Specificity
df['Specificity'] = mts.Specificity(df.text)
Parameters:
- input_data: this function requires the documents without tokenization
#### Relative_prevalence
df['Relative_prevalence'] = mts.Relative_prevalence(df.text)
Parameters:
- input_data: this function requires the documents without tokenization
For the full code script, you may check here:
- [Script](https://github.com/jinhangjiang/morethansentiments/blob/main/tests/test_code.py)
- [Jupyter Notebook](https://github.com/jinhangjiang/morethansentiments/blob/main/Boilerplate.ipynb)
# CHANGELOG
## Version 0.3.0 2025-01-31
- Fixed the parameter misplace issue in Redudancy.
- Fully upgraded the algorithm and refactored the code base. 40% - 50% speed boost on large datasets.
## Version 0.2.1, 2022-12-22
- Fixed the counting bug in Specificity
- Added max_doc parameter to Boilerplate
## Version 0.2.0, 2022-10-2
- Added the "get_ngram" feature to the Boilerplate function
- Added the percentage as a option for "min_doc" in Boilerpate, when the given value is between 0 and 1, it will automatically become a percentage for "min_doc"
## Version 0.1.3, 2022-06-10
- Updated the usage guide
- Minor fix to the script
## Version 0.1.2, 2022-05-08
- Initial release.
Raw data
{
"_id": null,
"home_page": "https://github.com/jinhangjiang/morethansentiments",
"name": "MoreThanSentiments",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": "Text Mining, Data Science, Natural Language Processing, Accounting",
"author": "Jinhang Jiang, Karthik Srinivasan",
"author_email": "jinhang@asu.edu",
"download_url": "https://files.pythonhosted.org/packages/54/53/fb734a299dfa789d4e08f0c4d4d6dbdde6a61e36b00d3fe5070301422d1c/MoreThanSentiments-0.3.0.tar.gz",
"platform": null,
"description": "[![License](https://img.shields.io/badge/License-BSD_3--Clause-green.svg)](https://opensource.org/licenses/BSD-3-Clause)\n[![PyPI](https://img.shields.io/pypi/v/morethansentiments)](https://pypi.org/project/morethansentiments/)\n[![Code Ocean](https://codeocean.com/codeocean-assets/badge/open-in-code-ocean.svg)](https://codeocean.com/capsule/7670045/tree)\n[![Downloads](https://pepy.tech/badge/morethansentiments)](https://pepy.tech/project/morethansentiments)\n\n# MoreThanSentiments\nBesides sentiment scores, this Python package offers various ways of quantifying text corpus based on multiple works of literature. Currently, we support the calculation of the following measures:\n\n- Boilerplate (Lang and Stice-Lawrence, 2015)\n- Redundancy (Cazier and Pfeiffer, 2015)\n- Specificity (Hope et al., 2016)\n- Relative_prevalence (Blankespoor, 2016)\n\nA medium blog is here: [MoreThanSentiments: A Python Library for Text Quantification](https://towardsdatascience.com/morethansentiments-a-python-library-for-text-quantification-e57ff9d51cd5)\n\n## Citation\n\nIf this package was helpful in your work, feel free to cite it as\n\n- Jiang, Jinhang, and Karthik Srinivasan. \"MoreThanSentiments: A text analysis package.\" Software Impacts 15 (2023): 100456. https://doi.org/10.1016/J.SIMPA.2022.100456\n\n## Installation\n\nThe easiest way to install the toolbox is via pip (pip3 in some\ndistributions):\n\n pip install MoreThanSentiments\n\n\n## Usage\n\n#### Import the Package\n\n import MoreThanSentiments as mts\n\n#### Read data from txt files\n\n my_dir_path = \"D:/YourDataFolder\"\n df = mts.read_txt_files(PATH = my_dir_path)\n\n#### Sentence Token\n\n df['sent_tok'] = df.text.apply(mts.sent_tok)\n\n#### Clean Data\n\nIf you want to clean on the sentence level:\n\n df['cleaned_data'] = pd.Series() \n for i in range(len(df['sent_tok'])):\n df['cleaned_data'][i] = [mts.clean_data(x,\\\n lower = True,\\\n punctuations = True,\\\n number = False,\\\n unicode = True,\\\n stop_words = False) for x in df['sent_tok'][i]] \n\nIf you want to clean on the document level:\n\n df['cleaned_data'] = df.text.apply(mts.clean_data, args=(True, True, False, True, False))\n\nFor the data cleaning function, we offer the following options:\n- lower: make all the words to lowercase\n- punctuations: remove all the punctuations in the corpus\n- number: remove all the digits in the corpus\n- unicode: remove all the unicodes in the corpus\n- stop_words: remove the stopwords in the corpus\n\n#### Boilerplate\n\n df['Boilerplate'] = mts.Boilerplate(sent_tok, n = 4, min_doc = 5, get_ngram = False)\n\nParameters:\n- input_data: this function requires tokenized documents.\n- n: number of the ngrams to use. The default is 4.\n- min_doc: when building the ngram list, ignore the ngrams that have a document frequency strictly lower than the given threshold. The default is 5 document. 30% of the number of the documents is recommended.\n- get_ngram: if this parameter is set to \"True\" it will return a datafram with all the ngrams and the corresponding frequency, and \"min_doc\" parameter will become ineffective.\n- max_doc: when building the ngram list, ignore the ngrams that have a document frequency strictly lower than the given threshold. The default is 75% of document. It can be percentage or integer.\n\n#### Redundancy\n\n df['Redundancy'] = mts.Redundancy(df.cleaned_data, n = 10)\n\nParameters:\n- input_data: this function requires tokenized documents.\n- n: number of the ngrams to use. The default is 10.\n\n#### Specificity\n\n df['Specificity'] = mts.Specificity(df.text)\n\nParameters:\n- input_data: this function requires the documents without tokenization\n\n#### Relative_prevalence\n\n df['Relative_prevalence'] = mts.Relative_prevalence(df.text)\n\nParameters:\n- input_data: this function requires the documents without tokenization\n\n\nFor the full code script, you may check here:\n- [Script](https://github.com/jinhangjiang/morethansentiments/blob/main/tests/test_code.py)\n- [Jupyter Notebook](https://github.com/jinhangjiang/morethansentiments/blob/main/Boilerplate.ipynb)\n\n\n# CHANGELOG\n## Version 0.3.0 2025-01-31\n- Fixed the parameter misplace issue in Redudancy.\n- Fully upgraded the algorithm and refactored the code base. 40% - 50% speed boost on large datasets.\n\n## Version 0.2.1, 2022-12-22\n- Fixed the counting bug in Specificity \n- Added max_doc parameter to Boilerplate\n\n## Version 0.2.0, 2022-10-2\n\n- Added the \"get_ngram\" feature to the Boilerplate function\n- Added the percentage as a option for \"min_doc\" in Boilerpate, when the given value is between 0 and 1, it will automatically become a percentage for \"min_doc\"\n\n## Version 0.1.3, 2022-06-10\n\n- Updated the usage guide\n- Minor fix to the script\n\n\n## Version 0.1.2, 2022-05-08\n\n- Initial release.\n\n\n",
"bugtrack_url": null,
"license": null,
"summary": "An NLP python package for computing Boilerplate score and many other text features.",
"version": "0.3.0",
"project_urls": {
"Homepage": "https://github.com/jinhangjiang/morethansentiments"
},
"split_keywords": [
"text mining",
" data science",
" natural language processing",
" accounting"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "f2a5f24d7666cb496677c9ac94497a970c7c427cb90c300916e0e4b537a48a60",
"md5": "64a3d2d21233800824b5c3239b22b5c3",
"sha256": "ff5f802f93c01d15a16c6cc5afda8c3bc3a2bddba958c1f1f399449ec930a10b"
},
"downloads": -1,
"filename": "MoreThanSentiments-0.3.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "64a3d2d21233800824b5c3239b22b5c3",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 6991,
"upload_time": "2025-01-31T19:16:25",
"upload_time_iso_8601": "2025-01-31T19:16:25.895069Z",
"url": "https://files.pythonhosted.org/packages/f2/a5/f24d7666cb496677c9ac94497a970c7c427cb90c300916e0e4b537a48a60/MoreThanSentiments-0.3.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "5453fb734a299dfa789d4e08f0c4d4d6dbdde6a61e36b00d3fe5070301422d1c",
"md5": "d403be9bf013dc334fecc0bc7d880040",
"sha256": "af6192ea37f3d043ba7b76f04808a1ba3f99381e13e229f5d47c4c8f3c720d79"
},
"downloads": -1,
"filename": "MoreThanSentiments-0.3.0.tar.gz",
"has_sig": false,
"md5_digest": "d403be9bf013dc334fecc0bc7d880040",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 2097028,
"upload_time": "2025-01-31T19:16:27",
"upload_time_iso_8601": "2025-01-31T19:16:27.938839Z",
"url": "https://files.pythonhosted.org/packages/54/53/fb734a299dfa789d4e08f0c4d4d6dbdde6a61e36b00d3fe5070301422d1c/MoreThanSentiments-0.3.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-01-31 19:16:27",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "jinhangjiang",
"github_project": "morethansentiments",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "morethansentiments"
}