[![DOI](https://zenodo.org/badge/490040941.svg)](https://zenodo.org/badge/latestdoi/490040941)
[![License](https://img.shields.io/badge/License-BSD_3--Clause-green.svg)](https://opensource.org/licenses/BSD-3-Clause)
[![PyPI](https://img.shields.io/pypi/v/morethansentiments)](https://pypi.org/project/morethansentiments/)
# MoreThanSentiments
Besides sentiment scores, this Python package offers various ways of quantifying text corpus based on multiple works of literature. Currently, we support the calculation of the following measures:
- Boilerplate (Lang and Stice-Lawrence, 2015)
- Redundancy (Cazier and Pfeiffer, 2015)
- Specificity (Hope et al., 2016)
- Relative_prevalence (Blankespoor, 2016)
A medium blog is here: [MoreThanSentiments: A Python Library for Text Quantification](https://towardsdatascience.com/morethansentiments-a-python-library-for-text-quantification-e57ff9d51cd5)
## Citation
If this package was helpful in your work, feel free to cite it as
- Jiang, J., & Srinivasan, K. (2022). MoreThanSentiments: A text analysis package. Software Impacts, 100456. https://doi.org/10.1016/J.SIMPA.2022.100456
## Installation
The easiest way to install the toolbox is via pip (pip3 in some
distributions):
pip install MoreThanSentiments
## Usage
#### Import the Package
import MoreThanSentiments as mts
#### Read data from txt files
my_dir_path = "D:/YourDataFolder"
df = mts.read_txt_files(PATH = my_dir_path)
#### Sentence Token
df['sent_tok'] = df.text.apply(mts.sent_tok)
#### Clean Data
If you want to clean on the sentence level:
df['cleaned_data'] = pd.Series()
for i in range(len(df['sent_tok'])):
df['cleaned_data'][i] = [mts.clean_data(x,\
lower = True,\
punctuations = True,\
number = False,\
unicode = True,\
stop_words = False) for x in df['sent_tok'][i]]
If you want to clean on the document level:
df['cleaned_data'] = df.text.apply(mts.clean_data, args=(True, True, False, True, False))
For the data cleaning function, we offer the following options:
- lower: make all the words to lowercase
- punctuations: remove all the punctuations in the corpus
- number: remove all the digits in the corpus
- unicode: remove all the unicodes in the corpus
- stop_words: remove the stopwords in the corpus
#### Boilerplate
df['Boilerplate'] = mts.Boilerplate(sent_tok, n = 4, min_doc = 5, get_ngram = False)
Parameters:
- input_data: this function requires tokenized documents.
- n: number of the ngrams to use. The default is 4.
- min_doc: when building the ngram list, ignore the ngrams that have a document frequency strictly lower than the given threshold. The default is 5 document. 30% of the number of the documents is recommended.
- get_ngram: if this parameter is set to "True" it will return a datafram with all the ngrams and the corresponding frequency, and "min_doc" parameter will become ineffective.
- max_doc: when building the ngram list, ignore the ngrams that have a document frequency strictly lower than the given threshold. The default is 75% of document. It can be percentage or integer.
#### Redundancy
df['Redundancy'] = mts.Redundancy(df.cleaned_data, n = 10)
Parameters:
- input_data: this function requires tokenized documents.
- n: number of the ngrams to use. The default is 10.
#### Specificity
df['Specificity'] = mts.Specificity(df.text)
Parameters:
- input_data: this function requires the documents without tokenization
#### Relative_prevalence
df['Relative_prevalence'] = mts.Relative_prevalence(df.text)
Parameters:
- input_data: this function requires the documents without tokenization
For the full code script, you may check here:
- [Script](https://github.com/jinhangjiang/morethansentiments/blob/main/tests/test_code.py)
- [Jupyter Notebook](https://github.com/jinhangjiang/morethansentiments/blob/main/Boilerplate.ipynb)
# CHANGELOG
## Version 0.2.1, 2022-12-22
- Fixed the counting bug in Specificity
- Added max_doc parameter to Boilerplate
## Version 0.2.0, 2022-10-2
- Added the "get_ngram" feature to the Boilerplate function
- Added the percentage as a option for "min_doc" in Boilerpate, when the given value is between 0 and 1, it will automatically become a percentage for "min_doc"
## Version 0.1.3, 2022-06-10
- Updated the usage guide
- Minor fix to the script
## Version 0.1.2, 2022-05-08
- Initial release.
Raw data
{
"_id": null,
"home_page": "https://github.com/jinhangjiang/morethansentiments",
"name": "MoreThanSentiments",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "Text Mining,Data Science,Natural Language Processing,Accounting",
"author": "Jinhang Jiang, Karthik Srinivasan",
"author_email": "jinhang@asu.edu",
"download_url": "https://files.pythonhosted.org/packages/21/d0/ee190cf70b180d61eca55e14da2f4fcccf602f2cba79169ffa459240c0b1/MoreThanSentiments-0.2.1.tar.gz",
"platform": null,
"description": "[![DOI](https://zenodo.org/badge/490040941.svg)](https://zenodo.org/badge/latestdoi/490040941)\n[![License](https://img.shields.io/badge/License-BSD_3--Clause-green.svg)](https://opensource.org/licenses/BSD-3-Clause)\n[![PyPI](https://img.shields.io/pypi/v/morethansentiments)](https://pypi.org/project/morethansentiments/)\n\n# MoreThanSentiments\nBesides sentiment scores, this Python package offers various ways of quantifying text corpus based on multiple works of literature. Currently, we support the calculation of the following measures:\n\n- Boilerplate (Lang and Stice-Lawrence, 2015)\n- Redundancy (Cazier and Pfeiffer, 2015)\n- Specificity (Hope et al., 2016)\n- Relative_prevalence (Blankespoor, 2016)\n\nA medium blog is here: [MoreThanSentiments: A Python Library for Text Quantification](https://towardsdatascience.com/morethansentiments-a-python-library-for-text-quantification-e57ff9d51cd5)\n\n## Citation\n\nIf this package was helpful in your work, feel free to cite it as\n\n- Jiang, J., & Srinivasan, K. (2022). MoreThanSentiments: A text analysis package. Software Impacts, 100456. https://doi.org/10.1016/J.SIMPA.2022.100456\n\n## Installation\n\nThe easiest way to install the toolbox is via pip (pip3 in some\ndistributions):\n\n pip install MoreThanSentiments\n\n\n## Usage\n\n#### Import the Package\n\n import MoreThanSentiments as mts\n\n#### Read data from txt files\n\n my_dir_path = \"D:/YourDataFolder\"\n df = mts.read_txt_files(PATH = my_dir_path)\n\n#### Sentence Token\n\n df['sent_tok'] = df.text.apply(mts.sent_tok)\n\n#### Clean Data\n\nIf you want to clean on the sentence level:\n\n df['cleaned_data'] = pd.Series() \n for i in range(len(df['sent_tok'])):\n df['cleaned_data'][i] = [mts.clean_data(x,\\\n lower = True,\\\n punctuations = True,\\\n number = False,\\\n unicode = True,\\\n stop_words = False) for x in df['sent_tok'][i]] \n\nIf you want to clean on the document level:\n\n df['cleaned_data'] = df.text.apply(mts.clean_data, args=(True, True, False, True, False))\n\nFor the data cleaning function, we offer the following options:\n- lower: make all the words to lowercase\n- punctuations: remove all the punctuations in the corpus\n- number: remove all the digits in the corpus\n- unicode: remove all the unicodes in the corpus\n- stop_words: remove the stopwords in the corpus\n\n#### Boilerplate\n\n df['Boilerplate'] = mts.Boilerplate(sent_tok, n = 4, min_doc = 5, get_ngram = False)\n\nParameters:\n- input_data: this function requires tokenized documents.\n- n: number of the ngrams to use. The default is 4.\n- min_doc: when building the ngram list, ignore the ngrams that have a document frequency strictly lower than the given threshold. The default is 5 document. 30% of the number of the documents is recommended.\n- get_ngram: if this parameter is set to \"True\" it will return a datafram with all the ngrams and the corresponding frequency, and \"min_doc\" parameter will become ineffective.\n- max_doc: when building the ngram list, ignore the ngrams that have a document frequency strictly lower than the given threshold. The default is 75% of document. It can be percentage or integer.\n\n#### Redundancy\n\n df['Redundancy'] = mts.Redundancy(df.cleaned_data, n = 10)\n\nParameters:\n- input_data: this function requires tokenized documents.\n- n: number of the ngrams to use. The default is 10.\n\n#### Specificity\n\n df['Specificity'] = mts.Specificity(df.text)\n\nParameters:\n- input_data: this function requires the documents without tokenization\n\n#### Relative_prevalence\n\n df['Relative_prevalence'] = mts.Relative_prevalence(df.text)\n\nParameters:\n- input_data: this function requires the documents without tokenization\n\n\nFor the full code script, you may check here:\n- [Script](https://github.com/jinhangjiang/morethansentiments/blob/main/tests/test_code.py)\n- [Jupyter Notebook](https://github.com/jinhangjiang/morethansentiments/blob/main/Boilerplate.ipynb)\n\n\n# CHANGELOG\n## Version 0.2.1, 2022-12-22\n- Fixed the counting bug in Specificity \n- Added max_doc parameter to Boilerplate\n\n## Version 0.2.0, 2022-10-2\n\n- Added the \"get_ngram\" feature to the Boilerplate function\n- Added the percentage as a option for \"min_doc\" in Boilerpate, when the given value is between 0 and 1, it will automatically become a percentage for \"min_doc\"\n\n## Version 0.1.3, 2022-06-10\n\n- Updated the usage guide\n- Minor fix to the script\n\n\n## Version 0.1.2, 2022-05-08\n\n- Initial release.\n\n\n",
"bugtrack_url": null,
"license": "",
"summary": "An NLP python package for computing Boilerplate score and many other text features.",
"version": "0.2.1",
"split_keywords": [
"text mining",
"data science",
"natural language processing",
"accounting"
],
"urls": [
{
"comment_text": "",
"digests": {
"md5": "9f598e67bafb5959aa02a0bf840196f1",
"sha256": "9f0f741957b9190bda15bd652f33657e5b58790a86993a79a88f343029ab18a3"
},
"downloads": -1,
"filename": "MoreThanSentiments-0.2.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "9f598e67bafb5959aa02a0bf840196f1",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 7317,
"upload_time": "2022-12-23T02:06:05",
"upload_time_iso_8601": "2022-12-23T02:06:05.432695Z",
"url": "https://files.pythonhosted.org/packages/7a/6d/682d055660b00793e8f999dcd831c0c7e0cf3c5b49f7489d909593d7dc3c/MoreThanSentiments-0.2.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"md5": "b4b7e792be4f265a2044f6d9f5b797fb",
"sha256": "fcb81267c1441ea0ae5c1621b0c2b1f87e079f0f61dba23bcfc03c7951dff56d"
},
"downloads": -1,
"filename": "MoreThanSentiments-0.2.1.tar.gz",
"has_sig": false,
"md5_digest": "b4b7e792be4f265a2044f6d9f5b797fb",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 6948,
"upload_time": "2022-12-23T02:06:07",
"upload_time_iso_8601": "2022-12-23T02:06:07.079979Z",
"url": "https://files.pythonhosted.org/packages/21/d0/ee190cf70b180d61eca55e14da2f4fcccf602f2cba79169ffa459240c0b1/MoreThanSentiments-0.2.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2022-12-23 02:06:07",
"github": true,
"gitlab": false,
"bitbucket": false,
"github_user": "jinhangjiang",
"github_project": "morethansentiments",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "morethansentiments"
}