[](https://pypi.python.org/pypi/arabica)
[]([https://opensource.org/licenses/MIT](https://opensource.org/license/apache-2-0/))
# Arabica
**Python package for text mining of time-series data**
Text data is often recorded as a time series with significant variability over time. Some examples of time-series text data include social media conversations, product reviews, research metadata, central bank communication, and newspaper headlines. Arabica makes exploratory analysis of these datasets simple by providing:
* **Descriptive n-gram analysis**: n-gram frequencies
* **Time-series n-gram analysis**: n-gram frequencies over a period
* **Text visualization**: n-gram heatmap, line plot, word cloud
* **Sentiment analysis**: VADER sentiment classifier
* **Financial sentiment analysis**: with FinVADER
* **Structural breaks identification**: Jenks Optimization Method
It automatically cleans data from punctuation on input. It can also apply all or a selected combination of the following cleaning operations:
* Remove digits from the text
* Remove the standard list(s) of stopwords
* Remove an additional list of stop words
Arabica works with **texts** of languages based on the Latin alphabet, uses `cleantext` for punctuation cleaning, and enables stop words removal for languages in the `NLTK` corpus of stopwords.
It reads dates in:
* **US-style**: *MM/DD/YYYY* (2013-12-31, Feb-09-2009, 2013-12-31 11:46:17, etc.)
* **European-style**: *DD/MM/YYYY* (2013-31-12, 09-Feb-2009, 2013-31-12 11:46:17, etc.) date and datetime formats.
## Installation
Arabica requires **Python 3.8 - 3.10**, [NLTK](http://www.nltk.org) - stop words removal,
[cleantext](https://pypi.org/project/cleantext/#description) - text cleaning, [wordcloud](https://pypi.org/project/wordcloud) - word cloud visualization,
[plotnine](https://pypi.org/project/plotnine) - heatmaps and line graphs, [matplotlib](https://pypi.org/project/matplotlib/) - word clouds and graphical operations,
[vaderSentiment](https://pypi.org/project/vaderSentiment) - sentiment analysis, [finvader](https://pypi.org/project/finvader) - financial sentiment analysis,
and [jenskpy](https://pypi.org/project/jenkspy/) for breakpoint identification.
To install using pip, use:
`pip install arabica`
## Usage
* **Import the library**:
``` python
from arabica import arabica_freq
from arabica import cappuccino
from arabica import coffee_break
```
* **Choose a method:**
**arabica_freq** enables a specific set of cleaning operations (lower casing, numbers, common stop words, and additional stop words
removal) and returns a dataframe with aggregated unigrams, bigrams, and trigrams frequencies over a period.
``` python
def arabica_freq(text: str, # Text
time: str, # Time
date_format: str, # Date format: 'eur' - European, 'us' - American
time_freq: str, # Aggregation period: 'Y'/'M'/'D', if no aggregation: 'ungroup'
max_words: int, # Maximum of most frequent n-grams displayed for each period
stopwords: [], # Languages for stop words
stopwords_ext: [], # Languages for extended stop words list
skip: [], # Remove additional stop words
numbers: bool = False, # Remove numbers
lower_case: bool = False # Lowercase text
)
```
**cappuccino** enables cleaning operations (lower casing, numbers, common stop words, and additional stop words
removal) and provides plots for descriptive (word cloud) and time-series (heatmap, line plot) visualization.
``` python
def cappuccino(text: str, # Text
time: str, # Time
date_format: str, # Date format: 'eur' - European, 'us' - American
plot: str, # Chart type: 'wordcloud'/'heatmap'/'line'
ngram: int, # N-gram size, 1 = unigram, 2 = bigram, 3 = trigram
time_freq: str, # Aggregation period: 'Y'/'M', if no aggregation: 'ungroup'
max_words int, # Maximum of most frequent n-grams displayed for each period
stopwords: [], # Languages for stop words
stopwords_ext: [], # Languages for extended stop words list
skip: [], # Remove additional stop words
numbers: bool = False, # Remove numbers
lower_case: bool = False # Lowercase text
)
```
**coffee_break** provides sentiment analysis and breakpoint identification in aggregated time series of sentiment. The implemented models are:
* [VADER](https://ojs.aaai.org/index.php/ICWSM/article/view/14550) is a lexicon and rule-based sentiment classifier attuned explicitly to general language expressed in social media
* [FinVADER](https://pypi.org/project/finvader/) improves VADER's classification accuracy on financial texts, including two financial lexicons
Break points in the time series are identified with the **Fisher-Jenks algorithm** (Jenks, 1977. Optimal data classification for choropleth maps).
``` python
def coffee_break(text: str, # Text
time: str, # Time
date_format: str, # Date format: 'eur' - European, 'us' - American
model: str, # Sentiment classifier, 'vader' - general language, 'finvader' - financial text
skip: [], # Remove additional stop words
preprocess: bool = False, # Clean data from numbers and punctuation
time_freq: str, # Aggregation period: 'Y'/'M'
n_breaks: int # Number of breakpoints: min. 2
)
```
## Documentation, examples and tutorials
* Read the [documentation](https://arabica.readthedocs.io/en/latest/index.html)
For more examples of coding, read these tutorials:
**General use:**
* Sentiment Analysis and Structural Breaks in Time-Series Text Data [here](https://towardsdatascience.com/sentiment-analysis-and-structural-breaks-in-time-series-text-data-8109c712ca2?sk=ce5c69171ba026fee631d1b23520d6e3)
* Visualization Module in Arabica Speeds Up Text Data Exploration [here](https://towardsdatascience.com/visualization-module-in-arabica-speeds-up-text-data-exploration-47114ad646ce?sk=e54bc7d170ea3ecb76fb45dc869d4a44)
* Text as Time Series: Arabica 1.0 Brings New Features for Exploratory Text Data Analysis [here](https://towardsdatascience.com/text-as-time-series-arabica-1-0-brings-new-features-for-exploratory-text-data-analysis-88eaabb84deb?sk=229ec0602d0b8514f25bce501ed9ecb9)
**Applications:**
* **Business Intelligence:** Customer Satisfaction Measurement with N-gram and Sentiment Analysis [here](https://towardsdatascience.com/customer-satisfaction-measurement-with-n-gram-and-sentiment-analysis-547e291c13a6?sk=62f9decb619744c96c49735ff09653c3)
* **Research meta-data analysis:** Research Article Meta-data Description Made Quick and Easy [here](https://pub.towardsai.net/research-article-meta-data-description-made-quick-and-easy-57754e54b550?sk=82477c74a159855f211b09b53026dedc)
* **Media coverage text mining**
* **Social media analysis**
---
💬 Please visit [here](https://github.com/PetrKorab/arabica/issues) for any questions, issues, bugs, and suggestions.
## Citation
Using **arabica** in a paper or thesis? Please cite this paper:
```bibtex
@article{Koráb:2024,
author = {{Koráb}, P., and {Poměnková}, J.},
title = {Arabica: A Python package for exploratory analysis of text data},
journal = {Journal of Open Source Software},
volume = {97},
number = {9},
pages = {6186},
year = {2024},
doi = {doi.org/10.21105/joss.06186},
}
Raw data
{
"_id": null,
"home_page": "https://github.com/PetrKorab/Arabica",
"name": "arabica",
"maintainer": null,
"docs_url": null,
"requires_python": "<3.11,>=3.8",
"maintainer_email": null,
"keywords": null,
"author": "Petr Kor\u00e1b",
"author_email": "xpetrkorab@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/1f/06/1eb8f7b7a893c778900ed59b26943a1b41c2f533d33cb78b6bc950802361/arabica-1.8.2.tar.gz",
"platform": null,
"description": "[](https://pypi.python.org/pypi/arabica)\r\n[]([https://opensource.org/licenses/MIT](https://opensource.org/license/apache-2-0/))\r\n\r\n\r\n# Arabica\r\n**Python package for text mining of time-series data**\r\n\r\nText data is often recorded as a time series with significant variability over time. Some examples of time-series text data include social media conversations, product reviews, research metadata, central bank communication, and newspaper headlines. Arabica makes exploratory analysis of these datasets simple by providing:\r\n\r\n* **Descriptive n-gram analysis**: n-gram frequencies\r\n* **Time-series n-gram analysis**: n-gram frequencies over a period\r\n* **Text visualization**: n-gram heatmap, line plot, word cloud\r\n* **Sentiment analysis**: VADER sentiment classifier\r\n* **Financial sentiment analysis**: with FinVADER\r\n* **Structural breaks identification**: Jenks Optimization Method\r\n\r\nIt automatically cleans data from punctuation on input. It can also apply all or a selected combination of the following cleaning operations:\r\n\r\n* Remove digits from the text\r\n* Remove the standard list(s) of stopwords\r\n* Remove an additional list of stop words\r\n\r\nArabica works with **texts** of languages based on the Latin alphabet, uses `cleantext` for punctuation cleaning, and enables stop words removal for languages in the `NLTK` corpus of stopwords.\r\n\r\nIt reads dates in:\r\n\r\n* **US-style**: *MM/DD/YYYY* (2013-12-31, Feb-09-2009, 2013-12-31 11:46:17, etc.)\r\n* **European-style**: *DD/MM/YYYY* (2013-31-12, 09-Feb-2009, 2013-31-12 11:46:17, etc.) date and datetime formats.\r\n\r\n\r\n## Installation\r\n\r\nArabica requires **Python 3.8 - 3.10**, [NLTK](http://www.nltk.org) - stop words removal,\r\n[cleantext](https://pypi.org/project/cleantext/#description) - text cleaning, [wordcloud](https://pypi.org/project/wordcloud) - word cloud visualization,\r\n[plotnine](https://pypi.org/project/plotnine) - heatmaps and line graphs, [matplotlib](https://pypi.org/project/matplotlib/) - word clouds and graphical operations,\r\n[vaderSentiment](https://pypi.org/project/vaderSentiment) - sentiment analysis, [finvader](https://pypi.org/project/finvader) - financial sentiment analysis,\r\nand [jenskpy](https://pypi.org/project/jenkspy/) for breakpoint identification.\r\n\r\nTo install using pip, use:\r\n\r\n`pip install arabica`\r\n\r\n## Usage\r\n\r\n* **Import the library**:\r\n\r\n\r\n``` python\r\nfrom arabica import arabica_freq\r\nfrom arabica import cappuccino\r\nfrom arabica import coffee_break \r\n```\r\n\r\n\r\n\r\n* **Choose a method:**\r\n\r\n**arabica_freq** enables a specific set of cleaning operations (lower casing, numbers, common stop words, and additional stop words \r\nremoval) and returns a dataframe with aggregated unigrams, bigrams, and trigrams frequencies over a period.\r\n\r\n\r\n\r\n``` python\r\ndef arabica_freq(text: str, # Text\r\n time: str, # Time\r\n date_format: str, # Date format: 'eur' - European, 'us' - American\r\n time_freq: str, # Aggregation period: 'Y'/'M'/'D', if no aggregation: 'ungroup'\r\n max_words: int, # Maximum of most frequent n-grams displayed for each period\r\n stopwords: [], # Languages for stop words\r\n stopwords_ext: [], # Languages for extended stop words list\r\n skip: [], # Remove additional stop words\r\n numbers: bool = False, # Remove numbers\r\n lower_case: bool = False # Lowercase text\r\n) \r\n```\r\n\r\n**cappuccino** enables cleaning operations (lower casing, numbers, common stop words, and additional stop words\r\nremoval) and provides plots for descriptive (word cloud) and time-series (heatmap, line plot) visualization.\r\n\r\n``` python\r\ndef cappuccino(text: str, # Text\r\n time: str, # Time\r\n date_format: str, # Date format: 'eur' - European, 'us' - American\r\n plot: str, # Chart type: 'wordcloud'/'heatmap'/'line'\r\n ngram: int, # N-gram size, 1 = unigram, 2 = bigram, 3 = trigram\r\n time_freq: str, # Aggregation period: 'Y'/'M', if no aggregation: 'ungroup'\r\n max_words int, # Maximum of most frequent n-grams displayed for each period\r\n stopwords: [], # Languages for stop words\r\n stopwords_ext: [], # Languages for extended stop words list\r\n skip: [], # Remove additional stop words \r\n numbers: bool = False, # Remove numbers\r\n lower_case: bool = False # Lowercase text\r\n)\r\n```\r\n\r\n**coffee_break** provides sentiment analysis and breakpoint identification in aggregated time series of sentiment. The implemented models are:\r\n\r\n* [VADER](https://ojs.aaai.org/index.php/ICWSM/article/view/14550) is a lexicon and rule-based sentiment classifier attuned explicitly to general language expressed in social media\r\n \r\n* [FinVADER](https://pypi.org/project/finvader/) improves VADER's classification accuracy on financial texts, including two financial lexicons\r\n\r\nBreak points in the time series are identified with the **Fisher-Jenks algorithm** (Jenks, 1977. Optimal data classification for choropleth maps).\r\n\r\n\r\n``` python\r\ndef coffee_break(text: str, # Text\r\n time: str, # Time\r\n date_format: str, # Date format: 'eur' - European, 'us' - American\r\n model: str, # Sentiment classifier, 'vader' - general language, 'finvader' - financial text \r\n skip: [], # Remove additional stop words\r\n preprocess: bool = False, # Clean data from numbers and punctuation\r\n time_freq: str, # Aggregation period: 'Y'/'M'\r\n n_breaks: int # Number of breakpoints: min. 2\r\n)\r\n```\r\n\r\n## Documentation, examples and tutorials\r\n\r\n* Read the [documentation](https://arabica.readthedocs.io/en/latest/index.html)\r\n\r\nFor more examples of coding, read these tutorials:\r\n\r\n**General use:**\r\n\r\n* Sentiment Analysis and Structural Breaks in Time-Series Text Data [here](https://towardsdatascience.com/sentiment-analysis-and-structural-breaks-in-time-series-text-data-8109c712ca2?sk=ce5c69171ba026fee631d1b23520d6e3) \r\n* Visualization Module in Arabica Speeds Up Text Data Exploration [here](https://towardsdatascience.com/visualization-module-in-arabica-speeds-up-text-data-exploration-47114ad646ce?sk=e54bc7d170ea3ecb76fb45dc869d4a44) \r\n* Text as Time Series: Arabica 1.0 Brings New Features for Exploratory Text Data Analysis [here](https://towardsdatascience.com/text-as-time-series-arabica-1-0-brings-new-features-for-exploratory-text-data-analysis-88eaabb84deb?sk=229ec0602d0b8514f25bce501ed9ecb9) \r\n\r\n**Applications:**\r\n\r\n* **Business Intelligence:** Customer Satisfaction Measurement with N-gram and Sentiment Analysis [here](https://towardsdatascience.com/customer-satisfaction-measurement-with-n-gram-and-sentiment-analysis-547e291c13a6?sk=62f9decb619744c96c49735ff09653c3) \r\n* **Research meta-data analysis:** Research Article Meta-data Description Made Quick and Easy [here](https://pub.towardsai.net/research-article-meta-data-description-made-quick-and-easy-57754e54b550?sk=82477c74a159855f211b09b53026dedc)\r\n* **Media coverage text mining**\r\n* **Social media analysis**\r\n---\r\n\r\n\r\n\u00f0\u0178\u2019\u00ac Please visit [here](https://github.com/PetrKorab/arabica/issues) for any questions, issues, bugs, and suggestions.\r\n\r\n\r\n## Citation\r\n\r\nUsing **arabica** in a paper or thesis? Please cite this paper:\r\n\r\n```bibtex\r\n\r\n@article{Kor\u00c3\u00a1b:2024,\r\n author = {{Kor\u00c3\u00a1b}, P., and {Pom\u00c4\u203ankov\u00c3\u00a1}, J.},\r\n title = {Arabica: A Python package for exploratory analysis of text data},\r\n journal = {Journal of Open Source Software},\r\n volume = {97},\r\n number = {9},\r\n pages = {6186},\r\n year = {2024},\r\n doi = {doi.org/10.21105/joss.06186},\r\n}\r\n",
"bugtrack_url": null,
"license": "OSI Approved :: Apache Software License",
"summary": "Python package for text mining of time-series data",
"version": "1.8.2",
"project_urls": {
"Homepage": "https://github.com/PetrKorab/Arabica"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "8d65a8edc7d9ae4ccf6634a36cfa77cd51fc1dfddd5066db840f6d50dc66723c",
"md5": "39c05312889fea5438590dcf4bcdff04",
"sha256": "55924b066a936c5e3356b821f737d093c2cdbc98c4dc09b35579721a983a8843"
},
"downloads": -1,
"filename": "arabica-1.8.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "39c05312889fea5438590dcf4bcdff04",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<3.11,>=3.8",
"size": 22152,
"upload_time": "2024-11-23T13:14:24",
"upload_time_iso_8601": "2024-11-23T13:14:24.330208Z",
"url": "https://files.pythonhosted.org/packages/8d/65/a8edc7d9ae4ccf6634a36cfa77cd51fc1dfddd5066db840f6d50dc66723c/arabica-1.8.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "1f061eb8f7b7a893c778900ed59b26943a1b41c2f533d33cb78b6bc950802361",
"md5": "ecc44dd289d7b9cd08fdc49dedaf8f2e",
"sha256": "6edeb494da3c4cae0440fb8e3c0192e2acc454d046e45876e824126a42f4395d"
},
"downloads": -1,
"filename": "arabica-1.8.2.tar.gz",
"has_sig": false,
"md5_digest": "ecc44dd289d7b9cd08fdc49dedaf8f2e",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<3.11,>=3.8",
"size": 23591,
"upload_time": "2024-11-23T13:14:26",
"upload_time_iso_8601": "2024-11-23T13:14:26.660567Z",
"url": "https://files.pythonhosted.org/packages/1f/06/1eb8f7b7a893c778900ed59b26943a1b41c2f533d33cb78b6bc950802361/arabica-1.8.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-11-23 13:14:26",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "PetrKorab",
"github_project": "Arabica",
"travis_ci": true,
"coveralls": false,
"github_actions": true,
"requirements": [],
"lcname": "arabica"
}