textmining-module


Nametextmining-module JSON
Version 0.1.3 PyPI version JSON
download
home_pagehttps://github.com/knowusuboaky/textmining_module
SummaryA Python Module for Comprehensive Text Mining, including Keyword Extraction and Text Analysis.
upload_time2024-02-19 13:21:27
maintainer
docs_urlNone
authorKwadwo Daddy Nyame Owusu - Boakye
requires_python>=3.6
license
keywords text mining clustering correlation similarity keyword extraction text analysis scoring bining data processing python
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Text Mining Module Library

## Overview

The `textmining_module` is a comprehensive Python library designed to
facilitate text mining, keyword extraction, and text analysis tasks. It
provides a suite of tools for preprocessing textual data, extracting
meaningful insights, and transforming texts into formats suitable for
various analysis and machine learning models.

## Features

-   **Text Preprocessing**: Simplify the preparation of text data with
    functions for cleaning, normalizing, and preprocessing textual
    content.
-   **Keyword Extraction**: Utilize built-in functionalities to extract
    significant keywords and phrases from large volumes of text.
-   **Text Analysis**: Leverage tools to analyze and understand the
    content, structure, and composition of your text data.

## Developer Manual for KeywordsExtractor

### Functions Map

<img src="https://github.com/knowusuboaky/textmining_module/blob/main/README_files/figure-markdown/mermaid-figure-1.png?raw=true" width="1526" height="459" alt="Optional Alt Text">


### User Manual

#### Installation

This is the environment we need to load.

``` bash

pip install textmining_module==0.1.3
```

#### Load Package

``` bash

from textmining_module import KeywordsExtractor
```

#### Base Operations

##### Extract Keywords From Dataset

``` bash

keywords_df =  KeywordsExtractor(data, 
                                 text_column= 'text_column', 
                                 method= 'yake', 
                                 n=3, 
                                 stopword_language= 'english') 
```

The `KeywordsExtractor` extracts keywords from textual data within a
`pandas` DataFrame. Here's a detailed look at each of its arguments:

-   `data` : The `pandas` DataFrame containing the `text data` from
    which you want to extract keywords. This DataFrame should have at
    least one `text_column` specified by the text_column argument.
    -   `text_column` : (str) The name of the column within the data
        DataFrame that contains the textual data for keyword extraction.
    -   `method` : (str) Specifies the method to be used for keyword
        extraction. The function supports the following methods:
        -   `frequency` : Extracts keywords based on word frequency,
            excluding common stopwords.
        -   `yake` : Utilizes YAKE (Yet Another Keyword Extractor), an
            unsupervised method that considers word frequency and
            position.- - `tf-idf` : Employs Term Frequency-Inverse
            Document Frequency, highlighting words that are particularly
            indicative of the text's content.
        -   `pos` : Focuses on part-of-speech tagging, typically
            selecting nouns as keywords.
        -   `ner`: Uses Named Entity Recognition to identify and extract
            entities (e.g., people, organizations) as keywords.
    -   `n` : (int) The number of keywords to extract from each piece of
        text.
    -   `stopwords_language` : (str) Indicates the language of the
        stopwords to be used for filtering during the keyword extraction
        process. This is relevant for methods that remove common words
        to focus on more meaningful content.

## Developer Manual for TextMiner

### Functions Map

<img src="https://github.com/knowusuboaky/textmining_module/blob/main/README_files/figure-markdown/mermaid-figure-2.png?raw=true" width="730" height="1615" alt="Optional Alt Text">



### User Manual

#### Installation

This is the environment we need to load.

``` bash

pip install textmining_module==0.1.3
```

#### Load Package

``` bash

from textmining_module import TextMiner
```

#### Base Operations

##### Prepare Text Dataset

``` bash

Cleaner = TextMiner(data, comment_variable='Text_column', target_variable='Target_column',
                       truncation_words=None, truncation_mode='right',
                       preprocessing_only=True, verbose=True)

data['Cleaned_text_column'] = Cleaner.reqCleanedData()['Text_column']
```

`Text_column` may have translations at the end that we want to remove.
We can use `TextMiner` to obtain [preprocessed]{.coop_blue} messages
that are [right truncated]{.coop_blue} after some stop words we
identified.

-   Required 1st argument : (`pandas` dataframe) of dataset;
-   `comment_variable` : (str) name of the comment variable in `pandas`
    dataframe;
-   `target_variable` : (str) name of the target variable in `pandas`
    dataframe;
-   `truncation_words` : (str list) words where a split occur to
    truncate a message to the left/right - i.e. if french copy
    before/after an english message;
-   `truncation_mode` : (str) {'right' : remove rhs of message at
    truncation_word, 'left' : remove lhs of message at truncation_word};
-   `preprocessing_only` : (bool) if True, only clean (opt.), format,
    stratify (opt.) and truncate (opt.) given dataset;
-   `verbose` : (bool) if True, show a progress bar.

##### Fetch Association

Let's review how to use `TextMiner` to fetch [processed]{.coop_blue}
keywords that are associated with ratings. The most challenging part of
most unsupervised algorithms is to find the correct hyperparameters. For
`TextMiner`, pay attention to `fpg_min_support`, `n` and `top`. Keyword
extraction may fail with an exponentially growing time complexity if too
many n-grams are fetched at a low support. A low `fpg_min_support` means
that we tolerate keywords that appear in a low number of observations. A
low `n` with a high `top` will lead to grams that are more likely to be
common to many messages, hence increasing time complexity as there would
be too many permutations to check. A high `n` with a low `top`, on the
other hand, will lead to grams that are too specific.

-   `strata_variable` : (str) name of the strata variable in `pandas`
    dataframe, for a stratified analysis - i.e. break down by LoB;
-   `req_len_complexity` : (bool) if True, include message length
    quartiles in analysis as a new qualitative attribute;
-   `removeOutersection` : (bool) if True, exclude keywords that contain
    other fetched keywords;
-   `search_mode` : (str) {'macro' : (for each strata) concatenate all
    rows in one chunk before extracting keywords, 'micro' : extract
    keywords row-wise}
-   `n` : (int) maximal number of grams (words excluding `stop_words`)
    that can form a `keyword`;
-   `top` : (int) how many n-grams to fetch;
-   `stop_words` : (str list) words to disregard in generation of
    n-grams;
-   `fpg_min_support` : (float) minimal support for FP Growth - try
    higher value if FPG takes too long;
-   `keep_strongest_association` : (bool) filter One Hot Data to keep
    highest supported bits before fetching association.

``` bash

path_to_stopwords = "./stop_keywords.txt" # optional
stopwords = open(path_to_stopwords, 'r').read().split('\n')

text_modeling = TextMiner(data, 
                 comment_variable='Cleaned_text_column', target_variable='Target_column', 
                 strata_variable='LoB_column', keywords_variable=None, clean_words=None,  # data
                 search_mode='micro', n=3, top=1, stop_words=None, truncation_words=None, truncation_mode='right', # YAKE
                 fpg_min_support=1E-3, keep_strongest_association=False, removeOutersection=False, # FPG
                 req_len_complexity=False, req_importance_score=False, # Random Forest
                 verbose=True, preprocessing_only=False) # class use
```

We can view the keywords that are associated to each pair (strata -
specific category, target - specific category). `TextMiner` allows some
rare keywords (may happen) that have low support but high confidence
score.

``` bash

text_modeling.reqKeywordsByTarget()['LoB_column_category']['Target_column_category']
```

We can also request the best target for each keyword based on support
**only**.

``` bash

text_modeling.bestbucket_by_s['LoB_column_category']
```

We can also request the strata of the data.

``` bash

text_modeling.reqUniqueStratas()
```

We can also request the targets of the data.

``` bash

text_modeling.reqUniqueTargets()
```

We can also request the keywords extracted from the data per strata

``` bash

text_modeling.reqYAKEKeywords()['LoB_column_category']
```

##### Micro vs Macro

Our `text_modeling` object fetched keywords with `micro` search,
could've been `macro` search instead. In both cases, the objective is to
build a list of unique keywords and show every keywords from that list
that are found in each and every given comment. Let's now review the key
differences.

-   `micro` fetches `top` keywords (`n`-grams) row-by-row and adds
    column `keywords_yake` to internally managed data.
    -   faster for smaller data
    -   better to fetch unique keywords in indiviudal messages within
        smaller stratas;
    -   tends to make clusters of high variability in size, but mostly
        considers minority data;
    -   needs lower 'top' argument as it is per message (\<= 2).

``` bash

path_to_stopwords = "./stop_keywords.txt" # optional
stopwords = open(path_to_stopwords, 'r').read().split('\n')

text_modeling_micro = TextMiner(data, 
                      comment_variable='Cleaned_text_column', target_variable='Target_column', 
                      strata_variable='LoB_column', keywords_variable=None, clean_words=None,  # data
                      search_mode='micro', n=3, top=1, stop_words=None, truncation_words=None, truncation_mode='right', # YAKE
                      fpg_min_support=1E-3, keep_strongest_association=False, removeOutersection=False, # FPG
                      req_len_complexity=False, req_importance_score=True) # Random Forest                      
```

OR

``` bash

path_to_stopwords = "./stop_keywords.txt" # optional
stopwords = open(path_to_stopwords, 'r').read().split('\n')

text_modeling_micro = TextMiner(data, 
                      comment_variable='Cleaned_text_column', target_variable='Target_column', 
                      strata_variable='LoB_column', keywords_variable=None, clean_words=None,  # data
                      search_mode='micro', n=3, top=1, stop_words=stopwords, fpg_min_support=1E-3,
                      keep_strongest_association=False, removeOutersection=False) # no filter                    
```

-   `macro` fetches and creates internally managed macro-keywords
    (stratified) dataframe(s). Uses text chunks for operations.
    -   faster for bigger data;
    -   better to fetch keywords typically common for every messages
        within large stratas;
    -   tends to make clusters of same size but ignores minority data;
    -   needs higher 'top' argument as it is for all messages (\>= 15);
    -   filters are recommended (have critical thinking).

``` bash

path_to_stopwords = "./stop_keywords.txt" # optional
stopwords = open(path_to_stopwords, 'r').read().split('\n')

text_modeling_macro = TextMiner(data, 
                      comment_variable='Cleaned_text_column', target_variable='Target_column', 
                      strata_variable='LoB_column', keywords_variable=None, clean_words=None,  # data
                      search_mode='macro', n=3, top=1, stop_words=None, truncation_words=None, truncation_mode='right', # YAKE
                      fpg_min_support=1E-3, keep_strongest_association=False, removeOutersection=False, # FPG
                      req_len_complexity=False, req_importance_score=True) # Random Forest
```

OR

``` bash

path_to_stopwords = "./stop_keywords.txt" # optional
stopwords = open(path_to_stopwords, 'r').read().split('\n')

text_modeling_macro = TextMiner(data, 
                      comment_variable='Cleaned_text_column', target_variable='Target_column', 
                      strata_variable='LoB_column', keywords_variable=None, clean_words=None,  # data
                      search_mode='macro', n=3, top=1, stop_words=stopwords, fpg_min_support=1E-3,
                      keep_strongest_association=False, removeOutersection=False) # no filter                    
```

#### Advanced Operations

##### Similarity Matrix

`TextMiner` objects can compute a matrix where each element represents
the similarity score between a pair of documents, texts or keywords. It
uses the Jaccard similarity measure, calculating the intersection over
the union of the sets of words (or other tokens) for each pair.

``` bash

text_modeling_similarity = TextMiner(data, 
                      comment_variable='Cleaned_text_column', target_variable='Target_column', 
                      strata_variable='LoB_column', keywords_variable=None, clean_words=None,  # data
                      search_mode='micro', n=3, top=1, stop_words=None, fpg_min_support=1E-3,
                      keep_strongest_association=False, removeOutersection=False) # no filter 

text_modeling_similarity.reqSimilarityMatrix()
```

We can also view the similarity matrix by a (strata - specific
category).

``` bash

text_modeling_similarity.reqSimilarityMatrix()['LoB_column_category']
```

##### Clusterize

`TextMiner` objects can cluster given data set using the fetched
keywords with the command `clusterize`. By default, it returns clusters
row- and column-wise. The `treshold` is the distance tolerance (in (0,
1\]) that is accepted to merge clusters.

``` bash

text_modeling_cluster = = TextMiner(data, 
                      comment_variable='Cleaned_text_column', target_variable='Target_column', 
                      strata_variable='LoB_column', keywords_variable=None, clean_words=None,  # data
                      search_mode='micro', n=3, top=1, stop_words=None, fpg_min_support=1E-3,
                      keep_strongest_association=False, removeOutersection=False) # no filter

cluster_observations, cluster_keywords = text_modeling_cluster.clusterize(treshold=0.9)
```

When `clusterize` is used, it adds to the internally managed data the
row-wise clusters.

``` bash

text_modeling_cluster.reqCleanedData()
```



> **Be Careful**
>
> Comments used for unsupervised clustering doesn't always have the
> needed keywords to fetch meaningful clusters - meaningful as they
> don't require a rigorous manual verification.



##### Weighted Balanced Random Forest

`TextMiner` objects can fit a Weighted Balanced Random Forest (WBRF)
given data set using the fetched keywords with the command `fit`. By
default, it uses a train-val-test split with randomized hyperparameters
search on a K-Fold validation process.

-   `req_importance_score`: (bool) find importance score for all bags of
    relevant keywords (at `TextMiner` object initialization, see
    `text_modeling_cluster` at `clusterize`);
-   `train_ratio`: (float) ratio in (0, 1) for train data in
    train-test-split;
-   `n_fold`: (int) number of folds in K-Fold hyperparameter tuning;
-   `n_round`: (int) number of rounds (new hyperparameter candidates)
    for K-Fold hyperparamters tuning;
-   `optim_metric`: (str) skl target metric for RandomizedSearchCV.

``` bash

text_modeling = TextMiner(data, 
                      comment_variable='Cleaned_text_column', target_variable='Target_column', 
                      strata_variable='LoB_column', keywords_variable=None, clean_words=None,  # data
                      search_mode='macro', n=3, top=1, stop_words=None, truncation_words=None, truncation_mode='right', # YAKE
                      fpg_min_support=1E-3, keep_strongest_association=False, removeOutersection=False, # FPG
                      req_len_complexity=False, req_importance_score=True) # Random Forest

text_modeling_fit = text_modeling.fit(n_round=5, n_fold=3, train_ratio=0.6, optim_metric='accuracy', n_jobs=1, skl_verbose=0, verbose=False)
```



> **This is important for scoring**
>
> Make sure `req_importance_score`=True in the `TextMiner`.



With the above, we can now find the best set of hyperparameters :

``` bash

text_modeling_fit.best_hp_by_s
```

That leads to these performance in training :

``` bash

text_modeling_fit.train_cm_by_s
```

``` bash

text_modeling_fit.train_metrics_by_s
```

``` bash

for s in text_modeling_fit.reqUniqueStratas():
  print(f'Strata : {s} \n')
  print(text_modeling_fit.train_cm_by_s[s])
  print(text_modeling_fit.train_metrics_by_s[s]) # for each strata
```

And leads to these performance in test :

``` bash

text_modeling_fit.test_cm_by_s
```

``` bash

text_modeling_fit.test_metrics_by_s
```

``` bash

for s in text_modeling_fit.reqUniqueStratas():
  print(f'Strata : {s} \n')
  print(text_modeling_fit.test_cm_by_s[s])
  print(text_modeling_fit.test_metrics_by_s[s]) # for each strata
```

We can dig deeper by looking at the importance scores (that we
required).

**Mean Decrease in Impurity (MDI)** : After the model is trained, you
can access the `MDI` scores to understand which features had the most
substantial impact on the model's decisions. This insight is
particularly useful for feature selection, understanding the data, and
interpreting the model's behavior, allowing you to make informed
decisions about which features to keep, discard, or further investigate.

``` bash

results_df = pd.DataFrame()

for s in text_modeling_fit.reqUniqueStratas():
    temp_df = pd.DataFrame(list(text_modeling_fit.mdi_importances_by_s[s].items()), columns=['Keyword', f'{s}_Importance'])
    if results_df.empty:
        results_df = temp_df
    else:
        results_df = pd.merge(results_df, temp_df, on='Keyword', how='outer')

results_df
```

**Permutation** : Unlike `MDI (Mean Decrease in Impurity)`, which is
specific to tree-based models, permutation importance can be applied to
any model. It measures the increase in the model's prediction after
permuting the feature's values, which breaks the relationship between
the feature and the true outcome.

``` bash

results_df = pd.DataFrame()

for s in text_modeling_fit.reqUniqueStratas():
    temp_df = pd.DataFrame(list(text_modeling_fit.perm_importances_by_s[s].items()), columns=['Keyword', f'{s}_Importance'])
    if results_df.empty:
        results_df = temp_df
    else:
        results_df = pd.merge(results_df, temp_df, on='Keyword', how='outer')

results_df
```

We see that permutations score is much more 'aggressive' as it leads to
smaller importances scores. A score close to 0 happens when a keyword's
presence doesn't improve accuracy. A negative score happens when a
keyword's presence decreases impurity i.e. a feature that should be
masked.

We can interpret both importance scores at once, for a bad of keywords
found in a given comment. Let $m_k$ be the MDI score and $p_k$ be the
permutations score for keyword $k \in K$, where $K$ is a set of keywords
found in a comment. `TextMiner` computes the Harmonic Importance as

$$
h := \frac{1}{(\sum_K \text{ReLU}(m_k))^{-1} + (\sum_K \text{ReLU}(p_k))^{-1}}.
$$

Harmonic Importance uses ReLU to disregard negative importance scores.
The choice of *Harmonic Mean* boils down to giving more importance to
permutations scores as they are typically cleaner. The resulting
Harmonic score can be requested by calling YAKE keywords output.

``` bash

text_modeling_fit.reqYAKEKeywords()
```



> **Best Scoring Method**
>
> The most suitable scoring method for assessing feature importance in
> machine learning models ultimately depends on the user's specific
> needs and context. In my view, all three methods---Mean Decrease in
> Impurity (MDI), Harmonic Mean, and Permutation Importance---offer
> valid approaches for evaluating feature significance. Each method has
> its strengths and can be effectively applied across various scenarios,
> making any one of them a potentially good choice depending on the
> particular requirements and goals of the analysis.



## Ideal Use Cases

-   **Sentiment Analysis** Ideal for businesses looking to gauge
    customer sentiment from reviews, social media posts, or feedback
    surveys. TextMiner can help identify positive, negative, and neutral
    sentiments, enabling companies to understand customer perceptions
    and improve their products or services accordingly.

-   **Topic Modeling** Useful for content aggregators, news agencies, or
    researchers who need to categorize large volumes of text into
    coherent topics. TextMiner can automate the discovery of prevailing
    themes in documents, making content navigation and organization more
    efficient.

-   **SEO Keyword Extraction** Digital marketers and content creators
    can leverage TextMiner to extract relevant keywords from articles,
    blog posts, or web pages. This assists in optimizing content for
    search engines, improving visibility, and driving traffic.

-   **Document Summarization** Beneficial for legal professionals,
    academics, or anyone who needs to digest large amounts of text.
    TextMiner can be used to generate concise summaries of lengthy
    documents, saving time and highlighting critical information.

-   **Fraud Detection** In finance and cybersecurity, TextMiner can
    analyze communication or transaction descriptions to detect patterns
    indicative of fraudulent activity. This proactive identification
    helps mitigate risks and safeguard assets.

-   **Competitive Analysis** Business analysts and strategists can use
    TextMiner to extract insights from competitor publications, press
    releases, or product descriptions. This enables a deeper
    understanding of market positioning, product features, and strategic
    moves.

-   **Customer Support Automation** For businesses looking to enhance
    their customer support, TextMiner can categorize incoming queries,
    route them to the appropriate department, and even suggest automated
    responses, improving efficiency and response time.

-   **Academic Research** Researchers can employ TextMiner to sift
    through academic papers, journals, or datasets, extracting relevant
    information, identifying research trends, and facilitating
    literature reviews.

-   **Social Media Monitoring** Marketing teams and social media
    managers can use TextMiner to track brand mentions, analyze public
    opinion, and understand consumer trends on social media platforms,
    informing marketing strategies and engagement efforts.

-   **Language Learning Applications** Developers of educational
    software can integrate TextMiner to analyze language usage, generate
    exercises, or provide feedback on language learning progress,
    enriching the learning experience.

The TextMiner component, with its comprehensive text analysis
capabilities, offers a powerful tool for extracting actionable insights
from textual data. Its application can significantly impact
decision-making, strategic planning, and operational efficiency across a
wide range of sectors.

## Contributing

We welcome contributions, suggestions, and feedback to make this library
even better. Feel free to fork the repository, submit pull requests, or
open issues.

## Documentation & Examples

For documentation and usage examples, visit the GitHub repository:
https://github.com/knowusuboaky/textmining_module

**Author**: Kwadwo Daddy Nyame Owusu - Boakye\
**Email**: kwadwo.owusuboakye@outlook.com\
**License**: MIT

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/knowusuboaky/textmining_module",
    "name": "textmining-module",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": "",
    "keywords": "text mining,clustering,correlation,similarity,keyword extraction,text analysis,scoring,bining,data processing,Python",
    "author": "Kwadwo Daddy Nyame Owusu - Boakye",
    "author_email": "kwadwo.owusuboakye@outlook.com",
    "download_url": "https://files.pythonhosted.org/packages/d5/04/cf68bbdb6e9c7fde89eb4a2a2d028c82889be910612dc2e10ba120f833d1/textmining_module-0.1.3.tar.gz",
    "platform": null,
    "description": "# Text Mining Module Library\r\n\r\n## Overview\r\n\r\nThe `textmining_module` is a comprehensive Python library designed to\r\nfacilitate text mining, keyword extraction, and text analysis tasks. It\r\nprovides a suite of tools for preprocessing textual data, extracting\r\nmeaningful insights, and transforming texts into formats suitable for\r\nvarious analysis and machine learning models.\r\n\r\n## Features\r\n\r\n-   **Text Preprocessing**: Simplify the preparation of text data with\r\n    functions for cleaning, normalizing, and preprocessing textual\r\n    content.\r\n-   **Keyword Extraction**: Utilize built-in functionalities to extract\r\n    significant keywords and phrases from large volumes of text.\r\n-   **Text Analysis**: Leverage tools to analyze and understand the\r\n    content, structure, and composition of your text data.\r\n\r\n## Developer Manual for KeywordsExtractor\r\n\r\n### Functions Map\r\n\r\n<img src=\"https://github.com/knowusuboaky/textmining_module/blob/main/README_files/figure-markdown/mermaid-figure-1.png?raw=true\" width=\"1526\" height=\"459\" alt=\"Optional Alt Text\">\r\n\r\n\r\n### User Manual\r\n\r\n#### Installation\r\n\r\nThis is the environment we need to load.\r\n\r\n``` bash\r\n\r\npip install textmining_module==0.1.3\r\n```\r\n\r\n#### Load Package\r\n\r\n``` bash\r\n\r\nfrom textmining_module import KeywordsExtractor\r\n```\r\n\r\n#### Base Operations\r\n\r\n##### Extract Keywords From Dataset\r\n\r\n``` bash\r\n\r\nkeywords_df =  KeywordsExtractor(data, \r\n                                 text_column= 'text_column', \r\n                                 method= 'yake', \r\n                                 n=3, \r\n                                 stopword_language= 'english') \r\n```\r\n\r\nThe `KeywordsExtractor` extracts keywords from textual data within a\r\n`pandas` DataFrame. Here's a detailed look at each of its arguments:\r\n\r\n-   `data` : The `pandas` DataFrame containing the `text data` from\r\n    which you want to extract keywords. This DataFrame should have at\r\n    least one `text_column` specified by the text_column argument.\r\n    -   `text_column` : (str) The name of the column within the data\r\n        DataFrame that contains the textual data for keyword extraction.\r\n    -   `method` : (str) Specifies the method to be used for keyword\r\n        extraction. The function supports the following methods:\r\n        -   `frequency` : Extracts keywords based on word frequency,\r\n            excluding common stopwords.\r\n        -   `yake` : Utilizes YAKE (Yet Another Keyword Extractor), an\r\n            unsupervised method that considers word frequency and\r\n            position.- - `tf-idf` : Employs Term Frequency-Inverse\r\n            Document Frequency, highlighting words that are particularly\r\n            indicative of the text's content.\r\n        -   `pos` : Focuses on part-of-speech tagging, typically\r\n            selecting nouns as keywords.\r\n        -   `ner`: Uses Named Entity Recognition to identify and extract\r\n            entities (e.g., people, organizations) as keywords.\r\n    -   `n` : (int) The number of keywords to extract from each piece of\r\n        text.\r\n    -   `stopwords_language` : (str) Indicates the language of the\r\n        stopwords to be used for filtering during the keyword extraction\r\n        process. This is relevant for methods that remove common words\r\n        to focus on more meaningful content.\r\n\r\n## Developer Manual for TextMiner\r\n\r\n### Functions Map\r\n\r\n<img src=\"https://github.com/knowusuboaky/textmining_module/blob/main/README_files/figure-markdown/mermaid-figure-2.png?raw=true\" width=\"730\" height=\"1615\" alt=\"Optional Alt Text\">\r\n\r\n\r\n\r\n### User Manual\r\n\r\n#### Installation\r\n\r\nThis is the environment we need to load.\r\n\r\n``` bash\r\n\r\npip install textmining_module==0.1.3\r\n```\r\n\r\n#### Load Package\r\n\r\n``` bash\r\n\r\nfrom textmining_module import TextMiner\r\n```\r\n\r\n#### Base Operations\r\n\r\n##### Prepare Text Dataset\r\n\r\n``` bash\r\n\r\nCleaner = TextMiner(data, comment_variable='Text_column', target_variable='Target_column',\r\n                       truncation_words=None, truncation_mode='right',\r\n                       preprocessing_only=True, verbose=True)\r\n\r\ndata['Cleaned_text_column'] = Cleaner.reqCleanedData()['Text_column']\r\n```\r\n\r\n`Text_column` may have translations at the end that we want to remove.\r\nWe can use `TextMiner` to obtain [preprocessed]{.coop_blue} messages\r\nthat are [right truncated]{.coop_blue} after some stop words we\r\nidentified.\r\n\r\n-   Required 1st argument : (`pandas` dataframe) of dataset;\r\n-   `comment_variable` : (str) name of the comment variable in `pandas`\r\n    dataframe;\r\n-   `target_variable`\u00c2\u00a0: (str) name of the target variable in `pandas`\r\n    dataframe;\r\n-   `truncation_words` : (str list) words where a split occur to\r\n    truncate a message to the left/right - i.e.\u00c2\u00a0if french copy\r\n    before/after an english message;\r\n-   `truncation_mode`\u00c2\u00a0: (str) {'right' : remove rhs of message at\r\n    truncation_word, 'left' : remove lhs of message at truncation_word};\r\n-   `preprocessing_only`\u00c2\u00a0: (bool) if True, only clean (opt.), format,\r\n    stratify (opt.) and truncate (opt.) given dataset;\r\n-   `verbose` : (bool) if True, show a progress bar.\r\n\r\n##### Fetch Association\r\n\r\nLet's review how to use `TextMiner` to fetch [processed]{.coop_blue}\r\nkeywords that are associated with ratings. The most challenging part of\r\nmost unsupervised algorithms is to find the correct hyperparameters. For\r\n`TextMiner`, pay attention to `fpg_min_support`, `n` and `top`. Keyword\r\nextraction may fail with an exponentially growing time complexity if too\r\nmany n-grams are fetched at a low support. A low `fpg_min_support` means\r\nthat we tolerate keywords that appear in a low number of observations. A\r\nlow `n` with a high `top` will lead to grams that are more likely to be\r\ncommon to many messages, hence increasing time complexity as there would\r\nbe too many permutations to check. A high `n` with a low `top`, on the\r\nother hand, will lead to grams that are too specific.\r\n\r\n-   `strata_variable` : (str) name of the strata variable in `pandas`\r\n    dataframe, for a stratified analysis - i.e.\u00c2\u00a0break down by LoB;\r\n-   `req_len_complexity` : (bool) if True, include message length\r\n    quartiles in analysis as a new qualitative attribute;\r\n-   `removeOutersection`\u00c2\u00a0: (bool) if True, exclude keywords that contain\r\n    other fetched keywords;\r\n-   `search_mode` : (str) {'macro' : (for each strata) concatenate all\r\n    rows in one chunk before extracting keywords, 'micro' : extract\r\n    keywords row-wise}\r\n-   `n` : (int) maximal number of grams (words excluding `stop_words`)\r\n    that can form a `keyword`;\r\n-   `top`\u00c2\u00a0: (int) how many n-grams to fetch;\r\n-   `stop_words` : (str list) words to disregard in generation of\r\n    n-grams;\r\n-   `fpg_min_support` : (float) minimal support for FP Growth - try\r\n    higher value if FPG takes too long;\r\n-   `keep_strongest_association` : (bool) filter One Hot Data to keep\r\n    highest supported bits before fetching association.\r\n\r\n``` bash\r\n\r\npath_to_stopwords = \"./stop_keywords.txt\" # optional\r\nstopwords = open(path_to_stopwords, 'r').read().split('\\n')\r\n\r\ntext_modeling = TextMiner(data, \r\n                 comment_variable='Cleaned_text_column', target_variable='Target_column', \r\n                 strata_variable='LoB_column', keywords_variable=None, clean_words=None,  # data\r\n                 search_mode='micro', n=3, top=1, stop_words=None, truncation_words=None, truncation_mode='right', # YAKE\r\n                 fpg_min_support=1E-3, keep_strongest_association=False, removeOutersection=False, # FPG\r\n                 req_len_complexity=False, req_importance_score=False, # Random Forest\r\n                 verbose=True, preprocessing_only=False) # class use\r\n```\r\n\r\nWe can view the keywords that are associated to each pair (strata -\r\nspecific category, target - specific category). `TextMiner` allows some\r\nrare keywords (may happen) that have low support but high confidence\r\nscore.\r\n\r\n``` bash\r\n\r\ntext_modeling.reqKeywordsByTarget()['LoB_column_category']['Target_column_category']\r\n```\r\n\r\nWe can also request the best target for each keyword based on support\r\n**only**.\r\n\r\n``` bash\r\n\r\ntext_modeling.bestbucket_by_s['LoB_column_category']\r\n```\r\n\r\nWe can also request the strata of the data.\r\n\r\n``` bash\r\n\r\ntext_modeling.reqUniqueStratas()\r\n```\r\n\r\nWe can also request the targets of the data.\r\n\r\n``` bash\r\n\r\ntext_modeling.reqUniqueTargets()\r\n```\r\n\r\nWe can also request the keywords extracted from the data per strata\r\n\r\n``` bash\r\n\r\ntext_modeling.reqYAKEKeywords()['LoB_column_category']\r\n```\r\n\r\n##### Micro vs Macro\r\n\r\nOur `text_modeling` object fetched keywords with `micro` search,\r\ncould've been `macro` search instead. In both cases, the objective is to\r\nbuild a list of unique keywords and show every keywords from that list\r\nthat are found in each and every given comment. Let's now review the key\r\ndifferences.\r\n\r\n-   `micro` fetches `top` keywords (`n`-grams) row-by-row and adds\r\n    column `keywords_yake` to internally managed data.\r\n    -   faster for smaller data\r\n    -   better to fetch unique keywords in indiviudal messages within\r\n        smaller stratas;\r\n    -   tends to make clusters of high variability in size, but mostly\r\n        considers minority data;\r\n    -   needs lower 'top' argument as it is per message (\\<= 2).\r\n\r\n``` bash\r\n\r\npath_to_stopwords = \"./stop_keywords.txt\" # optional\r\nstopwords = open(path_to_stopwords, 'r').read().split('\\n')\r\n\r\ntext_modeling_micro = TextMiner(data, \r\n                      comment_variable='Cleaned_text_column', target_variable='Target_column', \r\n                      strata_variable='LoB_column', keywords_variable=None, clean_words=None,  # data\r\n                      search_mode='micro', n=3, top=1, stop_words=None, truncation_words=None, truncation_mode='right', # YAKE\r\n                      fpg_min_support=1E-3, keep_strongest_association=False, removeOutersection=False, # FPG\r\n                      req_len_complexity=False, req_importance_score=True) # Random Forest                      \r\n```\r\n\r\nOR\r\n\r\n``` bash\r\n\r\npath_to_stopwords = \"./stop_keywords.txt\" # optional\r\nstopwords = open(path_to_stopwords, 'r').read().split('\\n')\r\n\r\ntext_modeling_micro = TextMiner(data, \r\n                      comment_variable='Cleaned_text_column', target_variable='Target_column', \r\n                      strata_variable='LoB_column', keywords_variable=None, clean_words=None,  # data\r\n                      search_mode='micro', n=3, top=1, stop_words=stopwords, fpg_min_support=1E-3,\r\n                      keep_strongest_association=False, removeOutersection=False) # no filter                    \r\n```\r\n\r\n-   `macro` fetches and creates internally managed macro-keywords\r\n    (stratified) dataframe(s). Uses text chunks for operations.\r\n    -   faster for bigger data;\r\n    -   better to fetch keywords typically common for every messages\r\n        within large stratas;\r\n    -   tends to make clusters of same size but ignores minority data;\r\n    -   needs higher 'top' argument as it is for all messages (\\>= 15);\r\n    -   filters are recommended (have critical thinking).\r\n\r\n``` bash\r\n\r\npath_to_stopwords = \"./stop_keywords.txt\" # optional\r\nstopwords = open(path_to_stopwords, 'r').read().split('\\n')\r\n\r\ntext_modeling_macro = TextMiner(data, \r\n                      comment_variable='Cleaned_text_column', target_variable='Target_column', \r\n                      strata_variable='LoB_column', keywords_variable=None, clean_words=None,  # data\r\n                      search_mode='macro', n=3, top=1, stop_words=None, truncation_words=None, truncation_mode='right', # YAKE\r\n                      fpg_min_support=1E-3, keep_strongest_association=False, removeOutersection=False, # FPG\r\n                      req_len_complexity=False, req_importance_score=True) # Random Forest\r\n```\r\n\r\nOR\r\n\r\n``` bash\r\n\r\npath_to_stopwords = \"./stop_keywords.txt\" # optional\r\nstopwords = open(path_to_stopwords, 'r').read().split('\\n')\r\n\r\ntext_modeling_macro = TextMiner(data, \r\n                      comment_variable='Cleaned_text_column', target_variable='Target_column', \r\n                      strata_variable='LoB_column', keywords_variable=None, clean_words=None,  # data\r\n                      search_mode='macro', n=3, top=1, stop_words=stopwords, fpg_min_support=1E-3,\r\n                      keep_strongest_association=False, removeOutersection=False) # no filter                    \r\n```\r\n\r\n#### Advanced Operations\r\n\r\n##### Similarity Matrix\r\n\r\n`TextMiner` objects can compute a matrix where each element represents\r\nthe similarity score between a pair of documents, texts or keywords. It\r\nuses the Jaccard similarity measure, calculating the intersection over\r\nthe union of the sets of words (or other tokens) for each pair.\r\n\r\n``` bash\r\n\r\ntext_modeling_similarity = TextMiner(data, \r\n                      comment_variable='Cleaned_text_column', target_variable='Target_column', \r\n                      strata_variable='LoB_column', keywords_variable=None, clean_words=None,  # data\r\n                      search_mode='micro', n=3, top=1, stop_words=None, fpg_min_support=1E-3,\r\n                      keep_strongest_association=False, removeOutersection=False) # no filter \r\n\r\ntext_modeling_similarity.reqSimilarityMatrix()\r\n```\r\n\r\nWe can also view the similarity matrix by a (strata - specific\r\ncategory).\r\n\r\n``` bash\r\n\r\ntext_modeling_similarity.reqSimilarityMatrix()['LoB_column_category']\r\n```\r\n\r\n##### Clusterize\r\n\r\n`TextMiner` objects can cluster given data set using the fetched\r\nkeywords with the command `clusterize`. By default, it returns clusters\r\nrow- and column-wise. The `treshold`\u00c2\u00a0is the distance tolerance (in (0,\r\n1\\]) that is accepted to merge clusters.\r\n\r\n``` bash\r\n\r\ntext_modeling_cluster = = TextMiner(data, \r\n                      comment_variable='Cleaned_text_column', target_variable='Target_column', \r\n                      strata_variable='LoB_column', keywords_variable=None, clean_words=None,  # data\r\n                      search_mode='micro', n=3, top=1, stop_words=None, fpg_min_support=1E-3,\r\n                      keep_strongest_association=False, removeOutersection=False) # no filter\r\n\r\ncluster_observations, cluster_keywords = text_modeling_cluster.clusterize(treshold=0.9)\r\n```\r\n\r\nWhen `clusterize`\u00c2\u00a0is used, it adds to the internally managed data the\r\nrow-wise clusters.\r\n\r\n``` bash\r\n\r\ntext_modeling_cluster.reqCleanedData()\r\n```\r\n\r\n\r\n\r\n> **Be Careful**\r\n>\r\n> Comments used for unsupervised clustering doesn't always have the\r\n> needed keywords to fetch meaningful clusters - meaningful as they\r\n> don't require a rigorous manual verification.\r\n\r\n\r\n\r\n##### Weighted Balanced Random Forest\r\n\r\n`TextMiner` objects can fit a Weighted Balanced Random Forest (WBRF)\r\ngiven data set using the fetched keywords with the command `fit`. By\r\ndefault, it uses a train-val-test split with randomized hyperparameters\r\nsearch on a K-Fold validation process.\r\n\r\n-   `req_importance_score`: (bool) find importance score for all bags of\r\n    relevant keywords (at `TextMiner` object initialization, see\r\n    `text_modeling_cluster` at `clusterize`);\r\n-   `train_ratio`: (float) ratio in (0, 1) for train data in\r\n    train-test-split;\r\n-   `n_fold`: (int) number of folds in K-Fold hyperparameter tuning;\r\n-   `n_round`: (int) number of rounds (new hyperparameter candidates)\r\n    for K-Fold hyperparamters tuning;\r\n-   `optim_metric`: (str) skl target metric for RandomizedSearchCV.\r\n\r\n``` bash\r\n\r\ntext_modeling = TextMiner(data, \r\n                      comment_variable='Cleaned_text_column', target_variable='Target_column', \r\n                      strata_variable='LoB_column', keywords_variable=None, clean_words=None,  # data\r\n                      search_mode='macro', n=3, top=1, stop_words=None, truncation_words=None, truncation_mode='right', # YAKE\r\n                      fpg_min_support=1E-3, keep_strongest_association=False, removeOutersection=False, # FPG\r\n                      req_len_complexity=False, req_importance_score=True) # Random Forest\r\n\r\ntext_modeling_fit = text_modeling.fit(n_round=5, n_fold=3, train_ratio=0.6, optim_metric='accuracy', n_jobs=1, skl_verbose=0, verbose=False)\r\n```\r\n\r\n\r\n\r\n> **This is important for scoring**\r\n>\r\n> Make sure `req_importance_score`=True in the `TextMiner`.\r\n\r\n\r\n\r\nWith the above, we can now find the best set of hyperparameters :\r\n\r\n``` bash\r\n\r\ntext_modeling_fit.best_hp_by_s\r\n```\r\n\r\nThat leads to these performance in training :\r\n\r\n``` bash\r\n\r\ntext_modeling_fit.train_cm_by_s\r\n```\r\n\r\n``` bash\r\n\r\ntext_modeling_fit.train_metrics_by_s\r\n```\r\n\r\n``` bash\r\n\r\nfor s in text_modeling_fit.reqUniqueStratas():\r\n  print(f'Strata : {s} \\n')\r\n  print(text_modeling_fit.train_cm_by_s[s])\r\n  print(text_modeling_fit.train_metrics_by_s[s]) # for each strata\r\n```\r\n\r\nAnd leads to these performance in test :\r\n\r\n``` bash\r\n\r\ntext_modeling_fit.test_cm_by_s\r\n```\r\n\r\n``` bash\r\n\r\ntext_modeling_fit.test_metrics_by_s\r\n```\r\n\r\n``` bash\r\n\r\nfor s in text_modeling_fit.reqUniqueStratas():\r\n  print(f'Strata : {s} \\n')\r\n  print(text_modeling_fit.test_cm_by_s[s])\r\n  print(text_modeling_fit.test_metrics_by_s[s]) # for each strata\r\n```\r\n\r\nWe can dig deeper by looking at the importance scores (that we\r\nrequired).\r\n\r\n**Mean Decrease in Impurity (MDI)** : After the model is trained, you\r\ncan access the `MDI` scores to understand which features had the most\r\nsubstantial impact on the model's decisions. This insight is\r\nparticularly useful for feature selection, understanding the data, and\r\ninterpreting the model's behavior, allowing you to make informed\r\ndecisions about which features to keep, discard, or further investigate.\r\n\r\n``` bash\r\n\r\nresults_df = pd.DataFrame()\r\n\r\nfor s in text_modeling_fit.reqUniqueStratas():\r\n    temp_df = pd.DataFrame(list(text_modeling_fit.mdi_importances_by_s[s].items()), columns=['Keyword', f'{s}_Importance'])\r\n    if results_df.empty:\r\n        results_df = temp_df\r\n    else:\r\n        results_df = pd.merge(results_df, temp_df, on='Keyword', how='outer')\r\n\r\nresults_df\r\n```\r\n\r\n**Permutation** : Unlike `MDI (Mean Decrease in Impurity)`, which is\r\nspecific to tree-based models, permutation importance can be applied to\r\nany model. It measures the increase in the model's prediction after\r\npermuting the feature's values, which breaks the relationship between\r\nthe feature and the true outcome.\r\n\r\n``` bash\r\n\r\nresults_df = pd.DataFrame()\r\n\r\nfor s in text_modeling_fit.reqUniqueStratas():\r\n    temp_df = pd.DataFrame(list(text_modeling_fit.perm_importances_by_s[s].items()), columns=['Keyword', f'{s}_Importance'])\r\n    if results_df.empty:\r\n        results_df = temp_df\r\n    else:\r\n        results_df = pd.merge(results_df, temp_df, on='Keyword', how='outer')\r\n\r\nresults_df\r\n```\r\n\r\nWe see that permutations score is much more 'aggressive' as it leads to\r\nsmaller importances scores. A score close to 0 happens when a keyword's\r\npresence doesn't improve accuracy. A negative score happens when a\r\nkeyword's presence decreases impurity i.e.\u00c2\u00a0a feature that should be\r\nmasked.\r\n\r\nWe can interpret both importance scores at once, for a bad of keywords\r\nfound in a given comment. Let $m_k$ be the MDI score and $p_k$ be the\r\npermutations score for keyword $k \\in K$, where $K$ is a set of keywords\r\nfound in a comment. `TextMiner` computes the Harmonic Importance as\r\n\r\n$$\r\nh := \\frac{1}{(\\sum_K \\text{ReLU}(m_k))^{-1} + (\\sum_K \\text{ReLU}(p_k))^{-1}}.\r\n$$\r\n\r\nHarmonic Importance uses ReLU to disregard negative importance scores.\r\nThe choice of *Harmonic Mean* boils down to giving more importance to\r\npermutations scores as they are typically cleaner. The resulting\r\nHarmonic score can be requested by calling YAKE keywords output.\r\n\r\n``` bash\r\n\r\ntext_modeling_fit.reqYAKEKeywords()\r\n```\r\n\r\n\r\n\r\n> **Best Scoring Method**\r\n>\r\n> The most suitable scoring method for assessing feature importance in\r\n> machine learning models ultimately depends on the user's specific\r\n> needs and context. In my view, all three methods---Mean Decrease in\r\n> Impurity (MDI), Harmonic Mean, and Permutation Importance---offer\r\n> valid approaches for evaluating feature significance. Each method has\r\n> its strengths and can be effectively applied across various scenarios,\r\n> making any one of them a potentially good choice depending on the\r\n> particular requirements and goals of the analysis.\r\n\r\n\r\n\r\n## Ideal Use Cases\r\n\r\n-   **Sentiment Analysis** Ideal for businesses looking to gauge\r\n    customer sentiment from reviews, social media posts, or feedback\r\n    surveys. TextMiner can help identify positive, negative, and neutral\r\n    sentiments, enabling companies to understand customer perceptions\r\n    and improve their products or services accordingly.\r\n\r\n-   **Topic Modeling** Useful for content aggregators, news agencies, or\r\n    researchers who need to categorize large volumes of text into\r\n    coherent topics. TextMiner can automate the discovery of prevailing\r\n    themes in documents, making content navigation and organization more\r\n    efficient.\r\n\r\n-   **SEO Keyword Extraction** Digital marketers and content creators\r\n    can leverage TextMiner to extract relevant keywords from articles,\r\n    blog posts, or web pages. This assists in optimizing content for\r\n    search engines, improving visibility, and driving traffic.\r\n\r\n-   **Document Summarization** Beneficial for legal professionals,\r\n    academics, or anyone who needs to digest large amounts of text.\r\n    TextMiner can be used to generate concise summaries of lengthy\r\n    documents, saving time and highlighting critical information.\r\n\r\n-   **Fraud Detection** In finance and cybersecurity, TextMiner can\r\n    analyze communication or transaction descriptions to detect patterns\r\n    indicative of fraudulent activity. This proactive identification\r\n    helps mitigate risks and safeguard assets.\r\n\r\n-   **Competitive Analysis** Business analysts and strategists can use\r\n    TextMiner to extract insights from competitor publications, press\r\n    releases, or product descriptions. This enables a deeper\r\n    understanding of market positioning, product features, and strategic\r\n    moves.\r\n\r\n-   **Customer Support Automation** For businesses looking to enhance\r\n    their customer support, TextMiner can categorize incoming queries,\r\n    route them to the appropriate department, and even suggest automated\r\n    responses, improving efficiency and response time.\r\n\r\n-   **Academic Research** Researchers can employ TextMiner to sift\r\n    through academic papers, journals, or datasets, extracting relevant\r\n    information, identifying research trends, and facilitating\r\n    literature reviews.\r\n\r\n-   **Social Media Monitoring** Marketing teams and social media\r\n    managers can use TextMiner to track brand mentions, analyze public\r\n    opinion, and understand consumer trends on social media platforms,\r\n    informing marketing strategies and engagement efforts.\r\n\r\n-   **Language Learning Applications** Developers of educational\r\n    software can integrate TextMiner to analyze language usage, generate\r\n    exercises, or provide feedback on language learning progress,\r\n    enriching the learning experience.\r\n\r\nThe TextMiner component, with its comprehensive text analysis\r\ncapabilities, offers a powerful tool for extracting actionable insights\r\nfrom textual data. Its application can significantly impact\r\ndecision-making, strategic planning, and operational efficiency across a\r\nwide range of sectors.\r\n\r\n## Contributing\r\n\r\nWe welcome contributions, suggestions, and feedback to make this library\r\neven better. Feel free to fork the repository, submit pull requests, or\r\nopen issues.\r\n\r\n## Documentation & Examples\r\n\r\nFor documentation and usage examples, visit the GitHub repository:\r\nhttps://github.com/knowusuboaky/textmining_module\r\n\r\n**Author**: Kwadwo Daddy Nyame Owusu - Boakye\\\r\n**Email**: kwadwo.owusuboakye@outlook.com\\\r\n**License**: MIT\r\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "A Python Module for Comprehensive Text Mining, including Keyword Extraction and Text Analysis.",
    "version": "0.1.3",
    "project_urls": {
        "Homepage": "https://github.com/knowusuboaky/textmining_module"
    },
    "split_keywords": [
        "text mining",
        "clustering",
        "correlation",
        "similarity",
        "keyword extraction",
        "text analysis",
        "scoring",
        "bining",
        "data processing",
        "python"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "d504cf68bbdb6e9c7fde89eb4a2a2d028c82889be910612dc2e10ba120f833d1",
                "md5": "b94d5efb13b9c90158d97264b10efa58",
                "sha256": "45d4429f95026f45f3c94db45ea226ed521b79b6b3fde150cc33360daee513e3"
            },
            "downloads": -1,
            "filename": "textmining_module-0.1.3.tar.gz",
            "has_sig": false,
            "md5_digest": "b94d5efb13b9c90158d97264b10efa58",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 27673,
            "upload_time": "2024-02-19T13:21:27",
            "upload_time_iso_8601": "2024-02-19T13:21:27.552253Z",
            "url": "https://files.pythonhosted.org/packages/d5/04/cf68bbdb6e9c7fde89eb4a2a2d028c82889be910612dc2e10ba120f833d1/textmining_module-0.1.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-02-19 13:21:27",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "knowusuboaky",
    "github_project": "textmining_module",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "textmining-module"
}
        
Elapsed time: 0.19365s