# TakeBlipInsightExtractor Package
_Data & Analytics Research_
## Overview
Here is presented these content:
* [Intro](#intro)
* [Run](#run)
* [Example of initialization e usage](#Example of initialization e usage)
## Intro
The Insight Extractor offers a way to analyze huge volumes of textual data in order to identify, cluster and detail subjects.
This project achieves this results by way of applying a proprietary Named Entity Recognition (NER) algorithm followed by a clustering algorithm.
The IE Cloud also allows any person to use this tool without having too many computational resources available to themselves.
The package outputs four types of files:
- **Wordcloud**: It's an image file containing a wordcloud describing the most frequent subjects on the text. The colours represent the groups of similar subjects.
- **Wordtree**: It's an html file which contains the graphic relationship between the subjects and the examples of uses in sentences. It's an interactive graphic where the user can navigate along the tree.
- **Hierarchy**: It's a json file which contains the hierarchical relationship between subjects.
- **Table**: It's a csv file containing the following columns:
Message | Entities | Groups | Structured Message
sobre cobranca inexistente|[{'value': 'cobrança', 'lowercase_value': 'cobrança', 'postags': 'SUBS', 'type': 'financial'}]|['cobrança']|sobre cobrança inexistente
### Parameters
The following parameters need to be set by the user on the command line:
- **embedding_path**: path to the embedding model, the file should end with .kv;
- **postagging_model_path**: path to the postagging model, the file should end with .pkl;
- **postagging_label_path**: path to the postagging label file, the file should end with .pkl;
- **ner_model_path**: path to the ner model, the file should end with .pkl;
- **ner_label_path**: path to the ner label file, the file should end with .pkl;
- **file**: path to the csv file the user wants to analyze;
- **user_email**: user's Take Blip email where they want to receive the analysis;
- **bot_name**: bot ID.
The following parameters have default settings, but can be customized by the user;
- **node_messages_examples**: it is an int representing the number of examples outputed for each subject on the Wordtree file. The default value is 100;
- **similarity_threshold**: it is a float representing the similarity threshold between the subject groups. The default value is 0.65, we recommend that this parameter not be modified;
- **percentage_threshold**: it is a float representing the frequency percentile of subject from which they are not removed from the analysis. The default value is 0.9;
- **batch_size**: it is an int representing the batch size. The default value is 50;
- **chunk_size**: it is an int representing chunk file size for upload in storaged. The default value is 1024;
- **separator**: it is a str for the csv file delimiter character. The default value is '|'.
## Example of initialization e usage:
1) Import main packages;
2) Initialize main variables;
3) Initialize eventhub logger;
4) Initialize Insight Extractor;
5) Insight Extractor usage.
An example of the above steps could be found in the python code below:
1) Import main packages
```
import uuid
from TakeBlipInsightExtractor.insight_extractor import InsightExtractor
from TakeBlipInsightExtractor.outputs.eventhub_log_sender import EventHubLogSender
```
2) Initialize main variables
```
embedding_path = '*.kv'
postag_model_path = '*.pkl'
postag_label_path = '*.pkl'
ner_model_path = '*.pkl'
ner_label_path = '*.pkl'
user_email = 'your_email@host.com'
bot_name = 'my_bot_for_insight_extractor'
application_name = 'your application'
eventhub_name = '*'
eventhub_connection_string = '*'
file_name = '*'
input_data = '*.csv'
separator = '|'
similarity_threshold = 0.65
node_messages_examples = 100
batch_size = 1024
percentage_threshold = 0.7
```
3) Initialize eventhub logger
```
correlation_id = str(uuid.uuid3(uuid.NAMESPACE_DNS, user_email + bot_name))
logger = EventHubLogSender(application_name=application_name,
user_email=user_email,
bot_name=bot_name,
file_name=file_name,
correlation_id=correlation_id,
connection_string=eventhub_connection_string,
eventhub_name=eventhub_name)
```
4) Initialize Insight Extractor
```
insight_extractor = InsightExtractor(input_data,
separator=separator,
similarity_threshold=similarity_threshold,
embedding_path=embedding_path,
postagging_model_path=postag_model_path,
postagging_label_path=postag_label_path,
ner_model_path=ner_model_path,
ner_label_path=ner_label_path,
user_email=user_email,
bot_name=bot_name,
logger=logger)
```
5) Insight Extractor usage
```
insight_extractor.predict(percentage_threshold=percentage_threshold,
node_messages_examples=node_messages_examples,
batch_size=batch_size)
```
Raw data
{
"_id": null,
"home_page": "",
"name": "insight-extractor-package",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "insight extractor",
"author": "Research and Innovation",
"author_email": "insightextractor.dataanalytics@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/b9/fe/5c4d702d6bb28a7144759da0226258647d94113cf2ecde0e893246a93a1e/insight-extractor-package-0.0.7.tar.gz",
"platform": null,
"description": "# TakeBlipInsightExtractor Package\n_Data & Analytics Research_\n\n## Overview\n\nHere is presented these content:\n\n* [Intro](#intro)\n* [Run](#run)\n* [Example of initialization e usage](#Example of initialization e usage)\n\n\n## Intro\n\nThe Insight Extractor offers a way to analyze huge volumes of textual data in order to identify, cluster and detail subjects. \nThis project achieves this results by way of applying a proprietary Named Entity Recognition (NER) algorithm followed by a clustering algorithm. \nThe IE Cloud also allows any person to use this tool without having too many computational resources available to themselves.\n\nThe package outputs four types of files:\n\n- **Wordcloud**: It's an image file containing a wordcloud describing the most frequent subjects on the text. The colours represent the groups of similar subjects.\n- **Wordtree**: It's an html file which contains the graphic relationship between the subjects and the examples of uses in sentences. It's an interactive graphic where the user can navigate along the tree.\n- **Hierarchy**: It's a json file which contains the hierarchical relationship between subjects.\n- **Table**: It's a csv file containing the following columns:\n\n \n Message | Entities | Groups | Structured Message\n sobre cobranca inexistente|[{'value': 'cobran\u00c3\u00a7a', 'lowercase_value': 'cobran\u00c3\u00a7a', 'postags': 'SUBS', 'type': 'financial'}]|['cobran\u00c3\u00a7a']|sobre cobran\u00c3\u00a7a inexistente\n\n\n\n### Parameters\n\nThe following parameters need to be set by the user on the command line:\n- **embedding_path**: path to the embedding model, the file should end with .kv;\n- **postagging_model_path**: path to the postagging model, the file should end with .pkl;\n- **postagging_label_path**: path to the postagging label file, the file should end with .pkl;\n- **ner_model_path**: path to the ner model, the file should end with .pkl;\n- **ner_label_path**: path to the ner label file, the file should end with .pkl;\n- **file**: path to the csv file the user wants to analyze;\n- **user_email**: user's Take Blip email where they want to receive the analysis;\n- **bot_name**: bot ID.\n\n\nThe following parameters have default settings, but can be customized by the user;\n- **node_messages_examples**: it is an int representing the number of examples outputed for each subject on the Wordtree file. The default value is 100;\n- **similarity_threshold**: it is a float representing the similarity threshold between the subject groups. The default value is 0.65, we recommend that this parameter not be modified;\n- **percentage_threshold**: it is a float representing the frequency percentile of subject from which they are not removed from the analysis. The default value is 0.9;\n- **batch_size**: it is an int representing the batch size. The default value is 50;\n- **chunk_size**: it is an int representing chunk file size for upload in storaged. The default value is 1024;\n- **separator**: it is a str for the csv file delimiter character. The default value is '|'.\n \n\n## Example of initialization e usage:\n1) Import main packages;\n2) Initialize main variables; \n3) Initialize eventhub logger;\n4) Initialize Insight Extractor;\n5) Insight Extractor usage.\n\n\nAn example of the above steps could be found in the python code below:\n\n1) Import main packages\n```\nimport uuid\nfrom TakeBlipInsightExtractor.insight_extractor import InsightExtractor\nfrom TakeBlipInsightExtractor.outputs.eventhub_log_sender import EventHubLogSender\n``` \n2) Initialize main variables\n```\nembedding_path = '*.kv'\npostag_model_path = '*.pkl'\npostag_label_path = '*.pkl'\nner_model_path = '*.pkl'\nner_label_path = '*.pkl'\n\nuser_email = 'your_email@host.com'\nbot_name = 'my_bot_for_insight_extractor'\napplication_name = 'your application'\n\neventhub_name = '*'\neventhub_connection_string = '*'\n\nfile_name = '*'\ninput_data = '*.csv'\nseparator = '|'\n\nsimilarity_threshold = 0.65\nnode_messages_examples = 100\nbatch_size = 1024\npercentage_threshold = 0.7\n```\n \n3) Initialize eventhub logger\n```\ncorrelation_id = str(uuid.uuid3(uuid.NAMESPACE_DNS, user_email + bot_name))\nlogger = EventHubLogSender(application_name=application_name,\n user_email=user_email,\n bot_name=bot_name,\n file_name=file_name,\n correlation_id=correlation_id,\n connection_string=eventhub_connection_string,\n eventhub_name=eventhub_name)\n```\n4) Initialize Insight Extractor\n```\ninsight_extractor = InsightExtractor(input_data,\n separator=separator,\n similarity_threshold=similarity_threshold,\n embedding_path=embedding_path,\n postagging_model_path=postag_model_path,\n postagging_label_path=postag_label_path,\n ner_model_path=ner_model_path,\n ner_label_path=ner_label_path,\n user_email=user_email,\n bot_name=bot_name,\n logger=logger)\n``` \n5) Insight Extractor usage\n```\ninsight_extractor.predict(percentage_threshold=percentage_threshold,\n node_messages_examples=node_messages_examples,\n batch_size=batch_size)\n``` \n \n\n",
"bugtrack_url": null,
"license": "",
"summary": "Insight Extractor Package",
"version": "0.0.7",
"split_keywords": [
"insight",
"extractor"
],
"urls": [
{
"comment_text": "",
"digests": {
"md5": "39c96652767503795212b8af0099a365",
"sha256": "468d23b7955075dfe65f3da7283d1ba16c41b7b5fc1b07f65edc56e0e0b6f749"
},
"downloads": -1,
"filename": "insight_extractor_package-0.0.7-py3-none-any.whl",
"has_sig": false,
"md5_digest": "39c96652767503795212b8af0099a365",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 31969,
"upload_time": "2023-01-02T15:46:19",
"upload_time_iso_8601": "2023-01-02T15:46:19.184762Z",
"url": "https://files.pythonhosted.org/packages/d3/53/ec45d61373a1c360890e0ef904d1c38919a1d4c420846a38759fa99b2a04/insight_extractor_package-0.0.7-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"md5": "cafb87e9b6f8c70d86c1d41292de1223",
"sha256": "75512ec1d157e377d07315ae5d75f15874bccba502754794b4ce491c326220cf"
},
"downloads": -1,
"filename": "insight-extractor-package-0.0.7.tar.gz",
"has_sig": false,
"md5_digest": "cafb87e9b6f8c70d86c1d41292de1223",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 27341,
"upload_time": "2023-01-02T15:46:22",
"upload_time_iso_8601": "2023-01-02T15:46:22.112199Z",
"url": "https://files.pythonhosted.org/packages/b9/fe/5c4d702d6bb28a7144759da0226258647d94113cf2ecde0e893246a93a1e/insight-extractor-package-0.0.7.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-01-02 15:46:22",
"github": false,
"gitlab": false,
"bitbucket": false,
"lcname": "insight-extractor-package"
}