hades-nlp

Name	hades-nlp JSON
Version	0.1.2 JSON
	download
home_page
Summary	Homologous Automated Document Exploration and Summarization - A powerful tool for comparing similarly structured documents
upload_time	2023-08-07 11:12:19
maintainer
docs_url	None
author	Artur Żółkowski
requires_python	>=3.9, !=2.7., !=3.0., !=3.1., !=3.2., !=3.3., !=3.4., !=3.5., !=3.6., !=3.7., !=3.8.
license
keywords	nlp documents topic modeling summarization machine learning natural language processing text analysis text mining text summarization
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # HADES: Homologous Automated Document Exploration and Summarization
A powerful tool for comparing similarly structured documents

[![PyPI version](https://badge.fury.io/py/hades-nlp.svg)](https://pypi.org/project/hades-nlp/)
[![Downloads](https://static.pepy.tech/badge/hades-nlp)](https://pepy.tech/project/hades-nlp)

## Overview
`HADES` is a **Python** package for comparing similarly structured documents. HADES is designed to streamline the work of professionals dealing with large volumes of documents, such as policy documents, legal acts, and scientific papers. The tool employs a multi-step pipeline that begins with processing PDF documents using topic modeling, summarization, and analysis of the most important words for each topic. The process concludes with an interactive web app with visualizations that facilitate the comparison of the documents. HADES has the potential to significantly improve the productivity of professionals dealing with high volumes of documents, reducing the time and effort required to complete tasks related to comparative document analysis.

## Installation
Latest released version of the `HADES` package is available on [Python Package Index (PyPI)](https://pypi.org/project/hades-nlp/):

1. Install spacy `en-core-web-sm` or `en-core-web-lg` model for English language according to the [instructions](https://spacy.io/models/en)

2. Install `HADES` package using pip:

```sh
pip install -U hades-nlp
```
The source code and development version is currently hosted on [GitHub](https://github.com/MI2DataLab/HADES).
## Usage
The `HADES` package is designed to be used in a Python environment. The package can be imported as follows:

```python
from hades.data_loading import load_processed_data
from hades.topic_modeling import ModelOptimizer, save_data_for_app, set_openai_key
from my_documents_data import PARAGRAPHS, COMMON_WORDS, STOPWORDS
```
The `load_processed_data` function loads the documents to be processed. The `ModelOptimizer` class is used to optimize the topic modeling process. The `save_data_for_app` function saves the data for the interactive web app. The `set_openai_key` function sets the OpenAI API key.
`my_documents_data` contains the informations about the documents to be processed. The `PARAGRAPHS` variable is a list of strings that represent the paragraphs of the documents. The `COMMON_WORDS` variable is a list of strings that represent the most common words in the documents. The `STOPWORDS` variable is a list of strings that represent the most common words in the documents that should be excluded from the analysis.

First, the documents are loaded and processed:
```python
set_openai_key("my openai key")
data_path = "my/data/path"
processed_df = load_processed_data(
    data_path=data_path,
    stop_words=STOPWORDS,
    id_column='country',
    flattened_by_col='my_column',
)
```
After the documents are loaded, the topic modeling process is optimized for each paragraph:
```python
model_optimizers = []
for paragraph in PARAGRAPHS:
    filter_dict = {'paragraph': paragraph}
    model_optimizer = ModelOptimizer(
        processed_df,
        'country',
        'section',
        filter_dict,
        "lda",
        COMMON_WORDS[paragraph],
        (3,6),
        alpha=100
    )
    model_optimizer.name_topics_automatically_gpt3()
    model_optimizers.append(model_optimizer)

```
For each paragraph, the `ModelOptimizer` class is used to optimize the topic modeling process. The `name_topics_automatically_gpt3` function automatically names the topics using the OpenAI GPT-3 API. User can also use the `name_topics_manually` function to manually name the topics.

Finally, the data is saved for the interactive web app:
```python
save_data_for_app(model_optimizers, path='path/to/results', do_summaries=True)
```
The `save_data_for_app` function saves the data for the interactive web app. The `do_summaries` parameter is set to `True` to generate summaries for each topic.

When the data is saved, the interactive web app can be launched:
```sh
hades run-app --config path/to/results/config.json
```

***

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "hades-nlp",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.9, !=2.7.*, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*, !=3.4.*, !=3.5.*, !=3.6.*, !=3.7.*, !=3.8.*",
    "maintainer_email": "",
    "keywords": "nlp,documents,topic modeling,summarization,machine learning,natural language processing,text analysis,text mining,text summarization",
    "author": "Artur \u017b\u00f3\u0142kowski",
    "author_email": "artur.zolkowski@wp.pl",
    "download_url": "https://files.pythonhosted.org/packages/f9/9c/0584f7eadeacc610312511a7ea840a2c34809236836ea557db99f8ef59c9/hades_nlp-0.1.2.tar.gz",
    "platform": null,
    "description": "# HADES: Homologous Automated Document Exploration and Summarization\nA powerful tool for comparing similarly structured documents\n\n[![PyPI version](https://badge.fury.io/py/hades-nlp.svg)](https://pypi.org/project/hades-nlp/)\n[![Downloads](https://static.pepy.tech/badge/hades-nlp)](https://pepy.tech/project/hades-nlp)\n\n## Overview\n`HADES` is a **Python** package for comparing similarly structured documents. HADES is designed to streamline the work of professionals dealing with large volumes of documents, such as policy documents, legal acts, and scientific papers. The tool employs a multi-step pipeline that begins with processing PDF documents using topic modeling, summarization, and analysis of the most important words for each topic. The process concludes with an interactive web app with visualizations that facilitate the comparison of the documents. HADES has the potential to significantly improve the productivity of professionals dealing with high volumes of documents, reducing the time and effort required to complete tasks related to comparative document analysis.\n\n## Installation\nLatest released version of the `HADES` package is available on [Python Package Index (PyPI)](https://pypi.org/project/hades-nlp/):\n\n1. Install spacy `en-core-web-sm` or `en-core-web-lg` model for English language according to the [instructions](https://spacy.io/models/en)\n\n2. Install `HADES` package using pip:\n\n```sh\npip install -U hades-nlp\n```\nThe source code and development version is currently hosted on [GitHub](https://github.com/MI2DataLab/HADES).\n## Usage\nThe `HADES` package is designed to be used in a Python environment. The package can be imported as follows:\n\n```python\nfrom hades.data_loading import load_processed_data\nfrom hades.topic_modeling import ModelOptimizer, save_data_for_app, set_openai_key\nfrom my_documents_data import PARAGRAPHS, COMMON_WORDS, STOPWORDS\n```\nThe `load_processed_data` function loads the documents to be processed. The `ModelOptimizer` class is used to optimize the topic modeling process. The `save_data_for_app` function saves the data for the interactive web app. The `set_openai_key` function sets the OpenAI API key.\n`my_documents_data` contains the informations about the documents to be processed. The `PARAGRAPHS` variable is a list of strings that represent the paragraphs of the documents. The `COMMON_WORDS` variable is a list of strings that represent the most common words in the documents. The `STOPWORDS` variable is a list of strings that represent the most common words in the documents that should be excluded from the analysis.\n\nFirst, the documents are loaded and processed:\n```python\nset_openai_key(\"my openai key\")\ndata_path = \"my/data/path\"\nprocessed_df = load_processed_data(\n    data_path=data_path,\n    stop_words=STOPWORDS,\n    id_column='country',\n    flattened_by_col='my_column',\n)\n```\nAfter the documents are loaded, the topic modeling process is optimized for each paragraph:\n```python\nmodel_optimizers = []\nfor paragraph in PARAGRAPHS:\n    filter_dict = {'paragraph': paragraph}\n    model_optimizer = ModelOptimizer(\n        processed_df,\n        'country',\n        'section',\n        filter_dict,\n        \"lda\",\n        COMMON_WORDS[paragraph],\n        (3,6),\n        alpha=100\n    )\n    model_optimizer.name_topics_automatically_gpt3()\n    model_optimizers.append(model_optimizer)\n\n```\nFor each paragraph, the `ModelOptimizer` class is used to optimize the topic modeling process. The `name_topics_automatically_gpt3` function automatically names the topics using the OpenAI GPT-3 API. User can also use the `name_topics_manually` function to manually name the topics.\n\nFinally, the data is saved for the interactive web app:\n```python\nsave_data_for_app(model_optimizers, path='path/to/results', do_summaries=True)\n```\nThe `save_data_for_app` function saves the data for the interactive web app. The `do_summaries` parameter is set to `True` to generate summaries for each topic.\n\nWhen the data is saved, the interactive web app can be launched:\n```sh\nhades run-app --config path/to/results/config.json\n```\n\n***\n\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Homologous Automated Document Exploration and Summarization - A powerful tool for comparing similarly structured documents",
    "version": "0.1.2",
    "project_urls": null,
    "split_keywords": [
        "nlp",
        "documents",
        "topic modeling",
        "summarization",
        "machine learning",
        "natural language processing",
        "text analysis",
        "text mining",
        "text summarization"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "b93f13386ae8829f65610f85db3fca35f8a00bf7796995e8187bea10e0b29096",
                "md5": "8b58ecfb68ed2882d6c7c698fd260b21",
                "sha256": "f9e3ea412e38e6a696f044af12c38b8b7ea38940be6cf6545c5a7c78b4251f99"
            },
            "downloads": -1,
            "filename": "hades_nlp-0.1.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "8b58ecfb68ed2882d6c7c698fd260b21",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9, !=2.7.*, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*, !=3.4.*, !=3.5.*, !=3.6.*, !=3.7.*, !=3.8.*",
            "size": 36518,
            "upload_time": "2023-08-07T11:12:18",
            "upload_time_iso_8601": "2023-08-07T11:12:18.029316Z",
            "url": "https://files.pythonhosted.org/packages/b9/3f/13386ae8829f65610f85db3fca35f8a00bf7796995e8187bea10e0b29096/hades_nlp-0.1.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "f99c0584f7eadeacc610312511a7ea840a2c34809236836ea557db99f8ef59c9",
                "md5": "a43c1b751d0a8c28d877da0d21fa0180",
                "sha256": "4f5b64c5618c37f96fe51f2322b3fc59c408240a2b679744816bef429823d9e7"
            },
            "downloads": -1,
            "filename": "hades_nlp-0.1.2.tar.gz",
            "has_sig": false,
            "md5_digest": "a43c1b751d0a8c28d877da0d21fa0180",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9, !=2.7.*, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*, !=3.4.*, !=3.5.*, !=3.6.*, !=3.7.*, !=3.8.*",
            "size": 28531,
            "upload_time": "2023-08-07T11:12:19",
            "upload_time_iso_8601": "2023-08-07T11:12:19.833263Z",
            "url": "https://files.pythonhosted.org/packages/f9/9c/0584f7eadeacc610312511a7ea840a2c34809236836ea557db99f8ef59c9/hades_nlp-0.1.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-08-07 11:12:19",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "hades-nlp"
}

Artur Żółkowski