SNPTMT


NameSNPTMT JSON
Version 0.0.11 PyPI version JSON
download
home_pagehttps://github.com/FIvER4IK/snptmt
SummaryPython module for searching for a new popular topics in the message threade
upload_time2023-06-09 17:42:54
maintainer
docs_urlNone
authorFIvER4IK
requires_python>=3.4
license
keywords clusters clustering short text search new popular topics message thread
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # SNPTMT

## User installation
```
pip install SNPTMT
```

## Loading and using modules
```
import SNPTMT.snptmt
```

## Necessary modules
all this modules should be installed and imported: `pandas, pymorphy2, nltk, ssl, re, spacy, math, random`.

```
import pandas as pd
import pymorphy2

import nltk
import ssl

from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords

import re

import spacy

from scipy.spatial.distance import cdist

import scipy.cluster.hierarchy as sch
from scipy.cluster.hierarchy import linkage, dendrogram

from scipy.spatial.distance import cdist, squareform

from scipy.cluster.hierarchy import fcluster

import math
import random
```

## Function discription 

```
download_stopwords()
```
funtion for downloading nltk stopwords.

<br>

```
delete_stopwords(df)
```
fuctions that delete stopwords in pandas dataframe (df) in column "message".

<br>


```
deEmojify(text)
```
delete emojies in specific line, this function is used in delete_emojies(df) and optional for use by users.

<br>

```
delete_emojies(df)
```
function that delete all emogies in pandas dataframe (df) in column "message".

<br>

```
deSigns(text)
```
delete signs in specific line, this function is used in delete_signs(df) and optional for use by users.

<br>

```
delete_signs(df)
```
function that delete all signs in pandas dataframe (df) in column "message".

<br>

```
lemmatization(df)
```
function for lemmatization all lines in column "messages" in pandas dataframe (the process of grouping together different inflected forms of the same word).

<br>

```
tokenizing(df)
```
function that creates new column "tokenized" that contains tokenized forms of all lines of "message" column, optional for use by users.

<br>

```
first_clustering(df, start_message, end_message)
```

function needed for the very first clustering, it takes three arguments: (df) pandas dataframe, (start_message) index of first message, (end_clustering) index of last message. Function returns cluster_dict dictionary where key is an index of a cluster and value is a list of indexes of messages, where every index is actual index - start_message => result of every clustering will be bound to the index of the very first message, if the first message was a message with index x, then the result of all subsequent clustering will be shifted by x indexes. For the correct work of all functions it is not not recommended to change cluster_dict to actual indexes.

<br>

```
add_points(df, start_message, end_message, cluster_dict)
```

the function is needed for all clusterizations except the first one. The function takes 4 arguments: (df) pandas dataframe, (start_message) index of first message, (end_clustering) index of last message, (cluster_dict) cluster_dict returned by the previous clusterig function (first_clustering() or add_points())

<br>

```
initialize_cluster_counters(cluster_dict)
```

function for initializing cluster_counters varibale, this function should be called only once after very first clustering (after the first_clustering() function)

<br>

```
find_base_clusters(cluster_dict_prev, cluster_dict)
```

function for finding base clusters for the second clustering in the chain. Uses Intersection over Union between cluster_dict and cluster_dict_prev to find base clusters for cluster_dict from cluster_dict_prev. This function needed to find base for remove_outdated_clusters().

<br>

```
remove_outdated_clusters(cluster_dict, cluster_dict_prev, base_clusters, cluster_counters, threshold, added_points)
```

Removes outdated clusters from the cluster dictionary. A cluster is considered outdated if no new elements have been added to it during the period when counter <= theshold. Counter is increasing by (1-1/number_of_added_points) every time when no point where added for a specific cluster. And make it equal 0, when point where added.

Parameters:
cluster_dict (dict): dictionary of clusters of last clustering.
cluster_dict_prev (dict): dictionary of clusters of previous clustering.
base_clusters (dict): dict of base clusters of cluster_dict from cluster_dict_prev.
cluster_counters (dict): counter for every cluster.
added_points (int): number of added points.
thresold (int): parameter that needed to determine how long a cluster should live. By defolt this parametr is equal 1.

Returns:
cluster_dict (dict): updated cluster dictionary.
last_updated (dict): updated updated_cluster_counters dictionary.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/FIvER4IK/snptmt",
    "name": "SNPTMT",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.4",
    "maintainer_email": "",
    "keywords": "clusters clustering short text search new popular topics message thread",
    "author": "FIvER4IK",
    "author_email": "andrewshatalov3@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/13/59/e9f4ffbf594c3b8b325968714880fa18db7c2feb624f12df5590fd627572/SNPTMT-0.0.11.tar.gz",
    "platform": null,
    "description": "# SNPTMT\n\n## User installation\n```\npip install SNPTMT\n```\n\n## Loading and using modules\n```\nimport SNPTMT.snptmt\n```\n\n## Necessary modules\nall this modules should be installed and imported: `pandas, pymorphy2, nltk, ssl, re, spacy, math, random`.\n\n```\nimport pandas as pd\nimport pymorphy2\n\nimport nltk\nimport ssl\n\nfrom nltk.tokenize import sent_tokenize, word_tokenize\nfrom nltk.corpus import stopwords\n\nimport re\n\nimport spacy\n\nfrom scipy.spatial.distance import cdist\n\nimport scipy.cluster.hierarchy as sch\nfrom scipy.cluster.hierarchy import linkage, dendrogram\n\nfrom scipy.spatial.distance import cdist, squareform\n\nfrom scipy.cluster.hierarchy import fcluster\n\nimport math\nimport random\n```\n\n## Function discription \n\n```\ndownload_stopwords()\n```\nfuntion for downloading nltk stopwords.\n\n<br>\n\n```\ndelete_stopwords(df)\n```\nfuctions that delete stopwords in pandas dataframe (df) in column \"message\".\n\n<br>\n\n\n```\ndeEmojify(text)\n```\ndelete emojies in specific line, this function is used in delete_emojies(df) and optional for use by users.\n\n<br>\n\n```\ndelete_emojies(df)\n```\nfunction that delete all emogies in pandas dataframe (df) in column \"message\".\n\n<br>\n\n```\ndeSigns(text)\n```\ndelete signs in specific line, this function is used in delete_signs(df) and optional for use by users.\n\n<br>\n\n```\ndelete_signs(df)\n```\nfunction that delete all signs in pandas dataframe (df) in column \"message\".\n\n<br>\n\n```\nlemmatization(df)\n```\nfunction for lemmatization all lines in column \"messages\" in pandas dataframe (the process of grouping together different inflected forms of the same word).\n\n<br>\n\n```\ntokenizing(df)\n```\nfunction that creates new column \"tokenized\" that contains tokenized forms of all lines of \"message\" column, optional for use by users.\n\n<br>\n\n```\nfirst_clustering(df, start_message, end_message)\n```\n\nfunction needed for the very first clustering, it takes three arguments: (df) pandas dataframe, (start_message) index of first message, (end_clustering) index of last message. Function returns cluster_dict dictionary where key is an index of a cluster and value is a list of indexes of messages, where every index is actual index - start_message => result of every clustering will be bound to the index of the very first message, if the first message was a message with index x, then the result of all subsequent clustering will be shifted by x indexes. For the correct work of all functions it is not not recommended to change cluster_dict to actual indexes.\n\n<br>\n\n```\nadd_points(df, start_message, end_message, cluster_dict)\n```\n\nthe function is needed for all clusterizations except the first one. The function takes 4 arguments: (df) pandas dataframe, (start_message) index of first message, (end_clustering) index of last message, (cluster_dict) cluster_dict returned by the previous clusterig function (first_clustering() or add_points())\n\n<br>\n\n```\ninitialize_cluster_counters(cluster_dict)\n```\n\nfunction for initializing cluster_counters varibale, this function should be called only once after very first clustering (after the first_clustering() function)\n\n<br>\n\n```\nfind_base_clusters(cluster_dict_prev, cluster_dict)\n```\n\nfunction for finding base clusters for the second clustering in the chain. Uses Intersection over Union between cluster_dict and cluster_dict_prev to find base clusters for cluster_dict from cluster_dict_prev. This function needed to find base for remove_outdated_clusters().\n\n<br>\n\n```\nremove_outdated_clusters(cluster_dict, cluster_dict_prev, base_clusters, cluster_counters, threshold, added_points)\n```\n\nRemoves outdated clusters from the cluster dictionary. A cluster is considered outdated if no new elements have been added to it during the period when counter <= theshold. Counter is increasing by (1-1/number_of_added_points) every time when no point where added for a specific cluster. And make it equal 0, when point where added.\n\nParameters:\ncluster_dict (dict): dictionary of clusters of last clustering.\ncluster_dict_prev (dict): dictionary of clusters of previous clustering.\nbase_clusters (dict): dict of base clusters of cluster_dict from cluster_dict_prev.\ncluster_counters (dict): counter for every cluster.\nadded_points (int): number of added points.\nthresold (int): parameter that needed to determine how long a cluster should live. By defolt this parametr is equal 1.\n\nReturns:\ncluster_dict (dict): updated cluster dictionary.\nlast_updated (dict): updated updated_cluster_counters dictionary.\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Python module for searching for a new popular topics in the message threade",
    "version": "0.0.11",
    "project_urls": {
        "Homepage": "https://github.com/FIvER4IK/snptmt"
    },
    "split_keywords": [
        "clusters",
        "clustering",
        "short",
        "text",
        "search",
        "new",
        "popular",
        "topics",
        "message",
        "thread"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "146d927867bc1f15c357a2ef741f80883d85922c48168311702bcace9c50352e",
                "md5": "59bf1e086016b366c253d85f5f11ab12",
                "sha256": "006b333593757e65a70141a81b6a88cbab39cbe751a64bb727dbec4a5b9dd7a2"
            },
            "downloads": -1,
            "filename": "SNPTMT-0.0.11-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "59bf1e086016b366c253d85f5f11ab12",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.4",
            "size": 6400,
            "upload_time": "2023-06-09T17:42:53",
            "upload_time_iso_8601": "2023-06-09T17:42:53.383650Z",
            "url": "https://files.pythonhosted.org/packages/14/6d/927867bc1f15c357a2ef741f80883d85922c48168311702bcace9c50352e/SNPTMT-0.0.11-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "1359e9f4ffbf594c3b8b325968714880fa18db7c2feb624f12df5590fd627572",
                "md5": "aaf99962e99a23cff7e169aae32d6c8a",
                "sha256": "44d8ae8caf031c9b0939029c648acf6268bf8835f8e5c30a83bbf600ec993b86"
            },
            "downloads": -1,
            "filename": "SNPTMT-0.0.11.tar.gz",
            "has_sig": false,
            "md5_digest": "aaf99962e99a23cff7e169aae32d6c8a",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.4",
            "size": 6005,
            "upload_time": "2023-06-09T17:42:54",
            "upload_time_iso_8601": "2023-06-09T17:42:54.717102Z",
            "url": "https://files.pythonhosted.org/packages/13/59/e9f4ffbf594c3b8b325968714880fa18db7c2feb624f12df5590fd627572/SNPTMT-0.0.11.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-06-09 17:42:54",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "FIvER4IK",
    "github_project": "snptmt",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "snptmt"
}
        
Elapsed time: 0.07895s