kanonym4text

Name	kanonym4text JSON
Version	0.2.7 JSON
	download
home_page	https://github.com/neumanh/K-anonymity-fot-texts
Summary	k-anonymity for texts
upload_time	2023-07-23 12:23:13
maintainer
docs_url	None
author	Lior Trieman, Hadas Neuman
requires_python
license
keywords	python k-anonymity privacy nlp
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # kanonym4text K-anonymity guerantee for texts



K Anonymity for text (or in short - kanonym4text) is an open‑sourced library that receives a corpus from the user (*dataframe*) and returns a K-anonymized corpus with additional information about the anonymization process performed.
kanon4txet is designed to be easily utilized, to **guarantee** anonymization at a certain level pre-defined by the user (k) while still preserving some of the text utilization properties. 
This repo and package are part of our Y-Data data science final project, and we would love to hear your feedback and learn from it!

## Overview
In this project, we aim to apply data science techniques to anonymize textual data while preserving their utility. K-Anonymity is a technique used to ensure that an individual in a dataset cannot be identified by linking their attributes to external information, by forcing each row to be identical to k-1 other rows The anonymized data can be used for various purposes like data sharing, research, and analysis without compromising privacy. We plan on creating a novel algorithm for k-anonymity. Specifically, we address the case of unstructured data items, such as texts. Using various NLP techniques, from classical to modern DL-based solutions, and testing the utility of the anonymized data.
We have tested the library on two main datasets:
1) Amazon dataset,
2) Enron emails dataset

We show it is able to generate anonymized corpora in both cases.

## Package High-Level Algorithm:

1) Data Preprocessing: tokenization, lemmatization, and stop word removal.
2) K-Anonymization: Generalization and Reduction.
3) Evaluation: Utilization is evaluated by embedding distance and semantic scores.
4) Visualization: dataset properties prior to and post anonymization will be visualized (optional)




# Getting Started
You can get started with kanonym4text immediately by installing it with pip:

## download package

```
pip install kanonym4text
```

## Step 1 - creat a data frame from your corpus
The code receives a data frame containing the corpus in the following format:
"txt" - column with the texts (default column name)

## Step 2 - import the following and initializethe object
Use [gensim word embedding model]([url](https://radimrehurek.com/gensim/models/word2vec.html#pretrained-models)) for you choice (default - *glove-twitter-25*) 
```
from kanonym4text import Kanonym
kan = Kanonym()
```

## Step 3 - read the df:
```
import pandas as pd
df = pd.read_csv('YOUR_FILE.csv')
```

## Step 4 -  run the anonymization process on your corpus
```
dfa, dist = kan.anonymize(df, k=2)
```

**Running instructions:**
The main function is called *anonymize*. 

It's input parameters are:
---------------------------

* `df` - Input Dataframe
* `k` - k
* `col` - The column in df that holds the text to anonymize. Default - txt
* `num_stop` - Number of stop word to use. Default - 1000
* `num_jobs` - Number of CPUs to utilize. Default - 1. All CPUs - -1.
* `verbose` - Output text level. Default - 0.

The function outputs are:    
---------------------------
A tuple with these items:
1. A dataframe - the same df the user sent with additional columns:
> 1. general_txt - text after "generalization"
> 2. anon_txt_history - changes performed on text during annonymization process:    
>       [] - replaced     
>       {} - Lemmatize     
>       () - protected word (stop-word)     
> 3. anon_txt - resulted anonymized text
> 4. neighbors - indeces of k neigbors (Bow anonymized)

2. An averaged cosine distance of sentence embedding for the documents before and after the anonymizaion

# Running Example

use the following link to run some examples of the package on your own dataframe:

https://colab.research.google.com/drive/1eMSSvBxtsNFMOvKrUXgbsx1g3KOQD56s#scrollTo=ci2qjboGCt0A&uniqifier=1

or use the following code:
```
from kanonym4text import Kanonym
import pandas as pd

df = pd.read_csv('YOUR_FILE.csv')

kan = Kanonym('glove-twitter-25') # creating an object from class
dfa, dist = kan.anonymize(df, k=k, n_jobs=-1, plot=True)
```

# Support

## Create a Bug Report
If you see an error message or run into an issue, please create a bug report. This effort is valued and helps all users.

## Submit a Feature Request
If you have an idea, or you're missing a capability that would make development easier and more robust, please Submit a feature request.

If a similar feature request already exists, don't forget to leave a "+1". If you add some more information such as your thoughts and vision about the feature, your comments will be embraced warmly :)

# Contributing

Kanon4txt is an open-source project. We are committed to a fully transparent development process and highly appreciate any contributions. Whether you are helping us fix bugs, proposing new features, improving our documentation, or spreading the word - we would love to have you!

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/neumanh/K-anonymity-fot-texts",
    "name": "kanonym4text",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "python,k-anonymity,privacy,NLP",
    "author": "Lior Trieman, Hadas Neuman",
    "author_email": "liortr30@gmail.com, hadas.doron@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/eb/0f/1c0e25ae4c1a40ddab6c04874d619b8199b8425642e1d35cd01399c70fa6/kanonym4text-0.2.7.tar.gz",
    "platform": null,
    "description": "# kanonym4text K-anonymity guerantee for texts\n\n\n\nK Anonymity for text (or in short - kanonym4text) is an open\u2011sourced library that receives a corpus from the user (*dataframe*) and returns a K-anonymized corpus with additional information about the anonymization process performed.\nkanon4txet is designed to be easily utilized, to **guarantee** anonymization at a certain level pre-defined by the user (k) while still preserving some of the text utilization properties. \nThis repo and package are part of our Y-Data data science final project, and we would love to hear your feedback and learn from it!\n\n## Overview\nIn this project, we aim to apply data science techniques to anonymize textual data while preserving their utility. K-Anonymity is a technique used to ensure that an individual in a dataset cannot be identified by linking their attributes to external information, by forcing each row to be identical to k-1 other rows The anonymized data can be used for various purposes like data sharing, research, and analysis without compromising privacy. We plan on creating a novel algorithm for k-anonymity. Specifically, we address the case of unstructured data items, such as texts. Using various NLP techniques, from classical to modern DL-based solutions, and testing the utility of the anonymized data.\nWe have tested the library on two main datasets:\n1) Amazon dataset,\n2) Enron emails dataset\n\nWe show it is able to generate anonymized corpora in both cases.\n\n## Package High-Level Algorithm:\n\n1) Data Preprocessing: tokenization, lemmatization, and stop word removal.\n2) K-Anonymization: Generalization and Reduction.\n3) Evaluation: Utilization is evaluated by embedding distance and semantic scores.\n4) Visualization: dataset properties prior to and post anonymization will be visualized (optional)\n\n\n\n\n# Getting Started\nYou can get started with kanonym4text immediately by installing it with pip:\n\n## download package\n\n```\npip install kanonym4text\n```\n\n## Step 1 - creat a data frame from your corpus\nThe code receives a data frame containing the corpus in the following format:\n\"txt\" - column with the texts (default column name)\n\n## Step 2 - import the following and initializethe object\nUse [gensim word embedding model]([url](https://radimrehurek.com/gensim/models/word2vec.html#pretrained-models)) for you choice (default - *glove-twitter-25*) \n```\nfrom kanonym4text import Kanonym\nkan = Kanonym()\n```\n\n## Step 3 - read the df:\n```\nimport pandas as pd\ndf = pd.read_csv('YOUR_FILE.csv')\n```\n\n## Step 4 -  run the anonymization process on your corpus\n```\ndfa, dist = kan.anonymize(df, k=2)\n```\n\n**Running instructions:**\nThe main function is called *anonymize*. \n\nIt's input parameters are:\n---------------------------\n\n* `df` - Input Dataframe\n* `k` - k\n* `col` - The column in df that holds the text to anonymize. Default - txt\n* `num_stop` - Number of stop word to use. Default - 1000\n* `num_jobs` - Number of CPUs to utilize. Default - 1. All CPUs - -1.\n* `verbose` - Output text level. Default - 0.\n\nThe function outputs are:    \n---------------------------\nA tuple with these items:\n1. A dataframe - the same df the user sent with additional columns:\n> 1. general_txt - text after \"generalization\"\n> 2. anon_txt_history - changes performed on text during annonymization process:    \n>       [] - replaced     \n>       {} - Lemmatize     \n>       () - protected word (stop-word)     \n> 3. anon_txt - resulted anonymized text\n> 4. neighbors - indeces of k neigbors (Bow anonymized)\n\n2. An averaged cosine distance of sentence embedding for the documents before and after the anonymizaion\n\n# Running Example\n\nuse the following link to run some examples of the package on your own dataframe:\n\nhttps://colab.research.google.com/drive/1eMSSvBxtsNFMOvKrUXgbsx1g3KOQD56s#scrollTo=ci2qjboGCt0A&uniqifier=1\n\nor use the following code:\n```\nfrom kanonym4text import Kanonym\nimport pandas as pd\n\ndf = pd.read_csv('YOUR_FILE.csv')\n\nkan = Kanonym('glove-twitter-25') # creating an object from class\ndfa, dist = kan.anonymize(df, k=k, n_jobs=-1, plot=True)\n```\n\n# Support\n\n## Create a Bug Report\nIf you see an error message or run into an issue, please create a bug report. This effort is valued and helps all users.\n\n## Submit a Feature Request\nIf you have an idea, or you're missing a capability that would make development easier and more robust, please Submit a feature request.\n\nIf a similar feature request already exists, don't forget to leave a \"+1\". If you add some more information such as your thoughts and vision about the feature, your comments will be embraced warmly :)\n\n# Contributing\n\nKanon4txt is an open-source project. We are committed to a fully transparent development process and highly appreciate any contributions. Whether you are helping us fix bugs, proposing new features, improving our documentation, or spreading the word - we would love to have you!\n\n\n\n\n\n\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "k-anonymity for texts",
    "version": "0.2.7",
    "project_urls": {
        "Download": "https://github.com/neumanh/K-anonymity-fot-texts/archive/refs/tags/0.2.6.tar.gz",
        "Homepage": "https://github.com/neumanh/K-anonymity-fot-texts"
    },
    "split_keywords": [
        "python",
        "k-anonymity",
        "privacy",
        "nlp"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a014b2a2a79541dbe8532825228dd450646315f13e7319d6971ab34a7007272f",
                "md5": "8d2b4bfa8d987c4e888b9ad2f8c934cb",
                "sha256": "053cf8a4ae714cae323626d262d79594fde4b98407e80d4a29cecca61c04dd58"
            },
            "downloads": -1,
            "filename": "kanonym4text-0.2.7-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "8d2b4bfa8d987c4e888b9ad2f8c934cb",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 40590,
            "upload_time": "2023-07-23T12:23:11",
            "upload_time_iso_8601": "2023-07-23T12:23:11.001982Z",
            "url": "https://files.pythonhosted.org/packages/a0/14/b2a2a79541dbe8532825228dd450646315f13e7319d6971ab34a7007272f/kanonym4text-0.2.7-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "eb0f1c0e25ae4c1a40ddab6c04874d619b8199b8425642e1d35cd01399c70fa6",
                "md5": "791529ace7eff434924d0f479dd850e5",
                "sha256": "6d07f44c258f5cfb534d5c056b885f57394d4bb038bbbc28c72aa0870836ae3b"
            },
            "downloads": -1,
            "filename": "kanonym4text-0.2.7.tar.gz",
            "has_sig": false,
            "md5_digest": "791529ace7eff434924d0f479dd850e5",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 40461,
            "upload_time": "2023-07-23T12:23:13",
            "upload_time_iso_8601": "2023-07-23T12:23:13.274289Z",
            "url": "https://files.pythonhosted.org/packages/eb/0f/1c0e25ae4c1a40ddab6c04874d619b8199b8425642e1d35cd01399c70fa6/kanonym4text-0.2.7.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-07-23 12:23:13",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "neumanh",
    "github_project": "K-anonymity-fot-texts",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "kanonym4text"
}

Lior Trieman, Hadas Neuman