anonym


Nameanonym JSON
Version 0.1.1 PyPI version JSON
download
home_pagehttps://gitlab.com/datainnovatielab/public/anonym
SummaryPython package anonym
upload_time2023-11-25 20:39:43
maintainer
docs_urlNone
authorErdogan Taskesen
requires_python>=3
license
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # anonym

[![Python](https://img.shields.io/pypi/pyversions/anonym)](https://img.shields.io/pypi/pyversions/anonym)
[![Pypi](https://img.shields.io/pypi/v/anonym)](https://pypi.org/project/anonym/)
[![Docs](https://img.shields.io/badge/Sphinx-Docs-Green)](https://anonym-datainnovatielab-public-3ae525a7078644e2013f2d5d2c9a0825.gitlab.io/)
[![Downloads](https://static.pepy.tech/personalized-badge/anonym?period=month&units=international_system&left_color=grey&right_color=brightgreen&left_text=PyPI%20downloads/month)](https://pepy.tech/project/anonym)
[![Downloads](https://static.pepy.tech/personalized-badge/anonym?period=total&units=international_system&left_color=grey&right_color=brightgreen&left_text=Downloads)](https://pepy.tech/project/anonym)
[![License](https://img.shields.io/badge/license-MIT-green.svg)](https://gitlab.com/datainnovatielab/public/anonym/-/blob/main/LICENSE)
[![Issues](https://img.shields.io/badge/issues-you_like-yellow)](https://gitlab.com/datainnovatielab/public/anonym/-/issues)
[![Project Status](http://www.repostatus.org/badges/latest/active.svg)](http://www.repostatus.org/#active)

* The ``anonym`` library is designed to anonymize sensitive data in Python, allowing users to work with, share, or publish their data without compromising privacy or violating data protection regulations. It uses Named Entity Recognition (NER) from ``spacy`` to identify sensitive information in the data. Once identified, the library leverages the ``faker`` library to generate fake but realistic replacements. Depending on the type of sensitive information (like names, addresses, dates), corresponding faker methods are used, ensuring the anonymized data maintains a similar structure and format to the original, making it suitable for further data analysis or testing.

* The ``anonym`` algorithm is designed to anonymize data in a DataFrame. It works by replacing real data with fake data, while maintaining the structure and format of the original data. Here's a step-by-step explanation of how it works:

**1. Initialization**: The anonym class is initialized with a language parameter (default is 'dutch') and a verbosity level (default is 'info'). The language parameter is used to load the appropriate language model for named entity recognition (NER), and the verbosity level sets the logger's verbosity.

**2. Data Import**: The import_data method is used to import a dataset from a given file path. The data is read into a pandas DataFrame.

**3. Data Anonymization**: The anonymize method is the core of the algorithm. It takes a DataFrame and optional parameters for specifying columns to fake or not to fake, and a NER blacklist. The method works as follows:

**4. It calls the extract_entities function** to extract all entities from the DataFrame. This function uses the ``spacy`` library's NER capabilities to identify entities in the data. If a column is specified in the fakeit parameter, the entities in that column are replaced with the specified fake replacement. If a column is specified in the do_not_fake parameter, it is left untouched. Otherwise, NER is performed on each row of the column.

**5. The generate_fake_labels function** is then called to generate fake labels for the extracted entities. This function uses the ``faker`` library to generate fake data that matches the type of the original data (e.g., names, companies, dates, cities, etc.).

**6. The replace_label_with_fake function** is then used to replace the original entities in the DataFrame with the generated fake labels.

**7. Data Export**: The to_csv method is used to write the anonymized DataFrame to a CSV file.

**8. Example Data Import**: The import_example method is used to import example datasets from a GitHub source or a specified URL.


	Start
	  |
	  v
	Initialize `anonym` class
	  |
	  v
	Import data using `import_data` method
	  |
	  v
	Anonymize data using `anonymize` method
	  |         |
	  |         v
	  |     Extract entities using `extract_entities` function
	  |         |
	  |         v
	  |     Generate fake labels using `generate_fake_labels` function
	  |         |
	  |         v
	  |     Replace original labels with fake ones using `replace_label_with_fake` function
	  v
	Export anonymized data using `to_csv` method
	  |
	  v
	End

The algorithm also includes several utility functions for text cleaning, preprocessing, filtering values, checking the ``spacy`` model, and setting the logger. The main function at the end of the script demonstrates how to use the anonym class to import an example dataset, anonymize it, and plot the results.


## Documentation

* [**anonym documentation pages (Sphinx)**](https://anonym-datainnovatielab-public-3ae525a7078644e2013f2d5d2c9a0825.gitlab.io/)


## Contents
- [Installation](#-installation)
- [Contribute](#-contribute)
- [Citation](#-citation)
- [Maintainers](#-maintainers)
- [License](#-copyright)

## Installation
* Install anonym from PyPI (recommended). anonym is compatible with Python 3.6+ and runs on Linux, MacOS X and Windows. 
* A new environment can be created as following:

```bash
conda create -n env_anonym python=3.10
conda activate env_anonym
```

```bash
pip install anonym            # normal install
pip install --upgrade anonym # or update if needed
```

* Alternatively, you can install from the GitHub source:
```bash
# Directly install from github source
pip install -e git://gitlab.com/datainnovatielab/public/anonym.git@0.1.0#egg=master
pip install git+https://gitlab.com/datainnovatielab/public/anonym#egg=master
pip install git+https://gitlab.com/datainnovatielab/public/anonym

# By cloning
git clone https://gitlab.com/datainnovatielab/public/anonym.git
cd anonym
pip install -U .
```  

### Import anonym package
```python
import anonym as anonym
```

### Example:
```python
  # Example 2
  # Load library
  from anonym import anonym
  # Initialize
  model = anonym(language='english', verbose='info')
  # Import example data set
  df = model.import_example('titanic')
  # Anonimyze the data set
  df_fake = model.anonymize(df)
```


### References
* https://gitlab.com/datainnovatielab/public/anonym

### Citation
Please cite in your publications if this is useful for your research (see citation).
   
### Contribute
* All kinds of contributions are welcome!

### Licence
See [LICENSE](LICENSE) for details.

            

Raw data

            {
    "_id": null,
    "home_page": "https://gitlab.com/datainnovatielab/public/anonym",
    "name": "anonym",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3",
    "maintainer_email": "",
    "keywords": "",
    "author": "Erdogan Taskesen",
    "author_email": "erdogan.taskesen@minienw.nl",
    "download_url": "https://files.pythonhosted.org/packages/3d/2e/ceb1c9a14718d09e72cac5ae8156359c8a74ef73ce33d713edeb52271540/anonym-0.1.1.tar.gz",
    "platform": null,
    "description": "# anonym\r\n\r\n[![Python](https://img.shields.io/pypi/pyversions/anonym)](https://img.shields.io/pypi/pyversions/anonym)\r\n[![Pypi](https://img.shields.io/pypi/v/anonym)](https://pypi.org/project/anonym/)\r\n[![Docs](https://img.shields.io/badge/Sphinx-Docs-Green)](https://anonym-datainnovatielab-public-3ae525a7078644e2013f2d5d2c9a0825.gitlab.io/)\r\n[![Downloads](https://static.pepy.tech/personalized-badge/anonym?period=month&units=international_system&left_color=grey&right_color=brightgreen&left_text=PyPI%20downloads/month)](https://pepy.tech/project/anonym)\r\n[![Downloads](https://static.pepy.tech/personalized-badge/anonym?period=total&units=international_system&left_color=grey&right_color=brightgreen&left_text=Downloads)](https://pepy.tech/project/anonym)\r\n[![License](https://img.shields.io/badge/license-MIT-green.svg)](https://gitlab.com/datainnovatielab/public/anonym/-/blob/main/LICENSE)\r\n[![Issues](https://img.shields.io/badge/issues-you_like-yellow)](https://gitlab.com/datainnovatielab/public/anonym/-/issues)\r\n[![Project Status](http://www.repostatus.org/badges/latest/active.svg)](http://www.repostatus.org/#active)\r\n\r\n* The ``anonym`` library is designed to anonymize sensitive data in Python, allowing users to work with, share, or publish their data without compromising privacy or violating data protection regulations. It uses Named Entity Recognition (NER) from ``spacy`` to identify sensitive information in the data. Once identified, the library leverages the ``faker`` library to generate fake but realistic replacements. Depending on the type of sensitive information (like names, addresses, dates), corresponding faker methods are used, ensuring the anonymized data maintains a similar structure and format to the original, making it suitable for further data analysis or testing.\r\n\r\n* The ``anonym`` algorithm is designed to anonymize data in a DataFrame. It works by replacing real data with fake data, while maintaining the structure and format of the original data. Here's a step-by-step explanation of how it works:\r\n\r\n**1. Initialization**: The anonym class is initialized with a language parameter (default is 'dutch') and a verbosity level (default is 'info'). The language parameter is used to load the appropriate language model for named entity recognition (NER), and the verbosity level sets the logger's verbosity.\r\n\r\n**2. Data Import**: The import_data method is used to import a dataset from a given file path. The data is read into a pandas DataFrame.\r\n\r\n**3. Data Anonymization**: The anonymize method is the core of the algorithm. It takes a DataFrame and optional parameters for specifying columns to fake or not to fake, and a NER blacklist. The method works as follows:\r\n\r\n**4. It calls the extract_entities function** to extract all entities from the DataFrame. This function uses the ``spacy`` library's NER capabilities to identify entities in the data. If a column is specified in the fakeit parameter, the entities in that column are replaced with the specified fake replacement. If a column is specified in the do_not_fake parameter, it is left untouched. Otherwise, NER is performed on each row of the column.\r\n\r\n**5. The generate_fake_labels function** is then called to generate fake labels for the extracted entities. This function uses the ``faker`` library to generate fake data that matches the type of the original data (e.g., names, companies, dates, cities, etc.).\r\n\r\n**6. The replace_label_with_fake function** is then used to replace the original entities in the DataFrame with the generated fake labels.\r\n\r\n**7. Data Export**: The to_csv method is used to write the anonymized DataFrame to a CSV file.\r\n\r\n**8. Example Data Import**: The import_example method is used to import example datasets from a GitHub source or a specified URL.\r\n\r\n\r\n\tStart\r\n\t  |\r\n\t  v\r\n\tInitialize `anonym` class\r\n\t  |\r\n\t  v\r\n\tImport data using `import_data` method\r\n\t  |\r\n\t  v\r\n\tAnonymize data using `anonymize` method\r\n\t  |         |\r\n\t  |         v\r\n\t  |     Extract entities using `extract_entities` function\r\n\t  |         |\r\n\t  |         v\r\n\t  |     Generate fake labels using `generate_fake_labels` function\r\n\t  |         |\r\n\t  |         v\r\n\t  |     Replace original labels with fake ones using `replace_label_with_fake` function\r\n\t  v\r\n\tExport anonymized data using `to_csv` method\r\n\t  |\r\n\t  v\r\n\tEnd\r\n\r\nThe algorithm also includes several utility functions for text cleaning, preprocessing, filtering values, checking the ``spacy`` model, and setting the logger. The main function at the end of the script demonstrates how to use the anonym class to import an example dataset, anonymize it, and plot the results.\r\n\r\n\r\n## Documentation\r\n\r\n* [**anonym documentation pages (Sphinx)**](https://anonym-datainnovatielab-public-3ae525a7078644e2013f2d5d2c9a0825.gitlab.io/)\r\n\r\n\r\n## Contents\r\n- [Installation](#-installation)\r\n- [Contribute](#-contribute)\r\n- [Citation](#-citation)\r\n- [Maintainers](#-maintainers)\r\n- [License](#-copyright)\r\n\r\n## Installation\r\n* Install anonym from PyPI (recommended). anonym is compatible with Python 3.6+ and runs on Linux, MacOS X and Windows. \r\n* A new environment can be created as following:\r\n\r\n```bash\r\nconda create -n env_anonym python=3.10\r\nconda activate env_anonym\r\n```\r\n\r\n```bash\r\npip install anonym            # normal install\r\npip install --upgrade anonym # or update if needed\r\n```\r\n\r\n* Alternatively, you can install from the GitHub source:\r\n```bash\r\n# Directly install from github source\r\npip install -e git://gitlab.com/datainnovatielab/public/anonym.git@0.1.0#egg=master\r\npip install git+https://gitlab.com/datainnovatielab/public/anonym#egg=master\r\npip install git+https://gitlab.com/datainnovatielab/public/anonym\r\n\r\n# By cloning\r\ngit clone https://gitlab.com/datainnovatielab/public/anonym.git\r\ncd anonym\r\npip install -U .\r\n```  \r\n\r\n### Import anonym package\r\n```python\r\nimport anonym as anonym\r\n```\r\n\r\n### Example:\r\n```python\r\n  # Example 2\r\n  # Load library\r\n  from anonym import anonym\r\n  # Initialize\r\n  model = anonym(language='english', verbose='info')\r\n  # Import example data set\r\n  df = model.import_example('titanic')\r\n  # Anonimyze the data set\r\n  df_fake = model.anonymize(df)\r\n```\r\n\r\n\r\n### References\r\n* https://gitlab.com/datainnovatielab/public/anonym\r\n\r\n### Citation\r\nPlease cite in your publications if this is useful for your research (see citation).\r\n   \r\n### Contribute\r\n* All kinds of contributions are welcome!\r\n\r\n### Licence\r\nSee [LICENSE](LICENSE) for details.\r\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Python package anonym",
    "version": "0.1.1",
    "project_urls": {
        "Download": "https://gitlab.com/datainnovatielab/public/anonym/archive/0.1.1.tar.gz",
        "Homepage": "https://gitlab.com/datainnovatielab/public/anonym"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "29e6249fe755dcacd24ec8be482b4287ba34dd7a4c338b93a6f08683bbddd92b",
                "md5": "45e2862335d37dc8cac1fd91cc5b1b32",
                "sha256": "b01027ce0cd747c0d19de13e56fab8bbd80abcde45cb1a3719ff47350ab8baad"
            },
            "downloads": -1,
            "filename": "anonym-0.1.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "45e2862335d37dc8cac1fd91cc5b1b32",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3",
            "size": 10091,
            "upload_time": "2023-11-25T20:39:42",
            "upload_time_iso_8601": "2023-11-25T20:39:42.627492Z",
            "url": "https://files.pythonhosted.org/packages/29/e6/249fe755dcacd24ec8be482b4287ba34dd7a4c338b93a6f08683bbddd92b/anonym-0.1.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "3d2eceb1c9a14718d09e72cac5ae8156359c8a74ef73ce33d713edeb52271540",
                "md5": "b39c4861f616b7abf0e2f23eaeb2c40d",
                "sha256": "c970ae49426b4226ebaa51d79cd4608085f67ec75768d6488e86529c92a21066"
            },
            "downloads": -1,
            "filename": "anonym-0.1.1.tar.gz",
            "has_sig": false,
            "md5_digest": "b39c4861f616b7abf0e2f23eaeb2c40d",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3",
            "size": 11737,
            "upload_time": "2023-11-25T20:39:43",
            "upload_time_iso_8601": "2023-11-25T20:39:43.718298Z",
            "url": "https://files.pythonhosted.org/packages/3d/2e/ceb1c9a14718d09e72cac5ae8156359c8a74ef73ce33d713edeb52271540/anonym-0.1.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-11-25 20:39:43",
    "github": false,
    "gitlab": true,
    "bitbucket": false,
    "codeberg": false,
    "gitlab_user": "datainnovatielab",
    "gitlab_project": "public",
    "lcname": "anonym"
}
        
Elapsed time: 0.41737s