# KafkaAnonymizer
## What is it?
The KafkaAnonymizer, powered by the Qurix Dataframe Anonymizer, is a Python package designed for anonymizing data within Kafka streams. Achieve data privacy compliance and protect sensitive information in real-time data pipelines.
## Main Features
1. Descriptive Statistics:
Generate comprehensive descriptive statistics for each column, including mean, min, max, frequency, unique values, and data type.
2. Anonymization Techniques:
Anonymize diverse data types such as float, int, string, and date columns based on statistical properties like mean, standard deviation, count, min, and max values.
3. Configurability:
Customize the anonymization process using white and blacklists, providing fine-grained control over which columns to include or exclude from anonymization.
4. String Anonymization Providers:
Support various string anonymization strategies, including generic text, gender, addresses, and person names.
Specify preferred string anonymization providers for each string column.
5. Dataframe Anonymization:
Anonymize entire DataFrames using the anonymize_dataframe method.
Flexibility to choose specific columns for anonymization through white and blacklists.
6. Randomization and Shuffling:
Utilize randomization techniques to generate synthetic data, ensuring representative yet anonymized information.
Implement shuffling mechanisms for randomizing string values during the anonymization process.
7. Data Type Handling:
Handle different data types (float, int, object, datetime) with dedicated anonymization logic for each type.
## Requirements
- `confluent-kafka`
You can install these dependencies manually or use the provided `requirement.txt` file in the repository.
## Installation
1. Create a New Virtual Environment (named `.venv` in this case):
```bash
python3 -m venv venv
```
2. Activate the Virtual Environment:
```bash
source venv/bin/activate
```
3. Install the Package:
To install the `qurix-dataframe-anonymizer` package, use `pip`:
```bash
pip install qurix-dataframe-anonymizer
```
## Usage
### Dataframe anonymizer
Anonymize dataframes using the DataframeAnonymizer class:
```python
import pandas as pd
from qurix.dataframe.anonymizer import DataframeAnonymizer, AnonymizeStrProvider
df = pd.read_csv("<my_csv_file.csv>")
anonymizer = DataframeAnonymizer()
df_anonymized = anonymizer.anonymize_dataframe(df)
df_anonymized.head()
# Dictionary specifiying specific anonymizer string for a particular column, e.g. GENDER, NAMES
anonymize_str_map = {
"Sex": AnonymizeStrProvider.GENDER,
"Name": AnonymizeStrProvider.PERSON_NAME
}
# Anonymize
df_anonymized = anonymizer.anonymize_dataframe(df, anonymize_str_map)
df_anonymized.head()
#For more advanced usage and customization, explore additional parameters in the anonymize_dataframe method, such as white and blacklists.
anonymized_df = anonymizer.anonymize_dataframe(df, white_list=["column1"], black_list=["column2"])
# Specify string anonymization providers
anonymized_df = anonymizer.anonymize_dataframe(df, anonymize_str_map={"column3": "gender"})
```
## Contact
For any inquiries or questions, feel free [reach out](https://qurix.tech/about_us.html).
Raw data
{
"_id": null,
"home_page": "https://github.com/qurixtechnology/qurix-dataframe-anonymizer.git",
"name": "qurix-dataframe-anonymizer",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.10, <4",
"maintainer_email": "",
"keywords": "python",
"author": "qurix Technology",
"author_email": "",
"download_url": "https://files.pythonhosted.org/packages/10/ef/5a175fd0a8eaadacbbad07c50601f63b2593eaad5fad08d000a200ef9933/qurix-dataframe-anonymizer-0.2.0.tar.gz",
"platform": null,
"description": "# KafkaAnonymizer\n\n## What is it?\n\nThe KafkaAnonymizer, powered by the Qurix Dataframe Anonymizer, is a Python package designed for anonymizing data within Kafka streams. Achieve data privacy compliance and protect sensitive information in real-time data pipelines.\n\n## Main Features\n\n1. Descriptive Statistics:\nGenerate comprehensive descriptive statistics for each column, including mean, min, max, frequency, unique values, and data type.\n\n2. Anonymization Techniques:\nAnonymize diverse data types such as float, int, string, and date columns based on statistical properties like mean, standard deviation, count, min, and max values.\n\n3. Configurability:\nCustomize the anonymization process using white and blacklists, providing fine-grained control over which columns to include or exclude from anonymization.\n\n4. String Anonymization Providers:\nSupport various string anonymization strategies, including generic text, gender, addresses, and person names.\nSpecify preferred string anonymization providers for each string column.\n\n5. Dataframe Anonymization:\nAnonymize entire DataFrames using the anonymize_dataframe method.\nFlexibility to choose specific columns for anonymization through white and blacklists.\n\n6. Randomization and Shuffling:\nUtilize randomization techniques to generate synthetic data, ensuring representative yet anonymized information.\nImplement shuffling mechanisms for randomizing string values during the anonymization process.\n\n7. Data Type Handling:\nHandle different data types (float, int, object, datetime) with dedicated anonymization logic for each type.\n\n## Requirements\n\n- `confluent-kafka`\n\nYou can install these dependencies manually or use the provided `requirement.txt` file in the repository.\n\n## Installation\n\n1. Create a New Virtual Environment (named `.venv` in this case):\n\n```bash\npython3 -m venv venv\n```\n\n2. Activate the Virtual Environment:\n\n```bash\nsource venv/bin/activate\n```\n\n3. Install the Package:\n\nTo install the `qurix-dataframe-anonymizer` package, use `pip`:\n\n```bash\npip install qurix-dataframe-anonymizer\n```\n\n## Usage\n\n### Dataframe anonymizer\n\nAnonymize dataframes using the DataframeAnonymizer class:\n\n```python\nimport pandas as pd\nfrom qurix.dataframe.anonymizer import DataframeAnonymizer, AnonymizeStrProvider\n\ndf = pd.read_csv(\"<my_csv_file.csv>\")\n\nanonymizer = DataframeAnonymizer()\ndf_anonymized = anonymizer.anonymize_dataframe(df)\ndf_anonymized.head()\n\n# Dictionary specifiying specific anonymizer string for a particular column, e.g. GENDER, NAMES\nanonymize_str_map = {\n \"Sex\": AnonymizeStrProvider.GENDER,\n \"Name\": AnonymizeStrProvider.PERSON_NAME\n}\n\n# Anonymize\ndf_anonymized = anonymizer.anonymize_dataframe(df, anonymize_str_map)\ndf_anonymized.head()\n\n#For more advanced usage and customization, explore additional parameters in the anonymize_dataframe method, such as white and blacklists.\n\nanonymized_df = anonymizer.anonymize_dataframe(df, white_list=[\"column1\"], black_list=[\"column2\"])\n\n# Specify string anonymization providers\nanonymized_df = anonymizer.anonymize_dataframe(df, anonymize_str_map={\"column3\": \"gender\"})\n```\n\n## Contact\n\nFor any inquiries or questions, feel free [reach out](https://qurix.tech/about_us.html).\n",
"bugtrack_url": null,
"license": "",
"summary": "qurix dataframe anonymizer for kafka",
"version": "0.2.0",
"project_urls": {
"Homepage": "https://github.com/qurixtechnology/qurix-dataframe-anonymizer.git"
},
"split_keywords": [
"python"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "1779c1b0527be4b60333d3fa7cbba1d0e88af24ba54b5efc77b2557cfc2b0ae9",
"md5": "a1bdd47269d9359ec9c33ec1ff4ccf72",
"sha256": "ef7ef82046ee9c35920c0c1db0b275a3e7d31d80300e91222bf34a0e05d9de4a"
},
"downloads": -1,
"filename": "qurix_dataframe_anonymizer-0.2.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "a1bdd47269d9359ec9c33ec1ff4ccf72",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10, <4",
"size": 6884,
"upload_time": "2023-11-22T11:26:34",
"upload_time_iso_8601": "2023-11-22T11:26:34.753263Z",
"url": "https://files.pythonhosted.org/packages/17/79/c1b0527be4b60333d3fa7cbba1d0e88af24ba54b5efc77b2557cfc2b0ae9/qurix_dataframe_anonymizer-0.2.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "10ef5a175fd0a8eaadacbbad07c50601f63b2593eaad5fad08d000a200ef9933",
"md5": "777041dfb890dafa40c3df6c9a3720ff",
"sha256": "e11f52246b153377d901663cb4380b59cb60539b5f1b2b66cc449b09d2fdbca6"
},
"downloads": -1,
"filename": "qurix-dataframe-anonymizer-0.2.0.tar.gz",
"has_sig": false,
"md5_digest": "777041dfb890dafa40c3df6c9a3720ff",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10, <4",
"size": 6123,
"upload_time": "2023-11-22T11:26:36",
"upload_time_iso_8601": "2023-11-22T11:26:36.277482Z",
"url": "https://files.pythonhosted.org/packages/10/ef/5a175fd0a8eaadacbbad07c50601f63b2593eaad5fad08d000a200ef9933/qurix-dataframe-anonymizer-0.2.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-11-22 11:26:36",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "qurixtechnology",
"github_project": "qurix-dataframe-anonymizer",
"github_not_found": true,
"lcname": "qurix-dataframe-anonymizer"
}