# Data Anonymizer Script
This project is designed to anonymize sensitive data using configurable methods in Polars.
## ๐ฆ Features
- Full masking
- Email masking
- Phone number masking
- Replace with static values
- Replace by substring or dictionary
- Sequential numeric and alphabetical replacement
- Truncation
- Initials extraction
- Age and date generalization
- Random choice substitution
- Fake numeric generation
- Column shuffling
- Date offset
- Conditional anonymization
## โ๏ธ How it works
1. The script reads a CSV file into a Polars DataFrame.
2. It loads a JSON config describing which columns to anonymize and how.
3. Each rule is applied and the resulting DataFrame is written to output.
## ๐งช Example Config
```json
{
"columns": {
"name": "initials_only",
"email": "mask_email",
"phone": "mask_number",
"cpf": {
"method": "replace_with_fake",
"params": {
"digits": 11
}
},
"username": {
"method": "replace_by_contains",
"params": {
"mapping": {
"admin": "user",
"root": "guest"
}
}
},
"status": {
"method": "replace_by_dict",
"params": {
"mapping": {
"active": "A",
"inactive": "I"
}
}
},
"id_seq": {
"method": "sequential_numeric",
"params": {
"prefix": "ID"
}
},
"ref_code": {
"method": "sequential_alpha",
"params": {
"prefix": "REF"
}
},
"comments": {
"method": "truncate",
"params": {
"length": 5
}
},
"age": "generalize_age",
"birth_date": {
"method": "generalize_date",
"params": {
"mode": "month_year"
}
},
"state": {
"method": "random_choice",
"params": {
"choices": [
"SP",
"RJ",
"MG",
"BA"
]
}
},
"last_access": {
"method": "date_offset",
"params": {
"min_days": -2,
"max_days": 2
}
},
"feedback": "shuffle"
}
}
```
## ๐ง Conditional Rules
You can also apply rules based on other column values:
```json
"cpf": {
"method": "replace_with_fake",
"params": {
"digits": 11
},
"condition": {
"column": "status",
"operator": "equals",
"value": "active"
}
}
```
## โ๏ธ Supported Condition Operators
| Operator | Description |
|----------------|----------------------------------------|
| equals | Equal to |
| not_equals | Not equal to |
| in | Value in list |
| not_in | Value not in list |
| gt | Greater than |
| gte | Greater than or equal to |
| lt | Less than |
| lte | Less than or equal to |
| contains | Substring exists in string |
| not_contains | Substring does not exist in string |
## ๐ Project Structure
```
.
โโโ main.py # Entry point to run anonymization
โโโ anonymizer.py # Core logic for applying anonymization rules
โโโ config.json # Example configuration file
โโโ sensitive_data.csv # Input file to be anonymized
โโโ README.md # Project documentation
โโโ requirements.txt # Project dependencies
```
## ๐ ๏ธ Requirements
- Python 3.12+
- [Polars](https://pola.rs/) >= 1.31.0
- Create a virtual environment:
```bash
python -m venv .venv
source .venv/bin/activate # or .venv\Scripts\activate on Windows
pip install -r requirements.txt
```
## ๐ Run the script
```bash
python main.py
```
Make sure to update paths for input CSV and config JSON as needed.
## ๐ฎ Possible Future Features
- Hashing support for specific fields
- Redaction rules using regex
- Support for nested or JSON-style fields
- CLI interface with rich options
- Parallel processing for large datasets
Raw data
{
"_id": null,
"home_page": null,
"name": "cloakdata",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.12",
"maintainer_email": null,
"keywords": "anonymization, data privacy, polars, etl, data masking",
"author": "Jeferson Peter",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/c6/a1/f95b2a65cdc67790217338dc146a046d7bb95990d89a0d6b3097481b5f5f/cloakdata-1.0.0.tar.gz",
"platform": null,
"description": "# Data Anonymizer Script\n\nThis project is designed to anonymize sensitive data using configurable methods in Polars.\n\n## \ud83d\udce6 Features\n\n- Full masking\n- Email masking\n- Phone number masking\n- Replace with static values\n- Replace by substring or dictionary\n- Sequential numeric and alphabetical replacement\n- Truncation\n- Initials extraction\n- Age and date generalization\n- Random choice substitution\n- Fake numeric generation\n- Column shuffling\n- Date offset\n- Conditional anonymization\n\n## \u2699\ufe0f How it works\n\n1. The script reads a CSV file into a Polars DataFrame.\n2. It loads a JSON config describing which columns to anonymize and how.\n3. Each rule is applied and the resulting DataFrame is written to output.\n\n## \ud83e\uddea Example Config\n\n```json\n{\n \"columns\": {\n \"name\": \"initials_only\",\n \"email\": \"mask_email\",\n \"phone\": \"mask_number\",\n \"cpf\": {\n \"method\": \"replace_with_fake\",\n \"params\": {\n \"digits\": 11\n }\n },\n \"username\": {\n \"method\": \"replace_by_contains\",\n \"params\": {\n \"mapping\": {\n \"admin\": \"user\",\n \"root\": \"guest\"\n }\n }\n },\n \"status\": {\n \"method\": \"replace_by_dict\",\n \"params\": {\n \"mapping\": {\n \"active\": \"A\",\n \"inactive\": \"I\"\n }\n }\n },\n \"id_seq\": {\n \"method\": \"sequential_numeric\",\n \"params\": {\n \"prefix\": \"ID\"\n }\n },\n \"ref_code\": {\n \"method\": \"sequential_alpha\",\n \"params\": {\n \"prefix\": \"REF\"\n }\n },\n \"comments\": {\n \"method\": \"truncate\",\n \"params\": {\n \"length\": 5\n }\n },\n \"age\": \"generalize_age\",\n \"birth_date\": {\n \"method\": \"generalize_date\",\n \"params\": {\n \"mode\": \"month_year\"\n }\n },\n \"state\": {\n \"method\": \"random_choice\",\n \"params\": {\n \"choices\": [\n \"SP\",\n \"RJ\",\n \"MG\",\n \"BA\"\n ]\n }\n },\n \"last_access\": {\n \"method\": \"date_offset\",\n \"params\": {\n \"min_days\": -2,\n \"max_days\": 2\n }\n },\n \"feedback\": \"shuffle\"\n }\n}\n```\n\n## \ud83e\udde0 Conditional Rules\n\nYou can also apply rules based on other column values:\n\n```json\n\"cpf\": {\n \"method\": \"replace_with_fake\",\n \"params\": {\n \"digits\": 11\n },\n \"condition\": {\n \"column\": \"status\",\n \"operator\": \"equals\",\n \"value\": \"active\"\n }\n}\n```\n\n## \u2696\ufe0f Supported Condition Operators\n\n| Operator | Description |\n|----------------|----------------------------------------|\n| equals | Equal to |\n| not_equals | Not equal to |\n| in | Value in list |\n| not_in | Value not in list |\n| gt | Greater than |\n| gte | Greater than or equal to |\n| lt | Less than |\n| lte | Less than or equal to |\n| contains | Substring exists in string |\n| not_contains | Substring does not exist in string |\n\n## \ud83d\udcc1 Project Structure\n\n```\n.\n\u251c\u2500\u2500 main.py # Entry point to run anonymization\n\u251c\u2500\u2500 anonymizer.py # Core logic for applying anonymization rules\n\u251c\u2500\u2500 config.json # Example configuration file\n\u251c\u2500\u2500 sensitive_data.csv # Input file to be anonymized\n\u251c\u2500\u2500 README.md # Project documentation\n\u2514\u2500\u2500 requirements.txt # Project dependencies\n```\n\n## \ud83d\udee0\ufe0f Requirements\n\n- Python 3.12+\n- [Polars](https://pola.rs/) >= 1.31.0\n- Create a virtual environment:\n\n```bash\npython -m venv .venv\nsource .venv/bin/activate # or .venv\\Scripts\\activate on Windows\npip install -r requirements.txt\n```\n\n## \ud83d\ude80 Run the script\n\n```bash\npython main.py\n```\n\nMake sure to update paths for input CSV and config JSON as needed.\n\n## \ud83d\udd2e Possible Future Features\n\n- Hashing support for specific fields\n- Redaction rules using regex\n- Support for nested or JSON-style fields\n- CLI interface with rich options\n- Parallel processing for large datasets\n",
"bugtrack_url": null,
"license": null,
"summary": "A lightweight library for anonymizing tabular datasets using Polars",
"version": "1.0.0",
"project_urls": null,
"split_keywords": [
"anonymization",
" data privacy",
" polars",
" etl",
" data masking"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "3b81a348bd743204012aa7325ecdd29e083c05342dd908e3c02562e6aaf6882f",
"md5": "86a11d01b893bcf34692239e3ed2c5b2",
"sha256": "284210649bca294eff6bf70e07965afcf1e5eb1e85c94dfe3d4598d5618b17d4"
},
"downloads": -1,
"filename": "cloakdata-1.0.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "86a11d01b893bcf34692239e3ed2c5b2",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.12",
"size": 9918,
"upload_time": "2025-08-02T00:04:25",
"upload_time_iso_8601": "2025-08-02T00:04:25.995923Z",
"url": "https://files.pythonhosted.org/packages/3b/81/a348bd743204012aa7325ecdd29e083c05342dd908e3c02562e6aaf6882f/cloakdata-1.0.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "c6a1f95b2a65cdc67790217338dc146a046d7bb95990d89a0d6b3097481b5f5f",
"md5": "f287a8a1dd70908554fd73c5a7562fdb",
"sha256": "4082bd6666ad15b4a05c3cb30e93869185ff918bf153c9c9d41cbb41ba5c420e"
},
"downloads": -1,
"filename": "cloakdata-1.0.0.tar.gz",
"has_sig": false,
"md5_digest": "f287a8a1dd70908554fd73c5a7562fdb",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.12",
"size": 10615,
"upload_time": "2025-08-02T00:04:27",
"upload_time_iso_8601": "2025-08-02T00:04:27.255290Z",
"url": "https://files.pythonhosted.org/packages/c6/a1/f95b2a65cdc67790217338dc146a046d7bb95990d89a0d6b3097481b5f5f/cloakdata-1.0.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-02 00:04:27",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "cloakdata"
}