cloakdata

Name	cloakdata JSON
Version	1.0.0 JSON
	download
home_page	None
Summary	A lightweight library for anonymizing tabular datasets using Polars
upload_time	2025-08-02 00:04:27
maintainer	None
docs_url	None
author	Jeferson Peter
requires_python	>=3.12
license	None
keywords	anonymization data privacy polars etl data masking
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Data Anonymizer Script

This project is designed to anonymize sensitive data using configurable methods in Polars.

## 📦 Features

- Full masking
- Email masking
- Phone number masking
- Replace with static values
- Replace by substring or dictionary
- Sequential numeric and alphabetical replacement
- Truncation
- Initials extraction
- Age and date generalization
- Random choice substitution
- Fake numeric generation
- Column shuffling
- Date offset
- Conditional anonymization

## ⚙️ How it works

1. The script reads a CSV file into a Polars DataFrame.
2. It loads a JSON config describing which columns to anonymize and how.
3. Each rule is applied and the resulting DataFrame is written to output.

## 🧪 Example Config

```json
{
  "columns": {
    "name": "initials_only",
    "email": "mask_email",
    "phone": "mask_number",
    "cpf": {
      "method": "replace_with_fake",
      "params": {
        "digits": 11
      }
    },
    "username": {
      "method": "replace_by_contains",
      "params": {
        "mapping": {
          "admin": "user",
          "root": "guest"
        }
      }
    },
    "status": {
      "method": "replace_by_dict",
      "params": {
        "mapping": {
          "active": "A",
          "inactive": "I"
        }
      }
    },
    "id_seq": {
      "method": "sequential_numeric",
      "params": {
        "prefix": "ID"
      }
    },
    "ref_code": {
      "method": "sequential_alpha",
      "params": {
        "prefix": "REF"
      }
    },
    "comments": {
      "method": "truncate",
      "params": {
        "length": 5
      }
    },
    "age": "generalize_age",
    "birth_date": {
      "method": "generalize_date",
      "params": {
        "mode": "month_year"
      }
    },
    "state": {
      "method": "random_choice",
      "params": {
        "choices": [
          "SP",
          "RJ",
          "MG",
          "BA"
        ]
      }
    },
    "last_access": {
      "method": "date_offset",
      "params": {
        "min_days": -2,
        "max_days": 2
      }
    },
    "feedback": "shuffle"
  }
}
```

## 🧠 Conditional Rules

You can also apply rules based on other column values:

```json
"cpf": {
  "method": "replace_with_fake",
  "params": {
    "digits": 11
  },
  "condition": {
    "column": "status",
    "operator": "equals",
    "value": "active"
  }
}
```

## ⚖️ Supported Condition Operators

| Operator        | Description                            |
|----------------|----------------------------------------|
| equals         | Equal to                               |
| not_equals     | Not equal to                           |
| in             | Value in list                          |
| not_in         | Value not in list                      |
| gt             | Greater than                           |
| gte            | Greater than or equal to               |
| lt             | Less than                              |
| lte            | Less than or equal to                  |
| contains       | Substring exists in string             |
| not_contains   | Substring does not exist in string     |

## 📁 Project Structure

```
.
├── main.py                 # Entry point to run anonymization
├── anonymizer.py           # Core logic for applying anonymization rules
├── config.json             # Example configuration file
├── sensitive_data.csv      # Input file to be anonymized
├── README.md               # Project documentation
└── requirements.txt        # Project dependencies
```

## 🛠️ Requirements

- Python 3.12+
- [Polars](https://pola.rs/) >= 1.31.0
- Create a virtual environment:

```bash
python -m venv .venv
source .venv/bin/activate  # or .venv\Scripts\activate on Windows
pip install -r requirements.txt
```

## 🚀 Run the script

```bash
python main.py
```

Make sure to update paths for input CSV and config JSON as needed.

## 🔮 Possible Future Features

- Hashing support for specific fields
- Redaction rules using regex
- Support for nested or JSON-style fields
- CLI interface with rich options
- Parallel processing for large datasets

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "cloakdata",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.12",
    "maintainer_email": null,
    "keywords": "anonymization, data privacy, polars, etl, data masking",
    "author": "Jeferson Peter",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/c6/a1/f95b2a65cdc67790217338dc146a046d7bb95990d89a0d6b3097481b5f5f/cloakdata-1.0.0.tar.gz",
    "platform": null,
    "description": "# Data Anonymizer Script\n\nThis project is designed to anonymize sensitive data using configurable methods in Polars.\n\n## \ud83d\udce6 Features\n\n- Full masking\n- Email masking\n- Phone number masking\n- Replace with static values\n- Replace by substring or dictionary\n- Sequential numeric and alphabetical replacement\n- Truncation\n- Initials extraction\n- Age and date generalization\n- Random choice substitution\n- Fake numeric generation\n- Column shuffling\n- Date offset\n- Conditional anonymization\n\n## \u2699\ufe0f How it works\n\n1. The script reads a CSV file into a Polars DataFrame.\n2. It loads a JSON config describing which columns to anonymize and how.\n3. Each rule is applied and the resulting DataFrame is written to output.\n\n## \ud83e\uddea Example Config\n\n```json\n{\n  \"columns\": {\n    \"name\": \"initials_only\",\n    \"email\": \"mask_email\",\n    \"phone\": \"mask_number\",\n    \"cpf\": {\n      \"method\": \"replace_with_fake\",\n      \"params\": {\n        \"digits\": 11\n      }\n    },\n    \"username\": {\n      \"method\": \"replace_by_contains\",\n      \"params\": {\n        \"mapping\": {\n          \"admin\": \"user\",\n          \"root\": \"guest\"\n        }\n      }\n    },\n    \"status\": {\n      \"method\": \"replace_by_dict\",\n      \"params\": {\n        \"mapping\": {\n          \"active\": \"A\",\n          \"inactive\": \"I\"\n        }\n      }\n    },\n    \"id_seq\": {\n      \"method\": \"sequential_numeric\",\n      \"params\": {\n        \"prefix\": \"ID\"\n      }\n    },\n    \"ref_code\": {\n      \"method\": \"sequential_alpha\",\n      \"params\": {\n        \"prefix\": \"REF\"\n      }\n    },\n    \"comments\": {\n      \"method\": \"truncate\",\n      \"params\": {\n        \"length\": 5\n      }\n    },\n    \"age\": \"generalize_age\",\n    \"birth_date\": {\n      \"method\": \"generalize_date\",\n      \"params\": {\n        \"mode\": \"month_year\"\n      }\n    },\n    \"state\": {\n      \"method\": \"random_choice\",\n      \"params\": {\n        \"choices\": [\n          \"SP\",\n          \"RJ\",\n          \"MG\",\n          \"BA\"\n        ]\n      }\n    },\n    \"last_access\": {\n      \"method\": \"date_offset\",\n      \"params\": {\n        \"min_days\": -2,\n        \"max_days\": 2\n      }\n    },\n    \"feedback\": \"shuffle\"\n  }\n}\n```\n\n## \ud83e\udde0 Conditional Rules\n\nYou can also apply rules based on other column values:\n\n```json\n\"cpf\": {\n  \"method\": \"replace_with_fake\",\n  \"params\": {\n    \"digits\": 11\n  },\n  \"condition\": {\n    \"column\": \"status\",\n    \"operator\": \"equals\",\n    \"value\": \"active\"\n  }\n}\n```\n\n## \u2696\ufe0f Supported Condition Operators\n\n| Operator        | Description                            |\n|----------------|----------------------------------------|\n| equals         | Equal to                               |\n| not_equals     | Not equal to                           |\n| in             | Value in list                          |\n| not_in         | Value not in list                      |\n| gt             | Greater than                           |\n| gte            | Greater than or equal to               |\n| lt             | Less than                              |\n| lte            | Less than or equal to                  |\n| contains       | Substring exists in string             |\n| not_contains   | Substring does not exist in string     |\n\n## \ud83d\udcc1 Project Structure\n\n```\n.\n\u251c\u2500\u2500 main.py                 # Entry point to run anonymization\n\u251c\u2500\u2500 anonymizer.py           # Core logic for applying anonymization rules\n\u251c\u2500\u2500 config.json             # Example configuration file\n\u251c\u2500\u2500 sensitive_data.csv      # Input file to be anonymized\n\u251c\u2500\u2500 README.md               # Project documentation\n\u2514\u2500\u2500 requirements.txt        # Project dependencies\n```\n\n## \ud83d\udee0\ufe0f Requirements\n\n- Python 3.12+\n- [Polars](https://pola.rs/) >= 1.31.0\n- Create a virtual environment:\n\n```bash\npython -m venv .venv\nsource .venv/bin/activate  # or .venv\\Scripts\\activate on Windows\npip install -r requirements.txt\n```\n\n## \ud83d\ude80 Run the script\n\n```bash\npython main.py\n```\n\nMake sure to update paths for input CSV and config JSON as needed.\n\n## \ud83d\udd2e Possible Future Features\n\n- Hashing support for specific fields\n- Redaction rules using regex\n- Support for nested or JSON-style fields\n- CLI interface with rich options\n- Parallel processing for large datasets\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A lightweight library for anonymizing tabular datasets using Polars",
    "version": "1.0.0",
    "project_urls": null,
    "split_keywords": [
        "anonymization",
        " data privacy",
        " polars",
        " etl",
        " data masking"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "3b81a348bd743204012aa7325ecdd29e083c05342dd908e3c02562e6aaf6882f",
                "md5": "86a11d01b893bcf34692239e3ed2c5b2",
                "sha256": "284210649bca294eff6bf70e07965afcf1e5eb1e85c94dfe3d4598d5618b17d4"
            },
            "downloads": -1,
            "filename": "cloakdata-1.0.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "86a11d01b893bcf34692239e3ed2c5b2",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.12",
            "size": 9918,
            "upload_time": "2025-08-02T00:04:25",
            "upload_time_iso_8601": "2025-08-02T00:04:25.995923Z",
            "url": "https://files.pythonhosted.org/packages/3b/81/a348bd743204012aa7325ecdd29e083c05342dd908e3c02562e6aaf6882f/cloakdata-1.0.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "c6a1f95b2a65cdc67790217338dc146a046d7bb95990d89a0d6b3097481b5f5f",
                "md5": "f287a8a1dd70908554fd73c5a7562fdb",
                "sha256": "4082bd6666ad15b4a05c3cb30e93869185ff918bf153c9c9d41cbb41ba5c420e"
            },
            "downloads": -1,
            "filename": "cloakdata-1.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "f287a8a1dd70908554fd73c5a7562fdb",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.12",
            "size": 10615,
            "upload_time": "2025-08-02T00:04:27",
            "upload_time_iso_8601": "2025-08-02T00:04:27.255290Z",
            "url": "https://files.pythonhosted.org/packages/c6/a1/f95b2a65cdc67790217338dc146a046d7bb95990d89a0d6b3097481b5f5f/cloakdata-1.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-02 00:04:27",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "cloakdata"
}

Jeferson Peter