# AI Data Scrubber
The AI Data Scrubber is a lightweight privacy-focused tool designed to remove personal information from text documents before uploading them to Large Language Models (LLMs). You can use it to clean sensitive documents like resumes or contracts. It's not guaranteed to remove everything, so you should still check before you upload your file to a LLM.
It uses a mixture of regular expressions and named entity recognition models from [spaCy](https://spacy.io/). It's less accurate than asking a LLM to remove PII - but then you don't need to either run a LLM on your own machine, or upload your document to a LLM.
## What It Does
The AI Data Scrubber removes personal information including:
- Names
- Email addresses
- Phone numbers
- Street addresses & ZIP codes
- URLs
- License plates
Currently, only US formats are supported.
You can run it via a command line interface or you can import it into your Python script.
## Quick Start
```bash
# Install using pip
pip install ai-data-scrubber
# Download required language model (~560MB)
python -m spacy download en_core_web_lg
# Clean your file
ai-data-scrubber your-file.txt
```
## Usage
**Command Line:**
```bash
# Auto-generates an output file with _scrubbed suffix
ai-data-scrubber input.txt
# Or you can specify the output file with the -o flag
ai-data-scrubber input.txt -o output.txt
```
**Python:**
```python
from ai_data_scrubber import scrub_text, scrub_file
# Scrub text directly
cleaned = scrub_text("Your text with personal information here")
# Or scrub a file
scrub_file("input.txt", "output.txt")
```
## Example
**Original text:**
```
John Smith
123 Main Street, Apt 4B
New York, NY 10001
Email: john.smith@example.com
Phone: (555) 123-4567
```
**Cleaned text:**
```
[NAME]
[ADDRESS], [UNIT]
New York, NY [ZIP_CODE]
Email: [EMAIL]
Phone: [PHONE]
```
## Documentation
You can find further documentation in the `docs` folder:
- [Installation Guide](docs/INSTALLATION.md) - Detailed installation including from source
- [Performance Testing](docs/TESTING.md) - How to run a testing script with 20 sample resumes
## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## Contributing
Contributions welcome! Feel free to submit a Pull Request.
Raw data
{
"_id": null,
"home_page": null,
"name": "ai-data-scrubber",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "privacy, data-cleaning, pii, llm, spacy, anonymization, text-processing",
"author": "Catherine Nelson",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/67/6c/d758a5b1f5685082c2a6ed44620b1706f0da21b1189294cc047ab3464b7a/ai_data_scrubber-0.1.1.tar.gz",
"platform": null,
"description": "# AI Data Scrubber\n\nThe AI Data Scrubber is a lightweight privacy-focused tool designed to remove personal information from text documents before uploading them to Large Language Models (LLMs). You can use it to clean sensitive documents like resumes or contracts. It's not guaranteed to remove everything, so you should still check before you upload your file to a LLM.\n\nIt uses a mixture of regular expressions and named entity recognition models from [spaCy](https://spacy.io/). It's less accurate than asking a LLM to remove PII - but then you don't need to either run a LLM on your own machine, or upload your document to a LLM.\n\n## What It Does\n\nThe AI Data Scrubber removes personal information including:\n- Names\n- Email addresses\n- Phone numbers\n- Street addresses & ZIP codes\n- URLs\n- License plates\n\nCurrently, only US formats are supported.\n\nYou can run it via a command line interface or you can import it into your Python script.\n\n## Quick Start\n\n```bash\n# Install using pip\npip install ai-data-scrubber\n\n# Download required language model (~560MB)\npython -m spacy download en_core_web_lg\n\n# Clean your file\nai-data-scrubber your-file.txt\n```\n\n## Usage\n\n**Command Line:**\n```bash\n# Auto-generates an output file with _scrubbed suffix\nai-data-scrubber input.txt\n\n# Or you can specify the output file with the -o flag\nai-data-scrubber input.txt -o output.txt\n```\n\n**Python:**\n```python\nfrom ai_data_scrubber import scrub_text, scrub_file\n\n# Scrub text directly\ncleaned = scrub_text(\"Your text with personal information here\")\n\n# Or scrub a file\nscrub_file(\"input.txt\", \"output.txt\")\n```\n\n## Example\n\n**Original text:**\n```\nJohn Smith\n123 Main Street, Apt 4B\nNew York, NY 10001\nEmail: john.smith@example.com\nPhone: (555) 123-4567\n```\n\n**Cleaned text:**\n```\n[NAME]\n[ADDRESS], [UNIT]\nNew York, NY [ZIP_CODE]\nEmail: [EMAIL]\nPhone: [PHONE]\n```\n\n## Documentation\n\nYou can find further documentation in the `docs` folder: \n\n- [Installation Guide](docs/INSTALLATION.md) - Detailed installation including from source\n- [Performance Testing](docs/TESTING.md) - How to run a testing script with 20 sample resumes\n\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## Contributing\n\nContributions welcome! Feel free to submit a Pull Request.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A lightweight tool for removing personal data from text before uploading to LLMs",
"version": "0.1.1",
"project_urls": {
"Homepage": "https://github.com/catherinenelson1/ai-data-scrubber",
"Issues": "https://github.com/catherinenelson1/ai-data-scrubber/issues",
"Repository": "https://github.com/catherinenelson1/ai-data-scrubber"
},
"split_keywords": [
"privacy",
" data-cleaning",
" pii",
" llm",
" spacy",
" anonymization",
" text-processing"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "eb13439332f5551bb749df8ff82abfed82d5eda5d87922067957bb544e5c84a7",
"md5": "d2445d80ebe028b55805d51bbec30221",
"sha256": "26b4966b77ee131ae4b43011ad1994edd6ac1c0bc4753951bfdffa09da0025fa"
},
"downloads": -1,
"filename": "ai_data_scrubber-0.1.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "d2445d80ebe028b55805d51bbec30221",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 6898,
"upload_time": "2025-10-15T03:28:12",
"upload_time_iso_8601": "2025-10-15T03:28:12.599504Z",
"url": "https://files.pythonhosted.org/packages/eb/13/439332f5551bb749df8ff82abfed82d5eda5d87922067957bb544e5c84a7/ai_data_scrubber-0.1.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "676cd758a5b1f5685082c2a6ed44620b1706f0da21b1189294cc047ab3464b7a",
"md5": "16adbfd4f43d0b52f53c91de719cc1aa",
"sha256": "b219bbb2c814f1065ff79da9324a44d969a2ae8d4454aa1b29c2143a340b871b"
},
"downloads": -1,
"filename": "ai_data_scrubber-0.1.1.tar.gz",
"has_sig": false,
"md5_digest": "16adbfd4f43d0b52f53c91de719cc1aa",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 9133,
"upload_time": "2025-10-15T03:28:13",
"upload_time_iso_8601": "2025-10-15T03:28:13.448265Z",
"url": "https://files.pythonhosted.org/packages/67/6c/d758a5b1f5685082c2a6ed44620b1706f0da21b1189294cc047ab3464b7a/ai_data_scrubber-0.1.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-15 03:28:13",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "catherinenelson1",
"github_project": "ai-data-scrubber",
"github_not_found": true,
"lcname": "ai-data-scrubber"
}