# PhantomText Toolkit
PhantomText is a Python library designed for handling content injection, content obfuscation, file scanning, and file sanitization across various document formats. This toolkit provides a comprehensive set of tools to manage and secure document content effectively.
## Features
- **Content Injection**: Easily inject content into different document formats using various steganographic techniques like zero-size text, transparent text, and out-of-bound positioning.
- **Content Obfuscation**: Protect sensitive information with advanced obfuscation techniques including zero-width characters, homoglyphs, diacritical marks, and bidirectional text reordering.
- **File Scanning**: Scan files for malicious content or vulnerabilities using the `FileScanner` class that detects obfuscated and injected content.
- **File Sanitization**: Sanitize files to remove harmful content with the `FileSanitizer` class.
## Attack Families
### Obfuscation Attacks
- **Zero-Width Characters**: Uses invisible Unicode characters (Zero Width Space, Zero Width Non-Joiner, etc.) to obfuscate text
- **Homoglyph Characters**: Replaces characters with visually similar Unicode characters from different scripts
- **Diacritical Marks**: Adds combining diacritical marks to characters to alter their appearance
- **Bidi/Reordering**: Uses Unicode bidirectional override characters to manipulate text direction and rendering
### Injection Attacks
- **Zero-Size Injection**: Injects content using zero or near-zero font sizes to make text invisible
- **Transparent Injection**: Injects content using transparent colors or opacity settings
- **Camouflage Injection**: (In development) Hides content by matching background colors or patterns
- **Out-of-Bound Injection**: (In development) Places content outside visible document boundaries
- **Metadata Injection**: (In development) Embeds content in document metadata
## Supported Formats
PhantomText supports the following document formats:
- PDF
- DOCX
- HTML
## Installation
To install PhantomText, you can use pip:
```
pip install phantomtext
```
## Usage
### Content Injection Example
```python
from phantomtext.content_injection import ContentInjector
injector = ContentInjector()
injector.inject_content('document.pdf', 'New Content')
```
### Content Obfuscation Example
```python
from phantomtext.content_obfuscation import ContentObfuscator
obfuscator = ContentObfuscator()
# Basic obfuscation
obfuscated_content = obfuscator.obfuscate_content('Sensitive Information')
# Advanced obfuscation with specific techniques
content = "Sensitive info: email@example.com and phone 123-456-7890."
target = "email@example.com"
# Zero-width character obfuscation
obfuscated = obfuscator.obfuscate(content, target,
obfuscation_technique="zeroWidthCharacter",
modality="default",
file_format="html")
# Homoglyph character obfuscation
obfuscated = obfuscator.obfuscate(content, target,
obfuscation_technique="homoglyph",
file_format="pdf")
# Diacritical marks obfuscation
obfuscated = obfuscator.obfuscate(content, target,
obfuscation_technique="diacritical",
modality="heavy",
file_format="docx")
# Bidi/reordering character obfuscation
obfuscated = obfuscator.obfuscate(content, target,
obfuscation_technique="bidi",
modality="default",
file_format="html")
```
### Content Injection Example
```python
from phantomtext.injection.zerosize_injection import ZeroSizeInjection
from phantomtext.injection.transparent_injection import TransparentInjection
# Zero-size injection
injector = ZeroSizeInjection(modality="default", file_format="pdf")
injector.apply(input_document="document.pdf",
injection="Hidden content",
output_path="injected_document.pdf")
# Transparent injection
injector = TransparentInjection(modality="opacity-0", file_format="html")
injector.apply(input_document="document.html",
injection="Invisible text",
output_path="injected_document.html")
```
#### Supported Attacks
##### Obfuscation Attacks
| **Attack Family** | **Attack Name** | **Variant** | **HTML** | **DOCX** | **PDF** |
|-------------------|---------------------------|---------------|----------|----------|---------|
| Obfuscation | diacritical_marks | default | ✅ | ✅ | ✅ |
| | | heavy | ✅ | ✅ | ✅ |
| Obfuscation | homoglyph_characters | default | ✅ | ✅ | ✅ |
| Obfuscation | zero_width_characters | default | ✅ | ✅ | ✅ |
| | | heavy | ✅ | ✅ | ✅ |
| Obfuscation | bidi_reordering | default | ✅ | ✅ | ✅ |
| | | heavy | ✅ | ✅ | ✅ |
##### Injection Attacks
| **Attack Family** | **Attack Name** | **Variant** | **HTML** | **DOCX** | **PDF** |
|-------------------|---------------------------|----------------------|----------|----------|---------|
| Injection | zero_size | default | ✅ | ✅ | ✅ |
| | | close-to-zero | ✅ | ❌ | ✅ |
| Injection | transparent | default | ✅ | ✅ | ✅ |
| | | opacity-0 | ✅ | ❌ | ✅ |
| | | opacity-close-to-zero| ✅ | ❌ | ✅ |
| | | vanish | ❌ | ✅ | ❌ |
| Injection | camouflage | default | 🚧 | 🚧 | 🚧 |
| Injection | out_of_bound | default | 🚧 | 🚧 | 🚧 |
| Injection | metadata | default | 🚧 | 🚧 | 🚧 |
**Legend:**
- ✅ Implemented and working
- ❌ Not supported for this format
- 🚧 Placeholder implementation (not yet functional)
### File Scanning Example
```python
from phantomtext.file_scanning import FileScanner
scanner = FileScanner()
# Scan a single file
result = scanner.scan_file('document.docx')
print(f"Malicious content found: {result['malicious_content_found']}")
print(f"Vulnerabilities: {result['vulnerabilities']}")
# Scan an entire directory
reports = scanner.scan_dir('./output')
for report in reports:
if report['malicious_content_found']:
print(f"⚠️ Issues found in {report['file_path']}")
for vulnerability in report['vulnerabilities']:
print(f" - {vulnerability}")
```
### Detection Capabilities
The FileScanner can detect the following obfuscation techniques:
- Zero-width character sequences
- Homoglyph character substitutions
- Diacritical mark insertions
- Bidirectional text overrides
### File Sanitization Example
```python
from phantomtext.file_sanitization import FileSanitizer
sanitizer = FileSanitizer()
sanitizer.sanitize_file('malicious_file.txt')
```
## Citation
If you use PhantomText in your research, please cite our paper:
```bibtex
@article{castagnaro2025hidden,
title={The Hidden Threat in Plain Text: Attacking RAG Data Loaders},
author={Castagnaro, Alberto and Salviati, Umberto and Conti, Mauro and Pajola, Luca and Pizzi, Simeone},
journal={arXiv preprint arXiv:2507.05093},
year={2025}
}
```
## Contributing
Contributions are welcome! Please feel free to submit a pull request or open an issue for any enhancements or bug fixes.
## License
This project is licensed under the MIT License. See the LICENSE file for more details.
Raw data
{
"_id": null,
"home_page": "https://github.com/lucapajola/PhantomText",
"name": "phantomtext",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": null,
"keywords": "text, obfuscation, steganography, content, injection, document, security",
"author": "Luca Pajola",
"author_email": "Luca Pajola <luca.pajola@example.com>",
"download_url": "https://files.pythonhosted.org/packages/8b/1c/81d2ca990e7f7921aee9eb5d4720fc551418d65dea1035e95905be19a4d2/phantomtext-0.1.1.tar.gz",
"platform": null,
"description": "# PhantomText Toolkit\n\nPhantomText is a Python library designed for handling content injection, content obfuscation, file scanning, and file sanitization across various document formats. This toolkit provides a comprehensive set of tools to manage and secure document content effectively.\n\n## Features\n\n- **Content Injection**: Easily inject content into different document formats using various steganographic techniques like zero-size text, transparent text, and out-of-bound positioning.\n- **Content Obfuscation**: Protect sensitive information with advanced obfuscation techniques including zero-width characters, homoglyphs, diacritical marks, and bidirectional text reordering.\n- **File Scanning**: Scan files for malicious content or vulnerabilities using the `FileScanner` class that detects obfuscated and injected content.\n- **File Sanitization**: Sanitize files to remove harmful content with the `FileSanitizer` class.\n\n## Attack Families\n\n### Obfuscation Attacks\n- **Zero-Width Characters**: Uses invisible Unicode characters (Zero Width Space, Zero Width Non-Joiner, etc.) to obfuscate text\n- **Homoglyph Characters**: Replaces characters with visually similar Unicode characters from different scripts\n- **Diacritical Marks**: Adds combining diacritical marks to characters to alter their appearance\n- **Bidi/Reordering**: Uses Unicode bidirectional override characters to manipulate text direction and rendering\n\n### Injection Attacks\n- **Zero-Size Injection**: Injects content using zero or near-zero font sizes to make text invisible\n- **Transparent Injection**: Injects content using transparent colors or opacity settings\n- **Camouflage Injection**: (In development) Hides content by matching background colors or patterns\n- **Out-of-Bound Injection**: (In development) Places content outside visible document boundaries\n- **Metadata Injection**: (In development) Embeds content in document metadata\n\n## Supported Formats\n\nPhantomText supports the following document formats:\n\n- PDF\n- DOCX\n- HTML\n\n## Installation\n\nTo install PhantomText, you can use pip:\n\n```\npip install phantomtext\n```\n\n## Usage\n\n### Content Injection Example\n\n```python\nfrom phantomtext.content_injection import ContentInjector\n\ninjector = ContentInjector()\ninjector.inject_content('document.pdf', 'New Content')\n```\n\n### Content Obfuscation Example\n\n```python\nfrom phantomtext.content_obfuscation import ContentObfuscator\n\nobfuscator = ContentObfuscator()\n\n# Basic obfuscation\nobfuscated_content = obfuscator.obfuscate_content('Sensitive Information')\n\n# Advanced obfuscation with specific techniques\ncontent = \"Sensitive info: email@example.com and phone 123-456-7890.\"\ntarget = \"email@example.com\"\n\n# Zero-width character obfuscation\nobfuscated = obfuscator.obfuscate(content, target, \n obfuscation_technique=\"zeroWidthCharacter\", \n modality=\"default\", \n file_format=\"html\")\n\n# Homoglyph character obfuscation\nobfuscated = obfuscator.obfuscate(content, target, \n obfuscation_technique=\"homoglyph\", \n file_format=\"pdf\")\n\n# Diacritical marks obfuscation\nobfuscated = obfuscator.obfuscate(content, target, \n obfuscation_technique=\"diacritical\", \n modality=\"heavy\", \n file_format=\"docx\")\n\n# Bidi/reordering character obfuscation\nobfuscated = obfuscator.obfuscate(content, target, \n obfuscation_technique=\"bidi\", \n modality=\"default\", \n file_format=\"html\")\n```\n\n### Content Injection Example\n\n```python\nfrom phantomtext.injection.zerosize_injection import ZeroSizeInjection\nfrom phantomtext.injection.transparent_injection import TransparentInjection\n\n# Zero-size injection\ninjector = ZeroSizeInjection(modality=\"default\", file_format=\"pdf\")\ninjector.apply(input_document=\"document.pdf\", \n injection=\"Hidden content\", \n output_path=\"injected_document.pdf\")\n\n# Transparent injection\ninjector = TransparentInjection(modality=\"opacity-0\", file_format=\"html\")\ninjector.apply(input_document=\"document.html\", \n injection=\"Invisible text\", \n output_path=\"injected_document.html\")\n```\n\n#### Supported Attacks\n\n##### Obfuscation Attacks\n\n| **Attack Family** | **Attack Name** | **Variant** | **HTML** | **DOCX** | **PDF** |\n|-------------------|---------------------------|---------------|----------|----------|---------|\n| Obfuscation | diacritical_marks | default | \u2705 | \u2705 | \u2705 |\n| | | heavy | \u2705 | \u2705 | \u2705 |\n| Obfuscation | homoglyph_characters | default | \u2705 | \u2705 | \u2705 |\n| Obfuscation | zero_width_characters | default | \u2705 | \u2705 | \u2705 |\n| | | heavy | \u2705 | \u2705 | \u2705 |\n| Obfuscation | bidi_reordering | default | \u2705 | \u2705 | \u2705 |\n| | | heavy | \u2705 | \u2705 | \u2705 |\n\n##### Injection Attacks\n\n| **Attack Family** | **Attack Name** | **Variant** | **HTML** | **DOCX** | **PDF** |\n|-------------------|---------------------------|----------------------|----------|----------|---------|\n| Injection | zero_size | default | \u2705 | \u2705 | \u2705 |\n| | | close-to-zero | \u2705 | \u274c | \u2705 |\n| Injection | transparent | default | \u2705 | \u2705 | \u2705 |\n| | | opacity-0 | \u2705 | \u274c | \u2705 |\n| | | opacity-close-to-zero| \u2705 | \u274c | \u2705 |\n| | | vanish | \u274c | \u2705 | \u274c |\n| Injection | camouflage | default | \ud83d\udea7 | \ud83d\udea7 | \ud83d\udea7 |\n| Injection | out_of_bound | default | \ud83d\udea7 | \ud83d\udea7 | \ud83d\udea7 |\n| Injection | metadata | default | \ud83d\udea7 | \ud83d\udea7 | \ud83d\udea7 |\n\n**Legend:**\n- \u2705 Implemented and working\n- \u274c Not supported for this format\n- \ud83d\udea7 Placeholder implementation (not yet functional)\n\n### File Scanning Example\n\n```python\nfrom phantomtext.file_scanning import FileScanner\n\nscanner = FileScanner()\n\n# Scan a single file\nresult = scanner.scan_file('document.docx')\nprint(f\"Malicious content found: {result['malicious_content_found']}\")\nprint(f\"Vulnerabilities: {result['vulnerabilities']}\")\n\n# Scan an entire directory\nreports = scanner.scan_dir('./output')\nfor report in reports:\n if report['malicious_content_found']:\n print(f\"\u26a0\ufe0f Issues found in {report['file_path']}\")\n for vulnerability in report['vulnerabilities']:\n print(f\" - {vulnerability}\")\n```\n\n### Detection Capabilities\n\nThe FileScanner can detect the following obfuscation techniques:\n- Zero-width character sequences\n- Homoglyph character substitutions \n- Diacritical mark insertions\n- Bidirectional text overrides\n\n### File Sanitization Example\n\n```python\nfrom phantomtext.file_sanitization import FileSanitizer\n\nsanitizer = FileSanitizer()\nsanitizer.sanitize_file('malicious_file.txt')\n```\n\n## Citation\n\nIf you use PhantomText in your research, please cite our paper:\n\n```bibtex\n@article{castagnaro2025hidden,\n title={The Hidden Threat in Plain Text: Attacking RAG Data Loaders},\n author={Castagnaro, Alberto and Salviati, Umberto and Conti, Mauro and Pajola, Luca and Pizzi, Simeone},\n journal={arXiv preprint arXiv:2507.05093},\n year={2025}\n}\n```\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a pull request or open an issue for any enhancements or bug fixes.\n\n## License\n\nThis project is licensed under the MIT License. See the LICENSE file for more details.\n",
"bugtrack_url": null,
"license": null,
"summary": "A toolkit for content injection, obfuscation, scanning, and sanitization of various document formats. If you use this library, please cite: Castagnaro et al. 'The Hidden Threat in Plain Text: Attacking RAG Data Loaders' (2025).",
"version": "0.1.1",
"project_urls": {
"Bug Reports": "https://github.com/lucapajola/PhantomText/issues",
"Homepage": "https://github.com/lucapajola/PhantomText",
"Source": "https://github.com/lucapajola/PhantomText"
},
"split_keywords": [
"text",
" obfuscation",
" steganography",
" content",
" injection",
" document",
" security"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "2188406cdb863721694ef03d49059a97455ca99c0f4a48bfa58a52d4a16bbd81",
"md5": "e0115cdadac142317d8a28a42399291f",
"sha256": "fc8d79527ce9531737eed97194cedc5a2c3b2fa5a330198a31a711adc06b6c8a"
},
"downloads": -1,
"filename": "phantomtext-0.1.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "e0115cdadac142317d8a28a42399291f",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 412744,
"upload_time": "2025-07-24T06:39:17",
"upload_time_iso_8601": "2025-07-24T06:39:17.103682Z",
"url": "https://files.pythonhosted.org/packages/21/88/406cdb863721694ef03d49059a97455ca99c0f4a48bfa58a52d4a16bbd81/phantomtext-0.1.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "8b1c81d2ca990e7f7921aee9eb5d4720fc551418d65dea1035e95905be19a4d2",
"md5": "7c6ba3b1e69f82587fc8cda8ed375623",
"sha256": "81d565cb40043fbe876cb8f6b9ec84206a673a0bc60da827e395046b96f08972"
},
"downloads": -1,
"filename": "phantomtext-0.1.1.tar.gz",
"has_sig": false,
"md5_digest": "7c6ba3b1e69f82587fc8cda8ed375623",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 1023355,
"upload_time": "2025-07-24T06:39:18",
"upload_time_iso_8601": "2025-07-24T06:39:18.798918Z",
"url": "https://files.pythonhosted.org/packages/8b/1c/81d2ca990e7f7921aee9eb5d4720fc551418d65dea1035e95905be19a4d2/phantomtext-0.1.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-24 06:39:18",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "lucapajola",
"github_project": "PhantomText",
"github_not_found": true,
"lcname": "phantomtext"
}