phantomtext


Namephantomtext JSON
Version 0.1.1 PyPI version JSON
download
home_pagehttps://github.com/lucapajola/PhantomText
SummaryA toolkit for content injection, obfuscation, scanning, and sanitization of various document formats. If you use this library, please cite: Castagnaro et al. 'The Hidden Threat in Plain Text: Attacking RAG Data Loaders' (2025).
upload_time2025-07-24 06:39:18
maintainerNone
docs_urlNone
authorLuca Pajola
requires_python>=3.7
licenseNone
keywords text obfuscation steganography content injection document security
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # PhantomText Toolkit

PhantomText is a Python library designed for handling content injection, content obfuscation, file scanning, and file sanitization across various document formats. This toolkit provides a comprehensive set of tools to manage and secure document content effectively.

## Features

- **Content Injection**: Easily inject content into different document formats using various steganographic techniques like zero-size text, transparent text, and out-of-bound positioning.
- **Content Obfuscation**: Protect sensitive information with advanced obfuscation techniques including zero-width characters, homoglyphs, diacritical marks, and bidirectional text reordering.
- **File Scanning**: Scan files for malicious content or vulnerabilities using the `FileScanner` class that detects obfuscated and injected content.
- **File Sanitization**: Sanitize files to remove harmful content with the `FileSanitizer` class.

## Attack Families

### Obfuscation Attacks
- **Zero-Width Characters**: Uses invisible Unicode characters (Zero Width Space, Zero Width Non-Joiner, etc.) to obfuscate text
- **Homoglyph Characters**: Replaces characters with visually similar Unicode characters from different scripts
- **Diacritical Marks**: Adds combining diacritical marks to characters to alter their appearance
- **Bidi/Reordering**: Uses Unicode bidirectional override characters to manipulate text direction and rendering

### Injection Attacks
- **Zero-Size Injection**: Injects content using zero or near-zero font sizes to make text invisible
- **Transparent Injection**: Injects content using transparent colors or opacity settings
- **Camouflage Injection**: (In development) Hides content by matching background colors or patterns
- **Out-of-Bound Injection**: (In development) Places content outside visible document boundaries
- **Metadata Injection**: (In development) Embeds content in document metadata

## Supported Formats

PhantomText supports the following document formats:

- PDF
- DOCX
- HTML

## Installation

To install PhantomText, you can use pip:

```
pip install phantomtext
```

## Usage

### Content Injection Example

```python
from phantomtext.content_injection import ContentInjector

injector = ContentInjector()
injector.inject_content('document.pdf', 'New Content')
```

### Content Obfuscation Example

```python
from phantomtext.content_obfuscation import ContentObfuscator

obfuscator = ContentObfuscator()

# Basic obfuscation
obfuscated_content = obfuscator.obfuscate_content('Sensitive Information')

# Advanced obfuscation with specific techniques
content = "Sensitive info: email@example.com and phone 123-456-7890."
target = "email@example.com"

# Zero-width character obfuscation
obfuscated = obfuscator.obfuscate(content, target, 
                                  obfuscation_technique="zeroWidthCharacter", 
                                  modality="default", 
                                  file_format="html")

# Homoglyph character obfuscation
obfuscated = obfuscator.obfuscate(content, target, 
                                  obfuscation_technique="homoglyph", 
                                  file_format="pdf")

# Diacritical marks obfuscation
obfuscated = obfuscator.obfuscate(content, target, 
                                  obfuscation_technique="diacritical", 
                                  modality="heavy", 
                                  file_format="docx")

# Bidi/reordering character obfuscation
obfuscated = obfuscator.obfuscate(content, target, 
                                  obfuscation_technique="bidi", 
                                  modality="default", 
                                  file_format="html")
```

### Content Injection Example

```python
from phantomtext.injection.zerosize_injection import ZeroSizeInjection
from phantomtext.injection.transparent_injection import TransparentInjection

# Zero-size injection
injector = ZeroSizeInjection(modality="default", file_format="pdf")
injector.apply(input_document="document.pdf", 
               injection="Hidden content", 
               output_path="injected_document.pdf")

# Transparent injection
injector = TransparentInjection(modality="opacity-0", file_format="html")
injector.apply(input_document="document.html", 
               injection="Invisible text", 
               output_path="injected_document.html")
```

#### Supported Attacks

##### Obfuscation Attacks

| **Attack Family** | **Attack Name**           | **Variant**   | **HTML** | **DOCX** | **PDF** |
|-------------------|---------------------------|---------------|----------|----------|---------|
| Obfuscation       | diacritical_marks         | default       | ✅        | ✅        | ✅       |
|                   |                           | heavy         | ✅        | ✅        | ✅       |
| Obfuscation       | homoglyph_characters      | default       | ✅        | ✅        | ✅       |
| Obfuscation       | zero_width_characters     | default       | ✅        | ✅        | ✅       |
|                   |                           | heavy         | ✅        | ✅        | ✅       |
| Obfuscation       | bidi_reordering           | default       | ✅        | ✅        | ✅       |
|                   |                           | heavy         | ✅        | ✅        | ✅       |

##### Injection Attacks

| **Attack Family** | **Attack Name**           | **Variant**          | **HTML** | **DOCX** | **PDF** |
|-------------------|---------------------------|----------------------|----------|----------|---------|
| Injection         | zero_size                 | default              | ✅        | ✅        | ✅       |
|                   |                           | close-to-zero        | ✅        | ❌        | ✅       |
| Injection         | transparent               | default              | ✅        | ✅        | ✅       |
|                   |                           | opacity-0            | ✅        | ❌        | ✅       |
|                   |                           | opacity-close-to-zero| ✅        | ❌        | ✅       |
|                   |                           | vanish               | ❌        | ✅        | ❌       |
| Injection         | camouflage                | default              | 🚧        | 🚧        | 🚧       |
| Injection         | out_of_bound              | default              | 🚧        | 🚧        | 🚧       |
| Injection         | metadata                  | default              | 🚧        | 🚧        | 🚧       |

**Legend:**
- ✅ Implemented and working
- ❌ Not supported for this format
- 🚧 Placeholder implementation (not yet functional)

### File Scanning Example

```python
from phantomtext.file_scanning import FileScanner

scanner = FileScanner()

# Scan a single file
result = scanner.scan_file('document.docx')
print(f"Malicious content found: {result['malicious_content_found']}")
print(f"Vulnerabilities: {result['vulnerabilities']}")

# Scan an entire directory
reports = scanner.scan_dir('./output')
for report in reports:
    if report['malicious_content_found']:
        print(f"⚠️ Issues found in {report['file_path']}")
        for vulnerability in report['vulnerabilities']:
            print(f"  - {vulnerability}")
```

### Detection Capabilities

The FileScanner can detect the following obfuscation techniques:
- Zero-width character sequences
- Homoglyph character substitutions  
- Diacritical mark insertions
- Bidirectional text overrides

### File Sanitization Example

```python
from phantomtext.file_sanitization import FileSanitizer

sanitizer = FileSanitizer()
sanitizer.sanitize_file('malicious_file.txt')
```

## Citation

If you use PhantomText in your research, please cite our paper:

```bibtex
@article{castagnaro2025hidden,
  title={The Hidden Threat in Plain Text: Attacking RAG Data Loaders},
  author={Castagnaro, Alberto and Salviati, Umberto and Conti, Mauro and Pajola, Luca and Pizzi, Simeone},
  journal={arXiv preprint arXiv:2507.05093},
  year={2025}
}
```

## Contributing

Contributions are welcome! Please feel free to submit a pull request or open an issue for any enhancements or bug fixes.

## License

This project is licensed under the MIT License. See the LICENSE file for more details.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/lucapajola/PhantomText",
    "name": "phantomtext",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": null,
    "keywords": "text, obfuscation, steganography, content, injection, document, security",
    "author": "Luca Pajola",
    "author_email": "Luca Pajola <luca.pajola@example.com>",
    "download_url": "https://files.pythonhosted.org/packages/8b/1c/81d2ca990e7f7921aee9eb5d4720fc551418d65dea1035e95905be19a4d2/phantomtext-0.1.1.tar.gz",
    "platform": null,
    "description": "# PhantomText Toolkit\n\nPhantomText is a Python library designed for handling content injection, content obfuscation, file scanning, and file sanitization across various document formats. This toolkit provides a comprehensive set of tools to manage and secure document content effectively.\n\n## Features\n\n- **Content Injection**: Easily inject content into different document formats using various steganographic techniques like zero-size text, transparent text, and out-of-bound positioning.\n- **Content Obfuscation**: Protect sensitive information with advanced obfuscation techniques including zero-width characters, homoglyphs, diacritical marks, and bidirectional text reordering.\n- **File Scanning**: Scan files for malicious content or vulnerabilities using the `FileScanner` class that detects obfuscated and injected content.\n- **File Sanitization**: Sanitize files to remove harmful content with the `FileSanitizer` class.\n\n## Attack Families\n\n### Obfuscation Attacks\n- **Zero-Width Characters**: Uses invisible Unicode characters (Zero Width Space, Zero Width Non-Joiner, etc.) to obfuscate text\n- **Homoglyph Characters**: Replaces characters with visually similar Unicode characters from different scripts\n- **Diacritical Marks**: Adds combining diacritical marks to characters to alter their appearance\n- **Bidi/Reordering**: Uses Unicode bidirectional override characters to manipulate text direction and rendering\n\n### Injection Attacks\n- **Zero-Size Injection**: Injects content using zero or near-zero font sizes to make text invisible\n- **Transparent Injection**: Injects content using transparent colors or opacity settings\n- **Camouflage Injection**: (In development) Hides content by matching background colors or patterns\n- **Out-of-Bound Injection**: (In development) Places content outside visible document boundaries\n- **Metadata Injection**: (In development) Embeds content in document metadata\n\n## Supported Formats\n\nPhantomText supports the following document formats:\n\n- PDF\n- DOCX\n- HTML\n\n## Installation\n\nTo install PhantomText, you can use pip:\n\n```\npip install phantomtext\n```\n\n## Usage\n\n### Content Injection Example\n\n```python\nfrom phantomtext.content_injection import ContentInjector\n\ninjector = ContentInjector()\ninjector.inject_content('document.pdf', 'New Content')\n```\n\n### Content Obfuscation Example\n\n```python\nfrom phantomtext.content_obfuscation import ContentObfuscator\n\nobfuscator = ContentObfuscator()\n\n# Basic obfuscation\nobfuscated_content = obfuscator.obfuscate_content('Sensitive Information')\n\n# Advanced obfuscation with specific techniques\ncontent = \"Sensitive info: email@example.com and phone 123-456-7890.\"\ntarget = \"email@example.com\"\n\n# Zero-width character obfuscation\nobfuscated = obfuscator.obfuscate(content, target, \n                                  obfuscation_technique=\"zeroWidthCharacter\", \n                                  modality=\"default\", \n                                  file_format=\"html\")\n\n# Homoglyph character obfuscation\nobfuscated = obfuscator.obfuscate(content, target, \n                                  obfuscation_technique=\"homoglyph\", \n                                  file_format=\"pdf\")\n\n# Diacritical marks obfuscation\nobfuscated = obfuscator.obfuscate(content, target, \n                                  obfuscation_technique=\"diacritical\", \n                                  modality=\"heavy\", \n                                  file_format=\"docx\")\n\n# Bidi/reordering character obfuscation\nobfuscated = obfuscator.obfuscate(content, target, \n                                  obfuscation_technique=\"bidi\", \n                                  modality=\"default\", \n                                  file_format=\"html\")\n```\n\n### Content Injection Example\n\n```python\nfrom phantomtext.injection.zerosize_injection import ZeroSizeInjection\nfrom phantomtext.injection.transparent_injection import TransparentInjection\n\n# Zero-size injection\ninjector = ZeroSizeInjection(modality=\"default\", file_format=\"pdf\")\ninjector.apply(input_document=\"document.pdf\", \n               injection=\"Hidden content\", \n               output_path=\"injected_document.pdf\")\n\n# Transparent injection\ninjector = TransparentInjection(modality=\"opacity-0\", file_format=\"html\")\ninjector.apply(input_document=\"document.html\", \n               injection=\"Invisible text\", \n               output_path=\"injected_document.html\")\n```\n\n#### Supported Attacks\n\n##### Obfuscation Attacks\n\n| **Attack Family** | **Attack Name**           | **Variant**   | **HTML** | **DOCX** | **PDF** |\n|-------------------|---------------------------|---------------|----------|----------|---------|\n| Obfuscation       | diacritical_marks         | default       | \u2705        | \u2705        | \u2705       |\n|                   |                           | heavy         | \u2705        | \u2705        | \u2705       |\n| Obfuscation       | homoglyph_characters      | default       | \u2705        | \u2705        | \u2705       |\n| Obfuscation       | zero_width_characters     | default       | \u2705        | \u2705        | \u2705       |\n|                   |                           | heavy         | \u2705        | \u2705        | \u2705       |\n| Obfuscation       | bidi_reordering           | default       | \u2705        | \u2705        | \u2705       |\n|                   |                           | heavy         | \u2705        | \u2705        | \u2705       |\n\n##### Injection Attacks\n\n| **Attack Family** | **Attack Name**           | **Variant**          | **HTML** | **DOCX** | **PDF** |\n|-------------------|---------------------------|----------------------|----------|----------|---------|\n| Injection         | zero_size                 | default              | \u2705        | \u2705        | \u2705       |\n|                   |                           | close-to-zero        | \u2705        | \u274c        | \u2705       |\n| Injection         | transparent               | default              | \u2705        | \u2705        | \u2705       |\n|                   |                           | opacity-0            | \u2705        | \u274c        | \u2705       |\n|                   |                           | opacity-close-to-zero| \u2705        | \u274c        | \u2705       |\n|                   |                           | vanish               | \u274c        | \u2705        | \u274c       |\n| Injection         | camouflage                | default              | \ud83d\udea7        | \ud83d\udea7        | \ud83d\udea7       |\n| Injection         | out_of_bound              | default              | \ud83d\udea7        | \ud83d\udea7        | \ud83d\udea7       |\n| Injection         | metadata                  | default              | \ud83d\udea7        | \ud83d\udea7        | \ud83d\udea7       |\n\n**Legend:**\n- \u2705 Implemented and working\n- \u274c Not supported for this format\n- \ud83d\udea7 Placeholder implementation (not yet functional)\n\n### File Scanning Example\n\n```python\nfrom phantomtext.file_scanning import FileScanner\n\nscanner = FileScanner()\n\n# Scan a single file\nresult = scanner.scan_file('document.docx')\nprint(f\"Malicious content found: {result['malicious_content_found']}\")\nprint(f\"Vulnerabilities: {result['vulnerabilities']}\")\n\n# Scan an entire directory\nreports = scanner.scan_dir('./output')\nfor report in reports:\n    if report['malicious_content_found']:\n        print(f\"\u26a0\ufe0f Issues found in {report['file_path']}\")\n        for vulnerability in report['vulnerabilities']:\n            print(f\"  - {vulnerability}\")\n```\n\n### Detection Capabilities\n\nThe FileScanner can detect the following obfuscation techniques:\n- Zero-width character sequences\n- Homoglyph character substitutions  \n- Diacritical mark insertions\n- Bidirectional text overrides\n\n### File Sanitization Example\n\n```python\nfrom phantomtext.file_sanitization import FileSanitizer\n\nsanitizer = FileSanitizer()\nsanitizer.sanitize_file('malicious_file.txt')\n```\n\n## Citation\n\nIf you use PhantomText in your research, please cite our paper:\n\n```bibtex\n@article{castagnaro2025hidden,\n  title={The Hidden Threat in Plain Text: Attacking RAG Data Loaders},\n  author={Castagnaro, Alberto and Salviati, Umberto and Conti, Mauro and Pajola, Luca and Pizzi, Simeone},\n  journal={arXiv preprint arXiv:2507.05093},\n  year={2025}\n}\n```\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a pull request or open an issue for any enhancements or bug fixes.\n\n## License\n\nThis project is licensed under the MIT License. See the LICENSE file for more details.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A toolkit for content injection, obfuscation, scanning, and sanitization of various document formats. If you use this library, please cite: Castagnaro et al. 'The Hidden Threat in Plain Text: Attacking RAG Data Loaders' (2025).",
    "version": "0.1.1",
    "project_urls": {
        "Bug Reports": "https://github.com/lucapajola/PhantomText/issues",
        "Homepage": "https://github.com/lucapajola/PhantomText",
        "Source": "https://github.com/lucapajola/PhantomText"
    },
    "split_keywords": [
        "text",
        " obfuscation",
        " steganography",
        " content",
        " injection",
        " document",
        " security"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "2188406cdb863721694ef03d49059a97455ca99c0f4a48bfa58a52d4a16bbd81",
                "md5": "e0115cdadac142317d8a28a42399291f",
                "sha256": "fc8d79527ce9531737eed97194cedc5a2c3b2fa5a330198a31a711adc06b6c8a"
            },
            "downloads": -1,
            "filename": "phantomtext-0.1.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "e0115cdadac142317d8a28a42399291f",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 412744,
            "upload_time": "2025-07-24T06:39:17",
            "upload_time_iso_8601": "2025-07-24T06:39:17.103682Z",
            "url": "https://files.pythonhosted.org/packages/21/88/406cdb863721694ef03d49059a97455ca99c0f4a48bfa58a52d4a16bbd81/phantomtext-0.1.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "8b1c81d2ca990e7f7921aee9eb5d4720fc551418d65dea1035e95905be19a4d2",
                "md5": "7c6ba3b1e69f82587fc8cda8ed375623",
                "sha256": "81d565cb40043fbe876cb8f6b9ec84206a673a0bc60da827e395046b96f08972"
            },
            "downloads": -1,
            "filename": "phantomtext-0.1.1.tar.gz",
            "has_sig": false,
            "md5_digest": "7c6ba3b1e69f82587fc8cda8ed375623",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 1023355,
            "upload_time": "2025-07-24T06:39:18",
            "upload_time_iso_8601": "2025-07-24T06:39:18.798918Z",
            "url": "https://files.pythonhosted.org/packages/8b/1c/81d2ca990e7f7921aee9eb5d4720fc551418d65dea1035e95905be19a4d2/phantomtext-0.1.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-24 06:39:18",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "lucapajola",
    "github_project": "PhantomText",
    "github_not_found": true,
    "lcname": "phantomtext"
}
        
Elapsed time: 1.44647s