You're Invited:Meet the Socket Team at BlackHat and DEF CON in Las Vegas, Aug 4-6.RSVP
Socket
Book a DemoInstallSign in
Socket

phantomtext

Package Overview
Dependencies
Maintainers
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

phantomtext

A toolkit for content injection, obfuscation, scanning, and sanitization of various document formats. If you use this library, please cite: Castagnaro et al. 'The Hidden Threat in Plain Text: Attacking RAG Data Loaders' (2025).

0.1.1
Source
pipPyPI
Maintainers
1

PhantomText Toolkit

PhantomText is a Python library designed for handling content injection, content obfuscation, file scanning, and file sanitization across various document formats. This toolkit provides a comprehensive set of tools to manage and secure document content effectively.

Features

  • Content Injection: Easily inject content into different document formats using various steganographic techniques like zero-size text, transparent text, and out-of-bound positioning.
  • Content Obfuscation: Protect sensitive information with advanced obfuscation techniques including zero-width characters, homoglyphs, diacritical marks, and bidirectional text reordering.
  • File Scanning: Scan files for malicious content or vulnerabilities using the FileScanner class that detects obfuscated and injected content.
  • File Sanitization: Sanitize files to remove harmful content with the FileSanitizer class.

Attack Families

Obfuscation Attacks

  • Zero-Width Characters: Uses invisible Unicode characters (Zero Width Space, Zero Width Non-Joiner, etc.) to obfuscate text
  • Homoglyph Characters: Replaces characters with visually similar Unicode characters from different scripts
  • Diacritical Marks: Adds combining diacritical marks to characters to alter their appearance
  • Bidi/Reordering: Uses Unicode bidirectional override characters to manipulate text direction and rendering

Injection Attacks

  • Zero-Size Injection: Injects content using zero or near-zero font sizes to make text invisible
  • Transparent Injection: Injects content using transparent colors or opacity settings
  • Camouflage Injection: (In development) Hides content by matching background colors or patterns
  • Out-of-Bound Injection: (In development) Places content outside visible document boundaries
  • Metadata Injection: (In development) Embeds content in document metadata

Supported Formats

PhantomText supports the following document formats:

  • PDF
  • DOCX
  • HTML

Installation

To install PhantomText, you can use pip:

pip install phantomtext

Usage

Content Injection Example

from phantomtext.content_injection import ContentInjector

injector = ContentInjector()
injector.inject_content('document.pdf', 'New Content')

Content Obfuscation Example

from phantomtext.content_obfuscation import ContentObfuscator

obfuscator = ContentObfuscator()

# Basic obfuscation
obfuscated_content = obfuscator.obfuscate_content('Sensitive Information')

# Advanced obfuscation with specific techniques
content = "Sensitive info: email@example.com and phone 123-456-7890."
target = "email@example.com"

# Zero-width character obfuscation
obfuscated = obfuscator.obfuscate(content, target, 
                                  obfuscation_technique="zeroWidthCharacter", 
                                  modality="default", 
                                  file_format="html")

# Homoglyph character obfuscation
obfuscated = obfuscator.obfuscate(content, target, 
                                  obfuscation_technique="homoglyph", 
                                  file_format="pdf")

# Diacritical marks obfuscation
obfuscated = obfuscator.obfuscate(content, target, 
                                  obfuscation_technique="diacritical", 
                                  modality="heavy", 
                                  file_format="docx")

# Bidi/reordering character obfuscation
obfuscated = obfuscator.obfuscate(content, target, 
                                  obfuscation_technique="bidi", 
                                  modality="default", 
                                  file_format="html")

Content Injection Example

from phantomtext.injection.zerosize_injection import ZeroSizeInjection
from phantomtext.injection.transparent_injection import TransparentInjection

# Zero-size injection
injector = ZeroSizeInjection(modality="default", file_format="pdf")
injector.apply(input_document="document.pdf", 
               injection="Hidden content", 
               output_path="injected_document.pdf")

# Transparent injection
injector = TransparentInjection(modality="opacity-0", file_format="html")
injector.apply(input_document="document.html", 
               injection="Invisible text", 
               output_path="injected_document.html")

Supported Attacks

Obfuscation Attacks
Attack FamilyAttack NameVariantHTMLDOCXPDF
Obfuscationdiacritical_marksdefault
heavy
Obfuscationhomoglyph_charactersdefault
Obfuscationzero_width_charactersdefault
heavy
Obfuscationbidi_reorderingdefault
heavy
Injection Attacks
Attack FamilyAttack NameVariantHTMLDOCXPDF
Injectionzero_sizedefault
close-to-zero
Injectiontransparentdefault
opacity-0
opacity-close-to-zero
vanish
Injectioncamouflagedefault🚧🚧🚧
Injectionout_of_bounddefault🚧🚧🚧
Injectionmetadatadefault🚧🚧🚧

Legend:

  • ✅ Implemented and working
  • ❌ Not supported for this format
  • 🚧 Placeholder implementation (not yet functional)

File Scanning Example

from phantomtext.file_scanning import FileScanner

scanner = FileScanner()

# Scan a single file
result = scanner.scan_file('document.docx')
print(f"Malicious content found: {result['malicious_content_found']}")
print(f"Vulnerabilities: {result['vulnerabilities']}")

# Scan an entire directory
reports = scanner.scan_dir('./output')
for report in reports:
    if report['malicious_content_found']:
        print(f"⚠️ Issues found in {report['file_path']}")
        for vulnerability in report['vulnerabilities']:
            print(f"  - {vulnerability}")

Detection Capabilities

The FileScanner can detect the following obfuscation techniques:

  • Zero-width character sequences
  • Homoglyph character substitutions
  • Diacritical mark insertions
  • Bidirectional text overrides

File Sanitization Example

from phantomtext.file_sanitization import FileSanitizer

sanitizer = FileSanitizer()
sanitizer.sanitize_file('malicious_file.txt')

Citation

If you use PhantomText in your research, please cite our paper:

@article{castagnaro2025hidden,
  title={The Hidden Threat in Plain Text: Attacking RAG Data Loaders},
  author={Castagnaro, Alberto and Salviati, Umberto and Conti, Mauro and Pajola, Luca and Pizzi, Simeone},
  journal={arXiv preprint arXiv:2507.05093},
  year={2025}
}

Contributing

Contributions are welcome! Please feel free to submit a pull request or open an issue for any enhancements or bug fixes.

License

This project is licensed under the MIT License. See the LICENSE file for more details.

Keywords

text

FAQs

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts