You're Invited:Meet the Socket Team at BlackHat and DEF CON in Las Vegas, Aug 4-6.RSVP →

Book a Demo Install Sign in

llm-obfuscator

Package Overview

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

llm-obfuscator

A tool for obfuscating text by manipulating token IDs while preserving token count and structure

0.1.0

Source

PyPI

Maintainers: 1

LLM Token Obfuscator

A tool for obfuscating text by manipulating token IDs while preserving token count and structure. Originally developed for benchmarking LLM inference performance and prefix caching behavior by generating test data that maintains token patterns but with obscured text.

Overview

This project provides a system for obfuscating text by applying a shift to token IDs. The obfuscation is reversible and preserves the token count, making it useful for:

Testing LLM systems with obfuscated content
Benchmarking tokenization performance
Creating privacy-preserving datasets
Generating synthetic text with realistic token distributions

Example of obfuscated text patterns:

Original: The quick brown fox jumps
Obfuscated: eng($_ ét rl manga

Original: The quick brown fox runs
Obfuscated: eng($_ ét rl Android

Note how the common prefix "The quick brown" is obfuscated to "eng($_ ét" in both cases, preserving the pattern.

Installation

From PyPI (Recommended)

pip install llm-obfuscator

From Source

Clone the repository:

git clone https://github.com/yourusername/llm-obfusicator.git
cd llm-obfusicator

Install dependencies:

pip install -r requirements.txt

Install the package in development mode:

pip install -e .

Usage

Command Line Interface

The package provides a command-line interface for easy use:

# Tokenize text
llm-obfuscator tokenize gpt-4 "Hello, world!"

# Obfuscate text
llm-obfuscator obfuscate gpt-4 "Hello, world!"

# Obfuscate text with a fixed shift
llm-obfuscator obfuscate gpt-4 "Hello, world!" --shift 42

Python API

from llm_obfuscator import obfuscate_text, tokenize_text

# Obfuscate text using a specific model's tokenizer
obfuscated = obfuscate_text("gpt-4", "Hello, world!")
print(obfuscated)

# Use a fixed shift value for deterministic results
obfuscated = obfuscate_text("gpt-4", "Hello, world!", shift=42)
print(obfuscated)

# Tokenize text
tokens = tokenize_text("gpt-4", "Hello, world!")
print(tokens)

Supported Models

The system supports both OpenAI and HuggingFace tokenizers:

OpenAI models: gpt-4, gpt-3.5-turbo, cl100k_base, etc.
HuggingFace models: gpt2, bert-base-uncased, etc.

Testing

The project includes several test suites to validate the obfuscation system:

Running All Tests

The easiest way to run all tests is to use the provided shell script:

# Make the script executable (if needed)
chmod +x run_all_tests.sh

# Run all tests
./run_all_tests.sh

This will run all test files in sequence, including unit tests and specialized test scripts.

Running Basic Tests

# Run all tests
python -m pytest tests/

# Run specific test file
python -m pytest tests/test.py

Specialized Test Scripts

The project includes specialized test scripts for different aspects of the obfuscation system:

# Test with real-world examples
python tests/test_real_world.py

# Test obfuscation demonstration
python tests/test_obfuscation.py

# Test mathematical properties
python tests/test_mapping_properties.py

Validation

The obfuscation system has been validated to ensure:

Token Count Preservation: The number of tokens remains the same after obfuscation
One-to-One Mapping: The obfuscation is a bijective (one-to-one) mapping
Frequency Preservation: Token frequency distributions are preserved
Reversibility: The original text can be recovered by applying the reverse shift

How It Works

The obfuscation process works as follows:

Text is tokenized using the specified model's tokenizer
Each token ID is shifted by a fixed amount (either specified or randomly generated)
The shifted tokens are detokenized back to text

The shift operation is performed modulo the vocabulary size (typically 50,000) to ensure all tokens remain within the valid vocabulary range.

Note: The token count preservation has a known margin of error of up to 8% for obscured texts. We are working on improving this.

License

MIT License

Keywords

FAQs

What is llm-obfuscator?

Is llm-obfuscator well maintained?

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

llm-obfuscator

LLM Token Obfuscator

Overview

Installation

From PyPI (Recommended)

From Source

Usage

Command Line Interface

Python API

Supported Models

Testing

Running All Tests

Running Basic Tests

Specialized Test Scripts

Validation

How It Works

License

Keywords

Related posts

AI + a16z Podcast: Vibe Coding, Security Risks, and the Path to Progress

Toptal’s GitHub Organization Hijacked: 10 Malicious Packages Published