
Security News
Socket Releases Free Certified Patches for Critical vm2 Sandbox Escape
A critical vm2 sandbox escape can allow untrusted JavaScript to break isolation and execute commands on the host Node.js process.
navi-sanitize
Advanced tools
Input sanitization pipeline for untrusted text. Deterministic. No ML. Legitimate Unicode preserved.
Deterministic input sanitization for untrusted text. Zero dependencies. Legitimate Unicode preserved by design.
Documentation · Getting Started · API Reference · Threat Model
from navi_sanitize import clean
clean("Неllo Wоrld") # "Hello World" — Cyrillic Н/о replaced
clean("price:\u200b 0") # "price: 0" — zero-width space stripped
clean("file\x00.txt") # "file.txt" — null byte removed
See the invisible:
evil = "system\u200b\u200cprompt" # looks like "systemprompt" but has 2 hidden chars
len(evil) # 14 (not 12!)
clean(evil) # "systemprompt" — hidden chars stripped
Opt-in utilities for deeper analysis: decode_evasion() peels nested URL/HTML/hex encodings, detect_scripts() and is_mixed_script() flag mixed-script spoofing.
Untrusted text contains invisible attacks: homoglyph substitution, zero-width characters, null bytes, fullwidth encoding, template/prompt injection delimiters. These bypass validation, poison templates, and fool humans.
navi-sanitize fixes the text before it reaches your application. It doesn't detect attacks — it removes them.
LLM prompt pipelines — User input flows into system prompts, RAG context, and tool calls. Invisible Unicode (tag block characters, bidi overrides) encodes instructions that tokenizers read but humans can't see. Homoglyphs bypass keyword filters. navi-sanitize strips these vectors before text reaches the model, and the pluggable escaper lets you add vendor-specific prompt escaping on top.
Web applications — Jinja2 SSTI, path traversal, and fullwidth encoding bypasses are well-known but tedious to cover manually. A single clean(user_input, escaper=jinja2_escaper) call handles homoglyph-disguised payloads like {{ cоnfig }} (Cyrillic о) that naive escaping misses.
Identity and anti-phishing — pаypal.com (Cyrillic а) renders identically to paypal.com in most fonts. Homoglyph replacement normalizes display names, URLs, and email addresses to catch spoofing that visual inspection misses.
Log analysis and SIEM — Attackers embed bidi overrides and zero-width characters in log entries to hide indicators of compromise from analysts and pattern-matching tools. Sanitizing log data on ingest ensures what you search is what's actually there.
Config and data ingestion — YAML, TOML, and JSON parsed from untrusted sources can carry null bytes that truncate C-extension processing, zero-width characters that break key matching, and homoglyphs that create near-duplicate keys. walk(parsed_config) sanitizes every string in a nested structure in one call.
navi-sanitize is the only library that combines invisible character stripping, homoglyph replacement, NFKC normalization, and pluggable escaping in a single zero-dependency pipeline. Existing tools solve pieces of this problem:
| navi-sanitize | Unidecode / anyascii | confusable_homoglyphs | ftfy | MarkupSafe / nh3 | |
|---|---|---|---|---|---|
| Purpose | Security sanitization | ASCII transliteration | Homoglyph detection | Encoding repair | HTML escaping |
| Invisible chars | Strips 492 (bidi, tag block, ZW, VS, C0/C1) | Incidental | No | Partial (preserves bidi, ZW, VS) | No |
| Homoglyphs | Replaces 66 curated pairs | Transliterates all non-ASCII | Detects only (no replace) | No | No |
| NFKC | Yes | No | No | NFC (NFKC optional) | No |
| Null bytes | Yes | No | No | No | No |
| Preserves Unicode | Yes (CJK, Arabic, emoji intact) | No (destroys all non-ASCII) | Yes | Yes | Yes |
| Pluggable escaper | Yes | No | No | No | N/A (HTML-specific) |
| Dependencies | Zero | Zero | Zero | wcwidth | C ext / Rust ext |
Key differences:
" into "Zhong" and Cyrillic sentences into gibberish. navi-sanitize normalizes only the 66 highest-risk lookalikes and leaves legitimate Unicode intact.navi_sanitize.clean() inside a pydantic AfterValidator or cerberus coercion chain for validated, sanitized output.Every string passes through stages in order. Each stage returns clean output and a warning if it changed anything.
| Stage | What it does |
|---|---|
| Null bytes | Strip \x00 |
| Invisibles | Strip zero-width, Unicode Tag block, bidi controls |
| NFKC | Normalize fullwidth ASCII to standard ASCII |
| Homoglyphs | Replace Cyrillic/Greek lookalikes with Latin equivalents |
| Re-NFKC | Re-normalize after homoglyph replacement (ensures idempotency) |
| Escaper | Pluggable — you choose what to escape for |
The first five stages are universal. The escaper is where you tell the pipeline what the output is for.
from navi_sanitize import clean, jinja2_escaper, path_escaper
# For Jinja2 templates
clean("{{ malicious }}", escaper=jinja2_escaper)
# For filesystem paths
clean("../../etc/passwd", escaper=path_escaper)
# For LLM prompts — bring your own
clean(user_input, escaper=my_prompt_escaper)
# No escaper — just the universal stages
clean(user_input)
An escaper is a function: str -> str. Write one in three lines.
Security note: The escaper runs as the final pipeline stage. Its output is not re-sanitized. Built-in escapers are tested. Custom escapers are your responsibility — a buggy escaper can re-introduce characters the pipeline removed.
# Pydantic — validate then sanitize
from typing import Annotated
from pydantic import BaseModel, AfterValidator
from navi_sanitize import clean
SafeStr = Annotated[str, AfterValidator(clean)]
class UserInput(BaseModel):
name: SafeStr
bio: SafeStr
# FastAPI — sanitize at the edge
from fastapi import Depends, Query
from navi_sanitize import clean
def safe_query(q: str = Query()) -> str:
return clean(q)
@app.get("/search")
def search(q: str = Depends(safe_query)):
return {"results": find(q)}
# Jinja2 — sanitize before rendering
from navi_sanitize import clean, jinja2_escaper
safe_context = {k: clean(v, escaper=jinja2_escaper) for k, v in user_data.items()}
template.render(**safe_context)
See examples/ for runnable scripts covering LLM pipelines, FastAPI/Pydantic, and log sanitization.
pip install navi-sanitize
from navi_sanitize import walk
# Recursively sanitize every string in a dict/list
spec = walk(untrusted_json)
walk() warns when nesting exceeds 128 levels by default; pass max_depth= to adjust. Traverses dicts and lists only — tuples and sets pass through by reference.
These utilities are not part of clean() and are never run automatically. You must call them explicitly.
from navi_sanitize import decode_evasion, clean, detect_scripts, is_mixed_script, path_escaper
# Double-encoded path traversal
raw = "%252e%252e%252fetc%252fpasswd"
# 1. Peel nested encodings (URL → HTML entities → hex escapes)
peeled = decode_evasion(raw) # "../../etc/passwd"
# 2. Sanitize through the universal pipeline
cleaned = clean(peeled, escaper=path_escaper) # "etc/passwd"
# 3. Check for mixed-script spoofing (useful on raw or pre-clean input)
if is_mixed_script(raw) or is_mixed_script(peeled):
flag_for_review(raw)
decode_evasion(text, *, max_layers=3) — iterative URL/HTML/hex decoding; stops when a pass produces no changedetect_scripts(text) — returns script buckets present in text (latin, cyrillic, greek, etc.)is_mixed_script(text) — True when 2+ scripts detectedScript detection can be applied pre-clean too — most useful on raw input for phishing detection.
navi-sanitize operates at the character level. It does not cover:
markupsafe.escape(), nh3.clean())clean())These are different problems with mature, purpose-built solutions. navi-sanitize handles what they don't: the invisible, character-level content that slips past them.
The pipeline never errors on valid string input. It always produces output. Non-string arguments raise TypeError. When it changes something, it logs a warning.
import logging
logging.basicConfig()
clean("pаypal.com")
# WARNING:navi_sanitize:Replaced 1 homoglyph(s) in value
# Returns: "paypal.com"
Measured on Python 3.12, single thread. clean() is the per-string cost; walk() includes the iterative copy pass.
| Scenario | Mean | Ops/sec |
|---|---|---|
clean() — short, clean text (no-op) | 2.8 us | 358K |
clean() — short, hostile (all stages fire) | 67 us | 15K |
clean() — 13KB clean text | 810 us | 1.2K |
clean() — 10KB hostile text | 449 us | 2.2K |
clean() — 100KB hostile payload | 5.7 ms | 176 |
walk() — 100-item nested dict, clean | 537 us | 1.9K |
walk() — 100-item nested dict, hostile | 6.9 ms | 144 |
MIT
FAQs
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Security News
A critical vm2 sandbox escape can allow untrusted JavaScript to break isolation and execute commands on the host Node.js process.

Research
Five malicious NuGet packages impersonate Chinese .NET libraries to deploy a stealer targeting browser credentials, crypto wallets, SSH keys, and local files.

Security News
pnpm 11 turns on a 1-day Minimum Release Age and blocks exotic subdeps by default, adding safeguards against fast-moving supply chain attacks.