## Security Vulnerability: Unicode Homoglyph Bypass in External Content Wrapping Function
The wrapExternalContent function (and its derivatives wrapWebContent and buildSafeExternalPrompt) is designed to wrap untrusted external content with special boundary markers (<<<EXTERNAL_UNTRUSTED_CONTENT id="...">>>) to separate it from trusted instructions. To prevent spoofing using visually similar Unicode characters, the content passes through foldMarkerText, which maps certain Unicode homoglyphs to ASCII equivalents (e.g., fullwidth < → <). However, the current mapping table (ANGLE_BRACKET_MAP) only covers a subset of characters that can substitute for < and >. Many other Unicode characters resembling angle brackets—including guillemets, double angle brackets, and ornamental variants—are missing. An attacker could inject a marker like «EXTERNAL_UNTRUSTED_CONTENT»; because it is not normalized, the subsequent regex (/<<<EXTERNAL_UNTRUSTED_CONTENT...>>>/gi) fails to detect it, leaving the spoofed marker untouched. If the underlying LLM treats these Unicode characters as equivalent to < and > (through tokenization or Unicode normalization), the security boundary can be bypassed, potentially allowing malicious instruction injection. This issue is a follow-up to previous security reports (#30948 and #30951) and represents a proactive hardening measure rather than an active exploit.
---
- **Source**: 
- **Sector**: The Network
- **Tags**: security, spoofing, unicode, regex, llm
- **Credibility**: unverified
- **Published**: 2026-03-06 03:13:07
- **ID**: 2303
- **URL**: https://whisperx.ai/en/intel/2303