
Security News
The Hidden Blast Radius of the Axios Compromise
The Axios compromise shows how time-dependent dependency resolution makes exposure harder to detect and contain.
unicode-segmenter
Advanced tools
A lightweight implementation of the Unicode Text Segmentation (UAX #29)
A lightweight implementation of the Unicode Text Segmentation (UAX #29)
Spec compliant: Up-to-date Unicode data, verified by the official Unicode test suites and fuzzed with the native Intl.Segmenter, and maintaining 100% test coverage.
Excellent compatibility: It works well on older browsers, edge runtimes, React Native (Hermes) and QuickJS.
Zero-dependencies: It doesn't bloat node_modules or the network bandwidth. Like a small minimal snippet.
Small bundle size: It effectively compresses the Unicode data and provides a bundler-friendly format.
Extremely efficient: It's carefully optimized for runtime performance, making it the fastest one in the ecosystem—outperforming even the built-in Intl.Segmenter.
TypeScript: It's fully type-checked, and provides type definitions and JSDoc.
ESM-first: It primarily supports ES modules, and still supports CommonJS.
[!NOTE] unicode-segmenter is now e18e recommendation!
Unicode® 17.0.0
Unicode® Standard Annex #29 - Revision 47 (2025-08-17)
Entries for Unicode text segmentation.
unicode-segmenter/grapheme: Segments and counts extended grapheme clustersunicode-segmenter/intl-adapter: Intl.Segmenter adapterunicode-segmenter/intl-polyfill: Intl.Segmenter polyfillAnd matchers for extra use cases.
unicode-segmenter/emoji: Matches single codepoint emojisunicode-segmenter/general: Matches single codepoint alphanumericsunicode-segmenter/graphemeUtilities for text segmentation by extended grapheme cluster rules.
import { graphemeSegments } from 'unicode-segmenter/grapheme';
[...graphemeSegments('a̐éö̲\r\n')];
// 0: { segment: 'a̐', index: 0, input: 'a̐éö̲\r\n' }
// 1: { segment: 'é', index: 2, input: 'a̐éö̲\r\n' }
// 2: { segment: 'ö̲', index: 4, input: 'a̐éö̲\r\n' }
// 3: { segment: '\r\n', index: 7, input: 'a̐éö̲\r\n' }
import { splitGraphemes } from 'unicode-segmenter/grapheme';
[...splitGraphemes('#️⃣*️⃣0️⃣1️⃣2️⃣')];
// 0: #️⃣
// 1: *️⃣
// 2: 0️⃣
// 3: 1️⃣
// 4: 2️⃣
import { countGraphemes } from 'unicode-segmenter/grapheme';
'👋 안녕!'.length;
// => 6
countGraphemes('👋 안녕!');
// => 5
'a̐éö̲'.length;
// => 7
countGraphemes('a̐éö̲');
// => 3
[!NOTE]
countGraphemes()is a small wrapper aroundgraphemeSegments().If you need it more than once at a time, consider memoization or use
graphemeSegments()orsplitGraphemes()once instead.
graphemeSegments() exposes some knowledge identified in the middle of the process to support some useful cases.
For example, knowing the Grapheme_Cluster_Break category at the beginning and end of a segment can help approximately infer the applied boundary rule.
import { graphemeSegments, GraphemeCategory } from 'unicode-segmenter/grapheme';
function* matchEmoji(str) {
for (const { segment, _catBegin } of graphemeSegments(input)) {
// `_catBegin` identified as Extended_Pictographic means the segment is emoji
if (_catBegin === GraphemeCategory.Extended_Pictographic) {
yield segment;
}
}
}
[...matchEmoji('1🌷2🎁3💩4😜5👍')]
// 0: 🌷
// 1: 🎁
// 2: 💩
// 3: 😜
// 4: 👍
Or build even more advanced one like an Unicode-aware TTY string width utility.
unicode-segmenter/intl-adapterIntl.Segmenter API adapter (only granularity: "grapheme" available yet)
import { Segmenter } from 'unicode-segmenter/intl-adapter';
// Same API with the `Intl.Segmenter`
const segmenter = new Segmenter();
unicode-segmenter/intl-polyfillIntl.Segmenter API polyfill (only granularity: "grapheme" available yet)
// Apply polyfill to the `globalThis.Intl` object.
import 'unicode-segmenter/intl-polyfill';
const segmenter = new Intl.Segmenter();
unicode-segmenter/emojiUtilities for matching emoji-like characters.
import {
isEmojiPresentation, // match \p{Emoji_Presentation}
isExtendedPictographic, // match \p{Extended_Pictographic}
} from 'unicode-segmenter/emoji';
isEmojiPresentation('😍'.codePointAt(0));
// => true
isEmojiPresentation('♡'.codePointAt(0));
// => false
isExtendedPictographic('😍'.codePointAt(0));
// => true
isExtendedPictographic('♡'.codePointAt(0));
// => true
unicode-segmenter/generalUtilities for matching alphanumeric characters.
import {
isLetter, // match \p{L}
isNumeric, // match \p{N}
isAlphabetic, // match \p{Alphabetic}
isAlphanumeric, // match [\p{N}\p{Alphabetic}]
} from 'unicode-segmenter/general';
unicode-segmenter uses only fundamental features of ES2015, making it compatible with most browsers.
To ensure compatibility, the runtime should support:
If the runtime doesn't support these features, it can easily be fulfilled with tools like Babel.
Since Hermes doesn't support the Intl.Segmenter API yet, unicode-segmenter is a good alternative.
unicode-segmenter is compiled into small & efficient Hermes bytecode than other JavaScript libraries. See the benchmark for details.
unicode-segmenter aims to be lighter and faster than alternatives in the ecosystem while fully spec compliant. So the benchmark is tracking several libraries' performance, bundle size, and Unicode version compliance.
unicode-segmenter/grapheme vsIntl.Segmenter API| Name | Unicode® | ESM? | Size | Size (min) | Size (min+gzip) | Size (min+br) | Size (min+zstd) |
|---|---|---|---|---|---|---|---|
unicode-segmenter/grapheme | 17.0.0 | ✔️ | 11,873 | 7,754 | 3,857 | 3,121 | 3,984 |
graphemer | 15.0.0 | ✖️ ️ | 410,435 | 95,104 | 15,752 | 10,660 | 15,911 |
grapheme-splittetr | 10.0.0 | ✖️ | 122,254 | 23,682 | 7,852 | 4,802 | 6,753 |
@formatjs/intl-segmenter* | 17.0.0 | ✖️ | 268,301 | 176,759 | 45,988 | 31,701 | 45,370 |
unicode-segmentation* | 15.1.0 | - | 56,529 | 52,439 | 24,108 | 17,343 | 24,375 |
Intl.Segmenter* | - | - | 0 | 0 | 0 | 0 | 0 |
@formatjs/intl-segmenter handles grapheme, word, and sentence, but it's not tree-shakable.unicode-segmentation size contains only minimum WASM binary and its bindings to execute benchmarking. It will increases to expose more features.Intl.Segmenter's Unicode data depends on the host, and may not be up-to-date.Intl.Segmenter may not be available in some old browsers, edge runtimes, or embedded environments.| Name | Bytecode size | Bytecode size (gzip)* |
|---|---|---|
unicode-segmenter/grapheme | 21,395 | 12,027 |
graphemer | 134,085 | 31,770 |
grapheme-splitter | 63,942 | 19,165 |
@formatjs/intl-segmenter | 329,547 | 136,751 |
Here is a brief explanation, and you can see archived benchmark results.
Performance in Node.js/Bun/Deno: unicode-segmenter/grapheme has best-in-class performance.
Intl.Segmenter.Performance in Browsers: The performance in browser environments varies greatly due to differences in browser engines, which makes benchmarking inconsistent, but:
Performance in React Native: unicode-segmenter/grapheme is still faster than alternatives when compiled to Hermes bytecode. It's 3~8x faster than graphemer and 20~26x faster than grapheme-splitter, with the performance gap increasing with input size.
Performance in QuickJS: unicode-segmenter/grapheme is the only usable library in terms of performance.
Instead of trusting these claims, you can try yarn perf:grapheme directly in your environment or build your own benchmark.
The Rust Unicode team (@unicode-rs):
The initial implementation was ported manually from unicode-segmentation library.
Marijn Haverbeke (@marijnh):
Inspired a technique that can greatly compress Unicode data table from his library.
FAQs
A lightweight implementation of the Unicode Text Segmentation (UAX #29)
The npm package unicode-segmenter receives a total of 76,682 weekly downloads. As such, unicode-segmenter popularity was classified as popular.
We found that unicode-segmenter demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Security News
The Axios compromise shows how time-dependent dependency resolution makes exposure harder to detect and contain.

Research
A supply chain attack on Axios introduced a malicious dependency, plain-crypto-js@4.2.1, published minutes earlier and absent from the project’s GitHub releases.

Research
Malicious versions of the Telnyx Python SDK on PyPI delivered credential-stealing malware via a multi-stage supply chain attack.