
Security News
GitHub Actions Supply Chain Attack Puts Thousands of Projects at Risk
A compromised GitHub Action exposed secrets in CI/CD logs, putting thousands of projects at risk and forcing developers to urgently secure their workflows.
A JavaScript library that breaks strings into their individual user-perceived characters (including emojis!)
The graphemer npm package is designed to handle and manipulate strings in a way that is aware of grapheme clusters, which are the visually recognizable characters, taking into account complex script behaviors and emoji sequences. This is particularly useful for applications that need to correctly count, split, or iterate over characters in strings that include multi-byte characters, emoji, and other composite characters that standard JavaScript string methods may not handle correctly.
Splitting strings into grapheme clusters
This feature allows you to split a string into an array of its constituent grapheme clusters, which is especially useful for strings containing emojis or characters from complex scripts.
const Graphemer = require('graphemer').default;
const splitter = new Graphemer();
const clusters = splitter.splitGraphemes('👋🏽 World');
console.log(clusters);
Counting grapheme clusters in a string
This feature provides the ability to count the number of grapheme clusters in a string, offering a more accurate character count for strings that include complex characters.
const Graphemer = require('graphemer').default;
const splitter = new Graphemer();
const count = splitter.countGraphemes('👋🏽 World');
console.log(count);
Iterating over grapheme clusters in a string
This feature enables iteration over each grapheme cluster in a string, allowing for operations to be performed on each visual character unit.
const Graphemer = require('graphemer').default;
const splitter = new Graphemer();
for (const cluster of splitter.iterateGraphemes('👋🏽 World')) {
console.log(cluster);
}
Similar to graphemer, split-graphemes provides functionality for splitting strings into grapheme clusters. However, graphemer offers additional features such as counting and iterating over graphemes, making it a more versatile choice for complex string manipulation.
While unicode-properties offers detailed information about Unicode characters, including their properties and categories, graphemer focuses specifically on the manipulation and handling of grapheme clusters within strings. This makes graphemer more specialized for tasks involving visual character units.
This library continues the work of Grapheme Splitter and supports the following unicode versions:
[v1.4.0]
[v1.3.0]
[v1.1.0]
[v1.0.0]
(Unicode 10 supported by grapheme-splitter
)In JavaScript there is not always a one-to-one relationship between string characters and what a user would call a separate visual "letter". Some symbols are represented by several characters. This can cause issues when splitting strings and inadvertently cutting a multi-char letter in half, or when you need the actual number of letters in a string.
For example, emoji characters like "🌷","🎁","💩","😜" and "👍" are represented by two JavaScript characters each (high surrogate and low surrogate). That is,
'🌷'.length == 2;
The combined emoji are even longer:
'🏳️🌈'.length == 6;
What's more, some languages often include combining marks - characters that are used to modify the letters before them. Common examples are the German letter ü and the Spanish letter ñ. Sometimes they can be represented alternatively both as a single character and as a letter + combining mark, with both forms equally valid:
var two = 'ñ'; // unnormalized two-char n+◌̃, i.e. "\u006E\u0303";
var one = 'ñ'; // normalized single-char, i.e. "\u00F1"
console.log(one != two); // prints 'true'
Unicode normalization, as performed by the popular punycode.js library or ECMAScript 6's String.normalize, can sometimes fix those differences and turn two-char sequences into single characters. But it is not enough in all cases. Some languages like Hindi make extensive use of combining marks on their letters, that have no dedicated single-codepoint Unicode sequences, due to the sheer number of possible combinations. For example, the Hindi word "अनुच्छेद" is comprised of 5 letters and 3 combining marks:
अ + न + ु + च + ् + छ + े + द
which is in fact just 5 user-perceived letters:
अ + नु + च् + छे + द
and which Unicode normalization would not combine properly. There are also the unusual letter+combining mark combinations which have no dedicated Unicode codepoint. The string Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘ obviously has 5 separate letters, but is in fact comprised of 58 JavaScript characters, most of which are combining marks.
Enter the graphemer
library. It can be used to properly split JavaScript strings into what a human user would call separate letters (or "extended grapheme clusters" in Unicode terminology), no matter what their internal representation is. It is an implementation on the Default Grapheme Cluster Boundary of UAX #29.
Install graphemer
using the NPM command below:
$ npm i graphemer
If you're using Typescript or a compiler like Babel (or something like Create React App) things are pretty simple; just import, initialize and use!
import Graphemer from 'graphemer';
const splitter = new Graphemer();
// split the string to an array of grapheme clusters (one string each)
const graphemes = splitter.splitGraphemes(string);
// iterate the string to an iterable iterator of grapheme clusters (one string each)
const graphemeIterator = splitter.iterateGraphemes(string);
// or do this if you just need their number
const graphemeCount = splitter.countGraphemes(string);
If you're using vanilla Node you can use the require()
method.
const Graphemer = require('graphemer').default;
const splitter = new Graphemer();
const graphemes = splitter.splitGraphemes(string);
import Graphemer from 'graphemer';
const splitter = new Graphemer();
// plain latin alphabet - nothing spectacular
splitter.splitGraphemes('abcd'); // returns ["a", "b", "c", "d"]
// two-char emojis and six-char combined emoji
splitter.splitGraphemes('🌷🎁💩😜👍🏳️🌈'); // returns ["🌷","🎁","💩","😜","👍","🏳️🌈"]
// diacritics as combining marks, 10 JavaScript chars
splitter.splitGraphemes('Ĺo͂řȩm̅'); // returns ["Ĺ","o͂","ř","ȩ","m̅"]
// individual Korean characters (Jamo), 4 JavaScript chars
splitter.splitGraphemes('뎌쉐'); // returns ["뎌","쉐"]
// Hindi text with combining marks, 8 JavaScript chars
splitter.splitGraphemes('अनुच्छेद'); // returns ["अ","नु","च्","छे","द"]
// demonic multiple combining marks, 75 JavaScript chars
splitter.splitGraphemes('Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞'); // returns ["Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍","A̴̵̜̰͔ͫ͗͢","L̠ͨͧͩ͘","G̴̻͈͍͔̹̑͗̎̅͛́","Ǫ̵̹̻̝̳͂̌̌͘","!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞"]
Graphemer is built with TypeScript and, of course, includes type declarations.
import Graphemer from 'graphemer';
const splitter = new Graphemer();
const split: string[] = splitter.splitGraphemes('Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞');
See Contribution Guide.
This library is a fork of the incredible work done by Orlin Georgiev and Huáng Jùnliàng at https://github.com/orling/grapheme-splitter.
The original library was heavily influenced by Devon Govett's excellent grapheme-breaker CoffeeScript library.
FAQs
A JavaScript library that breaks strings into their individual user-perceived characters (including emojis!)
The npm package graphemer receives a total of 24,931,764 weekly downloads. As such, graphemer popularity was classified as popular.
We found that graphemer demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
A compromised GitHub Action exposed secrets in CI/CD logs, putting thousands of projects at risk and forcing developers to urgently secure their workflows.
Research
Security News
A malicious Maven package typosquatting a popular library is secretly stealing OAuth credentials on the 15th of each month, putting Java developers at risk.
Security News
Socket and Seal Security collaborate to fix a critical npm overrides bug, resolving a three-year security issue in the JavaScript ecosystem's most popular package manager.