What is grapheme-splitter?
The grapheme-splitter npm package is designed to split strings into their constituent graphemes according to the Unicode Standard. This is particularly useful for accurately counting characters, especially in languages or scripts where a single visual character (grapheme cluster) may be composed of multiple Unicode characters. It helps in handling strings in a way that is more aligned with user expectations across different languages and scripts.
What are grapheme-splitter's main functionalities?
Splitting strings into graphemes
This feature allows you to split a string into an array of its constituent graphemes. The code sample demonstrates splitting a complex emoji (which is a single grapheme cluster composed of multiple Unicode characters) into its individual graphemes.
"use strict";\nconst GraphemeSplitter = require('grapheme-splitter');\nconst splitter = new GraphemeSplitter();\nconsole.log(splitter.splitGraphemes('👩❤️💋👩'));
Counting graphemes in a string
This feature provides the ability to count the number of graphemes in a string, which can be more accurate than counting code points or bytes, especially for strings containing complex characters or emojis. The code sample demonstrates counting the number of graphemes in a string that visually appears as a single character.
"use strict";\nconst GraphemeSplitter = require('grapheme-splitter');\nconst splitter = new GraphemeSplitter();\nconsole.log(splitter.countGraphemes('👩❤️💋👩'));
Other packages similar to grapheme-splitter
stringz
stringz is a package that provides robust string manipulation functions, including length calculation and substring extraction, taking into account Unicode characters and grapheme clusters. It compares to grapheme-splitter by offering a broader set of string manipulation functionalities while also handling grapheme clusters accurately.
Background
In JavaScript there is not always a one-to-one relationship between string characters and what a user would call a separate visual "letter". Some symbols are represented by several characters. This can cause issues when splitting strings and inadvertently cutting a multi-char letter in half, or when you need the actual number of letters in a string.
For example, emoji characters like "🌷","🎁","💩","😜" and "👍" are represented by two JavaScript characters each (high surrogate and low surrogate). That is,
"🌷".length == 2
The combined emoji are even longer:
"🏳️🌈".length == 6
What's more, some languages often include combining marks - characters that are used to modify the letters before them. Common examples are the German letter ü and the Spanish letter ñ. Sometimes they can be represented alternatively both as a single character and as a letter + combining mark, with both forms equally valid:
var two = "ñ";
var one = "ñ";
console.log(one!=two);
Unicode normalization, as performed by the popular punycode.js library or ECMAScript 6's String.normalize, can sometimes fix those differences and turn two-char sequences into single characters. But it is not enough in all cases. Some languages like Hindi make extensive use of combining marks on their letters, that have no dedicated single-codepoint Unicode sequences, due to the sheer number of possible combinations.
For example, the Hindi word "अनुच्छेद" is comprised of 5 letters and 3 combining marks:
अ + न + ु + च + ् + छ + े + द
which is in fact just 5 user-perceived letters:
अ + नु + च् + छे + द
and which Unicode normalization would not combine properly.
There are also the unusual letter+combining mark combinations which have no dedicated Unicode codepoint. The string Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘ obviously has 5 separate letters, but is in fact comprised of 58 JavaScript characters, most of which are combining marks.
Enter the grapheme-splitter.js library. It can be used to properly split JavaScript strings into what a human user would call separate letters (or "extended grapheme clusters" in Unicode terminology), no matter what their internal representation is. It is an implementation on the Default Grapheme Cluster Boundary of UAX #29.
Installation
You can use the index.js file directly as-is. Or you you can install grapheme-splitter
to your project using the NPM command below:
$ npm install --save grapheme-splitter
Tests
To run the tests on grapheme-splitter
, use the command below:
$ npm test
Usage
Just initialize and use:
var splitter = new GraphemeSplitter();
var graphemes = splitter.splitGraphemes(string);
var graphemes = splitter.iterateGraphemes(string);
var graphemeCount = splitter.countGraphemes(string);
Examples
var splitter = new GraphemeSplitter();
splitter.splitGraphemes("abcd");
splitter.splitGraphemes("🌷🎁💩😜👍🏳️🌈");
splitter.splitGraphemes("Ĺo͂řȩm̅");
splitter.splitGraphemes("뎌쉐");
splitter.splitGraphemes("अनुच्छेद");
splitter.splitGraphemes("Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞");
TypeScript
Grapheme splitter includes TypeScript declarations.
import GraphemeSplitter = require('grapheme-splitter')
const splitter = new GraphemeSplitter()
const split: string[] = splitter.splitGraphemes('Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞')
Acknowledgements
This library is heavily influenced by Devon Govett's excellent grapheme-breaker CoffeeScript library at https://github.com/devongovett/grapheme-breaker with an emphasis on ease of integration and pure JavaScript implementation.