You're Invited:Meet the Socket Team at BlackHat and DEF CON in Las Vegas, Aug 7-8.RSVP →

github.com/lovelydeng/uniseg

Package Overview

Dependencies

Alerts

File Explorer

Install Socket

Detect and block malicious and high-risk dependencies

Install

github.com/lovelydeng/uniseg

Package uniseg implements Unicode Text Segmentation, Unicode Line Breaking, and string width calculation for monospace fonts. Unicode Text Segmentation conforms to Unicode Standard Annex #29 (https://unicode.org/reports/tr29/) and Unicode Line Breaking conforms to Unicode Standard Annex #14 (https://unicode.org/reports/tr14/). In short, using this package, you can split a string into grapheme clusters (what people would usually refer to as a "character"), into words, and into sentences. Or, in its simplest case, this package allows you to count the number of characters in a string, especially when it contains complex characters such as emojis, combining characters, or characters from Asian, Arabic, Hebrew, or other languages. Additionally, you can use it to implement line breaking (or "word wrapping"), that is, to determine where text can be broken over to the next line when the width of the line is not big enough to fit the entire text. Finally, you can use it to calculate the display width of a string for monospace fonts. If you just want to count the number of characters in a string, you can use GraphemeClusterCount. If you want to determine the display width of a string, you can use StringWidth. If you want to iterate over a string, you can use Step, StepString, or the Graphemes class (more convenient but less performant). This will provide you with all information: grapheme clusters, word boundaries, sentence boundaries, line breaks, and monospace character widths. The specialized functions FirstGraphemeCluster, FirstGraphemeClusterInString, FirstWord, FirstWordInString, FirstSentence, and FirstSentenceInString can be used if only one type of information is needed. Consider the rainbow flag emoji: 🏳️‍🌈. On most modern systems, it appears as one character. But its string representation actually has 14 bytes, so counting bytes (or using len("🏳️‍🌈")) will not work as expected. Counting runes won't, either: The flag has 4 Unicode code points, thus 4 runes. The stdlib function utf8.RuneCountInString("🏳️‍🌈") and len([]rune("🏳️‍🌈")) will both return 4. The GraphemeClusterCount function will return 1 for the rainbow flag emoji. The Graphemes class and a variety of functions in this package will allow you to split strings into its grapheme clusters. Word boundaries are used in a number of different contexts. The most familiar ones are selection (double-click mouse selection), cursor movement ("move to next word" control-arrow keys), and the dialog option "Whole Word Search" for search and replace. This package provides methods for determining word boundaries. Sentence boundaries are often used for triple-click or some other method of selecting or iterating through blocks of text that are larger than single words. They are also used to determine whether words occur within the same sentence in database queries. This package provides methods for determining sentence boundaries. Line breaking, also known as word wrapping, is the process of breaking a section of text into lines such that it will fit in the available width of a page, window or other display area. This package provides methods to determine the positions in a string where a line must be broken, may be broken, or must not be broken. Monospace width, as referred to in this package, is the width of a string in a monospace font. This is commonly used in terminal user interfaces or text displays or editors that don't support proportional fonts. A width of 1 corresponds to a single character cell. The C function wcswidth() and its implementation in other programming languages is in widespread use for the same purpose. However, there is no standard for the calculation of such widths, and this package differs from wcswidth() in a number of ways, presumably to generate more visually pleasing results. To start, we assume that every code point has a width of 1, with the following exceptions: For Hangul grapheme clusters composed of conjoining Jamo and for Regional Indicators (flags), all code points except the first one have a width of 0. For grapheme clusters starting with an Extended Pictographic, any additional code point will force a total width of 2, except if the Variation Selector-15 (U+FE0E) is included, in which case the total width is always 1. Grapheme clusters ending with Variation Selector-16 (U+FE0F) have a width of 2. Note that whether these widths appear correct depends on your application's render engine, to which extent it conforms to the Unicode Standard, and its choice of font.

v0.0.0-20221120141218-19f3806b842a
Source
Go

Version published: 2 years ago

Readme

Source

Unicode Text Segmentation for Go

This Go package implements Unicode Text Segmentation according to Unicode Standard Annex #29, Unicode Line Breaking according to Unicode Standard Annex #14 (Unicode version 14.0.0), and monospace font string width calculation similar to wcwidth.

Background

Grapheme Clusters

In Go, strings are read-only slices of bytes. They can be turned into Unicode code points using the for loop or by casting: []rune(str). However, multiple code points may be combined into one user-perceived character or what the Unicode specification calls "grapheme cluster". Here are some examples:

String	Bytes (UTF-8)	Code points (runes)	Grapheme clusters
Käse	6 bytes: `4b 61 cc 88 73 65`	5 code points: `4b 61 308 73 65`	4 clusters: `[4b],[61 308],[73],[65]`
🏳️‍🌈	14 bytes: `f0 9f 8f b3 ef b8 8f e2 80 8d f0 9f 8c 88`	4 code points: `1f3f3 fe0f 200d 1f308`	1 cluster: `[1f3f3 fe0f 200d 1f308]`
🇩🇪	8 bytes: `f0 9f 87 a9 f0 9f 87 aa`	2 code points: `1f1e9 1f1ea`	1 cluster: `[1f1e9 1f1ea]`

This package provides tools to iterate over these grapheme clusters. This may be used to determine the number of user-perceived characters, to split strings in their intended places, or to extract individual characters which form a unit.

Word Boundaries

Word boundaries are used in a number of different contexts. The most familiar ones are selection (double-click mouse selection), cursor movement ("move to next word" control-arrow keys), and the dialog option "Whole Word Search" for search and replace. They are also used in database queries, to determine whether elements are within a certain number of words of one another. Searching may also use word boundaries in determining matching items. This package provides tools to determine word boundaries within strings.

Sentence Boundaries

Sentence boundaries are often used for triple-click or some other method of selecting or iterating through blocks of text that are larger than single words. They are also used to determine whether words occur within the same sentence in database queries. This package provides tools to determine sentence boundaries within strings.

Line Breaking

Line breaking, also known as word wrapping, is the process of breaking a section of text into lines such that it will fit in the available width of a page, window or other display area. This package provides tools to determine where a string may or may not be broken and where it must be broken (for example after newline characters).

Monospace Width

Most terminals or text displays / text editors using a monospace font (for example source code editors) use a fixed width for each character. Some characters such as emojis or characters found in Asian and other languages may take up more than one character cell. This package provides tools to determine the number of cells a string will take up when displayed in a monospace font. See here for more information.

Installation

go get github.com/rivo/uniseg

Examples

Counting Characters in a String

n := uniseg.GraphemeClusterCount("🇩🇪🏳️‍🌈")
fmt.Println(n)
// 2

Calculating the Monospace String Width

width := uniseg.StringWidth("🇩🇪🏳️‍🌈!")
fmt.Println(width)
// 5

Using the `Graphemes` Class

This is the most convenient method of iterating over grapheme clusters:

gr := uniseg.NewGraphemes("👍🏼!")
for gr.Next() {
	fmt.Printf("%x ", gr.Runes())
}
// [1f44d 1f3fc] [21]

Using the `Step` or `StepString` Function

This is orders of magnitude faster than the Graphemes class, but it requires the handling of states and boundaries:

str := "🇩🇪🏳️‍🌈"
state := -1
var c string
for len(str) > 0 {
	c, str, _, state = uniseg.StepString(str, state)
	fmt.Printf("%x ", []rune(c))
}
// [1f1e9 1f1ea] [1f3f3 fe0f 200d 1f308]

Advanced Examples

Breaking into grapheme clusters and evaluating line breaks:

str := "First line.\nSecond line."
state := -1
var (
	c          string
	boundaries int
)
for len(str) > 0 {
	c, str, boundaries, state = uniseg.StepString(str, state)
	fmt.Print(c)
	if boundaries&uniseg.MaskLine == uniseg.LineCanBreak {
		fmt.Print("|")
	} else if boundaries&uniseg.MaskLine == uniseg.LineMustBreak {
		fmt.Print("‖")
	}
}
// First |line.
// ‖Second |line.‖

If you're only interested in word segmentation, use FirstWord or FirstWordInString:

str := "Hello, world!"
state := -1
var c string
for len(str) > 0 {
	c, str, state = uniseg.FirstWordInString(str, state)
	fmt.Printf("(%s)\n", c)
}
// (Hello)
// (,)
// ( )
// (world)
// (!)

Similarly, use

FirstGraphemeCluster or FirstGraphemeClusterInString for grapheme cluster determination only,
FirstSentence or FirstSentenceInString for sentence segmentation only, and
FirstLineSegment or FirstLineSegmentInString for line breaking / word wrapping (although using Step or StepString is preferred as it will observe grapheme cluster boundaries).

Finally, if you need to reverse a string while preserving grapheme clusters, use ReverseString:

fmt.Println(uniseg.ReverseString("🇩🇪🏳️‍🌈"))
// 🏳️‍🌈🇩🇪

Documentation

Refer to https://pkg.go.dev/github.com/rivo/uniseg for the package's documentation.

Dependencies

This package does not depend on any packages outside the standard library.

Become a Sponsor on GitHub to support this project!

Your Feedback

Add your issue here on GitHub, preferably before submitting any PR's. Feel free to get in touch if you have any questions.

FAQs

What is github.com/lovelydeng/uniseg?

Package last updated on 20 Nov 2022

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

github.com/lovelydeng/uniseg

Unicode Text Segmentation for Go

Background

Grapheme Clusters

Word Boundaries

Sentence Boundaries

Line Breaking

Monospace Width

Installation

Examples

Counting Characters in a String

Calculating the Monospace String Width

Using the Graphemes Class

Using the Step or StepString Function

Advanced Examples

Documentation

Dependencies

Sponsor this Project

Your Feedback

Related posts

Node.js Adds Experimental Support for TypeScript

What’s New at Socket: Introducing Our Product Changelog

Using the `Graphemes` Class

Using the `Step` or `StepString` Function