uctools
Tools for showing information about unicode characters (UTF-8) and
performing normalization.
Copyright ® 2018, Luís Gomes luismsgomes@gmail.com.
The following command line tools are provided:
ucinfo
writes on stdout the name of each unicode character read from stdin
ucenum
enumerates on stdout all unicode characters of a chosen category
ucnorm
applies a standard unicode normalization (NFC, NFKC, NFD or NFKD)
ucinfo
The ucinfo tool reads UTF-8 text from stdin and writes to stdout information
about each character, one per line.
The output has 5 tab-separated columns:
1. the character itself, if printable, or an escaped representation of it
2. the decimal codepoint of the character
3. the number of bytes that the character occupies
4. the Unicode category of the character
5. the Unicode name of the character
ucenum
The ucenum tool takes a category abbreviation as argument and outputs a list
of all characters belonging to that category. The categories are:
L
Letter
Lu
Letter, Uppercase
Ll
Letter, Lowercase
Lt
Letter, Titlecase
Lm
Letter, Modifier
Lo
Letter, Other
M
Mark
Mn
Mark, Nonspacing
Mc
Mark, Spacing Combining
Me
Mark, Enclosing
N
Number
Nd
Number, Decimal Digit
Nl
Number, Letter
No
Number, Other
P
Punctuation
Pc
Punctuation, Connector
Pd
Punctuation, Dash
Ps
Punctuation, Open
Pe
Punctuation, Close
Pi
Punctuation, Initial quote (may behave like Ps or Pe depending on usage)
Pf
Punctuation, Final quote (may behave like Ps or Pe depending on usage)
Po
Punctuation, Other
S
Symbol
Sm
Symbol, Math
Sc
Symbol, Currency
Sk
Symbol, Modifier
So
Symbol, Other
Z
Separator
Zs
Separator, Space
Zl
Separator, Line
Zp
Separator, Paragraph
C
Other
Cc
Other, Control
Cf
Other, Format
Cs
Other, Surrogate
Co
Other, Private Use
Cn
Other, Not Assigned
ucnorm
This program reads UTF-8 text from stdin and writes it to
stdout after applying the specified normalization algorithm.
The Unicode standard defines various normalization forms of a Unicode
string, based on the definition of canonical equivalence and
compatibility equivalence. In Unicode, several characters can be
expressed in various way. For example, the character U+00C7 (LATIN
CAPITAL LETTER C WITH CEDILLA) can also be expressed as the sequence
U+0043 (LATIN CAPITAL LETTER C) U+0327 (COMBINING CEDILLA).
Even if two unicode strings look the same to a human reader, if one
has combining characters and the other doesn’t, they may not compare
equal.
For each character, there are two normal forms:
-
Normal form D (NFD) is also known as canonical decomposition, and
translates each character into its decomposed form.
-
Normal form C (NFC) first applies a canonical decomposition, then
composes pre-combined characters again.
In addition to these two forms, there are two additional normal forms
based on compatibility equivalence:
-
Normal form KD (NFKD) will apply the compatibility decomposition,
i.e. replace all compatibility characters with their equivalents.
-
Normal form KC (NFKC) first applies the compatibility decomposition,
followed by the canonical composition.
Compatibility decomposition ensures that equivalent characters will
compare equal (i.e. have the same codepoints). In Unicode, certain
characters are supported which normally would be unified with other
characters. For example, U+2160 (ROMAN NUMERAL ONE) is really the
same thing as U+0049 (LATIN CAPITAL LETTER I). However, it is
supported in Unicode for compatibility with existing character sets
(e.g. gb2312).
This program uses the normalization algorithms implemented in Python's
standard library. See:
https://docs.python.org/3/library/unicodedata.html#unicodedata.normalize