New Case Study:See how Anthropic automated 95% of dependency reviews with Socket.Learn More →

uctools

Package Overview

Dependencies

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

uctools

Unicode tools.

1.3.0
PyPI

Maintainers: 1

uctools

Tools for showing information about unicode characters (UTF-8) and performing normalization.

The following command line tools are provided:

ucinfo
    writes on stdout the name of each unicode character read from stdin

ucenum
    enumerates on stdout all unicode characters of a chosen category

ucnorm
    applies a standard unicode normalization (NFC, NFKC, NFD or NFKD)

ucinfo

The ucinfo tool reads UTF-8 text from stdin and writes to stdout information about each character, one per line. The output has 5 tab-separated columns:

1. the character itself, if printable, or an escaped representation of it
2. the decimal codepoint of the character
3. the number of bytes that the character occupies
4. the Unicode category of the character
5. the Unicode name of the character

ucenum

The ucenum tool takes a category abbreviation as argument and outputs a list of all characters belonging to that category. The categories are:

L
    Letter
Lu
    Letter, Uppercase
Ll
    Letter, Lowercase
Lt
    Letter, Titlecase
Lm
    Letter, Modifier
Lo
    Letter, Other
M
    Mark
Mn
    Mark, Nonspacing
Mc
    Mark, Spacing Combining
Me
    Mark, Enclosing
N
    Number
Nd
    Number, Decimal Digit
Nl
    Number, Letter
No
    Number, Other
P
    Punctuation
Pc
    Punctuation, Connector
Pd
    Punctuation, Dash
Ps
    Punctuation, Open
Pe
    Punctuation, Close
Pi
    Punctuation, Initial quote (may behave like Ps or Pe depending on usage)
Pf
    Punctuation, Final quote (may behave like Ps or Pe depending on usage)
Po
    Punctuation, Other
S
    Symbol
Sm
    Symbol, Math
Sc
    Symbol, Currency
Sk
    Symbol, Modifier
So
    Symbol, Other
Z
    Separator
Zs
    Separator, Space
Zl
    Separator, Line
Zp
    Separator, Paragraph
C
    Other
Cc
    Other, Control
Cf
    Other, Format
Cs
    Other, Surrogate
Co
    Other, Private Use
Cn
    Other, Not Assigned

ucnorm

This program reads UTF-8 text from stdin and writes it to stdout after applying the specified normalization algorithm.

The Unicode standard defines various normalization forms of a Unicode string, based on the definition of canonical equivalence and compatibility equivalence. In Unicode, several characters can be expressed in various way. For example, the character U+00C7 (LATIN CAPITAL LETTER C WITH CEDILLA) can also be expressed as the sequence U+0043 (LATIN CAPITAL LETTER C) U+0327 (COMBINING CEDILLA).

Even if two unicode strings look the same to a human reader, if one has combining characters and the other doesn’t, they may not compare equal.

For each character, there are two normal forms:

Normal form D (NFD) is also known as canonical decomposition, and translates each character into its decomposed form.
Normal form C (NFC) first applies a canonical decomposition, then composes pre-combined characters again.

In addition to these two forms, there are two additional normal forms based on compatibility equivalence:

Normal form KD (NFKD) will apply the compatibility decomposition, i.e. replace all compatibility characters with their equivalents.
Normal form KC (NFKC) first applies the compatibility decomposition, followed by the canonical composition.

Compatibility decomposition ensures that equivalent characters will compare equal (i.e. have the same codepoints). In Unicode, certain characters are supported which normally would be unified with other characters. For example, U+2160 (ROMAN NUMERAL ONE) is really the same thing as U+0049 (LATIN CAPITAL LETTER I). However, it is supported in Unicode for compatibility with existing character sets (e.g. gb2312).

This program uses the normalization algorithms implemented in Python's standard library. See: https://docs.python.org/3/library/unicodedata.html#unicodedata.normalize

Keywords

text unicode character

FAQs

What is uctools?

Is uctools well maintained?

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

uctools

uctools

ucinfo

ucenum

ucnorm

Keywords

Related posts

PyPI’s New Archival Feature Closes a Major Security Gap

North Korean APT Lazarus Targets Developers with Malicious npm Package