unicategories
Unicode category database, generated and cached on setup.
This module exposes a category dictionary containing RangeGroup
instances,
containing all unicode category character ranges detected on your system.
Example
from unicategories import categories
upperchars = categories['Lu'].characters()
print('Unicode uppercase caracters are "%s"' % ''.join(upperchars))
RangeGroup
Immutable iterable (based on tuple, with some useful methods) of (start, end)
tuples being, like python's range
, open at the end.
This method have been chosen for memory efficiency, storing individually all
characters on memory would take a lot of memory.
RangeGroup class provides the following methods:
range_group.characters()
type: () -> typing.Iterator[str]
Get iterator with all characters on this range group.
:returns: iterator of characters (str of size 1)
range_group.codes()
type: () -> typing.Iterator[int]
Get iterator for all unicode code points contained in this range group.
:returns: iterator of character indexes (int)
range_group.has(character)
type: (typing.Union[str, int]) -> bool
Get if character (or character code point) is part of this range group.
:param character: character or unicode code point to look for
:returns: True if character is contained by any range, False otherwise
Unicode categories
Taken from wikipedia.
Value | Category Major, minor | Basic type | Character assigned | Fixed | Remarks |
---|
Lu | Letter, uppercase | Graphic | Character | | |
Ll | Letter, lowercase | Graphic | Character | | |
Lt | Letter, titlecase | Graphic | Character | | Ligatures containing uppercase followed by lowercase letters (e.g., Dž , Lj , Nj , and Dz ) |
Lm | Letter, modifier | Graphic | Character | | |
Lo | Letter, other | Graphic | Character | | |
Mn | Mark, nonspacing | Graphic | Character | | |
Mc | Mark, spacing combining | Graphic | Character | | |
Me | Mark, enclosing | Graphic | Character | | |
Nd | Number, decimal digit | Graphic | Character | | All these, and only these, have Numeric Type = De |
Nl | Number, letter | Graphic | Character | | Numerals composed of letters or letterlike symbols (e.g., Roman numerals ) |
No | Number, other | Graphic | Character | | E.g., vulgar fractions , superscript and subscript digits |
Pc | Punctuation, connector | Graphic | Character | | Includes "_" underscore |
Pd | Punctuation, dash | Graphic | Character | | Includes several hyphen characters |
Ps | Punctuation, open | Graphic | Character | | Opening bracket characters |
Pe | Punctuation, close | Graphic | Character | | Closing bracket characters |
Pi | Punctuation, initial quote | Graphic | Character | | Opening quotation mark . Does not include the ASCII "neutral" quotation mark. May behave like Ps or Pe depending on usage |
Pf | Punctuation, final quote | Graphic | Character | | Closing quotation mark. May behave like Ps or Pe depending on usage |
Po | Punctuation, other | Graphic | Character | | |
Sm | Symbol, math | Graphic | Character | | |
Sc | Symbol, currency | Graphic | Character | | |
Sk | Symbol, modifier | Graphic | Character | | |
So | Symbol, other | Graphic | Character | | |
Zs | Separator, space | Graphic | Character | | Includes the space, but not TAB , CR , or LF , which are Cc |
Zl | Separator, line | Format | Character | | Only U+2028 LINE SEPARATOR (LSEP) |
Zp | Separator, paragraph | Format | Character | | Only U+2029 PARAGRAPH SEPARATOR (PSEP) |
Cc | Other, control | Control | Character | Fixed 65 | No name , <control> |
Cf | Other, format | Format | Character | | Includes the soft hyphen , control characters to support bi-directional text , and language tag characters |
Cs | Other, surrogate | Surrogate | Not (but abstract) | Fixed 2,048 | No name , <surrogate> |
Co | Other, private use | Private-use | Not (but abstract) | Fixed 137,468 total: 6,400 in BMP , 131,068 in Planes 15–16 | No name , <private-use> |
Cn | Other, not assigned | Noncharacter | Not | Fixed 66 | No name , <noncharacter> |
Cn | Other, not assigned | Reserved | Not | Not fixed | No name , <reserved> |
In addition to that, unicategories provide general categories L
, M
, N
, P
, S
, Z
and C
.