pyuegc
A pure-Python implementation of the Unicode algorithm for breaking strings of text (i.e., code point sequences) into extended grapheme clusters (“user-perceived characters”) as specified in UAX #29, “Unicode Text Segmentation.” This package conforms to version 16.0 of the Unicode standard, released in September 2024, and has been rigorously tested against the official Unicode test file to ensure accuracy.
Installation and updates
To install the package, run:
pip install pyuegc
To upgrade to the latest version, run:
pip install pyuegc --upgrade
Changelog
Check out the latest updates and changes here.
Unicode character database (UCD) version
To retrieve the version of the Unicode character database in use:
>>> from pyuegc import UCD_VERSION
>>> UCD_VERSION
'16.0.0'
Example usage
from pyuegc import EGC
def _output(unistr, egc):
return f"""\
# String: {unistr}
# Length of string: {len(unistr)}
# EGC: {egc}
# Length of EGC: {len(egc)}
"""
unistr = "Python"
egc = EGC(unistr)
print(_output(unistr, egc))
unistr = "e\u0301le\u0300ve"
egc = EGC(unistr)
print(_output(unistr, egc))
unistr = "Z̷̳̎a̸̛ͅl̷̻̇g̵͉̉o̸̰͒"
egc = EGC(unistr)
print(_output(unistr, egc))
unistr = "기운찰만하다"
egc = EGC(unistr)
print(_output(unistr, egc))
unistr = "পৌষসংক্রান্তির"
egc = EGC(unistr)
print(_output(unistr, egc))
Reversing a string directly may mess up diacritics, whereas reversing using EGC correctly preserves the visual appearance of characters regardless of the Unicode normalization form:
unistr = "ai\u0302ne\u0301e"
print(f"# Reversed string: {''.join(reversed(unistr))!r}")
print(f"# EGC processed and reversed: {''.join(reversed(EGC(unistr)))!r}")
Related resources
This implementation is based on the following resources:
Licenses
The code is licensed under the MIT license.
Usage of Unicode data files is governed by the UNICODE TERMS OF USE. Further specifications of rights and restrictions pertaining to the use of the Unicode data files and software can be found in the Unicode Data Files and Software License, a copy of which is included as UNICODE-LICENSE.