best-effort representations using smaller coded character sets (ASCII,
ISO 8859, etc.). The translation tables used by the codecs are from
the transtab
collection by Markus Kuhn.
Three types of transliterating codecs are provided:
"long", using as many characters as needed to make a natural
replacement. For example, \u00e4 LATIN SMALL LETTER A WITH
DIAERESIS ä
will be replaced with ae
.
"short", using the minimum number of characters to make a
replacement. For example, \u00e4 LATIN SMALL LETTER A WITH
DIAERESIS ä
will be replaced with a
.
"one", only performing single character replacements. Characters
that can not be transliterated with a single character are passed
through unchanged. For example, \u2639 WHITE FROWNING FACE ☹
will be passed through unchanged.
Using the codecs is simple::
import translitcodec
import codecs
codecs.encode('fácil € ☺', 'translit/long')
'facil EUR :-)'
codecs.encode('fácil € ☺', 'translit/short')
'facil E :-)'
The codecs return Unicode by default. To receive a bytestring back,
either chain the output of encode() to another codec, or append the
name of the desired byte encoding to the codec name::
codecs.encode('fácil € ☺', 'translit/one').encode('ascii', 'replace')
'facil E ?'
'fácil € ☺'.encode('translit/one/ascii', 'replace')
'facil E ?'
The package also supplies a 'transliterate' codec, an alias for
'translit/long'.
Another way to use the library is to use an error handle.
Error handles are available:
- 'strict/translit/long', 'strict/translit/short', 'strict/translit/one' - similar to 'strict'
- 'ignore/translit/long', 'ignore/translit/short', 'ignore/translit/one' - similar to 'ignore'
- 'replace/translit/long', 'replace/translit/short', 'replace/translit/one' - similar to 'replace'
These error handles above, work similarly to Python's built-in ones.
The difference is that transliteration is attempted first.
codecs.encode('Zażółć gęślą jaźń € ☺另!@#', 'ISO-8859-2', 'replace/translit/long').decode('ISO-8859-2')
'Zażółć gęślą jaźń EUR :-)?!@#'
codecs.encode('Zażółć gęślą jaźń € ☺另!@#', 'ISO-8859-2', 'replace/translit/short').decode('ISO-8859-2')
'Zażółć gęślą jaźń E :-)?!@#'
codecs.encode('Zażółć gęślą jaźń € ☺另!@#', 'ISO-8859-2', 'replace/translit/one').decode('ISO-8859-2')
'Zażółć gęślą jaźń E ??!@#'
codecs.encode('Zażółć gęślą jaźń € ☺另!@#', 'ISO-8859-2', 'ignore/translit/long').decode('ISO-8859-2')
'Zażółć gęślą jaźń EUR :-)!@#'
codecs.encode('Zażółć gęślą jaźń € ☺另!@#', 'ISO-8859-2', 'ignore/translit/short').decode('ISO-8859-2')
'Zażółć gęślą jaźń E :-)!@#'
codecs.encode('Zażółć gęślą jaźń € ☺另!@#', 'ISO-8859-2', 'ignore/translit/one').decode('ISO-8859-2')
'Zażółć gęślą jaźń E !@#'
translitcodec Changes
0.7.0
Released on May 8, 2021
- Added support for error handles
- Fixed conversion of the German eszett char
0.6.0
Released on December 13, 2020
- Add support for Python 3.9
0.5.2
Released on January 19, 2020
- Install package with setuptools
0.5.1
Released on January 19, 2020
- Add python_requires to prevent installation with Python 2 packages
0.5
Released on January 18, 2020
0.4
Released on May 11, 2015
- Added Python 3 compatibility
0.3
Released on February 14, 2011
0.2
Released on January 27, 2011
-
Resolves issue of "TypeError: character mapping must return integer,
None or unicode" when a blank value (eg: \N{ZERO WIDTH SPACE} \u200B)
was encoded. Unicode blanks are now returned.
-
Characters in the ASCII range are no longer included in the translation
tables.
0.1
Released on December 28, 2008
- Initial packaged release.