
Security News
TypeScript is Porting Its Compiler to Go for 10x Faster Builds
TypeScript is porting its compiler to Go, delivering 10x faster builds, lower memory usage, and improved editor performance for a smoother developer experience.
uroman is a universal romanizer. It converts text in any script to the standard Latin alphabet.
uroman is a universal romanizer. It converts text in any script to the standard Latin alphabet.
Example (Greek): Νεπάλ → Nepal
Example (Hindi): नेपाल → nepaal
Example (Urdu): نیپال → nypal
Example (Chinese): 三万一 → 31000
New Python version: 1.3.1 (released on June 27, 2024)
Last Perl version: 1.2.8 (released on April 23, 2021)
Author: Ulf Hermjakob, USC Information Sciences Institute
python3 -m pip install uroman
python3 -m uroman "Игорь Стравинский"
python3 -m uroman Игорь -l ukr
python3 -m uroman Ντέιβις Καπ -l ell
python3 -m uroman "\u03C0\u03B9" -d
python3 -m uroman -l hin -i mini-test/hin.txt
python3 -m uroman -l fas -i mini-test/fas.txt -o mini-test/fas-rom.jsonl -f edges
python3 -m uroman < mini-test/multi-script.txt > mini-test/multi-script.uroman.txt
python3 -m uroman -h
Note: Using the uroman CLI for single strings can be useful for simple tests,
but it is inefficient at scale because data resources are loaded every time. It is more efficient to romanize entire files or to use uroman inside Python as shown further below.
Note: The mini-test directory is included in this release.
Use command python3 -m uroman x --verbose
to find it.
You can compare your output mini-test/multi-script.uroman.txt with reference output mini-test/multi-script.uroman-ref.txt
Direct inputs (zero or more) | such as ‘Игорь Стравинский’ and ‘Ντέιβις’ above. |
-l --lcode | language code according to ISO-639-3, e.g. -l ukr for Ukrainian, -l hin for Hindi, -l fas for Persian |
-i --input_filename | alternative: stdin Note: If both direct inputs and input_filename are given, the romanization results for direct inputs will be written to stderr. |
-o --output_filename | alternative: stdout |
-f --rom_format | Output format choices:
|
-d --decode_unicode | Decode Unicode escape sequences such as ‘\u03C0\u03B9’ to ‘πι’ which in turn will be romanized to ‘pi’. This is useful for input formats such as JSON. |
-h --help | Use this option to see the full argument structure with all options. |
import uroman as ur
uroman = ur.Uroman() # load uroman data (takes about a second or so)
print(uroman.romanize_string('Игорь Стравинский'))
print(uroman.romanize_string('Игорь', lcode='ukr'))
uroman.romanize_file(input_filename='mini-test/multi-script.txt',
output_filename='mini-test/multi-script.uroman.jsonl',
rom_format=ur.RomFormat.LATTICE)
uroman = ur.Uroman(data_dir)
This constructor method loads data needed for the romanization of different languages. This constructor call might take about a second (real time) to load all of the romanization data, but it is necessary only once for multiple subsequent romanization calls.
data_dir | data directory (optional, default: standard uroman data directory) |
uroman.romanize_string(s, lcode, rom_format)
This method takes a string s and returns its romanization in the format according to rom_format: a string (default), or a list of edges.
s | string to be romanized, e.g. "ایران" |
lcode | language code, optional, a 3-letter code such as 'eng' for English (ISO-639-3) |
rom_format | Output format choices:
|
uroman.romanize_file(input_filename, output_filename, lcode)
This method romanizes a file input_filename to output_filename.
input_filename | default: stdin (for input_filename value of None) |
output_filename | default: stdout (for output_filename value of None) |
lcode | language code (optional), a 3-letter code such as 'eng' for English (ISO-639-3) |
Old Perl Version included on GitHub, but not included on PyPI.
$ uroman.pl [-l <lang-code>] [--chart] [--no-cache] < STDIN
where the optional <lang-code> is a 3-letter languages code, e.g. ara, bel, bul, deu, ell, eng, fas,
grc, ell, eng, heb, kaz, kir, lav, lit, mkd, mkd2, oss, pnt, pus, rus, srp, srp2, tur, uig, ukr, yid.
--chart specifies chart output (in JSON format) to represent alternative romanizations.
--no-cache disables caching.
Note: Directories text and test are under uroman's root directory on GitHub.
uroman.pl < text/zho.txt
uroman.pl -l tur < text/tur.txt
uroman.pl -l heb --chart < text/heb.txt
uroman.pl < test/multi-script.txt > test/multi-script.uroman-perl.txt
Identifying the input as Arabic, Belarusian, Bulgarian, English, German, Ancient Greek, Modern Greek, Pontic Greek, Hebrew, Kazakh, Kyrgyz, Latvian, Lithuanian, Macedonian, Ossetian, Persian, Russian, Serbian, Turkish, Ukrainian, Uyghur or Yiddish will improve romanization for those languages as some letters in those languages have different sound values from other languages using the same script (Arabic vs. Persian, Russian vs. Ukrainian, Hebrew vs. Yiddish). No effect for other languages in this version.
Ulf Hermjakob, Jonathan May, and Kevin Knight. 2018. Out-of-the-box universal romanization tool uroman. In Proceedings of the 56th Annual Meeting of Association for Computational Linguistics, Demo Track. ACL-2018 Best Demo Paper Award. Paper in ACL Anthology | Poster | BibTex
Changes in version 1.3.0
Changes in version 1.2.8
Changes in version 1.2.7
Changes in version 1.2.6
Changes in version 1.2.5
Changes in version 1.2.4
Changes in version 1.2
Changes in version 1.1 (major upgrade)
Changes in version 1.0 (major upgrade)
Changes in version 0.7 (minor upgrade)
Changes in version 0.6 (minor upgrade)
Changes in version 0.5 (minor upgrade)
Changes in version 0.4 (minor upgrade)
New features in version 0.3
Earlier versions of this tool were based upon work supported in part by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via contract # FA8650-17-C-9116, and by research sponsored by Air Force Research Laboratory (AFRL) under agreement number FA8750-19-1-1000. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of ODNI, IARPA, Air Force Laboratory, DARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.
FAQs
uroman is a universal romanizer. It converts text in any script to the standard Latin alphabet.
We found that uroman demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
TypeScript is porting its compiler to Go, delivering 10x faster builds, lower memory usage, and improved editor performance for a smoother developer experience.
Research
Security News
The Socket Research Team has discovered six new malicious npm packages linked to North Korea’s Lazarus Group, designed to steal credentials and deploy backdoors.
Security News
Socket CEO Feross Aboukhadijeh discusses the open web, open source security, and how Socket tackles software supply chain attacks on The Pair Program podcast.