Hyperglot – a database and tools for detecting language support in fonts
Hyperglot is an open research project dedicated to documenting how the world’s languages are written. By mapping orthographies and their requirements, it supports inclusive, multilingual type design and equitable access to high-quality typography for underserved communities. Hyperglot currently covers 783 languages, representing approximately 7.3 billion speakers, and is developed as open source by Rosetta Type/Research in collaboration with a global community of contributors and licensed under the Apache 2.0 license.
Hyperglot is available as:
đź“– Learn more about Hyperglot
🙋 Read the FAQ
💰 Sponsor via GitHub or directly via Hyperglot sponsorship. Any and all contributions are much appreciated! 🙏
Data validity & contributing
Hyperglot is a work in progress and provided AS IS. The validity of language data varies and continues to improve. Each language includes a validity label (todo, draft, preliminary, verified) to help you assess the data.
Mapping all the world’s languages is a huge task—we need help from native speakers and language users! If you notice an error or see that a language is missing, please get in touch (via email or Issues). We welcome contributions and will credit your input.
The data structure is documented in a separate README file along with guidelines for contributing.
Core concepts
The following concepts are essential to understanding how Hyperglot works.
A language can be written in one or more scripts. Each such writing system is represented in Hyperglot as an orthography. Most languages have a single primary orthography; however, some use multiple orthographies either independently (for example, in different regions) or concurrently (such as Serbian or Japanese).
In the database, an orthography contains the following character sets:
base – the required, essential characters,
aux – non-essential, recommended characters,
marks – combining marks,
punctuation,
numerals, and
currency.
A script, however, is more than a collection of characters. It also defines how characters interact when combined. This behavior is known as shaping and, in digital fonts, is implemented using OpenType features.
Read the detailed description of the database structure
Language support detection process
To detect language support in a font, Hyperglot performs the following checks:
- Required characters are present. Which characters are considered required is specified by filtering based on language/orthography status, data validity, and by selecting which character sets to check against.
- Precomposed character combinations are handled by the font. For character combinations that have a unique code point in Unicode, one of the following (depending on the setting):
- The encoded, precomposed character combinations are present.
- Base characters and mark characters from these combinations are present independently.
- Both of the above.
- Shaping behaviour is correctly handled by the font, where applicable:
- Required mark-positioning instructions are present.
- Required alternates for joining behavior (for example, in Arabic) are present.
- Conjunct syllable construction in Brahmi-derived scripts is supported. (Currently supported only for Hindi/Devanagari.)
Additional design-related notes are provided for the user’s discretion when assessing design quality. Hyperglot does not assess the font design in any way.
Command-line tools
Installation
You will need to have Python 3 installed. Install via pip:
pip install hyperglot
Besides the main hyperglot command used for font inspection, the package also includes:
hyperglot-report – explore missing language support (see below).
hyperglot-data – review language data stored in the database.
hyperglot-validate, hyperglot-save, and hyperglot-export – manage and process data when contributing.
Basic usage
Use:
hyperglot path/to/font.otf
to output a list of supported languages (and other data) for a font. Use:
hyperglot path/to/font.otf path/to/anotherfont.otf …
to check several fonts at once, or their combined coverage (with -m union).
Advanced options
-c, --check: Specify which character sets to check against. Options are 'base, auxiliary, punctuation, numerals, currency, all', or a comma-separated combination of these. (Default: 'base')
--validity: Filter languages by data validity level. Options are 'todo, draft, preliminary, verified'. (Default: 'preliminary')
-s, --status: Specify which languages to consider when checking support. Options are 'living, historical, constructed, all', or a comma-separated combination of these . (Default: 'living,constructed')
-o, --orthography: Which orthographies to consider when checking support for a language. Options are 'primary, secondary, historical, transliteration, all', or a comma-separated combination of these. (Default: 'primary')
-d, --decomposed: For precomposed character combinations, require only the individual component characters. By default, precomposed character combinations are also required when they have a unique code point in Unicode. (Default: False)
-m, --marks: Require that a font include all combining marks used by a language’s orthography. By default, only marks that are not part of precomposed character combinations are required. (Default: False)
--sort: Specify the sort order. Use "speakers" to sort by number of speakers. (Default: "alphabetic")
--sort-dir: Specify the sort direction. Use "desc" for descending order. (Default: "asc" for ascending order)
-y, --output: Specify a file path to write the output to, in YAML format. For a single input font, the output is a subset of the Hyperglot database containing the languages and orthographies supported by the font. When multiple fonts are provided, the YAML file contains a top-level key for each font. If the -m option is provided, the output includes the specific intersection or union result.
-t, --shaping-threshold: Set the frequency threshold for complex-script shaping checks. A font passes when it renders correctly for combinations at or above this threshold. Frequencies range from 1.0 (most frequent combinations) to 0.0 (rares combinations). (Default: 0.01)
--no-shaping Disable shaping checks (mark attachment, joining behavior, and conjunct shaping). (Default: shaping checks enabled)
-v, --verbose: Enable verbose logging.
-V, --version: Print the Hyperglot version number.
Explore missing language support
The hyperglot-report reports missing characters and shaping support. A common use case is identifying languages that could be supported with minimal additional work in a given font. The command accepts the same options as hyperglot and the following options:
--report-missing: Report languages missing n or fewer characters. If n is 0, all languages with any number of missing characters are reported. (Default: 0)
--report-marks: Report languages missing n or fewer mark-attachment sequences. If n is 0, all languages with any number of missing mark-attachment sequences are reported. (Default: 0)
--report-joining: Report languages missing n or fewer joining sequences. If n is 0, all languages with any number of missing joining sequences are reported. (Default: 0)
--report-all: Set or override all other --report-* options.
Roadmap
Other
The comparison of Hyperglot and the Unicode CLDR (this might be outdated atm.)
Notes
- Fonts included in the repository for testing purposes are licenses under their respective licenses
- Data included in the
other directory is replicated from various public domain and open source origins for compasion and aggregation (mostly present in historic commits of this repository)