New: Introducing PHP and Composer Support.Read the Announcement
Socket
Book a DemoInstallSign in
Socket

hyperglot

Package Overview
Dependencies
Maintainers
2
Versions
39
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

hyperglot

Detect language support for font binaries

pipPyPI
Version
0.6.2
Maintainers
2

Hyperglot – a database and tools for detecting language support in fonts

Hyperglot is an open research project dedicated to documenting how the world’s languages are written. By mapping orthographies and their requirements, it supports inclusive, multilingual type design and equitable access to high-quality typography for underserved communities. Hyperglot currently covers 783 languages, representing approximately 7.3 billion speakers, and is developed as open source by Rosetta Type/Research in collaboration with a global community of contributors and licensed under the Apache 2.0 license.

Hyperglot is available as:

  • the Hyperglot web apps,
  • the command-line tool: hyperglot,
  • the python packagage: import hyperglot (see examples for basic usage).

📖 Learn more about Hyperglot
🙋 Read the FAQ

💰 Sponsor via GitHub or directly via Hyperglot sponsorship. Any and all contributions are much appreciated! 🙏

Data validity & contributing

Hyperglot is a work in progress and provided AS IS. The validity of language data varies and continues to improve. Each language includes a validity label (todo, draft, preliminary, verified) to help you assess the data.

Mapping all the world’s languages is a huge task—we need help from native speakers and language users! If you notice an error or see that a language is missing, please get in touch (via email or Issues). We welcome contributions and will credit your input.

The data structure is documented in a separate README file along with guidelines for contributing.

Core concepts

The following concepts are essential to understanding how Hyperglot works.

A language can be written in one or more scripts. Each such writing system is represented in Hyperglot as an orthography. Most languages have a single primary orthography; however, some use multiple orthographies either independently (for example, in different regions) or concurrently (such as Serbian or Japanese).

In the database, an orthography contains the following character sets:

  • base – the required, essential characters,
  • aux – non-essential, recommended characters,
  • marks – combining marks,
  • punctuation,
  • numerals, and
  • currency.

A script, however, is more than a collection of characters. It also defines how characters interact when combined. This behavior is known as shaping and, in digital fonts, is implemented using OpenType features.

Read the detailed description of the database structure

Language support detection process

To detect language support in a font, Hyperglot performs the following checks:

  • Required characters are present. Which characters are considered required is specified by filtering based on language/orthography status, data validity, and by selecting which character sets to check against.
  • Precomposed character combinations are handled by the font. For character combinations that have a unique code point in Unicode, one of the following (depending on the setting):
    • The encoded, precomposed character combinations are present.
    • Base characters and mark characters from these combinations are present independently.
    • Both of the above.
  • Shaping behaviour is correctly handled by the font, where applicable:
    • Required mark-positioning instructions are present.
    • Required alternates for joining behavior (for example, in Arabic) are present.
    • Conjunct syllable construction in Brahmi-derived scripts is supported. (Currently supported only for Hindi/Devanagari.)

Additional design-related notes are provided for the user’s discretion when assessing design quality. Hyperglot does not assess the font design in any way.

Command-line tools

Installation

You will need to have Python 3 installed. Install via pip:

pip install hyperglot

Besides the main hyperglot command used for font inspection, the package also includes:

  • hyperglot-report – explore missing language support (see below).
  • hyperglot-data – review language data stored in the database.
  • hyperglot-validate, hyperglot-save, and hyperglot-export – manage and process data when contributing.

Basic usage

Use:

hyperglot path/to/font.otf

to output a list of supported languages (and other data) for a font. Use:

hyperglot path/to/font.otf path/to/anotherfont.otf …

to check several fonts at once, or their combined coverage (with -m union).

Advanced options

  • -c, --check: Specify which character sets to check against. Options are 'base, auxiliary, punctuation, numerals, currency, all', or a comma-separated combination of these. (Default: 'base')
  • --validity: Filter languages by data validity level. Options are 'todo, draft, preliminary, verified'. (Default: 'preliminary')
  • -s, --status: Specify which languages to consider when checking support. Options are 'living, historical, constructed, all', or a comma-separated combination of these . (Default: 'living,constructed')
  • -o, --orthography: Which orthographies to consider when checking support for a language. Options are 'primary, secondary, historical, transliteration, all', or a comma-separated combination of these. (Default: 'primary')
  • -d, --decomposed: For precomposed character combinations, require only the individual component characters. By default, precomposed character combinations are also required when they have a unique code point in Unicode. (Default: False)
  • -m, --marks: Require that a font include all combining marks used by a language’s orthography. By default, only marks that are not part of precomposed character combinations are required. (Default: False)
  • --sort: Specify the sort order. Use "speakers" to sort by number of speakers. (Default: "alphabetic")
  • --sort-dir: Specify the sort direction. Use "desc" for descending order. (Default: "asc" for ascending order)
  • -y, --output: Specify a file path to write the output to, in YAML format. For a single input font, the output is a subset of the Hyperglot database containing the languages and orthographies supported by the font. When multiple fonts are provided, the YAML file contains a top-level key for each font. If the -m option is provided, the output includes the specific intersection or union result.
  • -t, --shaping-threshold: Set the frequency threshold for complex-script shaping checks. A font passes when it renders correctly for combinations at or above this threshold. Frequencies range from 1.0 (most frequent combinations) to 0.0 (rares combinations). (Default: 0.01)
  • --no-shaping Disable shaping checks (mark attachment, joining behavior, and conjunct shaping). (Default: shaping checks enabled)
  • -v, --verbose: Enable verbose logging.
  • -V, --version: Print the Hyperglot version number.

Explore missing language support

The hyperglot-report reports missing characters and shaping support. A common use case is identifying languages that could be supported with minimal additional work in a given font. The command accepts the same options as hyperglot and the following options:

  • --report-missing: Report languages missing n or fewer characters. If n is 0, all languages with any number of missing characters are reported. (Default: 0)
  • --report-marks: Report languages missing n or fewer mark-attachment sequences. If n is 0, all languages with any number of missing mark-attachment sequences are reported. (Default: 0)
  • --report-joining: Report languages missing n or fewer joining sequences. If n is 0, all languages with any number of missing joining sequences are reported. (Default: 0)
  • --report-all: Set or override all other --report-* options.

Roadmap

  • 🪶 Change licence to Apache 2
  • 💰 Invite sponsorship and funding#174
  • 🤖 Basic analysis of shaping support provided by the font (GPOS and GSUB): check whether character combinations are affected by font OpenType features, enabling scalable support for complex combinations (e.g., Arabic, Hindi/Devanagari). #176
  • ➡️ Export in a format suitable for submission to Unicode CLDR
  • 🌍 Database web app: add links to other resources per language
  • 📚 Improve language data, sources, and validity for languages with fewer authoritative references #157
  • 🌍 Add data for more African languages and scripts, e.g., N'Ko #195
  • 🇮🇳 Add more shaping checks for Brahmi-derived scripts #176
  • 🇧🇷 Add data for indigenous Brazilian languages (Rafael Dietzch and students)
  • 🇺🇳 Secure funding to expand language coverage

Other

The comparison of Hyperglot and the Unicode CLDR (this might be outdated atm.)

Notes

  • Fonts included in the repository for testing purposes are licenses under their respective licenses
  • Data included in the other directory is replicated from various public domain and open source origins for compasion and aggregation (mostly present in historic commits of this repository)

FAQs

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts