@fgv/ts-bcp47
Typescript Utilities for BCP-47 Language Tags
Summary
Typescript utilities for parsing, manipulating and comparing BCP-47 language tags.
Installation
with npm:
npm install @fgv/ts-bcp47
API Documentation
Extracted API documentation is here
Overview
Classes and functions to:
TL; DR
For those who already understand BCP-47 language tags and just want to get started, here are a few examples:
import { Bcp47 } from '@fgv/ts-bcp47';
const {primaryLanguage, region} = Bcp47.tag('en-us').orThrow().subtags;
const {primaryLanguage, region} = Bcp47.tag('en-us', { normalization: 'canonical' }).orThrow().subtags;
const preferred = Bcp47.tag('art-lojban', { normalization: 'preferred' }).orThrow().tag;
const match = Bcp47.similarity('es-MX', 'es-mx').orThrow();
const match = Bcp47.similarity('es-MX', 'es-latn-mx').orThrow();
const match = Bcp47.similarity('es-419', 'es-MX').orThrow();
const match = Bcp47.similarity('es-419', 'es-ES').orThrow();
const match = Bcp47.similarity('es', 'es-MX').orThrow();
const match = Bcp47.similarity('en', 'es').orThrow();
const match = Bcp47.similarity('zh-Hans', 'zh-Hant').orThrow();
Note: This library uses the Result
pattern, so the return value from any method that might fail is a Result
object that must be tested for success or failure. These examples use either orThrow or orDefault to convert an error result to either an exception or undefined.
Anatomy of a BCP-47 language tag.
As specified in RFC 5646, a language tag consists of a series of subtags
(mostly optional), each of which describes some aspect of the language being referenced.
Subtags
The full set of subtags that make up a language tag are:
Grandfathered Tags
The RFC allows for a handful of grandfathered tags which do not meet the current specification. Those tags are recognized in their entirety and are not composed of subtags, so for grandfathered tags only, even primary language
is undefined.
Validation
Tag validation considers the tag in its current form and never changes the tag itself.
The specification defines two levels of conformance for language, and this library defines a third.
Well-Formed Tags
A well-formed
tag meets the basic syntactic requirements of the specification, but might not be valid in terms of content.
Valid Tags
A valid
tag meets both the syntactic and semantic requirements of the specification, meaning that either all subtags or full tag (in the case of grandfathered tags) are registered in the IANA language subtag registry, and neither extension nor variant tags are repeated.
Strictly Valid Tags
A strictly valid
tags is valid according to the specification and also meets the rules for variant and extlang prefixes defined by the specification and recorded in the language registry.
Examples
Some examples:
eng-US
is well-formed because it meets the language tag syntax but is not valid because eng
is not a registered language subtag.en-US
is both well-formed and valid, because en
is a registered language subtag.es-valencia-valencia
is well-formed but not valid, because the valencia
extension subtag is repeated.es-valencia
is well-formed and valid, but it is not strictly-valid because language subtag registry defines a ca
prefix for the valencia
subtag.ca-valencia
is well-formed, valid, and strictly valid.
Normalization
Normalization transforms a tag to produce a new tag which is semantically identical, but preferred for some reason.
Not-normalized
A non-normalized must be well-formed
and might be valid
or strictly-valid
but it does not use the letter case conventions recommended in the spec.
Canonical Form
A tag in canonical form meets all of the letter case conventions recommended by the specification, in addition to being at least well-formed
.
Preferred Form
In addition to being strictly-valid
and canonical, tags
in preferred form do not have any deprecated, redundant or suppressed subtags.
Examples
zh-cmn-hans
is strictly valid, but not canonical or preferred.zh-cmn-Hans
is strictly valid and canonical, but not preferred, because the subtag registry lists zh-cmn-Hans
as redundant, with the preferred value cmn-Hans
.cmn-Hans
is strictly valid, canonical and preferred.en-latn-us
is strictly valid, but not canonical or preferred.en-Latn-US
is strictly valid and canonical, but not preferred, because the subtag registry lists Latn
as the suppressed script for the en
language.en-US
is strictly valid, canonical and preferred.
Tag Matching
The match
function matches language tags, using semantic similarity, unlike RFC 4647, which relies on purely syntactic rules. This semantic match yields much better results in many cases.
For any given language tag pair, the match
function returns a similarity score in the range 0.0
(no similarity) to 1.0
(exact match).
The degrees of similarity are (from most to least similar):
exact
(1.0
) - The two language tags are semantically identical.variant
(0.9
) - The tags vary only in extension or private subtags.region
(0.8
) - The tags match on language, script and region but vary in variant, extension or private-use subtags.macroRegion
(0.7
) - The tags match on language and script, and one of the region subtags is a macro-region (e.g. 419
for Latin America) which encompasses the second region tag.neutralRegion
(0.6
) - The tags match on language and script, and only one of the tags contains a region subtag.affinity
(0.5
) - The tags match on language and script, and two region subtags have an orthographic affinity. Orthographic affinity is defined in this package in the overrides.json
file.preferredRegion
(0.4
) - The tags match on language and script, and one of the tags is the preferred region subtag for the language. Preferred region is also defined in this package in overrides.json
.sibling
(0.3
) - The tags match on language and script but both have region tags that are otherwise unrelated.undetermined
(0.2
) - One of the languages is the special language und
.none
(0.0
) - The tags do not match at all.
See Also
RFC 5646 - Tags for Identifying Languages
IANA Language Subtag Registry