URI::IDNA
A IDNA2008, UTS46, IDNA from WHATWG URL Standard and Punycode implementation in pure Ruby.
This gem provides a number of functions for converting internationalized domain names (IDNs) between the Unicode and ASCII Compatible Encoding (ACE) forms.
Installation
Add to your Gemfile:
gem "uri-idna"
And then run bundle install
.
Usage
There are plenty of ways to convert IDNs between Unicode and ACE forms.
IDNA2008
The RFC 5891 defines two protocols for IDN conversion: Registration and Domain Name Lookup.
Registration protocol
URI::IDNA.register(alabel:, ulabel:, **options)
Options
check_hyphens
: true
– whether to check hyphens according to Section 5.4.leading_combining
: true
– whether to check leading combining marks according to Section 5.4.check_joiners
: true
– whether to check CONTEXTJ
code points according to Section 5.4.check_others
: true
– whether to check CONTEXTO
code points according to Section 5.4.check_bidi
: true
– whether to check bidirectional characters according to Section 5.4.
require "uri/idna"
URI::IDNA.register(alabel: "xn--gdkl8fhk5egc.jp", ulabel: "ハロー・ワールド.jp")
URI::IDNA.register(ulabel: "ハロー・ワールド.jp")
URI::IDNA.register(alabel: "xn--gdkl8fhk5egc.jp")
URI::IDNA.register(ulabel: "☕.us")
#<URI::IDNA::InvalidCodepointError: Codepoint U+2615 at position 1 of "☕" not allowed>
Domain Name Lookup Protocol
URI::IDNA.lookup(domain_name, **options)
Options
check_hyphens
: true
– whether to check hyphens according to Section 4.2.3.1.leading_combining
: true
– whether to check leading combining marks according to Section 4.2.3.2.check_joiners
: true
– whether to check CONTEXTJ code points according to Section 4.2.3.3.check_others
: true
– whether to check CONTEXTO code points according to Section 4.2.3.3.check_bidi
: true
– whether to check bidirectional characters according to Section 4.2.3.4.verify_dns_length
: true
– whether to check DNS length according to Section 4.4.
require "uri/idna"
URI::IDNA.lookup("ハロー・ワールド.jp")
URI::IDNA.lookup("xn--pck0a1b0a6a2e.jp")
URI::IDNA.lookup("Ῠ.me")
#<URI::IDNA::InvalidCodepointError: Codepoint U+1FE8 at position 1 of "Ῠ" not allowed>
Unicode UTS46 (TR46)
Current revision: 31
The UTS46 defines two IDN conversion functions: ToASCII and ToUnicode.
ToASCII
URI::IDNA.to_ascii(domain_name, **options)
Options
require "uri/idna"
URI::IDNA.to_ascii("Bloß.de")
URI::IDNA.to_ascii("Bloß.de", transitional_processing: true)
URI::IDNA.to_ascii("☕.us")
ToUnicode
URI::IDNA.to_unicode(domain_name, **options)
Options
require "uri/idna"
URI::IDNA.to_unicode("xn--blo-7ka.de")
IDNA2008 compatibility
It's possible to use UTS46 mapping first and then apply IDNA2008, so the processing fully conforms IDNA2008:
require "uri/idna"
char = "⼤"
char.ord
char.downcase.ord
URI::IDNA::UTS46::Mapping.call(char).ord
domain = "⼤.cn"
URI::IDNA.lookup(domain)
mapped_domain = URI::IDNA::UTS46::Mapping.call(domain)
URI::IDNA.lookup(mapped_domain)
WHATWG
WHATWG's URL Standard uses UTS46 algorithm to define ToASCII and ToUnicode functions, it abstracts all available flags and provides only one—the be_btrict
flag instead.
Note that the check_hyphens
UTS46 option is set to false
in this algorithm.
ToASCII
URI::IDNA.whatwg_to_ascii(domain_name, **options)
Options
be_strict
: true
– defines values of use_std3_ascii_rules
and verify_dns_length
UTS46 options.
require "uri/idna"
URI::IDNA.whatwg_to_ascii("Bloß.de")
URI::IDNA.whatwg_to_ascii("2003_rules.com", be_strict: false)
URI::IDNA.whatwg_to_ascii("2003_rules.com")
#<URI::IDNA::InvalidCodepointError: Codepoint U+005F at position 5 of "2003_rules" not allowed>
ToUnicode
URI::IDNA.whatwg_to_unicode(domain_name, **options)
Options
be_strict
: true
- be_strict
: true
– defines value of use_std3_ascii_rules
UTS46 option.
require "uri/idna"
URI::IDNA.whatwg_to_unicode("xn--blo-7ka.de")
Punycode
Punycode module performs conversion between Unicode and Punycode. Note that Punycode is not IDNA2008 compliant, it is only used for conversion, no validations performed.
require "uri/idna/punycode"
URI::IDNA::Punycode.encode("ハロー・ワールド")
URI::IDNA::Punycode.decode("gdkl8fhk5egc")
Full technical reference:
IDNA2008
Punycode
- RFC 3492 – Punycode: A Bootstring encoding of Unicode
UTS46 (also referenced as TS46)
Development
After checking out the repo, run bin/setup
to install dependencies. Then, run rake spec
to run the tests. You can also run bin/console
for an interactive prompt that will allow you to experiment.
To install this gem onto your local machine, run bundle exec rake install
. To release a new version, update the version number in version.rb
, and then run bundle exec rake release
, which will create a git tag for the version, push git commits and the created tag, and push the .gem
file to rubygems.org.
Generating Unicode data
This gem uses Unicode data files to perform IDN conversion. To generate new Unicode data files, run bundle exec rake idna:generate
.
To specify Unicode version, use VERSION
environment variable, e.g. VERSION=15.1.0 bundle exec rake idna:generate
.
By default, used Unicode version is the one used by the Ruby version (RbConfig::CONFIG["UNICODE_VERSION"]
).
To set directory for generated files, use DEST_DIR
environment variable, e.g. DEST_DIR=lib/uri/idna/data bundle exec rake idna:generate
.
Unicode data cached in the tmp
directory by default, to change it, use CACHE_DIR
environment variable, e.g. CACHE_DIR=~/.cache/unicode_data bundle exec rake idna:generate
.
Inspect Unicode data
To inspect Unicode data, run bundle exec rake 'idna:inspect[<HEX_CODE>]'
.
To specify Unicode version, or cache directory, use VERSION
or CACHE_DIR
environment variables, e.g. VERSION=15.1.0 bundle exec rake 'idna:inspect[1f495]'
.
Update UTS46 test suite data
To update UTS46 test suite data, run bundle exec rake idna:update_uts46_test_suite
.
Contributing
Bug reports and pull requests are welcome on GitHub at https://github.com/skryukov/uri-idna.
License
The gem is available as open source under the terms of the MIT License.