base62
base62 is a correctly implemented, compact and fast implementation of Base62
encoding/decoding algorithm.
It is inspired by the java implementation by glowfall.
This Base62
implementation can encode/decode both integers and bytes of arbitrary length,
the correctness is tested using a large number of randomly generated bytes.
It is much faster than big.Int
based implementation and is not much slower than typical Base64
implementations. See the benchmark results below.
Why Base62
In comparison with Base64/Base32, Base62 is more friendly to human:
- it contains only alpha-numerical symbols, no special characters
- can be validated by eyes and more simple regexp
- can be fully selected by mouse double-click in any text editors and browser address bar
- it's compact and generates shorter strings than Base32
- it's the most natural and unambiguous way to encode your data in human-readable form :)
Variations of Base62 algorithm cann be widely used to represent authentication data in printable and
easy-copyable form, for example to encode OAuth 2.0 access_token
data.
Usage
Encode(src []byte) []byte
EncodeToString(src []byte) string
Decode(src []byte) ([]byte, error)
DecodeString(src string) ([]byte, error)
FormatInt(num int64) []byte
FormatUint(num uint64) []byte
ParseInt(src []byte) (int64, error)
ParseUint(src []byte) (uint64, error)
EncodeToBuf(dst []byte, src []byte) []byte
DecodeToBuf(dst []byte, src []byte) ([]byte, error)
AppendInt(dst []byte, num int64) []byte
AppendUint(dst []byte, num uint64) []byte
enc := NewEncoding("...my-62-byte-string-alphabet...")
enc.XXX()
Benchmark
Benchmark_Encode-12 11654754 97.41 ns/op
Benchmark_Decode-12 15481666 73.60 ns/op
Benchmark_EncodeToString-12 11950086 99.54 ns/op
Benchmark_DecodeString-12 16301325 74.36 ns/op
Benchmark_EncodeToBuf-12 13855840 84.68 ns/op
Benchmark_DecodeToBuf-12 97695962 12.21 ns/op
Benchmark_EncodeInteger-12 29119437 41.30 ns/op
Benchmark_DecodeInteger-12 120328183 9.917 ns/op
Benchmark_Encode_BigInt-12 1000000 1048 ns/op
Benchmark_Base64_EncodeToString-12 17803440 70.12 ns/op
Benchmark_Base64_DecodeString-12 19884616 55.09 ns/op
Benchmark_Base64_Encode-12 68163142 17.93 ns/op
Benchmark_Base64_Decode-12 41990004 28.25 ns/op
How it works
Encoding Base64 and Base32 both are bit-aligned, 6 bits for Base64 (2^6 = 64)
and 5 bits for Base32 (2^5 = 32), but Base62 is not bit-aligned, it's inefficient
to do divmod operation for non-bit-aligned integers, typical Base62 implementations
are BigInt based, which encodes input data block by block to get better performance,
(e.g. https://github.com/keybase/saltpack/blob/master/encoding/basex).
64 characters can fully represent 6 bits, but 62 characters can not, if each character
represents 5 bits, that is the Base32 encoding.
A naive BigInt based algorithm gives the shortest result, which is roughly like this
(e.g. https://github.com/eknkc/basex/blob/6baac8ea8b19cc66d125286d213770fec0691867/basex.go#L46):
digits := []int{0}
for i := 0; i < len(src); i++ {
carry := int(src[i])
for j := 0; j < len(digits); j++ {
carry += digits[j] << 8
digits[j] = carry % e.base
carry = carry / e.base
}
for carry > 0 {
digits = append(digits, carry%e.base)
carry = carry / e.base
}
}
_ = digits
This works but the time complexity is O(n^2) where n is the length of src. If the
input can be very large, the cost is unacceptable.
Improved algorithm splits the input into blocks, which reduce the time complexity to
O(kn) where k is the block size, and n is the length of src, it generates slightly
longer result than naive BigInt algorithm, but it is worth for the reduced time
complexity. When k is n, it degrades to the naive algorithm, when k is 1, the output
length is twice of the input, we don't want that. 32 is chosen as the block size
in library saltpack/encoding/basex.
Inspired by the java implementation by glowfall,
this library is not BigInt based, it encodes and decodes any arbitrary bytes in O(n)
time complexity. Here is a brief description of the algorithm.
This library uses a variadic length encoding, for each 6 bits, if the value is in range
[0, 62), it can be directly map to the 62 characters, if the value is 62 or 63 which
exceeds 61, we turn to encoding just the lower 5 bits. If we find a way to recognize
the 5 bits pattern, then we can correctly decode it back to the source data.
The binary representation of 62 and 63 is:
62 - 0b_0011_1110
63 - 0b_0011_1111
They have a common mask 0b_0011_1110
, in range [0, 62), there are another two integers
have a similar mask 0b_0001_1110
, 30 and 31, which is:
30 - 0b_0001_1110
31 - 0b_0001_1111
The four integers 30, 31, 62, 63 share a common mask 0b_0001_1110
, while all other
integers in range [0, 64) don't share the mask, i.e. for all other integers, the
expression value & 0b_0001_1110 == 0b_0001_1110
evaluates to false.
This is the key point!
We define a compactMask
as 0b_0001_1110
.
When encoding, for each 6 bits integer x, if x & compactMask == compactMask
is true,
it must be one of 30, 31, 62 or 63, we just encode the lower 5 bits, which are
0b_0001_1110
(30) and 0b_0001_1111
(31), we leave the 6-th bit to next byte.
When decoding, for each encoded byte x, we check x & compactMask == compactMask
, when
it is false, we know that the byte represents 6 bits in the raw data, it is in range
[0, 30) or [32, 62), else it represents 5 bits in the raw data, it is 30 or 31.
That is it, by using variadic length encoding, we successfully limit the value range
to [0, 62), we get very compact result and a simple O(n) time complexity for encoding
and decoding data of arbitrary length.
Compatibility
This library guarantees that it can correctly decode data encoded by itself.
The encoded result is not compatible with BigInt based algorithm,
(e.g. saltpack/encoding/basex, GMP and GnuPG).
The algorithm may be ported to other languages in future (if someone does a porting,
I'm glad to link it here :-) ).
Changelog
v1.1.0 - 2022/1/3
- Refactor the encoding code to be simpler and cleaner, as a bonus, it gives better
performance which is 1.5X~ faster than the old.
- Add a brief description of the algorithm and state the compatibility with other
Base62 implementations.
v1.0.0 - 2021/10/23
First stable release, the package has been used in several small and medium projects.
This release adds new APIs which help to reuse buffers to reduce memory allocation.