What is jschardet?
jschardet is a JavaScript library for character encoding detection. It is a port of the Python chardet library and is used to detect the character encoding of a given text. This can be particularly useful when dealing with text data from various sources where the encoding is unknown.
What are jschardet's main functionalities?
Detect Character Encoding
This feature allows you to detect the character encoding of a given text. The `detect` method returns an object with the encoding and confidence level.
const jschardet = require('jschardet');
const text = 'Some text with unknown encoding';
const result = jschardet.detect(text);
console.log(result);
Other packages similar to jschardet
chardet
chardet is a character encoding detection library for Node.js. It is also a port of the Python chardet library and offers similar functionality to jschardet. While jschardet is written in JavaScript, chardet is written in C++ and may offer better performance in some cases.
iconv-lite
iconv-lite is a character encoding conversion library for Node.js. While its primary focus is on converting text from one encoding to another, it also includes some basic encoding detection capabilities. It is more versatile than jschardet in terms of encoding conversion but may not be as specialized in detection.
node-icu-charset-detector
node-icu-charset-detector is a Node.js binding for the ICU (International Components for Unicode) library's charset detection functionality. It provides robust and accurate character encoding detection, leveraging the power of the ICU library. It is more heavyweight compared to jschardet but offers high accuracy.
JsChardet
Port of python's chardet (https://github.com/chardet/chardet).
License
LGPL
How To Use It
Node
npm install jschardet
var jschardet = require("jschardet")
// "àíàçã" in UTF-8
jschardet.detect("\xc3\xa0\xc3\xad\xc3\xa0\xc3\xa7\xc3\xa3")
// { encoding: "UTF-8", confidence: 0.9690625 }
// "次常用國字標準字體表" in Big5
jschardet.detect("\xa6\xb8\xb1\x60\xa5\xce\xb0\xea\xa6\x72\xbc\xd0\xb7\xc7\xa6\x72\xc5\xe9\xaa\xed")
// { encoding: "Big5", confidence: 0.99 }
// Martin Kühl
// jschardet.detectAll("\x3c\x73\x74\x72\x69\x6e\x67\x3e\x4d\x61\x72\x74\x69\x6e\x20\x4b\xfc\x68\x6c\x3c\x2f\x73\x74\x72\x69\x6e\x67\x3e")
// [
// {encoding: "windows-1252", confidence: 0.95},
// {encoding: "ISO-8859-2", confidence: 0.8796300205763055},
// {encoding: "SHIFT_JIS", confidence: 0.01}
// ]
Browser
Copy and include jschardet.min.js in your web page.
This library is also available in cdnjs at https://cdnjs.cloudflare.com/ajax/libs/jschardet/1.4.1/jschardet.min.js
Options
jschardet.enableDebug();
jschardet.detect(str, { minimumThreshold: 0 });
jschardet.detect(str, { detectEncodings: ["UTF-8", "windows-1252"] });
Supported Charsets
- Big5, GB2312/GB18030, EUC-TW, HZ-GB-2312, and ISO-2022-CN (Traditional and Simplified Chinese)
- EUC-JP, SHIFT_JIS, and ISO-2022-JP (Japanese)
- EUC-KR and ISO-2022-KR (Korean)
- KOI8-R, MacCyrillic, IBM855, IBM866, ISO-8859-5, and windows-1251 (Russian)
- ISO-8859-2 and windows-1250 (Hungarian)
- ISO-8859-5 and windows-1251 (Bulgarian)
- windows-1252
- ISO-8859-7 and windows-1253 (Greek)
- ISO-8859-8 and windows-1255 (Visual and Logical Hebrew)
- TIS-620 (Thai)
- UTF-32 BE, LE, 3412-ordered, or 2143-ordered (with a BOM)
- UTF-16 BE or LE (with a BOM)
- UTF-8 (with or without a BOM)
- ASCII
Technical Information
I haven't been able to create tests to correctly detect:
- ISO-2022-CN
- windows-1250 in Hungarian
- windows-1251 in Bulgarian
- windows-1253 in Greek
- EUC-CN
Development
Use npm run dist
to update the distribution files. They're available at https://github.com/aadsm/jschardet/tree/master/dist.
Authors
Ported from python to JavaScript by António Afonso (https://github.com/aadsm/jschardet)
Transformed into an npm package by Markus Ast (https://github.com/brainafk)