What is chardet?
The chardet npm package is a character encoding detector library, which allows you to determine the encoding of a given piece of text or a file. It is based on the character detection component of the ICU (International Components for Unicode) project and can be useful when dealing with text data that does not have encoding information.
What are chardet's main functionalities?
Detecting encoding of a text buffer
This code reads a file and uses chardet to detect the encoding of its content. The 'detect' function takes a buffer and returns the name of the encoding it believes the text is in.
const chardet = require('chardet');
const fs = require('fs');
fs.readFile('/path/to/file', (err, data) => {
if (err) throw err;
const encoding = chardet.detect(data);
console.log(encoding);
});
Detecting encoding with confidence
This code creates a buffer from a string and uses chardet's 'detectAll' function to get an array of possible encodings along with their confidence scores.
const chardet = require('chardet');
const buffer = Buffer.from('Some text with unknown encoding');
const result = chardet.detectAll(buffer);
console.log(result);
Detecting encoding of a file stream
This code creates a read stream from a file and uses chardet's 'detectStream' function to detect the encoding of the streamed content asynchronously.
const chardet = require('chardet');
const fs = require('fs');
const stream = fs.createReadStream('/path/to/file');
chardet.detectStream(stream).then(encoding => {
console.log(encoding);
});
Other packages similar to chardet
iconv-lite
iconv-lite is a character encoding conversion library. Unlike chardet, which detects the encoding, iconv-lite is used to convert from one encoding to another. It supports many encodings and is often used in conjunction with chardet to first detect the encoding and then convert the text.
jschardet
jschardet is a port of the python library chardet. It serves the same purpose as the chardet npm package, which is to detect the character encoding of text. The main difference may be in the implementation details and the specific encodings supported by each library.
encoding
The encoding npm package is another library for encoding and decoding text. It provides a simpler API for converting between encodings but does not have the detection capabilities of chardet. It's often used when the encoding is already known.
chardet
Chardet is a character detection module written in pure JavaScript (TypeScript). Module uses occurrence analysis to determine the most probable encoding.
- Packed size is only 22 KB
- Works in all environments: Node / Browser / Native
- Works on all platforms: Linux / Mac / Windows
- No dependencies
- No native code / bindings
- 100% written in TypeScript
- Extensive code coverage
Installation
npm i chardet
Usage
To return the encoding with the highest confidence:
import chardet from 'chardet';
const encoding = chardet.detect(Buffer.from('hello there!'));
const encoding = await chardet.detectFile('/path/to/file');
const encoding = chardet.detectFileSync('/path/to/file');
To return the full list of possible encodings use analyse
method.
import chardet from 'chardet';
chardet.analyse(Buffer.from('hello there!'));
Returned value is an array of objects sorted by confidence value in descending order
[
{ confidence: 90, name: 'UTF-8' },
{ confidence: 20, name: 'windows-1252', lang: 'fr' },
];
In browser, you can use Uint8Array instead of the Buffer
:
import chardet from 'chardet';
chardet.analyse(new Uint8Array([0x68, 0x65, 0x6c, 0x6c, 0x6f]));
Working with large data sets
Sometimes, when data set is huge and you want to optimize performance (with a tradeoff of less accuracy),
you can sample only the first N bytes of the buffer:
chardet
.detectFile('/path/to/file', { sampleSize: 32 })
.then((encoding) => console.log(encoding));
You can also specify where to begin reading from in the buffer:
chardet
.detectFile('/path/to/file', { sampleSize: 32, offset: 128 })
.then((encoding) => console.log(encoding));
Supported Encodings:
- UTF-8
- UTF-16 LE
- UTF-16 BE
- UTF-32 LE
- UTF-32 BE
- ISO-2022-JP
- ISO-2022-KR
- ISO-2022-CN
- Shift_JIS
- Big5
- EUC-JP
- EUC-KR
- GB18030
- ISO-8859-1
- ISO-8859-2
- ISO-8859-5
- ISO-8859-6
- ISO-8859-7
- ISO-8859-8
- ISO-8859-9
- windows-1250
- windows-1251
- windows-1252
- windows-1253
- windows-1254
- windows-1255
- windows-1256
- KOI8-R
Currently only these encodings are supported.
TypeScript?
Yes. Type definitions are included.
References