guessenc
Infer HTML encoding from response headers & content. Goes above and beyond the encoding detection done by most HTTP client libraries.
Basic Usage
The main function exported by guessenc
is infer_encoding()
.
>>> import requests
>>> from guessenc import infer_encoding
>>> resp = requests.get("http://www.fatehwatan.ps/page-183525.html")
>>> resp.raise_for_status()
>>> infer_encoding(resp.content, resp.headers)
(<Source.META_HTTP_EQUIV: 2>, 'cp1256')
This tells us that the detected encoding is cp1256, and that it was retrieved from a HTML tag with http-equiv='Content-Type'
.
Detail on the signature of infer_encoding()
:
def infer_encoding(
content: Optional[bytes] = None,
headers: Optional[Mapping[str, str]] = None
) -> Pair:
...
The content
represents the page HTML, such as response.content
.
The headers
represents the HTTP response headers, such as response.headers
.
If provided, this should be a data structure supporting a case-insensitive lookup, such as requests.structures.CaseInsensitiveDict
or multidict.CIMultiDict
.
Both parameters are optional.
The return type is a tuple
.
The first element of the tuple is a member of the Source
enum (see Search Process below). The source indicates where
the detected encoding comes from.
The second element of the tuple is either a str
, which is the canonical name of the detected encoding, or None
if no encoding is found.
Where Do Other Libraries Fall Short?
The requests
library "[follows] RFC 2616 to the letter" in using the HTTP headers to determine the encoding of the response content. This
means, among other things, using ISO-8859-1
as a fallback if no charset is given, despite the fact that UTF-8 has absolutely
dwarfed all other encodings in usage on web pages.
response.encoding = get_encoding_from_headers(response.headers)
If requests
does not find an HTTP Content-Type
header at all, it will fall back to detection via chardet
rather than looking in the
HTML tags for meaningful information. There's nothing at all wrong with this; it just means that the requests
maintainers have chosen to
focus on the power of requests
as an HTTP library, not an HTML library. If you want more fine-grained control over encoding detection,
try infer_encoding()
.
This is not to single out requests
either; there are other libraries that do the same dance with encoding detection;
aiohttp
checks the Content-Type
header, or otherwise
defaults to UTF-8 without looking anywhere else.
Search Process
The function guessenc.infer_encoding()
looks in a handful of places to extract an encoding, in this order, and stops when it finds one:
- In the
charset
value from the Content-Type
HTTP entity header. - In the
charset
value from a <meta charset="xxxx">
HTML tag. - In the
charset
value from a <meta>
tag with http-equiv="Content-Type"
. - Using the
chardet
library.
Each of the above "sources" is signified by a corresponding member of the Source
enum:
class Source(enum.Enum):
"""Indicates where our detected encoding came from."""
CHARSET_HEADER = 0
META_CHARSET = 1
META_HTTP_EQUIV = 2
CHARDET = 3
COULD_NOT_DETECT = 4
If none of the 4 sources from the list above return a viable encoding, this is indicated by Source.COULD_NOT_DETECT
.