Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More →

guessenc

Package Overview

Dependencies

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

guessenc

Infer HTML encoding from response headers & content

0.3
PyPI

Maintainers: 1

guessenc

Infer HTML encoding from response headers & content. Goes above and beyond the encoding detection done by most HTTP client libraries.

Basic Usage

The main function exported by guessenc is infer_encoding().

>>> import requests
>>> from guessenc import infer_encoding

>>> resp = requests.get("http://www.fatehwatan.ps/page-183525.html")
>>> resp.raise_for_status()
>>> infer_encoding(resp.content, resp.headers)
(<Source.META_HTTP_EQUIV: 2>, 'cp1256')

This tells us that the detected encoding is cp1256, and that it was retrieved from a HTML tag with http-equiv='Content-Type'.

Detail on the signature of infer_encoding():

def infer_encoding(
    content: Optional[bytes] = None,
    headers: Optional[Mapping[str, str]] = None
) -> Pair:
    ...

The content represents the page HTML, such as response.content.

The headers represents the HTTP response headers, such as response.headers. If provided, this should be a data structure supporting a case-insensitive lookup, such as requests.structures.CaseInsensitiveDict or multidict.CIMultiDict.

Both parameters are optional.

The return type is a tuple.

The first element of the tuple is a member of the Source enum (see Search Process below). The source indicates where the detected encoding comes from.

The second element of the tuple is either a str, which is the canonical name of the detected encoding, or None if no encoding is found.

Where Do Other Libraries Fall Short?

The requests library "[follows] RFC 2616 to the letter" in using the HTTP headers to determine the encoding of the response content. This means, among other things, using ISO-8859-1 as a fallback if no charset is given, despite the fact that UTF-8 has absolutely dwarfed all other encodings in usage on web pages.

# requests/adapters.py
response.encoding = get_encoding_from_headers(response.headers)

If requests does not find an HTTP Content-Type header at all, it will fall back to detection via chardet rather than looking in the HTML tags for meaningful information. There's nothing at all wrong with this; it just means that the requests maintainers have chosen to focus on the power of requests as an HTTP library, not an HTML library. If you want more fine-grained control over encoding detection, try infer_encoding().

This is not to single out requests either; there are other libraries that do the same dance with encoding detection; aiohttp checks the Content-Type header, or otherwise defaults to UTF-8 without looking anywhere else.

Search Process

The function guessenc.infer_encoding() looks in a handful of places to extract an encoding, in this order, and stops when it finds one:

In the charset value from the Content-Type HTTP entity header.
In the charset value from a <meta charset="xxxx"> HTML tag.
In the charset value from a <meta> tag with http-equiv="Content-Type".
Using the chardet library.

Each of the above "sources" is signified by a corresponding member of the Source enum:

class Source(enum.Enum):
    """Indicates where our detected encoding came from."""

    CHARSET_HEADER = 0
    META_CHARSET = 1
    META_HTTP_EQUIV = 2
    CHARDET = 3
    COULD_NOT_DETECT = 4

If none of the 4 sources from the list above return a viable encoding, this is indicated by Source.COULD_NOT_DETECT.

Keywords

encoding http html chardet detection

FAQs

What is guessenc?

Is guessenc well maintained?

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

guessenc

guessenc

Basic Usage

Where Do Other Libraries Fall Short?

Search Process

Keywords

Related posts

vlt Debuts New JavaScript Package Manager and Serverless Registry at NodeConf EU

Malicious Python Package Typosquats Popular 'fabric' SSH Library, Exfiltrates AWS Credentials