Security News
Introducing the Socket Python SDK
The initial version of the Socket Python SDK is now on PyPI, enabling developers to more easily interact with the Socket REST API in Python projects.
Infer HTML encoding from response headers & content. Goes above and beyond the encoding detection done by most HTTP client libraries.
The main function exported by guessenc
is infer_encoding()
.
>>> import requests
>>> from guessenc import infer_encoding
>>> resp = requests.get("http://www.fatehwatan.ps/page-183525.html")
>>> resp.raise_for_status()
>>> infer_encoding(resp.content, resp.headers)
(<Source.META_HTTP_EQUIV: 2>, 'cp1256')
This tells us that the detected encoding is cp1256, and that it was retrieved from a HTML tag with http-equiv='Content-Type'
.
Detail on the signature of infer_encoding()
:
def infer_encoding(
content: Optional[bytes] = None,
headers: Optional[Mapping[str, str]] = None
) -> Pair:
...
The content
represents the page HTML, such as response.content
.
The headers
represents the HTTP response headers, such as response.headers
.
If provided, this should be a data structure supporting a case-insensitive lookup, such as requests.structures.CaseInsensitiveDict
or multidict.CIMultiDict
.
Both parameters are optional.
The return type is a tuple
.
The first element of the tuple is a member of the Source
enum (see Search Process below). The source indicates where
the detected encoding comes from.
The second element of the tuple is either a str
, which is the canonical name of the detected encoding, or None
if no encoding is found.
The requests
library "[follows] RFC 2616 to the letter" in using the HTTP headers to determine the encoding of the response content. This
means, among other things, using ISO-8859-1
as a fallback if no charset is given, despite the fact that UTF-8 has absolutely
dwarfed all other encodings in usage on web pages.
# requests/adapters.py
response.encoding = get_encoding_from_headers(response.headers)
If requests
does not find an HTTP Content-Type
header at all, it will fall back to detection via chardet
rather than looking in the
HTML tags for meaningful information. There's nothing at all wrong with this; it just means that the requests
maintainers have chosen to
focus on the power of requests
as an HTTP library, not an HTML library. If you want more fine-grained control over encoding detection,
try infer_encoding()
.
This is not to single out requests
either; there are other libraries that do the same dance with encoding detection;
aiohttp
checks the Content-Type
header, or otherwise
defaults to UTF-8 without looking anywhere else.
The function guessenc.infer_encoding()
looks in a handful of places to extract an encoding, in this order, and stops when it finds one:
charset
value from the Content-Type
HTTP entity header.charset
value from a <meta charset="xxxx">
HTML tag.charset
value from a <meta>
tag with http-equiv="Content-Type"
.chardet
library.Each of the above "sources" is signified by a corresponding member of the Source
enum:
class Source(enum.Enum):
"""Indicates where our detected encoding came from."""
CHARSET_HEADER = 0
META_CHARSET = 1
META_HTTP_EQUIV = 2
CHARDET = 3
COULD_NOT_DETECT = 4
If none of the 4 sources from the list above return a viable encoding, this is indicated by Source.COULD_NOT_DETECT
.
FAQs
Infer HTML encoding from response headers & content
We found that guessenc demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
The initial version of the Socket Python SDK is now on PyPI, enabling developers to more easily interact with the Socket REST API in Python projects.
Security News
Floating dependency ranges in npm can introduce instability and security risks into your project by allowing unverified or incompatible versions to be installed automatically, leading to unpredictable behavior and potential conflicts.
Security News
A new Rust RFC proposes "Trusted Publishing" for Crates.io, introducing short-lived access tokens via OIDC to improve security and reduce risks associated with long-lived API tokens.