Security News
Research
Data Theft Repackaged: A Case Study in Malicious Wrapper Packages on npm
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
nokogiri-html5-inference
Advanced tools
Given HTML5 input, make a reasonable guess at how to parse it correctly.
Nokogiri::HTML5::Inference
makes reasonable inferences that work for both HTML5 documents and HTML5 fragments, and for all the different HTML5 tags that a web developer might need in a view library.
This is useful for parsing trusted content like view snippets, particularly for morphing cases like StimulusReflex.
The HTML5 Spec defines some very precise context-dependent parsing rules which can make it challenging to "just parse" a fragment of HTML without knowing the parent node -- also called the "context node" -- in which it will be inserted.
Most content in an HTML5 document can be parsed assuming the parser's mode will be in the "in body" insertion mode, but there are some notable exceptions. Perhaps the most problematic to web developers are the table-related tags, which will not be parsed properly unless the parser is in the "in table" insertion mode.
For example:
Nokogiri::HTML5::DocumentFragment.parse("<td>foo</td>").to_html
# => "foo" # where did the tag go!?
In the default "in body" mode, the parser will log an error, "Start tag 'td' isn't allowed here", and drop the tag. This particular fragment must be parsed "in the context" of a table in order to parse properly.
Thankfully, libgumbo and Nokogiri allow us to set the context node:
Nokogiri::HTML5::DocumentFragment.new(
Nokogiri::HTML5::Document.new,
"<td>foo</td>",
"table" # <--- this is the context node
).to_html
# => "<tbody><tr><td>foo</td></tr></tbody>"
This result is almost correct, but we're seeing another HTML5 parsing rule in action: there may be intermediate parent tags that the HTML5 spec requires to be inserted by the parser. In this case, the <td>
tag must be wrapped in <tbody><tr>
tags.
We can fix this to only return the tags we provided by using the <template>
tag as the context node, which the HTML5 spec provides exactly for this purpose:
Nokogiri::HTML5::DocumentFragment.new(
Nokogiri::HTML5::Document.new,
"<td>foo</td>",
"template" # <--- this is the context node
).to_html
# => "<td>foo</td>"
Huzzah! That works. And it's precisely what Nokogiri::HTML5::Inference.parse
does:
Nokogiri::HTML5::Inference.parse("<td>foo</td>").to_html
# => "<td>foo</td>"
Given an input String containing HTML5, infer the best way to parse it by calling Nokogiri::HTML5::Inference.parse
.
If the input is a document, you'll get a Nokogiri::HTML5::Document
back:
html = <<~HTML
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
</head>
<body>
<h1>Hello, world!</h1>
</body>
</html>
HTML
Nokogiri::HTML5::Inference.parse(html)
# => #(Document:0x1f04 {
# name = "document",
# children = [
# #(DTD:0x2030 { name = "html" }),
# #(Element:0x2134 {
# name = "html",
# attribute_nodes = [ #(Attr:0x2260 { name = "lang", value = "en" })],
# ...
# #(Element:0x2a44 {
# name = "body",
# children = [
# #(Text "\n "),
# #(Element:0x2bd4 { name = "h1", children = [ #(Text "Hello, world!")] }),
# #(Text "\n \n\n")]
# })]
# })]
# })
If the input is a fragment, you'll get back a Nokogiri::XML::NodeSet
:
Nokogiri::HTML5::Inference.parse("<tr><td>hello</td><td>world!</td></tr>")
# => [
# #<Nokogiri::XML::Element:0x4074 name="tr"
# children=[
# #<Nokogiri::XML::Element:0x4038 name="td" children=[#<Nokogiri::XML::Text:0x4024 "hello">]>,
# #<Nokogiri::XML::Element:0x4060 name="td" children=[#<Nokogiri::XML::Text:0x404c "world!">]>
# ]>
# ]
Both of these return types respond to the same query methods like #css
and #xpath
, tree-traversal methods like #children
, and serialization methods like #to_html
.
The implementation is currently pretty hacky and only looks at the first tag in the input to make decisions. Nonetheless, it is a step forward from what Nokogiri and libgumbo do out-of-the-box.
The implementation also is almost certainly incomplete, meaning there are HTML5 tags that aren't handled by this library as you might expect.
This implementation is probably OK for handling untrusted content, but it's still new and I haven't really thought very hard about it yet. If you want to use it on untrusted content, open an issue and talk with us about your use case so we can help keep you secure!
We would welcome bug reports and pull requests improving this library!
Install the gem and add to the application's Gemfile by executing:
bundle add nokgiri-html5-inference
If bundler is not being used to manage dependencies, install the gem by executing:
gem install nokgiri-html5-inference
After checking out the repo, run bin/setup
to install dependencies. Then, run rake test
to run the tests. You can also run bin/console
for an interactive prompt that will allow you to experiment.
To install this gem onto your local machine, run bundle exec rake install
. To release a new version, update the version number in version.rb
, and then run bundle exec rake release
, which will create a git tag for the version, push git commits and the created tag, and push the .gem
file to rubygems.org.
Bug reports and pull requests are welcome on GitHub at https://github.com/flavorjones/nokogiri-html5-inference. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the code of conduct.
The gem is available as open source under the terms of the MIT License.
Everyone interacting in the Nokogiri::HTML5::Inference
project's codebases, issue trackers, chat rooms and mailing lists is expected to follow the code of conduct.
FAQs
Unknown package
We found that nokogiri-html5-inference demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Research
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
Research
Security News
Attackers used a malicious npm package typosquatting a popular ESLint plugin to steal sensitive data, execute commands, and exploit developer systems.
Security News
The Ultralytics' PyPI Package was compromised four times in one weekend through GitHub Actions cache poisoning and failure to rotate previously compromised API tokens.