
Security News
Risky Biz Podcast: Making Reachability Analysis Work in Real-World Codebases
This episode explores the hard problem of reachability analysis, from static analysis limits to handling dynamic languages and massive dependency trees.
The Webtractor is a ruby library which is able to extract main content from webpages like news, blogs, etc. As a result you can just have a main content without any boilerplate (menu, footer, comments, etc).
You can install it directly via gem:
gem install webtractor
Or you can put it in your Gemfile:
gem 'webtractor'
Then run:
bundle install
extractor = Webtractor::Extractor.new
result = extractor.extract_from_url
'http://techcrunch.com/2014/05/24/dont-believe-anyone-who-tells-you-learning-to-code-is-easy/'
puts result.text
Or
extractor = Webtractor::Extractor.new
result = extractor.extract '<html><body>...</body></html>'
Or
page = Nokogiri::HTML(...)
extractor = Webtractor::Extractor.new
result = extractor.extract_from_xml page
You can also access Nokogiri document from result via xml attribute:
puts result.xml.xpath('...').text
Process of getting main content from the webpage is really simple. It consists of applying multiple filters on the document where every filter gets on input output of the last applied filter.
You can look at the names of default filters:
p Webtractor::Filters::DefaultFilter.new.filters.map{|f| f.class.to_s}
You can remove any filter:
extractor.remove_filter Webtractor::Filters::RemoveComments
Or you can also create your own filter. It can be any class which implements process method which takes page as an argument and returns page:
class RemoveBolds
def process page
page.css('b').remove
page
end
end
extractor.add_filter RemoveBolds.new
This library is distributed under the Beerware license.
FAQs
Unknown package
We found that webtractor demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
This episode explores the hard problem of reachability analysis, from static analysis limits to handling dynamic languages and massive dependency trees.
Security News
/Research
Malicious Nx npm versions stole secrets and wallet info using AI CLI tools; Socket’s AI scanner detected the supply chain attack and flagged the malware.
Security News
CISA’s 2025 draft SBOM guidance adds new fields like hashes, licenses, and tool metadata to make software inventories more actionable.