New Case Study:See how Anthropic automated 95% of dependency reviews with Socket.Learn More
Socket
Sign inDemoInstall
Socket

webtractor

Package Overview
Dependencies
Maintainers
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

webtractor

  • 0.0.3
  • Rubygems
  • Socket score

Version published
Maintainers
1
Created
Source

Webtractor

The Webtractor is a ruby library which is able to extract main content from webpages like news, blogs, etc. As a result you can just have a main content without any boilerplate (menu, footer, comments, etc).

Installation

You can install it directly via gem:

gem install webtractor

Or you can put it in your Gemfile:

gem 'webtractor'

Then run:

bundle install

Basic usage

extractor = Webtractor::Extractor.new
result = extractor.extract_from_url
'http://techcrunch.com/2014/05/24/dont-believe-anyone-who-tells-you-learning-to-code-is-easy/'
puts result.text

Or

extractor = Webtractor::Extractor.new
result = extractor.extract '<html><body>...</body></html>'

Or

page = Nokogiri::HTML(...)
extractor = Webtractor::Extractor.new
result = extractor.extract_from_xml page

You can also access Nokogiri document from result via xml attribute:

puts result.xml.xpath('...').text 

Advanced usage

Process of getting main content from the webpage is really simple. It consists of applying multiple filters on the document where every filter gets on input output of the last applied filter.

You can look at the names of default filters:

p Webtractor::Filters::DefaultFilter.new.filters.map{|f| f.class.to_s}

You can remove any filter:

extractor.remove_filter Webtractor::Filters::RemoveComments

Or you can also create your own filter. It can be any class which implements process method which takes page as an argument and returns page:

class RemoveBolds
  def process page
    page.css('b').remove
    page
  end
end

extractor.add_filter RemoveBolds.new

License

This library is distributed under the Beerware license.

FAQs

Package last updated on 26 May 2014

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc