Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More
Socket
Sign inDemoInstall
Socket

flyerhzm-regexp_crawler

Package Overview
Dependencies
Maintainers
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

flyerhzm-regexp_crawler

  • 0.9.1
  • Rubygems
  • Socket score

Version published
Maintainers
1
Created
Source

h1. RegexpCrawler

RegexpCrawler is a crawler which uses regular expression to catch data from website. It is easy to use and less code if you are familiar with regular expression.


h2. Install


gem sources -a http://gems.github.com
gem install flyerhzm-regexp_crawler

h2. Usage

It's really easy to use, sometime just one line.


RegexpCrawler::Crawler.new(options).start

options is a hash

  • :start_page, mandatory, a string to define a website url where crawler start
  • :continue_regexp, optional, a regexp to define what website urls the crawler continue to crawl, it is parsed by String#scan and get the first not nil result
  • :capture_regexp, mandatory, a regexp to define what contents the crawler crawl, it is parse by Regexp#match and get all group captures
  • :named_captures, mandatory, a string array to define the names of captured groups according to :capture_regexp
  • :model, optional if :save_method defined, a string of result's model class
  • :save_method, optional if :model defined, a proc to define how to save the result which the crawler crawled, the proc accept two parameters, first is one page crawled result, second is the crawled url
  • :headers, optional, a hash to define http headers
  • :encoding, optional, a string of the coding of crawled page, the results will be converted to utf8
  • :need_parse, optional, a proc if parsing the page by regexp or not, the proc accept two parameters, first is the crawled website uri, second is the response body of crawled page
  • :logger, optional, true for logging to STDOUT, or a Logger object for logging to that logger

If the crawler define :model no :save_method, the RegexpCrawler::Crawler#start will return an array of results, such as


[{:model_name => {:attr_name => 'attr_value'}, :page => 'website url'}, {:model_name => {:attr_name => 'attr_value'}, :page => 'another website url'}]

h2. Example

a script to synchronize your github projects except fork projects, please check example/github_projects.rb


require 'rubygems'
require 'regexp_crawler'

crawler = RegexpCrawler::Crawler.new(
  :start_page => "http://github.com/flyerhzm",
  :continue_regexp => %r{
}m, :capture_regexp => %r{(.*?).*(.*?).*(
.*?
)
}m, :named_captures => ['title', 'description', 'body'], :save_method => Proc.new do |result, page| puts '=============================' puts page puts result[:title] puts result[:description] puts result[:body][0..100] + "..." end, :need_parse => Proc.new do |page, response_body| page =~ %r{http://github.com/flyerhzm/\w+} && !response_body.index(/Fork of.*?/) end) crawler.start

The results are as follows:


=============================
http://github.com/flyerhzm/bullet/tree/master
bullet
A rails plugin/gem to kill N+1 queries and unused eager loading

Bullet

The Bullet plugin/gem is designed to help you increase your... ============================= http://github.com/flyerhzm/regexp_crawler/tree/master regexp_crawler A crawler which use regular expression to catch data.

RegexpCrawler

RegexpCrawler is a crawler which use regex expressi... ============================= http://github.com/flyerhzm/sitemap/tree/master sitemap This plugin will generate a sitemap.xml from sitemap.rb whose format is very similar to routes.rb

Sitemap

This plugin will generate a sitemap.xml or sitemap.xml.gz ... ============================= http://github.com/flyerhzm/visual_partial/tree/master visual_partial This plugin provides a way that you can see all the partial pages rendered. So it can prevent you from using partial page too much, which hurts the performance.

VisualPartial

This plugin provides a way that you can see all the ... ============================= http://github.com/flyerhzm/chinese_regions/tree/master chinese_regions provides all chinese regions, cities and districts

ChineseRegions

Provides all chinese regions, cities and districts<... ============================= http://github.com/flyerhzm/chinese_permalink/tree/master chinese_permalink This plugin adds a capability for ar model to create a seo permalink with your chinese text. It will translate your chinese text to english url based on google translate.

This plugin adds a capability for ar model to cre... ============================= http://github.com/flyerhzm/codelinestatistics/tree/master codelinestatistics The code line statistics takes files and directories from GUI, counts the total files, total sizes of files, total lines, lines of codes, lines of comments and lines of blanks in the files, displays the results and can also export results to html file.

codelinestatistics README file:

----------------------------------------
Wha...

FAQs

Package last updated on 11 Aug 2014

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc