Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More
Socket
Sign inDemoInstall
Socket

creepy-crawler

Package Overview
Dependencies
Maintainers
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

creepy-crawler

  • 1.0.2
  • Rubygems
  • Socket score

Version published
Maintainers
1
Created
Source

creepy-crawler

Ruby web crawler that takes a url as input and produces a sitemap using a neo4j graph database - Nothing creepy about it.

Build Status

##Installation ####Clone git clone https://github.com/udryan10/creepy-crawler.git && cd creepy-crawler ####Install Required Gems bundle install ####Install graph database rake neo4j:install ####Start graph database rake neo4j:start

####Requirements

  1. Gems listed in Gemfile
  2. Ruby 1.9+
  3. neo4j
  4. Oracle jdk7 (for neo4j graphing database)
  5. lsof (for neo4j graphing database)

##Usage ###Code ####Require require './creepy-crawler' ####Start a crawl Creepycrawler.crawl("http://example.com") ####Limit number of pages to crawl Creepycrawler.crawl("http://example.com", :max_page_crawl => 500) ####Extract some (potentially) useful statistics crawler = Creepycrawler.crawl("http://example.com", :max_page_crawl => 500) # list of broken links puts crawler.broken_links # list of sites that were visited puts crawler.visited_queue # count of crawled pages puts crawler.page_crawl_count

####Options DEFAULT_OPTIONS = { # whether to print crawling information :verbose => true, # whether to obey robots.txt :obey_robots => true, # maximum number of pages to crawl, value of nil will attempt to crawl all pages :max_page_crawl => nil, # should pages be written to the database. Likely only used for testing, but may be used if you only wanted to get at the broken_links data :graph_to_neo4j => true }

####Example examples located in examples/ directory

###Command line # Crawl site ruby creepy-crawler.rb --site "http://google.com" # Get command options ruby creepy-crawler.rb --help

Note: If behind a proxy, export your proxy environment variables

export http_proxy=<proxy_host>; export https_proxy=<proxy_host>

###Docker For testing, I have included the ability to run the environment and a crawl inside of a docker container

##Output creepy-crawler uses neo4j graph database to store and display the site map.

Web interface

neo4j has a web interface for viewing and interacting with the graph data. When running on local host, visit: http://localhost:7474/webadmin/

  1. Click the Data Browser tab

  2. Enter Query to search for nodes (will search all nodes):

    START root=node(*) RETURN root
  3. Click into a node

  4. Click switch view mode at top right to view a graphical map

Note: to have the map display url names instead of node numbers, you must create a style

REST interface

neo4j also has a full REST API for programatic access to the data

###Example Output Map Output Map

##TODO

  1. convert to gem
  2. multi-threaded to increase crawl performance

FAQs

Package last updated on 10 May 2014

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc