Security News
Opengrep Emerges as Open Source Alternative Amid Semgrep Licensing Controversy
Opengrep forks Semgrep to preserve open source SAST in response to controversial licensing changes.
Easy (yet efficient) ruby gem to crawl your favorite website.
Open your terminal, then:
git clone https://github.com/htaidirt/super_crawler
cd super_crawler
bundle
./bin/console
Then
sc = SuperCrawler::Crawl.new('https://gocardless.com')
sc.start(10) # => Start crawling the website using 10 threads
sc.render(5) # => Show the first 5 results of the crawling as sitemap
Add this line to your application's Gemfile:
gem 'super_crawler'
And then execute:
bundle install
Or install it yourself as:
gem install super_crawler
Want to experiment with the gem without installing it? Clone the following repo and run bin/console
for an interactive prompt that will allow you to experiment.
This gem is an experiment and can't be used for production purposes. Please, use it with caution if you want to use it in your projects.
There are also a lot of limitations that weren't handled due to time. You'll find more information on the limitations below.
SuperCrawler gem was only tested on MRI 2.3.1 and Rubinius 2.5.8.
Starting from a given URL, the crawler extracts all the internal links and assets within the page. The links are added to a list of unique links for further exploration. The crawler repeats the exploration visiting all the links until no new link is found.
Due to the heavy operations (thousands of pages), and the network time to access each page content, we will use threads to perform near-parallel processing.
In order to keep the code readable and structured, we created two classes:
SuperCrawler::Scrap
is responsible for scrapping a single page and extracting all relevant information (internal links and assets)SuperCrawler::Crawl
is responsible for crawling a whole website by collecting and managing links (using SuperCrawler::Scrap
on every internal link found.) This class is also responsible for rendering results.Open your favorite ruby console and require the gem:
require 'super_crawler'
Read the following if you would like to crawl a single web page and extract relevant information (internal links and assets).
page = SuperCrawler::Scrap.new( url )
Where url
should be the URL of the page you would like to scrap.
Nota: If the given URL has a missing scheme (http://
or https://
), SuperCrawler will prepend http://
to the URL.
Run
page.url
to get the encoded URL.
Run
page.get_links
to get the list of internal links in the page. An internal link is a link that has the same schame and host than the provided URL. Subdomains are rejected.
This method searches in the href
attribute of all <a>
anchor tags.
Nota:
Run
page.get_images
to get a list of images links within the page. The images links are extracted from the src="..."
attribute of all <img>
tags.
Nota: Images included using CSS or JavaScript aren't detected by the method.
Nota 2: This method returns an array of absolute URLs.
Run
page.get_stylesheets
to get a list of stylesheet links within the page. The links are extracted from the href="..."
attribute of all <link rel="stylesheet">
tags.
Nota:
Run
page.get_scripts
to get a list of script links within the page. The links are extracted from the src="..."
attribute of all <script>
tags.
Nota:
Run
page.get_assets
to get a list of all assets (links of images, stylesheets and scripts) as a hash of arrays.
sc = SuperCrawler::Crawl.new(url)
where url
is the URL of the website to crawl.
Next, start the crawler:
sc.start(number_of_threads)
where number_of_threads
is the number of threads that will perform the job (10 by default.) This can take some time, depending on the site to crawl.
To access the crawl results, use the following:
sc.links # The array of unique internal links
sc.crawl_results # Array of hashes containing links and assets for every unique internal link found
To see the crawling as a sitemap, use:
sc.render(5) # Will render the sitemap of the first 5 pages
TODO: Make more sophisticated rendering methods, that can render within files of different formats (HTML, XML, JSON,...)
After sc.start
, you can access all collected resources (links and assets) using sc.crawl_results
. This has the following structure:
[
{
url: 'http://example.com/',
links: [...array of internal links...],
assets: {
images: [...array of images links],
stylesheets: [...array of stylesheets links],
scripts: [...array of scripts links],
}
},
...
]
You can use sc.crawl_results.select{ |resource| ... }
to select a particular resource.
Example:
images = sc.crawl_results.map{ |page| page[:assets][:images] }.flatten.uniq
# => Returns an array of all unique images found during the crawling
You can collect in a single array any assets of a crawling, by using the following:
images = sc.get_assets :images # => Returns an array of unique images
stylesheets = sc.get_assets :stylesheets # => Returns an array of unique stylesheets
scripts = sc.get_assets :scripts # => Returns an array of unique scripts
It is important to note that all the given arrays contain unique absolute URLs. As said before, the assets are not necessarily internal assets.
Actually, the gem has the following limitations:
<a href="...">
tags are extracted<img src="..."/>
tags are extracted<link rel="stylesheet" href="..." />
tags are extracted<script src="...">
tags are extractedAfter checking out the repo, run bin/setup
to install dependencies. Then, run rake test
to run the tests. You can also run bin/console
for an interactive prompt that will allow you to experiment.
To install this gem onto your local machine, run bundle exec rake install
. To release a new version, update the version number in version.rb
, and then run bundle exec rake release
, which will create a git tag for the version, push git commits and tags, and push the .gem
file to rubygems.org.
Bug reports and pull requests are welcome on GitHub at https://github.com/htaidirt/super_crawler. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the Contributor Covenant code of conduct.
Please, follow this process:
The gem is available as open source under the terms of the MIT License.
FAQs
Unknown package
We found that super_crawler demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Opengrep forks Semgrep to preserve open source SAST in response to controversial licensing changes.
Security News
Critics call the Node.js EOL CVE a misuse of the system, sparking debate over CVE standards and the growing noise in vulnerability databases.
Security News
cURL and Go security teams are publicly rejecting CVSS as flawed for assessing vulnerabilities and are calling for more accurate, context-aware approaches.