Security News
PyPI’s New Archival Feature Closes a Major Security Gap
PyPI now allows maintainers to archive projects, improving security and helping users make informed decisions about their dependencies.
Browser Crawler is aimed to visit pages available on the site and extract useful information.
It can help maintaining e.g. lists of internal and external links,
creating sitemaps, visual testing using screenshots
or prepare the list of urls for the more sophisticated tool like Wraith.
Browser based crawling is performed with the help of Capybara and Chrome. Javascript is executed before page is analyzed allowing to crawl dynamic content. Browser based crawling is essentially an alternative to Wraith's spider mode, which parses only server side rendered html.
By default crawler visits pages following the links extracted. No button clicks performed other than during the optional authentication step. Thus crawler does not perform any updates to the site and can be treated as noninvasive.
Add this line to your application's Gemfile:
gem 'crawler', github: 'DimaSamodurov/crawler'
And then execute:
$ bundle
Or install it yourself as:
$ gem install browser_crawler
Without the authentication required:
crawl http://localhost:3000
With authentication, screenshots and limiting visited page number to 1:
crawl https://your.site.com/welcome -u username -p password -n 1 -s tmp/screenshots
# or
export username=dima
export password=secret
#...
crawl https://your.site.com/welcome -n 1 -s tmp/screenshots
Generate index from the captured screenshots. Index is saved to tmp/screenshots/index.html
.
bin/crawl -s tmp/screenshots
see additional options with:
bin/crawl -h
When finished the crawling report will be saved to tmp/crawl_report.yml
file by default.
You can specify the file path using command line options.
Below pointed an example script which configures the crawler and targets on the github.com
site
and after that records the result report as yaml file.
crawler = BrowserCrawler::Engine.new({
browser_options: {
headless: true,
window_size: [1200, 1600],
timeout: 60,
browser_options: { 'no-sandbox': nil }
},
max_pages: 10,
deep_visit: true
})
crawler.extract_links(url: 'https://github.com')
crawler.report_save
This gem use external dependency a cuprite
. The cuprite
allows working with browser without intermediaries (chrome-driver).
browser_options
responsible for configuration the chrome headless browser though the cuprite
.
max_pages
- an additional option to allow to set amount of pages for crawling.
By default it equals nil
and allows the crawler is browsing all pages within a certain domain.deep_visit
- a mode of the crawler when the crawler checks external resources without collecting links from them.All of them you can use with Capybara DSL.
crawler = BrowserCrawler::Engine.new()
# scroll down page before scan.
crawler.before do
page.execute_script 'window.scrollBy(0,10000)'
end
crawler.after do
page.body
end
crawler.extract_links(url: 'https://github.com')
crawler = BrowserCrawler::Engine.new()
# scroll down page before scan.
crawler.before type: :each do
page.execute_script 'window.scrollBy(0,10000)'
end
crawler.after type: :each do
page.body
end
crawler.extract_links(url: 'https://github.com')
Default behavior: by default crawler is sent all links from page to an unvisited_links array and after that browses each of them. This callback allows to change this behavior.
crawler = BrowserCrawler::Engine.new()
# scan_result consists of array with links from scaned page.
crawler.unvisited_links do
@page_inspector.scan_result
end
crawler.extract_links(url: 'https://github.com')
Changed behavior: change default behavior so that crawler browses only links which consist of /best-links
.
crawler = BrowserCrawler::Engine.new()
crawler.unvisited_links do
@page_inspector.scan_result.select { |link| link.include?('/best-links') }
end
crawler.extract_links(url: 'https://github.com')
Default behavior: by default crawler get all links from page and move to one to another.
crawler = BrowserCrawler::Engine.new()
crawler.change_page_scan_rules do
page.all('a').map { |a| a['href'] }
end
crawler.extract_links(url: 'https://github.com')
Changed behavior: change default behavior so that crawler get only links to have selector paginations
.
crawler = BrowserCrawler::Engine.new()
crawler.change_page_scan_rules do
if URI.parse(page.current_url).to_s.include?('/help/')
page.all('a.paginations') { |a| a['href'] }
else
[]
end
end
crawler.extract_links(url: 'https://github.com')
crawler = BrowserCrawler::Engine.new()
crawler.extract_links(url: 'https://github.com')
crawler.report_save(folder_path: './reports/')
If the folder doesn't exist, BrowserCrawler
create the folder for report.
crawler = BrowserCrawler::Engine.new()
crawler.extract_links(url: 'https://github.com')
crawler.report_save(type: :yaml)
crawler = BrowserCrawler::Engine.new()
crawler.extract_links(url: 'https://github.com')
crawler.report_save(type: :csv)
Browser Crawler can be useful to update paths:
section of the wraith's configs.
Provided wraith config is placed to wraith/configs/capture.yaml
file, do:
crawl https://your.site.com/welcome -c wraith/configs/capture.yaml
Or if you have crawling report available, just use it without the URL to skip crawling:
bin/crawl -c tmp/wraith_config.yml -r tmp/crawl_report.yml
Current version has the authentication process hardcoded: the path to login form and the field names used are specific to the project the crawler is extracted from. Configuration may be added in a future version.
It should be easy to crawl the site as part of the automated testing. e.g. in order to verify the list of pages available on the site, or in order to generate visual report (Wraith does it better).
By integrating browser_crawler into the application test suite it would be possible accessing pages and content not easily accessible on real site. E.g. when performing data modifications.
By integrating into test suite it would be possible to use all the tools/mocks/helpers/ created to simulate user behavior. E.g. mock external request using e.g. VCR.
After checking out the repo, run bin/setup
to install dependencies. Then, run rake spec
to run the tests. You can also run bin/console
for an interactive prompt that will allow you to experiment.
To install this gem onto your local machine, run bundle exec rake install
. To release a new version, update the version number in version.rb
, and then run bundle exec rake release
, which will create a git tag for the version, push git commits and tags, and push the .gem
file to rubygems.org.
Bug reports and pull requests are welcome on GitHub at https://github.com/dimasamodurov/browser_crawler.
MIT
FAQs
Unknown package
We found that browser_crawler demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 4 open source maintainers collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
PyPI now allows maintainers to archive projects, improving security and helping users make informed decisions about their dependencies.
Research
Security News
Malicious npm package postcss-optimizer delivers BeaverTail malware, targeting developer systems; similarities to past campaigns suggest a North Korean connection.
Security News
CISA's KEV data is now on GitHub, offering easier access, API integration, commit history tracking, and automated updates for security teams and researchers.