Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More
Socket
Sign inDemoInstall
Socket

websitiary

Package Overview
Dependencies
Maintainers
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

websitiary

  • 0.1.0
  • Rubygems
  • Socket score

Version published
Maintainers
1
Created
Source

websitiary by Thomas Link http://rubyforge.org/projects/websitiary/

This is a script for monitoring webpages that reuses other programs to do the actual work. By default, it works on an ASCII basis, i.e. with the output of text-based webbrowsers. With the help of some friends, it can also work with HTML.

== DESCRIPTION: This is a script for monitoring webpages that reuses other programs (w3m, diff, webdiff etc.) to do most of the actual work. By default, it works on an ASCII basis, i.e. with the output of text-based webbrowsers like w3m (or lynx, links etc.) as the output can easily be post-processed. With the help of some friends (see the section below on requirements), it can also work with HTML. E.g., if you have websec installed, you can also use its webdiff program to show colored diffs.

By default, this script will use w3m to dump HTML pages and then run diff over the current page and the previous backup. Some pages are better viewed with lynx or links. Downloaded documents (HTML or ASCII) can be post-processed (e.g., filtered through some ruby block that extracts elements via hpricot and the like). Please see the configuration options below to find out how to change this globally or for a single source.

=== CAVEAT: The script also includes experimental support for monitoring whole websites. Basically, this script supports robots.txt directives (see requirements) but this is hardly tested and may not work in some cases.

While it is okay for your own websites to ignore robots.txt, it is not for others. Please make sure that the webpages you run this program on allow such a use. Some webpages disallow the use of any automatic downloader or offline reader in their user agreements.

== FEATURES/PROBLEMS:

  • Download webpages on defined intervalls
  • Compare webpages with previous backups
  • Display differences between the current version and the backup
  • Provide hooks to post-process the downloaded documents and the diff
  • Display a one page report summarizing all news
  • Automatically open the report in your favourite web-browser
  • Quite customizable

ISSUES, TODO:

  • Improved support for robots.txt (test it)
  • The use of :website_below and :website is hardly tested (please report errors).
  • download => :body_html tries to rewrite references (a, img) which may fail on certain kind of urls (please report errors).
  • When using :body_html for download, it may happen that some JavaScript code is stripped, which breaks some JavaScript-generated links.

== SYNOPSIS:

=== Usage Example:

Run "profile"

websitiary profile

Edit "~/.websitiary/profile.rb"

websitiary --edit=profile

View the latest report

websitiary --review

Refetch all sources regardless of :days and :hours restrictions

websitiary -signore_age=true

Create html and rss reports for my websites

websitiary -fhtml,rss mysites

For example output see:

=== Configuration Profiles are plain ruby files (with the '.rb' suffix) stored in ~/.websitiary/.

The profile config.rb is always loaded if available.

==== default 'PROFILE1', 'PROFILE2' ... Set the default profile(s).

Example: default 'my_profile'

==== diff 'CMD "%s" "%s"' Use this shell command to make the diff. %s %s will be replaced with the old and new filename.

diff is used by default.

==== diffprocess lambda {|text| ...} Use this ruby snippet to post-process the diff.

==== download 'CMD "%s"' Use this shell command to download a page. %s will be replaced with the url.

w3m is used by default.

Example: download 'lynx -dump "%s"'

==== downloadprocess lambda {|text| ...} Use this ruby snippet to post-process what was downloaded.

==== edit 'CMD "%s"' Use this shell command to edit a profile. %s will be replaced with the filename.

vi is used by default.

Example: edit 'gvim "%s"&'

==== option TYPE, OPTION => VALUE Set a global option.

TYPE can be one of: :diff:: Generate a diff :diffprocess:: Post-process a diff (if necessary) :format:: Format the diff for output :download:: Download webpages :downloadprocess:: Post-process downloaded webpages :page:: The :format field defines the format of the final report. Here VALUE is a format string that takes 3 variables as arguments: report title, toc, contents.

DOWNLOAD is a symbol

VALUE is either a format string or a block of code (of class Proc).

Example: set :download, :foo => lambda {|url| get_url(url)}

==== output_format FORMAT, output_format [FORMAT1, FORMAT2, ...] Set the output format. Format can be one of:

  • html
  • text, txt (this only works with text based downloaders)
  • rss (prove of concept only; it requires :rss[:url] to be set to the url, where the rss feed will be published, using the option :rss, :url => URL configuration command; you either have to use a text-based downloader or include :rss_format => 'html' to the url options)

==== set OPTION => VALUE; set TYPE, OPTION => VALUE; unset OPTIONS (Un)Set an option for the following source commands.

Example: set :download, :foo => lambda {|url| get_url(url)} set :days => 7, sort => true unset :days, :sort

==== source URL(S), [OPTIONS] Options

:cols => FROM..TO:: Use only these colums from the output (used after applying the :lines option)

:depth => INTEGER:: In conjunction with a :website type of :download option, fetch url up to this depth.

:diff => "CMD", :diff => SHORTCUT:: Use this command to make the diff for this page. Possible values for SHORTCUT are: :webdiff (useful in conjunction with :download => :curl, :wget, or :body_html). :body_html, :website_below, :website and :openuri are synonyms for :webdiff.

:diffprocess => lambda {|text| ...}:: Use this ruby snippet to post-process this diff

:download => "CMD", :download => SHORTCUT: Use this command to download this page. For possible values for SHORTCUT see the section on shortcuts below.

:downloadprocess => lambda {|text| ...}:: Use this ruby snippet to post-process what was downloaded. This is the place where, e.g., hpricot can be used to extract certain elements from the HTML code. Example: lambda {|text| Hpricot(text).at('div#content').inner_html}

:format => "FORMAT %s STRING", :format => SHORTCUT:: The format string for the diff text. The default (the :diff shortcut) wraps the output in +pre+ tags. :webdiff, :body_html, :website_below, :website, and :openuri will simply add a newline character.

:hours => HOURS, :days => DAYS:: Don't download the file unless it's older than that

:ignore_age => true:: Ignore any :days and :hours settings. This is useful in some cases when set on the command line.

:lines => FROM..TO:: Use only these lines from the output

:match => REGEXP:: When recursively walking a website, follow only links that match this regexp.

:sort => true, :sort => lambda {|a,b| ...}:: Sort lines in output

:strip => true:: Strip empty lines

:title => "TEXT":: Display TEXT instead of URL

:use => SYMBOL:: Use SYMBOL for any other option. I.e. :download => :body_html :diff => :webdiff can be abbreviated as :use => :body_html (because for :diff :body_html is a synonym for :webdiff).

Example configuration file extract: source 'URL', :days => 7, :download => :lynx

Daily

set :days => 1 source 'http://www.example.com', :use => :body_html, :downloadprocess => lambda {|text| Hpricot(text).at('div#content').inner_html}

Weekly

set :days => 7 source 'http://www.example.com', :lines => 10..-1, :title => 'My Page'

Bi-weekly

set :days => 14 source <<URLS http://www.example.com http://www.example.com/page.html URLS

Make HTML diffs and highlight occurences of a word

source 'http://www.example.com', :title => 'Example', :use => :body_html, :diffprocess => highlighter(/word/i)

Download the whole website below this path (only pages with

html-suffix)

Download only php and html pages

source 'http://www.example.com/foo/bar.html', :title => 'Example -- Bar', :use => :website_below, :match => /.(php|html)\b/, :depth => 2

unset :days

==== view 'CMD "%s"' Use this shell command to view the output (usually a HTML file). %s will be replaced with the filename.

w3m is used by default.

Example: view 'gnome-open "%s"' # Gnome Desktop view 'kfmclient "%s"' # KDE view 'cygstart "%s"' # Cygwin view 'start "%s"' # Windows view 'firefox "%s"'

=== Shortcuts for use with :use, :download and other options :w3m:: Use w3m for downloading the source. Use diff for generating diffs.

:lynx:: Use lynx for downloading the source. Use diff for generating diffs.

:links:: Use links for downloading the source. Use diff for generating diffs.

:curl:: Use curl for downloading the source. Use webdiff for generating diffs.

:wget:: Use wget for downloading the source. Use webdiff for generating diffs.

:openuri:: Use open-uri for downloading the source. Use webdiff for generating diffs.

:body_html:: This requires hpricot to be installed. Use open-uri for downloading the source, use only the body. Use webdiff for generating diffs. Try to rewrite references (a, img) so that the point to the webpage. By default, this will also strip tags like script, form, object ...

:website:: Use :body_html to download the source. Follow all links referring to the same host with the same file suffix. Use webdiff for generating diff.

:website_below:: Use :body_html to download the source. Follow all links referring to the same host and a file below the top directory with the same file suffix. Use webdiff for generating diff.

:website_txt:: Use :website to download the source but convert the output to plain text.

:website_txt_below:: Use :website_below to download the source but convert the output to plain text.

Any shortcuts relying on :body_html will also try to rewrite any references so that the links point to the webpage.

== REQUIREMENTS: websitiary is a ruby-based application. You thus need a ruby interpreter.

It depends on how you use websitiary whether you actually need the following libraries, applications.

By default this script expects the following applications to be present:

  • diff
  • vi (or some other editor)

and one of:

The use of :webdiff as :diff application requires websec[http://download.savannah.gnu.org/releases/websec/] to be installed. In conjunction with :body_html, :openuri, or :curl, this will give you colored HTML diffs. Why not use +websec+ if I have to install it, you might ask. Well, +websec+ is written in perl and I didn't quite manage to make it work the way I want it to. websitiary is made to be better to configure.

For downloading HTML, you need one of these:

The following ruby libraries are needed in conjunction with :body_html and :website related shortcuts:

I personally would suggest to choose the following setup:

== INSTALL: === Use rubygems Run

gem install websitiary

This will download the package and install it.

=== Use the zip The zip[http://rubyforge.org/frs/?group_id=4030] contains a file setup.rb that does the work. Run

ruby setup.rb

=== Copy Manually Get the single file websitiary[http://rubyforge.org/frs/?group_id=4030] script and copy it to some directory in $PATH.

=== Initial Configuration Please check the requirements section above and get the extra libraries needed:

  • hpricot
  • robot_rules.rb

You might then want to create a profile ~/.websitiary/config.rb that is loaded on every run. In this profile you could set the default output viewer and profile editor, as well as a default profile.

Example:

# Load standard.rb if no profile is given on the command line.
default 'standard'

# Use cygwin's cygstart to view the output with the default HTML 
# viewer
view '/usr/bin/cygstart "%s"'

# Use Windows gvim from cygwin ruby which is why we convert the path 
# first
edit 'gvim $(cygpath -w -- "%s")'

Where these configuration files reside, may differ. If the environment variable $HOME is defined, the default is $HOME/.websitiary/ unless one of the following directories exist, which will then be used instead:

  • $USERPROFILE/websitiary (on Windows)
  • SYSCONFDIR/websitiary (where SYSCONFDIR usually is /etc but you can run ruby to find out more: ruby -e "p Config::CONFIG['sysconfdir']")

If neither directory exists and no $HOME variable is defined, the current directory will be used.

== LICENSE: websitiary Webpage Monitor Copyright (C) 2007 Thomas Link

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307
USA

vi: ft=rd:tw=72:ts=4

FAQs

Package last updated on 25 Jul 2009

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc