Research
Security News
Malicious npm Package Targets Solana Developers and Hijacks Funds
A malicious npm package targets Solana developers, rerouting funds in 2% of transactions to a hardcoded address.
Pismo extracts machine-usable metadata from unstructured (or poorly structured) English-language HTML documents. Data that Pismo can extract include titles, feed URLs, ledes, body text, image URLs, date, and keywords.
All tests pass on Ruby 1.9.3 and 2.0.0. Currently fails on JRuby 1.7.2 due to dependencies.
February 27, 2013: Version 0.7.4 has been released to ensure Ruby 2.0.0 compatibility but significant pull requests remain yet to be merged and handled.
December 19, 2010: Version 0.7.2 has been released - it includes a patch from Darcy Laycock to fix keyword extraction problems on some pages, has switched from Jeweler to Bundler for management of the gem, and adds support for JRuby 1.5.6 by skipping stemming on that platform.
A basic example of extracting basic metadata from a Web page:
require 'pismo'
# Load a Web page (you could pass an IO object or a string with existing HTML data along, as you prefer)
doc = Pismo::Document.new('http://www.rubyinside.com/cramp-asychronous-event-driven-ruby-web-app-framework-2928.html')
doc.title # => "Cramp: Asychronous Event-Driven Ruby Web App Framework"
doc.author # => "Peter Cooper"
doc.lede # => "Cramp (GitHub repo) is a new, asynchronous evented Web app framework by Pratik Naik of 37signals (and the Rails core team). It's built around Ruby's EventMachine library and was designed to use event-driven I/O throughout - making it ideal for situations where you need to handle a large number of open connections (such as Comet systems or streaming APIs.)"
doc.keywords # => [["cramp", 7], ["controllers", 3], ["app", 3], ["basic", 2], ..., ... ]
There's also a shorter "convenience" method which might be handy in IRB - it does the same as Pismo::Document.new:
Pismo['http://www.rubyflow.com/items/4082'].title # => "Install Ruby as a non-root User"
The current metadata methods are:
These methods are not fully documented here yet - you'll just need to try them out. The plural methods like #titles, #authors, and #feeds will return multiple matches in an array, if present. This is so you can use your own techniques to choose a "best" result in ambiguous cases.
The html_body and body methods will be of particular interest. They return the "body" of the page as determined by Pismo's "Reader". #body returns it as plain-text, #html_body maintains some basic HTML styling.
The default reader is the "tree" reader. This works in a similar fashion to Arc90's Readability or Safari Reader algorithm.
New! The keywords method accepts optional arguments. These are the current defaults:
:stem_at => 20, :word_length_limit => 15, :limit => 20, :remove_stopwords => true, :minimum_score => 2
You can also pass an array to keywords with :hints => arr if you want only words of your choosing to be found.
There are some shortcomings or problems that I'm aware of and am going to pursue:
A command line tool called "pismo" is included so that you can get metadata about a page from the command line. This is great for testing, or perhaps calling it from a non Ruby script. The output is currently in YAML.
./bin/pismo http://www.rubyinside.com/cramp-asychronous-event-driven-ruby-web-app-framework-2928.html title lede author datetime
---
:url: http://www.rubyinside.com/cramp-asychronous-event-driven-ruby-web-app-framework-2928.html
:title: "Cramp: Asychronous Event-Driven Ruby Web App Framework"
:lede: Cramp (GitHub repo)is a new, asynchronous evented Web app framework by Pratik Naik of 37signals
:author: Peter Cooper
:datetime: 2010-01-07 12:00:00 +00:00
If you call pismo without any arguments (except a URL), it starts an IRB session so you can directly work in Ruby. The URL provided is loaded and assigned to both the constant 'P' and the variable @p.
You can access Pismo's stopword list directly:
Pismo.stopwords # => [.., .., ..]
Pismo supports different readers for extracting the #body and #html_body from the web page.
The "cluster" reader uses an algorithm that tries to cluster contiguous content blocks together to identify the main document body. This is based on the ExtractContent gem (http://rubyforge.org/projects/extractcontent/).
The reader can be specified as part of #Document.new :
doc = Document.new(url, :reader => :cluster)
Apache 2.0 License - See LICENSE for details. Copyright (c) 2009, 2010 Peter Cooper
In short, you can use Pismo for whatever you like commercial or not, but please include a brief credit (as in the NOTICE file - as per the Apache 2.0 License) somewhere deep in your license file or similar, and, if you're nice and have the time, let me know if you're using it and/or share any significant changes or improvements you make.
FAQs
Unknown package
We found that pismo demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
Security News
A malicious npm package targets Solana developers, rerouting funds in 2% of transactions to a hardcoded address.
Security News
Research
Socket researchers have discovered malicious npm packages targeting crypto developers, stealing credentials and wallet data using spyware delivered through typosquats of popular cryptographic libraries.
Security News
Socket's package search now displays weekly downloads for npm packages, helping developers quickly assess popularity and make more informed decisions.