Research
Security News
Malicious npm Packages Inject SSH Backdoors via Typosquatted Libraries
Socket’s threat research team has detected six malicious npm packages typosquatting popular libraries to insert SSH backdoors.
goose-parser
Advanced tools
This tool moves routine crawling process to the new level. Now it's possible to parse a web page for a few moments. All you need is to specify parsing rules based on css selectors. It's so simple as Goose can do it. This library allows to parse such data types as grids, collections, and simple objects. Parser supports pagination via infinite scroll and pages. It offers next features: pre-parse actions and post-parse transformations.
npm install goose-parser
This library has dependency on PhantomJS 2.0. Follow instructions provided by the link or build it manually.
All css selectors can be set in a sizzle format.
This is a special atmosphere where Parser has to be executed. The main purpose of the environment is to provide a method for evaluating JS on the page.
That environment is used for running Parser on node.
var env = new PhantomEnvironment({
url: 'http://google.com',
});
The main and only required parameter is url
. It contains an url address of the site, where Parser will start.
This environment allows to perform snapshots, use proxy lists, custom proxy rotator, white and black lists for loading resources and more sweet features. Find more info about options in here.
That environment is used for running Parser in the browser.
var env = new BrowserEnvironment();
To created packed js-file with Parser execute following command:
npm run build
Parser.js is the main component of the package which performs page parsing.
var parser = new Parser({
environment: env,
pagination: pagination
});
Fields:
parser.parse({
actions: actions,
rules: parsingRules
});
Fields:
Add custom action by using method addAction
. Custom function is aware about context of Actions.
Example
parser.addAction('custom-click', function(options) {
// do some action
});
Params:
Add custom trasformation by using method addTransformation
.
Example
parser.addTransformation('custom-transform', function (options, result) {
return result + options.increment;
});
Params:
Add custom pagination by using method addPagination
. Custom pagination is aware about context of Paginator.
Usage
parser.addPagination('custom-pagination', function (options) {
// Paginate function
// return vow.resolve();
}, function (options, timeout) {
// Check pagination function
// return vow.resolve();
});
Params:
Promise
.Promise
.Example
Describe new pagination type
var previousPageHtml;
parser.addPagination('clickPerPage', function (options) {
var selector = options.scope + ':eq(' + this._currentPage + ')';
return this._env
.evaluateJs(options.pageScope, this._getPaginatePageHtml)
.then(function (html) {
previousPageHtml = html;
return this._actions.click(selector);
}, this);
}, function (options, timeout) {
return this._actions.wait(this._getPaginatePageHtml, function (html) {
return html !== null && html !== previousPageHtml;
}, [options.pageScope], timeout)
});
Use it
var pagination = {
type: 'clickPerPage', // your custom type
pageScope: '.page__content',
scope: '.page__pagination'
};
var parser = new Parser({
environment: env,
pagination: pagination
});
parser.parse({
rules: {}
});
The purpose of this rule - retrieving simple textual node value(s).
Example:
Parsing rule
{
name: 'node',
scope: 'div.simple-node'
}
HTML
<div class='simple-node'>simple-value</div>
Parsing result
{
node: 'simple-value'
}
Fields:
_id
name. If function specified, parser will call it for each row. See more info in example.node.getAttribute(attr)
The purpose of this rule - retrieving collection of nodes.
Example:
Parsing rule
{
name: 'row',
scope: 'div.collection-node',
collection: [
{
name: 'node1',
scope: 'div.simple-node1'
},
{
name: 'node2',
scope: 'div.simple-node2'
},
{
name: 'nested',
scope: 'div.nested-node',
collection: [
{
name: 'node3',
scope: 'div.simple-node3'
}
]
}
]
}
HTML
<div class='collection-node'>
<div class='simple-node1'>simple-value1</div>
<div class='simple-node2'>simple-value2</div>
<div class='nested-node'>
<div class='simple-node3'>simple-value3</div>
</div>
</div>
Parsing result
{
row: {
node1: 'simple-value1',
node2: 'simple-value2',
nested: {
node3: 'simple-value3'
}
}
}
Fields:
The purpose of this rule - retrieving collection of collection.
Example:
Parsing rule
{
scope: 'div.collection-node',
collection: [[
{
name: 'node1',
scope: 'div.simple-node1'
},
{
name: 'node2',
scope: 'div.simple-node2'
}
]]
}
HTML
<div>
<div class='collection-node'>
<div class='simple-node1'>simple-value1</div>
<div class='simple-node2'>simple-value2</div>
</div>
<div class='collection-node'>
<div class='simple-node1'>simple-value3</div>
<div class='simple-node2'>simple-value4</div>
</div>
</div>
Parsing result
[
{
node1: 'simple-value1',
node2: 'simple-value2'
},
{
node1: 'simple-value3',
node2: 'simple-value4'
}
]
Fields:
Parsing rule with id = function
var id = 0;
{
scope: 'div.collection-node',
collection: [[
{
id: function (rule, result) {
return ++id;
},
scope: 'simple-reference'
},
{
name: 'node',
scope: 'div.simple-node'
}
]]
}
Parsing rule with id from scope
{
scope: 'div.collection-node',
collection: [[
{
id: true,
scope: 'simple-reference'
},
{
name: 'node',
scope: 'div.simple-node'
}
]]
}
HTML
<div>
<div class='collection-node'>
<div class='simple-reference'>1</div>
<div class='simple-node'>simple-value1</div>
</div>
<div class='collection-node'>
<div class='simple-reference'>2</div>
<div class='simple-node'>simple-value2</div>
</div>
</div>
Parsing result
[
{
_id: 1,
node2: 'simple-value1'
},
{
_id: 2,
node2: 'simple-value2'
}
]
This is a way to parse collection-based data. See more info in Paginator.js
This type of pagination allows to parse collections with infinite scroll.
{
type: 'scroll',
interval: 500
}
Fields:
This type of pagination allows to parse collections with ajax-page pagination.
JS definition
{
type: 'page',
scope: '.page',
pageScope: '.pageContainer',
}
HTML
<div>
<div class='pageContainer'>
<div class='collection-node'>
<div class='simple-node1'>simple-value1</div>
<div class='simple-node2'>simple-value2</div>
</div>
<div class='collection-node'>
<div class='simple-node1'>simple-value3</div>
<div class='simple-node2'>simple-value4</div>
</div>
</div>
<div class='pagination'>
<div class='page'>1</div>
<div class='page'>2</div>
<div class='page'>3</div>
</div>
</div>
Fields:
Allow to execute actions on the page before parse process. All actions could return a result of the execution.
Click by the element on the page.
Example:
{
type: 'click',
scope: '.open-button'
}
Fields:
click
for that action.Wait for the element on the page.
Example:
{
type: 'wait',
scope: '.open-button.done'
}
Fields:
wait
for that action.Type text to the element.
Example:
{
type: 'type',
scope: 'input'
text: 'Some text to enter'
}
Fields:
type
for that action.Check if element exist on the page.
Example:
{
type: 'exist',
scope: '.some-element'
}
Fields:
exist
for that action.Action which helps to create if
statement based on another action.
Example:
{
type: 'conditionalActions',
conditions: [
{
type: 'exist',
scope: '.element-to-check'
}
],
actions: [
{
type: 'click',
scope: '.element-to-check',
waitForPage: true
}
]
}
In this particular action parser checks if element .element-to-check
presents on the page, do action click on it.
Fields:
conditionalActions
for that action.Add custom action by using method addAction
. Custom function is aware about context of Actions.
Example
actions.addAction('custom-click', function(options) {
// do some action
});
Allow to transform parsed value to some specific form.
Format date to specific view (using momentjs).
{
type: 'date',
locale: 'ru',
from: 'HH:mm D MMM YYYY',
to: 'YYYY-MM-DD'
}
Replace value using Regex.
{
type: 'replace',
re: ['\\s', 'g'],
to: ''
}
Add custom trasformation by using method addTransformation
.
Example
transformations.addTransformation('custom-transform', function (options, result) {
return result + options.increment;
});
To run tests use command:
npm test
To run tests build them with command:
npm run build-test
And then run file in the browser.
All parser components are covered by debug library, which give an ability to debug application in easy way.
Set DEBUG
variable with name of js file to show debug information.
DEBUG=Parser,Actions app.js
var env = new PhantomEnvironment({
url: uri,
screen: {
width: 1080,
height: 200
}
});
var parser = new Parser({
environment: env,
pagination: {
type: 'scroll',
interval: 500
}
});
parser.parse({
actions: [
{
type: 'wait',
timeout: 2 * 60 * 1000,
scope: '.container',
parentScope: 'body'
}
],
rules: {
scope: '.outer-wrap',
collection: [[
{
name: 'node1',
scope: '.node1',
actions: [
{
type: 'click',
scope: '.prepare-node1'
},
{
type: 'wait',
scope: '.prepare-node1.clicked'
}
],
collection: [
{
name: 'subNode',
scope: '.sub-node',
collection: [[
{
name: 'date',
scope: '.date-node',
transform: [
{
type: 'date',
locale: 'ru',
from: 'HH:mm D MMM YYYY',
to: 'YYYY-MM-DD'
}
]
},
{
name: 'number',
scope: '.number-node'
}
]]
}
]
},
{
name: 'prices',
scope: '.price'
}
]]
}
}).done(function (results) {
// do whatever with results
});
FAQs
Multi environment web page parser
We found that goose-parser demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 2 open source maintainers collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
Security News
Socket’s threat research team has detected six malicious npm packages typosquatting popular libraries to insert SSH backdoors.
Security News
MITRE's 2024 CWE Top 25 highlights critical software vulnerabilities like XSS, SQL Injection, and CSRF, reflecting shifts due to a refined ranking methodology.
Security News
In this segment of the Risky Business podcast, Feross Aboukhadijeh and Patrick Gray discuss the challenges of tracking malware discovered in open source softare.