Floodesh
Floodesh is middleware based web spider written with Nodejs. "Floodesh" is a combination of two words, flood
and mesh
.
Requirement
Gearman Server Installation
Make sure g++
, make
, libboost-all-dev
, gperf
, libevent-dev
and uuid-dev
have been installed.
wget https://launchpad.net/gearmand/1.2/1.1.12/+download/gearmand-1.1.12.tar.gz | tar xvf
cd gearmand-1.1.12
./configure
make
make install
Install
$ npm install -g floodesh-cli
Useage
Generate new app from templates by only one command.
$ mkdir floodesh_demo
$ cd floodesh_demo
$ floodesh-cli init // all necessary files will be generated in your directory.
Please make sure you have /data/tests and /var/log/bda/tests created and have Write access before use, you can customize path by modifying logBaseDir in config/[env]/index.js
Context
A context instance is a kind of Finite-State Machine implemented by Generators
which is ECMAScript 6 feature. By context, we can access almost all fields in response
and request
, like:
worker.use( (ctx,next) => {
ctx.content = ctx.body.toString();
return next();
})
Request
ctx.querystring
Get querystring.
ctx.idempotent
Check if the request is idempotent.
ctx.search
Get the search string. It includes the leading "?" compare to querystring.
ctx.method
Get request method.
ctx.query
Get parsed query-string.
ctx.path
Get the request pathname
ctx.url
Return request url, the same as ctx.href.
ctx.origin
Get the origin of URL, for instance, "https://www.google.com".
ctx.protocol
Return the protocol string "http:" or "https:".
ctx.host
Parse the "Host" header field host and support X-Forwarded-Host when a proxy is enabled.
ctx.hostname
Parse the "Host" header field hostname and support X-Forwarded-Host when a proxy is enabled.
ctx.secure
Check if protocol is https.
Response
ctx.status
Get status code from response.
ctx.message
Get status message from response.
ctx.body
Get the response body in Buffer.
ctx.length
Get length of response body.
ctx.type
Get the response mime type, for instance, "text/html"
ctx.lastModifieds
Get the Last-Modified date in Date form, if it exists.
ctx.etag
Get the ETag of a response.
Return the response header.
ctx.contentType
ctx.get(key)
Get value by key in response headers
ctx.is(types)
type
s String|Array- Return: String|false|null
Check if the incoming response contains the "Content-Type" header field, and it contains any of the give mime type
s.If there is no response body, null
is returned.If there is no content type, false
is returned.Otherwise, it returns the first type
that matches.
Other
tasks
Array of pending crawling tasks. A task is an object consists of Options and next
, next
is a function name in your spider you want to call in next task , Supported format:
[{
opt:[Options](https://github.com/request/request#requestoptions-callback),
next:String
}]
dataSet
dataSet
is a map to store result, that will be parsed and saved by floodesh.
Middlewares