crawler-ts
Lightweight crawler written in TypeScript using ES6 generators.
Installation
npm install --save crawler-ts crawler-ts-htmlparser2
Examples
API
The createCrawler
function expects the following options as the first parameter.
interface Options<L, R, P> {
requester(location: L): ValueOrPromise<R | undefined>;
shouldParse(props: PreParseProps<L, R>): ValueOrPromise<boolean>;
parser(props: PreParseProps<L, R>): ValueOrPromise<P | undefined>;
shouldYield(props: PostParseProps<L, R, P>): ValueOrPromise<boolean>;
follower(props: PostParseProps<L, R, P>): AsyncGenerator<L>;
shouldQueue(props: { location: L; origin: L; response: R; parsed: P }): ValueOrPromise<boolean>;
logger?: Logger;
}
interface PreParseProps<L, R> {
location: L;
response: R;
}
interface PostParseProps<L, R, P> extends PreParseProps<L, R> {
parsed: P;
}
type ValueOrPromise<T> = T | Promise<T>;
There are built-in modules available that implement some of these configuration values. See Modules section.
Modules
crawler-ts-fetch
This module implements a requester
that uses node-fetch
to request content over HTTP.
See modules/crawler-ts-fetch.
crawler-ts-htmlparser2
This module implements a requester
, parser
and follower
for HTML. The requester
uses crawler-ts-fetch
to request content over HTTP. The parser
uses htmlparser2
to parse HTML files. The follower
uses the parser result to find <a>
anchor elements and yields its href
properties.
See modules/crawler-ts-htmlparser2.
crawler-ts-fs
This module implements a requester
, parser
and follower
for the file system. The requester
uses fs.stat
to request file information. The parser
by default just returns the response from the requester
. The follower
follows directories.
Author
Gillis Van Ginderachter
License
GNU General Public License v3.0