Tartarus Dataset Fetch and Cleanup Tools
Various scripts for downloading and building datasets.
TL;DR
npx -p @tartarus/data td fetch wikimedia --output /tmp/
Prerequisites
Requires node
and wget
.
(Tested on MacOS with node==10.14.2
and wget==1.20.3
.)
Usage
td fetch gutenberg --output /my/download/path
td fetch wikimedia --output /my/download/path \
--language en \
--language es \
--site wiki \
--site wikiquote
td spider --output /my/download/path -- site /my/site/config/file.ts
Spiders & Crawling
td spider
collects sequential data from APIs. Both iterative counters (e.g. page number) and
extractable URLs ('next page') are supported.
SpiderSiteConfig
A spider requires a JS/TS configuration file to customize its use.
import { SpiderJsonData, SpiderHttpNavigator, SpiderStore, SpiderHandle } from '@tartarus/data'
export default {
name: 'myexamplesite',
data: new SpiderJsonData(),
navigator: new SpiderHttpNavigator(
{
baseTarget: 'https://api.domain.ext/v1/list',
target: (h: SpiderHandle): string | null => `${h.getBaseTarget()}&page=${h.getIteration()}`,
isDone: (h: SpiderHandle): boolean => false
}
),
store: new SpiderStore(
{
subDirectoryDepth: 3,
filename: (h: SpiderHandle) => `${h.getIteration()}.json`,
}
),
request: {
headers: {
'User-Agent': 'Tartarus-Data-Spider/1.0 (me@email.ext)'
},
method: 'get',
responseEncoding: 'utf8'
},
behavior: {
delay: 1500,
retryDelay: 15000,
maxRetries: 15,
}
}
SpiderHandle
The spider interface exposes information of its current status and the latest downloaded page by passing an instance of
SpiderHandle
class to the callback functions.
SpiderHandle.getResponseData(): SpiderParsedData
Returns an object that describes the data received in response to a successful query.
The data
element contains a parsed (JSON) object of the response.
The raw
element contains a string representation of the response data.
interface SpiderParsedData {
raw: string;
data: any;
}
SpiderHandle.getResponse(): SpiderNavigatorFetchResponse | null
Returns an object that contains a descriptor of a successful query (rawResponse
) and the raw data received (rawData
).
The contents of the rawResponse
element are dependent on the type of SpiderNavigator
in use -- for SpiderHttpNavigator
it will be an AxiosResponse<any>
; for SpiderFileNavigator
it will be set to null
.
interface SpiderNavigatorFetchResponse {
rawData: string;
rawResponse: any;
}
SpiderHandle.getBaseTarget(): string
Returns the value passed in baseTarget
element to the SpiderNavigator
instance. Typically an URL.
SpiderHandle.getIteration(): number
Returns the current iteration.
SpiderHandle.getPath(relativeFilename: string): string
Returns an absolute path to relativeFilename
in the output directory.
SpiderHandle.getSiteConfig()
: SpiderSiteConfig
Returns the contents of the site configuration as described above.
SpiderHandle.getSpider(): SpiderTask
Returns the Task
instance of the spider.