Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More →

fetch-filecache-for-crawling

Package Overview

Dependencies

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

fetch-filecache-for-crawling

Implementation of a `fetch` that extends the implementation from `node-fetch` to add an HTTP cache using a local cache folder for crawling purpose.

1.4.0
Source
npm

Version published: 7 years ago

Weekly downloads: 375; decreased by-8.76%

Maintainers: 1

Weekly downloads

Created: 7 years ago

Source

Implementation of fetch with a file-based HTTP cache for crawling purpose

Node.js module that exports a fetch function that extends the implementation from node-fetch to add an HTTP cache using a local cache folder.

The code was developed for a particular scenario with specific requirements in mind, and no attempts were made to generalize them. Publication as an npm package is mostly intended to ease reuse by a couple of specific projects.

Typically, the module is intended to be used for crawling purpose and makes the following assumptions, which do not hold true in other cases:

Throughout the application's lifetime, info in the cache should always be considered valid. In other words a second fetch on the same URL will always return the content from the cache, and will not lead to a second request on the network. That assumption would obviously false if the goal is to load a resource that changes in real-time. The code keeps the list of fetched URLs in memory for that purpose (list that keeps growing and that could thus be considered as a memory leak if the application was to run forever).
The user wants to preserve cached files in a folder, even after the application is done running. That file cache will be used upon next run of the application to send conditional requests.

Installation

Run npm install fetch-filecache-for-crawling

Usage

const fetch = require('fetch-filecache-for-crawling');

// URLs to crawl, some of which may be identical
let urls = [
  'https://caniuse.com/data.json',
  'https://caniuse.com/data.json'
]

Promise.all(urls.map(url =>
  fetch(url, { logToConsole: true })
    .then(response => response.json())
    .then(json => console.log(Object.keys(json.data).length +
      ' entries in Can I Use'))
)).catch(err => console.error(err));

Configuration

On top of usual fetch options, the following optional parameters can be passed to fetch in the options parameter to change default behavior:

cacheFolder: the name of the cache folder to use. By default, the code caches all files in a folder named .cache.
resetCache: set to true to empty the cache folder when the application starts. Defaults to false. Note that the cache folder will only be reset once, regardless of whether the parameter is set to true in subsequent calls to fetch.
avoidNetworkRequests: set to true to consider that responses in the cache folder are always valid when they exist. Defaults to false, which means that the method will send a conditional HTTP request to check whether the response in the cache is still valid.
forceRefresh: set to true to force another network fetch on a URL that has already been fetched during this crawl. This can be particularly useful when a fetch returns a temporary HTTP error if you want to try again later on. Defaults to false, meaning that, during the lifetime of an application, the response in the cache is automatically returned when the underlying URL has already been fetched.
logToConsole: set to true to output progress messages to the console. Defaults to false. All messages start with the ID of the request to be able to distinguish between them.

For instance, you may do:

const fetch = require('fetch-filecache-for-crawling');

fetch('https://www.w3.org/', {
  resetCache: true,
  cacheFolder: 'mycache',
  logToConsole: true
}).then(response => {});

If a config.json file exists in the current folder, the code will try to parse it as JSON and will look for the above parameters in that file. Configuration parameters provided in the options parameter take precedence over those defined in config.json.

Licensing

The code is available under an MIT license.

FAQs

What is fetch-filecache-for-crawling?

Is fetch-filecache-for-crawling popular?

Is fetch-filecache-for-crawling well maintained?

Package last updated on 08 Apr 2018

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

fetch-filecache-for-crawling

Implementation of fetch with a file-based HTTP cache for crawling purpose

Installation

Usage

Configuration

Licensing

Related posts

Weekly Downloads Now Available in npm Package Search Results

Tech's $90B Ghost Engineer Problem: Stanford Study Finds 9.5% of Engineers Do Almost Nothing