robots-txt-parser
A lightweight robots.txt parser for Node.js with support for wildcards, caching and promises.
Installing
Via NPM: npm install robots-txt-parser --save
.
Getting Started
After installing robots-txt-parser it needs to be required and initialised:
const robotsParser = require('robots-txt-parser');
const robots = robotsParser(
{
userAgent: 'Googlebot',
allowOnNeutral: false
});
Example Usage:
const robotsParser = require('robots-txt-parser');
const robots = robotsParser(
{
userAgent: 'Googlebot',
allowOnNeutral: false,
},
);
robots.useRobotsFor('http://example.com')
.then(() => {
robots.canCrawlSync('http://example.com/news');
robots.canCrawl('http://example.com/news', (value) => {
console.log('Crawlable: ', value);
});
robots.canCrawl('http://example.com/news')
.then((value) => {
console.log('Crawlable: ', value);
});
});
Condensed Documentation
Below is a condensed form of the documentation, each is a function that can be found on the robotsParser object.
Method | Parameters | Return |
---|
parseRobots(key, string) | key:String, string:String | None |
isCached(domain) | domain:String | Boolean for whether robots.txt for url is cached. |
fetch(url) | url:String | Promise, resolved when robots.txt retrieved. |
useRobotsFor(url) | url:String | Promise, resolved when robots.txt is fetched. |
canCrawl(url) | url:String, callback:Func (Opt) | Promise, resolves with Boolean. |
getSitemaps() | callback:Func (Opt) | Promise if no callback provided, resolves with [String]. |
getCrawlDelay() | callback:Func (Opt) | Promise if no callback provided, resolves with Number. |
getCrawlableLinks(links) | links:[String], callback:Func (Opt) | Promise if no callback provided, resolves with [String]. |
getPreferredHost() | callback:Func (Opt) | Promise if no callback provided, resolves with String. |
setUserAgent(userAgent) | userAgent:String | None. |
setAllowOnNeutral(allow) | allow:Boolean | None. |
clearCache() | None | None. |
Full Documentation
parseRobots
robots.parseRobots(key, string)
Parses a string representation of a robots.txt file and cache's it with the given key.
Parameters
- key -> Can be any URL.
- string -> String representation of a robots.txt file.
Returns
None.
Example
robots.parseRobots('https://example.com',
`
User-agent: *
Allow: /*.php$
Disallow: /
`);
isCached
robots.isCached(domain)
A method used to check if a robots.txt has already been fetched and parsed.
Parameters
- domain -> Can be any URL.
Returns
Returns true if a robots.txt has already been fetched and cached by the robots-txt-parser.
Example
robots.isCached('https://example.com');
robots.isCached('example.com');
fetch
robots.fetch(url)
Attempts to fetch and parse a robots.txt file located at the url, this method avoids checking the built-in cache and will always attempt to retrieve a fresh copy of the robots.txt.
Parameters
Returns
Returns a Promise which will resolve once the robots.txt has been fetched with the parsed robots.txt.
Example
robots.fetch('https://example.com/robots.txt')
.then((tree) => {
console.log(Object.keys(tree));
});
useRobotsFor
robots.useRobotsFor(url)
Attempts to download and use the robots.txt at the given url, if the robots.txt has already been downloaded, reads from the cached copy instead.
Parameters
Returns
Returns a promsise that resolves once the URL is fetched and parsed.
Example
robots.useRobotsFor('https://example.com/news')
.then(() => {
});
canCrawl
robots.canCrawl(url, callback)
Tests whether a url can be crawled for the current active robots.txt and user agent. If a robots.txt isn't cached for the domain of the url, it is fetched and parsed before returning a boolean value.
Parameters
- url -> Any URL.
- callback -> An optional callback, if undefined returns a promise.
Returns
Returns a Promise which will resolve with a boolean value.
Example
robots.canCrawl('https://example.com/news')
.then((crawlable) => {
console.log(crawlable);
});
getSitemaps
robots.getSitemaps(callback)
Returns a list of sitemaps present on the active robots.txt.
Parameters
- callback -> An optional callback, if undefined returns a promise.
Returns
Returns a Promise which will resolve with an array of strings.
Example
robots.getSitemaps()
.then((sitemaps) => {
console.log(sitemaps);
});
getCrawlDelay
robots.getCrawlDelay(callback)
Returns the crawl delay on requests to the current active robots.txt.
Parameters
- callback -> An optional callback, if undefined returns a promise.
Returns
Returns a Promise which will resolve with an Integer.
Example
robots.getCrawlDelay()
.then((crawlDelay) => {
console.log(crawlDelay);
});
getCrawlableLinks
robots.getCrawlableLinks(links, callback)
Takes an array of links and returns an array of links which are crawlable
for the current active robots.txt.
Parameters
- links -> An array of links to check for crawlability.
- callback -> An optional callback, if undefined returns a promise.
Returns
A Promise that will resolve to contain an Array of all the crawlable links.
Example
robots.getCrawlableLinks([])
.then((links) => {
console.log(links);
});
getPreferredHost
robots.getPreferredHost(callback)
Returns the preferred host name specified in the active robots.txt's host: directive or null if there isn't one.
Parameters
- callback -> An optional callback, if undefined returns a promise.
Returns
An String if the host is defined, undefined otherwise.
Example
robots.getPreferredHost()
.then((host) => {
console.log(host);
});
setUserAgent
robots.setUserAgent(userAgent)
Sets the current user agent to use when checking if a link can be crawled.
Parameters
Returns
undefined
Example
robots.setUserAgent('exampleBot');
robots.setUserAgent('testBot');
setAllowOnNeutral
robots.setAllowOnNeutral(allow)
Sets the canCrawl behaviour to return true or false when the robots.txt rules are balanced on whether a link should be crawled or not.
Parameters
- allow -> A boolean value.
Returns
undefined
Example
robots.setAllowOnNeutral(true);
robots.setAllowOnNeutral(false);
clearCache
robots.clearCache()
The cache can get extremely long over extended crawling, this simple method resets the cache.
Parameters
None
Returns
None
Example
robots.clearCache();
Synchronous API
Synchronous variants of the API, will be deprecated in a future version.
canCrawlSync
robots.canCrawlSync(url)
Tests whether a url can be crawled for the current active robots.txt and user agent. This won't attempt to fetch the robots.txt if it is not cached.
Parameters
Returns
Returns a boolean value depending on whether the url is crawlable. If there is no cached robots.txt for this url, it will always return true.
Example
robots.canCrawlSync('https://example.com/news')
getSitemapsSync()
robots.getSitemapsSync()
Returns a list of sitemaps present on the active robots.txt.
Parameters
None
Returns
An Array of Strings.
Example
robots.getSitemapsSync();
getCrawlDelaySync()
robots.getCrawlDelaySync()
Returns the crawl delay on specified in the active robots.txt's for the active user agent
Parameters
None
Returns
An Integer greater than or equal to 0.
Example
robots.getCrawlDelaySync();
getCrawlableLinksSync
robots.getCrawlableLinksSync(links)
Takes an array of links and returns an array of links which are crawlable
for the current active robots.txt.
Parameters
- links -> An array of links to check for crawlability.
Returns
An Array of all the links are crawlable.
Example
robots.getCrawlableLinks(['example.com/test/news', 'example.com/test/news/article']);
getPreferredHostSync
robots.getPreferredHostSync()
Returns the preferred host name specified in the active robots.txt's host: directive or undefined if there isn't one.
Parameters
None
Returns
An String if the host is defined, undefined otherwise.
Example
robots.getPreferredHostSync();
License
See LICENSE file.
Resources