Broken Links Checker
A /smart/ standalone package that checks for broken links in HTML files that reside in a given path. This package is compatiable with Netlify servers, as it will honor any redirects mentioned in the _redirects
file. Additionaly, to avoid unneccesary bandwith & time, the checker will cache URL results for the life of the process, and resolve local URLs to local files.
To run, use the CLI as follows:
> node index.js --path ~/dev/Hashicorp/middleman/build
Or, in the working directory:
> npx broken-links-checker --path ./build
CLI Parameters
Yes | path | Path of HTML files | |
No | baseUrl | Base URL to apply to relative links | https://www.hashicorp.com/ |
No | verbose | | false |
No | configPath | Path to a linksrc JSON config file | |
Configuration
The checker can be fine-tuned using a .linksrc.json
file. By default, the checker will attemp to read a config file from the supplied path parameter, falling back to the current working directory, and finally to the user's home folder. Here are all the parameters (and their default values) you may specify in the config file:
{
"verbose": false,
"baseUrl": "https://www.hashicorp.com/",
"maxConcurrentUrlRequests": 50,
"maxConcurrentFiles": 10,
"allowedTags": {
"a": "href",
"script": "src",
"style": "href",
"iframe": "src",
"img": "src",
"embed": "src",
"audio": "src",
"video": "src"
},
"allowedProtocols": ["http:", "https:"],
"excludePatterns": [
"^https?://.*linkedin.com",
"^https?://.*example.com",
"^https?://.*localhost"
],
"excludePrivateIPs": true,
"checkAnchorLinks": true,
"timeout": 60000,
"retry405Head": true,
"retryENOTFOUND": true,
"maxRetries": 3,
"useCache": true,
"avoidHeadForPatterns": [
"^https?://.*linkedin.com",
"^https?://.*youtube.com",
"^https?://.*meetup.com"
]
}
Smart Local URL Resolution
Links are divided into two categories: internal links, and external ones. To determine if a link is internal, we parsed it's hostname and compare it to configured baseUrl
hostname. While external links are simply checked using an HTTP HEAD request, internal ones are a bit more nuanced, since we're relying on Middleman: we construct a local path for the given link by replacing the public host, with our local path (i.e. path
parameter). We then try to match, in this order, an HTML file to the resulting path: ${path}
, ${path}.html
, and finally ${path}/index.html
.
Caching
Each URL checked is stored in a local memory cache to avoid rechecking the same URL. Although unadvised, this behavior can be disabled.
URL Resolving
External URL fetching is a bit more complex than a simple HEAD request. Some hosts apply explicit rate limits, and some apply implicit ones (e.g. LinkedIn). When we fetch an external URL, we first check out local cache. In case of a cache miss, we'll make a HEAD request to the given URL, up to maxRetries
times (defaults to 3). HEAD requests require less bandwidth and are less obtrusive and GET requests. Some hosts are set to not respond to HEAD requets, and will respond with a HTTP/405 Method Not Allowed
, in which case we'll retry the request with a GET request.
Netlify Redirects
Netlify redirects in the form of a _redirects
file are honored and processed using the NetlifyDevServer
class. Redirects files are auto detected in the root of the given path
.