404 Crawler 🏊♂️
A command line interface to crawl and detect 404 pages from sitemap.
📊 Usage
Install
Make sure npm is installed in your computer. To know more about it, visit https://docs.npmjs.com/downloading-and-installing-node-js-and-npm
In a terminal, run
npm install -g @algolia/404-crawler
After that, you'll be able to use the command 404crawler
in your terminal
Examples
-
Crawl and detect every 404 pages from algolia website's sitemap:
404crawler crawl -u https://algolia.com/sitemap.xml
-
Use JavaScript rendering to crawl and identify all 404 or 'Not Found' pages on the Algolia website.
404crawler crawl -u https://algolia.com/sitemap.xml --render-js
-
Crawl and identify all 404 pages on the Algolia website by analyzing its sitemap, including all potential sub-path variations
404crawler crawl -u https://algolia.com/sitemap.xml --include-variations
Options
-
--sitemap-url
or -u
:
Required URL of the sitemap.xml
file.
-
--render-js
or -r
:
Use JavaScript rendering to crawl and identify a 'Not Found Page' if the status code isn't a 404. This option is useful for websites that returns a 200 status code even if the page is not found (Next.js with custom not found page for example)
-
--output
or -o
:
Ouput path for the JSON file of the results. Example: crawler/results.json
. If not set, no file is written after the crawl.
-
--include-variations
or -v
:
Include all sub-path variations from URLs found in the sitemap.xml
.
For example, if https://algolia.com/foo/bar/baz is found in the sitemap, the crawler will test https://algolia.com/foo/bar/baz, https://algolia.com/foo/bar, https://algolia.com/foo and https://algolia.com
-
--exit-on-detection
or -e
:
Exit when a 404 or a 'Not Found' page is detected.
-
--run-in-parallel
or -p
:
Run the crawler with multiple pages in parallel. By default, the number of parallel instances is set to 10. See --batch-size
option to configure this number.
-
--batch-size
or -s
:
Number or parallel instances of crawler to run: the more this number is, the more resources are consumed. Only available when --run-in-parallel
option is set. If not set, default to 10.
-
--browser-type
or -b
:
Type of the browser to use to crawl pages. Can be 'firefox', 'chromium' or 'webkit'. If not set, default to 'firefox'.
👨💻 Get started (maintainers)
This CLI is built with TypeScript and uses ts-node
to run the code locally.
Install
Install all dependencies
pnpm i
Run locally
pnpm 404crawler crawl <options>
Deploy
-
Update package.json
version
-
Commit and push changes
-
Build JS files in dist/
with
pnpm build
-
Initialize npm with Algolia org as scope
npm init --scope=algolia
-
Follow instructions
-
Publish package with
npm publish
🔗 References
This package uses: