astro-robots-txt
This Astro integration generates a robots.txt for your Astro project during build.
Why astro-robots-txt?
The robots.txt file informs search engines which pages on your website should be crawled. See Google's own advice on robots.txt to learn more.
For Astro project you usually create the robots.txt in a text editor and place it to the public/
directory.
In that case you must manually synchronize site
option in astro.config.* with Sitemap:
record in robots.txt.
It brakes DRY principle.
Sometimes, especially during development, it's necessary to prevent your site from being indexed. To achieve this you need to place the meta tag <meta name="robots" content="noindex">
into the <head>
section of your pages or add X-Robots-Tag: noindex
to the HTTP response header, then add the lines User-agent: *
and Disallow: \
to robots.txt.
Again you have to do it manually in two different places.
astro-robots-txt can help in both cases on the robots.txt side. See details in this demo repo.
Installation
Quick Install
The experimental astro add
command-line tool automates the installation for you. Run one of the following commands in a new terminal window. (If you aren't sure which package manager you're using, run the first command.) Then, follow the prompts, and type "y" in the terminal (meaning "yes") for each one.
npx astro add astro-robots-txt
yarn astro add astro-robots-txt
pnpx astro add astro-robots-txt
Then, restart the dev server by typing CTRL-C
and then npm run astro dev
in the terminal window that was running Astro.
Because this command is new, it might not properly set things up. If that happens, log an issue on Astro GitHub and try the manual installation steps below.
Manual Install
First, install the astro-robots-txt
package using your package manager. If you're using npm or aren't sure, run this in the terminal:
npm install --save-dev astro-robots-txt
Then, apply this integration to your astro.config.*
file using the integrations
property:
astro.config.mjs
import robotsTxt from 'astro-robots-txt';
export default {
integrations: [robotsTxt()],
}
Then, restart the dev server.
Usage
The astro-robots-txt
integration requires a deployment / site URL for generation. Add your site's URL under your astro.config.* using the site
property.
Then, apply this integration to your astro.config.* file using the integrations
property.
astro.config.mjs
import { defineConfig } from 'astro/config';
import robotsTxt from 'astro-robots-txt';
export default defineConfig({
site: 'https://example.com',
integrations: [robotsTxt()],
});
Note that unlike other configuration options, site
is set in the root defineConfig
object, rather than inside the robotsTxt()
call.
Now, build your site for production via the astro build
command. You should find your robots.txt under dist/robots.txt
!
Warning
If you forget to add a site
, you'll get a friendly warning when you build, and the robots.txt
file won't be generated.
Example of generated `robots.txt` file
robots.txt
User-agent: *
Allow: /
Sitemap: https://example.com/sitemap-index.xml
Configuration
To configure this integration, pass an object to the robotsTxt()
function call in astro.config.mjs
.
astro.config.mjs
...
export default defineConfig({
integrations: [robotsTxt({
transform: ...
})]
});
sitemap
Type | Required | Default value |
---|
Boolean / String / String[] | No | true |
If you omit the sitemap
parameter or set it to true
, the resulting output in a robots.txt will be Sitemap: your-site-url/sitemap-index.xml
.
If you want to get the robots.txt file without the Sitemap: ...
entry, set the sitemap
parameter to false
.
astro.config.mjs
import robotsTxt from 'astro-robots-txt';
export default {
site: 'https://example.com',
integrations: [
robotsTxt({
sitemap: false,
}),
],
};
When the sitemap
is String
or String[]
its values must be a valid URL. Only http or https protocols are allowed.
astro.config.mjs
import robotsTxt from 'astro-robots-txt';
export default {
site: 'https://example.com',
integrations: [
robotsTxt({
sitemap: [
'https://example.com/first-sitemap.xml',
'http://another.com/second-sitemap.xml',
],
}),
],
};
sitemapBaseFileName
Type | Required | Default value |
---|
String | No | sitemap-index |
Sitemap file name before file extension (.xml
). It will be used if the sitemap
parameter is true
or omitted.
:grey_exclamation: @astrojs/sitemap and astro-sitemap integrations have the sitemap-index.xml
as their primary output. That is why the default value of sitemapBaseFileName
is set to sitemap-index
.
astro.config.mjs
import robotsTxt from 'astro-robots-txt';
export default {
site: 'https://example.com',
integrations: [
robotsTxt({
sitemapBaseFileName: 'custom-sitemap',
}),
],
};
host
Type | Required | Default value |
---|
Boolean / String | No | undefined |
Some crawlers (Yandex) support a Host
directive, allowing websites with multiple mirrors to specify their preferred domain.
astro.config.mjs
import robotsTxt from 'astro-robots-txt';
export default {
site: 'https://example.com',
integrations: [
robotsTxt({
host: 'your-domain-name.com',
}),
],
};
If the host
option is set to true
, the Host
output will be automatically resolved using the site option from Astro config.
transform
Type | Required | Default value |
---|
(content: String): String or
(content: String): Promise<String> | No | undefined |
Sync or async function called just before writing the text output to disk.
astro.config.mjs
import robotsTxt from 'astro-robots-txt';
export default {
site: 'https://example.com',
integrations: [
robotsTxt({
transform(content) {
return `# Some comments before the main content.\n# Second line.\n\n${content}`;
},
}),
],
};
policy
Type | Required | Default value |
---|
Policy[] | No | [{ allow: '/', userAgent: '*' }] |
List of Policy
rules
Type Policy
Name | Type | Required | Description |
---|
userAgent | String | Yes | You must provide a name of the automatic client (search engine crawler). Wildcards are allowed. |
disallow | String / String[] | No | Disallowed paths for crawling |
allow | String / String[] | No | Allowed paths for crawling |
crawlDelay | Number | No | Minimum interval (in secs) for the crawler to wait after loading one page, before starting other |
cleanParam | String / String[] | No | Indicates that the page's URL contains parameters that should be ignored during crawling. Maximum string length is limited to 500. |
astro.config.mjs
import robotsTxt from 'astro-robots-txt';
export default {
site: 'https://example.com',
integrations: [
robotsTxt({
policy: [
{
userAgent: 'Googlebot',
allow: '/',
disallow: ['/search'],
crawlDelay: 2,
},
{
userAgent: 'OtherBot',
allow: ['/allow-for-all-bots', '/allow-only-for-other-bot'],
disallow: ['/admin', '/login'],
crawlDelay: 2,
},
{
userAgent: '*',
allow: '/',
disallow: '/search',
crawlDelay: 10,
cleanParam: 'ref /articles/',
},
],
}),
],
};
Examples
Contributing
You're welcome to submit an issue or PR!
Changelog
See CHANGELOG.md for a history of changes to this integration.
Inspirations