SEO (Search Engine Optimization)
robots txt Configurations
Here are some common robots.txt
configurations:
- Block all crawlers: To block all crawlers from accessing your website, you can use the following configuration:
User-Agent: *
Disallow: /
- Allow all crawlers: To allow all crawlers to access your entire website, you can use the following configuration:
User-Agent: *
Disallow:
- Block specific crawlers: To block specific crawlers from accessing your website, you can specify the
User-Agent
directive for each crawler and use theDisallow
directive to block access:
User-Agent: BadBot
Disallow: /
User-Agent: AnotherBadBot
Disallow: /
- Block specific sections: To block specific sections of your website from being crawled, you can use the
Disallow
directive:
User-Agent: *
Disallow: /secret-folder/
Disallow: /private/
- Allow and disallow specific sections: To allow and disallow specific sections of your website, you can use the
Disallow
andAllow
directives in combination:
User-Agent: *
Disallow: /private/
Allow: /private/public-page.html
These are just a few examples of common robots.txt
configurations. The specific configuration you use will depend on your website and the needs of your business. It's important to carefully consider the implications of your configuration before making changes to your robots.txt
file.
Disallowing image crawling
To disallow image crawling, you can add the following line in your robots.txt
file:
User-Agent: *
Disallow: /*.jpg$
Disallow: /*.jpeg$
Disallow: /*.png$
Disallow: /*.gif$
Allowing Google and Yahoo!, but rejecting all others
To allow Google and Yahoo! to crawl your website, while rejecting all other crawlers, you can use the following configuration in your robots.txt
file:
User-Agent: Googlebot
Disallow:
User-Agent: Yahoo! Slurp
Disallow:
User-Agent: Yahoo! Slurp
Disallow:
User-Agent: *
Disallow: /
The first two lines specify that Googlebot and Yahoo! Slurp are allowed to crawl your website, while the last line specifies that all other crawlers (indicated by User-Agent: *
) should be blocked from accessing your website.
It's important to note that while this configuration will block most unwanted crawlers, it may not block all of them. Some crawlers may ignore the robots.txt
file, or may impersonate other search engines to bypass the restrictions. As such, it's important to monitor your server logs to ensure that your website is not being crawled excessively.
Blocking Office documents
To block crawling of office documents (e.g. Microsoft Word, Excel, and PowerPoint files), you can use the following configuration in your robots.txt
file:
User-Agent: *
Disallow: /*.doc$
Disallow: /*.docx$
Disallow: /*.xls$
Disallow: /*.xlsx$
Disallow: /*.ppt$
Disallow: /*.pptx$
This configuration specifies that all crawlers (indicated by User-Agent: *
) should not crawl any URLs that end with the file extensions .doc
, .docx
, .xls
, .xlsx
, .ppt
, or .pptx
.
It's important to note that while disallowing the crawling of office documents can help reduce the amount of bandwidth and server resources used, it may also negatively impact the visibility and ranking of your website in search results. This is because these documents may contain important information and context that can be used by search engines to understand the content and structure of your website. As such, disallowing the crawling of office documents should be done with care and only if necessary.
Blocking Internet Archiver
To block the Internet Archiver (also known as the Wayback Machine), you can use the following configuration in your robots.txt
file:
User-Agent: ia_archiver
Disallow: /
This line specifies that the Internet Archiver should not crawl any pages on your website.
It's important to note that while blocking the Internet Archiver can prevent your website from being archived and preserve your privacy, it may also negatively impact the visibility and discoverability of your website in search results. This is because archived pages can provide additional context and information that can be used by search engines to understand the content and history of your website. As such, blocking the Internet Archiver should be done with care and only if necessary.
Summary of the robots.txt Directive
The robots.txt
directive is a file that webmasters can use to control how web robots (often referred to as "bots" or "crawlers") interact with their website. The file is located at the root of a website and provides instructions to bots on which pages or sections of the website they are allowed or disallowed to crawl.
The robots.txt
file uses a specific format to specify the rules, with each line containing a User-Agent
directive that identifies the bot being targeted, followed by one or more Disallow
directives that specify the pages or sections of the website that the bot should not crawl.
Here is a summary of the robots.txt
directive in table format:
Directive | Example | Description |
---|---|---|
User-Agent |
User-Agent: Googlebot |
Identifies the bot being targeted. The following rules apply to the specified bot. |
Disallow |
Disallow: /secret |
Specifies the pages or sections of the website that the bot should not crawl. The specified pages or sections will not be crawled by the bot. |
Allow |
Allow: /secret/allowed-page |
Specifies the pages or sections of the website that the bot is allowed to crawl, even if a parent directory is disallowed. |
Sitemap |
Sitemap: https://example.com/sitemap.xml |
Specifies the location of the website's sitemap file. This helps bots more efficiently crawl the website by providing a roadmap of all the pages and sections. |
Crawl-delay |
Crawl-delay: 2 |
Specifies the number of seconds that the bot should wait between subsequent requests to the website. This can be used to prevent the bot from overloading the server. |
Wildcards |
Disallow: /*.png$ |
Specifies patterns that match URLs that the bot should not crawl. The $ symbol indicates that the pattern should only match URLs that end with the specified extension (in this case, .png ). |
Examples:
- To block all bots from crawling a website, provide the location of the sitemap, and set a crawl delay of 20 seconds:
User-Agent: *
Disallow: /
Sitemap: https://example.com/sitemap.xml
Crawl-delay: 20
- To allow Googlebot to crawl the entire website, provide the location of the sitemap, and set a crawl delay of 20 seconds:
User-Agent: Googlebot
Disallow:
Sitemap: https://example.com/sitemap.xml
Crawl-delay: 20
- To block Googlebot from crawling the
/secret
directory and provide the location of the sitemap:
User-Agent: Googlebot
Disallow: /secret
Sitemap: https://example.com/sitemap.xml
- To allow Googlebot and Bingbot to crawl the entire website, but block all other bots, provide the location of the sitemap, and set a crawl delay of 20 seconds:
User-Agent: Googlebot
Disallow:
Sitemap: https://example.com/sitemap.xml
Crawl-delay: 20
User-Agent: Bingbot
Disallow:
Sitemap: https://example.com/sitemap.xml
Crawl-delay: 20
User-Agent: *
Disallow: /
Sitemap: https://example.com/sitemap.xml
Crawl-delay: 20
- To disallow all bots from crawling images (for example,
.png
files), but allow all other pages:
User-Agent: *
Disallow: /*.png$
It's important to note that while the robots.txt
directive provides a way to control bot behavior, it is not a guarantee that bots will comply with the instructions. Some bots may ignore the robots.txt
file, while others may impersonate other bots to bypass the restrictions. As such, it's important to monitor your server logs and use other security measures to ensure that your website is not being crawled excessively or inappropriately.