What is a Robots.txt file?
A robots.txt file is simply a text file within your website’s directory that instructs search engine crawlers which pages on a website to crawl and those to ignore. These crawl instructions are defined by “disallowing” or “allowing” the behavior of specific (or all) web crawling software. The robots.txt file is also where search engines learn where they can find the sitemap.
How does Robots.txt work?
The robots.txt file is part of the Robots Exclusion Protocol (REP), a conglomerate of standards that regulate how robots crawl the web, access and index content, and serve that content up to users. The REP also includes directions like meta robots, as well as page, subdirectory, or website-wide instructions for how search engines should treat links (such as “nofollow” or “follow”).
Using robots.txt to manage access for web crawlers
Below are some examples of robots.txt in action for a www.example.com site: Robots.txt file URL: www.example.com/robots.txt
Blocking all web crawlers from all content
This instruction tells all web crawlers not to crawl any page on www.example.com, including the homepage.
User-agent: *
Disallow: /
Allowing all web crawlers access to all content
The rule below tells web crawlers to crawl all pages on www.example.com, including the homepage.
User-agent: * Allow:
Blocking a specific web crawler from a specific folder
User-agent: Googlebot
Disallow: /example-subfolder/
This syntax instructs only Google’s crawler to crawl any pages containing the URL string www.example.com/example-subfolder/.
Blocking a definite web crawler from a specific web page
User-agent: Bingbot
Disallow: /example-subfolder/blocked-page.html
This syntax instructs only Bing’s crawler to avoid crawling the exact page at www.example.com/example-subfolder/blocked-page.html.
How does robots.txt work?
Search engines have two primary goals:
- To crawl the web to discover content
- To index that content so that it can be found by people who are looking for information.
In general, to crawl websites, search engines follow links to get from one website to another — eventually, crawling across billions of links and sites. This crawling behavior is sometimes known as “spidering.” Once at a website and before spidering, crawlers look for a robots.txt file. If one exists, they will read it before continuing through the page. If the robots.txt file doesn’t contain any disallow rule or the website doesn’t have a robots.txt file, crawlers proceed to crawl other information on the website.
Specifying the location of the sitemap
As you probably know, sitemaps can greatly help to speed up the indexing of a website. Before search engines can work with your sitemap, you'll need to tell them where to find it. That's also something you can do in the robots.txt file. It's as simple as adding this line:
Sitemap: https://www.example.com/sitemap.xml
Although adding the sitemap to Google Search Console is normally enough for Google, we recommend adding it to the robots.txt as well. This helps other search engines find the sitemap.
Other quick robots.txt must-knows:
- To be found, a robots.txt file must be placed in a website’s top-level directory.
- The /robots.txt file is publicly available. Just add /robots.txt to the end of any root domain to see that website’s directives (if that site has a robots.txt file!). That means that anyone can see what pages you have set to be or not be crawled. Thus, don’t use them to hide sensitive user information.
- Some robots may decide to ignore your robots.txt file. This is particularly common with more reprehensible crawlers like email address scrapers or malware robots.
- Each subdomain on a root domain uses separate robots.txt files. That means that both example.com and blog.example.com should have their own robots.txt files (at example.com/robots.txt and blog.example.com/robots.txt).
- Robots.txt is case sensitive: the file must be named “robots.txt” (not robots.TXT, Robots.txt, etc.).
- It is advisable to indicate the location of any sitemaps linked with this domain at the bottom of the robots.txt file.
Technical robots.txt syntax
Moz defines robots.txt syntax as follows: Robots.txt syntax can be thought of as the “language” of robots.txt files. There are 5 common terms you are likely to come across in a robots file. They include:
- User-agent: The specific web crawler to which you are giving crawl instructions — usually a search engine. Most user agents can be found here.
- Allow (Only valid for Googlebot): This directive instructs Googlebot to access a page or subfolder even though its parent page or subfolder may be disallowed.
- Disallow: The directive instructs a user-agent not to crawl a certain URL. Note that only one “Disallow:” line is allowed for each URL.
- Sitemap: Used to call out the location of any XML sitemap(s) linked with this URL. Tip! this directive is only supported by Ask, Bing, Google, and Yahoo.
- Crawl-delay: Refers to the number of seconds a crawler should wait before loading and crawling page content. Tip! Googlebot doesn’t recognize this rule; however, the crawl rate can be set in Google Search Console.
Pattern-matching
When it comes to the exact URLs to allow or block, robots.txt files can get fairly complex as they permit the use of pattern-matching to cover a range of possible URL options. Both Bing and Google acknowledge two common expressions that can be used to detect pages or subfolders that an SEO wants to be excluded. These two characters are the dollar ($) and the asterisk (*) sign. The ($) matches the end of the URL, and (*) is a wildcard that represents any sequence of characters. Google provides a great list of possible pattern-matching syntax and examples here.
Where to put robots txt?
The robots.txt file must be placed at the root of the site host to which it applies. For example, to control crawling on all URLs below http://www.example.com/, the robots.txt file must be located at http://www.example.com/robots.txt. It cannot be located in a subdirectory (for instant, at http://example.com/pages/robots.txt ). If uncertain about how to access your site root, or need the go-ahead to do so, then contact your web hosting service provider. Pro tip! If you can’t access your website root, use an alternative blocking method such as meta tags.
Why is robots.txt essential?
To block non-public pages
Yes, sometimes you may have pages on your website that you don’t want to be indexed — for example, a login page. If you have such pages, it is ok to use robots.txt to block them from search engine crawlers and bots.
Maximize crawl budget
If having a rough time getting all of your pages indexed, you might have a crawl budget glitch. By blocking insignificant pages with robots.txt, Googlebot can spend more of your crawl budget on the pages that essentially matter.
Prevent indexing of resources
While meta directives can work just as well as robots.txt in stopping pages from getting indexed, they don’t work well for multimedia resources such as images and PDFs. That is where robots.txt comes into play. Bonus! You can always check how many web pages you have indexed in the Google Search Console. If the number is exactly that you want to be indexed, no need to bother. But if that is not so, then there is a need to create a robots.txt file for your site.
SEO best practices
- Make sure you aren’t blocking any content or sections of your site you want to be crawled.
- Do not use robots.txt to prevent sensitive data from appearing in SERP results. This is because other pages may link directly to the page containing private information, which may still be indexed. If you really want to block your page from search results, use a different method like the noindex meta directive or password protection.
- Links on pages blocked by robots.txt will not be followed. That means:
- Unless they are also linked from other search engine-accessible pages (such as pages not blocked via robots.txt, meta robots, etc.), the linked resources will not be crawled and may not be indexed.
- No link equity can be passed from the blocked page to the link destination. If you have pages to which you want equity to be passed, use a different blocking mechanism other than robots.txt.
- Some search engines have multiple crawlers. For example, Google uses Googlebot-Image for image search and Googlebot for organic search. Most crawlers from the same search engine follow the same rules, so there is no need to define rules for each of a search engine’s multiple crawlers. However, having the ability to do so allows you to perfect how your website is crawled.
- Make your robots.txt file easy to Find. Immediately you have your robots.txt file; it is time to make it live. While you can position it in any main directory of your website, we recommend placing it at https://example.com/robots.txt and writing it in lowercase to increase the odds. Note that your robots.txt file is case sensitive. So ensure to use a lowercase “r” in the filename.
- A search engine will cache the robots.txt contents but typically update the cached contents at least once a day. If you change the file and want to update it faster than it is, you can submit your robots.txt URL to Google.
Robots.txt vs. meta robots vs. x-robots
What is the difference between these three types of robot directives? Simply, robots.txt is the actual text file, whereas meta and x-robots are meta directives. Beyond that, these three serve different functions. Robots.txt determines website or directory-wide crawl behavior. On the other hand, meta and x-robots can determine indexation behavior at the individual page (or page element) level.