Robots.txt
Robots.txt: Short Summary
Robots.txt is a text file that is important for indexing website content. Using the file, webmasters can determine which subpages are to be captured and indexed by a crawler such as Googlebot and which are not. This makes robots.txt important for search engine optimisation.
Robots.txt: Detailed Summary
The basis of robots.txt and related indexing management is “Robots Exclusion Standard Protocol”, also abbreviated as REP, published in 1994. This protocol specifies various options that webmasters can use to control search engines’ crawlers and the way they work. However, it should be noted that robots.txt is merely a guideline for search engines, which they do not necessarily have to adhere to. The file cannot be used to assign access rights or prevent access – but since the major search engines such as Google, Yahoo and Bing have committed themselves to adhere to this guideline, robots.txt can be used to reliably control the indexing of one’s own site.
For the file to actually be read, it must be located in the root directory of the domain, and the entire name of the file and most of the instructions in the file itself must be written in lower case.
Furthermore, it is important to note that pages can still be indexed even if they were excluded from indexing in the robots.txt. This is particularly the case for pages with many backlinks, as these are an important criterion for search engines’ web crawlers.
How is a Robots.txt Structured?
The structure of the file is very simple. In the beginning, the user-agents are determined, and the following rules apply to them. A user-agent is basically just a search engine crawler. In order to be able to enter the correct names here, you need to know how the individual providers have designated their user-agents. The most common user-agents are:
- Googlebot (normal Google search engine)
- Googlebot-News (a bot that is no longer used, but whose instructions are also followed by the normal Googlebot)
- Googlebot image (Google image search)
- Googlebot video (Google video search)
- Googlebot-Mobile (Google mobile search)
- Adbot-Google (Google AdWords)
- Slurp (Yahoo)
- Bingbot (Bing)
The first line of a robots.txt could therefore look like this: “User-agent: Googlebot”. Once the desired user-agents have been specified, the actual instructions follow. As a rule, these begin with “Disallow:”, after which the webmaster specifies which directory or directories the crawlers should ignore during indexing. As an alternative to the Disallow command, an Allow entry can also be made. This makes it easier to separate which directory may be used for indexing and which may not. However, the Allow entry is not mandatory, but the Disallow command is.
In addition to specifying individual directories, “Disallow” (or “Allow”) can also be used to set “wildcards”, i.e. placeholders that can be used to specify more general rules for indexing the directories. An asterisk (*) can be set as a wildcard for any character string. With the entry “Disallow: *”, for example, the entire domain could be excluded from indexing, while “User-agents: *” can be used to set rules for all web crawlers for the domain. The second placeholder is the dollar sign ($). It can be used to specify that a filter should only apply to the end of a string. With the entry “Disallow: *.pdf$”, all pages ending in “.pdf” would be excluded from indexing.
An XML sitemap can also be referenced in robots.txt. This requires an entry with the following format: “Sitemap: http://www.textbroker.co.uk/sitemap.xml”. Comment lines can also be inserted. To do this, the relevant line must be preceded by a hash symbol (#).
Robots.txt and SEO
Since robots.txt determines which subpages are used for search engine indexing, it is obvious that the file also plays an important role in search engine optimisation. If, for example, a directory of the domain is excluded, all SEO efforts on the corresponding pages will be for nothing, as the crawlers simply do not pay attention to them. Conversely, robots.txt can also be used specifically for SEO, for example, to exclude certain pages and avoid being penalised because of duplicate content.
In general, it can be said that robots.txt is enormously important for search engine optimisation because it can have a massive effect on a page’s ranking. Accordingly, it must be carefully maintained because errors can quickly creep in, preventing important pages from being captured by crawlers. Caution is especially important when using wildcards because a typo or small mistake can have a particularly significant effect here. For inexperienced users, it is, therefore, advisable to set no or very few restrictions in the file. Further rules can later be defined step by step so that, for example, SEO measures are more effective.
Help With Creating Robots.txt
Although robots.txt is a simple text file that can be easily written with any text editor, errors, as described in the section above, have a very significant impact and can, in the worst case, have a massive negative influence on the ranking of a page.
Fortunately, for all those who do not want to venture directly into the robots.txt themselves, numerous free tools on the internet make file creation much easier, including Ryte. In addition, there are free tools for checking the file, like TechnicalSEO.com. Of course, Google also offers corresponding services that can be easily accessed via Webmaster Tools.
Conclusion
Despite its simple structure and limited recognition, robots.txt is a very important criterion when it comes to SEO and a page’s ranking. Admittedly, the rules laid down in the file are not binding. In most cases, however, they are correctly implemented by the search engine’s user-agents, meaning that webmasters can use robots.txt to quickly and easily determine which directories and pages of their domain could be used for indexing by the search engines.
However, due to the extensive influence of this file, it is advisable to become familiar with the required syntax or to use one of the free tools available on the internet. Otherwise, there is a risk of excluding pages that should be included by a search engine’s indexing and vice versa.
Register now for free