Robots txt prohibits folder indexing. How to prevent indexing of required pages

When independently promoting and promoting a website, it is important not only to create unique content or select queries in Yandex statistics (to form a semantic core), but you should also pay due attention to such an indicator as site indexing in Yandex and Google. It is these two search engines that dominate the RuNet, and how complete and fast the indexing of your site is in Yandex and Google determines the entire further success of promotion.



We have at our disposal two main tools with which we can manage site indexing in Google and Yandex. Firstly, this is, of course, a file robots.txt, which will allow us to set up a ban on indexing everything on the site that does not contain the main content (engine files and duplicate content) and robots.txt will be discussed in this article, but besides robots.txt there is another important tool for managing indexing — sitemap (Sitemap xml), which I already wrote about in some detail in the article linked to.

Robots.txt - why is it so important to manage site indexing in Yandex and Google

Robots.txt and Sitemap xml (files that allow you to manage site indexing) are very important for the successful development of your project and this is not at all an unfounded statement. In the article on Sitemap xml (see link above), I gave as an example the results of a very important study on the most common technical mistakes of novice webmasters, and there in second and third place (after non-unique content) are just robots.txt and Sitemap xml, or rather, either the absence of these files, or their incorrect composition and use.

It is necessary to understand very clearly that not all the contents of a site (files and directories) created on any engine (CMS Joomla, SMF or WordPress) should be available for indexing by Yandex and Google (I do not consider other search engines, due to their small share in RuNet search).

If you do not specify certain rules of behavior in robots.txt for search engine bots, then during indexing, many pages that are not related to the content of the site will end up in search engines, and multiple duplication of information content may also occur (the same material will be available through different links site), which search engines don’t like. A good solution would be to disable indexing in robots.txt.

In order to set rules of behavior for search bots, it is used robots.txt file. With its help, we will be able to influence the process of site indexing by Yandex and Google. Robot.txt is a regular text file that you can create and subsequently edit in any text editor (for example, Notepad++). The search robot will look for this file in the root directory of your site and if it doesn’t find it, it will index everything it can reach.

Therefore, after writing the required robots.txt file (all letters in the name must be in lower case - no capital letters), it needs to be saved to the root folder of the site, for example, using the Filezilla Ftp client, so that it is available at this address: http:/ /your_site.ru/robots.txt.

By the way, if you want to know what the robots.txt file of a particular site looks like, then it will be enough to add /robots.txt to the address of the main page of this site. This can be helpful in determining the best option for your robots.txt file, but keep in mind that the optimal robots.txt file will look different for different site engines ( prohibition of indexing in robots.txt will need to be done for different folders and files of the engine). Therefore, if you want to decide on the best version of the robots.txt> file, say for a forum on SMF, then you need to study robots.txt files for forums built on this engine.

Directives and rules for writing the robots.txt file (disallow, user-agent, host)

The robots.txt file has a very simple syntax, which is described in great detail, for example, in the Index. Typically, the robots.txt file indicates for which search robot the directives described below are intended (directive "User-agent"), themselves allowing (" Allow") and prohibiting directives (" Disallow"), and the directive " is also actively used Sitemap" to indicate to search engines exactly where the sitemap file is located.

It is also useful to indicate in the robots.txt file which of the mirrors of your site is the main one in the "Host" directive"Even if your site does not have mirrors, then it will be useful to indicate in this directive which of the spellings of your site is the main one with or without www. Because this is also a kind of mirroring. I talked about this in detail in this article: Domains with and without www - the history of their appearance, the use of 301 redirects to glue them together.

Now let's talk a little about rules for writing a robots.txt file. The directives in the robots.txt file look like this:

Correct robots.txt file must contain at least one "Disallow" directive after each "User-agent" entry. An empty robots.txt file assumes permission to index the entire site.

"User-agent" directive must contain the name of the search robot. Using this directive in robots.txt, you can configure site indexing for each specific search robot (for example, create a ban on indexing a separate folder only for Yandex). An example of writing a “User-agent” directive addressed to all search robots visiting your resource looks like this:

Let me give you a few simple examples managing site indexing in Yandex, Google and other search engines using the directives of the robots.txt file with an explanation of its actions.

    1 . The code below for the robots.txt file allows all search robots to index the entire site without any exceptions. This is specified by an empty Disallow directive.

    3 . Such a robots.txt file will prohibit all search engines from indexing the contents of the /image/ directory (http://mysite.ru/image/ - the path to this directory)

    5 . When describing paths for Allow-Disallow directives, you can use symbols "*" and "$", thus defining certain logical expressions. The symbol "*" means any (including empty) sequence of characters. The following example prevents all search engines from indexing files on a site with the extension “.aspx”:

    Disallow: *.aspx

To avoid unpleasant problems with site mirrors (Domains with and without www - history of appearance, use of 301 redirects to glue them together), it is recommended to add to the file robots.txt Host directive, which points the Yandex robot to the main mirror of your site (Host Directive, which allows you to set the main mirror of the site for Yandex). According to the rules for writing robots.txt, the entry for the User-agent must contain at least one Disallow directive (usually an empty one that does not prohibit anything):

User-agent: Yandex

Host: www.site.ru

Robots and Robots.txt - prohibiting search engines from indexing duplicates on the site


There is another way configure indexing of individual website pages for Yandex and Google. To do this, inside the “HEAD” tag of the desired page, the Robots META tag is written and this is repeated for all pages to which one or another indexing rule (ban or allow) needs to be applied. Example of using a meta tag:

...

In this case, the robots of all search engines will have to forget about indexing this page (this is indicated by noindex in the meta tag) and analyzing the links placed on it (this is indicated by nofollow).

There are only two pairs Robots meta tag directives: index and follow:

  1. Index - indicate whether the robot can index this page
  2. Follow - whether he can follow links from the page

The default values ​​are "index" and "follow". There is also a shortened version using “all” and “none”, which indicate the activity of all directives or, accordingly, vice versa: all=index,follow and none=noindex,nofollow.

For a WordPress blog, you can customize the Robots meta tag, for example, using the All in One SEO Pack plugin. Well, that’s it, the theory is over and it’s time to move on to practice, namely, to compiling optimal robots.txt files for Joomla, SMF and WordPress.

As you know, projects created on the basis of any engine (Joomla, WordPress, SMF, etc.) have many auxiliary files that do not carry any information load.

If you do not prohibit indexing of all this garbage in robots.txt, then the time allotted by the Yandex and Google search engines for indexing your site will be spent on search robots sorting through the engine files to search for the information component in them, i.e. content, which, by the way, in most CMSs is stored in a database that search robots cannot reach (you can work with databases through PhpMyAdmin). In this case, time for a full site indexing The Yandex and Google robots may not have any left.

In addition, you should strive for unique content on your project and should not allow duplicate content (information content) of your site when indexed. Duplication may occur if the same material is available at different URLs. The search engines Yandex and Google, while indexing the site, will detect duplicates and, perhaps, take measures to somewhat pessimize your resource if there are a large number of them.

If your project is created on the basis of any engine (Joomla, SMF, WordPress), then duplication of content will occur with a high probability, which means you need to deal with it, including by disabling indexing in robots.txt.

For example, in WordPress, pages with very similar content can be indexed by Yandex and Google if indexing of category content, tag archive content, and temporary archive content is allowed. But if you use the Robots meta tag to create a ban on indexing the archive of tags and the temporary archive (you can leave the tags, but prohibit the indexing of the content of the categories), then duplication of content will not occur. For this purpose in WordPress, it is best to use the capabilities of the All in One SEO Pack plugin.

The situation with duplication of content is even more difficult in the SMF forum engine. If you do not fine-tune (prohibit) site indexing in Yandex and Google via robots.txt, then multiple duplicates of the same posts will end up in the search engine index. Joomla sometimes has a problem with indexing and duplicating the content of regular pages and their print copies.

Robots.txt is intended for setting global rules for prohibiting indexing in entire site directories, or in files and directories whose names contain specified characters (by mask). You can see examples of setting such indexing prohibitions in the first article of this article.

To prohibit indexing in Yandex and Google one single page, it is convenient to use the Robots meta tag, which is written in the header (between the HEAD tags) of the desired page. More details about the syntax of the Robots meta tag are a little higher in the text. To prohibit indexing inside a page, you can use the NOINDEX tag, but it is, however, only supported by the Yandex search engine.

Host directive in robots.txt for Yandex

Now let's look at specific examples of robots.txt designed for different engines - Joomla, WordPress and SMF. Naturally, all three robots.txt files created for different engines will be significantly (if not radically) different from each other. True, all these robots.txt will have one common point and this point is related to the Yandex search engine.

Because In RuNet, the Yandex search engine has quite a lot of weight, so you need to take into account all the nuances of its work, then for correct indexing a site in Yandex requires a Host directive in robots.txt. This directive will explicitly indicate to Yandex the main mirror of your site. You can read more about this here: The Host directive, which allows you to set the main website mirror for Yandex.

To specify the Host directive, it is recommended to use a separate User-agent blog in the robots.txt file, intended only for Yandex (User-agent: Yandex). This is due to the fact that other search engines may not understand the Host directive and, accordingly, its inclusion in the User-agent directive intended for all search engines (User-agent: *) can lead to negative consequences and incorrect indexing of your site.

It’s hard to say what the situation really is, because search engine algorithms are a thing in themselves, so it’s better to do everything in robots.txt as advised. But in this case, in the robots.txt file, you will have to duplicate in the User-agent: Yandex directive all the rules that you specified in the User-agent: * directive. If you leave the User-agent: Yandex directive with an empty Disallow: directive, then in this way you in robots.txt, allow Yandex to index the entire site.

Before moving on to considering specific options for the robots.txt file, I would like to remind you that you can check the operation of your robots.txt file in Yandex Webmaster and Google Webmaster.

Correct robots.txt for SMF forum

Allow: /forum/*sitemap

Allow: /forum/*arcade

Allow: /forum/*rss

Disallow: /forum/attachments/

Disallow: /forum/avatars/

Disallow: /forum/Packages/

Disallow: /forum/Smileys/

Disallow: /forum/Sources/

Disallow: /forum/Themes/

Disallow: /forum/Games/

Disallow: /forum/*.msg

Disallow: /forum/*. new

Disallow: /forum/*sort

Disallow: /forum/*topicseen

Disallow: /forum/*wap

Disallow: /forum/*imode

Disallow: /forum/*action

User-agent: Slurp

Crawl-delay: 100

Please note that this robots.txt is for the case where your SMF forum is installed in the forum directory of the main site. If the forum is not in the directory, then simply remove /forum from all rules. The authors of this version of the robots.txt file for a forum on the SMF engine say that it will give the maximum effect for proper indexing in Yandex and Google if you do not activate friendly URLs (FUR) on your forum.

Friendly URLs in SMF can be activated or deactivated in the forum admin by following the following path: in the left column of the admin panel, select the “Characteristics and Settings” item, at the bottom of the window that opens, find the “Allow friendly URLs” item, where you can check or uncheck it.

Another correct robots.txt file for SMF forum(but probably not fully tested yet):

Allow: /forum/*sitemap

Allow: /forum/*arcade # if the game mod is not worth it, delete without skipping a line

Allow: /forum/*rss

Allow: /forum/*type=rss

Disallow: /forum/attachments/

Disallow: /forum/avatars/

Disallow: /forum/Packages/

Disallow: /forum/Smileys/

Disallow: /forum/Sources/

Disallow: /forum/Themes/

Disallow: /forum/Games/

Disallow: /forum/*.msg

Disallow: /forum/*. new

Disallow: /forum/*sort

Disallow: /forum/*topicseen

Disallow: /forum/*wap

Disallow: /forum/*imode

Disallow: /forum/*action

Disallow: /forum/*prev_next

Disallow: /forum/*all

Disallow: /forum/*go.php # or whatever redirect you have

Host: www.my site.ru # indicate your main mirror

User-agent: Slurp

Crawl-delay: 100

As you can see in this robots.txt, the Yandex-only Host directive is included in the User-agent directive for all search engines. I would probably still add a separate User-agent directive in robots.txt only for Yandex, repeating all the rules. But decide for yourself.

User-agent: Slurp

Crawl-delay: 100

This is due to the fact that the Yahoo search engine (Slurp is the name of its search bot) indexes the site in many threads, which can negatively affect its performance. In this robots.txt rule, the Crawl-delay directive allows you to set the Yahoo search robot the minimum period of time (in seconds) between the end of downloading one page and the start of downloading the next. This will relieve the load on the server when a site is indexed by the Yahoo search engine.

To prevent indexing in Yandex and Google of print versions of SMF forum pages, it is recommended to perform the operations described below (to carry them out, you will need to open some SMF files for editing using the FileZilla program). In the Sources/Printpage.php file, find (for example, using the built-in search in Notepad++) the line:

In the Themes/name_of_theme/Printpage.template.php file, find the line:

If you also want the print version to have a link to go to the full version of the forum (if some of the print pages have already been indexed in Yandex and Google), then in the same file Printpage.template.php you find the line with opening HEAD tag:

Get more information on this file variant robots.txt for SMF forum You can read this thread of the Russian-language SMF support forum.

Correct robots.txt for a Joomla site

Robots.txt is a special file located in the root directory of the site. The webmaster indicates in it which pages and data to exclude from indexing by search engines. The file contains directives that describe access to sections of the site (the so-called robot exception standard). For example, you can use it to set different access settings for search robots designed for mobile devices and desktop computers. It is very important to set it up correctly.

Is robots.txt necessary?

With robots.txt you can:

  • prohibit indexing of similar and unnecessary pages, so as not to waste the crawling limit (the number of URLs that a search robot can crawl in one crawl). Those. the robot will be able to index more important pages.
  • hide images from search results.
  • close unimportant scripts, style files and other non-critical page resources from indexing.

If this will prevent the Google or Yandex crawler from analyzing the pages, do not block the files.

Where is the Robots.txt file located?

If you just want to see what is in the robots.txt file, then simply enter in the address bar of your browser: site.ru/robots.txt.

Physically, the robots.txt file is located in the root folder of the site on the hosting. I have hosting beget.ru, so I will show the location of the robots.txt file on this hosting.


How to create the correct robots.txt

The robots.txt file consists of one or more rules. Each rule blocks or allows path indexing on the site.

  1. In a text editor, create a file called robots.txt and fill it out according to the rules below.
  2. The robots.txt file must be an ASCII or UTF-8 encoded text file. Characters in other encodings are not allowed.
  3. There should be only one such file on the site.
  4. The robots.txt file must be placed in root directory site. For example, to control the indexing of all pages on the site http://www.example.com/, the robots.txt file should be located at http://www.example.com/robots.txt. It should not be in a subdirectory(for example, at the address http://example.com/pages/robots.txt). If you have difficulty accessing the root directory, contact your hosting provider. If you don't have access to the site's root directory, use an alternative blocking method such as meta tags.
  5. The robots.txt file can be added to addresses with subdomains(for example http:// website.example.com/robots.txt) or non-standard ports (for example, http://example.com: 8181 /robots.txt).
  6. Check the file in the Yandex.Webmaster service and Google Search Console.
  7. Upload the file to the root directory of your site.

Here is an example robots.txt file with two rules. Below is his explanation.

User-agent: Googlebot Disallow: /nogooglebot/ User-agent: * Allow: / Sitemap: http://www.example.com/sitemap.xml

Explanation

  1. A user agent named Googlebot should not index the directory http://example.com/nogooglebot/ and its subdirectories.
  2. All other user agents have access to the entire site (can be omitted, the result will be the same, since full access is granted by default).
  3. The Sitemap file for this site is located at http://www.example.com/sitemap.xml.

Disallow and Allow directives

To prevent indexing and robot access to the site or some of its sections, use the Disallow directive.

User-agent: Yandex Disallow: / # blocks access to the entire site User-agent: Yandex Disallow: /cgi-bin # blocks access to pages # starting with "/cgi-bin"

According to the standard, it is recommended to insert an empty line feed before each User-agent directive.

The # symbol is intended to describe comments. Everything after this character and before the first line break is not taken into account.

To allow robot access to the site or some of its sections, use the Allow directive

User-agent: Yandex Allow: /cgi-bin Disallow: / # prohibits downloading everything except pages # starting with "/cgi-bin"

It is not allowed to have empty line breaks between the User-agent, Disallow and Allow directives.

The Allow and Disallow directives from the corresponding User-agent block are sorted by the length of the URL prefix (from smallest to largest) and are applied sequentially. If several directives are suitable for a given site page, the robot selects the last one in the order of appearance in the sorted list. Thus, the order of the directives in the robots.txt file does not affect how the robot uses them. Examples:

# Original robots.txt: User-agent: Yandex Allow: /catalog Disallow: / # Sorted robots.txt: User-agent: Yandex Disallow: / Allow: /catalog # Allows downloading only pages # starting with "/catalog" # Original robots.txt: User-agent: Yandex Allow: / Allow: /catalog/auto Disallow: /catalog # Sorted robots.txt: User-agent: Yandex Allow: / Disallow: /catalog Allow: /catalog/auto # prohibits downloading pages starting with "/catalog" # but allows pages starting with "/catalog/auto" to be downloaded.

If there is a conflict between two directives with prefixes of the same length, the Allow directive takes precedence.

Using special characters * and $

When specifying the paths of the Allow and Disallow directives, you can use the special characters * and $, thus specifying certain regular expressions.

The special character * means any (including empty) sequence of characters.

The special character $ means the end of the line, the character before it is the last one.

User-agent: Yandex Disallow: /cgi-bin/*.aspx # prohibits "/cgi-bin/example.aspx" # and "/cgi-bin/private/test.aspx" Disallow: /*private # prohibits not only "/private", # but also "/cgi-bin/private"

Sitemap Directive

If you are using a Sitemap file to describe the site structure, specify the path to the file as a parameter to the sitemap directive (if there are several files, specify all). Example:

User-agent: Yandex Allow: / sitemap: https://example.com/site_structure/my_sitemaps1.xml sitemap: https://example.com/site_structure/my_sitemaps2.xml

The directive is intersectional, so it will be used by the robot regardless of the place in the robots.txt file where it is specified.

The robot will remember the path to the file, process the data and use the results in subsequent download sessions.

Crawl-delay directive

If the server is heavily loaded and does not have time to process the robot's requests, use the Crawl-delay directive. It allows you to set the search robot the minimum period of time (in seconds) between the end of loading one page and the start of loading the next.

Before changing the site crawl speed, find out which pages the robot accesses more often.

  • Analyze the server logs. Contact the person responsible for the site or the hosting provider.
  • Look at the list of URLs on the Indexing → Crawl statistics page in Yandex.Webmaster (set the switch to All pages).

If you find that the robot is accessing service pages, prevent them from being indexed in the robots.txt file using the Disallow directive. This will help reduce the number of unnecessary calls from the robot.

Clean-param directive

The directive only works with the Yandex robot.

If site page addresses contain dynamic parameters that do not affect their content (session identifiers, users, referrers, etc.), you can describe them using the Clean-param directive.

Yandex Robot, using this directive, will not repeatedly reload duplicate information. This will increase the efficiency of crawling your site and reduce the load on the server.

For example, the site has pages:

www.example.com/some_dir/get_book.pl?ref=site_1&book_id=123 www.example.com/some_dir/get_book.pl?ref=site_2&book_id=123 www.example.com/some_dir/get_book.pl?ref=site_3&book_id= 123

The ref parameter is used only to track which resource the request was made from and does not change the content; the same page with the book book_id=123 will be shown at all three addresses. Then, if you specify the directive as follows:

User-agent: Yandex Disallow: Clean-param: ref /some_dir/get_book.pl

The Yandex robot will reduce all page addresses to one:

www.example.com/some_dir/get_book.pl?book_id=123

If such a page is available on the site, it will be included in the search results.

Directive Syntax

Clean-param: p0[&p1&p2&..&pn]

The first field, separated by &, lists the parameters that the robot does not need to take into account. The second field specifies the path prefix of the pages for which the rule should be applied.

Note. The Clean-Param directive is cross-sectional, so it can be specified anywhere in the robots.txt file. If several directives are specified, all of them will be taken into account by the robot.

The prefix can contain a regular expression in a format similar to the robots.txt file, but with some restrictions: only the characters A-Za-z0-9.-/*_ can be used. In this case, the * symbol is interpreted in the same way as in the robots.txt file: the * symbol is always implicitly appended to the end of the prefix. For example:

Clean-param: s /forum/showthread.php

Case is taken into account. There is a limit on the length of the rule - 500 characters. For example:

Clean-param: abc /forum/showthread.php Clean-param: sid&sort /forum/*.php Clean-param: someTrash&otherTrash

HOST directive

At the moment, Yandex has stopped supporting this directive.

Correct robots.txt: setting

The contents of the robots.txt file differ depending on the type of site (online store, blog), the CMS used, structure features and a number of other factors. Therefore, creating this file for a commercial website, especially if it is a complex project, should be done by an SEO specialist with sufficient experience.

An unprepared person will most likely not be able to make the right decision regarding which part of the content is better to close from indexing and which part to allow to appear in search results.

Correct Robots.txt example for WordPress

User-agent: * # general rules for robots, except Yandex and Google, # because for them the rules are below Disallow: /cgi-bin # folder on hosting Disallow: /? # all request parameters on the main page Disallow: /wp- # all WP files: /wp-json/, /wp-includes, /wp-content/plugins Disallow: /wp/ # if there is a subdirectory /wp/ where the CMS is installed ( if not, # the rule can be deleted) Disallow: *?s= # search Disallow: *&s= # search Disallow: /search/ # search Disallow: /author/ # author archive Disallow: /users/ # author archive Disallow: */ trackback # trackbacks, notifications in comments about the appearance of an open # link to an article Disallow: */feed # all feeds Disallow: */rss # rss feed Disallow: */embed # all embeddings Disallow: */wlwmanifest.xml # manifest xml file Windows Live Writer (if you don't use it, # the rule can be deleted) Disallow: /xmlrpc.php # WordPress API file Disallow: *utm*= # links with utm tags Disallow: *openstat= # links with openstat tags Allow: */uploads # open the folder with the uploads files Sitemap: http://site.ru/sitemap.xml # sitemap address User-agent: GoogleBot # rules for Google (I do not duplicate comments) Disallow: /cgi-bin Disallow: /? Disallow: /wp- Disallow: /wp/ Disallow: *?s= Disallow: *&s= Disallow: /search/ Disallow: /author/ Disallow: /users/ Disallow: */trackback Disallow: */feed Disallow: */ rss Disallow: */embed Disallow: */wlwmanifest.xml Disallow: /xmlrpc.php Disallow: *utm*= Disallow: *openstat= Allow: */uploads Allow: /*/*.js # open js scripts inside / wp- (/*/ - for priority) Allow: /*/*.css # open css files inside /wp- (/*/ - for priority) Allow: /wp-*.png # images in plugins, cache folder etc. Allow: /wp-*.jpg # images in plugins, cache folder, etc. Allow: /wp-*.jpeg # images in plugins, cache folder, etc. Allow: /wp-*.gif # images in plugins, cache folder, etc. Allow: /wp-admin/admin-ajax.php # used by plugins so as not to block JS and CSS User-agent: Yandex # rules for Yandex (I do not duplicate comments) Disallow: /cgi-bin Disallow: /? Disallow: /wp- Disallow: /wp/ Disallow: *?s= Disallow: *&s= Disallow: /search/ Disallow: /author/ Disallow: /users/ Disallow: */trackback Disallow: */feed Disallow: */ rss Disallow: */embed Disallow: */wlwmanifest.xml Disallow: /xmlrpc.php Allow: */uploads Allow: /*/*.js Allow: /*/*.css Allow: /wp-*.png Allow: /wp-*.jpg Allow: /wp-*.jpeg Allow: /wp-*.gif Allow: /wp-admin/admin-ajax.php Clean-Param: utm_source&utm_medium&utm_campaign # Yandex recommends not blocking # from indexing, but deleting tag parameters, # Google does not support such rules Clean-Param: openstat # similar

Robots.txt example for Joomla

User-agent: *
Disallow: /administrator/
Disallow: /cache/
Disallow: /includes/
Disallow: /installation/
Disallow: /language/
Disallow: /libraries/
Disallow: /media/
Disallow: /modules/
Disallow: /plugins/
Disallow: /templates/
Disallow: /tmp/
Disallow: /xmlrpc/

Robots.txt example for Bitrix

User-agent: *
Disallow: /*index.php$
Disallow: /bitrix/
Disallow: /auth/
Disallow: /personal/
Disallow: /upload/
Disallow: /search/
Disallow: /*/search/
Disallow: /*/slide_show/
Disallow: /*/gallery/*order=*
Disallow: /*?print=
Disallow: /*&print=
Disallow: /*register=
Disallow: /*forgot_password=
Disallow: /*change_password=
Disallow: /*login=
Disallow: /*logout=
Disallow: /*auth=
Disallow: /*?action=
Disallow: /*action=ADD_TO_COMPARE_LIST
Disallow: /*action=DELETE_FROM_COMPARE_LIST
Disallow: /*action=ADD2BASKET
Disallow: /*action=BUY
Disallow: /*bitrix_*=
Disallow: /*backurl=*
Disallow: /*BACKURL=*
Disallow: /*back_url=*
Disallow: /*BACK_URL=*
Disallow: /*back_url_admin=*
Disallow: /*print_course=Y
Disallow: /*COURSE_ID=
Disallow: /*?COURSE_ID=
Disallow: /*?PAGEN
Disallow: /*PAGEN_1=
Disallow: /*PAGEN_2=
Disallow: /*PAGEN_3=
Disallow: /*PAGEN_4=
Disallow: /*PAGEN_5=
Disallow: /*PAGEN_6=
Disallow: /*PAGEN_7=

Disallow: /*PAGE_NAME=search
Disallow: /*PAGE_NAME=user_post
Disallow: /*PAGE_NAME=detail_slide_show
Disallow: /*SHOWALL
Disallow: /*show_all=
Sitemap: http://path to your XML format map

Robots.txt example for MODx

User-agent: *
Disallow: /assets/cache/
Disallow: /assets/docs/
Disallow: /assets/export/
Disallow: /assets/import/
Disallow: /assets/modules/
Disallow: /assets/plugins/
Disallow: /assets/snippets/
Disallow: /install/
Disallow: /manager/
Sitemap: http://site.ru/sitemap.xml

Robots.txt example for Drupal

User-agent: *
Disallow: /database/
Disallow: /includes/
Disallow: /misc/
Disallow: /modules/
Disallow: /sites/
Disallow: /themes/
Disallow: /scripts/
Disallow: /updates/
Disallow: /profiles/
Disallow: /profile
Disallow: /profile/*
Disallow: /xmlrpc.php
Disallow: /cron.php
Disallow: /update.php
Disallow: /install.php
Disallow: /index.php
Disallow: /admin/
Disallow: /comment/reply/
Disallow: /contact/
Disallow: /logout/
Disallow: /search/
Disallow: /user/register/
Disallow: /user/password/
Disallow: *register*
Disallow: *login*
Disallow: /top-rated-
Disallow: /messages/
Disallow: /book/export/
Disallow: /user2userpoints/
Disallow: /myuserpoints/
Disallow: /tagadelic/
Disallow: /referral/
Disallow: /aggregator/
Disallow: /files/pin/
Disallow: /your-votes
Disallow: /comments/recent
Disallow: /*/edit/
Disallow: /*/delete/
Disallow: /*/export/html/
Disallow: /taxonomy/term/*/0$
Disallow: /*/edit$
Disallow: /*/outline$
Disallow: /*/revisions$
Disallow: /*/contact$
Disallow: /*downloadpipe
Disallow: /node$
Disallow: /node/*/track$
Disallow: /*&
Disallow: /*%
Disallow: /*?page=0
Disallow: /*section
Disallow: /*order
Disallow: /*?sort*
Disallow: /*&sort*
Disallow: /*votesupdown
Disallow: /*calendar
Disallow: /*index.php
Allow: /*?page=
Disallow: /*?
Sitemap: http://path to your XML format map

ATTENTION!

CMS are constantly updated. You may need to block other pages from indexing. Depending on the purpose, the ban on indexing can be removed or, conversely, added.

Check robots.txt

Each search engine has its own requirements for the design of the robots.txt file.

In order to check robots.txt To check the correctness of the syntax and structure of the file, you can use one of the online services. For example, Yandex and Google offer their own site analysis services for webmasters, which include robots.txt analysis:

Checking robotx.txt for the Yandex search robot

This can be done using a special tool from Yandex - Yandex.Webmaster, and there are also two options.

Option 1:

Drop-down list at the top right - select Robots.txt analysis or follow the link http://webmaster.yandex.ru/robots.xml

Don't forget that all the changes you make to the robots.txt file will not be available immediately, but only after some time.

Checking robotx.txt for Google search robot

  1. In Google Search Console, select your site, go to the inspection tool, and review the contents of your robots.txt file. Syntactic And brain teaser errors in it will be highlighted, and their number will be indicated under the editing window.
  2. At the bottom of the interface page, specify the desired URL in the appropriate window.
  3. From the drop-down menu on the right, select robot.
  4. Click the button CHECK.
  5. Status will be displayed AVAILABLE or NOT AVAILABLE. In the first case, Google robots can go to the address you specified, but in the second - not.
  6. If necessary, make changes to the menu and perform the test again. Attention! These corrections will not be automatically added to the robots.txt file on your site.
  7. Copy the modified content and add it to the robots.txt file on your web server.

In addition to verification services from Yandex and Google, there are many other online robots.txt validators.

Robots.txt generators

  1. Service from SEOlib.ru. Using this tool you can quickly get and check the restrictions in the Robots.txt file.
  2. Generator from pr-cy.ru. As a result of the Robots.txt generator, you will receive text that must be saved in a file called Robots.txt and uploaded to the root directory of your site.

I was faced with the task of excluding pages containing a certain query string (unique reports for the user, each of which has its own address) from indexing by search engines. I solved this problem for myself, and also decided to fully understand the issues of allowing and prohibiting site indexing. This material is dedicated to this. It covers not only advanced use cases for robots.txt, but also other, lesser-known ways to control site indexing.

There are many examples on the Internet of how to exclude certain folders from indexing by search engines. But a situation may arise when you need to exclude pages, and not all, but containing only the specified parameters.

Example page with parameters: site.ru/?act=report&id=7a98c5

Here act is the name of the variable whose value report, And id- this is also a variable with a value 7a98c5. Those. the query string (parameters) comes after the question mark.

There are several ways to block pages with parameters from indexing:

  • using the robots.txt file
  • using rules in the .htaccess file
  • using the robots meta tag

Controlling indexing in the robots.txt file

Robots.txt file

File robots.txt is a simple text file that is located in the root directory (folder) of the site and contains one or more entries. Typical example of file content:

User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /~joe/

In this file, three directories are excluded from indexing.

Remember that the line with " Disallow" must be written separately for each URL prefix you want to exclude. That is, you cannot write " Disallow: /cgi-bin/ /tmp/" into one line. Also remember the special meaning of empty lines - they separate blocks of records.

Regular expressions are not supported in any string User-agent, neither in Disallow.

The robots.txt file should be located in the root folder of your site. Its syntax is as follows:

User-agent: * Disallow: /folder or page prohibited for indexing Disallow: /other folder

As a value User-agent indicated * (asterisk) - this matches any value, i.e. The rules are intended for all search engines. Instead of an asterisk, you can specify the name of the specific search engine for which the rule is intended.

More than one directive can be specified Disallow.

You can use wildcard characters in your robots.txt file:

  • * denotes 0 or more instances of any valid character. Those. this is any string, including an empty one.
  • $ marks the end of the URL.

Other characters, including &, ?, =, etc. are taken literally.

Prohibiting indexing of a page with certain parameters using robots.txt

So I want to block addresses like (instead of MEANING can be any string): site.ru/?act=report&id=VALUE

The rule for this is:

User-agent: * Disallow: /*?*act=report&id=*

In him / (slash) means the root folder of the site, followed by * (asterisk), it means “anything.” Those. this can be any relative address, for example:

  • /page.php
  • /order/new/id

Then follows ? (question mark), which is interpreted literally, i.e. like a question mark. Therefore, what follows is the query line.

Second * means anything can be in the query string.

Then comes a sequence of characters act=report&id=*, in it act=report&id= is interpreted literally as is, and the last asterisk again means any line.

Prohibition of indexing by search engines, but permission for crawlers of advertising networks

If you have closed your site from indexing for search engines, or have closed certain sections of it, then AdSense advertising will not be shown on them! Placing advertisements on pages that are closed from indexing may be considered a violation in other affiliate networks.

To fix this, add to the very beginning of the file robots.txt the following lines:

User-agent: Mediapartners-Google Disallow: User-agent: AdsBot-Google* Disallow: User-Agent: YandexDirect Disallow:

With these lines we allow bots Mediapartners-Google, AdsBot-Google* And YandexDirect index the site.

Those. the robots.txt file for my case looks like this:

User-agent: Mediapartners-Google Disallow: User-agent: AdsBot-Google* Disallow: User-Agent: YandexDirect Disallow: User-agent: * Disallow: /*?*act=report&id=*

Prevent all pages with a query string from being indexed

This can be done as follows:

User-agent: * Disallow: /*?*

This example blocks all pages containing in the URL ? (question mark).

Remember: a question mark immediately after the domain name, e.g. site.ru/? is equivalent to an index page, so be careful with this rule.

Prohibiting indexing of pages with a certain parameter passed by the GET method

For example, you need to block URLs that contain the parameter in the query string order, the following rule is suitable for this:

User-agent: * Disallow: /*?*order=

Prevent indexing of pages with any of several parameters

Let's say we want to prevent pages that contain a query string or parameter from being indexed dir, or parameter order, or parameter p. To do this, list each of the blocking options in separate rules, something like this:

User-agent: * Disallow: /*?*dir= Disallow: /*?*order= Disallow: /*?*p=

How to prevent search engines from indexing pages that have several specific parameters in their URLs

For example, you need to exclude from indexing the page the contents parameter in the query string dir, parameter order and parameter p. For example, a page with this URL should be excluded from indexing: mydomain.com/new-printers?dir=asc&order=price&p=3

This can be achieved using the directive:

User-agent: * Disallow: /*?dir=*&order=*&p=*

Instead of parameter values ​​that may change constantly, use asterisks. If a parameter always has the same value, then use its literal spelling.

How to block a site from indexing

To prevent all robots from indexing the entire site:

User-agent: * Disallow: /

Allow all robots full access

To give all robots full access to index the site:

User-agent: * Disallow:

Either just create an empty /robots.txt file, or don't use it at all - by default, everything that is not prohibited for indexing is considered open. Therefore, an empty file or its absence means permission for full indexing.

Prohibiting all search engines from indexing part of the site

To close some sections of the site from all robots, use directives of the following type, in which replace the values ​​with your own:

User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /junk/

Blocking individual robots

To block access to individual robots and search engines, use the robot's name in the line User-agent. In this example, access is denied to BadBot:

User-agent: BadBot Disallow: /

Remember: many robots ignore the robots.txt file, so this is not a reliable means of stopping a site or part of it from being indexed.

Allow the site to be indexed by one search engine

Let's say we want to allow only Google to index the site, and deny access to other search engines, then do this:

User-agent: Google Disallow: User-agent: * Disallow: /

The first two lines give permission to the Google robot to index the site, and the last two lines prohibit all other robots from doing so.

Ban on indexing all files except one

Directive Allow defines paths that should be accessible to specified search robots. If the path is not specified, it is ignored.

Usage:

Allow: [path]

Important: Allow must follow before Disallow.

Note: Allow is not part of the standard, but many popular search engines support it.

Alternatively, using Disallow you can deny access to all folders except one file or one folder.

How to check the operation of robots.txt

IN Yandex.Webmaster there is a tool for checking specific addresses to allow or deny them indexing according to the robots.txt file of your file.

To do this, go to the tab Tools, select Robots.txt analysis. This file should download automatically, if there is an old version, then click the button Check:

Then into the field Are URLs allowed? enter the addresses you want to check. You can enter many addresses at once, each of them must be placed on a new line. When everything is ready, press the button Check.

In column Result if the URL is closed for indexing by search robots, it will be marked with a red light, if open, it will be marked with a green light.

IN Search Console there is a similar tool. It's in the tab Scanning. Called Robots.txt file inspection tool.

If you have updated the robots.txt file, then click on the button Send, and then in the window that opens, click the button again Send:

After this, reload the page (F5 key):

Enter the address to verify, select the bot and click the button Check:

Prohibiting page indexing using the robots meta tag

If you want to close the page from indexing, then in the tag write down:

to indicate what type of files are prohibited for indexing.

For example, a ban on indexing all files with the .PDF extension:

Header set X-Robots-Tag "noindex, nofollow"

Prohibition for indexing all image files (.png, .jpeg, .jpg, .gif):

Header set X-Robots-Tag "noindex"

Blocking access to search engines using mod_rewrite

In fact, everything that was described above DOES NOT GUARANTEE that search engines and prohibited robots will not access and index your site. There are robots that “respect” the robots.txt file, and there are those that simply ignore it.

Using mod_rewrite you can block access for certain bots

RewriteEngine On RewriteCond %(HTTP_USER_AGENT) Google RewriteCond %(HTTP_USER_AGENT) Yandex RewriteRule ^ - [F]

The above directives will block access to Google and Yandex robots for the entire site.

report/

RewriteEngine On RewriteCond %(HTTP_USER_AGENT) Google RewriteCond %(HTTP_USER_AGENT) Yandex RewriteRule ^report/ - [F]

If you are interested in blocking access for search engines to individual pages and sections of a site using mod_rewrite, then write in the comments and ask your questions - I will prepare more examples.

13 observations on “ How to exclude from indexing pages with certain parameters in the URL and other techniques for controlling site indexing by search engines
  1. Taras

    the closest in meaning, but here is the folder

    If, for example, you need to close only one folder for indexing report/, then the following directives will completely block access to this folder (a response code of 403 Access Denied will be issued) for Google and Yandex scanners.

The purpose of this guide is to help webmasters and administrators use robots.txt.

Introduction

The robot exemption standard is very simple at its core. In short, it works like this:

When a robot that follows the standard visits a site, it first requests a file called “/robots.txt.” If such a file is found, the Robot searches it for instructions prohibiting indexing certain parts of the site.

Where to place the robots.txt file

The robot simply requests the URL “/robots.txt” on your site; the site in this case is a specific host on a specific port.

Site URL Robots.txt file URL
http://www.w3.org/ http://www.w3.org/robots.txt
http://www.w3.org:80/ http://www.w3.org:80/robots.txt
http://www.w3.org:1234/ http://www.w3.org:1234/robots.txt
http://w3.org/ http://w3.org/robots.txt

There can only be one file “/robots.txt” on the site. For example, you should not place the robots.txt file in user subdirectories - robots will not look for them there anyway. If you want to be able to create robots.txt files in subdirectories, then you need a way to programmatically collect them into a single robots.txt file located at the root of the site. You can use .

Remember that URLs are case sensitive and the file name “/robots.txt” must be written entirely in lowercase.

Wrong location of robots.txt
http://www.w3.org/admin/robots.txt
http://www.w3.org/~timbl/robots.txt The file is not located at the root of the site
ftp://ftp.w3.com/robots.txt Robots do not index ftp
http://www.w3.org/Robots.txt The file name is not in lowercase

As you can see, the robots.txt file should be placed exclusively at the root of the site.

What to write in the robots.txt file

The robots.txt file usually contains something like:

User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /~joe/

In this example, indexing of three directories is prohibited.

Note that each directory is listed on a separate line - you cannot write "Disallow: /cgi-bin/ /tmp/". You also cannot split one Disallow or User-agent statement into several lines, because Line breaks are used to separate instructions from each other.

Regular expressions and wildcards cannot be used either. The “asterisk” (*) in the User-agent instruction means “any robot”. Instructions like “Disallow: *.gif” or “User-agent: Ya*” are not supported.

The specific instructions in robots.txt depend on your site and what you want to prevent from being indexed. Here are some examples:

Block the entire site from being indexed by all robots

User-agent: *
Disallow: /

Allow all robots to index the entire site

User-agent: *
Disallow:

Or you can simply create an empty file “/robots.txt”.

Block only a few directories from indexing

User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /private/

Prevent site indexing for only one robot

User-agent: BadBot
Disallow: /

Allow one robot to index the site and deny all others

User-agent: Yandex
Disallow:

User-agent: *
Disallow: /

Deny all files except one from indexing

This is quite difficult, because... there is no “Allow” statement. Instead, you can move all files except the one you want to allow for indexing into a subdirectory and prevent it from being indexed:

User-agent: *
Disallow: /docs/

Or you can prohibit all files prohibited from indexing:

User-agent: *
Disallow: /private.html
Disallow: /foo.html
Disallow: /bar.html

From the author: Do you have pages on your website that you don't want search engines to see? From this article you will learn in detail how to prevent page indexing in robots.txt, whether this is correct and how to generally block access to pages.

So, you need to prevent certain pages from being indexed. The easiest way to do this is in the robots.txt file itself, adding the necessary lines to it. I would like to note that we specified the folder addresses relative to each other, specify the URLs of specific pages in the same way, or you can specify an absolute path.

Let's say my blog has a couple of pages: contacts, about me and my services. I wouldn't want them to be indexed. Accordingly, we write:

User-agent: * Disallow: /kontakty/ Disallow: /about/ Disallow: /uslugi/

Another variant

Great, but this is not the only way to block the robot’s access to certain pages. The second is to place a special meta tag in the html code. Naturally, place only in those records that need to be closed. It looks like this:

< meta name = "robots" content = "noindex,nofollow" >

The tag must be placed in the head container in the html document to work correctly. As you can see, it has two parameters. Name is specified as a robot and specifies that these directions are intended for web crawlers.

The content parameter must have two values, separated by commas. The first is a ban or permission to index text information on the page, the second is an indication of whether to index links on the page.

Thus, if you want the page not to be indexed at all, specify the values ​​noindex, nofollow, that is, do not index the text and prohibit following links, if any. There is a rule that if there is no text on a page, then it will not be indexed. That is, if all the text is closed in noindex, then there is nothing to be indexed, so nothing will be included in the index.

In addition, there are the following values:

noindex, follow – prohibition of text indexing, but permission to follow links;

index, nofollow – can be used when the content should be taken into the index, but all links in it should be closed.

index, follow – default value. Everything is permitted.