Robots are machines belonging to search entities on the Internet, such as Google, Yahoo or Bing. These robots access web pages to search for information within it and add this information in search engines, which we usually know as indexing or positioning a website on the Internet.
If you have a well-configured file you can make these robots choose the right information more quickly, obtaining better web browsability, better positioning in search engines and, moreover, they can also reduce some drawbacks.
These robots are also called "spiders", "crawlers", spiders, "bots" or Indexers.
1.- What is the robots.txt file and what it is for
The robots.txt file is a plain text file created by the user to control robots access to the hosting. This archive sets out recommendations that search robots must comply with. That is, you tell him that is what you do not want to index. In this way, they will better select your website information and improve positioning.
The robots.txt file must be uploaded to the root of the accommodation to tell robots which pages or directories you are not interested in indexing. There must only be one robots.txt file on each website.
Setting up this file is important, as it brings benefits, for example:
It helps to make a smoother indexation of important web content, improving Internet positioning. It can also speed up robot tracking, improving the use of the web.
It prevents access to certain robots, as some of them only provide problems on the web because they are not search engines, in addition to limiting the information you want to display, so that private personal data cannot be found on Google.
They reduce server overload, because the access time of some robots can be controlled. Some of these robots are devoted to making a high number of requests that can saturate the server and slower the actual user to slower navigation on the page.
2.- How to create a robots.txt file
The robots file is created using two commands.
User-Agent: (Spider Name)
Disallow: (Row)
The Spider Name is the name of the search engine robot. If you want to indicate that the prohibitions affect all search engines, we must put "*" instead of the name of the search engine.
The "Path", is the name of the file or folder that you do not want to index. In order to ban indexing all documents in a directory, the path must include the "/" character at the end of the directory name. In other words, the format will be:
Disallow: /directory/
Examples:
Disallow: / prohibits entry into all accommodation.
Disallow: /forum/bids entry into the forum directory.
Disallow: allows entry into all accommodation.
3.- How to enter comments in a file
If you want to enter any comments into the file, you have to start the line with the "#" sign. This means that this line is a comment and it should not be taken into account.
Example:
#Dejamos full access to Webcrawler, as Disallow is empty.
User-agent: webcrawler
Disallow:
4.- What is the "Crawl-delay"
If statistics are checked, it can be seen that sometimes some robots that review the web make a multitude of requests to the server until it overloaded. To avoid this overload, the "Crawl-delay" directive, which indicates the time between each robot access.
Example:
User-agent
Crawl-delay: 60
This indicates that robots have to wait 60 seconds between each access. The drawback of this directive is that it does not affect all robots, some of which if it affects, are:MSNBot,SlurpiGooglebot.
5.- Other directives to control access time.
To monitor the time when robots index pages, some of these directives can be used:
# Allow search engines from 2 am to 7:45 am (hours are always in Greenwitch)
Visit-time:0200-0745
# One document every 30 minutes
Request:1/30m
# Combined: 1 doc every 10 minutes and between 1 and 5 p.m.e
Request:1/10m 1300-1659
It is important to check the file before uploading it to the hosting, as if it contains any errors, unwanted robots may index the web incorrectly. It could also happen that none of the robots you want to index the web do so correctly.
6.- How does a robots.txt file has to remain
To allow access to accommodation for all robots:
User-agent
Disallow:
Crawl-delay: 60
In order not to allow access to any robots in the accommodation:
User-agent
Disallow: /
Crawl-delay: 60
In order not to allow robots access to a particular page:
User-agent
Disallow: / file.html
Request-rate: 1/10m 1300-1659
To limit access to specific directories:
This setting is recommended, as it prohibits all robots from accessing the folders you have pointed out, and also restricts the access time of robots to avoid saturations on the server.
User-agent
Disallow: / Folder1/
Disallow: / Folder2/
Crawl-delay: 60
7.- How to set up a robots.txt file in a particular CMS
Many content managers such as Joomla, Drupal, WordPress, etc., are likely to already have their own robots.txt installed along with the app. All that needs to be done is to add the "crawl-delay" directive so as not to overload the page and also indicate the directories or articles that need to be indexed.
Examples of robots.txt:
For a Wordpress:
User-agent
Crawl-Delay: 60
Disallow: /wp-content/
Disallow: /wp-icludes
Disallow: /trackback/
Disallow: /wp-admin/
Disallow: /files/
Disallow: /category/
Disallow: /tag/*
Disallow: /tag/
Disallow: /wp-*
Disallow: /login/
Disallow: /*.js$
Disallow: /*.inc$
Disallow: /*.css$
Disallow: /*.php$
User-agent:
Allow:/
User-agent: Googlebot-Image
Disallow: /
User-agent: Jennifer
Disallow: /
User-agent: duggmirror
Disallow: /
For a Drupal:
User-agent
Crawl-delay: 60
# Directories
Disallow: /includes/
Disallow: /misc/
Disallow: /modules/
Disallow: /profiles/
Disallow: /scripts/
Disallow: /sites/
Disallow: /themes/
# Files
Disallow: /changelog.txt
Disallow: /cron.php
Disallow: /install.mysql.txt
Disallow: /install.pgsql.txt
Disallow: /install.php
Disallow: /inxt
Disallow: /license.txt
Disallow: /maintaners.txt
Disallow: /update.php
Disallow: /upgrade.txt
Disallow: /xmlrpc.php
# Paths (clean URLs)
Disallow: /admin/
Disallow: /comment/reply/
Disallow: /contact/
Disallow: /logout/
Disallow: /node/add/
Disallow: /search/
Disallow: /user/register/
Disallow: /user/password
Disallow: /user/login
# Paths (not clean URLs)
Disallow: /?q=admin/
Disallow: /?q=comment/reply
Disallow: /?q=contact
Disallow: /?
Disallow: /?
Disallow: /?
Disallow: /?
Disallow: /?
Disallow: /?
Disallow: /?
# Extras on drupal.org
# no access for table sorting paths or any paths that have parameters
Disallow: /*
Disallow: /*&sort*
Disallow: /*solrsort*
Disallow: /*&solrsort*
# no access to profiles that are often targeted by spammers.
Disallow: /profile/interest/*
Disallow: /profile/industries/*
Disallow: /profile/companies/*
# Disallow bogus aggregator pages
Disallow: /aggregator
# Disallow project search
Disallow: /project/issues/search/*
Disallow: /project/issues/*
# Disallow book export
Disallow: /book/export/*
# Disallow pift tests
Disallow: /pift/retest/*
# Disallow project subscription
Disallow: /project/issues/ subscribe-mail/*
For a Joomla:
User-agent
Crawl-delay: 60
Disallow: /administrator
Disallow: /cache/
Disallow: /components/
Disallow: /images/
Disallow: /includes/
Disallow: /installation/
Disallow: /language/
Disallow: /libraries/
Disallow: /media/
Disallow: /modules/
Disallow: /plugins/
Disallow: /templates/
Disallow: /tmp/
Disallow: /xmlrpc/
For a Prestashop:
User-agent
Crawl-delay: 60
Disallow: /cgi-bin/
Disallow: /img/
Disallow: /js/
Disallow: /mails/
Disallow: /modules/
Disallow: /themes/
Disallow: /translations/
Disallow: /tools/
Disallow: /override/
Disallow: /classes/
Disallow: /config/
Disallow: /controllers/
Disallow: /download/
Disallow: /localization/
Disallow: /log/
Disallow: /mails/
Disallow: /override/
Disallow: /tests/
Disallow: /translations/
Disallow: /upload/
Disallow: /webservice/
Disallow: /404.php
Disallow: /address.php
Disallow: /addresses.php
Disallow: /authentication.php
Disallow: /best-sales.php
Disallow: /cart.php
Disallow: /category.php
Disallow: /cms.php
Disallow: /contact-form.php
Disallow: /discount.php
Disallow: /guest-tracking.php
Disallow: /history.php
Disallow: /identity.php
Disallow: /images.inc.php
Disallow: /init.php
Disallow: /my-account.php
Disallow: /order.php
Disallow: /order-detail.php
Disallow: /order-follow.php
Disallow: /order-opc.php
Disallow: /order-slip.php
Disallow: /order-history.php
Disallow: /pagination.php
Disallow: /password.php
Disallow: /pdf-invoice.
Disallow: /pdf-order-return.php
Disallow: /pdf-order-slip.php
Disallow: /product-sort.php
Disallow: /product-comparison.php
Disallow: /product.php
Disallow: /search.php
Disallow: /statistics.php