How to set up the robots.txt file

How to set up the robots.txt file

Robots are machines belonging to search entities on the Internet, such as GoogleYahoo or Bing. These robots access web pages to search for information within it and add this information in search engines, which we usually know as indexing or positioning a website on the Internet.

If you have a well-configured file you can make these robots choose the right information more quickly, obtaining better web browsability, better positioning in search engines and, moreover, they can also reduce some drawbacks.

These robots are also called "spiders", "crawlers", spiders, "bots" or Indexers.

1.- What is the robots.txt file and what it is for
The robots.txt file is a plain text file created by the user to control robots access to the hosting. This archive sets out recommendations that search robots must comply with. That is, you tell him that is what you do not want to index. In this way, they will better select your website information and improve positioning.

The robots.txt file must be uploaded to the root of the accommodation to tell robots which pages or directories you are not interested in indexing. There must only be one robots.txt file on each website.

Setting up this file is important, as it brings benefits, for example:
  • It helps to make a smoother indexation of important web content, improving Internet positioning. It can also speed up robot tracking, improving the use of the web.
  • It prevents access to certain robots, as some of them only provide problems on the web because they are not search engines, in addition to limiting the information you want to display, so that private personal data cannot be found on Google.
  • They reduce server overload, because the access time of some robots can be controlled. Some of these robots are devoted to making a high number of requests that can saturate the server and slower the actual user to slower navigation on the page.

2.- How to create a robots.txt file
The robots file is created using two commands.

User-Agent: (Spider Name)
Disallow: (Row)

The Spider Name is the name of the search engine robot. If you want to indicate that the prohibitions affect all search engines, we must put "*" instead of the name of the search engine.
The "Path", is the name of the file or folder that you do not want to index. In order to ban indexing all documents in a directory, the path must include the "/" character at the end of the directory name. In other words, the format will be:

Disallow: /directory/
Examples:

Disallow: / prohibits entry into all accommodation.
Disallow: /forum/bids entry into the forum directory.
Disallow: allows entry into all accommodation.

3.- How to enter comments in a file
If you want to enter any comments into the file, you have to start the line with the "#" sign. This means that this line is a comment and it should not be taken into account.

Example:
#Dejamos full access to Webcrawler, as Disallow is empty.
User-agent: webcrawler
Disallow:

4.- What is the "Crawl-delay"
If statistics are checked, it can be seen that sometimes some robots that review the web make a multitude of requests to the server until it overloaded. To avoid this overload, the "Crawl-delay" directive, which indicates the time between each robot access.

Example:

User-agent
Crawl-delay: 60

This indicates that robots have to wait 60 seconds between each access. The drawback of this directive is that it does not affect all robots, some of which if it affects, are:MSNBot,SlurpiGooglebot.

5.- Other directives to control access time.
To monitor the time when robots index pages, some of these directives can be used:

# Allow search engines from 2 am to 7:45 am (hours are always in Greenwitch)
Visit-time:0200-0745
# One document every 30 minutes
Request:1/30m
# Combined: 1 doc every 10 minutes and between 1 and 5 p.m.e
Request:1/10m 1300-1659
It is important to check the file before uploading it to the hosting, as if it contains any errors, unwanted robots may index the web incorrectly. It could also happen that none of the robots you want to index the web do so correctly.

6.- How does a robots.txt file has to remain
To allow access to accommodation for all robots:

User-agent
Disallow:
Crawl-delay: 60

In order not to allow access to any robots in the accommodation:

User-agent
Disallow: /
Crawl-delay: 60

In order not to allow robots access to a particular page:

User-agent
Disallow: / file.html
Request-rate: 1/10m 1300-1659

To limit access to specific directories:
This setting is recommended, as it prohibits all robots from accessing the folders you have pointed out, and also restricts the access time of robots to avoid saturations on the server.

User-agent
Disallow: / Folder1/
Disallow: / Folder2/
Crawl-delay: 60

A list of all robots appears on this website: http://www.robotstxt.org/db.html

7.- How to set up a robots.txt file in a particular CMS
Many content managers such as Joomla, Drupal, WordPress, etc., are likely to already have their own robots.txt installed along with the app. All that needs to be done is to add the "crawl-delay" directive so as not to overload the page and also indicate the directories or articles that need to be indexed.

Examples of robots.txt:
For a Wordpress:

User-agent
Crawl-Delay: 60
Disallow: /wp-content/
Disallow: /wp-icludes
Disallow: /trackback/
Disallow: /wp-admin/
Disallow: /files/
Disallow: /category/
Disallow: /tag/*
Disallow: /tag/
Disallow: /wp-*
Disallow: /login/
Disallow: /*.js$
Disallow: /*.inc$
Disallow: /*.css$
Disallow: /*.php$
User-agent:
Allow:/
User-agent: Googlebot-Image
Disallow: /
User-agent: Jennifer
Disallow: /
User-agent: duggmirror
Disallow: /

For a Drupal:

User-agent
Crawl-delay: 60
# Directories
Disallow: /includes/
Disallow: /misc/
Disallow: /modules/
Disallow: /profiles/
Disallow: /scripts/
Disallow: /sites/
Disallow: /themes/
# Files
Disallow: /changelog.txt
Disallow: /cron.php
Disallow: /install.mysql.txt
Disallow: /install.pgsql.txt
Disallow: /install.php
Disallow: /inxt
Disallow: /license.txt
Disallow: /maintaners.txt
Disallow: /update.php
Disallow: /upgrade.txt
Disallow: /xmlrpc.php
# Paths (clean URLs)
Disallow: /admin/
Disallow: /comment/reply/
Disallow: /contact/
Disallow: /logout/
Disallow: /node/add/
Disallow: /search/
Disallow: /user/register/
Disallow: /user/password
Disallow: /user/login
# Paths (not clean URLs)
Disallow: /?q=admin/
Disallow: /?q=comment/reply
Disallow: /?q=contact
Disallow: /?
Disallow: /?
Disallow: /?
Disallow: /?
Disallow: /?
Disallow: /?
Disallow: /?
# Extras on drupal.org
# no access for table sorting paths or any paths that have parameters
Disallow: /*
Disallow: /*&sort*
Disallow: /*solrsort*
Disallow: /*&solrsort*
# no access to profiles that are often targeted by spammers.
Disallow: /profile/interest/*
Disallow: /profile/industries/*
Disallow: /profile/companies/*
# Disallow bogus aggregator pages
Disallow: /aggregator
# Disallow project search
Disallow: /project/issues/search/*
Disallow: /project/issues/*
# Disallow book export
Disallow: /book/export/*
# Disallow pift tests
Disallow: /pift/retest/*
# Disallow project subscription
Disallow: /project/issues/ subscribe-mail/*

For a Joomla:

User-agent
Crawl-delay: 60
Disallow: /administrator
Disallow: /cache/
Disallow: /components/
Disallow: /images/
Disallow: /includes/
Disallow: /installation/
Disallow: /language/
Disallow: /libraries/
Disallow: /media/
Disallow: /modules/
Disallow: /plugins/
Disallow: /templates/
Disallow: /tmp/
Disallow: /xmlrpc/

For a Prestashop:

User-agent
Crawl-delay: 60
Disallow: /cgi-bin/
Disallow: /img/
Disallow: /js/
Disallow: /mails/
Disallow: /modules/
Disallow: /themes/
Disallow: /translations/
Disallow: /tools/
Disallow: /override/
Disallow: /classes/
Disallow: /config/
Disallow: /controllers/
Disallow: /download/
Disallow: /localization/
Disallow: /log/
Disallow: /mails/
Disallow: /override/
Disallow: /tests/
Disallow: /translations/
Disallow: /upload/
Disallow: /webservice/
Disallow: /404.php
Disallow: /address.php
Disallow: /addresses.php
Disallow: /authentication.php
Disallow: /best-sales.php
Disallow: /cart.php
Disallow: /category.php
Disallow: /cms.php
Disallow: /contact-form.php
Disallow: /discount.php
Disallow: /guest-tracking.php
Disallow: /history.php
Disallow: /identity.php
Disallow: /images.inc.php
Disallow: /init.php
Disallow: /my-account.php
Disallow: /order.php
Disallow: /order-detail.php
Disallow: /order-follow.php
Disallow: /order-opc.php
Disallow: /order-slip.php
Disallow: /order-history.php
Disallow: /pagination.php
Disallow: /password.php
Disallow: /pdf-invoice.
Disallow: /pdf-order-return.php
Disallow: /pdf-order-slip.php
Disallow: /product-sort.php
Disallow: /product-comparison.php
Disallow: /product.php
Disallow: /search.php
Disallow: /statistics.php


For more information, you can contact us.
    • Related Articles

    • Information and uses of the .htaccess file

      1.- What is a .htaccess? The .htaccess (hypertext access) file is the default name for Apache’s directory-level configuration file. It is used to customize the configuration of directives and parameters defined in the main hosting configuration file. ...
    • How to set up hosting PHP

      Within the advanced tools of the hosting service, you will find the option Configure PHP, where you can modify different PHP directives to adjust your hosting to the requirements requested by some CMS applications. These values vary depending on the ...
    • How to set up permanent links in WordPress

      Permalinks are essential for any WordPress site, as they define the structure of your post and page URLs, helping them be optimized for search engines. In other words, permalinks are the web addresses that users will use to access that content. Here ...
    • How to set up White Label Panel (Management)

      Server domains In the accommodation ofcdmonyou have a white mark panel so that you can give access to another user and allow them to manage the hosting. The advantage of this dashboard is that the user you provide access will only be able to manage ...
    • How to set up the FTP user permanently in FileZilla

      To upload or update your website to the contracted hosting service, you must use an FTP client program. There are many programs that can help you, one of the most popular is FileZilla. Connecting to the FTP service through a manager requires a ...