.: Favorite Posts :.

.: Popular Posts :.

.: Recent Posts :.

  • Writing the GP Essay - Write to convince!
  • Secret to your success! Do you have a vision
  • Upload your own image on the Internet now!
  • Moneybookers better than Paypal!
  • Method For Presentation Of Topics
  • A True Teacher Is An Emperor Whose..
  • Money - The Root Of All Success!
  • What Is Search Engine Optimization - SEO
  • robots.txt - the secret!
  • Is a blog the platform you need?
  • Enhance Your Blog Posts With Line Jump
  • Ronaldinho Vs Cristiano Ronaldo
  • Acquiring Ideas For GP Essays
  • Entrecard Is Down - Fire Broke At Their Hosting Server's Datacenter
  • Eurovision Song Contest 2008
  • Entrecard Finally Grants Your Wish - Add Multiple Blogs To A Single Account
  • robots.txt - why it this simple file still so widely used?

    Posted on: Tuesday, June 5th, 2007 | Written by Wakish

    robots.txt, robots.txt, robots.txt..
    You may have come across this strange name at least once if you are an internet surfer

    robots.txt, robots.txt, robots.txt..
    You have definetly seen this “.txt” more than once if you are a blogger using automated tools like wordpress or blogspot..etc.

    robots.txt, robots.txt, robots.txt..
    Now, you are undoubtedly curious by this “text file” if you are hosting and customizing your own blog/website.
    Urrrmmm….so let’s try to find out about this mysterious thing together, you and me :)

    You: “robots.txt is a robot hiding in a file?”
    Wakish: “I will say it is just a plain text-file?”
    You: “Hahaha.. trying to be smart Wakish? hah..My 12year old brother knows this much!”

    You: “Let’s get serious, I need to know about it!!”
    Wakish: “Ok, relax and let me guide you in the world of robots..urrmm, sorry robots.txt :)

    star Let’s try giving some definition of a robots.txt

    - is a simple, plain text-file
    - is placed in the root directory of a website
    - used to control which pages, images or any other files that can be indexed by a Web Robot (aka Search Engines Spiders or bots)
    - restrict a specific or any robots access to a website or part of the website
    - provides some intructions for robots crawling a website

    Now on the net, whenever there’s a behaviour among 2 or more entities, there’s bound to be a protocol (a rule) and here this protocol is named as The Robots Exclusion Standard or The Robots Exclusion Protocol.

    star The Robots Exclusion Standard

    - is a set of rules or a convention which governs the behaviour of a Web Robot with respect to a Website’s directories
    A raw fact: This protocol relies on the cooperation of Web Robots.
    Q => What if these Bots do not cooperate?
    A => Yes, not all Bots abide by this protocol and these bots are called ‘Bad Bots or Spam Bots’; so it’s not a 100% guaranteed way to restrict access to your files and directories.

    bulb A thought:

    In this era of intense competition and high-ranking urge, most of the intelligent and business minded companies always try to conform as closely as possible to standards or protocols set universally. Therefore, in a near future, the population of Bad Bots will surely be minimized thus boosting the importance of this protocol.

    star Mechanism of robots.txt & a Web Robot

    STEP 1: Web Spider visits a website, for e.g http://wakish.info/
    STEP 2: Spider checks for http://wakish.info/robots.txt
    STEP 3: If robots.txt found, analyse intructions in the file & proceed to STEP 5
    STEP 4: If not found, an error message printed in log file & proceed to STEP 6

    STEP 5: Crawl/index website according to instructions defined in robots.txt
    STEP 6: index website in the manners I (the spider) want

    STEP 7: Crawl that website as many times as ‘I want’ (say n times)

    star Benefits of robots.txt

    1) Minimize errors in log file - As you have observed in STEP 4 above, if you do not provide a robots.txt file an error message is logged. Now if the Web Bot crawl or access your site 100 times in a day (STEP 7), then imagine the size of error logs.

    2) Save bandwidth - As you have seen in STEP 7, a bot can access your file ‘n’ times a day and n times crawling your images, html files,..etc. So all this may dump a considerable bandwidth and server load especially if you are running a site with a lot of images and graphics. Hence, using robots.txt can help you define this behaviour.

    3) Restrict privacy of your website or part of your files - Spiders will crawl ALL your files if you don’t instruct them. At times you might want to restrict public view to a certain image file for instance.

    4) Be more professional - If you are investing effort in doing the best in your endeavour, you should take all chances and opportunities on your side; consider doing things the right way - hence use a robots.txt

    5) Boost your site rank - Web Robots are information greedy and pays respect (by increasing you PR value in their database) to those who have a lot of good content. Hence, why not provide them with the right content in the right format? :) So try to make a balance between the so called ‘machine readability” and “human readability”.

    star Usage Syntax

    When writing your own robots.txt, you pay particular care to the following:
    1) Case sensitivity
    2) Use the exact name of the existing web spider
    3) Where are the semi-colons (:) placed
    4) When to use asterisk (*)
    5) Bots read instructions in the robots.txt in a “Top to Bottom” fashion.
    6) the text-file should be named “robots.txt” with the ’s’ in it

    (exclamation Note: A text file can be created by using Notepad and saving the filename as ‘robots‘. If you are using Linux, you can do this with an editor like KWrite, Kate or any other text-editor and saing it with the name “robots” and suffixing it explicitely with the extension ‘.txt‘)

    User-agent: *
    Disallow:

    star Explanation

    - ‘*’ means ‘All Robots that exist’
    - Since ‘Disallow’ has no value assigned to it, ALL files will be indexed on the website

    line 1 User-agent: BadBot
    line 2 Disallow: /
    line 3
    line 4 User-agent: *
    line 5 Disallow: /thisFolder/

    star PitFall

    - If intruction set from Line 4 was before those at line 1, then BadBots would still be able to access your site according to usage 5) above!Any other alternative to robots.txt?

    Yes, the HTML META tags.
    e.g:

    <META NAME=”ROBOTS” CONTENT=”NOINDEX, NOFOLLOW” />

    But, this is not really the advised way to restrict web robots. The preferred and more flexible way is undoubtedly the robots.txt

    star Some additional notes

    1) There softwares which can help you easily create your robots.txt in a more user-friendly manner, they are known as ‘robots.txt generator’
    2) And to check your usage syntax, we have ‘robots.txt syntax checker’
    3) robots.txt also plays a role in gauging ROI (Return On Investment)

    ROI - refers to how much profit or revenue a marketing campaign generates, for instance, an investment of $50 in Google Adwords generating a profit of $500
    Now, when a website has not been assigned a google PR yet, the robots.txt can be examined to understand why, for e.g, we might see that google was inadvertently blocked from crawling that site.

    I end here, hoping that my effort to produce this article will help you gain enough insight about the famous ‘robots.txt’ so that you can provide help to your friends, helping yourself and allowing you to move on to your journey on the net with yet more tools in your database of internet knowledge ;)
    Cheers!

    Technorati Tags: , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,

    Share This Article with your friends!



    And don't forget to Subscribe, it's 100% Free..
    Posted in: Webmasters, Internet, Did You Know..

      (2) Comments made - Say your part!

    1. 1
      From Garala on March 9th, 2008 at 5:59 pm

      Robots.txt files are also used in forum softwares such as vBulletin, IPB, SMF, and many others.

      I never have used robots.txt.

    2. 2
      From Ikki on April 11th, 2008 at 10:34 pm

      Hey Wakish!

      Read your comment on my blog and wanted to come by ;) I never used robots.txt on my sites until a few days ago when I decided to do a little research about it.

      Everyone should use it to help crawlers index their sites better ;)

    Leave a Comment

    Close
    E-mail It