robots.txt – why is this simple file still so widely used?

Want To Be Part Of my 7-Week's Lessons To Help You Propel Your Objectives & Eradicate Frustrations In Your Life?
Subscribe FREELY Right Now - click here!

robots.txt, robots.txt, robots.txt..
You may have come across this strange name at least once if you are an internet surfer

robots.txt, robots.txt, robots.txt..
You have definetly seen this “.txt” more than once if you are a blogger using automated tools like wordpress or blogspot..etc.

robots.txt, robots.txt, robots.txt..
Now, you are undoubtedly curious by this “text file” if you are hosting and customizing your own blog/website.
Urrrmmm….so let’s try to find out about this mysterious thing together, you and me

You: “robots.txt is a robot hiding in a file?”
Wakish: “I will say it is just a plain text-file?”
You: “Hahaha.. trying to be smart Wakish? hah..My 12year old brother knows this much!”

You: “Let’s get serious, I need to know about it!!”
Wakish: “Ok, relax and let me guide you in the world of robots..urrmm, sorry robots.txt

Let’s try giving some definition of a robots.txt

is a simple, plain text-file
is placed in the root directory of a website
used to control which pages, images or any other files that can be indexed by a Web Robot (aka Search Engines Spiders or bots)
restrict a specific or any robots access to a website or part of the website
provides some intructions for robots crawling a website

Now on the net, whenever there’s a behaviour among 2 or more entities, there’s bound to be a protocol (a rule) and here this protocol is named as The Robots Exclusion Standard or The Robots Exclusion Protocol.

The Robots Exclusion Standard

is a set of rules or a convention which governs the behaviour of a Web Robot with respect to a Website’s directories
A raw fact: This protocol relies on the cooperation of Web Robots.
Q => What if these Bots do not cooperate?
A => Yes, not all Bots abide by this protocol and these bots are called ‘Bad Bots or Spam Bots’; so it’s not a 100% guaranteed way to restrict access to your files and directories.

A thought:

In this era of intense competition and high-ranking urge, most of the intelligent and business minded companies always try to conform as closely as possible to standards or protocols set universally. Therefore, in a near future, the population of Bad Bots will surely be minimized thus boosting the importance of this protocol.

Mechanism of robots.txt & a Web Robot

STEP 1: Web Spider visits a website, for e.g http://wakish.info/
STEP 2: Spider checks for http://wakish.info/robots.txt
STEP 3: If robots.txt found, analyse intructions in the file & proceed to STEP 5
STEP 4: If not found, an error message printed in log file & proceed to STEP 6

STEP 5: Crawl/index website according to instructions defined in robots.txt
STEP 6: index website in the manners I (the spider) want

STEP 7: Crawl that website as many times as ‘I want’ (say n times)

Benefits of robots.txt

1) Minimize errors in log file – As you have observed in STEP 4 above, if you do not provide a robots.txt file an error message is logged. Now if the Web Bot crawl or access your site 100 times in a day (STEP 7), then imagine the size of error logs.

2) Save bandwidth – As you have seen in STEP 7, a bot can access your file ‘n’ times a day and n times crawling your images, html files,..etc. So all this may dump a considerable bandwidth and server load especially if you are running a site with a lot of images and graphics. Hence, using robots.txt can help you define this behaviour.

3) Restrict privacy of your website or part of your files – Spiders will crawl ALL your files if you don’t instruct them. At times you might want to restrict public view to a certain image file for instance.

4) Be more professional – If you are investing effort in doing the best in your endeavour, you should take all chances and opportunities on your side; consider doing things the right way – hence use a robots.txt

5) Boost your site rank – Web Robots are information greedy and pays respect (by increasing you PR value in their database) to those who have a lot of good content. Hence, why not provide them with the right content in the right format? So try to make a balance between the so called ‘machine readability” and “human readability”.

Usage Syntax

When writing your own robots.txt, you pay particular care to the following:
1) Case sensitivity
2) Use the exact name of the existing web spider
3) Where are the semi-colons (:) placed
4) When to use asterisk (*)
5) Bots read instructions in the robots.txt in a “Top to Bottom” fashion.
6) the text-file should be named “robots.txt” with the ‘s’ in it

(Note: A text file can be created by using Notepad and saving the filename as ‘robots‘. If you are using Linux, you can do this with an editor like KWrite, Kate or any other text-editor and saing it with the name “robots” and suffixing it explicitely with the extension ‘.txt‘)

User-agent: *
Disallow:

Explanation

– ‘*’ means ‘All Robots that exist’
– Since ‘Disallow’ has no value assigned to it, ALL files will be indexed on the website

line 1 User-agent: BadBot
line 2 Disallow: /
line 3
line 4 User-agent: *
line 5 Disallow: /thisFolder/

PitFall

– If intruction set from Line 4 was before those at line 1, then BadBots would still be able to access your site according to usage 5) above!Any other alternative to robots.txt?

Yes, the HTML META tags.
e.g:

<META NAME=”ROBOTS” CONTENT=”NOINDEX, NOFOLLOW” />

But, this is not really the advised way to restrict web robots. The preferred and more flexible way is undoubtedly the robots.txt

Some additional notes

1) There softwares which can help you easily create your robots.txt in a more user-friendly manner, they are known as ‘robots.txt generator’
2) And to check your usage syntax, we have ‘robots.txt syntax checker’
3) robots.txt also plays a role in gauging ROI (Return On Investment)

ROI – refers to how much profit or revenue a marketing campaign generates, for instance, an investment of $50 in Google Adwords generating a profit of $500
Now, when a website has not been assigned a google PR yet, the robots.txt can be examined to understand why, for e.g, we might see that google was inadvertently blocked from crawling that site.

I end here, hoping that my effort to produce this article will help you gain enough insight about the famous ‘robots.txt’ so that you can provide help to your friends, helping yourself and allowing you to move on to your journey on the net with yet more tools in your database of internet knowledge 😉
Cheers!

Valuable Feedback / Comment / Review From People Like You

  1. Garala says:

    Robots.txt files are also used in forum softwares such as vBulletin, IPB, SMF, and many others.

    I never have used robots.txt.

  2. Ikki says:

    Hey Wakish!

    Read your comment on my blog and wanted to come by 😉 I never used robots.txt on my sites until a few days ago when I decided to do a little research about it.

    Everyone should use it to help crawlers index their sites better 😉

  3. Wakish says:

    Thanks guys!

  4. Tusetaulley says:

    Greetings,

    What is the best web hosting company?

    I’m need to build a web site for my boss.

    Thank you,

    -Jen

  5. Wakish says:

    Hi Tusetaulley, it really depends on your boss requirements and on how many traffic you are expecting + the nature of your web system you plan to build. If you provide some more info, I could perhaps give you more relevent details.

    Else, I would recommend host like HostIcan, MediaTemple..
    or even bluehost (for medium size websites)

  6. Robot text can do alot.

  7. rams says:

    hi wakish!
    i’ve been reading blogs or articles on your website since i came at work(around 4 hours ago). the reason i’am still on your website, i guess u guessed..;) the site is very informative. i’ve been recently give responsibility to work on SEO. i guess i’ll go on reading for some more time everything i find useful and necessary before i start scripting my own documentation to submit to my boss. Thanks once again. good luck

  8. Wakish says:

    Hi rams, thanks for your visit, wishes and appreciation 😉

These Folks Mentioned This Article Somewhere..

Speak Your Mind

*