robots.txt, robots.txt, robots.txt..
You have definetly seen this “.txt” more than once if you are a blogger using automated tools like wordpress or blogspot..etc.
robots.txt, robots.txt, robots.txt..
Now, you are undoubtedly curious by this “text file” if you are hosting and customizing your own blog/website.
Urrrmmm….so let’s try to find out about this mysterious thing together, you and me
You: “robots.txt is a robot hiding in a file?”
Wakish: “I will say it is just a plain text-file?”
You: “Hahaha.. trying to be smart Wakish? hah..My 12year old brother knows this much!”
You: “Let’s get serious, I need to know about it!!”
Wakish: “Ok, relax and let me guide you in the world of robots..urrmm, sorry robots.txt ”
Let’s try giving some definition of a robots.txt
– is a simple, plain text-file
– is placed in the root directory of a website
– used to control which pages, images or any other files that can be indexed by a Web Robot (aka Search Engines Spiders or bots)
– restrict a specific or any robots access to a website or part of the website
– provides some intructions for robots crawling a website
Now on the net, whenever there’s a behaviour among 2 or more entities, there’s bound to be a protocol (a rule) and here this protocol is named as The Robots Exclusion Standard or The Robots Exclusion Protocol.
The Robots Exclusion Standard
– is a set of rules or a convention which governs the behaviour of a Web Robot with respect to a Website’s directories
A raw fact: This protocol relies on the cooperation of Web Robots.
Q => What if these Bots do not cooperate?
A => Yes, not all Bots abide by this protocol and these bots are called ‘Bad Bots or Spam Bots’; so it’s not a 100% guaranteed way to restrict access to your files and directories.
In this era of intense competition and high-ranking urge, most of the intelligent and business minded companies always try to conform as closely as possible to standards or protocols set universally. Therefore, in a near future, the population of Bad Bots will surely be minimized thus boosting the importance of this protocol.
Mechanism of robots.txt & a Web Robot
STEP 1: Web Spider visits a website, for e.g https://wakish.info/
STEP 2: Spider checks for https://wakish.info/robots.txt
STEP 3: If robots.txt found, analyse intructions in the file & proceed to STEP 5
STEP 4: If not found, an error message printed in log file & proceed to STEP 6
STEP 5: Crawl/index website according to instructions defined in robots.txt
STEP 6: index website in the manners I (the spider) want
STEP 7: Crawl that website as many times as ‘I want’ (say n times)
Benefits of robots.txt
1) Minimize errors in log file – As you have observed in STEP 4 above, if you do not provide a robots.txt file an error message is logged. Now if the Web Bot crawl or access your site 100 times in a day (STEP 7), then imagine the size of error logs.
2) Save bandwidth – As you have seen in STEP 7, a bot can access your file ‘n’ times a day and n times crawling your images, html files,..etc. So all this may dump a considerable bandwidth and server load especially if you are running a site with a lot of images and graphics. Hence, using robots.txt can help you define this behaviour.
3) Restrict privacy of your website or part of your files – Spiders will crawl ALL your files if you don’t instruct them. At times you might want to restrict public view to a certain image file for instance.
4) Be more professional – If you are investing effort in doing the best in your endeavour, you should take all chances and opportunities on your side; consider doing things the right way – hence use a robots.txt
5) Boost your site rank – Web Robots are information greedy and pays respect (by increasing you PR value in their database) to those who have a lot of good content. Hence, why not provide them with the right content in the right format? So try to make a balance between the so called ‘machine readability” and “human readability”.
When writing your own robots.txt, you pay particular care to the following:
1) Case sensitivity
2) Use the exact name of the existing web spider
3) Where are the semi-colons (:) placed
4) When to use asterisk (*)
5) Bots read instructions in the robots.txt in a “Top to Bottom” fashion.
6) the text-file should be named “robots.txt” with the ‘s’ in it
(Note: A text file can be created by using Notepad and saving the filename as ‘robots‘. If you are using Linux, you can do this with an editor like KWrite, Kate or any other text-editor and saing it with the name “robots” and suffixing it explicitely with the extension ‘.txt‘)
– ‘*’ means ‘All Robots that exist’
– Since ‘Disallow’ has no value assigned to it, ALL files will be indexed on the website
line 1 User-agent: BadBot
line 2 Disallow: /
line 4 User-agent: *
line 5 Disallow: /thisFolder/
– If intruction set from Line 4 was before those at line 1, then BadBots would still be able to access your site according to usage 5) above!Any other alternative to robots.txt?
Yes, the HTML META tags.
<META NAME=”ROBOTS” CONTENT=”NOINDEX, NOFOLLOW” />
But, this is not really the advised way to restrict web robots. The preferred and more flexible way is undoubtedly the robots.txt
Some additional notes
1) There softwares which can help you easily create your robots.txt in a more user-friendly manner, they are known as ‘robots.txt generator’
2) And to check your usage syntax, we have ‘robots.txt syntax checker’
3) robots.txt also plays a role in gauging ROI (Return On Investment)
ROI – refers to how much profit or revenue a marketing campaign generates, for instance, an investment of $50 in Google Adwords generating a profit of $500
Now, when a website has not been assigned a google PR yet, the robots.txt can be examined to understand why, for e.g, we might see that google was inadvertently blocked from crawling that site.
I end here, hoping that my effort to produce this article will help you gain enough insight about the famous ‘robots.txt’ so that you can provide help to your friends, helping yourself and allowing you to move on to your journey on the net with yet more tools in your database of internet knowledge ;)