Wednesday, June 20, 2012

What is Robots.txt?

The robots exclusion standard, also known more commonly as Robots.txt, is a text file present in the root directory of a website. The Robots.txt file is a convention created to direct the activity of search engine crawlers or web spiders. The file tells the search engine crawlers which parts to web and which parts to leave alone in a website, differing between what is viewable to the public and what is viewable to the creators of the website alone. A Robots.txt file is frequently used by search engines to categorize and archive web pages, or by webmasters to proofread source codes.

The Robots.txt file of a website will work when it is used as a request to specific robots to ignore directories or files specified within the Robots.txt file. Websites with sub-domains generally need a Robots.txt file for each sub-domain, all so that information that is not viewable to the public is not picked for a keyword search. It also heightens the keyword density of the actual web page text, and keeps visitors from coming across misleading or irrelevant to the keyword searches. Robot.txt protocols are simply advisory though. There is no law requiring websites to have Robot.txt files, or to use them on their web pages. 
  
What are Search Engine Spiders?

These sneaky devils are the informative bits that search your website for content marked as available for web robots to retrieve and appropriately rank for the searcher.  These spiders or web crawlers essentially seek out information not masked by the robots.txt format.

How Does a Robots.txt Blockage Come About?


The usage of robots.txt formatting is most commonly used for staging servers.  If you find yourself at the mercy of a robots.txt problem, it likely stems from when your staging server was rolled over to the live server.  Web developers utilize the robots.txt format to prevent the duplication of your web content during the building process and when your site does eventually go live.

How To Check Your Site for Robots.txt 


You are able to manually check your website to rule out the possibility that is suffering from the effects of an inappropriately placed robots.txt setting. No need to panic over the possibility of being Google blacklisted, keep calm and check the following simple steps:
  • Enter your domain name followed by a backslash and robots.txt in the address bar. For example: http://thedomainname.com/robots.txt
  • If a 404-error page is the result, then you may not have the robots.txt feature.
  • An additional route would be to log into your Google Webmaster Tools page to tell you which URLs include a robots.txt file restriction.
  • If your robots.txt file shows:
     User-agent: *
     Disallow: /
 
You’ll need to be sure to make changes.  You should never see the above coding on a live website.


How To Prevent Parts of Your Site From Being Indexed


Robot.txt can actually work to serve you just as they can hurt your website. To essentially hide certain sections of your website from these spiders or web crawlers, you can implement the features of the robots.txt formatting.  To disallow ads or log files from being searched on your website, these pages or features should be respectively coded:

    User-agent: *
    Disallow: /ads

    Disallow: /logs
  • Unfortunately, the usage of the robots.txt isn’t a cure-all for those items you wouldn’t like searched. You may also notice the blanket effect of this feature. Basic protocol doesn’t allow for Wildcards in the Disallow line or “Allow:” lines.  Subsequently, Google has expanded this basic format issue to allow both of these options, but these are not universally accepted, so it is recommended that these expansions ONLY be used for a “User-agent:” run by Google.

 

Does the Robots.txt Prevent Users From Viewing Certain Content?


Absolutely not.  Adding the robots.txt to your web coding will only prevent web-screening spiders from selecting content from these portions of your site.  All content will be left for the viewing pleasure of all visitors to that page and will be completely unaware of the robots.txt status of the content on that page.  In all honesty the robots.txt will only disallow “polite” spiders from access to the information, in reality there are likely less well-mannered searchers weaving through that data.

If you really want to protect certain data, content or certain sections of your website, your best bet is to password protect these areas. Also remember that if you want content officially removed from the index, you must include a robots no index meta tag on each and every page you want to unequivocally remove from the index of your site.

Understanding the slightly more simplistic features of running and maintaining your website will likely save you money on the front and the running end of your business.  If you find that your website has disappeared from Google search or is extremely hard to find otherwise, your first step should be to double-check your robots.txt.  No need to spend extra money on a tech professional when you are well equipped to rule out the easy fixes and get back to the world of the living as far as the web is concerned!

No comments:

Post a Comment

Facebook Likes, Increase FB Likes Free