Robots.txt Guide

[Image Source]

A quick guide on the basics and origins of the Robots Exclusion Protocol

Along your SEO journey you may have come across the acronym REP or the term robots.txt. These are two ways of describing the Robots Exclusion Protocol (REP) or Robots Exclusion Standard. This blog post will serve to educate you on robots.txt and inform you of how it could help your small business.

Screen Shot 2016-03-18 at 2.50.36 PM
[Image created using Jason Davie’s Word Cloud Generator]
What the heck is it?

1994 was a big year for SEO. Not only was the first blog created, but the original REP was also formulated.The Robots Exclusion protocol is a standard or text file that communicates with website crawlers. Think of it like a compass that points the crawlers in the right direction when it comes to which parts of a website they should scan and index. Search engines are greedy. They want to scan and index as much information as they possibly can. What this means is that they will assume everything on your blog or website is available to scan unless you tell them otherwise. That’s where the Robots Exclusion Standard comes in. While this standard can be very helpful to anyone, including small business owners, it must also be used with great care.

Why would I want to use robots.txt?

The Robots Exclusion Protocol essentially allows you to control the crawler traffic on your website. This would come in handy if you don’t want Google crawling two very similar pages on your site and wasting what Google terms your “crawl budget”. Basically, a crawl budget is the number of times a search engine will crawl one of your web pages each time it visits the site. As you can imagine, it’s imperative to understand this concept if you want to develop and maintain a successful SEO strategy.

How do I Use it? 

To create a robots.txt file, you’ll have to put it in the top level directory of your server. If you have Windows, robotstxt.org recommends using the program notepad.exe to create your REP. Likewise, if you have a Macbook, use TextEdit. When a robot is looking for the robots.txt file, it will strip down the URL in a particular way. For example, if our URL for this blog was http://www.searchable.com/shop/index.html, the robot crawling the page would remove the “/shop/index.html” and replace it with “/robots.txt” so the URL now looks like this: http://www.searchable.com/robots.txt”. What this means for you is you will want to put the robots.txt file where you would put your website’s main index.htmlAs for what exactly to put, here’s a Robots.txt cheat sheet from MOZ bar on some of the more common REP language. I’ve also added some additional helpful text:

Block all web crawlers from ALL content: User-agent: *

Disallow: /

Block a specific crawler from a specific folder: User-agent: Googlebot

Disallow: /no-google/

Block a specific crawler from a specific webpage: User-agent: Googlebot

Disallow: /no-google/blocked-page.html

Sitemap Parameter: User-agent: *

Disallow:

Sitemap: http://www.example.com/none-standard-location/sitemap-.xml

Allow all web crawlers to access all files: User-agent: *

Allow:

The Downside

It’s important to keep in mind that denying robots the ability to crawl a webpage denies the link it’s value. In other words, the use of the Robots Exclusion Standard could potentially mean a decrease in the effects of your current Search Engine Optimization. Say someone links a page from your website that you’ve hidden from Google using REP. Now Google has a way of avoiding the Robots.txt protocol you used by indexing your web page through that third-party link. According to Yoast, if you have a section on your website that you do not want to show in Google’s search results, but that still generates a lot of links, you should not use the REP. Yoast suggests that instead of using REP, you should use a “noindex, follow” robots meta tag. This way, search engines like Google will still be able to properly distribute the link value for that page across your website. Another way to avoid this situation would be to password protect a file rather than excluding it from search results with robots.txt.

As you can see, robots.txt can be very useful to someone who is in the process of optimizing their website. On the other hand, it can be very tricky to use and could hurt your SEO if not used properly. In my opinion, robots.txt can be a very useful tool for optimizing your website. So long as you follow the guidelines outlines above, I’m confident that the Robots Exclusion Protocol will greatly improve your SEO. As always, Searchable is here to help. Leave any questions, comments or suggestions regarding Robots.txt in the comments below.

 

Advertisements

1 thought on “Robots.txt Guide”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s