In the world of search, accessibility is a big deal. It can affect whether users get to your website or not. Most of the work goes into optimizing your site so that Google can access your pages. More often than not, no one spends any time optimizing their website that Google doesn’t access certain pages.
Why shouldn’t Google access certain pages? Because the Google-Bot is very busy scraping the entire web. He can only spend so much time on your site. So you want to make sure your site is set-up in a way that allows Google-Bot to get to the most valuable pages quickly. One of the best ways to do this is to tell Google-Bot where not to go.
This is where robots.txt comes in handy. Google refers to the robots.txt file as “a no trespassing sign” because it tells Google-bot what content not to crawl. Here is a short list of pages you should consider blocking*
Having a search function on your site is great, but it can lead to a lot of duplicate content issues. This is especially dangerous is if you have faceted search on your web site. Faceted search creates web page after web pages of nearly identical content with unique URLs. The best way to solve this is to block any search results by using the wildcard “*” in conjunction with your URL structure. For example if I put the following statement in my robots.txt file:
Google-Bot will not crawl any results page besides the basic search page. That way the search page itself will be indexed, but not any results that may come up.
2. Membership Only Content
If your site has content that is exclusive to members, it is best to keep Google-bot from those pages. Web spiders cannot make it around password walls & log in portals. So while it may help new sign-ups to link to specific content on every page of your website. Web Spiders will be constantly stuck at the login. The best way to handle that content is to put it in a subdirectory, and disallow spiders from accessing the subdirectory with a statement similar to this one:
This is more for practicality and security. Nowadays hackers and other malware enthusiasts use Google to find website’s that are ripe for the picking. Simply running the query “inurl:wp-login” returned over 1 million results. By disallowing Google from your admin content will help keep your site safe and sound.
4.Near Duplicate Pages On Your Site
Running a large website can be challenging. From large service websites with local branches to e-commerce sites that sell red widgets, blue widgets, and green widgets, it’s hard to come up with brand new copy for each page, especially if it is just a different color, size, or location. For each group of pages that have incredibly similar content, pick the best page, and disallow the rest. You can do this by putting the exact URLs in the robots.txt file:
5.Near Duplicate Pages to Other Sites
This is a big problem with E-commerce site specifically, or any kind of franchise site. Whoever you are working with gave you promotional materials or product descriptions, and you used that instead. If this is the case take the time to rewrite in all in your own words. However, in the meantime disallow Google from the rest. If the majority your pages read just like all the others, why should Google rank any of your pages higher?
These are a great trust builder, but not awe-inspiring content. While most sites that accept information should have them, that does not mean that you Google should waste time reading about how you don’t spam your customers. Eventually this may come to be a ranking factor; however currently there is no evidence to suggest Google uses these to distinguish websites.
Any business running a PPC campaign should have custom landing pages. The point of custom landing pages is to zero in on the exact customer intent, and to find the best triggers for a specific action. The problem is that you end up with a bunch of pages on your website that have nearly identical content. Once again we want to keep Google away from these. Have no fear though. If you have a specific URL in the Robots.txt file the organic search engine bot will not crawl the page, but the Google-Adbot(if you are using AdWords) will ignore the Robots.txt.
8. Tag Cloud Pages
A Tag cloud is a great way to see what a blog is about. It is also a great way to create a lot of duplicate content very quickly. These can be a pain, because it is very rare that a blog post will only have one tag. So for tag A you have five blog summaries, and for tag D you have four of the same five blog summaries. It’s best to just disallow the whole tag subdirectory, because unless your site is incredibly authoritative, it is unlikely you will rank for any tags.
9.Pages that aren’t providing any value
This is similar to disallowing nearly identical pages, but slightly different. If you have old blog content, or whole subdirectories that aren’t receiving any impressions in the search engines, and haven’t for a while. It may be wise to disallow Google-bot from crawling them. Not because the pages have identical content, but because it comes down to the economy of time. If Google-bot has to crawl every page on your site, it may move on to the next site before getting all the way through. Which means it was able to crawl the unpopular section of your site, but may have missed the new stuff you added in a subdirectory.
* Disclaimer: Be very careful when you are disallowing parts of your site. The goal is to optimize a web spiders crawl path to the most interesting valuable content you have. The goal is not to only have one page accessible to the web spider. Always double check your work!