Many of you know I frequent the Zen Cart forum answering questions for shop owners. I recently began to keep a makeshift list of repeatedly asked and misunderstood questions. Aside from the Meta keyword debate (which I will NOT be debating here, as there is nothing to debate), I have seen many questions and deep misunderstanding about blocking pages from indexing and when and why to do so.
If you have a question or have a myth or misconception you’d like me to post, you can comment on this post or contact us and we will be happy to post the issue and its solution/explanation for everyone.
Blocking my page in robots.txt will prevent indexing..
This is just not true, your robots.txt is a tool used to prevent crawling… not indexing. While blocking a page in robots.txt will prevent well behaved spiders from crawling these pages, it will not prevent indexing, remove a page from the index, nor block the flow of PageRank to the page.
Pages blocked in robots.txt will still continue to accumulate “link juice” or PageRank, but cannot pass it on to the rest of your site’s pages. This is a fine time to nofollow the links to that page and prevent wasted link juice. Orrrrrr you can simply read the rest of this post and find out how to stop pages from being indexed, but allow the link juice to flow freely through them.
I think the biggest part of the confusion here for shop owners it how Google used to refer to these pages in webmaster tools. Google used to report these pages under “Crawl Errors” and shop owners really thought they had errors. Now Google correctly calls these pages “Blocked by robots.txt”.
In your webmasters tools you WILL see an error if you block pages in robots.txt and then include them in your sitemap. This is a pretty clear illustration of improper instructions…. Here Google crawl these pages please, then pages in your invitation to crawl are blocked from crawling.
As far as removing pages from indexing…. You can, after blocking with robots.txt, ask Google to remove the page from their index. However, this can be quite a big and dangerous task. So we really need to know how to get the ancillary pages we don’t want indexed, removed naturally from the search indexes, never to be indexed again.
The simplest and most effective way to prevent pages from being indexed is with a noindex, follow Meta tag. (Notice I did NOT say noindex, nofollow tag)
A noindex, follow tag is a Meta tag residing in the <head> of a web document which tells behaved bots not to index the page. Inserting this tag (shown below) will naturally remove the page if indexed, and effectively prevent it from future indexing as well.
<meta name="robots" content="noindex, follow" />
This Meta tag has a bigger brother in the noindex, nofollow tag. Truth of the matter is that we often misuse this tag… People will use a noindex, nofollow tag for example on their login form to prevent indexing, however, the nofollow part is unnecessary and should likely be left off.
You see, while our login form has no valuable content for the search engine’s index, so we ask them not to index (noindex), many if not all of the links on the login page are link we would really want crawled (followed)… But we are not allowing Google for example to follow them when we use this tag: