So have you seen this ugly message in your Google site command results?
In order to show you the most relevant results, we have omitted some entries very similar to the ### already displayed.
If you like, you can repeat the search with the omitted results included.
What exactly does this daunting phrase mean? I think the key words here are “relevant” and “similar”. Google expressly wishes to display highly relevant and unique pages in their search results. This blog post from Google is a MUST read to help you understand duplicate content.
This is a 2 part problem… The first part is fairly simple, if the page has no value to a search user then block it from indexing. Pages in your store such as shipping, conditions, contact pages etc present no end value to a searcher and thus should not be allowed to be indexed. As with anything else there are a few ways to accomplish this, with the dynamic nature of most e-commerce shops blocking them in your robots.txt is both easy and highly effective.
Part 2… Duplication is far more complicated as the duplicate content can come from a variety of sources. The first step is to understand what duplicate content is and why Google finds it less than useful to the main index.
Duplicate Content: 2 pages within the search engine’s index which are substantially similar in their content.
You see there are in fact many different ways to end up with duplicate content… Here are some common ones we see in the shops.
- Using manufacturers descriptions which are likely to be indexed already on other sites.
- Allowing your session IDs to be crawled and thus indexed.
- Creating a different product page for different styles/colors/etc of products and using the same text for the most part.
- Canonical duplication and lack of a proper default index page.
- Not creating enough product/category unique copy within the template to allow them to be unique from each other.
The problem with duplicate content is that it puts Google in the drivers seat instead of you. When Google determines a page is too duplicate they will make a decisions from the available versions as to which one is most appropriate. There is much theory and discussion as to how much duplicate content causes a page to be dumped from the main search results. I honestly think this number is around 25-30% duplicate.
The duplication is for the WHOLE page, not just the part you are entering for a product description for example. Let’s say for example you are creating a new product and you have little or no textual content specific to that product… This page is VERY likely to be duplicate because obviously your template is pretty much the same on a per page basis. You need to have enough page specific or relevant text to set it apart, not only to prevent duplication, but to help it rank better as well.
Using straight supplier or manufacturer’s descriptions is duplicate in the same manner… Remember Google is not just comparing your pages against your own pages for quality and uniqueness. The best way to avoid this is to take a minute and write some good copy. In lieu of that you can shingle the text to avoid being duplicate. A shingle is one way search engines measure textual duplication. Generally speaking you need to add or change a word ever 10 words (aka shingle). The consensus is that most search engines shingle in 10 words… However we advise clients to use 5 word shingles as it creates better content.
Sessions IDs are a bit tricky. A few things need to happen here, Google and the other search engines ideally should not be allowed to crawl the sessions and you need to provide the search engines with a proper version of the url in your sitemap. If Google grabs a url with the session ID attached it is obviously 100% duplicate to its canonical version. You see search engines do not realize the two urls are for the same page… ANYTIME a url contains the same content and different urls, its duplicate. Check out the post for duplicate content caused by sessions.
Duplication due to a canonical url issue is probably the one we see most often. Of the last 50 reports we have done 38 of them had no canonical redirect in place or it was a 302.
The proper way to repair canonical duplication is to redirect the www to the non-www or vice verse with a 301 permanent redirect. Before you begin this task some research is warranted. Which version of your urls have the most indexed pages and backlinks for example. Sure, the backlinks will eventually be transferred to the new canonical url… But losing a bunch of indexed pages and backlinks for like 90 days is just not worth the vanity of having the www in your url. Besides, if its done properly the www version will redirect seamlessly to the non-www. Once you do this remember to always link to the proper canonical url when you create links.
Default indexing is just as popular as the canonical www problem. We as a matter of fact repair them at the same time. If for example your domain url (www.site.com) and (www.site.com/index.php) both work and are not redirected… They are 100% duplicate! If you have say index.html in addition and the www canonical urls issue then your main page has 6 100% duplicate copies. You can see this would be a problem =-). Default indexing is repaired (nearly 100% of the time) by redirecting the index.php and index.html for example to the domain url (www.site.com). Be sure to repair any template links to index.xxx as well.
Once in a very blue moon we have come across a site whose index.php was chosen to be the authoritative page by Google and it has more links and PageRank… This is the one occasion you MIGHT consider forcing the domain url to the index address with a 301 instead.
We strongly suggest you validate your site with Google Webmaster Tools as they will provide some in depth information regarding your site and its indexing that you really need to be successful.