Archive

Posts Tagged ‘duplicate content’

Google Patent – Duplicate Document Detection

December 16th, 2009

Duplicate document detection in a web crawler system

Is your content unique?

Is your content unique?

12/1/2009 Google was granted a patent by the US Patent Office detailing how duplicate documents are detected in a web crawler system. This new patent details how Google detects and then filters or determines which documents are the “more important” version for the purpose of providing unique search results.

Most of the information we have had good reason to accept as true for some time, but detailing it in this manner can genuinely help webmasters to understand not only what duplicate content is, but how to create unique pages/content for searchers to find.

Duplicate documents are documents that have substantially identical content, and in some embodiments wholly identical content, but different document addresses.

The patent details 3 possible situations where duplicate content is detected and thus processed as such.

  1. Two documents, including any combination of web pages (documents) and temporary (302) redirect pages, are duplicate if they share the same page content, but have different URLs.
  2. Two temporarily redirected pages are duplicate if they share the same target or destination URL, but have different source URLs.
  3. A regular web page/document and a temporary redirect page are duplicate if the URL of the web page is the target/destination URL of the temporary redirect page or the content of the regular web page is the same as that of the temporarily redirected page.

In some of these situations redirects are involved which are temporarily redirected pages such as a Meta refresh over 0 seconds, JavaScript redirects and 302 redirects. Permanently redirected pages are not associated in this manner for duplicate document detection because the web crawlers do not download the content of the redirecting page, only the target or destination page.

How does Google detect duplicate content?

Google simply compares the pages/documents to each other to determine duplicity. They can use a whole page or thumbprint so to speak, phrases and essentially any content the crawler can read. What was not clearly disclosed is the amount or percentage of duplicity which would then in fact cause a filtering action from Google with regard to the index.

Upon determining duplicity between to documents, Google then attempts to determine which document is most important. This decision is about importance, not PageRank, nor ownership. Once this decision has been made Google then proceeds to filter the duplicate page(s) from search results in order to provide searchers with unique and qualified search results.

There are many ways that shop owners create duplicate content. Many are easily managed and some are not. Doing everything in your power to prevent it is the best course of action, as there will inevitability be duplicity you cannot prevent.

Some of the most common ways that shop owners duplicate their content:

  • Copying product descriptions from distributors or other indexed web pages
  • Lack of unique page specific content enough to make a page unique within a common site template
  • Posting articles or text from other websites
  • Not properly redirecting the canonical url to the non canonical (www or non-www)
  • Failing to add pages like the shopping cart page to the robots skip to prevent indexing
  • Not using Google’s parameter handling to exclude or ignore sorted page urls
  • Rewritten (SEO urls) which are not properly rewritten with a permanent or 301 redirect

If you think about it, Google is not trying to penalize anyone, but rather attempting to help searchers find fresh, unique and qualified results. Last time I checked this is the goal of a search engine =-)

admin E-Commerce SEO

Google Attacks Duplicate Content

October 16th, 2009

This seems to have gone, fairly unnoticed, but it’s really a very big deal for ecommerce websites as a whole… Especially Zen Cart. Duplication for sorted and alternate view urls continues to be a problem for shop owners. The new rel=”canonical” attribute is excellent, but in our dynamic shopping carts implementation is a nightmare for most. So, Google to the rescue (at least for their results) as now in your Google Webmaster Tools you can adjust your parameter handling for Google.

Parameter Handling:

Parameter handling allows you to view which parameters Google believes should be ignored or not ignored at crawl time, and to overwrite our suggestions if necessary.

This is a very informative video about how Google handles duplicate content and fully debunks the “Duplicate Content Penalty” myth as well.

To make use of the new parameter handling tool in your Google Webmaster Tools, simply expand the “Site Configuration” tab in your left menu and select “Settings”. On the setting page you will find parameter handling… and it might say something like this:

Googlebot has crawled your site and made a few parameter suggestions.
Adjust parameter settings »

Upon clicking the “Adjust parameter settings” link additional information and setting will drop down below it. In these setting you will see parameters Google has already discovered, suggestions and the opportunity to add parameters as well.

Most interesting that Google has found proper pages from our Zen Cart store already.

Parameter Handling for a Zen Cart

Parameter Handling for a Zen Cart

You can clearly see that Google has identified some of the most common dupliction issues in our Zen Cart already and has suggested another… That is also wholey duplicate. Now you can see below we have taken a few of Google’s suggestions and added them to our ignored parameters.

Zen Cart ignored parameters

Zen Cart ignored parameters

While this only helps your Google handling, it certainly gives you some quality insight in to your store, its duplicate content and helps you identify areas of need to better solve the same duplication issues with other crawlers. All in all, nice tool from G.

Melanie E-Commerce SEO

Does My Online Store have Duplicate Content?

November 2nd, 2008

Google is all over duplicate content, your site will suffer for it. But what exactly is duplicate content? The biggest problem is, Google doesn’t always tell you there is a problem, you can set up Google Alerts on your domains and addresses, but that’s no help for code, and template duplication. So I will try to cover the most popular ways to get pages excluded from main search for duplicate content.

  1. Your site uses both plain http://website and http://www.website URL protocol. Use a redirect or mod rewrite to fix this. This canonical redirect will help Google understand that the www & non-www are NOT 2 separate pages. You see, urls are like phone numbers, unique for every user… So Google thinks these 2 versions of your domain’s urls are in fact different numbers, so to speak.
  2. Your site has no standardized handling method for the entry page, or default indexing. This means the index/default/main page can be viewed live & independently of the www.mysite.com. You fix this with a 301 or permanent redirect. For example /index.html is exactly the same page as your domain url.
  3. If you use a template, this dramatically increases th opportunity for content duplication by using dynamic site elements across many or all pages.  To combat this, when page specific text is added, it must be ENOUGH text to make that page stand out and be unique from the other templated pages.
  4. Change your Meta tags and page titles for every page. This one is 100% easy and highly important. Use the <head> elements to properly describe the page and you will be fine.
  5. If you are going to share files across domains, link them ….yes even your own stuff can be duplicate content.
  6. In your Google Webmaster Tools it is advisable the you choose a preferred domain under the diagnostic tab, and by all means while you are there check out your content analysis for duplicate information Google has found in your site.
  7. The #1 duplication issue I see amoung the stores we analyze is product description duplication. You MUST write great unique product descriptions for your product to be successful. Using the manufacturers description creates content that is duplicate in whole or partially with every other distributer… Many time the supplier as well. I’m thinking your site is NOT the authority for this content and Google will not display your products for search.
  8. Your robots.txt WILL NOT block your page from being indexed… You must block them with a noindex, nofollow or other means to specifically DISALLOW indexing of these pages.
  9. In general when adding textual content (product descriptions), paraphrase and add rich text ….that’s really what Google wants.  Unique pages perform well… Duplicate pages never show up.
  10. Be extremely careful with content generators, most times the content is duplicate. Try paying a college student to write the copy for you… Good investment!

We theorize that 70% of your total page must be unique. Here are a few tools to help check it:

How do I know I have duplicate content in Google? Well even though we are signed up for their tools and participate in all their little programs, they don’t bother to go out of their to inform us (even though most times website duplication is unintentional and sometimes the webmaster is actually the victim (scraping)). So here are some helpful tools to monitor your store’s pages in Google.

  • You can use Google Search site:yourdomain.com will give you the indexed pages, all including subdomains. So if you track this it can be helpful.  So ideally if you see a drop here you might have had pages pulled from the main index. This is vague at best.
  • This is the Google Cache Tool, and you are going to love this. It tells you how many pages in your domain are currently cached in Google.com, when the cached file was recorded and the results of the cache like +5 or -1 pages etc. A little more help is needed we still can’t identify which pages have issues, but at least a we have a time frame to narrow it down if you log changes to the site.

Lastly what do you do if your store’s pages get removed? Think unique and get to work. Once you have stellar conetnt rich and unique pages create links to them to help Google find its love for them once again.

I leave you with this thought for the day, the process is strict and possibly ridiculous right now…..but as with anything else, change inevitably brings about chaos.

Melanie E-Commerce SEO

Shopping Cart Duplication – #1 Cause

September 13th, 2008

At PRO-Webs we complete a great deal of site reports, and this gives us a unique advantage for research, testing and identifying common mistakes. Today, we are going to let shop owners in on the biggest cause of content duplication we have seen to date… But first lets get in to some background and information about the less than stellar results of content duplication.

What is duplicate content?
Duplicate content generally refers to substantive blocks of content within or across domains that either completely match other content or are appreciably similar. Most of the time when we see this, it’s unintentional or at least not malicious in origin: forums that generate both regular and stripped-down mobile-targeted pages, store items shown (and — worse yet — linked) via multiple distinct URLs, and so on. In some cases, content is duplicated across domains in an attempt to manipulate search engine rankings or garner more traffic via popular or long-tail queries.

So duplicate content very simply put, is any content which is too similar to another page’s (any page’s) content indexed in Google. Duplicate content can cause a host of headaches and is certainly not a concrete foundation for a top performing online store. Why should shop owners care about duplicate content?… Simple Google and other search engines care a great deal.

This is really one of those times when you as a business owner needs to walk a mile in Google’s shoes. Google aims to deliver the best UNIQUE content for every search… This is what Google, the business, does for their customers. As I am sure you as a shop owner want the best for your customers, you can clearly understand why Google would not want the same content in 5 different results to deliver their users. So if there is duplicate content… What does Google do about it?

In theory, when Google is presented to 2 pages which are not unique enough to rank independently, they make a decision as to which page to rank for the terms. There is very certainly a very complicated algorithmic computation Google does to determine which page to rank… But that really matters little. Generally speaking, the site which has the greatest authority for the content will rank and the other is shunned from the main search-able index. There is likely a ton of metrics used to determine authority for this purpose, but the largest factors are likely to be PageRank, relevant links, traffic and the age of the page. Seriously though… The simple truth is if you stole it and published it, you are not going to be able to rank it properly.

So back to the purpose of this post…

What is the #1 cause of content duplication in shopping carts?

Many shop owners are distributing or drop shipping the product base from their online store. The drop shipper or main distribution source will nearly always provide product information and descriptions for the products they wish you to sell. In our hast and waste to launch our new store or new product, many shop owners will use all or some large blocks of the distributors or manufacturers product information… UHOH

There are really only 2 factors of concern with regard to this type of content duplication.

  1. Overall duplication percentage of the page
  2. Large blocks or shingles of text containing keywords

The overall duplication percentage of the page is not nearly the biggest issue, but it can and will cause Google to frown on your pages. Again, there is much debate as the what a tolerable level of duplication is… But Google’s not saying (for obvious reasons). My best estimation is at or around 25 to 30% duplicate a page will begin to fail you.

The next and really more serious duplication factor in my estimation is large blocks of text containing keywords which are 100% duplicate. Whether is be theory or fact, we know Google is capable of identifying search terms in a page and presenting a snippet for a organic search result description… Therefore, in like function, large blocks of text containing important text and keywords which are duplicate will be easily identified. This is very specifically a debilitating issue when the block of duplicated content is at the top of a container, like a paragraph or div for example.

When content is at the top of a page, paragraph, div or table… Google considers it to be of larger importance that the text in the same container below it. Duplicate content is a position of power such as this is the quickest way for a page to fail at the hands of Google’s duplication filters.

So whats the answer here…

How do I fix the duplicate content on my site?

Well, the simplest answer is not probably what most shop owners will want to hear (at least in my experience), but create great descriptive and unique content for your shops… Every page! At least any you want to perform well =-P!

So maybe you don’t have the time or skill to write such killer content? You can take a great deal less time and “shingle” the content provided to you from the manufacturers. This is really not a complicated process at all, Google reads text in blocks or shingles… I think about 10 word shingles to narrow it down. So if you take the text provided by the manufacturer and change or add a word every 5 words… then it is no longer duplicate =-). Please notice I DID NOT say delete a word, as this does nothing much at all, you must add or change a word. I tell shop owners every 5 word shingles, simply to be on the safe side and ideally have even better content when they are done.

It can be very disheartening to hear shop owners say “I don’t have time for that”… I just generally respond, you will have plenty of time when your online shop has no sales… Just do it then =-)

Melanie E-Commerce SEO