Google Patent – Duplicate Document Detection


Duplicate document detection in a web crawler system

Is your content unique?
Is your content unique?

12/1/2009 Google was granted a patent by the US Patent Office detailing how duplicate documents are detected in a web crawler system. This new patent details how Google detects and then filters or determines which documents are the “more important” version for the purpose of providing unique search results.

Most of the information we have had good reason to accept as true for some time, but detailing it in this manner can genuinely help webmasters to understand not only what duplicate content is, but how to create unique pages/content for searchers to find.

Duplicate documents are documents that have substantially identical content, and in some embodiments wholly identical content, but different document addresses.

The patent details 3 possible situations where duplicate content is detected and thus processed as such.

  1. Two documents, including any combination of web pages (documents) and temporary (302) redirect pages, are duplicate if they share the same page content, but have different URLs.
  2. Two temporarily redirected pages are duplicate if they share the same target or destination URL, but have different source URLs.
  3. A regular web page/document and a temporary redirect page are duplicate if the URL of the web page is the target/destination URL of the temporary redirect page or the content of the regular web page is the same as that of the temporarily redirected page.

In some of these situations redirects are involved which are temporarily redirected pages such as a Meta refresh over 0 seconds, JavaScript redirects and 302 redirects. Permanently redirected pages are not associated in this manner for duplicate document detection because the web crawlers do not download the content of the redirecting page, only the target or destination page.

How does Google detect duplicate content?

Google simply compares the pages/documents to each other to determine duplicity. They can use a whole page or thumbprint so to speak, phrases and essentially any content the crawler can read. What was not clearly disclosed is the amount or percentage of duplicity which would then in fact cause a filtering action from Google with regard to the index.

Upon determining duplicity between to documents, Google then attempts to determine which document is most important. This decision is about importance, not PageRank, nor ownership. Once this decision has been made Google then proceeds to filter the duplicate page(s) from search results in order to provide searchers with unique and qualified search results.

There are many ways that shop owners create duplicate content. Many are easily managed and some are not. Doing everything in your power to prevent it is the best course of action, as there will inevitability be duplicity you cannot prevent.

Some of the most common ways that shop owners duplicate their content:

  • Copying product descriptions from distributors or other indexed web pages
  • Lack of unique page specific content enough to make a page unique within a common site template
  • Posting articles or text from other websites
  • Not properly redirecting the canonical url to the non canonical (www or non-www)
  • Failing to add pages like the shopping cart page to the robots skip to prevent indexing
  • Not using Google’s parameter handling to exclude or ignore sorted page urls
  • Rewritten (SEO urls) which are not properly rewritten with a permanent or 301 redirect

If you think about it, Google is not trying to penalize anyone, but rather attempting to help searchers find fresh, unique and qualified results. Last time I checked this is the goal of a search engine =-)