How does Google analyze the content of a web page?


On March 22, 2011 Google was granted a new US patent. This patent named “Determining semantically distinct regions of a document” how Google may decide importance of web page structured containers based on location, commonality in web page containers and display. This patent may grant us some great insight in to the valuation of content in various positions and containers on our web pages.

Google grabs the page’s source to attempt to simulate a typical browser display. This act allows Google to fairly reasonably surmise the different page elements, their purpose and thus relative importance. So common areas such as header, footer, sidebars, ads, etc are undoubtedly given less weight/importance that the main content area of the page. One cool idea I would like to see come from this is that Google downgrades the duplication issues for these common areas such as header & footer. Taking these commonly site wide areas of duplication out of the duplication filter would allow us to use a highly usable and site wide template, which is shown consistently to be favored by users, without the need to “pump” up the main content unique page’s text in order to overcome the common template text parts. Wouldn’t that be awesome?

The new patent (7,913,163) lists several “modules” that are used to analyze a web document. These modules link analysis, text analysis, image captioning and snippet construction are all weighted relative to their position on the page. For example a footer link would carry less weight than a link in the main content, by the same rule a link in a paragraph in the main content would thus carry even more weight. Important search terms shoved the in footer or header will not help you nearly as much as the same terms in the main content area. Images surrounded by relevant text are by the same right easier to rank in image search as the text around the image in near proximity provides more weight.

Something else to consider is the use of proper containers throughout your content. A great example we see quite often is a product description that is a list or lacks proper paragraph tags to logically break the content in to separate thoughts. The issue here is that without a proper textual container such as a paragraph Google has little to generate a snippet from. Google doesn’t always our own Meta description, instead they pull a snippet from a container containing or relative the searcher’s query. Here is an example:

This product, Gadsden US Flag is searched for in many different terms. In the illustration below, you can see that effort was made not only to cover all they ways people search for this flag, but to include each one in a contain that Google can find worthy of a proper snippet.

As you can see the effort to demonstrate the importance and relation of these terms has been mapped and planned out very well. The biggest “tell” is that this page performs very well in search for the terms it was written for. Note that at no time did we spam, stuff or even create anything odd… This is also great text for our shoppers.

I think that it is possible that incredibly complicated page designs, pages with important content in irregular or oddly positioned containers may suffer, but in the end a standard, easy to navigate and predictable layout converts best … Because shoppers are more comfortable and able to navigate. What is good for Google is good for your shoppers as well … Win, Win!


3 responses to “How does Google analyze the content of a web page?”

  1. […] Are you stuffing your pages full of boring textual content for the search engines? While duplication is a problem and you DO need textual content, stuffing your pages full of crap so you can rank is completely pointless when you can't make a sale. The days of populating pages with stuffed content for rank are gone… Cannot even stuff the bottoms of the pages for an edge anymore … Google knows this trick and has made algorithmic changes to better determine the proper "weight" of …. […]