Anatomy of a Proper Search Engine Sitemap

Sitemaps 101

Call this an eye opener and perhaps myth buster for how you think about sitemaps. First and foremost we are not talking about your human/visitor HTML sitemap… But rather a sitemap specifically formatted for search engines to read.

My hope is that you will have a better understanding of the Sitemap Protocol and its intention. Because really, I have heard it all.

First a few myth busters!

You must have a search engine sitemap – FALSE
You must resubmit your sitemap when pages change – FALSE
The search engines will index your pages if they are in your sitemap – FALSE
In your sitemap you can tell the search engines how often to come back with the <changefreq> attribute – FALSE
The <priority> attribute forces the search engine to pay special attention to certain pages – FALSE
Altering your <lastmod> date will trick the search engines in to crawling more frequently – FALSE

Now that those are out of the way, why should you use a sitemap.xml?

Sites with dynamic content, complicated navigation, very deep directory structure and poorly internally linked websites are hard for search engines to crawl. So for these and most sites a sitemap provides a “road map” to help the search engines find and crawl your pages.

The Universal sitemap protocol, along with the Robots Exclusion Standard are quite complementary. The sitemap protocol is a universal tool to deliver pages to the search engines for crawling, while the Robots Exclusion Standard is a mostly universal protocol to prevent pages from being crawled.

So logically speaking, URLs which are blocked from crawl in robots.txt, cannot be in your sitemap. Additionally, pages blocked from indexing should not be in your sitemap either. There are some very few occasions when a page should be crawled, but in blocked from indexing. We are not going to get in to this as it is a situation you will probably never see. So rule #1, if the search engines cannot crawl or index a page, it cannot be in your sitemap.

The standard protocol for search engine sitemaps is XML. This is not mistake, and actually quite appropriate. The sheer simplicity and clear formatting of XML allows the search engines to crawl a great deal of data in a very short period of time. Bingo, we like this!

So I know there are quite a few sitemap modules out there for Zen Cart, and I am not beating anyone up here… But consider these facts.

A sitemap index is NOT required unless your sitemap contains more than 50,000 URLs. It is allowed, but why would we make our sitemap any more complicated or longer to crawl than necessary?
This is not a sitemap for humans, so why would we have our URLs displayed in tables and with styles to impede a speedy and efficient crawl?
Your sitemap is for your site’s URLs only, so why would we have resource and developer links in the footer of our sitemaps? Why would we even have a footer?
The links in your sitemap should be in their proper canonical format, not with session IDs and language parameters… Duplicate content.
Your URLs must be encoded properly in all ASCII characters. So you folks using URL rewrites need to be very careful not to use special characters in your product titles such as {} * & @ © etc…. or your sitemap will not be valid or accepted.
Unless you are selling images, image links do not belong in your sitemap.
If your site is www, then all URLs in your sitemap are www, no exceptions or mixtures of www and non… Again duplication.

Submit your new sitemap.xml to Google, Bing and Yahoo in their webmasters tools … Or don’t and just add it to your robots.txt. Either way is fine. You do not need to keep resubmitting your sitemap to the search engines, once submitted… They know where it is. If you choose to resubmit it for changes… Do not abuse the privilege.

When you have a proper sitemap, it will look something like this….

<?xml version="1.0" encoding="UTF-8"?>
<urlset
      xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9
      http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
<url>
  <loc>http://www.example.com/page.html</loc>
  <lastmod>2005-01-01</lastmod>
  <changefreq>monthly</changefreq>
  <priority>0.6</priority>
</url>

<url>
  <loc>http://www.example.com/</loc>
  <lastmod>2005-01-01</lastmod>
  <changefreq>monthly</changefreq>
  <priority>0.8</priority>
</url>
</urlset>

See, really not for humans… Your sitemap is a lean, mean search engine feeding machine!