Mastodon

SEO 101: Questions about XML Sitemaps

From the archives…

I just read Jeff Atwood’s post on Coding Horror about the importance of Sitemaps. I’m always eager to hear about people’s experiences since I spent so much time on XML Sitemaps and getting sitemaps.org launched while I was at Google. Sitemaps, of course, are supported by Google, Yahoo, and Live Search. All you have to do is reference the Sitemap location in your robots.txt file and all the engines will pick it up.

Atwood noted that he uses Google to search for his own stuff, which makes it that much more frustrating when some of the content isn’t indexed. (Not to mention of course, the lost visitor opportunities.) Once he created an XML Sitemap, Google started finding and indexing more of his pages. Yay!

However, he and his commenters had a few questions about the process, so I thought I’d take a few minutes to answer them. Of course, I don’t work for Google anymore, so these answers are entirely my own. If you want official answers, check out the Official Google Webmaster Help forum.

Why is Google having so much trouble crawling my dynamic site? Can’t Googlebot figure out my URL scheme? (I’m paraphrasing Atwood’s post here.)
I haven’t spent a lot of time studying stackoverflow.com (the site in question), but since Google is crawling and indexing the URLs after finding them in the Sitemap, the problem likely isn’t with the dynamic nature of the URLs themselves. The issue is probably that the internal linking structure doesn’t provide links to every single page. Since Googlebot crawls the web by following links, it wouldn’t know about the unlinked URLs. Atwood notes this possibility:

“On a Q&A site like Stack Overflow, only the most recent questions are visible on the homepage… I guess I was spoiled by my previous experience with blogs, which are almost incestuously hyperlinked, where everything ever posted has a permanent and static hyperlink attached to it, with simple monthly and yearly archive pages. With more dynamic websites, this isn’t necessarily the case.”

Of course, pages with links to them (particularly no external links) may not have substantial PageRank and therefore are unlikely to rank for anything other than long tail queries. But since the scenario Atwood describes is all about long tail queries (typing in the exact title of a page, for instance), then getting those pages crawled and indexed is sufficient.

To dig a bit more into Atwood’s needs, he says, “It’s far easier to outsource the burden of search to Google and their legions of server farms than it is for our tiny development team to do it on our one itty-bitty server. At least not well.” If he’s looking to provide comprehensive search for visitors of his site, he might consider Google’s custom search engine (CSE). Generally, the CSE searches over what’s in the Google index. But if you’re submitted a Sitemap, Google will maintain a CSE-specific index that contains any URLs from the Sitemap that aren’t in Google’s web search index. So, the CSE could provide even better search results than a regular web search.

Why would Google put some URLs in the CSE-specific index and not the regular web index? Well, Google’s algorithms use lots of criteria for determining not only how to rank pages, but what pages to crawl and index as well. So, if, for instance, Googlebot has crawled what it’s deemed the maximum number of URLs from your site for the week for the web index (I’m over-simplifying here a bit), it can still add the remainder to the CSE index.

It doesn’t sound very scalable. (from John Topley)
You can easily write a script that updates the Sitemap each time the site is updated. And if your Sitemap reaches the maximum size, you can break it up into multiple Sitemaps automatically or you can segment them by folder (or whatever organizational structure works best for you). If you want, you can even ping the search engines each time the Sitemap is updated, or you can just reference it in your robots.txt file as Atwood suggests and let them pick it up.

How do you determine change frequency? (John Topley)
If your script can determine this, then you can set it up programmatically. Otherwise, I’d skip this attribute and just concentrate on listing the URLs.

I think google is not happy with the “dynamic” parts of the url e.g. “?” or “&” (Marcel Sauer)
Google does fine with dynamic URLs. They can have trouble if the dynamic nature of the site leads to things like infinite URLs, lots of URLs that display the same page, crazy parameters, or recursive redirects, but as I noted above, the trouble tends not to be with the URLs themselves, but the fact that they aren’t always well-linked.

8 thoughts on “SEO 101: Questions about XML Sitemaps

  • Amit Agarwal

    Hi Vanessa – Thanks for this informative article – the frequency field of sitemaps has always been very confusing but glad you covered it.

    I was reading a recent post on SEOMoz that quoted a discussion from SMX East. It says “Put really important pages in your sitemap, rather than every page on your site. ”

    Would love to hear your opinion on this.

    Reply
  • Vanessa

    There are several ways people approach what to put in Sitemaps:

    -Put the important pages in the Sitemap. This method is a good one is if it’s problematic to put all pages in the Sitemap. The point of the Sitemap is to let search engine know more about your site, particularly about the pages of your site, and this approach tells the search engines about the pages you care about most. That should give search engines a signal that all other things being equal, you’re telling them that these pages are the ones you care about. (Of course, all signals normally aren’t equal, so instead this will be one signal balanced among many, but the same idea holds.) So, that’s a solid approach.

    -Put the non-indexed pages in the Sitemap. The idea behind this method is that search engines already know about the rest of your site, so you’re just making sure they know about these as well. This may seem the opposite of the first approach. After all, if from the first approach search engines should get a signal that the pages in the Sitemap are most important, then wouldn’t the search engines use that same signal for this set of URLs? When really they might be the least important (hence the non-indexing). It may seem that way, but actually that’s not the case. Since search engines use the Sitemap as one of many signals, what you’re really saying with URLs in a Sitemap, is hey, search engine! pay attention to these pages! It generally won’t cause the search engine to then forsake all other signals that caused indexing of the other pages. It will just focus some extra attention on these. A Sitemap comes into play the most in the crawling process. So, if some pages aren’t indexed, it makes sense to make sure the search engines know about them so they can crawl them.

    -Put a comprehensive list of URLs in the Sitemap. This is my preferred approach when it’s technically practical. Why not tell search engines what the definitive list of pages on your site is? Why limit it to really important ones? One benefit to this is that there’s at least one place other than crawling that Sitemaps can be helpful, and that’s canonicalization. If a search engine has detected that several URLs display the same page, the version of the URL that’s in the Sitemap is a signal as to which is the canonical version.

    In reality, any of these approaches are good ones. Sitemaps enable the site owner to have a voice in the long list of signals that search engines use to crawl and index pages. Since they’re a signal and not a directive, they don’t correlate to just one option. The signal tells the search engines that you care about their crawlers taking a look at these pages, and many times, they then do.

    I imagine that each search engine uses the Sitemap signals slightly differently, since after all, each search engine has different crawling and indexing algorithms. However, I do think that it would be useful for the search engines to come together and let us know how exactly they use them and how they differ in using them. In particular, it would be very helpful if, as part of sitemaps.org, they got together and made sure they weren’t using Sitemaps for opposing purposes. You don’t want to have a shared standard that is used so differently that if a site owner compiles a Sitemap in a particular way, it helps with one search engine and hurts with another.

    When I worked on the sitemaps.org collaboration, it was all about figuring out what the standard should be and coming together to support it. Now that all the major engines do, I think the next step is sorting out more details about how they’re used (particularly since the search engines should now have lots of data about how they can best be used) and give site owners best practices.

    Reply
  • Brent D. Payne

    Nice post Vanessa. Interesting how people interact a lot for a while and then not so much for a long while. Hopefully this is coming up on a time where we’ll start doing more interacting naturally again. 😉

    P.S. I’m on season 4 of Buffy. I’m catching up to get the sub-culture of search–that you created.

    Reply
  • Jeff Atwood

    Hi Vanessa,

    Great article! And thanks for Google webmaster tools & sitemap.org, they’re both fantastic resources.

    One clarification, however:

    “the problem likely isn’t with the dynamic nature of the URLs themselves. The issue is probably that the internal linking structure doesn’t provide links to every single page. Since Googlebot crawls the web by following links, it wouldn’t know about the unlinked URLs”

    I don’t think this is true; every single question in the system can be reached through a direct hyperlink1 *IF* you follow the pagination links, as I mentioned in my article:

    http://stackoverflow.com/questions
    http://stackoverflow.com/questions?page=2
    http://stackoverflow.com/questions?page=3
    ..
    http://stackoverflow.com/questions?page=931

    The problem, from our perspective, is that Googlebot simply wasn’t doing that.

    Blogs are a simpler case because all the archive pages are generally in the form:

    http://myblog/archives/2008-06
    http://myblog/archives/2008-07

    etcetera. This led us to believe, based on observed behavior, that Googlebot couldn’t follow our pagination links.

    However, in retrospect sitemap.xml is probably a more *efficient* way for Googlebot (and any other search engines) to discover URLs to each question in Stack Overflow. No page loads are incurred on the server, no extra parsing of meaningless (to searchbots) markup, and so forth.

    Reply
  • Jeff Atwood

    Hi Vanessa,

    I entered a comment reply to this blog entry but it hasn’t been posted yet? Did I mess up, or was it eaten by spam filters somehow?

    At any rate, I just wanted to thank you for the great and informative blog entry.

    Jeff

    Reply
  • Guardian

    Hey Vanessa,

    A master stroke with an informative as well as resourceful post. I am almost newbie to search, but I have a great power to locate the right sources in search industry and you are among them. I ahve been fllowing your rss feeds as well as your blog.You would be happy to know that I have gained much knowledge about search just by following your blog as well as Danny’s Daggle.

    Reply
  • Richard McLaughin

    (whine) I still find a lot of pages that are in my xml file that Google has yet to find.

    Great post.

    Reply
  • cesar

    test comment

    Reply

Leave a comment

Your email address will not be published. Required fields are marked *