Everyone loves benchmarks! What’s normal? Am I (I mean, my site) better or worse than the average?
Search engine crawl behavior is no different. Is my site being crawled more or less than other sites? (See my earlier post on Googlebot crawl behavior, crawl budgeting, and crawl efficiency.)
Joost de Valk recently talked about the crawl behavior of Google and Bing (and SEO tools) related to his site yoast.com. But is what he’s seeing for his site “typical”? (Note that he looked at both search engine crawler and SEO tools activity, but I focused solely on search engine crawlers. He also wasn’t looking at the data to discover benchmarks; his point was about energy consumption, which is a good point, but not related to what I was looking at for this post.)
Keylime Toolbox Crawl Analytics analyzes server logs to provide insights on search engine crawl behavior. I looked at 40 US-based sites of varying size and technical state (in various unrelated industries). The smallest site was less than 200 pages and the largest site was more than 10 million pages.
What did I discover? There are no averages, no benchmarks. Like much of SEO, the data is useful only in context with your own site.
Google Search Console Crawl Stats Numbers Don’t Equate to the Number of URLs Crawled a Day
As I’ll be discussing at SMX Advanced during the “Solving Complex SEO Problems When Standard Fixes Don’t Apply” session, Google Search Console Crawl Stats reports:
- Are across all of Google’s crawlers, not just the crawlers used to indexing (and for instance, includes crawling for AdSense and AdWords).
- Report pages crawled per day, but not unique, indexable pages. For instance, Google may request the same page multiple times, and may request all of the associated resources for a page (images, CSS files, JavaScript files, and so on). These “pages” might also be non-indexable (for example, might redirect, be non-canonical, include anoindex attributes, and so on).
Below is a sample from a Keylime Toolbox Crawl Analytics report (in Excel that shows how URLs may be crawled multiple times a day:
And below is an example of resources being crawled. It’s important that Googlebot be able to crawl these, but they aren’t indexed separately, so don’t contribute to the total indexable URL count:
What this means is that if you are using the Google Search Console Crawl Stats “pages crawled per day” as a general estimate of the number of indexable URLs on your site crawled per day (for instance, to calculate how long it will take Google to recrawl your site and for changes to be reflected in Google’s index), you are likely not getting a good estimate.
You can use server logs to determine how many unique, indexable URLs Google is really crawling a day by looking at:
- Unique URLs (vs. total requests) – the Keylime Toolbox reports report each of these metrics separately.
- Unique URLs that return a 200 or 304 – Keylime Toolbox lists these separately.
- Unique, indexable URLs – this one’s a little harder, but the way I do it is to copy the list of URLs that return a 200 or 304 (from the Keylime Toolbox report) to a separate Excel file, filter out resources, and then crawl the remaining URLs by upload the list to Screaming Frog. From that output, I can organize the URLs into noindex, non-canonical, and canonical. (To get the list of canonical vs. non-canonical URLs, I create a separate column in Excel and use the formula =EXACT(A2,V2), where A is the column with the crawled URL and V is the column with the canonical value.)
From the resulting list of canonical URLs, you might still have duplicates if the canonical attributes aren’t set up correctly, so sort the URLs alphabetically and check this list for duplication issues such as:
- URLs with a trailing slash and without
- URLs with varied case
- URLs with optional parameters
You can normally tell by skimming if duplication issues exist. If not, then congratulations! This resulting canonical URLs list is the number of unique, indexable URLs that Google crawled that day! (If duplication issues do exist, this process helps to identify them and you can then generally use filter patterns to reduce the list further to a true canonical set.)
Using this process, here are results for two example sites:
- Site 1
- GSC Crawl Stats – 1.3 million URLs crawled a day
- Keylime Toolbox server log analysis – 800k URLs requested by Googlebot
- Unique, indexable URLs – 1k URLs
- Site 2
- GSC Crawl Stats – 120k URLs crawled a day
- Keylime Toolbox server log analysis – 100k URLs requested by Googlebot
- Unique, indexable URLs – 3k URLs
But even that’s not the full picture. Google recrawls some pages from one day to the next, so if Google crawls 1 thousand unique pages each day, that doesn’t mean they’ll have crawled 5 thousand unique pages in 5 days. There’s often overlap from day to day (that varies from 10% to 8%).
My Actual Indexable URLs Crawled a Day Is Really Small! What Now?!
Is it a problem if the percentage of unique, indexable URLs is so small? Maybe, but maybe not. Google does have to crawl all of the 404s and the redirects and the resources and the non-canonical URLs, so the goal is not to get the crawl to a state that only unique, indexable URLs are crawled.
If the site is fully indexed and Google’s index generally reflects the latest content on the site, then the crawl may be fine. But if the site isn’t well indexed or it takes a long time for changes on the site to be reflected in Google’s index, then crawl efficiency improvements may be a higher priority than they otherwise would be.
Google and Bing Crawl Behavior Doesn’t Correlate
I found no patterns in the Google vs. Bing crawl for each site. In some cases, the crawl volumes were similar. For some sites, Bing crawled significantly more. For other sites, Google crawled significantly more. (The difference varied from Bing crawled only 1% of Google’s volume to Bing crawling 9,000% Google’s volume.)
What does this mean? Depends on the site. For instance: If a site has a tricky technical implementation that Bing is having trouble with, Bing might crawl it less or might get caught up in loops and crawl it more. If Google has penalized a site, it might crawl it less.
Don’t Worry About Yahoo’s Crawler
Yahoo still crawls the web, but not much. For many sites in my sample, Yahoo crawled less than a hundred pages a day. For a few sites, Yahoo crawled around 5k URLs a day (vs. a million or more requests from Google), possibly for structured data extraction.
Search Engine Crawling: Every Site Is Different
Search engine crawling is very dependent on the particulars of the site.
- 404s – In nearly all cases, 404s and other errors comprised less than 10% of the crawl and in most cases was less than 5%. Is less better? Maybe, maybe not. If the crawl doesn’t contain any 404s, it’s possible the site is misconfigured to return a 200 response code for invalid URLs, for example.
- Redirects – the percentage of 301s and 302s varied widely. That’s to be expected since some sites have fairly recently migrated from http to https, a few changed their URL structures for other reasons, and so on. Typically, you’ll see a spike in redirects if you do some kind of migration, and then the percentage of redirects should decline once the search engines have crawled all of them. In the sample I looked at, the redirect percentage ranged from 0% to 60%.
Below is an example of the Keylime Toolbox graph that enables you to track these trends for your site:
- Unique URLs – Google often requests the same URL multiple times a day. Signals may be associated with the page that indicate it may change often (it’s the home page, it’s on a news site, the content actually does change a lot). Also, some of the requests are for resources (like CSS or JavaScript files), which may be required to build all of the pages on the site. I found no patterns in the percentage of total requests that were unique. The percentage ranged from 9% unique to 100% unique and included everything in between.
The chart below shows the percentage of unique URLs crawled by Google in a day (ordered by site size: the smallest site is on the left and the largest is on the right).
This metric can’t be used to monitor or measure anything directly (you have to look at the actual URLs crawled for that), but is useful for better understanding the crawl and calculating how long it will take for SEO improvements to be reflected in performance.
Site size was also not necessarily correlated with the number of URLs crawled a day, although it is generally the case that Google crawls more pages per day for larger sites than smaller sites. The chart below shows site size (the blue line) compared to the number of (total) URLs Googlebot requests in a day for the data set I analyzed.
If I Can’t Benchmark, Then How Can I Use the Data?
Maybe you can’t use crawl metrics for benchmarking, but the data is useful in all kinds of ways. Below are just a few examples. What data is important for you to track depends on the site.
- How many unique, indexable pages are really being crawled each day? How long will it actually take for changes to be reflected in Google’s index?
- Is crawl efficiency an issue? If the site is being fairly comprehensively crawled, maybe not. Having the full picture helps prioritize crawl efficiency improvements. (As I’ll talk about at the SMX session, other data goes into this prioritization, such as how well the site is indexed and how often the site content changes.)
- As you make improvements, you can use the initial metrics to monitor changes. As you fix broken links, 404s should go down. As you redirect non-canonical URLs (like those with varied case), the number of URLs that return a 200 may briefly go down and redirects may briefly go up, and the ratio of canonical vs. non-canonical URLs being crawled should shift for the better.
Of course, server log analysis is useful for lots of other reasons. If you’d like to check out what insights are available, check out the details and take a look at the server log import process. Email us at support@keylimetoolbox.com for more details on how to get started with Keylime Toolbox Crawl Analytics. (It’s only $49/month for daily log processing!)