Google recently posted about Googlebot’s “crawl budget”, which they define as a combination of a site’s “crawl rate limit” and Google’s “crawl demand” for the URLs of that site. The post contains a lot of great information, but how can you best apply that information to the specifics of your site?
I’ve been helping site teams with crawl issues for more than 10 years (including back when I worked at Google and helped launch tools such as the Googlebot Activity report and the Crawl Rate Control in 2006 (now called Crawl Stats and Crawl Rate)) and below are my takeaways for sites of all sizes.
How Does Google’s “Crawl Budget” Fit into SEO and How Sites Are Crawled, Indexed, and Ranked?
Googlebot is the system that crawls web site pages (URLs). The process basically goes like this:
- Google “discovers” URLs in a variety of ways (XML Sitemaps, internal links, external links, guessing based on common web patterns, and so on).
- Google aggregates the URLs it has found from the various sources into one consolidated list and then sorts that list into a kind of priority order.
- Google sets what they are calling a “crawl budget”, which determines how fast they can crawl the URLs on the site.
- A “scheduler” directs Googlebot to crawl the URLs in the priority order, under the constraints of the crawl budget.
This process is ongoing. So URLs are always being added to the list and the list is always being prioritized (as URLs from the list are crawled, new URLs are added, and so on), and the crawl budget is adjusted.
As Google also notes in the post, the “crawl budget” is based on both a “crawl rate limit” (technical limitations) and “crawl demand” (how important Google finds the URLs). Google may crawl fewer URLs than the crawl rate limit allows based on what URLs they think are worth spending resources crawling.
Should I Care About This? Is Any of it Actionable?
My explanation above explanation simplifies the process quite a bit, but the key elements most site owners and digital marketers care about are:
- URLs can’t rank unless they’re indexed, and they won’t be indexed unless they’re crawled
- If you want a URL to be crawled (and indexed and ranked), make sure:
- the URL is discoverable
- Google finds the URL valuable (so it has high priority in the crawl list)
- Google doesn’t spend all of the crawl budget on URLs that you don’t care about
- the crawl budget is as large as possible
As Google notes in their post, if your site is fairly small (fewer than a few thousand URLs), Google will likely have no trouble crawling all of the URLs of the site fairly quickly, regardless of the crawl budget (although keep reading for caveats to that).
But, as Google also notes, larger sites and sites that dynamically create pages can take steps both to ensure Google’s crawl budget is as large as possible and that Google is using that crawl budget to crawl the URLs you care about most.
What Is Google’s Crawl Rate Limit?
Google crawls the URLs of a site using a combination of settings:
- The number of connections open at once (the number of simultaneous requests from Google to your site’s server)
- The amount of time between requests
For example, if Google has configured your site to have a crawl budget of 10 connections and 5 seconds between requests, then Google could crawl a maximum of 120 URLs in 60 seconds (10 requests every 5 seconds).
You can see that in action by going to the site in Google Search Console, then clicking Site Settings. Below are examples from two sites: the first screenshot in each row is the slowest setting available to the site and the second screenshot in each row is the fastest setting available to the site.
What’s missing from this explanation is that Google doesn’t crawl every site constantly. Google hasn’t provided details about how long Googlebot spends crawling sites, but this is in part based on crawl demand (see more on that below).
How Does Google Determine a Site’s Crawl Rate Limit?
Google doesn’t want to accidentally bring your server down. The crawl rate limit is intended to ensure that doesn’t happen. Factors that go into a site’s crawl rate limit include:
- How quickly the server that hosts the site responds to requests
- Whether the server returns server errors or timeouts in response to requests
Can You Change Your Site’s Crawl Rate Limit?
As you might imagine, the best way to improve your site’s crawl rate limit is to make your server faster. If your site’s server responds quickly to requests and doesn’t return errors or timeouts, Google will adjust the crawl rate limit to open more connections at once, with less time between requests.
Server Speed vs. Page Speed
In SEO, you hear a lot about “page speed”. This isn’t the same as server speed. “Page speed” generally refers to page load times (how long a page takes to render for a visitor), whereas server speed is how long the server takes to respond to the request for the page. Page load times are impacted by server speed, but they’re not the same thing.
Google recommends reducing server response time to under 200ms and notes that “dozens of potential factors” might slow down the server.
Google’s John Mueller noted last year that a server response time of over two seconds is “an extremely high response time” that results in “severely limiting the number of URLs [Google will] crawl from” a site. He later elaborated in a discussion of that comment that server response time and page loading speed “are very different things, have different effects, and are commonly confused”.
For more details on measuring server response time separately from page load times, see MaAnna Stephenson’s post on server response time vs. page load time for Google crawl and Patrick Sexton’s post on improving server response times through changing your web host or web server software.
Server Errors
You can find out if Googlebot is getting server errors when crawling your site from the Crawl Errors report in Google Search Console.
- The URL Errors > Server Error report will show you the URLs Google tried to crawl, the response code (such as 502 or 504), and a graph that helps you know if the number of errors are increasing.
- The Site Errors section will show you more critical errors that kept Googlebot from accessing the site at all, including issues with overall connectivity, timeouts, and DNS issues.
See Google’s help documentation for more details on these types of errors and how to fix them.
With Google’s crawl error reports, it can be difficult to know just how much of a problem the URL errors are (the site errors always indicate a critical problem). You definitely want to reduce the number of server errors to keep Google from lowering the crawl rate limit, but the issue is more critical if 80% of the requests returned an error than if 2% of the requests returned an error.
The only way to know that level of detail is by analyzing your server’s access logs (since that provides data on both successfully crawled URLs and URLs that returned errors). Keylime Toolbox provides that with our Crawl Analytics tool (and lots of other great tools exist as well). Keylime Toolbox Crawl Analytics is $49/month for daily analysis (either as a standalone tool or as an add on to another plan). If you’d like to set up a Crawl Analytics plan, just email us at support@keylimetoolbox.com and we’ll help you get started.
Often, I find that URLs listed with server errors in Google’s crawl error reports surface issues with server misconfiguration. For example, a bunch of URLs may be listed as returning a server timeout, but in reality, the server isn’t timing out, those URLs are just misconfigured to return a 504 response instead of a 404.
Generally, it’s solid SEO advice a lot of crawl errors won’t hurt your site overall – they just mean that those specific URLs returning an error won’t be indexed. But in the case of server errors, your site overall might be hurt since not only will those specific URLs not get indexed, but if Google encounters a lot of server errors, they’ll reduce the crawl rate limit and crawl fewer URLs overall.
Requesting a Change in Crawl Rate Limit in Google Search Console
If Google is crawling your site too much, you can ask them to reduce the crawl by going to the site’s settings in Google Search Console.
Although the current iteration of the tool is positioned as a way to limit the crawl (the initial version I helped launch enabled site owners to request either slower or faster speeds), it’s unclear whether you can request a faster crawl.
The setting is labeled “Limit Google’s maximum crawl rate”, but when you choose that you see a slider with a white indicator positioned along it with information on the number of requests per second and the number of seconds between requests. It appears as though this setting is the current crawl rate limit and that you can either request a slower or faster crawl rate.
In the example below, it appears the crawl rate limit is set to 6 requests per second, with .167 seconds between requests:
But I can move the white indicator to “low” and the display changes to .5 requests per second and 2 seconds between requests:
And I can move the slider to “high” and the display changes to 10 requests per second and .1 seconds between requests:
What is Google’s Crawl Demand?
Google’s crawl rate limit is based on technical limitations, but Google may not crawl all of the URLs on your site even if Googlebot isn’t limited.
If your site contains tons of great pages (high quality, lots of external links, and so on), then Google will want to crawl them all and keep them up to date. If your site contains a lot of low quality spam, then Google won’t be as interested in crawling those pages.
Google’s post notes that more “popular” pages are crawled more often and that sitewide events like site moves may trigger and increase in crawl demand.
Of course, it’s all more nuanced than that. The scheduler orders the crawl of URLs of a site based on priority. The order is always changing as URLs get crawled and more information is aggregated about pages. Priority is based on numerous factors, such as:
- Is the URL the home page? Google likes to crawl the home page of sites often so that URL stays at the top of the list even once it’s recrawled. Looking at Crawl Analytics reports for our Keylime Toolbox customers, we find that home pages are generally crawled every day.
- Is the URL valuable/popular? Google will crawl URLs it considers “valuable” fairly often to keep them up to date in the index and will crawl URLs it considers low quality much less often. How Google’s algorithms determine “popularity” is always changing and they don’t provide specifics, but you can use metrics of your own to determine if visitors are finding your pages valuable. Does the have a lot of PageRank, valuable external links, rank for a lot of queries, get clicked on a lot in search results, get shared a lot socially, have a low bounce rate? Does it have a lot of great quality content or is it mostly blank/duplicate/spam?
- Does the URL change often? If Google determines that a page is updated frequently, that page might get crawled more often. Sometimes marketers hear this and set up superficial or artificial updates for pages so that Google will recrawl them, but there’s no value to this. That a page changes a lot may get it crawled more often, but you don’t need a page to be recrawled if it hasn’t changed. A more frequent crawl doesn’t lead to better ranking. (Note that for some queries that look for freshness (such as topical news queries), new pages may outrank old pages in some instances, but that’s a different issue. Most queries don’t fall in this category.)
- Has it been a long time since the URL has been crawled? If a page hasn’t been crawled in a while (Google doesn’t consider it to be very important, so it’s been at the end of the priority list and Googlebot has run out of allotment due to the crawl rate limit during each crawl), it will eventually get bumped up to the top of the queue so Google’s index doesn’t get too stale.
The only way to know which URLs Googlebot is crawling most often and how often URLs are crawled is by looking at server logs (which again, you can analyze using lots of third party tools, including the Crawl Analytics reports in Keylime Toolbox).
Working With Your Site’s Crawl Rate Limit and Crawl Demand: How to Ensure the Right Pages Are Crawled Quickly
No matter how robust your server and how valuable your pages, Googlebot doesn’t crawl any site infinitely. You want to make sure that Google is regularly crawling the pages you care about most.
Making URLs Discoverable
The first step is making sure the pages you care about are “discoverable”:
- Ensure no pages on the site are “orphaned”: that a visitor can browse to all pages of the site through an internal links structure (no pages should be available only through a search box, for instance).
- Ensure all URLs are included in an XML Sitemap.
Controlling What Googlebot Crawls
You don’t want Googlebot to spend valuable crawl budget on URLs you don’t want indexed. URLs you don’t want indexed include non-canonical URL variations and pages you don’t want indexed at all. You also don’t want to set up “spider traps”, which dynamically generate infinite URLs (a good example of this scenario is a “next” link for a calendar that infinitely creates pages with later dates).
Faceted Navigation
Google’s blog post lists “faceted navigation” as a type of “low value add URL” that may “negatively affect a site’s crawling and indexing” and “may cause a significant delay in discovering great content”. That’s a bit vague and alarming, since many high quality sites use faceted navigation and it’s not practical (or user friendly) to remove it.
The post links to a previous post about best practices for faceted navigation, which provides information about how best to handle the situation. I think what Google intended to say is that faceted navigation can lead to a large number of duplicate URLs that can take up a lot of the crawl budget, so it’s key to implement them in way that prevents this.
In summary:
- Determine which facets you want to be indexed separately (you might want the “blue dresses” and “red dresses” facets indexed separately, but not the “blue size 10 dresses” and “red size 10 dresses” indexed separately).
- Use canonical attributes to point the facets you don’t want indexed to the variations you do (for instance, the canonical value for all size variations of “blue dresses” would be the URL for the blue dresses page). You could also set up the site to include all sizes on one page, as Google’s Maile Ohye outlines in step 6 of her post about SEO for ecommerce sites.
- If the facets use standard key/value pair parameters in URLs, set those in the Google Search Console parameter handling tool.
- Don’t provide facet options if the result produces 0 results (a blank page).
Using canonical attributes won’t keep faceted variations from being crawled at all, but it should cause them to be crawled less often (a non-canonical URL has a lower priority in the queue) and Google’s parameter handling tool is a directive of which parameters not to crawl.
Infinite URLs
Infinite URLs can happen in all kinds of ways.
Infinite Next Links
This might happen with calendars, or pagination, for instance. If possible, dynamically stop providing a next link when no additional content exists. For instance, make sure that the pagination links don’t provide “next” forever.
Infinite Search Results
Google recommends blocking search results via robots.txt. This ensures that search results (which generally can always be infinite, because a visitor could search for anything) don’t use all of the crawl budget and also keeps the search results out of Google’s search results.
If you use search results for category pages, use a different URL structure for set category pages vs. free-form search results (if possible) or restrict searches to only keywords that match your existing taxonomy.
Once again, you can look at server access logs to see if Google is spending a lot of crawl time on search results pages. Below is a sample from the Keylime Toolbox Crawl Analytics report, showing what Googlebot is crawling most often for a site:
For the example site below, 4.7% of the crawl (28,079 of 605,244 fetches) is /search URLs.
You can see the specific URLs being crawled as well:
Soft 404s
If a page that isn’t found returns a 404, Google will crawl it much less often (lower its priority in the queue) and won’t index it. But if that “not found” page returns a 200, Google will keep crawling it. This can lead to infinite URLs since any non-existent page returns a 200.
The most obvious case is when the server is misconfigured to return a 200 status code for pages that aren’t found. Below is an example:
But other scenarios can be harder to spot. For example, the server might be set up to treat invalid URLs as searches, so example.com/puppies would do a search for “puppies” and return a search results page (that might not contain any relevant results). This set up may seem user friendly, but it generally isn’t. In these cases, it’s best to either just return a user-friendly custom error page (with a 404 response) or redirect to an actual search results page (that is blocked by robots.txt to avoid infinite crawl issues).
Another configuration is a set up that redirects all invalid URLs to the home page (or some other page that returns a 404). This also is intended to be user friendly but typically isn’t (the visitor gets a different page than expected and doesn’t know why) and can also lead to crawl issues. In these cases, it’s also best to return a custom error page.
Infinite URL Variations
Lots of other scenarios might exist for infinite URLs. For example, the site might have a contact form (such as a “give us feedback” or “report a problem”) that autopopulates the subject line using the page the visitor came from and includes the referring page as a parameter.
In a case like this, you might not want the contact form indexed at all, but you’ve set up infinite variations of the contact form URL. The best solution for this is to block the contact form with robots.txt.
However, for other pages that may have infinite URL variations, you may want the foundation URL indexed (such as for parameters that are used for tracking). The solution here is the same as for faceted parameters. Use a combination of the canonical attribute and Google’s parameter handling tool to give the non-canonical URLs low priority.
What Should You Do Next?
A more efficient crawl not only can lead to more comprehensive indexing, but also more updated pages in Google’s index. And if you make improvements to your site (technical improvements, fixes to penalty issues, and so on), the more efficient the crawl, the more quickly Google will pick up those changes.
- Assess the current state:
- Submit a canonical, comprehensive XML Sitemap to Google Search Console. Google will report how many of those URLs are indexed. If most of them are, you likely don’t have to worry about this all that much.
- Check search results for recently updated pages and then check the cache of those pages. Has Google crawled them since you’ve made updates? If the cache dates are old, then making the crawl more efficient may help get the pages updated in Google’s index more quickly.
- Review server access logs to see what is being crawled. Is it the pages that are most important to be indexed? (Email us at support@keylimetoolbox.com if you want to set up log analysis for $49/month).
- Make sure every page on your site is accessible via at least one internal link
- Check Google Search Console Crawl Errors report for site errors and server errors:
- Fix any site-wide errors and server misconfigurations
- Improve server response time if possible
- Review server error URLs to determine if the specific URLs are misconfigured to return the wrong status code
- Look for infinite URL issues (and resolve them), using:
- Google Search Console Crawl Errors > Soft 404s report
- Google site: operator search
- Server logs
- Read more:
- Ask a question in the comments below!
8 thoughts on “Googlebot’s Crawl Budget and Crawl Demand: How It Impact’s Your Site’s Visibility in Search Results and How You Can Improve It”
Pingback: SearchCap: Google local pack ads, Bing Ads scheduling & Google Android offline searches
Pingback: YouTube Videos in Google's Image Results? – #SEOForLunch Newsletter Issue #20 -
Very good article – the more indepth guide on the subject I have come across.
Just to clarify, if I have pages I don’t want Googlebot to spend time and resource on (example, a news page with low or thin content, or almost duplicate pages), what is the best process to remove these pages from the crawl budget ?
noindex, robots.txt or simply a rel canonical to the closest version ?
Thank you Vanessa!
It depends on the situation, but robots.txt keeps the pages from being crawled at all (so they don’t impact the crawl budget). I recommend using this for pages that redirect to a login, for example.
A page with a noindex tag still is crawled (so the noindex can be extracted) but is not indexed. So this option still impacts crawl budget (although a page with a noindex may be crawled less often, it still is crawled occasionally). I recommend this option when you want to leverage the value of the internal links on a page (for example, an HTML sitemap page).
A (non-canonical) page with a canonical attribute is also still crawled (although less often than a canonical page), so it also impacts crawl budget. I recommend this option when the page is duplicate or nearly duplicate and the page might accrue external links.
If a near-duplicate, low/thin content page is never meant to be indexed, doesn’t have internal links to other pages, and isn’t going to accrue any links, then robots.txt may be the best option.
But you also have to keep the site structure in mind. It’s not always feasible to block pages with robots.txt. For example, if you have a local directory and want to block individual pages until they have a certain number of user reviews, you can’t easily do this with robots.txt as there’s no pattern that differentiates the pages with reviews from those without. You could block each page individually, but that’s not realistic when you have thousands of pages to block.
So in those cases, a common solution is to programmatically insert a noindex on the pages with no reviews (that is removed automatically when the page does have reviews). Unfortunately, this doesn’t help crawl efficiency all that much (pages with noindex may be crawled slightly less often, but are still crawled). But there’s always going to be some amount of crawl inefficiency (which is why you want to look at all instances of crawl inefficiency and improve where you can).
Thank you Vanessa for the quick and very detailed answer. The issue I am having with this site has a lot of UGC pages which are thin in content / near duplicates, as this is a photography website. Plus, these pages are located at the root of the server , which means I have no /folder/ to use as a path exclusion for robots.txt.
I guess my only choice is to use both noindex and rel canonical, since I can’t add all robots.txt files manually each time a user post a news article.
Thank you !
If you can configure the site so that the pages are in a folder, that might be the best bet. But if you can’t, you’re right, noindex may be the way to keep them out of the index.
Would nofollowing links help control what pages get crawled by removing crawlable paths to them or how often they are crawled by lowering their priority?
Nofollowing the links probably won’t help that much. If you’re going to nofollow a link, you may as well also block it with robots.txt (if the URL pattern allows it) if your intent is to keep that URL from being crawled and indexed.
A nofollowed link probably will mean that the URL being linked to has slightly less overall value signals and may have slightly less priority than it otherwise would, but it doesn’t keep the URL from being crawled (especially as Google likely knows about the URL from other sources). So it’s not a great option for controlling the crawl.
Google talks about this a bit in their blog post:https://webmasters.googleblog.com/2017/01/what-crawl-budget-means-for-googlebot.html
“Q: Does the nofollow directive affect crawl budget?
A: It depends. Any URL that is crawled affects crawl budget, so even if your page marks a URL as nofollow it can still be crawled if another page on your site, or any page on the web, doesn’t label the link as nofollow.”
Generally, I recommend against nofollowing internal links and using other methods (robots.txt, canonical and pagination attributes, Google’s parameter handling, etc.) to influence the crawl based on the specific use case for the URL.
Nofollow was originally introduced to help combat spam (by nofollowing links in comments, for instance, your site is less attractive to spammers) and for the most part, that remains its best use.