top of page

Crawl Budget Optimization for SEO

Understanding Crawl Budget Optimization

spider bot crawling website pages - digital art

What is Crawl Budget?

Crawl Budget in SEO is the amount of pages a search engine bot can crawl and index in a time frame that it visits a particular website.

Naturally, the bigger the crawl budget the more pages the bot will visit and thus the more pages have a chance to get indexed in Google search.


Why Crawl Budget Matters in SEO

Crawl budget is that "touch point" where all of your SEO effort and all of the onsite and offsite optimizations have a chance to get noticed and evaluated by the Googlebot. This is the moment where site owners come to terms with the reality of SEO success and process. The website can be setup correctly, have all of the right content, meta tags and favorable hierarchical structure to succeed in SEO, however when the Googlebot crawls the site, it visits and indexes just a few pages at a time. And this becomes a major moment of frustration for site owners.


Factors Influencing Crawl Budget Allocation

Among the main factors influencing crawl budget are (somewhat in priority order):


  • Quality of your content and how useful it is to your users. For ease of understanding, let's assume your content is very valuable, users spend a long time browsing/reading it and none of your pages are empty or don't have enough meaningful content.

  • Number and quality of backlinks. This can also be that old domain authority indicator. I like to use MOZ Domain Authority indicator from the MOZ Chrome extension. Now, if the score is low it does not mean your site will not and cannot rank. It just means the website signal among all of the other website signals out there in the global www is not strong enough yet. You can improve it with backlinks. The usual SEO advice is - use backlinks wisely.

  • Internal links and sitemaps. Googlebot can only crawl what it can discover. You can make sure to link all of the pages on your website through one another: a hierarchical approach where the top "parent pages" live in your top navigation or footer and from there bots can discover all underlying "children pages". You can create "word clouds" with popular searches linking to main pages, you can add a widget like "Related Searches" or "Related Articles" and thus interlink all of your pages with one another. And the simplest way is to add all the pages you want ranking to your xml sitemap. Best thing to do though is to have all of the above or a combination of.

  • Load time speed of your pages. Googlebot does not have all day waiting for the pages to load. It parses your site with incredible speed and if it has to slow down and wait for any of the pages to load, or for those 301s to resolve, it will end up crawling a smaller number of pages. Basically, whether it crawls 500 pages or 50 in the same amount of time it is allocating to crawling your site depends on the site speed.

  • Amount of "junk" URLs living on the site. When it comes to SEO there are two types of pages: the pages you want ranking the ones you don't. The ones you don't want ranking need to be hidden from the bot crawling either through a robots NOINDEX tag or disallowed through the robots.txt file. That way Google won't spend its valuable resources on pages that don't matter and that should not and often times will never rank in Google (due to quality, duplication etc.). Between robots.txt and robots tag there is a difference though - to truly save crawl budget you want to block unwanted URLs through the robots.txt. This becomes more evident when we look at how search engines crawl a website.


How Search Engines Crawl Your Site

To know how to do crawl budget optimization or even to know whether you need to do any crawl budget optimization, it is important to understand what Googlebot looks at when it crawls sites.


Crawling Process Overview

"Crawler" (sometimes also called a "robot" or "spider") is a generic term for any program that is used to automatically discover and scan websites by following links from one web page to another. Google's main crawler used for Google Search is called Googlebot.

In some cases Googlebot will queue pages for later parsing instead of fully crawling and indexing pages at the time of the crawl. This usually happens with websites that use a lot of Javascript, and especially where content and links are hidden behind the Javascript code. Googlebot reads html and if content or links are not immediately visible as page text in html, the crawler will queue the pages and will come back to render Javascript later. This slows down crawling and indexing and negatively affects site's SEO. Google's ability to parse JS has improved greatly over the years, however in the universe where you want search engine bots to crawl as much of your site as possible and index it all - you don't want to create any obstacles for the crawlers on their journey.


The Importance of Crawl Efficiency

As reviewed above, crawl efficiency is key for site indexation and its ranking. The more "good" URLs a search engine bot can crawl, the more it can evaluate and rank. Bots won't rank pages they have not visited.


Identifying Crawl Budget Issues

Let's take a look at what can and often does go wrong when search engines are trying to access and crawl a website.


Common Crawl Budget Problems


Problem 1: Unwanted and duplicate URLs with parameters in them.

If you change one digit in a URL Googlebot will see it as a separate unique URL. This is true for URLs that have parameters in them but do not change the content of the page. For example,

is the same URL and content as

The URL looks different though and so the bot will crawl it if found. And that would be time wasted as instead it could have crawled and evaluated a different URL that is actually unique and can rank.


What to do

To spare bots from crawling that parameterized URL above (the one with "?keyword=1" in it), make sure none of the links on your site feature it. Link to the correct version of your homepage (if it's a homepage like in my example) from every place where you have a link to the homepage.


In some cases other websites might be linking to you with a parameterized URL - not much you can do here. To mitigate, add a rel canonical tag to the homepage pointing to itself (the correct and the only version of the homepage URL):

<link rel="canonical" href="https://www.seoseagull.com">

Do this for every page on your site, that way Googlebot will know which page URL is the main (or "canonical") and it will not waste its resources trying to index and rank all those other redundant parameterized versions.


If a parameter is prevalent though, for example, if it can be generated through a tool on your site in many versions, you can and should disallow (block) that parameter through the robots.txt file. That way you will truly spare Googlebot's crawl budget from being wasted on those duplicate parameterized pages. In my case I would add this line to the robots.txt file:

Disallow: /*keyword=

Problem 2: Slow website

if the site is slow, bots will spend more time waiting for pages to load and less time crawling the pages. Crawl budget is not just the amount of pages Googlebot is willing to crawl on a given site, it is the amount of pages it is willing to crawl in a given time. So, let's say page load time goes up significantly due to some change - now bots need to spend more time to crawl the same number of pages and they are more likely to drop some pages and not crawl them instead of spending all the additional time.


What to do


Monitor your page speed. You can do so within Google Search Console (GSC) in the Web Vitals section:


GSC core web vitals with failing LCP - screenshot

If GSC for your website looks like the above (has issues), then using Lighthouse tool or WebPageTest.org is the next step to narrow down on the speed inefficiencies and prioritize what to tackle first. This exercise is mostly technical and you will need support of your dev team to improve on page speed. Our team has done it dozens of times and will be happy to lead the process should you need any help. Contact Us Here.


Problem 3: Inability of crawlers to access your site.

This is more of a general problem with your website SEO, not just a crawl budget problem. However, sometimes, trying to analyze a niche problem we discover that the issue is much bigger and more encompassing. Having your "good" pages blocked from indexing can be that issue.


What to do

Check your robots.txt file to make sure some time during a site migration, release or an SEO mistake a set of pages (or even an entire website!) were not blocked or disallowed from crawling.

If all looks good in robots.txt which by the way can be found by adding /robots.txt after your domain name, likeso: https://www.seoseagull.com/robots.txt

then check each of the pages that you think are not getting crawled or indexed enough by looking at the page head code. Look for the robots meta tag:

<meta name="robots" content="noindex">

Pages that you want indexed and ranking in Google search should not say "noindex".


If neither robots.txt nor robots tag seem to have any issues but the bot still does not crawl all of your priority pages, make sure those pages are linked from the site pages. Have your top main pages linked from the homepage. A sticky header and/or footer link is best. Have some relevant content on other pages link to your priority pages as well. Googlebot actually counts how many times a page is linked internally on a website and will consider frequently linked pages more important and thus in priority queue for crawling, indexing and ranking.


Last but not least, a staple place to hold a list of all of the priority pages you want to see ranking in Google search (eventually) is the xml sitemap. XML sitemap can be submitted directly to Google within the GSC sitemap section:


GSC Sitemaps section - screenshot

If you are not scared of your competitors crawling your pages and scraping your content (not until now at least - new fear unlocked 🙃), add your sitemap location into the robots.txt file. That way you don't have to submit it to GSC, and Bing bot, and DuckDuckGo and... the list goes on. Especially if you sell products or provide services internationally, the list of search engines that need access to your xml sitemap will be much longer than the one above.


Tools for Analyzing Crawl Budget


Above we have discussed some useful tools that help analyze contributing factors to crawl budget optimization. Let's recap and add a few other ones:


Google Search Console. This is a free tool that is an SEO specialist's gift that keeps on giving. Page Indexing in GSC is probably the starting point of a crawl budget optimization journey:


page indexing in GSC - screenshot

While the chart is not conclusive, it will give an idea of wheather the number of non-indexed pages is too large compared to the number of "good" URLs on the site. In this case, a page cleanup might be the best thing to do not to waste crawling resources.

Sitemap section of the GSC is another place to look to ensure health of the URLs submitted to the Googlebot.


Having a domain weight tool installed in your Chrome browser, like the Moz Bar from Moz is helpful if you are trying to understand why Googlebot won't index all of your pages even though you did everything right.


Moz bar stats for a new website - screenshot

If your site is brand new and does not have enough word of mouth and outside links pointing to it, it is unlikely that Googlebot will invest a lot of the crawling resources into it. Grow your domain and page authority, backlinks and brand recognition and the crawl allocation increase will come naturally.


If you have an engineering team, you can work with them on creating a Log File Analysis chart. You can build it internally by analyzing your log files (files that record each site visit, including bot visits) and filtering for the Googlebot visits. Then plotting this data over a time line chart so you can see if the number of pages Googlebot visits over time is decreasing or increasing. You can also group this data by page type so you can see if what Googlebot prioritizes matches what you prioritize for traffic and conversion.


An alternative to building log file analysis in-house would be purchasing a technical SEO tool like OnCrawl, Botify and many others.


Recap


  • Crawl budget is how many of the site pages a search engine bot is willing to crawl in a given time.

  • Relying on free tools like GSC you can analyze whether what Googlebot crawls is also what you consider important pages and content on your site.

  • If there is a mismatch between the pages crawled and the pages you prioritize, you can do some optimizations using levers like: xml sitemaps, robots.txt file, robots tags, internal linking and links from external sources.

  • Crawl budget optimization gets more comprehensive when you can leverage an engineering team.

  • Contact SEO Seagull specialists for hands-on crawl budget audit and optimization.

Commenti


bottom of page