Crawling and indexing are just the beginning of getting your pages to appear in search results. Although these are two basics of optimizing your site for search, there are many considerations that can rank your site on the first day versus having to "crawl" your way back to the top. The background and tips in this guide may assist you in adding a single new page to your site, or launching a new site entirely.
This comprehensive guide will cover:
Googlebot crawling
Since this guide addresses tactical questions related to crawl efficiency, we won’t cover the definitions of these activities. If you'd like more background before reading on, Google provides a brief introduction to how Google discovers and indexes sites on their Webmasters Blog.
First, we'll cover the basic characteristics of crawling before reviewing methods to optimize crawl efficiency. Although this article refers to Googlebot, it can be used as a general guide for all search engines.
Crawling is Googlebot's main priority. Making your site easy to crawl ensures discovery of high-quality content, optimized metadata, linking strategy, and all other SEO pursuits implemented on your site.
Characteristics of Googlebot crawling:
- Site-wide events like a site move or relaunch will trigger increased crawl demand
- If you have a small site (fewer than several thousand URLs), your site will be crawled efficiently most of the time
- Crawling is not a direct ranking factor, but if pages on your site cannot be crawled, they will not rank
- The crawl-delay directive in your robots.txt file does nothing
- Google doesn’t index everything it crawls and doesn’t show everything it indexes
- Google works to serve its users, so it will more regularly crawl and index popular pages on well-known sites to keep their index up-to-date
What is crawl efficiency?
Crawl efficiency is how seamlessly bots are able to crawl all the pages on your site. A clean site structure, reliable servers, errorless sitemaps and robots.txt files, and optimized site speed all improve crawl efficiency. If your site checks off all these boxes, you will maximize your site’s crawl budget, rate, efficiency, crawlability, and indexation.
To optimize crawl efficiency on your site, know the different facets that make up crawl efficiency. Crawl efficiency depends heavily on the following:
Every site has a crawl budget. Crawl budget is the number of pages Googlebot wants to crawl (crawl demand) and the number of pages Googlebot can crawl (crawl rate). Once Googlebot "spends" its crawl budget, it stops crawling a site, even if it hasn’t reached every page you intend to get indexed. This is where meta tags come into play: They can keep Googlebot from wasting crawl budget on pages that should not be served up in search results.
As stated before, if your site is small, your crawl efficiency is already higher than that of a multi-million-page site because there simply are fewer pages to crawl. Any URL crawled by Googlebot—regardless of whether it’s an alternate URL (e.g. AMP, hreflang pages), URL with parameters, or embedded content (e.g. CSS, JavaScript)—counts toward your site’s crawl budget, which is why a clean sitemap and robots.txt file are pertinent.
Every site also has a crawl rate limit. This limit is the maximum fetching rate for your site and represents the maximum number of simultaneous parallel connections that can be used to crawl the site, along with the necessary waiting time between these fetches. The purpose of the crawl rate limit is to avoid inundating your server with requests, which could slow down your site for human users. A reliable server and quick page loading can increase your crawl rate limit, helping your site to get crawled more often by Googlebot.
Free SEO Scorecard
Get professional analysts' insights into your Technical SEO, Content, Competitor Activity, UX, Web Analytic Configuration, and more. Get started with your free website SEO audit today.
Factors that negatively affect crawlability
Factors that affect crawl rate and crawl demand impact your site’s crawl efficiency and, as a result, overall crawlability and indexation.
The following factors can negatively affect your site's crawlability:
- Slow server responses or a significant number of 5XX errors
- A significant number of low value–add pages. These can include:
- Many versions of a page with URL parameters that offer useless filtering, faceted navigation, or session identifiers
- Duplicate content within your site
- Low-quality pages
- Spammy pages
- Long redirect chains
- Long page-load times that may timeout
- Nonstrategic use of noindex and nofollow tags
- Pages served up through AJAX without links in the page source
- Blocking bots from crawling JavaScript and CSS files
- “Dirt” in your sitemap
Tips to improve your site's crawlability
To optimize crawl efficiency (You guessed it!) do the opposite of the factors that negatively affect crawling and indexing. Here's how to discover areas to improve.
Reliable servers for improved crawl health
Just as server errors or slow server responses can reduce your crawl rate limit, a quick-responding server can improve your crawl rate limit. An increased crawl rate limit allows more connections to crawl your site.
Note: You can increase the “Crawl Rate Limit” in Google Search Console. However, although reducing it lessens crawling of your site, increasing it doesn’t necessarily increase crawling.
Site speed to increase crawl rate limit
Tying into the previous factor, a faster site is a sign of reliable servers and increases crawl rate. In addition, having a faster site means fewer timeouts—and fewer pages wasting your site’s crawl budget.
Sitemap and robots.txt files to optimize crawl budget
Keeping a clean sitemap and updated directives in your robots.txt file reduces crawl budget waste on pages not intended for search results. Search parameter URLs count toward your crawl budget if you don't specify otherwise, which can keep bots from crawling more important pages on your site.
Although Google won’t necessarily disregard your sitemap if there’s more than 1% dirt (i.e. URLs that don't return a 200 response code), it still wastes crawl budget to submit these pages for crawling. On the other hand, Bing is still up in the air about whether it will “trust your sitemap less” if it contains more than 1% of dirt. Even though Google is the preeminent search engine, we can’t leave 33% of search traffic on the table with such a simple mistake.
The index ratio (ratio of pages submitted to pages indexed) is a great indication of how efficiently your site is being crawled. Aim for a 1:1 index ratio in Google Search Console. If you’ve submitted sitemaps and noticed a low ratio, look for the following:
- Non-200 pages in your sitemap
- Non-canonical pages in your sitemap
- Spider traps not blocked in your robots.txt file
- Improper use of nofollow or noindex tags
Note: Breaking up a single sitemap into multiple sitemaps can help identify areas of your site that aren't getting indexed.
Understanding these elements and their roles in your site’s crawlability and indexation—although elementary—are essential for successful search optimization.
Are there other factors that should be included in this guide? Comment below and let us know!