Crawling and indexing are just the beginning of getting your pages to appear in search results. These are two basics of optimizing your site for search that can help you if you're adding a single page to your site or launching a whole new site.
Crawling is the main priority for Googlebot, and making your site easy to crawl ensures discovery of your content, optimized metadata, and all other SEO work you implement on your site.
The basics of Googlebot crawling:
- If you have a small site (less than a few thousand URLs), your site will be crawled efficiently most of the time.
- Crawling is not a direct ranking factor, but if your site or some of its pages cannot be crawled, they will not rank.
- The crawl-delay directive in the robots.txt file does nothing.
- Google doesn’t index everything it crawls and doesn’t rank everything it indexes.
- Google will more regularly crawl and index popular pages on well-known sites to keep their index up-to-date.
Crawl efficiency, crawl budget, crawl rate, and crawl demand
Crawl efficiency is how easily bots can crawl all the pages on your site. Keeping a clean site structure, having reliable servers, maintaining sitemaps and robots.txt files, and optimizing site speed all improve crawl efficiency.
If your site checks all these boxes, you'll maximize your site’s crawl efficiency and make the most of your "crawl budget."
Your site’s crawl budget is the amount of time and energy a search engine commits to a single visit to your site. Your crawl budget = crawl rate + crawl demand (more below). Once Googlebot reaches its crawl budget, it will stop crawling your site, even if it hasn’t reached every page that you want indexed.
If your site is small, your crawl budget likely exceeds what Googlebot needs to crawl every page on every visit. There simply are fewer pages that Googlebot has to process.
If, on the other hand, your site has millions of pages, improving crawl efficiency can help you get the most of your crawl budget. On each visit, Googlebot will crawl more pages, which will help search engines index new pages and content updates more quickly.
Crawl rate is how quickly Googlebot can crawl pages on your site. There is a crawl rate limit for every site, which affects how many pages are crawled. This limit is the maximum number of simultaneous parallel connections that can be used to crawl the site, along with the necessary waiting time between these fetches.
Factors that affect crawl rate
- If a site responds really quickly for a while, the limit goes up, meaning more connections can be used to crawl. If a site slows down or responds with server errors, the limit goes down—and Googlebot crawls less.
- Owners can reduce Googlebot’s crawling of their site by adjusting settings in Google Search Console. A higher limit doesn’t automatically increase crawling
- If there’s no demand for indexing, there will be low activity from Googlebot, even if the crawl rate limit isn’t reached.
Crawl demand is how many pages Googlebot thinks it should crawl. Crawl demand depends partly on how many total pages exist on your site, but you can also steer Googlebot away from unimportant sections of your site with directives in your robots.txt file.
Factors that affect crawl demand
- URLs that are more popular on the Internet tend to be crawled more often to keep them "fresher." Google tries to prevent URLs from becoming "stale" in the index.
- Site-wide events (site moves) may trigger an increase in crawl demand to reindex the content under the new URLs.
Factors that negatively affect crawling and indexing
Inefficient crawling hurts your site’s indexation. For example, slow server responses or a significant number of 5XX errors (server errors, like a 503) can reduce your site’s crawl rate limit, as can a significant number of low-value pages:
- Many versions of a page with URL parameters;
- Duplicate content within your site;
- Low-quality pages;
- Spammy pages;
- A significant number of long redirect chains;
- Long page-loading times that can cause timeouts;
- Improper use of noindex and nofollow tags;
- Pages served via AJAX without links in the page source;
Any non-200 URL (i.e. a "good" URL) qualifies as “dirt” in your sitemap. Asking Googlebot to crawl non-200 pages wastes crawl budget. Bing is still up in the air about whether it will “trust your sitemap less”—or even ignore it—if more than 1% of dirt is found.
How to optimize crawl efficiency for new pages and site launches
The methods to optimize crawl efficiency are—you guessed it—to do the opposite of the factors that can negatively affect crawling and indexing.
- Ensure you have reliable servers. Just as server errors or slow server responses can reduce your crawl rate limit, a quick-responding server can improve your crawl rate limit. An increased crawl rate limit results in more simultaneous connections to crawl your site.
- Speed up your site. Having a faster site means fewer timeouts (fewer pages wasted in your site’s crawl budget).
Streamline the crawl via your sitemap and robots.txt files. Keeping a clean sitemap and updated directives in your robots.txt file reduces wasted crawl budget on pages that you don't intend to serve up in search results. Search parameter URLs are toward your crawl budget, which can keep bots from crawling the important pages on your site.
The index ratio, the ratio of pages submitted to pages indexed, is a great indication of how efficiently your site is being crawled. You should aim for a 1-to-1 index ratio, which you can check for in Google Search Console.
If the ratio is quite low or certain pages aren’t being indexed, look for the following:
- Non-200 pages in your sitemap;
- Non-canonicalized pages in your sitemap;
- Spider traps that aren’t blocked in your robots.txt file;
- Improper use of nofollow or noindex tags.