Most of your site isn't indexed. Here's why.
Most B2B teams assume their site is fully indexed. It isn't. Here's what crawling and indexing actually mean, how to tell which problem you have, and what to do about it.
Published5 June 2026
Read time10 minutes
Filed underTechnical SEO · Crawling · Indexation
Every technical SEO conversation eventually comes back to the same pair of questions: is Google finding the page, and is Google keeping it? Those are crawling and indexing. Most teams treat them as one thing. They aren't, and conflating them leads to the wrong diagnosis and the wrong fix.
This post covers both processes in enough detail to diagnose real problems: what Googlebot actually does, when crawl budget matters, how canonicalisation works, and what Search Console is and isn't telling you about your site's indexed state.
Crawling and indexing are two separate processes. Conflating them leads to the wrong diagnosis.
Crawling vs. indexing: why the distinction matters operationally.
What each process actually is
Crawling is discovery and retrieval: Googlebot fetches a URL, downloads its content, and follows the links it finds. A crawled page has been visited. That's all. Crawling does not mean the page will appear in search results.
Indexing is the decision to store and serve a page. After crawling, Google's systems analyse the content, assess its quality and uniqueness, and decide whether to add it to the index. A page can be crawled repeatedly and never indexed - because it's thin, because it's a near-duplicate of another page, because a canonical tag points elsewhere, or because a quality signal is too low.
Why it matters for diagnosis
If a page isn't ranking, the cause could be: it's not being crawled, it's being crawled but not indexed, or it's indexed but not competitive. Each has a different fix. "Not ranking" is not a useful starting point. "Crawled but not indexed" points to content quality or canonicalisation. "Not crawled" points to internal linking, robots.txt, or crawl budget. Knowing which problem you have determines which tool you reach for.
| State |
What it means |
Where to look |
| Not crawled |
Googlebot hasn't visited. Could be blocked by robots.txt, not linked from anywhere, or deprioritised due to crawl budget. |
URL Inspection in Search Console; server logs |
| Crawled, not indexed |
Google visited but chose not to store. Common causes: thin content, duplicate, low quality signal, canonical mismatch. |
Page Indexing report in Search Console |
| Indexed, not ranking |
In the index but not competitive for target queries. Content quality, authority, or relevance issue. |
Search Console Performance; ranking tools |
| Indexed and ranking |
Working as intended. Optimise for position and CTR. |
Search Console Performance |
How Googlebot actually works.
The crawl queue
Googlebot doesn't visit sites in any particular order. It maintains a crawl queue: a prioritised list of URLs to fetch, ranked by signals including PageRank, freshness signals, sitemap submissions, and internal link signals. URLs with more inbound links and more recent change signals get crawled more frequently. A page with no inbound internal links may sit in the queue for weeks.
Googlebot identifies itself via the user-agent string Googlebot/2.1. Google runs multiple crawlers for different purposes - Googlebot-Image, Googlebot-Video, AdsBot-Google - but for web indexing, the primary crawler is Googlebot (desktop) and Googlebot (smartphone). As of 2023, Google's primary crawler is the smartphone variant: it renders and evaluates pages as a mobile browser would.
Rendering
After fetching, Googlebot renders the page using a headless Chromium instance. This is where JavaScript-dependent content gets evaluated. The rendering is not instantaneous: Google processes rendering in a second wave, which can mean a delay between initial crawl and the indexed version reflecting JavaScript-generated content. For most static HTML pages this is irrelevant. For heavy JavaScript applications it's a material consideration.
The crawl rate limit
Google deliberately limits how fast it crawls a site to avoid overloading servers. The crawl rate is set automatically based on server response times and can be manually reduced (but not increased) in Search Console's legacy settings. If Googlebot is causing server load, it will back off. If your server is consistently slow, Googlebot crawls less frequently.
Crawl budget: when it matters and when it doesn't.
Crawl budget is one of the most misused concepts in technical SEO. It's regularly cited as a problem on sites where it isn't one, and overlooked on sites where it is.
What crawl budget actually means
Crawl budget is the combination of crawl rate limit (how fast Google will crawl without overloading the server) and crawl demand (how much Google wants to crawl based on signals like PageRank and freshness). Together they determine how many pages Googlebot will fetch from a site in a given period.
When it matters
For most sites under 10,000 pages with clean URL structures, crawl budget is not a limiting factor. Google will crawl the whole site in a reasonable time. Crawl budget becomes a real concern when:
- The site has hundreds of thousands of URLs, many of them low-value: faceted navigation permutations, session ID parameters, pagination without canonical tags, near-duplicate product or listing pages.
- Important pages aren't being crawled despite being internally linked, as shown in server logs.
- Crawl stats in Search Console show high volumes of 4xx or 5xx responses, indicating Googlebot is wasting budget on broken URLs.
If none of those apply, optimising for crawl budget is premature. Fix the actual problem first.
The canonicalisation layer.
How Google decides which URL to index
When multiple URLs serve the same or very similar content, Google selects one as the canonical - the version it will index and use for ranking signals. This selection is influenced by: the rel="canonical" tag, the sitemap, internal linking patterns, and Google's own assessment of which version is most useful.
The important nuance: the canonical tag is a hint, not a directive. Google may ignore it if it conflicts with other signals. A page with a self-referencing canonical but no internal links pointing to it, while a nearly identical page has hundreds of internal links, may result in Google indexing the more-linked version regardless of what the canonical tag says.
Common canonicalisation problems
- HTTP and HTTPS versions both accessible without a redirect or canonical. Google will usually pick HTTPS, but the duplicate signals dilute PageRank.
- Trailing slash inconsistency:
/page and /page/ treated as separate URLs without canonicalisation.
- URL parameters not handled:
/products?sort=price and /products?sort=name indexed as separate pages when they show the same content.
- Canonical points to a redirect: the canonical destination redirects elsewhere, creating an ambiguous chain Google has to resolve.
The canonical vs. noindex decision
Use rel="canonical" when you want a page's signals consolidated into another version. Use noindex when you want a page completely excluded from the index but still accessible to users. They are not interchangeable: a page with noindex still passes link equity to pages it links to; a page with a canonical pointing elsewhere passes its signals to the canonical URL.
What Search Console is and isn't telling you.
Google Search Console is the primary tool for monitoring crawl and indexation state, but it has significant limitations that matter for diagnosis.
What it shows accurately
- The index status of specific URLs (via URL Inspection)
- Which URLs Google has detected as having issues (Page Indexing report)
- Crawl stats: requests per day, response codes, response times over the last 90 days
- Sitemap submission status and detected errors
What it doesn't show
- A complete count of indexed pages. The index coverage numbers are sampled and estimated. The actual indexed page count is not available.
- Which pages Googlebot visited but chose not to crawl deeply. Server logs show every bot request; Search Console shows a subset.
- Real-time status. There's typically a delay of several days between a crawl event and its reflection in Search Console data.
- Why a page was excluded. The "crawled, not indexed" classification tells you the outcome but not the underlying reason. That diagnosis requires reading the actual page.
Indexation diagnosis: where to start.
- Page not appearing in Search Console at all: check robots.txt, check whether the page is linked from any indexed page, use URL Inspection to fetch directly
- "Crawled, not indexed" status: read the page as a user - is it thin, a near-duplicate, or primarily navigation? That's your answer
- Indexed but not ranking: indexation is not the problem; content quality and authority are
- Crawl stats show high 404 rate: audit redirect chains and fix broken internal links first
- New pages taking weeks to appear: check internal linking - a new page with no inbound links from indexed pages will be deprioritised in the crawl queue
Frequently asked
Can Google read JavaScript?
Yes, but in two passes: it crawls the raw HTML first, then renders the JavaScript later in a separate, deferred step. Content that only appears after JavaScript runs can be indexed late or missed. If your site is built on a JS framework, see what actually gets indexed on JavaScript sites.
What is fetch and render in Search Console?
It's the URL Inspection tool's "Test live URL" function. It fetches a page as Googlebot and shows you the rendered HTML and a screenshot of what Google actually sees. Comparing that against your browser is the fastest way to catch content, links, or structured data that aren't making it into the index.
Does crawl budget affect small sites?
Rarely. For a site under 10,000 pages with a clean URL structure and reasonable server response times, Google will crawl the entire site within a normal cycle regardless of crawl budget. The concept becomes relevant at scale - large e-commerce sites, news sites with millions of URLs, or B2B SaaS platforms with auto-generated landing pages. If you're spending time optimising crawl budget on a 500-page site, that time is almost certainly better spent on content quality.
What's the difference between noindex and canonical for removing a page from search?
noindex tells Google not to show the page in search results but it still crawls it and the page can still pass link equity. A canonical pointing to another URL consolidates signals into the canonical but doesn't prevent the page from being crawled either. Neither is a substitute for the other. If you want a page entirely excluded from indexation and are comfortable Googlebot still crawls it, use noindex. If you have duplicate content you want consolidated into one version, use canonical.
How do I tell if Googlebot is blocked from a specific page?
Use URL Inspection in Search Console - it shows the last crawl date, the rendered HTML, and any blocking signals Google detected. If the tool shows "URL is not on Google" and the page exists, check robots.txt for a disallow rule covering that path, check whether the page has a noindex meta tag, and check whether all inbound links to the page are themselves indexed. Also check server logs if available: if Googlebot isn't appearing in the logs for that URL, the block is upstream of the server (DNS, CDN, or robots.txt).
Does submitting a sitemap guarantee indexation?
No. A sitemap is a crawl hint, not an indexation instruction. It tells Google which URLs you want it to consider, and it speeds up discovery of new pages. But Google will still apply its quality and uniqueness assessments before indexing any URL. Submitting a sitemap full of thin or duplicate pages doesn't get them indexed - it just tells Google where they are. Only submit URLs in your sitemap that you genuinely want indexed and that meet Google's quality threshold.