Training Data Is the New Ranking Factor: The Common Crawl AI Visibility Audit Explained

Q: How do I check if my site is in Common Crawl?

Query the Common Crawl index for your domain. If your pages appear across recent monthly crawls, you're reachable by CCBot. If they don't, that's an upstream visibility problem no on-page optimisation will fix.

Your site ranks number one. ChatGPT has never heard of it. Both things can be true at once, and the reason has nothing to do with your content.

Every article in this series so far has dealt with what happens after a crawler reaches a page: how LLMs process it, how RAG retrieves it, how Google's AI Overviews cite it. All of it assumes the page was reachable in the first place. For a large and growing share of the web, that assumption is simply wrong. In June 2026 the Common Crawl Foundation published The AI Visibility Audit, a free field guide that maps the layer sitting upstream of all of it: before on-page work, before technical SEO, before a single link is built, a site has to be reachable by the crawlers that feed AI training data. Miss that step and a page can sit at the top of Google while staying invisible to ChatGPT, Gemini, Claude, and Perplexity.

What Common Crawl is, and why it matters.

Common Crawl is a nonprofit, founded in 2007 by Gil Elbaz, with one job: keep an open, downloadable copy of the web that anyone can use. It runs a bot called CCBot that crawls the open web every month and publishes the results as free archives on Amazon S3.

Here is why that matters to AI search. Those archives became one of the core training sets for modern large language models. When OpenAI and the other labs started building, Common Crawl was one of the first datasets they reached for. So the chain is short and brutal: block CCBot and you block yourself out of the data these models learn from.

The scale tells you why a single blocked crawler costs so much.

10 PB+ of open archive data since 2008. Each monthly crawl captures roughly 2-2.5 billion pages (350-400 TiB uncompressed), and the dataset is cited in more than 10,000 research papers. Page counts shift with seed lists and revisit scheduling, so treat any single figure as approximate; the scale is not in doubt.

From the web to a model that knows you.

Four steps separate a published page from a model that knows it exists. You can only act on the first one. Most SEOs have never looked at it.

STEP 01

Crawlers fetch the page

CCBot and others follow links and respect robots.txt. This is the only step your configuration controls. Access is won or lost here.

→

STEP 02

Data enters the archive

Captured pages are written to monthly snapshots (WARC, WAT, WET, CDX) and published openly. Not crawled means not in the snapshot. There is no backup route.

→

STEP 03

Labs filter and train

Model builders pull from the archive, filter for quality, and train. To make the cut you first have to be in the archive, then survive the quality filter.

→

STEP 04

The model knows you

The model can surface, cite, and recommend what it learned. You either make the training data or you don't.

→

The guide's central claim follows from that chain: training-data inclusion behaves like a ranking factor. It sits upstream of on-page work, technical SEO, and link building, and infrastructure decisions set it (CDN configuration, robots.txt rules, server-side rendering) rather than content strategy.

Two kinds of AI visibility.

A model can know about you in two completely different ways, and they have nothing in common except the word "visibility". One is baked into the model's weights. The other is fetched live, on demand. They answer to different crawlers, on different clocks, and a site can win one while losing the other.

	Parametric memory	Retrieval (RAG)
What it is	What the model knows without looking anything up: content baked into its weights during training.	What the model fetches live at answer time, from the systems covered in Article 3.
Timing	Slow. Crawl cycle, then archive, then a training run. Months of lag, on the model's schedule, not yours.	Immediate. Publish today, eligible today, if the live bots can reach you.
Depends on	Being crawled and kept before the model's training cutoff.	Being reachable by live retrieval and search bots right now.
Access question	Was the site open to training crawlers like CCBot when the model trained?	Is the site open to today's retrieval bots, which are a separate set?
How to audit	CC index coverage and crawl history (Check 2).	Live fetch tests against the retrieval user agents (Check 1).

The trap is assuming one covers the other. A site can be wide open to CCBot for training and still block the live bots that serve RAG answers. A full picture checks both clocks.

The block you never set.

AI crawlers are now the most-blocked agents on the web. Most of those blocks were never a decision. They were a default.

A page can rank well in Google and still be unreachable by training crawlers, because the block lives in the infrastructure, not the content. Two mechanisms cause nearly all of it, and both are invisible from inside your own files.

Managed robots.txt. Some CDNs inject AI-crawler disallow rules into your robots.txt for you. They create the file if it doesn't exist, or prepend their own rules to the top of one that does, disallowing GPTBot, ClaudeBot, Google-Extended, and CCBot. Your repo looks clean. The block is happening at the edge.
WAF and bot-management blocks. A "block AI bots" toggle or a bot-fight rule rejects requests by user agent at the firewall, before they reach the server at all. Again, nothing in your files. The bot just sees a 403 and leaves.

This is the default state, not a fringe case. One major CDN sits in front of a huge slice of the web, and for new domains its out-of-the-box setting blocks AI crawlers. Nobody opted in. The setting opted in for them.

75% of the top 100 US and UK news sites blocked CCBot in a January 2026 analysis, making it the single most-blocked training bot: ahead of Anthropic-ai (72%), ClaudeBot (69%), and GPTBot (62%). Across reputable news sites overall, AI-blocking climbed from 23% in September 2023 to nearly 60% by May 2025.

The practical takeaway is blunt. For any publisher or brand, assume CCBot access is already gone and check it first.

Harmonic Centrality: why some sites get crawled deeply.

Open access gets you crawled. How deeply, and how often, comes down to where you sit in the link graph. Common Crawl publishes a Web Graph mapping the web's link structure at host and domain level, and from it derives a measure called Harmonic Centrality.

PageRank and Harmonic Centrality ask different questions. PageRank counts how many important sites link to you: popularity. Harmonic Centrality measures how close you sit to the core of the web's link topology: proximity. A domain near that core gets prioritised in the crawl budget. A domain stranded at the edge gets crawled shallowly and rarely, even with respectable PageRank and a wide-open robots.txt.

One link from a genuinely core site can lift your centrality more than dozens from isolated ones.The AI Visibility Audit, on what link building is really for now

That changes the brief for link building. The old question was "how many authoritative links does this site have?" The new one runs alongside it: "how connected is this site to the core of the graph the training crawler prioritises?" A site can score well on the first and badly on the second. Higher centrality means crawled more often, which means more pages in the archive, more training data, and a better chance the models know you and recommend you. Each crawl compounds the last.

The language bias: AI leans towards English.

If you operate in a non-English market, here is an uncomfortable one: a page can rank well at home and still lose the AI citation to its English equivalent. Two biases stack to make it happen.

Start with the raw mix. English makes up roughly 41% of the latest crawl (40.85% of CC-MAIN-2026-21, from Common Crawl's own language distribution). After quality filtering the effective share is higher still, because the filter favours well-structured, high-authority pages, and those skew English.

On top of that base, two effects compound. Retrieval bias comes first: AI retrievers rank English and high-authority pages above equivalent localised ones, and it gets worse when the query itself is in or normalised to English. Across multilingual RAG systems the retriever, not the generator, is the bottleneck in cross-lingual ranking. Model bias comes second: large models appear to compute in a shared concept space that tilts toward English, so even parametric answers lean that way.

So pull the lever that actually moves: publish a strong English edition of your cornerstone pages. The retriever is where the bias is most fixable, and a genuinely useful English version of your highest-value pages widens your AI footprint. That means proper hreflang, distinct crawlable URLs, and real content rather than machine-translated filler. Both versions still have to clear the access checks below.

The five-check AI Visibility Audit.

Five checks, in order, most decisive first. Free tools throughout, about 90 minutes start to finish. Run them in sequence: an early failure usually explains the ones that follow.

CCBot access

Confirm nothing is disallowing AI crawlers, in robots.txt or at the edge. Read the live robots.txt for CCBot disallow lines (and GPTBot, ClaudeBot, Google-Extended), then test the firewall separately: a clean robots.txt means nothing if the WAF returns a 403 by user agent. A single curl request with the CCBot user-agent string against the homepage tells you whether you're blocked at the edge.

Tools curl · Screaming Frog with custom user agent

CC index coverage

Check whether the domain is actually in the archive, when it was last crawled, and how many pages are indexed. You can be open in robots.txt yet barely crawled if centrality is low or your pages were never discovered. The Common Crawl Index Server returns all of this from a free public API.

Tool index.commoncrawl.org

Harmonic Centrality

Look up the domain's centrality and rank in the Web Graph. A low rank means deprioritised crawl budget: shallow, infrequent crawling even with open access. Treat low centrality as a strategic risk and a link-building target aimed at core-connected sites, not just high-PageRank ones.

Tool community CC Rank Checker (Common Crawl's own is in development)

Structured data completeness

Entities without structured data are harder to represent cleanly in training data and harder for models to attribute. Audit Schema.org markup on key pages: Organisation, Article or Product, Author, Breadcrumb. Missing or invalid markup is a low-effort, high-leverage fix.

Tool Google Rich Results Test

Server-side rendering

Many AI crawlers behave like early Googlebot: they fetch HTML but never run JavaScript. If your content only appears after JS executes, the crawler captures an empty shell. Compare the raw HTML fetch against the rendered page. If the key content isn't in the raw HTML, it's invisible to anything that doesn't execute JavaScript.

Tool curl

The five results collapse into a one-page scorecard: pass or fail on access and rendering, present or absent in the archive, a centrality grade, and a structured-data gap list, each with its own fix. Most agencies don't offer that deliverable yet. That gap is the opportunity.

Verifying CCBot is real: the impersonation problem.

If you read server logs, know this: the CCBot user-agent string proves nothing. Common Crawl says so itself. Other crawlers routinely identify as CCBot when they aren't.

The real CCBot runs from dedicated IP ranges with reverse DNS, so a forward-confirmed reverse DNS check settles it. The IP should resolve to a hostname ending in .crawl.commoncrawl.org, and that hostname should resolve back to the same IP. An impostor fails the round trip. Common Crawl even publishes its verified ranges as JSON at index.commoncrawl.org/ccbot.json. The rule is simple: verify by IP, never by user-agent string, before you touch an access rule or trust a crawl-volume number.

It cuts both ways. Impersonation can inflate apparent CCBot traffic until a site over-blocks and locks out the real training crawler. Or the real CCBot gets banned at the WAF, blamed for load an impostor generated.

The opt-out question, honestly.

The audit is written for sites that want into the training data. It's also honest that plenty of sites don't, and for good reasons: principle, commercial strategy, or outright objection to how the content would be used. The same checks run in reverse for them. They confirm you're genuinely excluded rather than just assuming you are.

The tooling for opting out is maturing. Common Crawl published an Opt-Out Registry in September 2025, a single place to register preferences (including for content already sitting in older crawls), and it's an active member of the IETF AIPREF working group building a richer standard vocabulary for crawler preferences over robots.txt and HTTP headers.

Now the harder truth. Opting out is less reliable than most publishers think. Researchers at ACM CCS 2025 found several thousand domains where the owner believed they had opted out and hadn't. The mechanisms are inconsistent across crawlers: a rule that stops one bot does nothing to another, and intent quietly drifts from outcome. It's the most credible criticism of the current crawling ecosystem, and Common Crawl is working on it directly.

Why this belongs in every GEO workflow.

Most GEO frameworks skip a step. They start with content and structure and links, and assume the systems doing the training and retrieving can actually reach the page. The AI Visibility Audit puts that assumption first, where it belongs.

This series has covered how LLMs generate responses (Article 2), how RAG retrieves content (Article 3), how Google's AI Overviews and AI Mode work (Article 5), and what each platform says about getting cited (Article 6). Every one of those depends on access. The Common Crawl audit is the check that confirms it, and for a meaningful share of sites it will come back negative. CDN-injected blocks, WAF rules, JavaScript-only rendering, low centrality: none of these are content problems, and no amount of content strategy fixes them.

If you're not in the crawl, you're not in the model; if you're not in the model, you may not be in the answer. The audit is the free, 90-minute way to find out which side of that line you're on, before you spend another quarter optimising content a crawler never sees.

The old world was index and rank. The new world is train and retrieve.

Frequently asked

What is the Common Crawl AI visibility audit?

It's a check of whether your site is present in Common Crawl, the open web archive that feeds most AI training datasets. If a site isn't in the crawl, it can't be in the training data, which is one of the upstream ways a brand becomes known to a model. The audit confirms reachability before you worry about RAG or on-page work.

What is Common Crawl?

Common Crawl is a non-profit that crawls the web and publishes the archive for free. It's one of the largest single sources in the training data of major language models, which is why being included in it matters for long-term AI visibility.

How do I check if my site is in Common Crawl?

Query the Common Crawl index for your domain (the post walks through the exact method). If your pages appear across recent monthly crawls, you're reachable by CCBot. If they don't, that's an upstream visibility problem no amount of on-page optimisation will fix.

Does training-data inclusion affect AI citations?

It affects what a model knows about you by default, which shapes how it talks about your brand even without live retrieval. It's distinct from RAG, which fetches pages at query time. Both routes matter, and the strongest position is to be present in both. For the retrieval side, see how to rank in Google AI Mode.

How is training data different from RAG retrieval?

Training data is baked in before the model ships and has a cutoff date. RAG retrieval happens live at query time. A brand-new page can be cited via RAG immediately, but it won't reach training data until a future crawl and training run. They're two separate routes to being used by AI.

How often is Common Crawl updated?

Roughly monthly. Each crawl captures a large sample of the web rather than every page, so consistent presence across multiple crawls is a better signal of reachability than appearing in any single one.

Back to the start

This is the latest follow-on to the series. New here? "Attention Is All You Need" is where the foundations begin, and everything else builds on it. Read Article 1 →

SOURCES ↓

Common Crawl Foundation, "Introducing the AI Visibility Audit," Stephen Burns, June 2026 - commoncrawl.org/blog/introducing-the-ai-visibility-audit
Common Crawl Foundation, The AI Visibility Audit (PDF field guide), June 2026 - Common-Crawl-AI-Visibility-Guide.pdf
Common Crawl Foundation, CCBot documentation - commoncrawl.org/ccbot
Common Crawl Foundation, Web Graphs - commoncrawl.org/web-graphs
Common Crawl Foundation, Opt-Out Registry, September 2025 - commoncrawl.org/blog/common-crawl-foundation-opt-out-registry
Common Crawl Foundation, Crawl Statistics - commoncrawl.github.io/cc-crawl-statistics
Common Crawl Index Server (free, public) - index.commoncrawl.org
Common Crawl Foundation, FAQ - commoncrawl.org/faq
Wikipedia: Large language model - en.wikipedia.org/wiki/Large_language_model
Wikipedia: Retrieval-augmented generation - en.wikipedia.org/wiki/Retrieval-augmented_generation

Training Data Is the New Ranking Factor: The Common Crawl AI Visibility Audit Explained.

What Common Crawl is, and why it matters.

From the web to a model that knows you.

Two kinds of AI visibility.

The block you never set.

Harmonic Centrality: why some sites get crawled deeply.

The language bias: AI leans towards English.

The five-check AI Visibility Audit.

CCBot access

CC index coverage

Harmonic Centrality

Structured data completeness

Server-side rendering

Verifying CCBot is real: the impersonation problem.

The opt-out question, honestly.

Why this belongs in every GEO workflow.

What is the Common Crawl AI visibility audit?

What is Common Crawl?

How do I check if my site is in Common Crawl?

Does training-data inclusion affect AI citations?

How is training data different from RAG retrieval?

How often is Common Crawl updated?

Let's talk about your visibility.