What Is RAG: the Only Thing That Matters for AI Search

RAG is the topic I get asked about more than any other in this series. And I understand why: once you see how Retrieval-Augmented Generation works, the rest of AI search visibility clicks into place. It's the mechanism that connects your content to every answer these systems produce.

Every trained language model has a structural problem: it doesn't know what happened after its training ended. Ask GPT-4 about an event that occurred after its knowledge cutoff and it'll either decline to answer or (more problematically) generate a confident, plausible-sounding response with no basis in reality. This isn't a failure of intelligence. Training fixes the model's weights, and fixed weights mean fixed knowledge. No amount of prompting can make a model know something it wasn't trained on. RAG is the architectural solution. Every major AI search product uses it. If you want to understand why your content gets cited or ignored, this is where you start.

The problem RAG solves.

Before RAG, AI systems faced a binary choice: either the model knew something from training, or it didn't. For static knowledge (scientific principles, historical facts, stable reference material) training data was sufficient. For anything that changes (recent events, current pricing, new research, live market data) it wasn't.

The response without RAG is a "closed book" inference: the model answers purely from its learned parameters. The response with RAG is an "open book" inference: the model has access to retrieved documents it can read and cite before answering. Think of it as the difference between a closed-book exam and an open-book exam. The student's underlying intelligence and reasoning capability are the same in both cases. The quality and recency of their answers differs dramatically.

For AI search specifically, RAG solves three distinct problems.

Recency. Training data has a cutoff. RAG retrieves current content at query time, so responses can reflect information published after training ended.

Hallucination. By grounding responses in retrieved documents, RAG gives the model factual anchors that constrain generation toward accurate claims. A model that hallucinates a fact when answering from memory is far less likely to hallucinate when it has a retrieved document containing the correct fact right in front of it.

Attribution. RAG-grounded responses can cite their sources in a way that closed-book responses can't. This is why AI search products display citations alongside answers.

The four-step RAG pipeline.

RAG systems typically follow a four-step pipeline. Each step is where different content signals matter, and understanding them is what separates a content strategy that works from one that doesn't.

Step 1: Chunking and indexing

Before any query happens, the system has to prepare your content. It splits documents (web pages, PDFs, articles, knowledge base entries) into passages, typically a few hundred to a few thousand tokens each. It then converts each passage into a vector embedding: a high-dimensional numerical representation of the passage's semantic meaning, generated by a separate embedding model.

Those embeddings live in a vector database. That database is the indexed corpus the RAG system retrieves from. For Google AI Overviews and AI Mode, it's Google's Search index. For Microsoft Copilot, it's Microsoft's semantic index (built on SharePoint and OneDrive content for organisational queries) and Bing's index for web queries. For Perplexity, retrieval happens live at query time against the public web.

This chunking step is where content structure has its first significant effect. A well-structured document (clear headings, self-contained paragraphs, each passage making a coherent claim) chunks into retrievable, usable units. A document with long paragraphs that mix multiple ideas, or key claims buried after preamble, produces chunks that are harder to retrieve accurately and harder to use once retrieved.

Step 2: Retrieval

When you submit a query, the system converts it into a vector embedding using the same model used to index the content. It then performs a similarity search (typically cosine similarity) to find the stored passages whose vectors are most similar to the query vector.

This is the critical technical distinction from traditional search. Traditional search ranking is driven by keyword matching (does the page contain the query terms?), link authority (how many authoritative sites link to this page?), and behavioural signals (do users click and stay?). RAG retrieval is driven by semantic similarity: how close is the meaning of this passage to the meaning of this query, in a high-dimensional vector space?

These are genuinely different things. A page that ranks well in traditional search because it's accumulated links over years may not retrieve well in RAG if its content isn't semantically precise about what you're asking. A newer page with lower link authority may retrieve well if it directly and precisely addresses the query's intent.

Step 3: Prompt augmentation

The retrieved passages are injected into the LLM's context window alongside the original user query. A typical augmented prompt looks something like: "Based on the following sources, answer this question: [user query]. Sources: [retrieved passage 1] [retrieved passage 2] [retrieved passage 3]..."

How many passages get injected depends on the context window size and the retrieval system's configuration. More retrieved passages mean more coverage but also more content for the model to process, and more exposure to the "lost in the middle" problem I described in Article 2 of this series.

Step 4: Generation

The model generates a response grounded in the retrieved passages. Where the model's training data and the retrieved content agree, the response is confident. Where they conflict, the model has to reconcile the discrepancy. Well-designed RAG systems are configured to prefer retrieved content over training data when there's a conflict, since retrieved content is typically more recent and more specifically relevant to the query.

Citations in AI search responses (the links under AI Overviews, the sources panel in ChatGPT Search, the citations in Perplexity) are generated at this step, extracted from the retrieved passages that contributed to the response.

IMAGE PROMPT: AI-generated diagram of the full 4-step RAG pipeline: Step 1 "Chunk & Index" (document → passages → vectors → database) → Step 2 "Retrieve" (query vector → similarity search → top-K passages) → Step 3 "Augment" (passages + query → context window) → Step 4 "Generate" (LLM → cited response). Linear flow, left-to-right, editorial illustration style.

Google's term for RAG: grounding.

Google officially confirmed in its May 2026 AI search documentation that AI Overviews and AI Mode both use Retrieval-Augmented Generation, which Google calls "grounding."

Grounding, in Google's usage, means anchoring an AI response to retrieved real-world sources rather than relying on training data alone. The retrieval source for Google's AI features is Google's own Search index: the same index used for traditional blue-link results.

This has a direct and important implication: there's no separate AI index to optimise for. Getting into AI Overviews starts with being indexed and crawlable in Google's standard Search index. All the baseline technical requirements of traditional SEO (indexation, crawlability, snippet eligibility, technical health) still apply before any AI-specific considerations. Yes, even now.

Google's documentation is explicit on this point: foundational SEO is the prerequisite for AI search visibility, not an alternative to it.

How each platform implements RAG differently.

The four-step pipeline is universal. The implementation varies significantly by platform, and those differences have direct implications for your GEO strategy.

Google AI Overviews. Single-pass RAG. Google's systems retrieve relevant passages from the Search index using its standard ranking signals as a first-pass filter, inject them into Gemini's context window, and generate a synthesised answer. Passage-level retrieval means individual paragraphs can be cited independently of overall page ranking.

Google AI Mode. Multi-stage RAG with query fan-out. A single user query is decomposed into multiple sub-queries, each retrieving independently, with results synthesised across all of them. Your page can be cited in AI Mode even if it doesn't rank in the top 10 for the primary query. It just needs to rank for one of the sub-queries. (See Article 5 for a detailed treatment of AI Mode and query fan-out.)

Microsoft Copilot. Dual-layer retrieval. For organisational queries, Copilot retrieves from Microsoft Graph (SharePoint, OneDrive, Teams, email) using a semantic index built over the organisation's content. For web queries, Copilot generates targeted search queries, sends them to Bing, retrieves results, performs grounding checks and semantic similarity cross-checks, then synthesises a response. The query Copilot sends to Bing isn't your original prompt. It's a distilled set of terms the system determines will retrieve relevant information.

ChatGPT Search. Hybrid retrieval. ChatGPT Search rewrites user queries into one or more targeted sub-queries and sends them to search partners. When web search is enabled, it behaves like a RAG system; when it's not, responses draw purely from training data. Deep Research mode is agentic: it plans and carries out a multi-step research process, searching, evaluating sources, and refining queries iteratively.

Perplexity. Real-time RAG on every single query via the Sonar model. There's no pre-existing ranked index that Perplexity retrieves from. Retrieval happens live at query time against the public web, using a low-latency hybrid search combining semantic methods, LLM ranking, and human feedback signals. Every answer is retrieved and synthesised fresh. This means traditional link-based ranking has no inherent advantage in Perplexity. What matters is crawlability, freshness, and factual authority.

Claude. When web search is enabled, Claude generates targeted search queries, retrieves relevant results, analyses them for key information, and provides a response with citations. Claude can conduct multiple progressive searches, using earlier results to inform subsequent queries. When web search isn't enabled, Claude draws purely from training data up to its knowledge cutoff.

IMAGE PROMPT: Side-by-side screenshots of citation UIs from three AI search platforms: (1) ChatGPT Search — the "Sources" panel showing numbered inline citations; (2) Perplexity — the right-hand source card panel with favicons and titles; (3) Google AI Overviews — the collapsible "Show more" citation links. Crop each to show the citation mechanism clearly. Use real queries with known, citable answers.

What semantic similarity means for your content.

The shift from keyword matching to semantic similarity is the most important technical change in information retrieval in decades.

In traditional keyword-based retrieval, the question was: does this document contain the query terms? Ranking then determined which keyword-matching documents to surface first. SEO accordingly optimised for keyword presence, keyword density, and ranking signals.

In semantic vector retrieval, the question is different: is the meaning of this passage similar to the meaning of this query? The passage doesn't need to contain the exact query terms. It needs to express ideas that are semantically close to the query's intent.

The practical consequences are real. A passage that answers "how do transformers handle long sequences?" will be retrieved for queries about "Transformer attention mechanism long context" even if it uses different phrasing. A passage stuffed with exact-match keywords that doesn't actually, clearly address the underlying question retrieves poorly despite keyword density. Write for the question the reader is trying to answer, not for the keywords you want to rank for. In a semantic retrieval world, those two things converge more than in a keyword world, but the framing still matters.

IMAGE PROMPT: AI-generated vector space diagram showing semantic proximity: 2D or 3D scatter plot with word/phrase clusters. Show a query point ("how do transformers handle long sequences?") and nearby passage embeddings clustered around it — some with exact query keywords, some without but semantically similar. Illustrate that proximity ≠ keyword match. Use clean, minimal styling.

The universal requirements.

The RAG architecture drives the same content requirements regardless of which platform you're targeting.

Crawlability. If the platform's retrieval system can't access your content, it can't retrieve it. Technical barriers to crawling (blocked resources, JavaScript-only rendering, noindex directives on canonical pages) prevent entry into the pipeline entirely.

Passage-level structure. Because retrieval operates at the passage level, not the page level, each passage should be a coherent, self-contained unit that makes a clear claim.

Semantic precision. Your content needs to precisely address the questions it's intended to answer. Vague, hedged, or densely jargon-laden content retrieves poorly.

Factual attribution. Well-designed RAG systems perform grounding checks. Content that makes unattributed assertions is harder to use as a citation anchor than content with clear, attributable claims.

Non-commodity value. If the model's training data already contains a reliable answer to a query, the RAG system has less incentive to retrieve external content. Content that provides information the model can't generate from training data alone (proprietary data, original research, expert analysis, first-hand experience) is structurally more valuable to RAG systems.

Article 6 in this series covers what each platform has actually published about these requirements where the guidance exists, and what the architecture implies where it doesn't. The requirements above aren't speculation. They follow directly from the pipeline I've described here.

/ Next in the series

Article 4 is the definitive technical timeline of the AI search race: every significant model release and architectural shift from November 2022 to May 2026. Read Article 4 →

SOURCES ↓

Google Cloud, "What is Retrieval-Augmented Generation?" — cloud.google.com/use-cases/retrieval-augmented-generation
Google Cloud, "RAG and Grounding on Vertex AI," 2024 — cloud.google.com/blog/...
Google Search Central, "Optimizing for Generative AI Features on Google Search," 2026 — developers.google.com/search/docs/...
IBM, "What is Retrieval-Augmented Generation?" — ibm.com/think/topics/retrieval-augmented-generation
AWS, "What is Retrieval-Augmented Generation?" — aws.amazon.com/what-is/retrieval-augmented-generation
Anthropic, "Introducing Web Search on the Anthropic API," 2025 — anthropic.com/news/web-search-api
Anthropic, "Introducing Citations on the Anthropic API," 2025 — anthropic.com/news/introducing-citations-api
OpenAI, "Introducing ChatGPT Search," 2024 — openai.com/index/introducing-chatgpt-search
OpenAI API Documentation, "Web Search" — platform.openai.com/docs/guides/tools-web-search
OpenAI Help Centre, "ChatGPT Search" — help.openai.com/en/articles/9237897-chatgpt-search
Microsoft Learn, "Microsoft 365 Copilot Architecture" — learn.microsoft.com/en-us/microsoft-365/copilot/...
Microsoft Learn, "Semantic Indexing for Microsoft 365 Copilot" — learn.microsoft.com/en-us/microsoftsearch/...
Microsoft Learn, "Use Public Websites to Improve Generative Answers" — learn.microsoft.com/en-us/microsoft-copilot-studio/...
Perplexity, "Sonar Models Documentation" — docs.perplexity.ai/docs/sonar/models
Perplexity, "Meet New Sonar," 2025 — perplexity.ai/hub/blog/meet-new-sonar
Wikipedia: Retrieval-augmented generation — en.wikipedia.org/wiki/Retrieval-augmented_generation
Wikipedia: Large language model — en.wikipedia.org/wiki/Large_language_model

What Is RAG: the Only Thing That Matters for AI Search.

The problem RAG solves.

The four-step RAG pipeline.

Step 1: Chunking and indexing

Step 2: Retrieval

Step 3: Prompt augmentation

Step 4: Generation

Google's term for RAG: grounding.

How each platform implements RAG differently.

What semantic similarity means for your content.

The universal requirements.

The full series.

Thomas Cox

Want this in your inbox?