Writing · Series: How AI Search Works · Part 2 of 6
How Large Language Models Actually Work: Tokens, Context Windows, and Why Your Content Gets Ignored.
The mechanics beneath the interface. What tokenisation means for how models read your content, why context windows create hard limits on what gets considered, and the practical implications for content that earns citations vs. content that doesn't.
Published1 June 2026
ByThomas Cox
Read time10 minutes
Filed underSeries · LLM Mechanics · Part 2 of 6
The question I get most is some version of this: how do I get my content cited by AI search? The answer is technical, not strategic. It's rooted in how large language models actually process information, and most of the GEO advice floating around skips this entirely in favour of surface-level tips that will date badly.
I've run enough audits to know that the gap isn't effort or content quality. It's understanding. If you don't know why putting your key claim first matters at a mechanical level, you can't adapt when the platforms change (which they do, constantly). This article explains the mechanics so you can.
Ch 1: Two Completely Different Processes: Training and Inference.
The first distinction to internalise is that there are two fundamentally different phases in an LLM's life: training and inference. They're not versions of the same thing. They're different processes that happen at different times, at different costs, for different reasons.
Training is when the model learns. A language model is exposed to vast text corpora (books, websites, academic papers, code, conversations) and is trained to predict the next token in a sequence. For every text sample, the model makes a prediction. The prediction is compared to the actual next token. The error is used to adjust billions of parameters — the numerical weights that define how the model processes information. This happens billions of times, across billions of text samples, on clusters of thousands of GPUs or TPUs running for weeks or months.
Training a frontier model costs tens of millions of dollars. It happens once, or infrequently. The result is a model with fixed weights: a snapshot of learned patterns from the training data.
Inference is what happens when you use the model. You send a prompt. The model generates a response, token by token. Each token is sampled from a probability distribution over the model's vocabulary — the model outputs a probability for every possible next token, and one is selected. Then that token is added to the context, and the process repeats for the next token. This continues until the model produces an end-of-sequence token or reaches a length limit.
Inference is fast: milliseconds to seconds per response. It's also entirely determined by the fixed weights established during training. The model has no ability to learn from your prompt, no ability to update its knowledge based on what you ask it. It's, in a meaningful sense, stateless.
This has a direct implication for your GEO strategy: the model doesn't know things the way humans do. It has learned statistical patterns that produce plausible token sequences given an input. What it outputs is shaped by everything it was trained on and nothing else, unless an external retrieval system (RAG) injects additional content into the context window at inference time.
IMAGE PROMPT: AI-generated diagram showing the two-phase LLM lifecycle: left panel "Training" (GPU cluster, data arrows, parameter weights updating) → right panel "Inference" (fixed weights, prompt in, response out). Label costs: "Weeks / $10M+" vs "Milliseconds / fractions of a cent." Clean, editorial black-and-white with one accent colour.
Ch 2: What a Token Is and Why It Matters.
Language models don't process words. They process tokens.
A token is a unit of text that the model's vocabulary covers — typically a word, a word fragment, or a punctuation mark. Common words are single tokens. Less common words may be split into multiple tokens. As a rough rule of thumb, one token is approximately 0.75 words in English, meaning 1,000 words is approximately 1,333 tokens.
Why does this matter? Because everything in an LLM (the context window, the cost of inference, the attention mechanism) is measured in tokens, not words or characters.
When a model processes your content through a RAG system, it doesn't retrieve your page. It retrieves chunks of your page (passages of a few hundred to a few thousand tokens) and injects them into the model's context window. The model then generates a response based on those retrieved tokens plus the user's query.
The chunking process is where structure matters: a well-structured passage that contains a clear claim in its first sentence is a retrievable, usable chunk. A passage that buries its central point in the third paragraph, after contextual preamble, produces a chunk that may be retrieved but whose key claim sits in the middle — precisely where the attention mechanism struggles most.
IMAGE PROMPT: Screenshot of the OpenAI Tokenizer tool at platform.openai.com/tokenizer — paste a paragraph of text and show how it splits into coloured token fragments. Include a passage that shows common words as single tokens and a less common word split across 2–3 tokens. Crop to show the token count clearly.
Ch 3: The Context Window: What the Model Can See.
The context window is the total amount of text the model can process in a single inference pass. It's measured in tokens. Everything the model uses to generate a response (the system prompt, conversation history, retrieved documents, user query, and model output so far) must fit within this window.
Content outside the context window doesn't exist to the model. It can't be referenced, reasoned about, or cited. It's simply not there.
Context window sizes have grown dramatically:
- Early GPT-3: 4,096 tokens (roughly 3,000 words)
- GPT-4o: approximately 128,000 tokens (roughly 96,000 words)
- Claude Opus 4.6: 1,000,000 tokens (roughly 750,000 words)
- Gemini 2.5 Pro: 1,000,000+ tokens (over 750,000 words)
For RAG systems, context window size determines how many retrieved passages can be injected before generation. A model with a 128K token window can hold significantly more retrieved content than one with a 4K window, which means it can draw from more sources and synthesise more comprehensively.
But a larger context window doesn't solve every problem.
IMAGE PROMPT: AI-generated visual bar comparison of context window sizes across models: GPT-3 (4K), GPT-4o (128K), Claude Opus 4 (1M), Gemini 2.5 Pro (1M+). Show as horizontal bars proportional to token counts. Label each bar with model name and approximate word-count equivalent. Use the site's ink colour palette.
Ch 4: The "Lost in the Middle" Problem.
In 2023, a team at Stanford and Samaya AI published research documenting a property of large language models that has significant practical implications for anyone trying to get their content cited.
The finding: LLMs exhibit a U-shaped performance curve when retrieving information from long contexts. Information placed at the very beginning of the context window is retrieved reliably. Information placed at the very end is retrieved reliably. Information placed in the middle is systematically underweighted — the model performs noticeably worse at using it, even when that information is directly relevant to the query.
This isn't a bug. It's an emergent property of the attention mechanism and training dynamics. The pattern mirrors human memory effects (primacy and recency), with the middle being where retention degrades.
The practical consequence for RAG systems: when retrieved passages are injected into a prompt, where your content lands within that prompt affects whether the model actually uses it. Content that contains its key claim in the first sentence of a paragraph is more likely to survive the middle-of-context degradation than content where the key claim appears after several sentences of preamble. This is the technical justification for every piece of advice you've heard about putting your main point first. It's not a writing preference. It's alignment with how the attention mechanism distributes its resources.
/ 2025 update
Google's own research published in 2025 found that Gemini 2.5 Flash demonstrates substantially reduced lost-in-the-middle degradation compared to earlier models, suggesting the effect is improving with architectural advances. It is not eliminated across all models, and it remains a relevant factor when targeting platforms running older or smaller model variants. But it is directionally diminishing as a constraint as the frontier models improve.
IMAGE PROMPT: AI-generated visualisation of the U-shaped "lost in the middle" performance curve: x-axis = "Position in context window (Start → Middle → End)", y-axis = "Retrieval accuracy %". Curve dips significantly in the middle. Add a shaded zone labelled "Degraded attention zone." Reference: Liu et al., 2023 (Stanford/Samaya AI).
Ch 5: Hallucination: What Happens Without Retrieval.
Because inference is entirely determined by the fixed weights from training, a model asked about something outside its training data (recent events, proprietary information, data after its knowledge cutoff) has no reliable mechanism to say "I don't know." Instead, it generates the most statistically plausible continuation of the prompt, which may bear no relationship to the truth.
This is hallucination: the model confidently generates false information because it's doing exactly what it was trained to do (predict the next token) without any grounding signal to constrain it to factual claims.
Retrieval-Augmented Generation (RAG), covered in depth in Article 3 of this series, is the architectural response to this problem. By injecting retrieved real-world content into the context window before generation, RAG systems give the model a factual grounding signal — retrieved passages that anchor the response to actual sources.
The implication for your content: your pages need to be retrievable and injectable into context windows in order to influence AI responses. A page that's not indexed, not crawlable, or not structured for passage-level retrieval is invisible to the RAG systems that power every major AI search product.
Ch 6: What This Means for Getting Cited.
Pulling these mechanics together: the model generates responses based on what's in its context window at inference time. RAG systems populate that context window with retrieved passages from indexed content. The attention mechanism distributes its resources unevenly, favouring the beginning and end of the context and underweighting the middle. The model has a knowledge cutoff and can't reliably answer questions outside it without retrieval.
From these four facts, the content requirements follow directly:
- Your content must be indexed and crawlable (to enter the retrieval pipeline at all)
- Your content must chunk well (clear, self-contained passages, each making a coherent claim)
- Your key claims must appear at the start of passages, not buried after preamble (alignment with the U-shaped attention curve)
- Your content must be factually precise and clearly attributed (to survive the grounding checks all RAG systems perform before injecting content into a prompt)
- Your content must provide information the model couldn't generate from training data alone (non-commodity, original, expert-led)
None of these are new content principles. All of them have a technical explanation that most GEO advice doesn't provide. The mechanism is what makes the advice durable - when the platforms update their retrieval systems, understanding the underlying mechanics is what lets you adapt rather than start over.
/ Next in the series
Article 3 explains RAG in full: the four-step pipeline, how each major platform implements it differently, and why semantic similarity replaced keyword matching. Read Article 3 →