Constitutional AI vs RLHF: Why the Alignment Method Affects What Gets Cited

When a practitioner asks how to get their content cited by Claude versus how to get it cited by ChatGPT, they are typically assuming the answer is the same. Write clearly, demonstrate expertise, provide value. Fine.

But the models are aligned differently. ChatGPT is trained with Reinforcement Learning from Human Feedback. Claude is trained with Constitutional AI. These are not interchangeable approaches to the same problem. They encode different values into the model's behaviour — and those different values affect, at the margin, what each model treats as trustworthy, citable, and well-sourced. This is the most technically ambitious article in this series, and the one most likely to be cited by practitioners who want to demonstrate genuine depth of understanding.

The alignment problem.

Training a large language model produces a model that is very good at predicting the next token. It does not produce a model that is reliably helpful, honest, or safe. The statistical patterns learned from internet-scale text include harmful content, misinformation, biased reasoning, and ways of being maximally engaging that are not the same as being maximally accurate.

The alignment problem is the challenge of making a powerful, capable model behave in ways that are beneficial rather than harmful — reliable, honest, appropriately uncertain, and respectful of the user's interests. No training technique has solved this completely. But different approaches encode different values with different emphases, and those differences are meaningful for anyone trying to understand what each model favours when it cites sources.

RLHF: learning from human preferences.

Reinforcement Learning from Human Feedback is the alignment technique used to produce ChatGPT from GPT-3.5, and the primary alignment technique underlying the GPT model family. The process has three stages.

Stage 1: Supervised Fine-Tuning. Human trainers write examples of ideal responses to prompts. The model is fine-tuned on these examples, learning what good responses look like from human-authored demonstrations.

Stage 2: Reward Model Training. The model generates multiple responses to the same prompt. Human raters rank these responses by quality. A separate reward model is trained to predict which responses humans will prefer — essentially learning a human preference function from the ranking data.

Stage 3: RL Optimisation. The language model is further trained using reinforcement learning, with the reward model providing the reward signal. The model learns to generate responses that the reward model predicts humans will prefer.

The result is a model that has learned to produce responses satisfying a human preference function. That preference function reflects what a particular group of human raters — typically English-speaking, from a specific demographic and cultural context — found preferable in training data gathered at a specific point in time.

RLHF is empirical: it encodes what humans prefer, as measured. It does not encode what humans should prefer, as reasoned. The distinction matters for edge cases — situations where raters' preferences are inconsistent, culturally specific, or reflect biases in the rater pool.

For content citation specifically: a model trained with RLHF prefers responses that are satisfying to the human reader — responses that feel helpful, authoritative, and clear. Content that is well-written, confident, and accessible tends to be cited by RLHF-aligned models because those are the properties that produce responses the reward model scores highly. Content that is heavily hedged, technically complex but inaccessible, or uncertain in ways that reduce response quality is less likely to be cited.

Constitutional AI: learning from principles.

Constitutional AI is the alignment technique developed by Anthropic for Claude. It was introduced in Anthropic's 2022 research paper and has been refined across subsequent model generations. The approach differs fundamentally from RLHF in its starting point: rather than learning from human preference ratings, Constitutional AI trains the model against an explicit set of principles — a constitution — that articulates how the model should behave, what values it should uphold, and what tradeoffs it should make when values conflict.

The process has two key stages.

Stage 1: Supervised Learning from Self-Critique. The model is given a prompt and generates an initial response. It is then given the constitution and asked to critique its own response against the constitutional principles — does the response violate any principles? Is it harmful, deceptive, or unhelpful in ways the constitution identifies? The model then revises its response based on its own critique. This revised response is used as training data.

Stage 2: Reinforcement Learning from AI Feedback (RLAIF). Rather than using human raters to rank model outputs, a separate constitutional AI model evaluates responses against the principles. This AI evaluator generates the preference labels used to train the final model. The result is alignment based on explicit, articulable principles rather than implicit human preferences.

The constitution itself articulates values: the model should be helpful, honest, and harmless. It should avoid deception. It should acknowledge uncertainty. It should be direct rather than sycophantic. It should treat users as intelligent adults capable of handling honest answers.

For content citation specifically: a model trained with Constitutional AI places explicit weight on honesty and accuracy. Claude is specifically trained to acknowledge uncertainty, to avoid claiming knowledge it does not have, and to be honest about the limitations of its information. Content that is factually precise, appropriately attributed, and honest about what it does and does not claim is more aligned with Claude's training values than content that is overconfident, poorly sourced, or that overstates certainty.

The practical difference for content.

The difference between RLHF and Constitutional AI is most visible in how each model handles uncertainty and attribution.

An RLHF-aligned model is trained to produce responses that human raters prefer. Human raters often prefer confident, clear responses to hedged, uncertain ones — even when uncertainty is the epistemically correct stance. This can produce a slight systematic preference for confident-sounding sources, even when the confidence is not fully warranted.

A Constitutional AI-aligned model is trained against explicit principles that include appropriate expression of uncertainty. Claude is trained to say "I don't know" when it doesn't know, to express uncertainty when it is genuinely uncertain, and to prefer sources that are honest about the limits of their claims. This produces a slight systematic preference for content that is epistemically honest — content that says "this data is from Q1 2025 and may have been updated" rather than presenting potentially outdated information with unwarranted confidence.

For practitioners, the implication is small but real. Content written for Claude citation should be especially precise about the provenance and currency of its claims. Attributing statistics to specific sources with specific dates, acknowledging the limits of the data, and being honest about what is known versus inferred — these are not just good content principles. They are properties that Constitutional AI-aligned models are specifically trained to value.

Neither approach is perfect — and both converge.

It would be misleading to conclude that one alignment approach is straightforwardly superior. RLHF has produced models that are exceptionally well-calibrated to human preferences across a vast range of tasks. The human preference data — gathered from diverse raters across many contexts — captures a broad range of what makes responses good in practice. Models trained with RLHF are often better at matching the register and tone of responses to context, because human raters naturally prefer tonally appropriate responses.

Constitutional AI produces models that are more transparent about their values — you can read Claude's constitution and understand what principles the model was trained against. The explicit principle set means Constitutional AI alignment is more interpretable than RLHF, where the preference function is learned rather than stated. But constitutions are written by humans and reflect the values of their authors. The claim that Constitutional AI is more "objective" than RLHF overstates the case.

In practice, both approaches are continually refined. GPT-5 incorporates alignment techniques that have evolved significantly beyond the original InstructGPT RLHF implementation. Claude's constitution is updated periodically. The models converge in many respects even as their fundamental alignment approaches differ.

The honest answer for practitioners is that the alignment method is a secondary signal compared to the universal RAG requirements — crawlability, factual precision, clear structure, non-commodity value. A site that is not indexed cannot be cited by either model. A site with vague, poorly structured content will not be cited well by either. But at the margin, where two roughly equivalent pieces of content compete to be cited:

For ChatGPT: content that is confident, clear, well-written, and satisfying to read has a slight systematic advantage from RLHF alignment.
For Claude: content that is epistemically honest, precisely attributed, and honest about the limits of its claims has a slight systematic advantage from Constitutional AI alignment.

The deepest practical implication is this: the best content for both models is the same. Content that is factually precise, clearly structured, appropriately attributed, honest about uncertainty, and genuinely valuable to the reader satisfies the requirements of both alignment approaches. They converge on the same content properties because they are both attempts to solve the same underlying problem — making AI systems reliably helpful and honest.

Write content that is true. Write content that is clear. Write content that acknowledges what it does not know. Attribute your claims. Provide value that the model cannot generate without you. That is the content that gets cited. The alignment method tells you why.

/ End of series

This is the final article in the How AI Search Works series. View the full series index →

SOURCES ↓

Anthropic, "Claude's Constitution," 2022 — anthropic.com/news/claudes-constitution
Anthropic, Research — anthropic.com/research
OpenAI, "Training Language Models to Follow Instructions with Human Feedback (InstructGPT)," 2022 — openai.com/index/instruction-following
Wikipedia: Reinforcement learning from human feedback — en.wikipedia.org/wiki/Reinforcement_learning_from_human_feedback
Wikipedia: Claude (language model) — en.wikipedia.org/wiki/Claude_(language_model)
Wikipedia: Anthropic — en.wikipedia.org/wiki/Anthropic
Wikipedia: ChatGPT — en.wikipedia.org/wiki/ChatGPT
Wikipedia: Large language model — en.wikipedia.org/wiki/Large_language_model
Wikipedia: Generative pre-trained transformer — en.wikipedia.org/wiki/Generative_pre-trained_transformer
Vaswani et al., "Attention Is All You Need," Google Brain / Google Research, 2017 — research.google/pubs/attention-is-all-you-need

Constitutional AI vs RLHF: Why the Alignment Method Affects What Gets Cited.

The alignment problem.

RLHF: learning from human preferences.

Constitutional AI: learning from principles.

The practical difference for content.

Neither approach is perfect — and both converge.

The full series.

Thomas Cox

Want this in your inbox?