Skip to main content

TruthLens: A Cognitive Immune System for Real-Time Hallucination Detection and Mitigation in Large Language Models

Version: 1.0 Authors: vLLM Semantic Router Team Date: December 2025


Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities, yet their tendency to generate hallucinations—fluent but factually incorrect or ungrounded content—remains a critical barrier to enterprise AI adoption. Industry surveys consistently show that hallucination risks are among the top concerns preventing organizations from deploying LLM-powered applications in production environments, particularly in high-stakes domains such as healthcare, finance, and legal services.

We propose TruthLens, a real-time hallucination detection and mitigation framework integrated into the vLLM Semantic Router. By positioning hallucination control at the inference gateway layer, TruthLens provides a model-agnostic, centralized solution that addresses the "accuracy-latency-cost" triangle through configurable mitigation strategies. Users can select from three operational modes based on their tolerance for cost and accuracy trade-offs: (1) Lightweight Mode—single-round detection with warning injection, (2) Standard Mode—iterative self-refinement with the same model, and (3) Premium Mode—multi-model cross-verification and collaborative correction. This design enables organizations to deploy trustworthy AI systems while maintaining control over operational costs and response latency.


1. Introduction: The Hallucination Crisis in Enterprise AI

1.1 The Core Problem

Hallucinations represent the most significant barrier to enterprise AI adoption today. Unlike traditional software bugs, LLM hallucinations are:

  • Unpredictable: They occur randomly across different queries and contexts
  • Convincing: Hallucinated content often appears fluent, confident, and plausible
  • High-stakes: A single hallucination in medical, legal, or financial domains can cause irreversible harm
  • Invisible: Without specialized detection, users cannot distinguish hallucinations from accurate responses

Industry Impact by Domain:

DomainHallucination Risk ToleranceTypical Mitigation Approach
HealthcareNear-zero (life-critical)Mandatory human verification, liability concerns
Financial ServicesVery low (regulatory)Compliance-driven review processes
LegalVery low (liability)Restricted to internal research and drafting
Customer SupportModerateEscalation protocols for uncertain responses
Creative/MarketingHigh toleranceMinimal intervention required

Note: Based on enterprise deployment patterns observed across industry surveys (McKinsey 2024, Gartner 2024, Menlo Ventures 2024).

1.2 Why Existing Solutions Fall Short

Current approaches to hallucination mitigation operate at the wrong layer of the AI stack:

1.3 Why vLLM Semantic Router is the Ideal Solution Point

The vLLM Semantic Router occupies a unique position in the AI infrastructure stack that makes it ideally suited for hallucination mitigation:

Key Advantages of Gateway-Level Hallucination Control:

AdvantageDescription
Model-AgnosticWorks with any LLM backend without modification
Centralized PolicySingle configuration point for all applications
Cost ControlOrganization-wide visibility into accuracy vs. cost trade-offs
Incremental AdoptionEnable per-decision, per-domain policies
ObservabilityUnified metrics, logging, and alerting for hallucination events
Defense in DepthComplements (not replaces) RAG and prompt engineering

1.4 Formal Problem Definition

We formalize hallucination detection in Retrieval-Augmented Generation (RAG) systems as a token-level sequence labeling problem.

Definition 1 (RAG Context). Let a RAG interaction be defined as a tuple (C, Q, R) where:

  • C = {c₁, c₂, ..., cₘ} is the retrieved context (set of documents/passages)
  • Q is the user query
  • R = (r₁, r₂, ..., rₙ) is the generated response as a sequence of n tokens

Definition 2 (Grounded vs. Hallucinated Tokens). A token rᵢ in response R is:

  • Grounded if there exists evidence in C that supports the claim containing rᵢ
  • Hallucinated if rᵢ contributes to a claim that:
    • (a) Contradicts information in C (contradiction hallucination), or
    • (b) Cannot be verified from C and is not common knowledge (ungrounded hallucination)

Definition 3 (Hallucination Detection Function). The detection task is to learn a function:

f: (C, Q, R) → Y

where Y = (y₁, y₂, ..., yₙ) and yᵢ ∈ {0, 1} indicates whether token rᵢ is hallucinated.

Definition 4 (Hallucination Score). Given predictions Y and confidence scores P = (p₁, ..., pₙ) where pᵢ = P(yᵢ = 1), we define:

  • Token-level score: s_token(rᵢ) = pᵢ
  • Span-level score: For a contiguous span S = (rᵢ, ..., rⱼ), s_span(S) = max(pᵢ, ..., pⱼ)
  • Response-level score: s_response(R) = 1 - ∏(1 - pᵢ) for all i where pᵢ > τ_token

Definition 5 (Mitigation Decision). Given threshold τ, the system takes action:

Action(R) =
PASS if s_response(R) < τ
MITIGATE if s_response(R) ≥ τ

2.1 Taxonomy of Hallucination Types

Before reviewing detection methods, we establish a taxonomy of hallucination types:

Type 1: Intrinsic Hallucination — Generated content contradicts the provided context.

Example: Context says "The meeting is on Tuesday." Response says "The meeting is scheduled for Wednesday."

Type 2: Extrinsic Hallucination — Generated content cannot be verified from the context and is not common knowledge.

Example: Context discusses a company's Q3 earnings. Response includes Q4 projections not mentioned anywhere.

Type 3: Fabrication — Entirely invented entities, citations, or facts.

Example: "According to Smith et al. (2023)..." where no such paper exists.

TypeDetection DifficultyMitigation Approach
IntrinsicEasier (direct contradiction)Context re-grounding
ExtrinsicMedium (requires knowledge boundary)Uncertainty expression
FabricationHarder (requires external verification)Cross-reference checking

2.2 Detection Methods

CategoryRepresentative WorkMechanismAccuracyLatencyCost
Encoder-BasedLettuceDetect (2025), Luna (2025)Token classification with ModernBERT/DeBERTaF1: 75-79%15-35msLow
Self-ConsistencySelfCheckGPT (2023)Multiple sampling + consistency checkVariesNx baseHigh
Cross-ModelFinch-Zk (2025)Multi-model response comparisonF1: +6-39%2-3x baseHigh
Internal StatesMIND (ACL 2024)Hidden layer activation analysisHigh<10msRequires instrumentation

2.2.1 Encoder-Based Detection (Deep Dive)

LettuceDetect (Kovács et al., 2025) frames hallucination detection as token-level sequence labeling:

  • Architecture: ModernBERT-large (395M parameters) with classification head
  • Input: Concatenated [Context, Query, Response] with special tokens
  • Output: Per-token probability of hallucination
  • Training: Fine-tuned on RAGTruth dataset (18K examples)
  • Key Innovation: Long-context handling (8K tokens) enables full RAG context inclusion

Performance on RAGTruth Benchmark:

ModelToken F1Example F1Latency
LettuceDetect-large79.22%74.8%~30ms
LettuceDetect-base76.5%71.2%~15ms
Luna (DeBERTa)73.1%68.9%~25ms
GPT-4 (zero-shot)61.2%58.4%~2s

Why Encoder-Based for TruthLens: The combination of high accuracy, low latency, and fixed cost makes encoder-based detection ideal for gateway-level deployment.

2.2.2 Self-Consistency Methods

SelfCheckGPT (Manakul et al., 2023) exploits the observation that hallucinations are inconsistent across samples:

  • Mechanism: Generate N responses, measure consistency
  • Intuition: Factual content is reproducible; hallucinations vary
  • Limitation: Requires N LLM calls (typically N=5-10)

Theoretical Basis: If P(fact) is high, the fact appears in most samples. If P(hallucination) is low per-sample, it rarely repeats.

2.2.3 Cross-Model Verification

Finch-Zk (2025) leverages model diversity:

  • Mechanism: Compare responses from different model families
  • Key Insight: Different models hallucinate differently
  • Segment-Level Correction: Replace inconsistent segments with higher-confidence version

2.3 Mitigation Strategies

StrategyRepresentative WorkMechanismEffectivenessOverhead
Self-RefinementSelf-Refine (NeurIPS 2023)Iterative feedback loop40-60% reduction2-4x latency
Chain-of-VerificationCoVe (ACL 2024)Generate verification questions50-70% reduction3-5x latency
Multi-Agent DebateMAD (2024)Multiple agents argue and converge60-80% reduction5-10x latency
Cross-Model CorrectionFinch-Zk (2025)Targeted segment replacementUp to 9% accuracy gain3x latency

2.3.1 Self-Refinement (Deep Dive)

Self-Refine (Madaan et al., NeurIPS 2023) demonstrates that LLMs can improve their own outputs:

Loop:
1. Generate initial response R₀
2. Generate feedback F on R₀ (same model)
3. Generate refined response R₁ using F
4. Repeat until convergence or max iterations

Key Findings:

  • Works best when feedback is specific (not just "improve this")
  • Diminishing returns after 2-3 iterations
  • Requires the model to have the knowledge to correct itself

Limitation for Hallucination: If the model lacks the correct knowledge, self-refinement may not help or may introduce new errors.

2.3.2 Chain-of-Verification (CoVe)

CoVe (Dhuliawala et al., ACL 2024) generates verification questions:

1. Generate response R
2. Extract factual claims from R
3. For each claim, generate verification question
4. Answer verification questions using context
5. Revise R based on verification results

Advantage: Explicit verification step catches subtle errors. Disadvantage: High latency (3-5x) due to multi-step process.

2.3.3 Multi-Agent Debate

Multi-Agent Debate (Du et al., 2024) uses multiple LLM instances:

1. Multiple agents generate responses
2. Agents critique each other's responses
3. Agents revise based on critiques
4. Repeat for N rounds
5. Synthesize final response

Theoretical Advantage: Diverse perspectives catch blind spots. Practical Challenge: High cost (5-10x) and latency.

2.3 The Accuracy-Latency-Cost Triangle

Research consistently shows a fundamental trade-off:

Key Insight: No single approach optimizes all three dimensions. TruthLens addresses this by offering user-selectable operational modes that let organizations choose their position on this trade-off triangle.


3. Theoretical Foundations

This section establishes the theoretical basis for TruthLens's three-mode architecture, drawing from sequence labeling, iterative optimization, ensemble learning, and multi-agent systems theory.

3.1 Hallucination Detection as Sequence Labeling

3.1.1 Token Classification Architecture

Modern hallucination detection leverages transformer-based encoders fine-tuned for token classification. Given input sequence X = [CLS] C [SEP] Q [SEP] R [SEP], the encoder produces contextualized representations:

H = Encoder(X) ∈ ℝ^(L×d)

where L is sequence length and d is hidden dimension. For each token rᵢ in the response, we compute:

P(yᵢ = 1|X) = σ(W · hᵢ + b)

where W ∈ ℝ^d, b ∈ ℝ are learned parameters and σ is the sigmoid function.

3.1.2 Why ModernBERT for Detection

The choice of encoder architecture significantly impacts detection quality. We adopt ModernBERT (Warner et al., 2024) for the following theoretical advantages:

PropertyModernBERTTraditional BERTImpact on Detection
Context Length8,192 tokens512 tokensHandles full RAG context without truncation
AttentionRotary Position Embeddings (RoPE)Absolute positionalBetter long-range dependency modeling
ArchitectureGeGLU activations, no biasesGELU, with biasesImproved gradient flow for fine-grained classification
EfficiencyFlash Attention, UnpaddingStandard attention2x inference speedup enables real-time detection

3.1.3 Scoring Function Design

The aggregation from token-level to response-level scores requires careful design. We propose a noisy-OR aggregation model:

s_response(R) = 1 - ∏ᵢ(1 - pᵢ · 𝟙[pᵢ > τ_token])

Theoretical Justification: The noisy-OR model assumes independence between hallucination events at different tokens. While this is an approximation, it provides:

  1. Monotonicity: Adding a hallucinated token never decreases the response score
  2. Sensitivity: Single high-confidence hallucination triggers detection
  3. Calibration: Score approximates P(∃ hallucination in R)

Alternative: Span-Based Aggregation

For correlated hallucinations (common in fabricated entities), we first group contiguous hallucinated tokens into spans, then aggregate:

s_response(R) = max{s_span(S₁), s_span(S₂), ..., s_span(Sₖ)}

This reduces sensitivity to tokenization artifacts and focuses on semantic units.

3.1.4 Threshold Selection Theory

The detection threshold τ controls the precision-recall trade-off. From decision theory:

Proposition 1 (Optimal Threshold). Given cost ratio λ = C_FN / C_FP (cost of false negative vs. false positive), the optimal threshold satisfies:

τ = 1 / (1 + λ · (1-π)/π)*

where π is the prior probability of hallucination.

Practical Implications:

Domainλ (Cost Ratio)Recommended τRationale
Medical10-1000.3-0.5Missing hallucination is catastrophic
Financial5-200.4-0.6Regulatory risk from false information
Customer Support1-20.6-0.7Balance user experience and accuracy
Creative0.1-0.50.8-0.9Over-flagging harms creativity

3.2 Self-Refinement Theory

3.2.1 Iterative Refinement as Fixed-Point Iteration

Standard Mode employs iterative self-refinement, which can be formalized as seeking a fixed point of a refinement operator.

Definition 6 (Refinement Operator). Let T: R → R be the refinement operator where:

T(Rₜ) = LLM(Prompt_refine(C, Q, Rₜ, Detect(Rₜ)))

The iteration proceeds as: R₀ → R₁ → R₂ → ... → R*

Theorem 1 (Convergence Conditions). The refinement sequence {Rₜ} converges to a fixed point R* if:

  1. The hallucination score sequence {s(Rₜ)} is monotonically non-increasing
  2. The score is bounded below (s(R) ≥ 0)
  3. The LLM exhibits consistency: similar prompts yield similar outputs

Proof Sketch: Conditions 1 and 2 ensure the score sequence converges by the Monotone Convergence Theorem. Condition 3 (LLM consistency) ensures the response sequence itself converges, not just the scores.

3.2.2 Convergence Rate Analysis

Empirical Observation: Self-refinement typically exhibits sublinear convergence:

s(Rₜ) - s(R) ≤ O(1/t)*

This is because:

  1. Easy hallucinations (explicit contradictions) are corrected in early iterations
  2. Hard hallucinations (subtle ungrounded claims) may persist or oscillate
  3. Diminishing returns after 2-3 iterations in practice

3.2.3 Prompt Engineering Principles for Correction

Effective refinement prompts must satisfy several theoretical properties:

Principle 1 (Specificity): The prompt must identify which spans are hallucinated, not just that hallucination exists.

Principle 2 (Grounding): The prompt must provide the original context C to enable fact-checking.

Principle 3 (Preservation): The prompt must instruct the model to preserve accurate content.

Principle 4 (Uncertainty): When correction is not possible, the model should express uncertainty rather than fabricate alternatives.

Refinement Prompt Template Structure:

Given:
- Context: [Retrieved passages C]
- Query: [User question Q]
- Response: [Current response Rₜ with hallucinated spans marked]

The following spans may be hallucinated: [List of (span, confidence)]

Instructions:
1. For each flagged span, verify against the context
2. If contradicted: correct using context evidence
3. If unverifiable and not common knowledge: remove or qualify with uncertainty
4. Preserve all accurate, well-grounded content
5. Maintain coherent narrative flow

3.3 Multi-Model Collaboration Theory

Premium Mode leverages multiple LLMs for cross-verification. We ground this in ensemble learning and multi-agent debate theory.

3.3.1 Ensemble Learning Perspective

Theorem 2 (Diversity-Accuracy Trade-off). For an ensemble of M models with individual error rate ε and pairwise correlation ρ, the ensemble error rate under majority voting is:

ε_ensemble ≈ ε · (1 + (M-1)ρ) / M when ε < 0.5

Corollary: Ensemble error approaches zero as M → ∞ only if ρ < 1 (models are diverse).

Implications for TruthLens:

Model CombinationExpected Diversity (1-ρ)Error Reduction
Same model family (GPT-4 variants)Low (0.2-0.4)10-20%
Different families (GPT-4 + Claude)Medium (0.4-0.6)30-50%
Different architectures (Transformer + other)High (0.6-0.8)50-70%

3.3.2 Multi-Agent Debate Framework

Beyond simple voting, multi-agent debate enables models to argue about factual claims and converge on truth.

Definition 7 (Argumentation Framework). An argumentation framework is a pair AF = (A, →) where:

  • A is a set of arguments (factual claims from each model)
  • → ⊆ A × A is an attack relation (contradictions between claims)

Definition 8 (Grounded Extension). The grounded extension E of AF is the maximal conflict-free set of arguments that defends itself against all attacks.

Multi-Agent Debate Protocol:

3.3.3 Consensus Mechanisms

Mechanism 1: Majority Voting

y_final(token) = argmax_y |{m : f_m(token) = y}|

  • Simple, fast
  • Requires odd number of models
  • Does not account for model confidence

Mechanism 2: Weighted Confidence Aggregation

p_final(token) = Σₘ wₘ · pₘ(token) / Σₘ wₘ

where wₘ is model m's calibrated reliability weight.

  • Accounts for varying model expertise
  • Requires calibrated confidence scores

Mechanism 3: Segment-Level Replacement (Finch-Zk)

For each claim segment S in response R₁:

  1. Check if S appears (semantically) in R₂
  2. If consistent: keep S
  3. If inconsistent: replace with version from more reliable model
  4. If only in R₁: flag as potentially hallucinated

This mechanism achieves fine-grained correction without full response regeneration.

3.4 Theoretical Justification for Three-Mode Architecture

3.4.1 Pareto Frontier Analysis

The Accuracy-Latency-Cost space admits a Pareto frontier: points where improving one dimension requires sacrificing another.

Proposition 2 (Three Operating Points). The Pareto frontier in the A-L-C space has three natural "knee points" corresponding to:

  1. Cost-dominated regime (Lightweight): Minimal intervention, detection-only
  2. Balanced regime (Standard): Moderate refinement, single-model
  3. Accuracy-dominated regime (Premium): Maximum verification, multi-model

3.4.2 Why Not Continuous Control?

One might ask: why discrete modes rather than continuous parameters?

Argument 1 (Cognitive Load): Users cannot effectively reason about continuous trade-offs. Three discrete modes map to intuitive concepts: "fast/cheap," "balanced," "best quality."

Argument 2 (Operational Complexity): Each mode involves qualitatively different mechanisms (detection-only vs. iteration vs. multi-model). Intermediate points would require complex interpolation.

Argument 3 (Empirical Gaps): The Pareto frontier is not smooth—there are natural gaps where intermediate configurations offer little benefit over the nearest discrete mode.

3.4.3 Mode Selection as Online Learning

In production, mode selection can be formulated as a multi-armed bandit problem:

  • Arms: {Lightweight, Standard, Premium}
  • Reward: User satisfaction (proxy: no negative feedback)
  • Cost: Latency + API costs

Thompson Sampling approach: Maintain Beta distributions over success probability for each mode, sample and select, update based on outcome. This enables adaptive mode selection per query type.


4. System Architecture

4.1 High-Level Architecture

TruthLens integrates into the vLLM Semantic Router's ExtProc pipeline, creating a comprehensive request-response security boundary:

4.2 Detection Flow

The hallucination detection process operates on the complete context-query-response triple:


5. User Strategy Options: The Cost-Accuracy Spectrum

TruthLens provides three operational modes that allow organizations to position themselves on the cost-accuracy trade-off spectrum based on their specific requirements.

5.1 Strategy Overview

5.2 Mode Comparison Matrix

Dimension🟢 Lightweight🟡 Standard🔴 Premium
Primary GoalCost efficiencyBalancedMaximum accuracy
Detection MethodSingle encoder passEncoder + self-checkMulti-model cross-verification
Mitigation ActionWarning injectionIterative self-refinementMulti-model collaborative correction
Latency Overhead+15-35ms+200-500ms (2-4x)+1-3s (5-10x)
Cost Multiplier1.0x (detection only)1.5-2.5x3-5x
Hallucination ReductionAwareness only40-60%70-85%
Best ForInternal tools, chatbotsBusiness applicationsMedical, legal, financial

5.3 Lightweight Mode: Cost-Optimized Detection

Philosophy: Minimize operational cost while providing hallucination awareness. This mode treats hallucination detection as an information service rather than an intervention system.

5.3.1 Theoretical Basis

Lightweight Mode is grounded in Bounded Rationality Theory (Simon, 1955): when optimization costs exceed benefits, satisficing (accepting "good enough") is rational.

Cost-Benefit Analysis:

Let C_detect = cost of detection, C_mitigate = cost of mitigation, p = probability of hallucination, L = expected loss from undetected hallucination.

Lightweight Mode is optimal when: C_detect < p · L but C_detect + C_mitigate > p · L

In other words: detection is worth the cost, but full mitigation is not.

5.3.2 Mechanism

Characteristics:

  • No additional LLM calls after initial generation
  • Fixed detection cost regardless of response length
  • User-facing warning empowers human verification
  • Rich metadata for downstream analytics

5.3.3 Theoretical Guarantees

Proposition 3 (Detection Latency Bound). For ModernBERT-large with sequence length L ≤ 8192:

T_detect ≤ O(L²/chunk_size) + O(L · d)

In practice: T_detect ≤ 35ms for L ≤ 4096 on modern GPUs.

Proposition 4 (No False Negatives on Pass-Through). In Lightweight Mode, all hallucinations above threshold τ are flagged. The mode never suppresses detection results.

5.3.4 Ideal Use Cases

  • Internal knowledge bases (users can verify)
  • Developer assistants (technical users)
  • Creative writing tools (hallucination may be desired)
  • Low-stakes customer interactions (human escalation available)

5.4 Standard Mode: Balanced Self-Refinement

Philosophy: Leverage the same model to self-correct detected hallucinations through iterative refinement. This mode implements a closed-loop feedback system where the LLM serves as both generator and corrector.

5.4.1 Theoretical Basis

Standard Mode is grounded in Self-Consistency Theory and Iterative Refinement:

Theorem 3 (Self-Refinement Effectiveness). If an LLM has learned the correct answer distribution for a query class, then prompting with explicit error feedback increases the probability of correct output:

P(correct | feedback on error) > P(correct | no feedback)

provided the feedback is accurate and actionable.

Intuition: LLMs often "know" the right answer but fail to produce it on first attempt due to:

  • Sampling noise (temperature > 0)
  • Attention to wrong context regions
  • Competing patterns in weights

Explicit error feedback redirects attention and suppresses incorrect patterns.

5.4.2 Convergence Analysis

Definition 9 (Refinement Sequence). The sequence {Rₜ} for t = 0, 1, 2, ... where:

R₀ = LLM(Q, C) (initial response) Rₜ₊₁ = LLM(Prompt_refine(Q, C, Rₜ, Detect(Rₜ))) (refined response)

Lemma 1 (Monotonic Score Decrease). Under mild assumptions (consistent LLM, accurate detection), the hallucination score sequence is non-increasing:

s(Rₜ₊₁) ≤ s(Rₜ) with high probability

Empirical Convergence Pattern:

IterationTypical Score ReductionMarginal Improvement
1 → 230-50%High
2 → 315-25%Medium
3 → 45-15%Low
4+<5%Diminishing

This motivates the default max_iterations = 3 setting.

5.4.3 Mechanism

Characteristics:

  • Iterative improvement through self-reflection
  • Same model maintains consistency
  • Bounded iterations control costs
  • Graceful degradation if convergence fails

Research Foundation: Based on Self-Refine (NeurIPS 2023) and Chain-of-Verification (ACL 2024) principles.

Ideal Use Cases:

  • Business intelligence reports
  • Customer support (escalated queries)
  • Educational content
  • Technical documentation

5.5 Premium Mode: Multi-Model Collaborative Verification

Philosophy: Maximum accuracy through diverse model perspectives and collaborative error correction. This mode implements ensemble verification and adversarial debate mechanisms.

5.5.1 Theoretical Basis: Ensemble Learning

Premium Mode is grounded in Condorcet's Jury Theorem (1785) and modern ensemble learning theory:

Theorem 4 (Condorcet's Jury Theorem, adapted). For M independent models each with accuracy p > 0.5 on a binary decision, the majority vote accuracy approaches 1 as M → ∞:

P(majority correct) = Σ(k=⌈M/2⌉ to M) C(M,k) · pᵏ · (1-p)^(M-k) → 1

Corollary (Diversity Requirement): The theorem requires independence. Correlated models (same training data, architecture) provide diminishing returns.

Practical Diversity Sources:

Diversity TypeExampleIndependence Level
Training dataGPT vs ClaudeHigh
ArchitectureTransformer vs MambaVery High
Fine-tuningBase vs InstructMedium
PromptingDifferent system promptsLow

5.5.2 Theoretical Basis: Multi-Agent Debate

Beyond voting, debate enables models to refine each other's reasoning:

Definition 10 (Debate Protocol). A debate between models M₁, M₂ with judge J consists of:

  1. Generation Phase: Both models produce responses R₁, R₂
  2. Critique Phase: Each model critiques the other's response
  3. Defense Phase: Models defend their claims with evidence
  4. Synthesis Phase: Judge J produces final response based on arguments

Theorem 5 (Debate Improves Grounding). When models must justify claims with evidence from context C, the debate process filters ungrounded claims:

An ungrounded claim in R₁ will be challenged by M₂ if M₂ cannot find supporting evidence in C.

Information-Theoretic View: Debate acts as a lossy compression of the argument space, preserving only claims that survive cross-examination.

5.5.3 Mechanism

5.5.4 Consensus Mechanisms

Mechanism 1: Segment-Level Voting

For each claim segment S:

vote(S) = Σₘ 𝟙[S ∈ Rₘ] / M

Accept S if vote(S) > 0.5 (majority agreement).

Mechanism 2: Confidence-Weighted Fusion

R_final = argmax_R Σₘ wₘ · sim(R, Rₘ)

where wₘ is model m's calibrated confidence and sim is semantic similarity.

Mechanism 3: Fine-Grained Replacement (Finch-Zk)

  1. Segment R₁ into claims {S₁, S₂, ..., Sₖ}
  2. For each Sᵢ, check consistency with R₂
  3. If inconsistent: replace Sᵢ with version from more reliable model
  4. Output: hybrid response with highest-confidence segments

5.5.5 Cost-Accuracy Trade-off Analysis

ConfigurationModelsExpected Accuracy GainCost Multiplier
Dual-model voting2+15-25%2x
Triple-model voting3+25-35%3x
Dual + Judge2+1+30-40%3x
Full debate (3 rounds)2+1+40-50%5-6x

5.5.6 Ideal Use Cases

  • Medical diagnosis assistance: Life-critical decisions
  • Legal document analysis: Liability implications
  • Financial advisory: Regulatory compliance required
  • Safety-critical systems: Aerospace, nuclear, etc.

5.6 Mode Selection Decision Tree


6. Configuration Design

6.1 Global Configuration

# Global hallucination detection settings
hallucination:
enabled: true

# Detection model (ModernBERT-based)
model_id: "models/lettucedetect-large-modernbert-en-v1"
use_cpu: false

# Default operational mode
default_mode: "standard" # lightweight | standard | premium

# Detection threshold (0.0 - 1.0)
# Lower = more strict, Higher = more lenient
threshold: 0.6

# Warning template for lightweight mode
warning_template: |
⚠️ **Notice**: This response may contain information that could not be
fully verified against the provided context. Please verify critical facts
before taking action.

# Standard mode settings
standard:
max_iterations: 3
convergence_threshold: 0.4 # Stop if score drops below this

# Premium mode settings
premium:
verification_models:
- "claude-3-sonnet"
- "gpt-4-turbo"
judge_model: "llama-3.1-70b"
max_iterations: 5
require_consensus: true

6.2 Per-Decision Plugin Configuration

decisions:
# Healthcare domain - Maximum accuracy required
- name: "medical_assistant"
description: "Medical information queries"
priority: 100
rules:
operator: "OR"
conditions:
- type: "domain"
name: "healthcare"
- type: "keyword"
name: "medical_terms"
modelRefs:
- model: "gpt-4-turbo"
plugins:
- type: "hallucination"
configuration:
enabled: true
mode: "premium"
threshold: 0.3 # Very strict
max_iterations: 5
require_disclaimer: true

# Financial services - High accuracy
- name: "financial_advisor"
description: "Financial analysis and advice"
priority: 90
rules:
operator: "OR"
conditions:
- type: "domain"
name: "finance"
plugins:
- type: "hallucination"
configuration:
enabled: true
mode: "standard"
threshold: 0.5
max_iterations: 4

# General customer support - Balanced
- name: "customer_support"
description: "General customer inquiries"
priority: 50
rules:
operator: "OR"
conditions:
- type: "domain"
name: "support"
plugins:
- type: "hallucination"
configuration:
enabled: true
mode: "standard"
threshold: 0.6
max_iterations: 2

# Internal tools - Cost optimized
- name: "internal_assistant"
description: "Internal knowledge base queries"
priority: 30
rules:
operator: "OR"
conditions:
- type: "domain"
name: "internal"
plugins:
- type: "hallucination"
configuration:
enabled: true
mode: "lightweight"
threshold: 0.7

# Creative writing - Detection disabled
- name: "creative_writing"
description: "Creative content generation"
priority: 20
rules:
operator: "OR"
conditions:
- type: "domain"
name: "creative"
plugins:
- type: "hallucination"
configuration:
enabled: false # "Hallucination" is a feature here

6.3 Response Headers

The following headers are added to all responses when hallucination detection is enabled:

HeaderDescriptionExample Values
X-TruthLens-EnabledWhether detection was performedtrue, false
X-TruthLens-ModeOperational mode usedlightweight, standard, premium
X-TruthLens-ScoreHallucination confidence score0.0 - 1.0
X-TruthLens-DetectedWhether hallucination exceeded thresholdtrue, false
X-TruthLens-IterationsNumber of refinement iterations0, 1, 2, ...
X-TruthLens-Latency-MsDetection/mitigation latency35, 450, 2100

6.4 Metrics and Observability

Prometheus Metrics:

MetricTypeLabelsDescription
truthlens_detections_totalCounterdecision, mode, detectedTotal detection operations
truthlens_scoreHistogramdecision, modeScore distribution
truthlens_latency_secondsHistogrammode, operationProcessing latency
truthlens_iterationsHistogramdecision, modeRefinement iteration count
truthlens_models_usedCountermodel, roleModel usage in premium mode

7. References

  1. Kovács, Á., & Recski, G. (2025). LettuceDetect: A Hallucination Detection Framework for RAG Applications. arXiv:2502.17125

  2. Goel, A., Schwartz, D., & Qi, Y. (2025). Finch-Zk: Zero-knowledge LLM hallucination detection and mitigation through fine-grained cross-model consistency. arXiv:2508.14314

  3. Lin, Z., Niu, Z., Wang, Z., & Xu, Y. (2024). Interpreting and Mitigating Hallucination in MLLMs through Multi-agent Debate. arXiv:2407.20505

  4. Tran, K.T., et al. (2025). Multi-Agent Collaboration Mechanisms: A Survey of LLMs. arXiv:2501.06322

  5. Manakul, P., Liusie, A., & Gales, M.J. (2023). SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models. arXiv:2303.08896

  6. Tang, L., et al. (2024). MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents. EMNLP 2024

  7. Madaan, A., et al. (2023). Self-Refine: Iterative Refinement with Self-Feedback. NeurIPS 2023

  8. Dhuliawala, S., et al. (2024). Chain-of-Verification Reduces Hallucination in Large Language Models. ACL Findings 2024

  9. Su, W., et al. (2024). Unsupervised Real-Time Hallucination Detection based on LLM Internal States (MIND). ACL Findings 2024

  10. Belyi, M., et al. (2025). Luna: A Lightweight Evaluation Model to Catch Language Model Hallucinations. COLING 2025


Appendix A: Full System Flow Diagram


Appendix B: Glossary

TermDefinition
HallucinationLLM-generated content that is factually incorrect or unsupported by context
Intrinsic HallucinationFabricated facts from the model's internal parametric knowledge
Extrinsic HallucinationContent not grounded in the provided context (common in RAG)
ExtProcEnvoy External Processor - enables request/response modification at the gateway
Token-Level DetectionIdentifying specific tokens/spans that are hallucinated
Self-RefinementIterative process where the same model corrects its own hallucinations
Cross-Model VerificationUsing multiple different models to verify factual consistency
Multi-Agent DebateMultiple LLM agents argue positions to converge on factual truth
RAGRetrieval-Augmented Generation - grounding LLMs with retrieved documents
ModernBERTState-of-the-art encoder architecture with 8K context support
Accuracy-Latency-Cost TriangleFundamental trade-off in hallucination mitigation strategies
Convergence ThresholdScore below which hallucination is considered resolved

Document Version: 1.0 | Last Updated: December 2025