TruthLens: A Cognitive Immune System for Real-Time Hallucination Detection and Mitigation in Large Language Models

Version: 1.0 Authors: vLLM Semantic Router Team Date: December 2025

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities, yet their tendency to generate hallucinations—fluent but factually incorrect or ungrounded content—remains a critical barrier to enterprise AI adoption. Industry surveys consistently show that hallucination risks are among the top concerns preventing organizations from deploying LLM-powered applications in production environments, particularly in high-stakes domains such as healthcare, finance, and legal services.

We propose TruthLens, a real-time hallucination detection and mitigation framework integrated into the vLLM Semantic Router. By positioning hallucination control at the inference gateway layer, TruthLens provides a model-agnostic, centralized solution that addresses the "accuracy-latency-cost" triangle through configurable mitigation strategies. Users can select from three operational modes based on their tolerance for cost and accuracy trade-offs: (1) Lightweight Mode—single-round detection with warning injection, (2) Standard Mode—iterative self-refinement with the same model, and (3) Premium Mode—multi-model cross-verification and collaborative correction. This design enables organizations to deploy trustworthy AI systems while maintaining control over operational costs and response latency.

1. Introduction: The Hallucination Crisis in Enterprise AI

1.1 The Core Problem

Hallucinations represent the most significant barrier to enterprise AI adoption today. Unlike traditional software bugs, LLM hallucinations are:

Unpredictable: They occur randomly across different queries and contexts
Convincing: Hallucinated content often appears fluent, confident, and plausible
High-stakes: A single hallucination in medical, legal, or financial domains can cause irreversible harm
Invisible: Without specialized detection, users cannot distinguish hallucinations from accurate responses

Industry Impact by Domain:

Domain	Hallucination Risk Tolerance	Typical Mitigation Approach
Healthcare	Near-zero (life-critical)	Mandatory human verification, liability concerns
Financial Services	Very low (regulatory)	Compliance-driven review processes
Legal	Very low (liability)	Restricted to internal research and drafting
Customer Support	Moderate	Escalation protocols for uncertain responses
Creative/Marketing	High tolerance	Minimal intervention required

Note: Based on enterprise deployment patterns observed across industry surveys (McKinsey 2024, Gartner 2024, Menlo Ventures 2024).

1.2 Why Existing Solutions Fall Short

Current approaches to hallucination mitigation operate at the wrong layer of the AI stack:

1.3 Why vLLM Semantic Router is the Ideal Solution Point

The vLLM Semantic Router occupies a unique position in the AI infrastructure stack that makes it ideally suited for hallucination mitigation:

Key Advantages of Gateway-Level Hallucination Control:

Advantage	Description
Model-Agnostic	Works with any LLM backend without modification
Centralized Policy	Single configuration point for all applications
Cost Control	Organization-wide visibility into accuracy vs. cost trade-offs
Incremental Adoption	Enable per-decision, per-domain policies
Observability	Unified metrics, logging, and alerting for hallucination events
Defense in Depth	Complements (not replaces) RAG and prompt engineering

1.4 Formal Problem Definition

We formalize hallucination detection in Retrieval-Augmented Generation (RAG) systems as a token-level sequence labeling problem.

Definition 1 (RAG Context). Let a RAG interaction be defined as a tuple (C, Q, R) where:

C = {c₁, c₂, ..., cₘ} is the retrieved context (set of documents/passages)
Q is the user query
R = (r₁, r₂, ..., rₙ) is the generated response as a sequence of n tokens

Definition 2 (Grounded vs. Hallucinated Tokens). A token rᵢ in response R is:

Grounded if there exists evidence in C that supports the claim containing rᵢ
Hallucinated if rᵢ contributes to a claim that:
- (a) Contradicts information in C (contradiction hallucination), or
- (b) Cannot be verified from C and is not common knowledge (ungrounded hallucination)

Definition 3 (Hallucination Detection Function). The detection task is to learn a function:

f: (C, Q, R) → Y

where Y = (y₁, y₂, ..., yₙ) and yᵢ ∈ {0, 1} indicates whether token rᵢ is hallucinated.

Definition 4 (Hallucination Score). Given predictions Y and confidence scores P = (p₁, ..., pₙ) where pᵢ = P(yᵢ = 1), we define:

Token-level score: s_token(rᵢ) = pᵢ
Span-level score: For a contiguous span S = (rᵢ, ..., rⱼ), s_span(S) = max(pᵢ, ..., pⱼ)
Response-level score: s_response(R) = 1 - ∏(1 - pᵢ) for all i where pᵢ > τ_token

Definition 5 (Mitigation Decision). Given threshold τ, the system takes action:

Action(R) =
  PASS        if s_response(R) < τ
  MITIGATE    if s_response(R) ≥ τ

2.1 Taxonomy of Hallucination Types

Before reviewing detection methods, we establish a taxonomy of hallucination types:

Type 1: Intrinsic Hallucination — Generated content contradicts the provided context.

Example: Context says "The meeting is on Tuesday." Response says "The meeting is scheduled for Wednesday."

Type 2: Extrinsic Hallucination — Generated content cannot be verified from the context and is not common knowledge.

Example: Context discusses a company's Q3 earnings. Response includes Q4 projections not mentioned anywhere.

Type 3: Fabrication — Entirely invented entities, citations, or facts.

Example: "According to Smith et al. (2023)..." where no such paper exists.

Type	Detection Difficulty	Mitigation Approach
Intrinsic	Easier (direct contradiction)	Context re-grounding
Extrinsic	Medium (requires knowledge boundary)	Uncertainty expression
Fabrication	Harder (requires external verification)	Cross-reference checking

2.2 Detection Methods

Category	Representative Work	Mechanism	Accuracy	Latency	Cost
Encoder-Based	LettuceDetect (2025), Luna (2025)	Token classification with ModernBERT/DeBERTa	F1: 75-79%	15-35ms	Low
Self-Consistency	SelfCheckGPT (2023)	Multiple sampling + consistency check	Varies	Nx base	High
Cross-Model	Finch-Zk (2025)	Multi-model response comparison	F1: +6-39%	2-3x base	High
Internal States	MIND (ACL 2024)	Hidden layer activation analysis	High	<10ms	Requires instrumentation

2.2.1 Encoder-Based Detection (Deep Dive)

LettuceDetect (Kovács et al., 2025) frames hallucination detection as token-level sequence labeling:

Architecture: ModernBERT-large (395M parameters) with classification head
Input: Concatenated [Context, Query, Response] with special tokens
Output: Per-token probability of hallucination
Training: Fine-tuned on RAGTruth dataset (18K examples)
Key Innovation: Long-context handling (8K tokens) enables full RAG context inclusion

Performance on RAGTruth Benchmark:

Model	Token F1	Example F1	Latency
LettuceDetect-large	79.22%	74.8%	~30ms
LettuceDetect-base	76.5%	71.2%	~15ms
Luna (DeBERTa)	73.1%	68.9%	~25ms
GPT-4 (zero-shot)	61.2%	58.4%	~2s

Why Encoder-Based for TruthLens: The combination of high accuracy, low latency, and fixed cost makes encoder-based detection ideal for gateway-level deployment.

2.2.2 Self-Consistency Methods

SelfCheckGPT (Manakul et al., 2023) exploits the observation that hallucinations are inconsistent across samples:

Mechanism: Generate N responses, measure consistency
Intuition: Factual content is reproducible; hallucinations vary
Limitation: Requires N LLM calls (typically N=5-10)

Theoretical Basis: If P(fact) is high, the fact appears in most samples. If P(hallucination) is low per-sample, it rarely repeats.

2.2.3 Cross-Model Verification

Finch-Zk (2025) leverages model diversity:

Mechanism: Compare responses from different model families
Key Insight: Different models hallucinate differently
Segment-Level Correction: Replace inconsistent segments with higher-confidence version

2.3 Mitigation Strategies

Strategy	Representative Work	Mechanism	Effectiveness	Overhead
Self-Refinement	Self-Refine (NeurIPS 2023)	Iterative feedback loop	40-60% reduction	2-4x latency
Chain-of-Verification	CoVe (ACL 2024)	Generate verification questions	50-70% reduction	3-5x latency
Multi-Agent Debate	MAD (2024)	Multiple agents argue and converge	60-80% reduction	5-10x latency
Cross-Model Correction	Finch-Zk (2025)	Targeted segment replacement	Up to 9% accuracy gain	3x latency

Self-Refine (Madaan et al., NeurIPS 2023) demonstrates that LLMs can improve their own outputs:

Loop:
Generate initial response R₀
Generate feedback F on R₀ (same model)
Generate refined response R₁ using F
Repeat until convergence or max iterations

Key Findings:

Works best when feedback is specific (not just "improve this")
Diminishing returns after 2-3 iterations
Requires the model to have the knowledge to correct itself

Limitation for Hallucination: If the model lacks the correct knowledge, self-refinement may not help or may introduce new errors.

2.3.2 Chain-of-Verification (CoVe)

CoVe (Dhuliawala et al., ACL 2024) generates verification questions:

Generate response R
Extract factual claims from R
For each claim, generate verification question
Answer verification questions using context
Revise R based on verification results

Advantage: Explicit verification step catches subtle errors. Disadvantage: High latency (3-5x) due to multi-step process.

2.3.3 Multi-Agent Debate

Multi-Agent Debate (Du et al., 2024) uses multiple LLM instances:

Multiple agents generate responses
Agents critique each other's responses
Agents revise based on critiques
Repeat for N rounds
Synthesize final response

Theoretical Advantage: Diverse perspectives catch blind spots. Practical Challenge: High cost (5-10x) and latency.

2.3 The Accuracy-Latency-Cost Triangle

Research consistently shows a fundamental trade-off:

Key Insight: No single approach optimizes all three dimensions. TruthLens addresses this by offering user-selectable operational modes that let organizations choose their position on this trade-off triangle.

3. Theoretical Foundations

This section establishes the theoretical basis for TruthLens's three-mode architecture, drawing from sequence labeling, iterative optimization, ensemble learning, and multi-agent systems theory.

3.1 Hallucination Detection as Sequence Labeling

3.1.1 Token Classification Architecture

Modern hallucination detection leverages transformer-based encoders fine-tuned for token classification. Given input sequence X = [CLS] C [SEP] Q [SEP] R [SEP], the encoder produces contextualized representations:

H = Encoder(X) ∈ ℝ^(L×d)

where L is sequence length and d is hidden dimension. For each token rᵢ in the response, we compute:

P(yᵢ = 1|X) = σ(W · hᵢ + b)

where W ∈ ℝ^d, b ∈ ℝ are learned parameters and σ is the sigmoid function.

3.1.2 Why ModernBERT for Detection

The choice of encoder architecture significantly impacts detection quality. We adopt ModernBERT (Warner et al., 2024) for the following theoretical advantages:

Property	ModernBERT	Traditional BERT	Impact on Detection
Context Length	8,192 tokens	512 tokens	Handles full RAG context without truncation
Attention	Rotary Position Embeddings (RoPE)	Absolute positional	Better long-range dependency modeling
Architecture	GeGLU activations, no biases	GELU, with biases	Improved gradient flow for fine-grained classification
Efficiency	Flash Attention, Unpadding	Standard attention	2x inference speedup enables real-time detection

3.1.3 Scoring Function Design

The aggregation from token-level to response-level scores requires careful design. We propose a noisy-OR aggregation model:

s_response(R) = 1 - ∏ᵢ(1 - pᵢ · 𝟙[pᵢ > τ_token])

Theoretical Justification: The noisy-OR model assumes independence between hallucination events at different tokens. While this is an approximation, it provides:

Monotonicity: Adding a hallucinated token never decreases the response score
Sensitivity: Single high-confidence hallucination triggers detection
Calibration: Score approximates P(∃ hallucination in R)

Alternative: Span-Based Aggregation

For correlated hallucinations (common in fabricated entities), we first group contiguous hallucinated tokens into spans, then aggregate:

s_response(R) = max{s_span(S₁), s_span(S₂), ..., s_span(Sₖ)}

This reduces sensitivity to tokenization artifacts and focuses on semantic units.

3.1.4 Threshold Selection Theory

The detection threshold τ controls the precision-recall trade-off. From decision theory:

Proposition 1 (Optimal Threshold). Given cost ratio λ = C_FN / C_FP (cost of false negative vs. false positive), the optimal threshold satisfies:

τ = 1 / (1 + λ · (1-π)/π)*

where π is the prior probability of hallucination.

Practical Implications:

Domain	λ (Cost Ratio)	Recommended τ	Rationale
Medical	10-100	0.3-0.5	Missing hallucination is catastrophic
Financial	5-20	0.4-0.6	Regulatory risk from false information
Customer Support	1-2	0.6-0.7	Balance user experience and accuracy
Creative	0.1-0.5	0.8-0.9	Over-flagging harms creativity

Standard Mode employs iterative self-refinement, which can be formalized as seeking a fixed point of a refinement operator.

Definition 6 (Refinement Operator). Let T: R → R be the refinement operator where:

T(Rₜ) = LLM(Prompt_refine(C, Q, Rₜ, Detect(Rₜ)))

The iteration proceeds as: R₀ → R₁ → R₂ → ... → R*

Theorem 1 (Convergence Conditions). The refinement sequence {Rₜ} converges to a fixed point R* if:

The hallucination score sequence {s(Rₜ)} is monotonically non-increasing
The score is bounded below (s(R) ≥ 0)
The LLM exhibits consistency: similar prompts yield similar outputs

Proof Sketch: Conditions 1 and 2 ensure the score sequence converges by the Monotone Convergence Theorem. Condition 3 (LLM consistency) ensures the response sequence itself converges, not just the scores.

3.2.2 Convergence Rate Analysis

Empirical Observation: Self-refinement typically exhibits sublinear convergence:

s(Rₜ) - s(R) ≤ O(1/t)*

This is because:

Easy hallucinations (explicit contradictions) are corrected in early iterations
Hard hallucinations (subtle ungrounded claims) may persist or oscillate
Diminishing returns after 2-3 iterations in practice

3.2.3 Prompt Engineering Principles for Correction

Effective refinement prompts must satisfy several theoretical properties:

Principle 1 (Specificity): The prompt must identify which spans are hallucinated, not just that hallucination exists.

Principle 2 (Grounding): The prompt must provide the original context C to enable fact-checking.

Principle 3 (Preservation): The prompt must instruct the model to preserve accurate content.

Principle 4 (Uncertainty): When correction is not possible, the model should express uncertainty rather than fabricate alternatives.

Refinement Prompt Template Structure:

Given:
- Context: [Retrieved passages C]
- Query: [User question Q]
- Response: [Current response Rₜ with hallucinated spans marked]

The following spans may be hallucinated: [List of (span, confidence)]

Instructions:
1. For each flagged span, verify against the context
2. If contradicted: correct using context evidence
3. If unverifiable and not common knowledge: remove or qualify with uncertainty
4. Preserve all accurate, well-grounded content
5. Maintain coherent narrative flow

3.3 Multi-Model Collaboration Theory

Premium Mode leverages multiple LLMs for cross-verification. We ground this in ensemble learning and multi-agent debate theory.

3.3.1 Ensemble Learning Perspective

Theorem 2 (Diversity-Accuracy Trade-off). For an ensemble of M models with individual error rate ε and pairwise correlation ρ, the ensemble error rate under majority voting is:

ε_ensemble ≈ ε · (1 + (M-1)ρ) / M when ε < 0.5

Corollary: Ensemble error approaches zero as M → ∞ only if ρ < 1 (models are diverse).

Implications for TruthLens:

Model Combination	Expected Diversity (1-ρ)	Error Reduction
Same model family (GPT-4 variants)	Low (0.2-0.4)	10-20%
Different families (GPT-4 + Claude)	Medium (0.4-0.6)	30-50%
Different architectures (Transformer + other)	High (0.6-0.8)	50-70%

3.3.2 Multi-Agent Debate Framework

Beyond simple voting, multi-agent debate enables models to argue about factual claims and converge on truth.

Definition 7 (Argumentation Framework). An argumentation framework is a pair AF = (A, →) where:

A is a set of arguments (factual claims from each model)
→ ⊆ A × A is an attack relation (contradictions between claims)

Definition 8 (Grounded Extension). The grounded extension E of AF is the maximal conflict-free set of arguments that defends itself against all attacks.

Multi-Agent Debate Protocol:

3.3.3 Consensus Mechanisms

Mechanism 1: Majority Voting

y_final(token) = argmax_y |{m : f_m(token) = y}|

Simple, fast
Requires odd number of models
Does not account for model confidence

Mechanism 2: Weighted Confidence Aggregation

p_final(token) = Σₘ wₘ · pₘ(token) / Σₘ wₘ

where wₘ is model m's calibrated reliability weight.

Accounts for varying model expertise
Requires calibrated confidence scores

Mechanism 3: Segment-Level Replacement (Finch-Zk)

For each claim segment S in response R₁:

Check if S appears (semantically) in R₂
If consistent: keep S
If inconsistent: replace with version from more reliable model
If only in R₁: flag as potentially hallucinated

This mechanism achieves fine-grained correction without full response regeneration.

3.4 Theoretical Justification for Three-Mode Architecture

3.4.1 Pareto Frontier Analysis

The Accuracy-Latency-Cost space admits a Pareto frontier: points where improving one dimension requires sacrificing another.

Proposition 2 (Three Operating Points). The Pareto frontier in the A-L-C space has three natural "knee points" corresponding to:

Cost-dominated regime (Lightweight): Minimal intervention, detection-only
Balanced regime (Standard): Moderate refinement, single-model
Accuracy-dominated regime (Premium): Maximum verification, multi-model

3.4.2 Why Not Continuous Control?

One might ask: why discrete modes rather than continuous parameters?

Argument 1 (Cognitive Load): Users cannot effectively reason about continuous trade-offs. Three discrete modes map to intuitive concepts: "fast/cheap," "balanced," "best quality."

Argument 2 (Operational Complexity): Each mode involves qualitatively different mechanisms (detection-only vs. iteration vs. multi-model). Intermediate points would require complex interpolation.

Argument 3 (Empirical Gaps): The Pareto frontier is not smooth—there are natural gaps where intermediate configurations offer little benefit over the nearest discrete mode.

3.4.3 Mode Selection as Online Learning

In production, mode selection can be formulated as a multi-armed bandit problem:

Arms: {Lightweight, Standard, Premium}
Reward: User satisfaction (proxy: no negative feedback)
Cost: Latency + API costs

Thompson Sampling approach: Maintain Beta distributions over success probability for each mode, sample and select, update based on outcome. This enables adaptive mode selection per query type.

4. System Architecture

4.1 High-Level Architecture

TruthLens integrates into the vLLM Semantic Router's ExtProc pipeline, creating a comprehensive request-response security boundary:

4.2 Detection Flow

The hallucination detection process operates on the complete context-query-response triple:

5. User Strategy Options: The Cost-Accuracy Spectrum

TruthLens provides three operational modes that allow organizations to position themselves on the cost-accuracy trade-off spectrum based on their specific requirements.

5.1 Strategy Overview

5.2 Mode Comparison Matrix

Dimension	🟢 Lightweight	🟡 Standard	🔴 Premium
Primary Goal	Cost efficiency	Balanced	Maximum accuracy
Detection Method	Single encoder pass	Encoder + self-check	Multi-model cross-verification
Mitigation Action	Warning injection	Iterative self-refinement	Multi-model collaborative correction
Latency Overhead	+15-35ms	+200-500ms (2-4x)	+1-3s (5-10x)
Cost Multiplier	1.0x (detection only)	1.5-2.5x	3-5x
Hallucination Reduction	Awareness only	40-60%	70-85%
Best For	Internal tools, chatbots	Business applications	Medical, legal, financial

5.3 Lightweight Mode: Cost-Optimized Detection

Philosophy: Minimize operational cost while providing hallucination awareness. This mode treats hallucination detection as an information service rather than an intervention system.

5.3.1 Theoretical Basis

Lightweight Mode is grounded in Bounded Rationality Theory (Simon, 1955): when optimization costs exceed benefits, satisficing (accepting "good enough") is rational.

Cost-Benefit Analysis:

Let C_detect = cost of detection, C_mitigate = cost of mitigation, p = probability of hallucination, L = expected loss from undetected hallucination.

Lightweight Mode is optimal when: C_detect < p · L but C_detect + C_mitigate > p · L

In other words: detection is worth the cost, but full mitigation is not.

5.3.2 Mechanism

Characteristics:

No additional LLM calls after initial generation
Fixed detection cost regardless of response length
User-facing warning empowers human verification
Rich metadata for downstream analytics

5.3.3 Theoretical Guarantees

Proposition 3 (Detection Latency Bound). For ModernBERT-large with sequence length L ≤ 8192:

T_detect ≤ O(L²/chunk_size) + O(L · d)

In practice: T_detect ≤ 35ms for L ≤ 4096 on modern GPUs.

Proposition 4 (No False Negatives on Pass-Through). In Lightweight Mode, all hallucinations above threshold τ are flagged. The mode never suppresses detection results.

5.3.4 Ideal Use Cases

Internal knowledge bases (users can verify)
Developer assistants (technical users)
Creative writing tools (hallucination may be desired)
Low-stakes customer interactions (human escalation available)

Philosophy: Leverage the same model to self-correct detected hallucinations through iterative refinement. This mode implements a closed-loop feedback system where the LLM serves as both generator and corrector.

5.4.1 Theoretical Basis

Standard Mode is grounded in Self-Consistency Theory and Iterative Refinement:

Theorem 3 (Self-Refinement Effectiveness). If an LLM has learned the correct answer distribution for a query class, then prompting with explicit error feedback increases the probability of correct output:

P(correct | feedback on error) > P(correct | no feedback)

provided the feedback is accurate and actionable.

Intuition: LLMs often "know" the right answer but fail to produce it on first attempt due to:

Sampling noise (temperature > 0)
Attention to wrong context regions
Competing patterns in weights

Explicit error feedback redirects attention and suppresses incorrect patterns.

5.4.2 Convergence Analysis

Definition 9 (Refinement Sequence). The sequence {Rₜ} for t = 0, 1, 2, ... where:

R₀ = LLM(Q, C) (initial response) Rₜ₊₁ = LLM(Prompt_refine(Q, C, Rₜ, Detect(Rₜ))) (refined response)

Lemma 1 (Monotonic Score Decrease). Under mild assumptions (consistent LLM, accurate detection), the hallucination score sequence is non-increasing:

s(Rₜ₊₁) ≤ s(Rₜ) with high probability

Empirical Convergence Pattern:

Iteration	Typical Score Reduction	Marginal Improvement
1 → 2	30-50%	High
2 → 3	15-25%	Medium
3 → 4	5-15%	Low
4+	<5%	Diminishing

This motivates the default max_iterations = 3 setting.

5.4.3 Mechanism

Characteristics:

Iterative improvement through self-reflection
Same model maintains consistency
Bounded iterations control costs
Graceful degradation if convergence fails

Research Foundation: Based on Self-Refine (NeurIPS 2023) and Chain-of-Verification (ACL 2024) principles.

Ideal Use Cases:

Business intelligence reports
Customer support (escalated queries)
Educational content
Technical documentation

5.5 Premium Mode: Multi-Model Collaborative Verification

Philosophy: Maximum accuracy through diverse model perspectives and collaborative error correction. This mode implements ensemble verification and adversarial debate mechanisms.

5.5.1 Theoretical Basis: Ensemble Learning

Premium Mode is grounded in Condorcet's Jury Theorem (1785) and modern ensemble learning theory:

Theorem 4 (Condorcet's Jury Theorem, adapted). For M independent models each with accuracy p > 0.5 on a binary decision, the majority vote accuracy approaches 1 as M → ∞:

P(majority correct) = Σ(k=⌈M/2⌉ to M) C(M,k) · pᵏ · (1-p)^(M-k) → 1

Corollary (Diversity Requirement): The theorem requires independence. Correlated models (same training data, architecture) provide diminishing returns.

Practical Diversity Sources:

Diversity Type	Example	Independence Level
Training data	GPT vs Claude	High
Architecture	Transformer vs Mamba	Very High
Fine-tuning	Base vs Instruct	Medium
Prompting	Different system prompts	Low

5.5.2 Theoretical Basis: Multi-Agent Debate

Beyond voting, debate enables models to refine each other's reasoning:

Definition 10 (Debate Protocol). A debate between models M₁, M₂ with judge J consists of:

Generation Phase: Both models produce responses R₁, R₂
Critique Phase: Each model critiques the other's response
Defense Phase: Models defend their claims with evidence
Synthesis Phase: Judge J produces final response based on arguments

Theorem 5 (Debate Improves Grounding). When models must justify claims with evidence from context C, the debate process filters ungrounded claims:

An ungrounded claim in R₁ will be challenged by M₂ if M₂ cannot find supporting evidence in C.

Information-Theoretic View: Debate acts as a lossy compression of the argument space, preserving only claims that survive cross-examination.

5.5.3 Mechanism

5.5.4 Consensus Mechanisms

Mechanism 1: Segment-Level Voting

For each claim segment S:

vote(S) = Σₘ 𝟙[S ∈ Rₘ] / M

Accept S if vote(S) > 0.5 (majority agreement).

Mechanism 2: Confidence-Weighted Fusion

R_final = argmax_R Σₘ wₘ · sim(R, Rₘ)

where wₘ is model m's calibrated confidence and sim is semantic similarity.

Mechanism 3: Fine-Grained Replacement (Finch-Zk)

Segment R₁ into claims {S₁, S₂, ..., Sₖ}
For each Sᵢ, check consistency with R₂
If inconsistent: replace Sᵢ with version from more reliable model
Output: hybrid response with highest-confidence segments

5.5.5 Cost-Accuracy Trade-off Analysis

Configuration	Models	Expected Accuracy Gain	Cost Multiplier
Dual-model voting	2	+15-25%	2x
Triple-model voting	3	+25-35%	3x
Dual + Judge	2+1	+30-40%	3x
Full debate (3 rounds)	2+1	+40-50%	5-6x

5.5.6 Ideal Use Cases

Medical diagnosis assistance: Life-critical decisions
Legal document analysis: Liability implications
Financial advisory: Regulatory compliance required
Safety-critical systems: Aerospace, nuclear, etc.

5.6 Mode Selection Decision Tree

6. Configuration Design

6.1 Global Configuration

# Global hallucination detection settings
hallucination:
  enabled: true

  # Detection model (ModernBERT-based)
  model_id: "models/lettucedetect-large-modernbert-en-v1"
  use_cpu: false

  # Default operational mode
  default_mode: "standard"  # lightweight | standard | premium

  # Detection threshold (0.0 - 1.0)
  # Lower = more strict, Higher = more lenient
  threshold: 0.6

  # Warning template for lightweight mode
  warning_template: |
    ⚠️ **Notice**: This response may contain information that could not be
    fully verified against the provided context. Please verify critical facts
    before taking action.

  # Standard mode settings
  standard:
    max_iterations: 3
    convergence_threshold: 0.4  # Stop if score drops below this

  # Premium mode settings
  premium:
    verification_models:
      - "claude-3-sonnet"
      - "gpt-4-turbo"
    judge_model: "llama-3.1-70b"
    max_iterations: 5
    require_consensus: true

6.2 Per-Decision Plugin Configuration

decisions:
  # Healthcare domain - Maximum accuracy required
  - name: "medical_assistant"
    description: "Medical information queries"
    priority: 100
    rules:
      operator: "OR"
      conditions:
        - type: "domain"
          name: "healthcare"
        - type: "keyword"
          name: "medical_terms"
    modelRefs:
      - model: "gpt-4-turbo"
    plugins:
      - type: "hallucination"
        configuration:
          enabled: true
          mode: "premium"
          threshold: 0.3           # Very strict
          max_iterations: 5
          require_disclaimer: true

  # Financial services - High accuracy
  - name: "financial_advisor"
    description: "Financial analysis and advice"
    priority: 90
    rules:
      operator: "OR"
      conditions:
        - type: "domain"
          name: "finance"
    plugins:
      - type: "hallucination"
        configuration:
          enabled: true
          mode: "standard"
          threshold: 0.5
          max_iterations: 4

  # General customer support - Balanced
  - name: "customer_support"
    description: "General customer inquiries"
    priority: 50
    rules:
      operator: "OR"
      conditions:
        - type: "domain"
          name: "support"
    plugins:
      - type: "hallucination"
        configuration:
          enabled: true
          mode: "standard"
          threshold: 0.6
          max_iterations: 2

  # Internal tools - Cost optimized
  - name: "internal_assistant"
    description: "Internal knowledge base queries"
    priority: 30
    rules:
      operator: "OR"
      conditions:
        - type: "domain"
          name: "internal"
    plugins:
      - type: "hallucination"
        configuration:
          enabled: true
          mode: "lightweight"
          threshold: 0.7

  # Creative writing - Detection disabled
  - name: "creative_writing"
    description: "Creative content generation"
    priority: 20
    rules:
      operator: "OR"
      conditions:
        - type: "domain"
          name: "creative"
    plugins:
      - type: "hallucination"
        configuration:
          enabled: false  # "Hallucination" is a feature here

6.3 Response Headers

The following headers are added to all responses when hallucination detection is enabled:

Header	Description	Example Values
`X-TruthLens-Enabled`	Whether detection was performed	`true`, `false`
`X-TruthLens-Mode`	Operational mode used	`lightweight`, `standard`, `premium`
`X-TruthLens-Score`	Hallucination confidence score	`0.0` - `1.0`
`X-TruthLens-Detected`	Whether hallucination exceeded threshold	`true`, `false`
`X-TruthLens-Iterations`	Number of refinement iterations	`0`, `1`, `2`, ...
`X-TruthLens-Latency-Ms`	Detection/mitigation latency	`35`, `450`, `2100`

6.4 Metrics and Observability

Prometheus Metrics:

Metric	Type	Labels	Description
`truthlens_detections_total`	Counter	`decision`, `mode`, `detected`	Total detection operations
`truthlens_score`	Histogram	`decision`, `mode`	Score distribution
`truthlens_latency_seconds`	Histogram	`mode`, `operation`	Processing latency
`truthlens_iterations`	Histogram	`decision`, `mode`	Refinement iteration count
`truthlens_models_used`	Counter	`model`, `role`	Model usage in premium mode

7. References

Kovács, Á., & Recski, G. (2025). LettuceDetect: A Hallucination Detection Framework for RAG Applications. arXiv:2502.17125
Goel, A., Schwartz, D., & Qi, Y. (2025). Finch-Zk: Zero-knowledge LLM hallucination detection and mitigation through fine-grained cross-model consistency. arXiv:2508.14314
Lin, Z., Niu, Z., Wang, Z., & Xu, Y. (2024). Interpreting and Mitigating Hallucination in MLLMs through Multi-agent Debate. arXiv:2407.20505
Tran, K.T., et al. (2025). Multi-Agent Collaboration Mechanisms: A Survey of LLMs. arXiv:2501.06322
Manakul, P., Liusie, A., & Gales, M.J. (2023). SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models. arXiv:2303.08896
Tang, L., et al. (2024). MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents. EMNLP 2024
Madaan, A., et al. (2023). Self-Refine: Iterative Refinement with Self-Feedback. NeurIPS 2023
Dhuliawala, S., et al. (2024). Chain-of-Verification Reduces Hallucination in Large Language Models. ACL Findings 2024
Su, W., et al. (2024). Unsupervised Real-Time Hallucination Detection based on LLM Internal States (MIND). ACL Findings 2024
Belyi, M., et al. (2025). Luna: A Lightweight Evaluation Model to Catch Language Model Hallucinations. COLING 2025

Appendix A: Full System Flow Diagram

Appendix B: Glossary

Term	Definition
Hallucination	LLM-generated content that is factually incorrect or unsupported by context
Intrinsic Hallucination	Fabricated facts from the model's internal parametric knowledge
Extrinsic Hallucination	Content not grounded in the provided context (common in RAG)
ExtProc	Envoy External Processor - enables request/response modification at the gateway
Token-Level Detection	Identifying specific tokens/spans that are hallucinated
Self-Refinement	Iterative process where the same model corrects its own hallucinations
Cross-Model Verification	Using multiple different models to verify factual consistency
Multi-Agent Debate	Multiple LLM agents argue positions to converge on factual truth
RAG	Retrieval-Augmented Generation - grounding LLMs with retrieved documents
ModernBERT	State-of-the-art encoder architecture with 8K context support
Accuracy-Latency-Cost Triangle	Fundamental trade-off in hallucination mitigation strategies
Convergence Threshold	Score below which hallucination is considered resolved

Document Version: 1.0 | Last Updated: December 2025

Abstract​

1. Introduction: The Hallucination Crisis in Enterprise AI​

1.1 The Core Problem​

1.2 Why Existing Solutions Fall Short​

1.3 Why vLLM Semantic Router is the Ideal Solution Point​

1.4 Formal Problem Definition​

2. Related Work: State-of-the-Art in Hallucination Mitigation​

2.1 Taxonomy of Hallucination Types​

2.2 Detection Methods​

2.2.1 Encoder-Based Detection (Deep Dive)​

2.2.2 Self-Consistency Methods​

2.2.3 Cross-Model Verification​

2.3 Mitigation Strategies​

2.3.1 Self-Refinement (Deep Dive)​

2.3.2 Chain-of-Verification (CoVe)​

2.3.3 Multi-Agent Debate​

2.3 The Accuracy-Latency-Cost Triangle​

3. Theoretical Foundations​

3.1 Hallucination Detection as Sequence Labeling​

3.1.1 Token Classification Architecture​

3.1.2 Why ModernBERT for Detection​

3.1.3 Scoring Function Design​

3.1.4 Threshold Selection Theory​

3.2 Self-Refinement Theory​

3.2.1 Iterative Refinement as Fixed-Point Iteration​

3.2.2 Convergence Rate Analysis​

3.2.3 Prompt Engineering Principles for Correction​

3.3 Multi-Model Collaboration Theory​

3.3.1 Ensemble Learning Perspective​

3.3.2 Multi-Agent Debate Framework​

3.3.3 Consensus Mechanisms​

3.4 Theoretical Justification for Three-Mode Architecture​

3.4.1 Pareto Frontier Analysis​

3.4.2 Why Not Continuous Control?​

3.4.3 Mode Selection as Online Learning​

4. System Architecture​

4.1 High-Level Architecture​

4.2 Detection Flow​

5. User Strategy Options: The Cost-Accuracy Spectrum​

5.1 Strategy Overview​

5.2 Mode Comparison Matrix​

5.3 Lightweight Mode: Cost-Optimized Detection​

5.3.1 Theoretical Basis​

5.3.2 Mechanism​

5.3.3 Theoretical Guarantees​

5.3.4 Ideal Use Cases​

5.4 Standard Mode: Balanced Self-Refinement​

5.4.1 Theoretical Basis​

5.4.2 Convergence Analysis​

5.4.3 Mechanism​

5.5 Premium Mode: Multi-Model Collaborative Verification​

5.5.1 Theoretical Basis: Ensemble Learning​

5.5.2 Theoretical Basis: Multi-Agent Debate​

5.5.3 Mechanism​

5.5.4 Consensus Mechanisms​

5.5.5 Cost-Accuracy Trade-off Analysis​

5.5.6 Ideal Use Cases​

5.6 Mode Selection Decision Tree​

6. Configuration Design​

6.1 Global Configuration​

6.2 Per-Decision Plugin Configuration​

6.3 Response Headers​

6.4 Metrics and Observability​

7. References​

Appendix A: Full System Flow Diagram​

Appendix B: Glossary​