Reducing LLM Hallucinations During Test-Time Compute: A Coherence-Based Approach with Application to DeepSeek-R1

Feb 149 min read

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language processing. However, they remain susceptible to "hallucinations," generating outputs that are factually incorrect, logically inconsistent, or nonsensical. Test-time compute offers a potential avenue for improving LLM performance by allowing additional computation during inference. This paper explores the limitations of current test-time compute methods in addressing hallucinations, highlighting the critical role of coherence in the LLM's reasoning process. We propose a novel approach that leverages logic graph representations of LLM reasoning to identify and mitigate non-coherent structures, which are a significant source of hallucinations. We then demonstrate how these principles can be integrated into the learning framework of DeepSeek-R1, a state-of-the-art reasoning model, by modifying its Group Relative Policy Optimization (GRPO) objective to explicitly incentivize coherence. This integration offers a pathway toward more reliable and trustworthy LLMs.

1. Introduction

Large Language Models (LLMs) have revolutionized natural language processing, exhibiting impressive performance across a wide range of tasks. Despite these advancements, LLMs are prone to generating "hallucinations" – outputs that are factually incorrect, logically inconsistent, or otherwise detached from reality. These hallucinations pose a significant challenge to the reliable deployment of LLMs in real-world applications.

Test-time compute, which involves performing additional computation during inference, has emerged as a promising technique for enhancing LLM performance. Unlike traditional inference, where a model produces a single output based on its learned parameters, test-time compute allows the model to refine its response through iterative processes, such as self-revision, candidate verification, and reinforcement learning techniques like MinMax policy selection. However, while test-time compute can improve overall accuracy, it does not inherently eliminate the problem of hallucinations.

This paper argues that a key factor contributing to hallucinations is the lack of coherence in the LLM's underlying reasoning process. We propose a novel approach that focuses on identifying and mitigating non-coherent structure functions within the LLM's reasoning, represented as a logic graph. We then demonstrate how this coherence-based approach can be integrated into the learning framework of DeepSeek-R1, a state-of-the-art reasoning model, to reduce hallucinations and improve the reliability of its outputs.

2. Scaling Test-Time Compute and Its Limitations

Test-time compute refers to the computational resources used by an LLM during the inference stage, after the model has been trained. This is distinct from traditional inference, which relies solely on the model's pre-trained parameters. Test-time compute allows for dynamic refinement of the model's output. Common mechanisms include:

Refining the Proposal Distribution: The LLM iteratively improves its answer through guided self-revision, generating a sequence of revised outputs [1]. Each revision builds on insights from previous attempts, ideally leading to a more accurate and refined response.
Verifying Candidate Solutions: The LLM generates multiple candidate outputs and evaluates them based on predefined criteria or a reward model [2]. This allows the selection of the output that best aligns with the desired objective.
MinMax Policy Reinforcement Learning: As used in DeepSeek-R1 [3], this involves selecting policies that maximize the expected reward through reinforcement learning.

While test-time compute empowers LLMs to "think harder" [4] and can improve performance, particularly in the face of "peak data" limitations [5], it is not a panacea for hallucinations. The model may still converge on incorrect or inconsistent solutions, even with increased computational effort.

3. Knowledge Graphs, LLMs, and Coherence

Knowledge Graphs (KGs) provide a structured representation of knowledge, using nodes (entities) and edges (relationships). Integrating KGs with LLMs can improve coherence, reduce hallucinations, and enhance reasoning capabilities [6, 7, 8]. KGs provide contextual information and constraints, guiding LLMs towards more logically sound outputs. Different KG types (general vs. domain-specific) offer unique advantages. Graph databases (not explicitly detailed in the original text, but implicitly relevant) are crucial for efficient storage and retrieval of this graph-structured data.

There are two primary approaches to integrating KGs and LLMs:

LLM-assisted GNNs: LLMs enhance the performance of Graph Neural Networks (GNNs) by generating richer node features and improving graph perception [9].
GNN-assisted LLMs: GNNs incorporate structural knowledge into LLMs, improving reasoning and coherence. Examples include Graph Retrieval-Augmented Generation (GraphRAG) [11] and Graph of Thoughts (GoT) [9].

4. Causes of LLM Hallucinations

Hallucinations arise from a confluence of factors:

Data Quality: LLMs are trained on massive datasets that may contain inaccuracies, biases, or inconsistencies [6]. These imperfections can be learned and reflected in the model's output.
Model Design Limitations: The architecture and algorithms of LLMs inherently involve a trade-off between fluency and accuracy [6, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21].
Prompt Issues: Ambiguous or poorly defined prompts can increase the likelihood of hallucinations [6].
Algorithmic Shortcomings: Test-time compute algorithms, such as beam search, can contribute to hallucinations by converging on suboptimal solutions [6].
Hallucinations During Test-Time Compute Reasoning: Test-time compute can exacerbate hallucinations through mechanisms such as PRM exploitation, difficulty-dependent issues, scaling problems, and verification challenges [1, 20, 21]. Repetitive step generation, short solutions, and overfitting to PRM signals are observed issues.

5. Detecting and Mitigating Hallucinations: Existing Approaches

Current research focuses on detecting and mitigating hallucinations. Methods like semantic entropy [7] and semantic entropy probes (SEPs) [8] quantify the model's uncertainty, helping to identify potential hallucinations. Other mitigation strategies include:

Non-coherent Structure Functions: Identifying and eliminating non-coherent structure functions, which cause loops in logic diagrams, is crucial [22]. Decomposing these functions into coherent segments is a key mitigation strategy.

6. Mitigating Hallucinations Through Coherence: A Logic Graph Approach

We propose a novel approach centered on the coherence of the LLM's reasoning process. We focus on identifying and eliminating non-coherent structure functions, which are analogous to "tangled wires" in a logical circuit, leading to inconsistencies and errors [23].

6.1 Logic Graphs and Reachability Analysis

We represent the LLM's reasoning process using a logic graph. Nodes represent statements, concepts, or variables, and directed edges represent relationships (e.g., implication, dependence). Analyzing the connections and pathways in this graph allows us to uncover flaws in the reasoning.

The adjacency matrix represents the connections in the logic graph. A '1' indicates a direct relationship, and a '0' indicates no direct relationship. The reachability matrix (R) indicates all possible paths between nodes, derived through algorithms like Warshall's or Warren's. For graphs with logical connectives, the reachability solution is a linear combination of the reachability of the nodes (Rn) and the reachability of the logical AND connectives (Rl) [22]:

R = Rn + Rl (1)

Reachability analysis helps identify non-coherent structures, such as cycles or unreachable nodes, which are indicative of logical flaws.

7. Beyond Identification: Active Coherence Enforcement

Identifying non-coherent structures is the first step. The next, crucial step is active coherence enforcement, which involves intervening in the model's reasoning process to restructure its internal logic. This is a more proactive approach than simply detecting hallucinations after they occur.

8. DeepSeek-R1: A Case Study

DeepSeek-R1 [3] is a reasoning model that uses reinforcement learning (specifically, Group Relative Policy Optimization or GRPO) and chain-of-thought reasoning. It learns through trial and error, guided by a reward signal. This makes it an ideal candidate for integrating our coherence-based approach.

9. Integrating Coherence into DeepSeek-R1's Learning Equation

The core of DeepSeek-R1's learning is the GRPO objective function, which aims to optimize the policy (πθ) to maximize the expected reward:

(Equation and explanation from original text are omitted for brevity, but would be included here in a full paper).

We propose modifying this objective to explicitly incentivize coherence. This involves:

9.1 Coherence-Enhanced Advantage Estimation (Ai)

Logic Graph Construction: As DeepSeek-R1 generates its chain of thought, we construct a corresponding logic graph (G).
Reachability and Coherence Scoring: After each generated token, we update G and compute its reachability matrix (R). We analyze R to identify non-coherent structures and derive a coherence score (Ci). Cycles or contradictions lead to a lower Ci.
Augmented Advantage: We combine the original advantage (Ai) with the coherence score (Ci):
Aˆi = Ai + γCi (5)
where γ is a hyperparameter controlling the weight of the coherence score.

9.2 Coherence Regularization Term

We introduce a "coherence divergence" (DC(πθ)) that measures the average incoherence of the logic graphs generated by the policy. The modified GRPO objective becomes:

(A modified equation incorporating DC(πθ) and a hyperparameter λ would be presented here).

10. Impact on Learning

These modifications incentivize the model to generate coherent reasoning chains. The coherence-enhanced advantage function guides the policy towards logically sound graphs, while the coherence regularization term penalizes deviations from coherence. This creates a feedback loop where the model learns to maximize both task-specific rewards and the internal consistency of its reasoning.

11. Conclusion

Hallucinations in LLMs remain a significant challenge. Test-time compute, while beneficial, is not a complete solution. This paper proposes a novel approach focusing on the coherence of the LLM's reasoning process, represented as a logic graph. By identifying and mitigating non-coherent structures, we can significantly reduce hallucinations.

We demonstrated how these principles can be integrated into DeepSeek-R1's learning framework, modifying its GRPO objective to explicitly incentivize coherence. This involves enhancing the advantage estimation with a coherence score and adding a coherence regularization term.

Future research should focus on:

Automating the identification and correction of non-coherent structures.
Investigating the relationship between different types of non-coherent structures and specific types of hallucinations.
Developing new metrics that assess the logical consistency of LLM outputs.

By focusing on the underlying logical structure of LLM reasoning and actively enforcing coherence, we can move towards creating more reliable, trustworthy, and insightful AI systems. The integration of coherence-based approaches, as exemplified by the potential application to DeepSeek-R1, represents a significant step towards this goal.

References

[1] C. Snell, J. Lee, K. Xu, and A. Kumar, “Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters,” arXiv:2408.03314, 2024. https://arxiv.org/abs/2408. 03314

[2] Cloud Security Alliance, “What is Test Time Compute?,” Cloud Security Alliance Blog, 2024. https: //cloudsecurityalliance.org/blog/2024/12/13/test-time-compute

[3] DeepSeek, “DeepSeek’s First-Generation of Reasoning Models,” Ollama, 2024. https://arxiv.org/ abs/2501.12948

[4] R. S. Munthe, “Understanding Test-Time Compute: A New Mechanism Allowing AI to “Think Harder”,” Medium, 2024. https://medium.com/@rendysatriadalimunthe/ understanding-test-time-compute-a-new-mechanism-allowing-ai-to-think-harder-19e017abc540

[5] Forward Future, “What Is Test-Time Compute? Revolutionizing AI with Prolonged Thinking,” Forward Future, 2024. https://www.forwardfuture.ai/p/ the-magic-of-prolonged-thinking-test-time-compute-part-1

[6] T. Balarabe, “Large Language Model Hallucinations,” Medium, 2024. https://medium.com/

@tahirbalarabe2/large-language-model-hallucinations-14aad4ccc78e

[7] University of Oxford, “Major Research into ’Hallucinating’ Generative Models Advances Understanding of How to Improve Their Safety,” University of Oxford News, 2024. https://www.ox.ac.uk/news/ 2024-01-10-major-research-hallucinating-generative-models-advances-understanding-how-improve

[8] Authors, “Semantic Entropy Probes: Robust and Cheap Hallucination Detection in LLMs,” arXiv preprint arXiv:2406.15927, 2024. https://arxiv.org/abs/2406.15927

[9] X. Bresson et al., “Integrating Graphs with Large Language Models,” arXiv preprint arXiv:2402.05894, 2024.https://arxiv.org/abs/2402.05894

[10] NIST, “Hybrid-LLM-GNN: Integrating Large Language Models and Graph Neural Networks for Enhanced Materials Property Prediction,” NIST Publications, 2024. https://www.nist.gov/publications/ hybrid-llm-gnn-integrating-large-language-models-and-graph-neural-networks-enhanced

[11] MarkTechPost, “Integrating Graph Structures into Language Models: A Comprehensive Study of GraphRAG,” MarkTechPost, 2024. https://www.marktechpost.com/2024/08/24/ integrating-graph-structures-into-language-models-a-comprehensive-study-of-graphrag/

[12] Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, Y. J. Bang, A. Madotto, and P. Fung, “Survey of Hallucination in Natural Language Generation,” ACM Computing Surveys, vol. 55, no. 12, pp. 1-38, 2023. https://dl.acm.org/doi/abs/10.1145/3571730

[13] N. J. McKenna, J. -F. Ton, G. Sharma, S. R. Bowman, and H. de Vries, “Sources of Hallucination by Large Language Models on Inference Tasks,” arXiv preprint arXiv:2305.14552, 2023. https://arxiv. org/abs/2305.14552

[14] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, and D. Zhou, “Chain of Thought Prompting Elicits Reasoning in Large Language Models,” arXiv preprint arXiv:2201.11903, 2022. https://arxiv.org/abs/2201.11903

[15] A. Madaan, N. Shaghir, U. Borse, J. Zhang, A. Fan, A. Omidshafiei, S. S. A. Aley, B. Abulimiti, N. Goyal, S. Paul, and M. Gao, “Self-Refine: Iterative Refinement with Self-Feedback,” arXiv preprint arXiv:2303.17651, 2023. https://arxiv.org/abs/2303.17651

[16] Diego Gosmar, Deborah A. Dahl, Hallucination Mitigation using Agentic AI Natural Language-Based Frameworks. Medium Jan 2025 https://arxiv.org/abs/2501.13946

[17] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Ku¨ttler, M. Lewis, W. Yih, T. Rockt¨aschel, and S. Riedel, “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks,” arXiv preprint arXiv:2005.11401, 2020. https://arxiv.org/abs/2005.11401

[18] T. Schick, J. Dwivedi-Yu, R. Dess`ı, R. Raileanu, M. Lomeli, L. Zettlemoyer, N. Cancedda, and T. Scialom, “Toolformer: Language Models Can Teach Themselves to Use Tools,” arXiv preprint arXiv:2302.04761, 2023. https://arxiv.org/abs/2302.04761

[19] S. Pan, L. Luo, Y. Wang, C. Chen, J. Wang, and X. Wu, “Unifying Large Language Models and Knowledge Graphs: A Roadmap,” arXiv preprint arXiv:2306.08302, 2023. https://arxiv.org/abs/ 2306.08302

[20] Y. Yao, Z. Zhao, D. S. Weld, J. D. Co-Reyes, and S. Riedel, “ReAct: Synergizing Reasoning and Acting in Language Models,” arXiv preprint arXiv:2210.03629, 2022. https://arxiv.org/abs/2210.03629

[21] A. Azaria and T. Mitchell, “The Internal State of an LLM Knows When It’s Lying,” arXiv preprint arXiv:2304.13734, 2023. https://arxiv.org/abs/2304.13734

[22] H. P. Alesso, “Some Algebraic Aspects of Decomposed Non-Coherent Structure Functions,” Reliability Engineering, vol. 6, no. 2, pp. 105-117, 1983. https://www.sciencedirect.com/science/article/ abs/pii/0143817483900094

[23] H. P. Alesso, P. Prassinos, and C. F. Smith, “Beyond Fault Trees to Fault Graphs,” Reliability Engineering, vol. 8, no. 3, pp. 173-184, 1984. https://www.sciencedirect.com/science/article/abs/ pii/0143817485900861

[24] H. P. Alesso and H. J. Benson, “Fault Tree and Reliability Relationships for Analyzing Noncoherent Two-State Systems,” Nuclear Engineering and Design, vol. 56, no. 2, pp. 309–320, Feb. 1980. https: //www.sciencedirect.com/science/article/abs/pii/0029549380901326.

[25] H. P. Alesso, “On the Relationship of Digraph Matrix Analysis, Petri Net Theory and Fault Trees,” Reliability Engineering, vol. 10, no. 2, pp. 93-103, 1985. https://www.sciencedirect.com/science/ article/abs/pii/0143817485900034

[26] H. Peter Alesso, Craig F. Smith, Thinking on the Web: Berners-Lee, G¨odel and Turing,

Wiley-Interscience (January 1, 2006), https://www.amazon.com/Thinking-Web-Berners-Lee-G%C3%

B6del-Turing/dp/0471768146/ref=sr_1_1?crid=39FWASOHZ4PF6

AI HIVE

Reducing LLM Hallucinations During Test-Time Compute: A Coherence-Based Approach with Application to DeepSeek-R1

Recent Posts

Comments

Subscribe to our newsletter