Recursive Think-Answer Process for LLMs and VLMs
Abstract
Recursive Think-Answer Process enables iterative reasoning cycles that improve accuracy and reduce self-reflective errors in language and vision-language models through confidence-based reinforcement learning.
Think-Answer reasoners such as DeepSeek-R1 have made notable progress by leveraging interpretable internal reasoning. However, despite the frequent presence of self-reflective cues like "Oops!", they remain vulnerable to output errors during single-pass inference. To address this limitation, we propose an efficient Recursive Think-Answer Process (R-TAP) that enables models to engage in iterative reasoning cycles and generate more accurate answers, going beyond conventional single-pass approaches. Central to this approach is a confidence generator that evaluates the certainty of model responses and guides subsequent improvements. By incorporating two complementary rewards-Recursively Confidence Increase Reward and Final Answer Confidence Reward-we show that R-TAP-enhanced models consistently outperform conventional single-pass methods for both large language models (LLMs) and vision-language models (VLMs). Moreover, by analyzing the frequency of "Oops"-like expressions in model responses, we find that R-TAP-applied models exhibit significantly fewer self-reflective patterns, resulting in more stable and faster inference-time reasoning. We hope R-TAP pave the way evolving into efficient and elaborated methods to refine the reasoning processes of future AI.
Community
🧠 Can models know when they are wrong—and try again?
Think–Answer models such as DeepSeek-R1 and OpenAI o1 sometimes produce self-reflective cues like “Oops” or “Let me reconsider,” suggesting internal uncertainty. However, even when this uncertainty is evident, the model does not actually revisit its reasoning—it still outputs a final random answer after a single reasoning pass.
💡 Core Idea - R-TAP (Recursive Think-Answer Process)
Instead of stopping after one Think–Answer pair, we enable models to:
1️⃣ Generate a Think–Answer
2️⃣ Estimate its own confidence via a dedicated Confidence Generator
3️⃣ Re-run reasoning if confidence is low
4️⃣ Stop early if confidence is sufficiently high
🎁 During training, we introduce two confidence-driven rewards:
1️⃣ Recursive Confidence Increase Reward
→ Encourages confidence to improve across iterations
2️⃣ Final Answer Confidence Reward
→ Encourages high-confidence termination
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Structured Reasoning for Large Language Models (2026)
- Teaching Large Reasoning Models Effective Reflection (2026)
- Reasoning While Asking: Transforming Reasoning Large Language Models from Passive Solvers to Proactive Inquirers (2026)
- ConPress: Learning Efficient Reasoning from Multi-Question Contextual Pressure (2026)
- Dual-Phase LLM Reasoning: Self-Evolved Mathematical Frameworks (2026)
- VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice (2026)
- Correct, Concise and Complete: Multi-stage Training For Adaptive Reasoning (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper