Total relevant papers: 4
Paper selection prompt and criteria at the bottom
Table of contents with paper titles:
Seed-Prover: Deep and Broad Reasoning for Automated Theorem Proving Authors: Luoxin Chen, Jinming Gu, Liankai Huang, Wenhao Huang, Zhicheng Jiang, Allan Jie, Xiaoran Jin, Xing Jin, Chenggang Li, Kaijing Ma, Cheng Ren, Jiawei Shen, Wenlei Shi, Tong Sun, He Sun, Jiahui Wang, Siran Wang, Zhihong Wang, Chenrui Wei, Shufa Wei, Yonghui Wu, Yuchen Wu, Yihang Xia, Huajian Xin, Fan Yang, Huaiyuan Ying, Hongyi Yuan, Zheng Yuan, Tianyang Zhan, Chi Zhang, Yue Zhang, Ge Zhang, Tianyun Zhao, Jianqiu Zhao, Yichi Zhou, Thomas Hanwen Zhu
Trustworthy Reasoning: Evaluating and Enhancing Factual Accuracy in LLM Intermediate Thought Processes Authors: Rui Jiao, Yue Zhang, Jinku Li
How does Chain of Thought Think? Mechanistic Interpretability of Chain-of-Thought Reasoning with Sparse Autoencoding Authors: Xi Chen, Aske Plaat, Niki van Stein
Fast and Accurate Contextual Knowledge Extraction Using Cascading Language Model Chains and Candidate Answers Authors: Lee Harris
arXiv ID: 2507.23726
Authors: Luoxin Chen, Jinming Gu, Liankai Huang, Wenhao Huang, Zhicheng Jiang, Allan Jie, Xiaoran Jin, Xing Jin, Chenggang Li, Kaijing Ma, Cheng Ren, Jiawei Shen, Wenlei Shi, Tong Sun, He Sun, Jiahui Wang, Siran Wang, Zhihong Wang, Chenrui Wei, Shufa Wei, Yonghui Wu, Yuchen Wu, Yihang Xia, Huajian Xin, Fan Yang, Huaiyuan Ying, Hongyi Yuan, Zheng Yuan, Tianyang Zhan, Chi Zhang, Yue Zhang, Ge Zhang, Tianyun Zhao, Jianqiu Zhao, Yichi Zhou, Thomas Hanwen Zhu
Abstract: LLMs have demonstrated strong mathematical reasoning abilities by leveraging reinforcement learning with long chain-of-thought, yet they continue to struggle with theorem proving due to the lack of clear supervision signals when solely using natural language. Dedicated domain-specific languages like Lean provide clear supervision via formal verification of proofs, enabling effective training through reinforcement learning. In this work, we propose \textbf{Seed-Prover}, a lemma-style whole-proof reasoning model. Seed-Prover can iteratively refine its proof based on Lean feedback, proved lemmas, and self-summarization. To solve IMO-level contest problems, we design three test-time inference strategies that enable both deep and broad reasoning. Seed-Prover proves $78.1%$ of formalized past IMO problems, saturates MiniF2F, and achieves over 50% on PutnamBench, outperforming the previous state-of-the-art by a large margin. To address the lack of geometry support in Lean, we introduce a geometry reasoning engine \textbf{Seed-Geometry}, which outperforms previous formal geometry engines. We use these two systems to participate in IMO 2025 and fully prove 5 out of 6 problems. This work represents a significant advancement in automated mathematical reasoning, demonstrating the effectiveness of formal verification with long chain-of-thought reasoning.
Comment: This paper proposes Seed-Prover, a model for automated theorem proving using long chain-of-thought and formal verification. It matches criterion 5 (long chains of thought and long-term reasoning in LLMs), as it discusses reinforcement learning with long chain-of-thought and strategies for deep and broad reasoning in mathematical proofs.
Relevance: 7 Novelty: 7
arXiv ID: 2507.22940
Authors: Rui Jiao, Yue Zhang, Jinku Li
Abstract: arXiv:2507.22940v1 Announce Type: new Abstract: We present RELIANCE (Reasoning Evaluation with Logical Integrity and Accuracy for Confidence Enhancement), a novel framework addressing a critical vulnerability in Large Language Models (LLMs): the prevalence of factual inaccuracies within intermediate reasoning steps despite correct final answers. This phenomenon poses substantial risks in high-stakes domains including healthcare, legal analysis, and scientific research, where erroneous yet confidently presented reasoning can mislead users into dangerous decisions. Our framework integrates three core components: (1) a specialized fact-checking classifier trained on counterfactually augmented data to detect subtle factual inconsistencies within reasoning chains; (2) a Group Relative Policy Optimization (GRPO) reinforcement learning approach that balances factuality, coherence, and structural correctness through multi-dimensional rewards; and (3) a mechanistic interpretability module examining how factuality improvements manifest in model activations during reasoning processes. Extensive evaluation across ten state-of-the-art models reveals concerning patterns: even leading models like Claude-3.7 and GPT-o1 demonstrate reasoning factual accuracy of only 81.93% and 82.57% respectively. RELIANCE significantly enhances factual robustness (up to 49.90% improvement) while maintaining or improving performance on challenging benchmarks including Math-500, AIME-2024, and GPQA. Furthermore, our activation-level analysis provides actionable insights into how factual enhancements reshape reasoning trajectories within model architectures, establishing foundations for future training methodologies that explicitly target factual robustness through activation-guided optimization.
Comment: This paper matches criterion 5, as it analyzes and improves the factual accuracy of intermediate reasoning steps in LLMs, with empirical studies on reasoning chains and activation-level analysis. It provides methodological insights into long chains of thought and reasoning robustness, which is directly relevant to inference-time reasoning and long CoT.
Relevance: 7 Novelty: 6
arXiv ID: 2507.22928
Authors: Xi Chen, Aske Plaat, Niki van Stein
Abstract: arXiv:2507.22928v1 Announce Type: new Abstract: Chain-of-thought (CoT) prompting boosts Large Language Models accuracy on multi-step tasks, yet whether the generated "thoughts" reflect the true internal reasoning process is unresolved. We present the first feature-level causal study of CoT faithfulness. Combining sparse autoencoders with activation patching, we extract monosemantic features from Pythia-70M and Pythia-2.8B while they tackle GSM8K math problems under CoT and plain (noCoT) prompting. Swapping a small set of CoT-reasoning features into a noCoT run raises answer log-probabilities significantly in the 2.8B model, but has no reliable effect in 70M, revealing a clear scale threshold. CoT also leads to significantly higher activation sparsity and feature interpretability scores in the larger model, signalling more modular internal computation. For example, the model's confidence in generating correct answers improves from 1.2 to 4.3. We introduce patch-curves and random-feature patching baselines, showing that useful CoT information is not only present in the top-K patches but widely distributed. Overall, our results indicate that CoT can induce more interpretable internal structures in high-capacity LLMs, validating its role as a structured prompting method.
Comment: This paper provides a mechanistic interpretability study of Chain-of-Thought (CoT) reasoning in LLMs, analyzing how CoT affects internal model features and reasoning. This matches criterion 5 (long chains of thought, long-term reasoning, and empirical studies on reasoning in LLMs).
Relevance: 7 Novelty: 6
arXiv ID: 2507.22921
Authors: Lee Harris
Abstract: arXiv:2507.22921v1 Announce Type: new Abstract: Language models can capture complex relationships in given text, but these are notorious for being costly and for producing information that does not exist (i.e., hallucinations). Furthermore, the resources invested into producing this information would be wasted if it were incorrect. We address these issues by proposing, implementing, and applying the Language Model Chain (LMC) algorithm. In this, a language model's response to a given prompt about given text is only correct if it exists in the collection of possible (i.e., candidate) answers, and text corresponding to incorrect responses is fed into a more predictive (but slower) language model. This process is repeated for a collection of language models, or until all predictions about the text are correct. We used the LMC algorithm to extract patient dates of birth from medical documents, and combining a collection of language models in a multi-stage cascade significantly increased prediction speed and accuracy over individual language models, while greatly reducing the number of corresponding hallucinations. We believe that the novel LMC algorithm significantly contributes to the knowledge extraction field, and that this should be explored much further in the future.
Comment: This paper proposes a cascading language model chain (LMC) algorithm to accelerate knowledge extraction by routing between fast and slow models based on candidate answers. This matches criterion 6 (hybrid or adaptive reasoning in LLMs), as it dynamically switches between models for efficiency and accuracy.
Relevance: 6 Novelty: 6
In suggesting papers to your friend, remember that he enjoys papers on statistical machine learning, and generative modeling in natural language processing. Your friend also likes learning about surprising empirical results in language models, as well as clever statistical tricks.