Personalized Daily Arxiv Papers 08/02/2025

Total relevant papers: 4

Paper selection prompt and criteria at the bottom

Table of contents with paper titles:

Seed-Prover: Deep and Broad Reasoning for Automated Theorem Proving Authors: Luoxin Chen, Jinming Gu, Liankai Huang, Wenhao Huang, Zhicheng Jiang, Allan Jie, Xiaoran Jin, Xing Jin, Chenggang Li, Kaijing Ma, Cheng Ren, Jiawei Shen, Wenlei Shi, Tong Sun, He Sun, Jiahui Wang, Siran Wang, Zhihong Wang, Chenrui Wei, Shufa Wei, Yonghui Wu, Yuchen Wu, Yihang Xia, Huajian Xin, Fan Yang, Huaiyuan Ying, Hongyi Yuan, Zheng Yuan, Tianyang Zhan, Chi Zhang, Yue Zhang, Ge Zhang, Tianyun Zhao, Jianqiu Zhao, Yichi Zhou, Thomas Hanwen Zhu
Trustworthy Reasoning: Evaluating and Enhancing Factual Accuracy in LLM Intermediate Thought Processes Authors: Rui Jiao, Yue Zhang, Jinku Li
How does Chain of Thought Think? Mechanistic Interpretability of Chain-of-Thought Reasoning with Sparse Autoencoding Authors: Xi Chen, Aske Plaat, Niki van Stein
Fast and Accurate Contextual Knowledge Extraction Using Cascading Language Model Chains and Candidate Answers Authors: Lee Harris

0. Seed-Prover: Deep and Broad Reasoning for Automated Theorem Proving

arXiv ID: 2507.23726

Authors: Luoxin Chen, Jinming Gu, Liankai Huang, Wenhao Huang, Zhicheng Jiang, Allan Jie, Xiaoran Jin, Xing Jin, Chenggang Li, Kaijing Ma, Cheng Ren, Jiawei Shen, Wenlei Shi, Tong Sun, He Sun, Jiahui Wang, Siran Wang, Zhihong Wang, Chenrui Wei, Shufa Wei, Yonghui Wu, Yuchen Wu, Yihang Xia, Huajian Xin, Fan Yang, Huaiyuan Ying, Hongyi Yuan, Zheng Yuan, Tianyang Zhan, Chi Zhang, Yue Zhang, Ge Zhang, Tianyun Zhao, Jianqiu Zhao, Yichi Zhou, Thomas Hanwen Zhu

Abstract: LLMs have demonstrated strong mathematical reasoning abilities by leveraging reinforcement learning with long chain-of-thought, yet they continue to struggle with theorem proving due to the lack of clear supervision signals when solely using natural language. Dedicated domain-specific languages like Lean provide clear supervision via formal verification of proofs, enabling effective training through reinforcement learning. In this work, we propose \textbf{Seed-Prover}, a lemma-style whole-proof reasoning model. Seed-Prover can iteratively refine its proof based on Lean feedback, proved lemmas, and self-summarization. To solve IMO-level contest problems, we design three test-time inference strategies that enable both deep and broad reasoning. Seed-Prover proves $78.1%$ of formalized past IMO problems, saturates MiniF2F, and achieves over 50% on PutnamBench, outperforming the previous state-of-the-art by a large margin. To address the lack of geometry support in Lean, we introduce a geometry reasoning engine \textbf{Seed-Geometry}, which outperforms previous formal geometry engines. We use these two systems to participate in IMO 2025 and fully prove 5 out of 6 problems. This work represents a significant advancement in automated mathematical reasoning, demonstrating the effectiveness of formal verification with long chain-of-thought reasoning.

Comment: This paper proposes Seed-Prover, a model for automated theorem proving using long chain-of-thought and formal verification. It matches criterion 5 (long chains of thought and long-term reasoning in LLMs), as it discusses reinforcement learning with long chain-of-thought and strategies for deep and broad reasoning in mathematical proofs.

Relevance: 7 Novelty: 7

1. Trustworthy Reasoning: Evaluating and Enhancing Factual Accuracy in LLM Intermediate Thought Processes

arXiv ID: 2507.22940

Authors: Rui Jiao, Yue Zhang, Jinku Li

Abstract: arXiv:2507.22940v1 Announce Type: new Abstract: We present RELIANCE (Reasoning Evaluation with Logical Integrity and Accuracy for Confidence Enhancement), a novel framework addressing a critical vulnerability in Large Language Models (LLMs): the prevalence of factual inaccuracies within intermediate reasoning steps despite correct final answers. This phenomenon poses substantial risks in high-stakes domains including healthcare, legal analysis, and scientific research, where erroneous yet confidently presented reasoning can mislead users into dangerous decisions. Our framework integrates three core components: (1) a specialized fact-checking classifier trained on counterfactually augmented data to detect subtle factual inconsistencies within reasoning chains; (2) a Group Relative Policy Optimization (GRPO) reinforcement learning approach that balances factuality, coherence, and structural correctness through multi-dimensional rewards; and (3) a mechanistic interpretability module examining how factuality improvements manifest in model activations during reasoning processes. Extensive evaluation across ten state-of-the-art models reveals concerning patterns: even leading models like Claude-3.7 and GPT-o1 demonstrate reasoning factual accuracy of only 81.93% and 82.57% respectively. RELIANCE significantly enhances factual robustness (up to 49.90% improvement) while maintaining or improving performance on challenging benchmarks including Math-500, AIME-2024, and GPQA. Furthermore, our activation-level analysis provides actionable insights into how factual enhancements reshape reasoning trajectories within model architectures, establishing foundations for future training methodologies that explicitly target factual robustness through activation-guided optimization.

Comment: This paper matches criterion 5, as it analyzes and improves the factual accuracy of intermediate reasoning steps in LLMs, with empirical studies on reasoning chains and activation-level analysis. It provides methodological insights into long chains of thought and reasoning robustness, which is directly relevant to inference-time reasoning and long CoT.

Relevance: 7 Novelty: 6

2. How does Chain of Thought Think? Mechanistic Interpretability of Chain-of-Thought Reasoning with Sparse Autoencoding

arXiv ID: 2507.22928

Authors: Xi Chen, Aske Plaat, Niki van Stein

Abstract: arXiv:2507.22928v1 Announce Type: new Abstract: Chain-of-thought (CoT) prompting boosts Large Language Models accuracy on multi-step tasks, yet whether the generated "thoughts" reflect the true internal reasoning process is unresolved. We present the first feature-level causal study of CoT faithfulness. Combining sparse autoencoders with activation patching, we extract monosemantic features from Pythia-70M and Pythia-2.8B while they tackle GSM8K math problems under CoT and plain (noCoT) prompting. Swapping a small set of CoT-reasoning features into a noCoT run raises answer log-probabilities significantly in the 2.8B model, but has no reliable effect in 70M, revealing a clear scale threshold. CoT also leads to significantly higher activation sparsity and feature interpretability scores in the larger model, signalling more modular internal computation. For example, the model's confidence in generating correct answers improves from 1.2 to 4.3. We introduce patch-curves and random-feature patching baselines, showing that useful CoT information is not only present in the top-K patches but widely distributed. Overall, our results indicate that CoT can induce more interpretable internal structures in high-capacity LLMs, validating its role as a structured prompting method.

Comment: This paper provides a mechanistic interpretability study of Chain-of-Thought (CoT) reasoning in LLMs, analyzing how CoT affects internal model features and reasoning. This matches criterion 5 (long chains of thought, long-term reasoning, and empirical studies on reasoning in LLMs).

Relevance: 7 Novelty: 6

3. Fast and Accurate Contextual Knowledge Extraction Using Cascading Language Model Chains and Candidate Answers

arXiv ID: 2507.22921

Authors: Lee Harris

Abstract: arXiv:2507.22921v1 Announce Type: new Abstract: Language models can capture complex relationships in given text, but these are notorious for being costly and for producing information that does not exist (i.e., hallucinations). Furthermore, the resources invested into producing this information would be wasted if it were incorrect. We address these issues by proposing, implementing, and applying the Language Model Chain (LMC) algorithm. In this, a language model's response to a given prompt about given text is only correct if it exists in the collection of possible (i.e., candidate) answers, and text corresponding to incorrect responses is fed into a more predictive (but slower) language model. This process is repeated for a collection of language models, or until all predictions about the text are correct. We used the LMC algorithm to extract patient dates of birth from medical documents, and combining a collection of language models in a multi-stage cascade significantly increased prediction speed and accuracy over individual language models, while greatly reducing the number of corresponding hallucinations. We believe that the novel LMC algorithm significantly contributes to the knowledge extraction field, and that this should be explored much further in the future.

Comment: This paper proposes a cascading language model chain (LMC) algorithm to accelerate knowledge extraction by routing between fast and slow models based on candidate answers. This matches criterion 6 (hybrid or adaptive reasoning in LLMs), as it dynamically switches between models for efficiency and accuracy.

Relevance: 6 Novelty: 6

Paper selection prompt

New methodological improvements or theoretical analysis about inference acceleration of large language models, especially about speculative decoding. Speculative decoding is a specific method for accelerating the inference of auto-regressive models, which drafts several future tokens efficiently and then verifies them in parallel.
- Relevant: papers that discuss specific methods about inference acceleration of LLMs, or improving these methods, or analyzing them theoretically. Papers that explicitly mention speculative decoding are highly appreciated.
- Not relevant: papers about adaptation to some task. Simply following instructions or inputs is not sufficient.
New papers about GFlowNet. GFlowNet samples a diverse set of candidates in an active learning context, with a training objective that makes them approximately sample in proportion to a given reward function.
- Relevant: papers that discuss specific methods like GFlowNet, or improving these methods, or analyzing them. Particularly those about combining GFlowNet with large language models.
- Not relevant: any papers that do not consider GFlowNet.
New improvements in tokenization or integrating tokenization with language models. Tokenization is the process of converting text into tokens, which are the smallest units of meaning for a language model. Improvements in tokenization can significantly impact the efficiency and performance of language models.
- Relevant: papers that discuss novel methods for tokenization, improvements in existing tokenization techniques, or integrating tokenization with language models. This includes research on optimizing tokenizers for better model performance, new algorithms for tokenization, or case studies demonstrating the impact of tokenization on language model efficiency and accuracy. Papers that explore "token-free" or "tokenization-free" approaches are highly appreciated.
- Not relevant: papers that primarily focus on applications of language models without discussing the underlying tokenization process, or papers that only mention tokenization in passing without contributing new methods or significant improvements.
New methodological improvements or theoretical analysis in decoding techniques for large language models (LLMs). Decoding is the process used by language models to generate text based on the input sequence and the model's learned parameters.
- Relevant: papers that discuss novel decoding methods, improvements to existing decoding techniques, or theoretical analyses of decoding processes in LLMs. This includes research on optimizing decoding strategies for better model performance, new algorithms for decoding, or studies demonstrating the impact of different decoding methods on language model outputs. Approaches that combine decoding with other model components to enhance overall efficiency and accuracy are also relevant. Papers that explicitly mention speculative decoding are highly appreciated.
- Not relevant: papers that primarily focus on the applications of language models without discussing the decoding process in detail, or papers that only mention decoding in passing without contributing new methods or significant improvements.
New methodological insights or empirical studies on inference-time scaling laws, long chains of thought (long CoT), or long-term reasoning in large language models. These areas explore how language models manage complex, multi-step reasoning tasks, particularly when generating extended outputs, and how their inference-time behavior scales with output length. The focus is on generating long, coherent outputs rather than handling long inputs or general long-context settings.
- Relevant: papers that analyze inference-time scaling laws, propose or evaluate methods for improving long-term reasoning in LLMs (e.g., long CoT), or present surprising empirical findings about generating extended outputs. Papers that explicitly mention ChatGPT-o1 or similar concepts in their abstracts are especially worth attention. Studies exploring techniques to make long CoT reasoning more efficient, scalable, or coherent over extended outputs are highly appreciated.
- Not relevant: papers focusing primarily on handling long input contexts, general long-context adaptation, or task-specific applications of reasoning without insights into inference-time scaling or extended output generation.
New methods or analysis for hybrid or adaptive reasoning in large language models. Hybrid reasoning (also known as adaptive reasoning) refers to the ability of a single large language model to dynamically switch between different reasoning modes, typically a "fast" intuitive path (System 1) for simple queries and a "slow" deliberate, step-by-step reasoning path (System 2, e.g., Chain of Thought) for complex problems. The goal is to optimize the trade-off between computational cost, latency, and response quality. This switching can be triggered by explicit instructions, learned by the model based on query complexity, or guided by a separate router model.
- Relevant: Papers that propose novel architectures or methods that enable a model to choose between different computational paths (e.g., direct generation vs. multi-step reasoning). Theoretical or empirical analyses of the trade-offs between different reasoning strategies and when one should be preferred over another.
- Not relevant: Papers that simply apply a single reasoning strategy (like Chain of Thought) to a task without exploring the dynamic selection between multiple strategies.

In suggesting papers to your friend, remember that he enjoys papers on statistical machine learning, and generative modeling in natural language processing. Your friend also likes learning about surprising empirical results in language models, as well as clever statistical tricks.