Deliberate Reasoning in Language Models as Structure-Aware
Planning with an Accurate World Model
Abstract
Enhancing the reasoning capabilities of language models (LMs) remains a key challenge, especially for tasks that require complex, multi-step decision-making where existing Chain-of-Thought (CoT) approaches struggle with consistency and verification. In this paper, we propose a novel reasoning framework, referred to as Structure-aware Planning with an Accurate World Model (SWAP), that integrates structured knowledge representation with learned planning. Unlike prior methods that rely purely on natural language reasoning, SWAP leverages entailment graphs to encode structured dependencies and enable symbolic verification of intermediate steps. To systematically construct and update the graph, SWAP employs a policy model to propose candidate expansions and a world model to predict structural updates. To improve accuracy, the world model generates multiple alternative updates, and a discriminator re-ranks them based on plausibility. To encourage diverse exploration, we introduce Diversity-based Modelling (DM), which samples candidates from the remaining probability mass after removing previously sampled candidates from the original policy distribution. Additionally, SWAP improves the discrimination accuracy through Contrastive Ranking (CR), which directly compares candidates within prompts and incorporates meta-knowledge to improve ranking quality. We evaluate SWAP across diverse reasoning-intensive benchmarks including math reasoning, logical reasoning, and coding tasks. Extensive experiments demonstrate that SWAP significantly improves upon the base models and consistently outperforms existing reasoning methods111Code and data are available at https://github.com/xiongsiheng/SWAP..
Deliberate Reasoning in Language Models as Structure-Aware
Planning with an Accurate World Model
Siheng Xiong1, Ali Payani2, Yuan Yang1* & Faramarz Fekri1 1Georgia Institute of Technology 2Cisco Research sxiong45@gatech.edu apayani@cisco.com mblackout@hotmail.com* fekri@ece.gatech.edu
1 Introduction
Achieving human-level problem solving is regarded as the next milestone in Artificial General Intelligence (AGI) (Jaech et al., 2024). To enhance the reasoning capabilities of language models (LMs), the Chain-of-Thought (CoT) approach (Wei et al., 2022) is widely adopted due to its scalability and flexibility. However, CoT relies solely on natural language reasoning and lacks an effective verification mechanism, leading to inconsistencies and hallucinations in complex reasoning tasks. To address this limitation, formal methods such as first-order logic (Pan et al., 2023) and program-based (Chen et al., 2022) reasoning have been proposed. Despite their advantages, these methods often lack the expressiveness needed to generalize across diverse reasoning tasks (Yang et al., 2024). In this paper, we propose a semi-formal reasoning framework that integrates structured knowledge representation into the reasoning process, enabling symbolic verification of intermediate steps while maintaining flexibility. Our framework represents reasoning as the construction of entailment graphs (Dalvi et al., 2021), which explicitly capture how premises lead to intermediate conclusions, facilitating validation of claims. Each node in the entailment graph represents a statement, such as evidence, an assumption, or a lemma/rule, while each (hyper)edge represents an entailment relation, mapping a set of premises to a conclusion. For example, the statements "All people who regularly drink coffee are dependent on caffeine" and "Rina is a student who regularly drinks coffee" together entail the conclusion "Rina is dependent on caffeine".

Similar to multi-step reasoning, entailment graph construction can be framed as a sequential decision-making process. To systematically construct and update the graph, we introduce a planning framework, Structure-aware Planning with an Accurate World Model (SWAP), which consists of a policy model and a world model. In our formulation, the world state corresponds to the graph structure. Given the current state, the policy model proposes candidate expansions, while the world model predicts state updates. The accuracy of the LM-based world model is crucial, as it directly impacts the effectiveness of the policy model. Existing methods (Hao et al., 2023) implement the world model by prompting the same LM with in-context demonstrations, but this approach struggles with complex tasks. To address this limitation, we enhance the world model by sampling multiple alternative updates and using a discriminator to re-rank them based on plausibility. Beyond this, we identify two fundamental bottlenecks in planning-based reasoning: generation diversity and discrimination accuracy. To encourage diverse exploration, we introduce Diversity-based Modelling (DM), which samples candidates from the remaining probability mass after removing previously sampled candidates from the original policy distribution. Additionally, SWAP improves discrimination accuracy through Contrastive Ranking (CR), which directly compares candidates within prompts and incorporates meta-knowledge to improve ranking quality. We evaluate SWAP across a range of reasoning-intensive benchmarks, including math reasoning, logical reasoning, and coding tasks. Extensive experiments demonstrate that SWAP not only significantly improves upon base models but also consistently outperforms existing reasoning methods.
Specifically, our main contributions include:
-
•
We introduce structure-aware planning, integrating entailment graphs into multi-step reasoning tasks. These graphs explicitly capture how premises lead to intermediate conclusions, enabling symbolic verification and improving coherence in the reasoning process.
-
•
We propose SWAP, a framework that enhances the policy model with an accurate world model. Additionally, we address two fundamental bottlenecks in planning-based reasoning: generation diversity and discrimination accuracy through Diversity-based Modeling (DM) and Contrastive Ranking (CR), respectively.
-
•
Extensive experiments on diverse benchmarks, including math reasoning, logical reasoning, and coding tasks, demonstrate that SWAP is a generalizable framework that consistently outperforms recent reasoning methods for LMs.
2 Preliminaries
2.1 Task formulation
We formulate the planning task as a Markov Decision Process (MDP) represented as where:
-
•
State : Represents the current graph, capturing all known or inferred information along with entailment relations.
-
•
Action : Corresponds to an expansion step, where new information is derived or inferred from the current state, leading to a state transition.
-
•
Transition probability : Defines the probability of transitioning to after taking in .
-
•
Reward : Measures the quality of given . While reward functions are typically used to select the best action, our framework instead directly compares different actions using a discriminator.
This MDP formulation establishes a foundation for applying planning-based methods to enhance the sequential decision-making capabilities of LMs. By iteratively updating their parameters, the models progressively learn an optimal policy, thereby improving their reasoning performance.
2.2 Structured reasoning as entailment graph construction
The key innovation that distinguishes our approach from related work is conceptualizing the multi-step reasoning process as entailment graph construction, which outlines how the premises in lead to intermediate conclusions, ultimately validating the final answer in . Formally, let represent the structure, where is the set of nodes, with each node representing a statement, e.g., evidence, an assumption, or a lemma/rule; is the set of directed (hyper)edges, where each (hyper)edge represents an entailment relation from a source node set (the premises) to a target node set (the conclusions).
Given , the world model first builds the initial graph by extracting key statements and their relations. During the reasoning, incrementally grows the graph by adding new nodes and edges, ultimately forming , which includes the final answer. Additionally, symbolic verification is introduced to ensure the graph quality. For simplicity, let us denote the state (in natural language) with structural information as . Incorporating this structure provides two main benefits: 1) the policy model can make more informed decisions using the structural information; and 2) the world model can predict more accurate next state.
3 Structure-aware Planning with an Accurate World Model
3.1 Framework
In this section, we present our framework (SWAP) that enables LMs to systematically construct and utilize an entailment graph for solving a wide range of reasoning tasks. We use to denote the policy, to denote the world model, to denote the discriminator, and to denote the controller all based on pre-trained LMs. We consider as language sequences, i.e., where each is a token, so that . We use to denote the state (in natural language) with structural information, and to denote the context of goal , plan and .
For notational convenience, we define the generation process as where is the number of generations, the symbolic verification process as , and the discrimination process as where is the number of preserved candidates. To search potential plans and actions, we simulate the future situations of the current state using the world model. Specifically, we use to denote the simulation starting from up to step given the goal and plan from the context .
Algorithm 1 outlines the workflow. Given a reasoning question , the world model first extracts the goal and the initial state . The policy then proposes a set of plans by sampling times from . The top candidate plans are selected by the discriminator based on the simulation results under each plan. Given the goal , selected plan and current state , expansion at step begins with the policy sampling times from as the next action pool. The discriminator then evaluates and selects the top candidate contexts (, , , , ) based on simulated states .
The accurate state prediction (Algorithm 2) is then performed in parallel for each selected context (, , , , ). Specifically, the world model predicts the next state by sampling times from . Then the discriminator selects the top candidate state. Based on the selected , the controller determines whether to continue reasoning. If the goal is achieved, the controller generates the final answer , stores in the completed pool , and reduces by 1. Otherwise, will be added to the context pool for the next step. The process continues until the step limit is reached or becomes 0. Finally, the top answer is selected by the discriminator based on the completed states in .
3.2 Seeking diversity in action generation
We identify two fundamental bottlenecks in planning-based reasoning: generation diversity and discrimination accuracy. Enhancing diversity is crucial for expanding the solution space, increasing the likelihood of discovering globally optimal solutions. To address this, we propose a Diversity-based Modelling (DM) approach (Figure 2). The core idea is to encourage the policy to generate multiple candidate options for the same step, ensuring that each option differs from the previous ones, thereby mitigating self-bias.

Given the current state , we define as the original distribution learned from positive trajectories during training. For -th generation, we aim to introduce diversity by considering an additional distribution , which captures steps that are semantically equivalent to those generated previously . Specifically, the probability of -th token in the -th generation is
(1) |
where denotes the -th generation, and for notational simplicity, we move the token index to a subscript, so that denotes the preceding tokens of the -th token . The distribution is learned from training data generated by GPT-4o, where each pair consists of semantically equivalent actions.
The final distribution is given by:
(2) |
where the decay factor, with , is introduced to emphasize diversity in early stages of generation while gradually reducing this effect. This ensures that the deduplication effect is stronger initially to explore different paths but weakens over time to avoid drifting too far from plausible solutions, thereby maintaining accuracy.
The normalization function
(3) |
is applied to discard negative-valued tokens (that either resemble previous responses or deviate from the intended progression of reasoning) and ensure a diverse and relevant response. The selection of this function is further discussed in Appendix A.
3.3 Improving discrimination accuracy in reasoning
As highlighted in recent works (Huang et al., 2023; Chen et al., 2024b), discrimination accuracy is a critical aspect of planning-based methods. However, training an effective process reward model (PRM) is challenging, as it reduces each candidate option to a single numerical score, oversimplifying complex decision-making aspects. Moreover, those raw scores may not be inherently comparable without proper calibration and additional context.
To address this issue, our discriminator employs Contrastive Ranking (CR) to evaluate multiple candidate options within prompts. By directly comparing these options, the discriminator can distinguish nuanced differences, particularly identifying erroneous components, thereby simplifying the evaluation process. To illustrate the automatic annotation process (Figure 3), consider a positive trajectory that leads to the correct final answer. We randomly select an intermediate step , and generate alternative reasoning trajectories: , where represents the length of the -th trajectory. Among these trajectories, we first filter out invalid trajectories through symbolic verification. Then the first erroneous steps in negative trajectories are identified by semantically comparing with the corresponding steps in the positive trajectory. We further perform completions for outcome verification, i.e., if none of the completions result in the correct answer, we confirm the identified steps as erroneous.
Math Reasoning | Logical Reasoning | Coding | ||||
Method | GSM8K | MATH500 | FOLIO | ReClor | HumanEval | MBPP |
LLaMA3-8B-Instruct | ||||||
Zero-shot CoT | 76.9 0.2 | 27.8 0.1 | 60.4 0.6 | 57.8 0.3 | 60.3 0.3 | 58.8 0.1 |
Few-shot CoT (4-shot) | 77.6 0.2 | 25.8 0.3 | 64.6 0.4 | 63.9 0.2 | 59.6 0.1 | 59.8 0.2 |
SFT-CoT | 77.2 0.1 | 27.0 0.2 | 66.0 0.4 | 64.2 0.3 | 58.6 0.2 | 60.0 0.2 |
Self-consistency (@maj32) | 87.3 0.2 | 30.6 0.3 | 69.0 0.8 | 71.1 0.5 | - | - |
ToT | 83.0 0.2 | 25.0 0.3 | 68.7 0.8 | 69.0 0.4 | - | - |
RAP | 86.2 0.3 | 26.2 0.3 | 70.8 0.6 | 70.4 0.4 | - | - |
PRM (PRM800K) | 89.0 0.2 | 34.6 0.2 | - | - | - | - |
PRM (Math-Shepherd) | 90.2 0.1 | 33.4 0.3 | - | - | - | - |
SWAP (w/o planning) | 80.1 0.3 | 32.0 0.2 | 69.1 0.4 | 67.4 0.3 | 62.2 0.2 | 61.4 0.3 |
SWAP | 92.6 0.2 | 43.2 0.3 | 79.2 0.5 | 78.1 0.4 | 68.8 0.4 | 68.0 0.3 |
Mistral-7B-Instruct | ||||||
Zero-shot CoT | 25.6 0.1 | 12.8 0.6 | 49.5 0.6 | 54.0 0.1 | 42.3 1.1 | 38.8 0.4 |
Few-shot CoT (4-shot) | 51.7 0.4 | 15.0 0.7 | 57.0 1.1 | 43.1 0.6 | 43.6 0.6 | 44.8 0.6 |
SFT-CoT | 52.4 0.2 | 16.4 0.4 | 59.0 0.5 | 56.2 0.2 | 43.8 0.4 | 45.8 0.4 |
Self-consistency (@maj32) | 69.9 0.3 | 20.6 0.6 | 61.5 0.7 | 53.4 0.2 | - | - |
ToT | 58.5 0.6 | 17.3 0.4 | 58.7 0.6 | 50.2 0.4 | - | - |
RAP | 72.4 0.4 | 18.8 0.5 | 62.0 0.4 | 54.6 0.3 | - | - |
PRM (PRM800K) | 74.2 0.3 | 24.1 0.4 | - | - | - | - |
PRM (Math-Shepherd) | 76.0 0.4 | 22.8 0.6 | - | - | - | - |
SWAP (w/o planning) | 57.0 0.2 | 19.4 0.4 | 62.0 0.6 | 58.2 0.2 | 45.0 0.3 | 46.2 0.4 |
SWAP | 83.2 0.4 | 28.0 0.5 | 71.2 0.4 | 67.9 0.4 | 52.2 0.4 | 51.6 0.3 |



Given the process annotations, we define the inputs and outputs of the discriminator while incorporating meta knowledge , such as common pitfalls and errors, to improve ranking quality. Specifically,
(4) |
|
(5) |
where denotes an explanation, highlighting differences among the future states before making a decision. The superscript ’best’ indicates the final selected option, i.e., and . For action ranking, we use the simulated immediate next state , as only needs to align with the plan . For plan ranking, we use the simulated terminal state , that is, . For training data, we select a pair of correct and incorrect options from the positive and negative trajectories of the same question. We then fine-tune the discriminator using this data, along with meta-knowledge and explanations bootstrapped from GPT-4o. Further discussions on the methodology and implementation details can be found in Appendix A, D.
During inference, given a set of candidate options, we apply points-based ranking to select the top candidates denoted as dis(model, input, b) (in Algorithm 1, 2). To manage computational complexity, we randomly sample comparison groups, ensuring that the total number of comparisons remains linear with respect to the number of candidates. To further enhance robustness, we randomly reorder group members before evaluation. Additionally, we perform symbolic verification prior to invoking the discriminator. Further details on these strategies are provided in Appendix A.
4 Experiments
4.1 Experimental setup
In this section, we demonstrate the versatility and effectiveness of our framework by applying it to a diverse set of reasoning tasks. Dataset statistics and examples are provided in Appendix B. The key parameter settings in SWAP are as follows: DM: , . CR: ; training group size: ; inference group size: less than . To enhance training effectiveness, we employ advanced training strategies such as curriculum learning and self-improving training for supervised fine-tuning, followed by preference learning for further optimization. We evaluate SWAP against popular reasoning strategies, including: CoT and Self-consistency (Wang et al., 2023b), ToT (Yao et al., 2023) and RAP (Hao et al., 2023), Supervised fine-tuning (SFT) on CoTs and verification with PRMs (Lightman et al., 2023; Wang et al., 2023a). We conduct experiments using two different base models: LLaMA3-8B-Instruct (Dubey et al., 2024) and Mistral-7B-Instruct (Jiang et al., 2023). Implementation settings are as follows: Self-consistency and PRMs use 32 rollouts. For SWAP, ToT, and RAP: the generation limit is 3, with a breadth limit of 3.The total number of rollouts is 32. The step limit is 10 for MATH500, and 6 for all other datasets. Temperature for these methods is set as 0.7. More implementation details are provided in Appendix C, D.
Math Reasoning | Logical Reasoning | Coding | ||||
---|---|---|---|---|---|---|
Method | GSM8K | MATH500 | FOLIO | ReClor | HumanEval | MBPP |
SWAP (Ours) | 92.6 | 43.2 | 79.2 | 78.1 | 68.8 | 68.0 |
w/o structure info | 90.2 | 41.6 | 75.8 | 76.2 | 67.4 | 67.0 |
w/o state pred refinement | 88.0 | 38.4 | 76.3 | 74.2 | 67.0 | 66.8 |
w/o DM | 87.0 | 36.8 | 74.0 | 75.6 | 65.2 | 64.6 |
w/o meta knowledge | 91.0 | 42.1 | 78.0 | 77.2 | 67.6 | 67.0 |
w/o planning | 80.1 | 32.0 | 69.1 | 67.4 | 62.2 | 61.4 |
4.2 Main results
The overall performance is shown in Table 1, with fine-grained results and example analyses provided in Appendix E and F, respectively. The key findings are summarized as follows:
SWAP consistently outperforms all other reasoning methods. Verification approaches, such as self-consistency and PRMs, do not search through intermediate steps during reasoning. In contrast, SWAP enables the model to engage in structured, conscious planning, similar to human reasoning, which is particularly crucial for avoiding intermediate errors.
Structured representation and an accurate world model further enhance planning effectiveness. Planning-based methods such as ToT and RAP fall short in performance compared to our approach. They lack the structured understanding and precise state modelling that SWAP provides. SWAP explicitly incorporates structure to capture relationships between key statements, providing priors and enabling symbolic verification. Additionally, Diversity-based Modeling (DM) enables the framework to explore a broader solution space, increasing the likelihood of discovering optimal steps. Meanwhile, Contrastive Ranking (CR) significantly enhances discrimination accuracy by focusing on nuanced differences between candidate options.
4.3 Analysis
We examine the impact of total rollouts and breadth limit on overall accuracy (Figure 4), providing insights into optimal parameter selection and demonstrating inference-time scaling.
Multiple rollouts improve accuracy but with diminishing returns. Increasing the number of rollouts generally enhances accuracy across all datasets, but the gains taper off as rollouts increase. Beyond a certain point, additional rollouts offer minimal improvement, making computational efficiency a key consideration.
Expanding breadth improves accuracy, but with diminishing returns at higher limits. Accuracy increases for both models as the breadth limit grows across all datasets, enhancing candidate selection. However, the gains are steepest at lower breadth values, with improvements becoming incremental beyond a certain point.
GSM8K | ||
---|---|---|
Method | Avg. generated tokens | Acc (%) |
Zero-shot CoT | 0.17k | 76.9 |
Few-shot CoT (4-shot) | 0.15k | 77.6 |
Self-consistency (@maj32) | 5.2k | 87.3 |
ToT | 10.8k | 83.0 |
RAP | 18.6k | 86.2 |
SWAP (w/o planning) | 0.31k | 80.1 |
SWAP | 12.4k | 92.6 |
4.4 Ablation study
We analyze the impact of the key components introduced in this paper (Table 2). The complete framework achieves the highest performance across all tasks, demonstrating that each component positively contributes to overall accuracy. Notably, planning has the most significant impact by effectively selecting optimal actions. These improvements are consistently observed across different benchmarks, highlighting the robustness of SWAP.
4.5 Efficiency study
We evaluate the average number of tokens generated by different methods on GSM8K dataset using Llama-3-8B-Instruct. The results are summarized in Table 3. We observed that while the theoretical time complexity of SWAP is comparable to ToT (BFS with pruning), it generates more tokens in practice due to the incorporation of a world model and the construction of entailment graphs. On the other hand, SWAP is significantly more efficient than RAP, which requires simulations until terminal states to estimate the expected future rewards.
5 Related Work
Advanced planning methods for enhancing the multi-step problem-solving capabilities of language models (LMs) can be categorized into three main approaches: re-ranking (Wang et al., 2023b; Li et al., 2023; Lei et al., 2024), iterative correction (Madaan et al., 2023; Shinn et al., 2023; Yao et al., 2022; Chen et al., 2024a) and tree search (Gu et al., 2023; Hao et al., 2023; Yao et al., 2023; Zhou et al., 2023). Despite differences in design, these methods fundamentally rely on a discriminator to evaluate planning steps. Recent studies (Huang et al., 2023; Chen et al., 2024b) have demonstrated that the discriminator plays a crucial role in planning-based reasoning. As a result, using in-context learning to prompt the same LM as both the generator and discriminator is often insufficient for improving model performance on complex reasoning tasks.
To address this limitation, prior research has explored various discriminator (reward model) designs. There are two primary types of reward models: Outcome Reward Model (ORM): evaluates the fully generated solution by assigning a single scalar confidence score, trained through outcome supervision by comparing generated answers with the ground truth; Process Reward Model (PRM): provides stepwise rewards throughout the reasoning process, assigning a scalar confidence score to each intermediate step (Lightman et al., 2023; Yuan et al., 2024; Tian et al., 2024). Empirical evidence suggests that process supervision offers greater benefits to multi-step reasoning than outcome supervision, as it ensures the correctness of each intermediate step (Lightman et al., 2023). However, training a PRM requires process supervision, which is difficult to obtain—manual process annotations are inherently not scalable. Although recent research (Wang et al., 2023a; Luo et al., 2024) has explored automatic process annotation via tree search, training an effective PRM remains challenging. A key limitation is that PRMs reduce each candidate option to a single numerical score, which oversimplifies complex decision-making. Moreover, those raw scores may not be inherently comparable without proper calibration and additional context.
To address these challenges, we introduce Contrastive Ranking (CR), which compares multiple candidate options within prompts rather than evaluating them in isolation. By directly comparing options, the model can: distinguish nuanced differences between candidates; identify erroneous components more effectively; and simplify the evaluation process, leading to improved discrimination accuracy. Additionally, we enhance the automatic annotation process by incorporating semantic equivalence checks and symbolic verification.
6 Conclusion
In this paper, we propose SWAP, a novel framework that enhances the reasoning capabilities of LMs through structure-aware planning with an accurate world model. Extensive experiments demonstrate that SWAP consistently outperforms existing methods, achieving significant improvements across reasoning-intensive benchmarks. Our approach primarily adopts a re-ranking strategy, which balances computational efficiency and model performance. For future work, exploring reinforcement learning to enable dynamic interaction between the policy model and the world model, along with tool integration, could further optimize LMs for solving complex, long-horizon real-world problems. Additionally, teaching models to self-identify and correct mistakes presents a promising direction for enhancing robustness within our framework.
Limitations
Planning-based methods, including SWAP, enhance deliberative reasoning in language models. However, generating multiple candidate options and comparing them for each intermediate step can be resource-intensive. Future work could focus on optimizing the procedure or exploring the trade-offs between accuracy and efficiency. One promising direction is to develop adaptive strategies that dynamically adjust the number of candidates based on task difficulty or the model’s confidence at each reasoning step. Additionally, resource-aware algorithms could be explored to explicitly balance computational cost and accuracy. To further improve efficiency, lightweight model architectures could be leveraged to enhance the discriminator, enabling it to quickly and accurately compare candidates.
Ethics Statement
In this paper, we fine-tune language models using publicly available benchmarks intended for research purposes. Additionally, we employ multiple language models to generate text based on these benchmarks. While the authors of these models are committed to addressing ethical considerations, generated content may still contain improper or harmful material. None of such content reflects the opinions of the authors.
Acknowledgments
This work is supported in part by DARPA SciFy program, Award No.HR001125C0302, and CISCO Systems, Inc.
References
- Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- Austin et al. (2021) Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021. Program synthesis with large language models. arXiv preprint arXiv:2108.07732.
- Chen et al. (2021) Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. 2021. Evaluating large language models trained on code. Preprint, arXiv:2107.03374.
- Chen et al. (2022) Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W Cohen. 2022. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588.
- Chen et al. (2024a) Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. 2024a. Teaching large language models to self-debug. In The Twelfth International Conference on Learning Representations.
- Chen et al. (2024b) Ziru Chen, Michael White, Raymond Mooney, Ali Payani, Yu Su, and Huan Sun. 2024b. When is tree search useful for llm planning? it depends on the discriminator. arXiv preprint arXiv:2402.10890.
- Cobbe et al. (2021) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
- Dalvi et al. (2021) Bhavana Dalvi, Peter Jansen, Oyvind Tafjord, Zhengnan Xie, Hannah Smith, Leighanna Pipatanangkura, and Peter Clark. 2021. Explaining answers with entailment trees. arXiv preprint arXiv:2104.08661.
- Dubey et al. (2024) Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783.
- Gu et al. (2023) Yu Gu, Xiang Deng, and Yu Su. 2023. Don’t generate, discriminate: A proposal for grounding language models to real-world environments. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4928–4949.
- Han et al. (2022) Simeng Han, Hailey Schoelkopf, Yilun Zhao, Zhenting Qi, Martin Riddell, Luke Benson, Lucy Sun, Ekaterina Zubova, Yujie Qiao, Matthew Burtell, David Peng, Jonathan Fan, Yixin Liu, Brian Wong, Malcolm Sailor, Ansong Ni, Linyong Nan, Jungo Kasai, Tao Yu, Rui Zhang, Shafiq Joty, Alexander R. Fabbri, Wojciech Kryscinski, Xi Victoria Lin, Caiming Xiong, and Dragomir Radev. 2022. Folio: Natural language reasoning with first-order logic. arXiv preprint arXiv:2209.00840.
- Hao et al. (2023) Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, and Zhiting Hu. 2023. Reasoning with language model is planning with world model. arXiv preprint arXiv:2305.14992.
- Hendrycks et al. (2021) Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021. Measuring mathematical problem solving with the math dataset. NeurIPS.
- Hu et al. (2023) Edward J Hu, Moksh Jain, Eric Elmoznino, Younesse Kaddar, Guillaume Lajoie, Yoshua Bengio, and Nikolay Malkin. 2023. Amortizing intractable inference in large language models. arXiv preprint arXiv:2310.04363.
- Hu et al. (2021) Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
- Huang et al. (2023) Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. 2023. Large language models cannot self-correct reasoning yet. arXiv preprint arXiv:2310.01798.
- Jaech et al. (2024) Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. 2024. Openai o1 system card. arXiv preprint arXiv:2412.16720.
- Jiang et al. (2023) Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. 2023. Mistral 7b. arXiv preprint arXiv:2310.06825.
- Karpukhin et al. (2020) Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online. Association for Computational Linguistics.
- Lei et al. (2024) Bin Lei, Yi Zhang, Shan Zuo, Ali Payani, and Caiwen Ding. 2024. Macm: Utilizing a multi-agent system for condition mining in solving complex mathematical problems. Preprint, arXiv:2404.04735.
- Li et al. (2023) Yifei Li, Zeqi Lin, Shizhuo Zhang, Qiang Fu, Bei Chen, Jian-Guang Lou, and Weizhu Chen. 2023. Making language models better reasoners with step-aware verifier. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5315–5333, Toronto, Canada. Association for Computational Linguistics.
- Lightman et al. (2023) Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. 2023. Let’s verify step by step. arXiv preprint arXiv:2305.20050.
- Liu et al. (2024a) Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et al. 2024a. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model. arXiv preprint arXiv:2405.04434.
- Liu et al. (2024b) Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. 2024b. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437.
- Luo et al. (2024) Liangchen Luo, Yinxiao Liu, Rosanne Liu, Samrat Phatale, Harsh Lara, Yunxuan Li, Lei Shu, Yun Zhu, Lei Meng, Jiao Sun, et al. 2024. Improve mathematical reasoning in language models by automated process supervision. arXiv preprint arXiv:2406.06592.
- Madaan et al. (2023) Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. 2023. Self-refine: Iterative refinement with self-feedback. In Thirty-seventh Conference on Neural Information Processing Systems.
- Pan et al. (2023) Liangming Pan, Alon Albalak, Xinyi Wang, and William Yang Wang. 2023. Logic-lm: Empowering large language models with symbolic solvers for faithful logical reasoning. arXiv preprint arXiv:2305.12295.
- Rafailov et al. (2024) Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2024. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36.
- Shinn et al. (2023) Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik R Narasimhan, and Shunyu Yao. 2023. Reflexion: language agents with verbal reinforcement learning. In Thirty-seventh Conference on Neural Information Processing Systems.
- Tian et al. (2024) Ye Tian, Baolin Peng, Linfeng Song, Lifeng Jin, Dian Yu, Haitao Mi, and Dong Yu. 2024. Toward self-improvement of llms via imagination, searching, and criticizing. arXiv preprint arXiv:2404.12253.
- Vijayakumar et al. (2016) Ashwin K Vijayakumar, Michael Cogswell, Ramprasath R Selvaraju, Qing Sun, Stefan Lee, David Crandall, and Dhruv Batra. 2016. Diverse beam search: Decoding diverse solutions from neural sequence models. arXiv preprint arXiv:1610.02424.
- Wang et al. (2023a) Peiyi Wang, Lei Li, Zhihong Shao, RX Xu, Damai Dai, Yifei Li, Deli Chen, Y Wu, and Zhifang Sui. 2023a. Math-shepherd: A label-free step-by-step verifier for llms in mathematical reasoning. arXiv preprint arXiv:2312.08935.
- Wang et al. (2023b) Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023b. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations.
- Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837.
- Yang et al. (2024) Yuan Yang, Siheng Xiong, Ali Payani, Ehsan Shareghi, and Faramarz Fekri. 2024. Can LLMs reason in the wild with programs? In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 9806–9829, Miami, Florida, USA. Association for Computational Linguistics.
- Yao et al. (2023) Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. Tree of thoughts: deliberate problem solving with large language models. In Proceedings of the 37th International Conference on Neural Information Processing Systems, pages 11809–11822.
- Yao et al. (2022) Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629.
- Yu et al. (2020) Weihao Yu, Zihang Jiang, Yanfei Dong, and Jiashi Feng. 2020. Reclor: A reading comprehension dataset requiring logical reasoning. In International Conference on Learning Representations (ICLR).
- Yuan et al. (2024) Lifan Yuan, Ganqu Cui, Hanbin Wang, Ning Ding, Xingyao Wang, Jia Deng, Boji Shan, Huimin Chen, Ruobing Xie, Yankai Lin, et al. 2024. Advancing llm reasoning generalists with preference trees. arXiv preprint arXiv:2404.02078.
- Zhou et al. (2023) Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. 2023. Language agent tree search unifies reasoning acting and planning in language models. Preprint, arXiv:2310.04406.
Appendix A Further Methodological Details
Normalization function selection. The normalization function
is applied to discard negative-valued tokens (that either resemble previous responses or deviate from the intended progression of reasoning) and ensure a diverse and relevant response.
Negative values in the adjusted probability, i.e., , typically arise in two cases:
(1) Repeated tokens: The token’s original probability is high, but its semantic equivalence probability is even higher. This suggests that the token closely resembles previous responses, making it less desirable for diversity.
(2) Irrelevant tokens: The token’s original probability is not high, indicating that it likely represents a step deviating from the intended reasoning process.
In both cases, clipping negative values to zero eliminates undesirable tokens while preserving diverse and contextually relevant outputs. Alternative normalization methods, such as Softmax, redistribute probability mass across all tokens, including those that should ideally have very low or zero probability. This distorts the original probability distribution, allowing irrelevant tokens to persist and ultimately degrading model performance.
Consider an intermediate step in a math problem with the original probability distribution: = {"Let": 0.6, "Simplify": 0.2, "Consider": 0.2, other tokens: 0}. Given the previous response "Let be 4.", the semantic equivalence probability distribution becomes: = {"Let": 0.5, "Set": 0.2, "Consider": 0.3, other tokens: 0}. Let , then the adjusted probability distribution is: = "Let": 0.1, "Simplify": 0.2, "Consider": -0.1, "Set": -0.2, other tokens: 0. Applying the proposed method yields: {"Let": 0.33, "Simplify": 0.66, every other token: 0}. This ensures that only high-relevance tokens are preserved. However, applying Softmax results in: {"Let": 0.00011, "Simplify": 0.00012, "Consider": 0.00009, "Set": 0.00008, other tokens: 0.0001}. Here, probability mass is spread thinly across all tokens, diluting the relevance of the original distribution and harming model performance. Thus, our proposed normalization method effectively filters out irrelevant tokens while preserving diversity and relevance, ensuring a more accurate and controlled reasoning process.
State prediction refinement with diversity. Encouraging the world model to generate diverse predictions increases the likelihood of overcoming self-biases and discovering a more accurate future state. To ensure robustness, we select the top prediction from the diverse options generated. To achieve this, we apply a similar strategy to enhance diversity for state prediction, that is,
|
(6) |
where denotes the -th response, and is the preceding tokens of the -th token . Once the state is generated, the corresponding graph is extracted, ensuring the model maintains a consistent representation of entailment relationships throughout the reasoning process.
Contextual reframing strategy. In addition to Diversity-based Modelling (DM), we employ a contextual reframing strategy to further enhance generation diversity. This approach involves randomly reinterpreting the current state to create an alternative context at each step. For example, given the original state , we generate an alternative state , where is sampled from the semantic equivalence distributions . The corresponding graph is then regenerated from to maintain consistency in the reasoning process. Our experiments demonstrate that this strategy also significantly enhances output diversity, improving the robustness and performance of the model on reasoning tasks.
Comparison with related work. Compared to existing approaches (Vijayakumar et al., 2016; Hu et al., 2023), Diversity-based Modeling (DM) offers several advantages: (1) End-to-end learning: DM operates as a post-training strategy for LMs, enabling seamless integration and scalability to large datasets. (2) Leveraging pre-trained knowledge: It utilizes the extensive world knowledge embedded in pre-trained LMs, enhancing generation diversity without requiring additional domain-specific training. Diverse beam search (Vijayakumar et al., 2016) performs beam search in groups using a diversity-augmented objective, but it has several limitations: (1) Computational complexity: Beam search at the token level becomes computationally intractable for long-form reasoning. (2) Hyperparameter sensitivity: Identifying optimal strategies and hyperparameters for similarity calculation is time-consuming. (3) Reliability issues in specialized tasks: In reasoning tasks involving special tokens (e.g., math reasoning or first-order logic reasoning), embedding-based similarity calculations may be unreliable. GFlowNets fine-tuning (Hu et al., 2023) is a diversity-seeking reinforcement learning algorithm based on amortized Bayesian inference. Although it demonstrates better performance compared to SFT with limited training data, it is unclear whether it can scale to large-scale datasets and complex reasoning tasks. As a reinforcement learning method, GFlowNets fine-tuning can be significantly more challenging and costly to train when dealing with large-scale datasets.
Contrastive ranking for candidate selection. During inference, given a set of candidate options, we apply points-based ranking to select the top candidates denoted as . Specifically, ranking procedure includes: (1) Comparison group formation: We consider all possible option combinations for a fixed comparison group size. To manage computational complexity, we randomly sample comparison groups, ensuring the total number of comparisons remains linear with respect to the number of candidates. (2) Point assignment: Within each comparison group, the discriminator selects the best option, which receives 1 point, while the remaining options receive 0 points. (3) Robustness enhancement: To reduce positional bias, we randomly reorder group members before evaluation. (4) Final ranking: After all comparisons, candidates are ranked based on their total points. Additionally, before invoking the discriminator, we apply symbolic verification to discard invalid options. Options that fail symbolic verification are excluded from ranking.
Meta knowledge construction. The meta knowledge , which aids in answer verification and error identification, is derived from training questions. We employ GPT-4o with in-context learning to extract meta knowledge from training data, focusing on common pitfalls and errors associated with the same question type.
Formally, meta knowledge is constructed as:
(7) |
where represents stored knowledge from the -th training sample, represents the top training samples, selected based on the cosine similarity between the training query embedding and the test query embedding :
(8) |
In practice, when is large, we reduce the context length by employing a LM-based compressor, which condenses the retrieved meta-knowledge into a shorter sequence of embeddings while preserving essential information.
Symbolic verification. To further enhance discrimination, we introduce symbolic verification for generated entailment graphs . The verification process consists of the following key steps: (1) Syntax verification: Ensures that nodes and edges adhere to the correct format. (2) Node dependency analysis: Examines the dependencies between nodes (given conditions, assumptions, facts, or inferences derived from prior nodes). (3) Cycle detection: Verifies that the graph remains acyclic, preventing logical contradictions. (4) Redundancy check: Detects redundant or disconnected nodes. Each step is implemented using standard graph algorithms, ensuring efficient and reliable verification.
Appendix B Dataset Overview
In this section, we provide statistics and examples for all benchmark datasets used in our study. We consider GSM8K (Cobbe et al., 2021), MATH (Hendrycks et al., 2021) for math reasoning, FOLIO (Han et al., 2022), ReClor (Yu et al., 2020) for logical reasoning, and HumanEval (Chen et al., 2021), MBPP (Austin et al., 2021) for coding. For GSM8K, there are 7,473 training samples and 1,319 test samples. For MATH, there are 7,500 training samples and 5,000 test samples. MATH500 is a subset of 500 representative test samples extracted by Lightman et al. (2023), with the remaining test samples added to the training set. For FOLIO, the training and validation sets consist of 1,001 and 203 samples, respectively. For ReClor, we use 4,638 training samples, 500 validation samples (used as test set, as the original test set answers are not publicly available), and 1,000 test samples. HumanEval contains 164 test samples, and since it lacks a training set, we use the entire MBPP dataset (after format transformation) for training. MBPP consists of 374 training samples, 90 validation samples, and 500 test samples. This dataset selection ensures a comprehensive evaluation across diverse reasoning tasks.
Appendix C Prompts for Data Generation
In this section, we present all the prompts used in our data generation process. These prompts include those for plan generation, action generation, state generation, final answer generation, semantic equivalence data generation, semantic equivalence evaluation, meta knowledge generation, and contrastive process supervision for plan, action, and state generation.
Appendix D Implementation Details
Training data creation. For each dataset, we use multiple models (GPT-4o (Achiam et al., 2023), DeepSeek (Liu et al., 2024a, b), LLaMA3 (Dubey et al., 2024)) to generate diverse trajectories for the training and validation sets. Before labeling, we filter out trajectories that fail symbolic verification. We then label trajectories as positive or negative based on their final answers. To improve model stability, we augment training samples using GPT-4o. Given the positive and negative trajectories of the same question, we automatically generate process annotations (as discussed in Section 3.3) using DeepSeek. Additionally, we observed an order bias in contrastive ranking data, where the model tends to prefer the first option when presented with similar choices. To address this, we apply pre-processing and post-processing techniques: (1) Pre-processing: We randomly shuffle the correct option’s position before training. Only responses that select the correct option are accepted. (2) Post-processing: During inference, we randomly reorder options to mitigate bias and improve robustness. These strategies ensure more stable training and fairer ranking evaluations.
Model training overview. SWAP is fine-tuned from a base model using LoRA (Hu et al., 2021). The primary objective is to train the LM to understand the format and structure of entailment graphs. We choose LoRA over full fine-tuning due to the limited availability of training data. To ensure scalability and generalization, we fine-tune a single generator, which is then repurposed to serve as the policy model, world model, and controller. For each dataset, the generator is fine-tuned on all positive trajectories from the training set. As illustrated in Figure 2, the generator contains two LoRAs. The original LoRA is fine-tuned on positive trajectories as usual, while the SemEquiv-LoRA is fine-tuned on semantic equivalence data, which are bootstrapped using GPT-4o, for plan, actions and states. Specially, the total number of trajectories used for generator fine-tuning: GSM8k (28.3k), MATH500 (49.3k), FOLIO (7.3k), ReClor (14.5k), HumanEval (6.0k), and MBPP (3.4k). For each positive trajectory, we random sampled some steps and generated two alternatives for each step as semantically equivalent pairs. The total number of semantically equivalent pairs obtained: GSM8k (8.1k), MATH500 (24.2k), FOLIO (3.8k), ReClor (7.1k), HumanEval (2.8k) and MBPP (2.0k).
As a critical component of our framework, the discriminator is fine-tuned from a base model using contrastive ranking data for each dataset. Specifically, the total number of contrastive ranking pairs used for training: GSM8k (48.0k), MATH500 (100.2k), FOLIO (14.1k), ReClor (28.7k), HumanEval (10.4k), and MBPP (8.2k). We explore two approaches for discriminator fine-tuning: full fine-tuning and LoRA. We found that with sufficient training data, full fine-tuning provides better performance. Yet a discriminator fine-tuned with LoRA still yields significant benefits in our framework with a lower computational cost.
Meta knowledge retrieval. We use a DPR model (Karpukhin et al., 2020) to generate embeddings for both training questions and the test query. We then compute cosine similarity to select the top 5 matches during inference. Once the relevant meta-knowledge is extracted, we explore two approaches: (1) Using the original text directly. (2) Compressing the text into a shorter version. Our experiments show that Approach 1 yields higher accuracy, while Approach 2 offers slightly lower accuracy but faster inference speed.
Future state depth determination. The depth of the future state (mentioned in Section 3.3) is determined experimentally. Our findings indicate that: For plan ranking, the terminal state is the most effective. For action ranking, the immediate next state is sufficient, while using deeper future states leads to a decline in accuracy. After analyzing error cases, we attribute this decline to new errors introduced by subsequent actions, which negatively impact evaluation.
Math500 | ||||||||
---|---|---|---|---|---|---|---|---|
ALG | CP | GEO | IA | NT | PRE | PALG | Total | |
Method | (# 124) | (# 38) | (# 41) | (# 97) | (# 62) | (# 56) | (# 82) | (# 500) |
Zero-shot CoT | 46.8 | 21.6 | 20.0 | 7.4 | 17.7 | 16.4 | 45.1 | 27.8 |
Few-shot CoT (4-shot) | 41.9 | 15.3 | 22.4 | 12.8 | 24.5 | 11.1 | 34.4 | 25.8 |
SFT-CoT | 42.9 | 16.7 | 21.7 | 12.8 | 24.5 | 12.7 | 38.8 | 27.0 |
Self-consistency (@maj32) | 45.4 | 25.7 | 25.4 | 15.2 | 31.4 | 9.1 | 45.4 | 30.6 |
SWAP (w/o planning) | 46.8 | 23.0 | 24.6 | 15.9 | 31.8 | 15.8 | 47.8 | 32.0 |
SWAP | 52.3 | 49.3 | 35.6 | 22.1 | 43.4 | 25.9 | 66.8 | 43.2 |
Math500 | ||||||
---|---|---|---|---|---|---|
L1 | L2 | L3 | L4 | L5 | Total | |
Method | (# 43) | (# 90) | (# 105) | (# 128) | (# 134) | (# 500) |
Zero-shot CoT | 74.4 | 44.9 | 35.4 | 18.0 | 4.6 | 27.8 |
Few-shot CoT (4-shot) | 51.2 | 48.9 | 29.5 | 15.3 | 9.3 | 25.8 |
SFT-CoT | 69.9 | 48.0 | 30.9 | 14.5 | 7.9 | 27.0 |
Self-consistency (@maj32) | 67.9 | 50.7 | 34.5 | 24.5 | 7.7 | 30.6 |
SWAP (w/o planning) | 75.6 | 52.4 | 36.3 | 20.1 | 12.4 | 32.0 |
SWAP | 83.0 | 68.5 | 46.6 | 36.2 | 17.2 | 43.2 |
Advanced training strategies. To ensure effective training of the discriminator, we employ specialized strategies, including curriculum learning and self-improving training for supervised fine-tuning. We apply curriculum learning primarily to the MATH500 dataset: Round 1: Train on Level 1 problems. Round 2: Train on Level 1 and Level 2 problems. This process continues until all five levels are included. In each round, we train the model until convergence, using early stopping to prevent overfitting. To further refine the discriminator’s accuracy, we implement self-improving training: Run the trained system on the training samples and collect errors made by the generator. Fine-tune the discriminator using these errors, while keeping the generator fixed. Repeat the process until convergence. We also employ DPO (Rafailov et al., 2024) to enhance the discriminator: Given a prompt, the discriminator generates multiple responses. We select response pairs that rank different options as the best. GPT-4o serves as an expert, providing preference labels for these pairs, which are then used as DPO training data. Note that we keep the generator’s reasoning capability fixed after conventional supervised fine-tuning, as our goal is to highlight the power of planning within our framework. However, higher overall system performance could be achieved by revisiting and updating the generator as well.
Appendix E Fine-grained Results
To gain a comprehensive understanding of the model’s strengths and weaknesses, we provide fine-grained results on MATH500 (Table 4 and 5). We choose MATH500 for this analysis since it categorizes the test set by both problem types and difficulty levels, facilitating a more detailed evaluation of model performance across different dimensions. From Table 4, we observe that SWAP consistently outperforms other methods across all subsets and overall. This demonstrates that SWAP significantly enhances the overall mathematical reasoning capability compared to the baselines. The inclusion of the planning mechanism enables more accurate reasoning and selection, improving performance across different subsets. Similarly, in Table 5, SWAP achieved the best performance across all difficulty levels, particularly excelling in the most challenging Level 5. The planning mechanism contributes to improved accuracy on high-difficulty problems, demonstrating its effectiveness in enhancing reasoning capabilities. As difficulty increases, all methods show a significant decline in performance, particularly at Levels 4 and 5, indicating the increased complexity of reasoning required for these problems. Overall, SWAP consistently outperforms the baseline, especially on higher-difficulty problems, highlighting its advantage in handling complex reasoning tasks.
Appendix F Output Examples of SWAP
In this section, we provide example outputs generated using LLaMA3-8B-Instruct with SWAP for all benchmarks used in our paper, including GSM8K, MATH500, FOLIO, ReClor, HumanEval, and MBPP.