This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Seer: Facilitating Structured Reasoning and Explanation via Reinforcement Learning

Guoxin Chen†§, Kexin Tang\ast, Chao Yang†🖂, Fuying Ye, Yu Qiao, Yiming Qian‡🖂
Shanghai Artificial Intelligence Laboratory
§Institute of Computing Technology, Chinese Academy of Sciences
\astShanghai Jiao Tong University
Agency for Science, Technology and Research (A*STAR)
chenguoxin22s@ict.ac.cn, {tkx94china, fuyingye.work}@gmail.com,
{yangchao,qiaoyu}@pjlab.org.cn, qiany@ihpc.a-star.edu.sg
Abstract

Elucidating the reasoning process with structured explanations from question to answer is crucial, as it significantly enhances the interpretability, traceability, and trustworthiness of question-answering (QA) systems. However, structured explanations demand models to perform intricately structured reasoning, which poses great challenges. Most existing methods focus on single-step reasoning through supervised learning, ignoring logical dependencies between steps. Moreover, existing reinforcement learning (RL) based methods overlook the structured relationships, underutilizing the potential of RL in structured reasoning. In this paper, we propose Seer, a novel method that maximizes a structure-based return to facilitate structured reasoning and explanation. Our proposed structure-based return precisely describes the hierarchical and branching structure inherent in structured reasoning, effectively capturing the intricate relationships between different reasoning steps. In addition, we introduce a fine-grained reward function to meticulously delineate diverse reasoning steps. Extensive experiments show that Seer significantly outperforms state-of-the-art methods, achieving an absolute improvement of 6.9% over RL-based methods on EntailmentBank, a 4.4% average improvement on STREET benchmark, and exhibiting outstanding efficiency and cross-dataset generalization performance. Our code is available at https://github.com/Chen-GX/SEER.

1 Introduction

Navigating machines to understand and articulate the thought process from posing a question to arriving at an answer has been a long-term pursuit in the AI community (McCarthy, 1959; Yu et al., 2023). Current QA explainable systems adeptly furnish brief supporting evidence (Rajani et al., 2019; DeYoung et al., 2020). However, they often fail to clarify the reasoning process from prior knowledge to the derived answer. By elucidating the reasoning process of answers generation from the language models, we can greatly improve interpretability, trustworthiness, and debuggability (Dalvi et al., 2021; Ribeiro et al., 2023). As illustrated in Figure 1, when generating answers for the question "Which natural material is best for making a table?", the reasoning process with structured explanations, such as entailment trees (Dalvi et al., 2021) or reasoning graphs (Ribeiro et al., 2023), explains why "sturdy wood" is the best answer.

Refer to caption
Figure 1: An example of structured explanation. Given a hypothesis hh (a declarative sentence derived from a question-answer pair) and a set of facts (or corpus), the goal is to generate a structured explanation, which delineates the reasoning process from facts to the hypothesis.
Method Training Emphasis Runtime Return
RLET multi-step reasoning 9.34s chained
FAME single-step reasoning 30.77s /
Ours structured reasoning 3.91s structured
Table 1: Comparative analysis of different methods: RL-based method, RLET (Liu et al., 2022), supervised method, FAME (Hong et al., 2023), and our approach.

Deriving such complex structured explanations poses a great challenge. Previous methods (Dalvi et al., 2021; Tafjord et al., 2021) consider structured explanations as linearized sequences and generate the entire reasoning process in one go. However, these methods lack controllability and may hallucinate unreliable reasoning steps. To address these concerns, recent studies (Hong et al., 2022; Neves Ribeiro et al., 2022; Yang et al., 2022) decompose structured explanations and focus on single-step reasoning via supervised learning. Nevertheless, this kind of approach may not always yield optimal results as they fail to consider the interdependencies between different steps. FAME (Hong et al., 2023) attempts to compensate for these shortcomings by leveraging Monte-Carlo planning (Kocsis and Szepesvári, 2006), which significantly increases the running time and inadvertently explores numerous ineffective steps (as shown in Table 1). Furthermore, FAME still concentrates on isolated single-step reasoning, which lacks support for structured reasoning. As a general framework for solving sequential decision-making problems, reinforcement learning (RL) is employed in RLET (Liu et al., 2022) to enhance multi-step reasoning. However, RLET defines the return (a.k.a. cumulative reward) using the standard chain structure, thus lacking the ability to represent the tree (Dalvi et al., 2021) or graph (Ribeiro et al., 2023) logical structures inherent in structured reasoning. As a result, the potential of RL for structured reasoning is not fully exploited.

To address the above issues, we propose Seer, a novel method that facilitates Structured rEasoning and Explanation via Reinforcement learning. In structured reasoning, we observe that the logical dependencies between different steps no longer follow a chained trajectory but instead adhere to the inherent tree or graph structure. Therefore, we propose the structure-based return to precisely describe a tree or graph logical structure, effectively capturing the complex interdependencies between different steps. Additionally, we refine the reward function to meticulously delineate diverse reasoning steps, specifically targeting redundant ones that do not contribute to the final structured explanations. Through experiments in Sec. 5.4, we find that redundant steps represent the exploration in the environment, and appropriate penalization contributes to improved reasoning performance.

Our contributions are summarized as follows:

\bullet We propose Seer, a novel RL-based method that facilitates structured reasoning and explanation. To our knowledge, Seer is the first general framework that accommodates scenarios of chained, tree-based, and graph-based structured reasoning.

\bullet We propose the structure-based return to address the intricate interdependencies among different reasoning steps, effectively stimulating the potential of RL in structured reasoning.

\bullet We conduct extensive experiments to demonstrate the superiority of Seer over state-of-the-art methods. Our method facilitates the effectiveness and efficiency of structured reasoning and exhibits outstanding cross-dataset generalization performance.

2 Related Work

2.1 Explanation for Question Answering

Extensive research has delved into various forms of interpretability in QA systems (Thayaparan et al., 2020; Wiegreffe and Marasovic, 2021; Lamm et al., 2021; Chen et al., 2023). Different from the free-form texts susceptible to hallucinations (Rajani et al., 2019; Wei et al., 2022) or the rationales that only provide supporting evidence (DeYoung et al., 2020; Valentino et al., 2021), the structured explanations, such as the entailment trees (Dalvi et al., 2021) and reasoning graphs (Ribeiro et al., 2023), offer a novel way to generate explanations. These structured methods utilize tree or graph formats to clearly outline what information is used and how it is combined to reach the answer. Despite the remarkable interpretability, the intricately structured reasoning also poses significant challenges (Yu et al., 2023; Xu et al., 2023).

2.2 Natural Language Reasoning

Natural language reasoning, a process that integrates multiple knowledge to derive new conclusions, has attracted significant attention (Saha et al., 2020; Tafjord et al., 2021; Sanyal et al., 2022; Chen et al., 2024). Among these, the entailment trees and reasoning graphs, which involve structured reasoning and reasoning path generation tasks, present considerable challenges (Yu et al., 2023). Dalvi et al. (2021) attempt to transform structured reasoning into a linearized sequence to fit generative models, which may generate hallucinations and invalid reasoning. To alleviate this issue, recent studies (Neves Ribeiro et al., 2022; Hong et al., 2022; Neves Ribeiro et al., 2022; Hong et al., 2023) perform premises selection and reasoning in a step-by-step manner. Nevertheless, these methods decompose structured reasoning and solely leverage isolated single-step supervision to train models. This kind of approach neglects the interdependencies between different steps, which may not always yield optimal results. Therefore, in light of the advancements of RL in various reasoning tasks (Poesia et al., 2021; Le et al., 2022), RLET (Liu et al., 2022) attempts to incorporate RL into the entailment trees. However, it has to enumerate all potential actions, which is unacceptable for practical scenarios. Furthermore, RLET still defines returns in chained trajectories to facilitate multi-step reasoning, which is not suitable for tree/graph-based structured reasoning. In contrast, our Seer showcases superior adaptability to chained, tree-based, and graph-based structured reasoning via the structure-based return, which significantly enhances both the reasoning performance and efficiency.

3 Method

3.1 Task Formulation

As illustrated in Figure 1, the input of the task comprises a set of facts X={x1,x2,,xn}X=\{x_{1},x_{2},\ldots,x_{n}\} and a hypothesis hh. The output of the task is the reasoning steps in a structured form, such as an entailment tree TT or a reasoning graph111Although the reasoning graph (Ribeiro et al., 2023) is a more general structure, to be consistent with the majority of previous work, we use the entailment tree (Dalvi et al., 2021) as an example to formalize the task and illustrate our method. Our proposed method is also applicable to the task described in the form of a reasoning graph.. The entailment tree TT consists of tree-structured reasoning, whose leaf nodes are selected from the relevant facts (xx_{*}) and intermediate nodes represent the derived intermediate conclusions (ii_{*}). We represent the annotated ground-truth entailment tree as TgoldT_{gold}, with its leaf nodes signifying XgoldX_{gold}.

Refer to caption
Figure 2: Overall framework of Seer. For trajectory rollout, action generation (Policy) and conclusion generation (Entailment) are performed alternately. The orange area details the reasoning process from sts_{t} to st+1s_{t+1}. For policy optimization, the reward module assigns rewards and updates the policy and critic based on tree or graph structures.

3.2 Overview

We model the structured reasoning as a reinforcement learning (RL) task, the goal of which is to learn the optimal reasoning policy. Figure 2 illustrates the overall framework of Seer, which mainly includes trajectory rollout and policy optimization. For trajectory rollout, we generate trajectories based on the current policy, and each trajectory is produced iteratively until the stopping criteria are satisfied (Appendix C.1). For policy optimization, we assign rewards to the collected trajectories and update both the policy and critic using the structure-based return. Algorithm 1 (Appendix A) outlines our proposed method for further reference.

3.3 Fine-grained Component of Seer

State

At reasoning step tt, we define the state st={h,Pt,Ct}s_{t}=\{h,P_{t},C_{t}\} as a combination of the hypothesis hh, existing reasoning steps PtP_{t} and candidate sentences CtC_{t}. PtP_{t} contains the reasoning steps so far, and CtC_{t} is the set of sentences that can be selected as premises. Each sentence in CtC_{t} is either unused facts or intermediate conclusions ItI_{t} generated by previous steps, i.e., Ct={XIt\Ut}C_{t}=\{X\cup I_{t}\backslash U_{t}\}, where UtU_{t} is the set of used sentences. For the initial state, s1={h,P1=,C1=X}s_{1}=\{h,P_{1}=\varnothing,C_{1}=X\}.

Action

Given the state sts_{t}, we consider two types of actions at𝒜(st)a_{t}\in\mathcal{A}(s_{t}): (1) "Reason: <premises>": the entailment module is invoked to generate a new intermediate conclusion iti_{t} based on the given <premises>. Here, <premises> are selected from CtC_{t}. Then, the state is updated as follows: Pt+1=Pt{<premises>it}P_{t+1}=P_{t}\cup\{\text{\textless premises\textgreater}\rightarrow i_{t}\}, Ut+1=Ut{<premises>}U_{t+1}=U_{t}\cup\{\text{\textless premises\textgreater}\}, and It+1=It{it}I_{t+1}=I_{t}\cup\{i_{t}\}. (2) "End": This action signifies the end of the reasoning process and returns the trajectory τ\tau.

Policy

The action type "Reason: <premises>" induces a large action space, since premises can be any combination of sentences from the candidate set CC. To enumerate the probabilities of all potential actions and then sample an action to execute, previous studies (Liu et al., 2022; Hong et al., 2022) limit combinations to pairwise premises, such that the action space is reduced to (n2)\binom{n}{2}, where nn is the size of the set CC. However, such a simplification incurs some potential drawbacks. First, as the number of candidate sentences increases, the number of potential actions grows exponentially. This renders them impractical for complex reasoning tasks with limited computational resources. Second, by restricting combinations to pairs only, the interdependencies among multiple premises are ignored, which may limit the effectiveness and richness of the derived conclusions.

To address this issue, we adopt a generative model to represent the policy π\pi, which can directly sample from the action space 𝒜(st)\mathcal{A}(s_{t}). Using the generative model essentially expands the action space where the combinations of premises can be arbitrary. This enables the policy to extensively explore better actions during RL training, not limited to paired premises. Further, to speed up RL training, we first generate the top-kk actions using policy π\pi:

at1,at2,,atkπ(a|st),a𝒜(st),a_{t}^{1},a_{t}^{2},...,a_{t}^{k}\sim\pi(a|s_{t}),\quad a\in\mathcal{A}(s_{t}), (1)

where the input is a linearized state sts_{t} (i.e., the concatenation of hh, PtP_{t}, and CtC_{t}). Then, we proceed with re-normalization to form an appropriate probability distribution over the top-kk actions, and sample from it to select the action ata_{t} to be performed in the current reasoning step, that is,

π(ati|st)=π(ati|st)j=1kπ(atj|st),i=1,,k,\pi^{\prime}(a_{t}^{i}|s_{t})=\frac{\pi(a_{t}^{i}|s_{t})}{\sum_{j=1}^{k}\pi(a_{t}^{j}|s_{t})},\quad i=1,...,k, (2)
atπ(a|st),a{at1,at2,,atk}.a_{t}\sim\pi^{\prime}(a|s_{t}),\quad a\in\{a_{t}^{1},a_{t}^{2},...,a_{t}^{k}\}. (3)

Entailment Module

If the action ata_{t} is "Reason: <premises>", we invoke the entailment module to derive the intermediate conclusion to obtain the next state. The entailment module is also a generative model with its input being <premises>. Following Hong et al. (2022); Liu et al. (2022), we fine-tune the entailment model in a supervised manner and freeze the parameters during the reinforcement learning process, as shown in Figure 2.

Reward

To evaluate the correctness of the entailment tree, Dalvi et al. (2021) proposed an alignment algorithm based on Jaccard similarity to align each intermediate node of the predicted tree TpredT_{\text{pred}} with TgoldT_{\text{gold}}. However, different from the fully supervised learning methods, we observe that during the RL process, the policy explores different actions to identify the optimal reasoning process, inevitably attempting some redundant steps that do not contribute to reaching the final hypothesis. Existing RL-based work (Liu et al., 2022) simply treats redundant steps with the same penalty as erroneous steps. This simplification may negatively affect the learning process which discourages necessary exploration in the action space. Furthermore, it lacks detailed feedback to guide the policy toward optimal policy, as it fails to differentiate between innocuous actions (redundant steps) and incorrect actions (erroneous steps).

To this end, we propose a fine-grained reward function that assigns different reward values for correct steps, erroneous steps, and redundant steps, as shown in Equation 4. For a trajectory τ\tau, we assume that the last intermediate conclusion is our predicted hypothesis since the policy deems it should End here. Then, we backtrack to construct the predicted entailment tree TpredT_{\text{pred}} (see Appendix C.6 for more details). Note that there might be some steps not participating in TpredT_{\text{pred}}, which are regarded as redundant steps. Then, as illustrated in Figure 5, we consider steps that perfectly match via the alignment algorithm (Dalvi et al., 2021) as correct steps and regard others as erroneous steps.

rt={1,if perfectly match,0.5,if itTpred,1,otherwise.r_{t}=\begin{cases}1,&\text{if }\text{perfectly match},\\ -0.5,&\text{if }i_{t}\notin T_{\text{pred}},\\ -1,&\text{otherwise}.\end{cases} (4)

Critic

To enhance training stability, we introduce the critic to estimate the state-value function V(st)V(s_{t}). The input of V(st)V(s_{t}) is a linearized state, and its output is a scalar representing the return (i.e., cumulative reward) when starting from state sts_{t}. In the simplest case, the return is the chained sum of the rewards. Accordingly, one-step temporal difference (TD) (Sutton, 1988) is often used to estimate V(st)V(s_{t}), which is updated by the TD-target:

Gt=rt+γV(st+1),G_{t}=r_{t}+\gamma V(s_{t+1}), (5)

where γ\gamma is the discount factor. However, in structured reasoning, reasoning steps typically adhere to inherent tree (Dalvi et al., 2021) or graph (Ribeiro et al., 2023) structures, with the chained structure being merely a special case. Thus, Equation 5 just describes the chained multi-step reasoning, which may not effectively capture the intricate logical dependencies between steps in structured reasoning.

Therefore, we propose the structure-based return, where the TD-target is expressed in a more general formulation:

G^t=rt+γ1|𝒫(st)|sj𝒫(st)V(sj),\hat{G}_{t}=r_{t}+\gamma\frac{1}{|\mathcal{P}(s_{t})|}\sum_{s_{j}\in\mathcal{P}(s_{t})}V(s_{j}), (6)

where 𝒫(st)\mathcal{P}(s_{t}) represents the parent node of state sts_{t} in the entailment tree TpredT_{\text{pred}} or reasoning graph. When stTpreds_{t}\notin T_{\text{pred}}, 𝒫(st)=st+1\mathcal{P}(s_{t})=s_{t+1}. It can be seen that our structure-based return (Equation 6) adapts to structured reasoning involving chained, tree-based and graph-based structured scenarios. Especially, entailment tree is a special case of the reasoning graph, in which each state typically has only one parent node, and thus Equation 6 degenerates into G^t=rt+γV(𝒫(st))\hat{G}_{t}=r_{t}+\gamma V(\mathcal{P}(s_{t})). Furthermore, as shown in Figure 6 (Appendix E), for equivalent trajectories s1s2s3s4s_{1}\rightarrow s_{2}\rightarrow s_{3}\rightarrow s_{4} and s2s1s3s4s_{2}\rightarrow s_{1}\rightarrow s_{3}\rightarrow s_{4}, previous method (Liu et al., 2022) would assign different returns for state s1s_{1} and s2s_{2}, even though they represent the same tree in the end. Conversely, our method, by precisely delineating the intricate interdependencies between reasoning steps, consistently allocates the same return to any equivalent trajectories, thereby enhancing both stability and effectiveness.

3.4 Optimization

Our objective is to enhance the structured reasoning capabilities of the policy through RL. To alleviate issues of training instability and sample inefficiency in RL (Zhou et al., 2023; Roit et al., 2023), we employ the proximal policy optimization (PPO) algorithm (Schulman et al., 2017) to train the policy π\pi (parameterized by θ\theta), as follows:

π=𝔼t[min(πθ(at|st)πθold(at|st)A^t,clip(πθ(at|st)πθold(at|st),1ϵ,1+ϵ))A^t+β(πθ)],\begin{gathered}\mathcal{L}_{\pi}=\mathbb{E}_{t}[\min\Big{(}\frac{\pi^{\prime}_{\theta}(a_{t}|s_{t})}{\pi^{\prime}_{\theta_{old}}(a_{t}|s_{t})}\hat{A}_{t},\text{clip}\big{(}\frac{\pi^{\prime}_{\theta}(a_{t}|s_{t})}{\pi^{\prime}_{\theta_{old}}(a_{t}|s_{t})},\\ 1-\epsilon,1+\epsilon\big{)}\Big{)}\hat{A}_{t}+\beta\mathcal{E}(\pi^{\prime}_{\theta})],\end{gathered} (7)

where π\pi^{\prime} represents the probabilities normalized by Equation 2, θ\theta and θold\theta_{old} are parameters of the new and old policies, ϵ\epsilon is a hyperparameter defining the clipping range, β\beta is the entropy exploration coefficient, and \mathcal{E} is the entropy bonus, which encourages sufficient exploration:

(πθ)=𝔼atπθ[logπθ(at|st)].\mathcal{E}(\pi^{\prime}_{\theta})=\mathbb{E}_{a_{t}\sim\pi_{\theta}}[-\log\pi^{\prime}_{\theta}(a_{t}|s_{t})]. (8)

Futhermore, A^t\hat{A}_{t} is the estimate of the advantage function for state sts_{t}, defined as follows:

A^t=G^tV(st).\hat{A}_{t}=\hat{G}_{t}-V(s_{t}). (9)

To accurately evaluate return and guide the policy towards better updates, we train the critic by minimizing the difference between its prediction and the TD-target:

V=𝔼t[(V(st)G^t)2].\mathcal{L}_{V}=\mathbb{E}_{t}\left[(V(s_{t})-\hat{G}_{t})^{2}\right]. (10)

Supervised Warm-up

Incorporating the supervised warm-up strategy before RL offers a relatively stable initial policy, which facilitates faster adaptation to the environment, particularly for complex reasoning tasks (Ramamurthy et al., 2023; Wu et al., 2023). Therefore, we convert the structured reasoning into single-step supervised data to warm up the policy as follows:

warmup=ilogp(yi|st,y<i).\mathcal{L}_{\text{warmup}}=-\sum_{i}\log p(y_{i}|s_{t},y_{<i}). (11)

where yy is the golden action at sts_{t}.

4 Experiments

4.1 Datasets

Tree-structured reasoning

We conduct experiments on EntailmentBank (Dalvi et al., 2021), the first dataset that supports structured explanation with entailment trees. Following (Hong et al., 2023), we also conduct experiments on EntailmentBankQA (Tafjord et al., 2022), whose objective is to reach the answer based on the entailment tree.

Graph-structured reasoning

We conduct experiments on the STREET benchmark (Ribeiro et al., 2023) to assess the performance of graph-structured reasoning. Please refer to Appendix B for more details about the dataset statistics.

4.2 Baselines

For EntailmentBank, we compare with single-pass methods, such as EntailmentWriter (Dalvi et al., 2021), and step-by-step methods including METGEN (Hong et al., 2022), IRGR (Neves Ribeiro et al., 2022), RLET (Liu et al., 2022), NLProofs (Yang et al., 2022) and FAME (Hong et al., 2023). For EntailmentBankQA, we compare with Selection-Inference (SI) (Creswell and Shanahan, 2022) and FAME (Hong et al., 2023). For the STREET benchmark, we compare with the method proposed in (Ribeiro et al., 2023). Furthermore, we conduct comparisons with GPT-4 (OpenAI, 2023) equipped with Chain-of-Thought (CoT) (Wei et al., 2022), Tree of Thought (ToT) (Yao et al., 2023a) and ReAct (Yao et al., 2023b).

4.3 Implementation Details

For a fair comparison222Previous studies have consistently utilized T5-large as the base model. Despite the existence of more advanced generative models (Du et al., 2022; Touvron et al., 2023), using T5-large enables us to maintain a fair comparison., the policy is built with a T5-large model (Raffel et al., 2020), while the critic is the encoder of T5-large combined with a MLP (tanh\tanh as the activation function). For a supervised warm-up, we set a learning rate of 1e-5, a batch size of 16, and train the model for 20 epochs. For RL training, we set learning rate 2e-6 for both policy and critic, discounter factor γ\gamma as 0.95, batch size as 3, buffer size as 12, buffer training epochs NKN_{K} as 2, ϵ\epsilon as 0.2, and β\beta as 1e-4. More implementation details can be found in Appendix C.

4.4 Evaluation Metrics

For EntailmentBank, we evaluate TpredT_{\text{pred}} with the following dimensions: Leaves, Steps, Intermediates, and Overall AllCorrect. For STREET benchmark, we evaluate the reasoning graphs with two dimensions: Answer Accuracy and Reasoning Graph Accuracy. Note that Overall AllCorrect and Reasoning Graph Accuracy are extremely strict metrics, where any deviations will result in a score of 0. More metrics details can be found in Appendix D.

Task Method Leaves Steps Intermediates Overall
F1 AllCorrect F1 AllCorrect F1 AllCorrect AllCorrect
Task1 EntailmentWriter 98.7 84.1 50.0 38.5 67.6 35.9 34.4
METGEN 100.0 100.0 57.9 42.1 71.3 39.2 37.0
IRGR 97.6 89.4 50.2 36.8 62.1 31.8 32.4
RLET 100.0 100.0 54.6 40.7 66.9 36.3 34.8
NLProofS 97.8 90.1 55.6 42.3 72.4 40.6 38.9
Seer (Ours) 100.0 100.0 67.6 52.6 70.3 42.6 40.6
Task2 EntailmentWriter 83.2 35.0 39.5 24.7 62.2 28.2 23.2
METGEN 83.7 48.6 41.7 30.4 62.7 32.7 28.0
IRGR 69.9 23.8 30.5 22.3 47.7 26.5 21.8
RLET 81.0 39.0 38.5 28.4 56.3 28.6 25.7
NLProofS 90.3 58.8 47.2 34.4 70.2 37.8 33.3
Seer (Ours) 86.4 53.5 56.8 39.7 66.3 38.3 34.7
Task3 EntailmentWriter 35.7 2.9 6.1 2.4 33.4 7.7 2.4
METGEN 34.8 8.7 9.8 8.6 36.7 20.4 8.6
IRGR 45.6 11.8 16.1 11.4 38.8 20.9 11.5
RLET 38.3 9.1 11.5 7.1 34.2 12.1 6.9
NLProofS 43.2 8.2 11.2 6.9 42.9 17.3 6.9
FAME 43.4 13.8 16.6 12.4 40.6 19.9 11.9
GPT4-CoT 44.1 12.1 15.4 10.8 43.1 20.6 10.8
GPT4-ToT 43.3 12.0 15.8 11.0 43.9 20.0 11.0
GPT4-ReAct 45.8 12.9 14.1 10.5 43.5 21.5 10.5
Seer (Ours) 47.1 13.8 17.4 12.9 45.1 18.8 12.9
Table 2: Experiment results on EntailmentBank. Bold and underlined texts highlight the best method and the runner-up. RLET is based on DeBERTa-large (He et al., 2023), while all other methods are based on T5-large. All baseline results come from published papers. We use the gpt-4-1106-preview version for GPT-4.

5 Result Analysis

5.1 Structured Reasoning

EntailmentBank

As shown in Table 2, our Seer outperforms all baseline methods on the most strict metric, "Overall AllCorrect", across all three tasks. Specifically, our method achieves an absolute improvement of 1.7%/1.4%/1.0% in Task 1/2/3 compared to the strongest baseline. The steps dimension, i.e., premises selection, is the core of EntailmentBank333A comprehensive error analysis is detailed in Appendix G., contributing to enhancing the accuracy of both leaves and intermediates dimensions, thereby improving the overall AllCorrect metric. (1) Compared to SOTA supervised methods, such as NLProofs and FAME, our method exhibits significant advantages in the steps dimension. This demonstrates that focusing solely on isolated single-step reasoning through supervised learning may yield suboptimal solutions in intricate structured reasoning tasks, even though employing advanced planning algorithms, such as Monte-Carlo planning in FAME. (2) Compared to the SOTA RL-based method, our method outperforms RLET by 5.8%/9.0%/6.0% in Task 1/2/3. Our method employs a generative model as the policy to circumvent the issue of enumerating actions, facilitating the policy’s understanding of structured reasoning tasks (generating potential actions by itself). Moreover, our proposed structure-based return more effectively captures the tree-structured logical dependencies between steps and can assign stable returns for equivalent trajectories, which significantly improves reasoning abilities. Subsequent ablation studies will further demonstrate this. (3) Compared to GPT-4 with CoT, ToT, and ReAct, our method achieves an absolute improvement of 1.9% in Task 3. Although GPT-4 exhibits outstanding reasoning capabilities surpassing many other baselines, its performance relies on a vast number of parameters. Details about the prompts of GPT-4 can be found in Appendix F.

Method Task 1 Task 2
SI+Halter 72.4 55.9
SI+Halter+Search 83.2 72.9
FAME 91.5 78.2
Seer (Ours) 92.7 85.6
Table 3: Experiment results on the EntailmentBankQA. SI is based on Chinchilla-7B (Hoffmann et al., 2022).

EntailmentBankQA

Following Creswell and Shanahan (2022), we introduce the halter module to generate answers based on TpredT_{\text{pred}} and substitute hypothesis with question and option during the reasoning process. As illustrated in Table 3, our method surpasses FAME by an absolute margin of 1.2%/7.4% in Task 1/2. While both FAME and SI are supervised methods, FAME significantly outperforms SI by enhancing the model’s reasoning and exploration capabilities through Monte-Carlo planning. However, our method enhances the structured reasoning capabilities of the policy rather than focusing solely on single-step reasoning, which can significantly improve the quality of the entailment tree to aid in answering, especially in complex reasoning environments.

Method SCONE GSM8K AQUA-RAT AR-LSAT
Answer Accuracy
STREET 69.6 10.4 28.7 28.0
GPT4 † 66.0 94.0 78.0 32.0
SEER (Ours) 72.4 21.4 37.6 33.5
Reasoning Graph Accuracy
STREET 60.0 0.7 0.0 0.0
GPT4 † 32.0 10.0 4.0 2.0
SEER (Ours) 64.8 13.4 8.1 7.2
Table 4: Experiment results on STREET benchmark. † indicates we recorded the best results in CoT, ToT, and ReAct for brevity.

STREET

As shown in Table 4, compared to GPT-4, our method has achieved absolute improvements of 4.8%/3.4%/4.1%/5.2% across various datasets, although the Reasoning Graph Accuracy is a very strict metric (Ribeiro et al., 2023). While GPT-4 excels at answering questions (far surpassing other methods), its parameter is thousands of times greater than other methods. Moreover, during the reasoning process, GPT-4 is prone to hallucinations (Rawte et al., 2023), resulting in poor performance in structured reasoning, particularly evident in the "Reasoning Graph Accuracy" metric. Since SCONE contains sufficient data as well as similar QA and reasoning patterns, we observe that the STREET method would outperform GPT-4 on SCONE. However, by obtaining high-quality reasoning graphs, our method achieves absolute improvements of 2.8%/11.0%/8.9%/5.5% compared to the STREET method, significantly improving answer accuracy and trustworthiness. In reasoning graphs, a state may have multiple parent nodes. Our structure-based return (Equation 6) still precisely describes the cumulative reward for each state, thereby facilitating reasoning performance in graph-structured reasoning.

5.2 Cross-dataset Performance

Method eQASC eOBQA
P@1 NDCG P@1 NDCG
EntailmentWriter 52.48 73.14 69.07 89.05
EntailmentWriter-Iter 52.56 73.28 72.15 90.19
METGEN 55.81 74.19 74.89 90.50
FAME 53.36 79.64 73.09 89.32
GPT-4 54.00 88.82 85.36 91.19
Seer (Ours) 60.33 89.76 77.50 94.62
Table 5: Cross-dataset performance on the eQASC and eOBQA.

To evaluate the generalization performance, we conduct cross-dataset experiments on eQASC and eOBQA444More details about the setting of eQASC and eOBQA can be found in Appendix B. (Jhamtani and Clark, 2020). We apply the policy of Task 2 for selection without training on eQASC or eOBQA. As illustrated in Table 5, our method exhibits significant superiority in cross-dataset generalization. Compared to supervised methods, our Seer, following the inherent structural nature of entailment trees, can better capture the logical dependencies between reasoning steps, which can effectively promote the generalization ability of the policy. The experimental results further validate the effectiveness of our method.

5.3 Ablation Studies

Method Leaves Steps Intermediates Overall
Seer (Ours) 13.8 12.9 18.8 12.9
w/o redundant 13.2 12.6 18.5 12.3
w/o structure-based return 12.9 11.7 18.5 11.1
w/o RL 10.2 9.4 17.1 9.1
Table 6: Ablation study of each component.
Refer to caption
(a) rredundantr_{\text{redundant}}
Refer to caption
(b) β\beta
Figure 3: Parameter sensitivity analysis.

To evaluate the contribution of each component, we conduct extensive ablation studies. As shown in Table 6, we investigate three different variations of Seer in Task 3 of EntailmentBank: (1) w/o redundant neglects redundant steps by assigning a reward of -1. (2) w/o structure-based return removes the structure-based return and calculates it using the chained sum of rewards (Equation 5). (3) w/o RL removes the RL phase, relying solely on supervised warm-up. We discover that overlooking redundant steps may potentially inhibit the exploration of policy, leading to a performance decline. In addition, the results shown in Table 6 also demonstrate that removing the structure-based return severely affects the performance. It not only adequately addresses the equivalent trajectory problems, but also elegantly captures the logical relationships inherent in entailment trees, which is crucial for structured reasoning. Furthermore, it can be seen that removing the RL phase reduces performance by 3.8% of Overall Allcorrect, which is a significant impact for this strict metric. This indicates that relying solely on supervised learning may overlook the logical relationships in structured reasoning, thereby falling into suboptimal solutions.

5.4 Parameter Sensitivity Analysis

As illustrated in Figure 3, we further investigate the impact of rredundantr_{\text{redundant}} and β\beta on the performance in Task 3. We observe that compared to treating redundant and erroneous steps equally (rredundant=1r_{\text{redundant}}=-1), not penalizing (rredundant=0r_{\text{redundant}}=0) may have more detrimental effects, which allows for unrestricted exploration. Moreover, a suitable β\beta (the coefficient of entropy bonus) is crucial for performance enhancement, as it encourages the policy to break away from the "stereotypes" of supervised warm-up.

6 Conclusions

We propose Seer, a novel approach that facilitates structured reasoning and explanation via RL. To our knowledge, Seer is the first general framework capable of enhancing chained, tree-based, and graph-based structured reasoning. Our structure-based return precisely delineates the hierarchical and branching structure inherent in structured reasoning, effectively facilitating reasoning ability. Furthermore, Seer employs a generative model to represent the policy and refines the reward function, ingeniously circumventing the limitations of existing works. Comprehensive experimental results demonstrate that Seer significantly outperforms state-of-the-art methods and exhibits outstanding cross-dataset generalization performance.

Limitations

Although our method has achieved excellent performance in structured reasoning and explanation, there remains one issue that deserves further exploration for future work: how to perform structured reasoning in the context of multimodal data. This includes combining content from images, tables, or audio data, a form of multimodal structured reasoning that is increasingly prevalent and demanding in real-world scenarios. In future work, we plan to extend our Seer to accommodate multimodal scenarios.

Ethics Statement

This work focuses primarily on structured reasoning and explanation problems, and its contributions are entirely methodological. Therefore, this work does not have direct negative social impacts. For the experiments, we have open-sourced the code and utilized openly available datasets commonly used in previous research, without any sensitive information to our knowledge. The authors of this work adhere to the ACL ethical guidelines, and the application of this work does not present any apparent issues that may lead to ethical risks.

Acknowledgements

This work is supported by the National Key R&D Program of China (NO.2022ZD0160102). Chao Yang is supported by the Shanghai Post-doctoral Excellent Program (Grant No. 2022234).

References

Appendix A Algorithm Details

Input: Structured reasoning dataset 𝒟\mathcal{D}; Training epochs NwarmupN_{\text{warmup}}, NN and NKN_{K}; batch size bwarmupb_{\text{warmup}} and bminib_{\text{mini}};
Output: The optimal parameter of policy
/* (1) Supervised Warm-up phase */
1 Initialise policy parameters πθ\pi_{\theta}
2 Convert 𝒟\mathcal{D} into single-step data 𝒟step\mathcal{D}_{step}
3 for epoch =1=1 to NwarmupN_{\text{warmup}} do
4       for i =1=1 to |𝒟step|/bwarmup|\mathcal{D}_{step}|/b_{\text{warmup}} do
5             sample minibatch from 𝒟step\mathcal{D}_{\text{step}}
6             update parameters πθ\pi_{\theta} by Eq. 11
7      
/* (2) RL phase */
8 Initialize the critic parameters VV
9 for epoch =1=1 to NN do
10       Initialise training buffer \mathcal{B}\leftarrow\varnothing
       // Filling the replay buffer
11       while \mathcal{B} not full do
12             sample {h,X,Tgold}\{h,X,T_{\text{gold}}\} from 𝒟\mathcal{D}
13             collect trajectory τ\tau via πθ\pi_{\theta}
14             assign a reward for each step in τ\tau
15             fill buffer \mathcal{B} with {st,at,rt}\{s_{t},a_{t},r_{t}\} from τ\tau
16            
      // Performing k-epoch updates per buffer
17       for epochk\text{epoch}_{k} =1=1 to NKN_{K} do
18             sample {(st,at,rt)}bmini\{(s_{t},a_{t},r_{t})\}_{b_{\text{mini}}} from \mathcal{B}
19             compute (πθ)\mathcal{E}(\pi^{\prime}_{\theta}) and A^t\hat{A}_{t} by Eqs. 8, 9
20             update policy πθ\pi_{\theta} by Eq. 7
21             update critic VV by Eq. 10
22            
23      
Algorithm 1 The training process of Seer

Algorithm 1 describes the overall training process of our proposed Seer in detail, which primarily consists of two phases: supervised warm-up and reinforcement learning (RL). In the supervised warm-up phase, the structured reasoning is first decomposed into single-step reasoning data (Line 2). Then, we employs supervised learning to guide the policy πθ\pi_{\theta} to quickly adapt to the complex reasoning environments (Lines 3-6). This is particularly beneficial when the number of parameters in the policy is relatively small (Akyurek et al., 2023; Liu et al., 2023). In the RL phase, we initially populate the replay buffer \mathcal{B} through the policy πθ\pi_{\theta} (Lines 9-13). Then, we update the parameters of the policy and the critic using the buffer data. To improve sample efficiency, NKN_{K} updates are performed for each replay buffer (Lines 14-18).

For the inference process, we only need to use the policy (without the critic) for structured reasoning. Specifically, as illustrated in the trajectory rollout of Figure 2, we update the state by the policy and the entailment module. Then, we end the reasoning process until the stopping criteria are satisfied. Finally, we backtrack to construct the entire structured explanation, taking the last intermediate conclusion as the hypothesis for entailment tree (or the answer for the STREET benchmark).

Appendix B Datasets Details

Task Name Task Domain # Questions # Reasoning Steps Reasoning Type Answer Type
EntailmentBank Science 1,840 5,881 Tree-structured /
EntailmentBankQA (ARC) Science 1,840 5,881 Tree-structured 4-Way MC
SCONE Processes 14,574 130,482 Graph-structured State Pred.
GSM8K Math 1,030 4,666 Graph-structured Number
AQUA-RAT Math 1,152 7,179 Graph-structured 5-Way MC
AR-LSAT Logic 500 2,885 Graph-structured 5-Way MC
Table 7: Datasets Statistics of Structured Reasoning.

Datasets of Structured Reasoning

Table 7 describes the statistics of datasets in detail. In the answer types, “K-Way MC” stands for multiple choice answer with K options.

EntailmentBank (Dalvi et al., 2021) comprises 1,840 expert-annotated entailment trees with an average of 7.6 nodes spanning across 3.2 entailment steps. The facts are derived from the WorldTree V2 corpus (Xie et al., 2020). Based on different facts XX, there are three progressively more challenging tasks: Task1 (no-distractor), Task2 (distractor) and Task3 (full-corpus). For GPT-4, we employ all the data in Task 3 from EntailmentBank to evaluate its performance. EntailmentBank was originally designed for post-hoc tree reconstruction tasks instead of QA, Tafjord et al. (2022) converted it into EntailmentBankQA where the task is to choose the correct answer given multiple choice options rather than deriving hypothesis hh.

To construct the STREET benchmark, Ribeiro et al. (2023) standardized many QA datasets, such as ARC (Clark et al., 2018), SCONE (Long et al., 2016), GSM8K (Cobbe et al., 2021), AQUA-RAT (Ling et al., 2017) and AR-LSAT (Zhong et al., 2021), in the graph-structured explanation format, where the tasks are converted into answering the question based on the predicted reasoning graphs. Please note that ARC in STREET is congruent with Task 1 of EntailmentBankQA (Ribeiro et al., 2023), hence, we do not repeat the experiment for this task in Table 4. Due to the high cost of GPT-4, we randomly sample 50 instances from each dataset in the STREET benchmark to evaluate GPT-4’s performance.

Datasets of Cross-dataset Experiments

To evaluate the generalization performance of our method, following Hong et al. (2022), we conduct cross-dataset experiments on eQASC and eOBQA (Jhamtani and Clark, 2020), which collect one-step entailment trees for questions from QASC (Khot et al., 2020) and OpenBookQA (Mihaylov et al., 2018), respectively. The goal of this task is to select valid one-step trees from the candidate set and evaluate the results with P@1 and NDCG metrics (Hong et al., 2022). Questions with no valid tree are filtered. The candidate sets for eQASC and eOBQA are composed of 10 and 3 sentences respectively.

Appendix C Implementation Details

C.1 Stopping criteria

For a fair comparison, we use the T5-large model to represent the policy. However, we observe that T5-large tends to perform "Reason" actions more frequently, which is caused by the smaller number of model parameters and the issue of having only a few "End" instances. Moreover, unlike GPT-4, T5-large is less able to recognize when a hypothesis has been inferred and when to stop. Therefore, we attach two extra stopping criteria in addition to the "End" action: (1) The semantic similarity between the intermediate conclusion and the hypothesis exceeds a predefined threshold, i.e., BLEURT(ii_{*}, hh) > 1. (2) Exceeding the maximum number of reasoning steps (set to 20 in this paper).

C.2 Alignment algorithm

Following (Dalvi et al., 2021), we evaluate the intermediate steps based on Jaccard similarity. Specifically, the intermediate nodes ii_{*} in TpredT_{\text{pred}} are aligned with the intermediate nodes in TgoldT_{\text{gold}} that have the maximum Jaccard similarity. If the Jaccard similarity between the intermediate node in TpredT_{\text{pred}} and all intermediate nodes in TgoldT_{\text{gold}} is zero, it is aligned with "NULL". Note that only the intermediate node that is perfectly matched with a node in TgoldT_{\text{gold}}, i.e., the Jaccard similarity is 1, is considered as a correct step. Figure 5 provides a detailed illustration of this process. The alignment process is similar in the reasoning graphs (Ribeiro et al., 2023).

C.3 Retriever for Task 3

In Task 3 of EntailmentBank, first, it is necessary to retrieve relevant sentences from the corpus (Dalvi et al., 2021). The research focus of this paper is to enhance the structured reasoning ability of the policy. Therefore, we directly adopt the retriever and its associated parameters proposed in previous work (Hong et al., 2023), which is based on the all-mpnet-base-v2 model (Reimers and Gurevych, 2019). For a fair comparison, we retrieve the top 25 most relevant sentences as XX for Task 3.

C.4 Entailment Module

The entailment module is also based on the T5-large model, taking premises as input and generating intermediate conclusions. Our primary focus is to enhance the structured reasoning ability of the policy through RL, therefore, we directly employ the entailment module that has already been trained in previous work (Hong et al., 2023), which also aids in a fair comparison.

C.5 Halter Module

In EntailmentBankQA, we employ the Halter module (Creswell and Shanahan, 2022) to answer questions based on the predicted entailment trees. In this paper, the Halter module is built upon the T5-large model. The module is trained with a learning rate of 1e-5 and a batch size of 16.

C.6 Entailment Tree Construction

To evaluate the correctness of each reasoning step, we have to reconstruct the trajectory into an entailment tree TpredT_{\text{pred}} and compare it with TgoldT_{\text{gold}}. Figure 5 illustrates this reconstruction process. We consider the last intermediate conclusion as the hypothesis and then construct the predicted entailment tree based on the reasoning relationship of each step. The reconstruction process is similar in the reasoning graphs (Ribeiro et al., 2023).

C.7 Running time

In our experimental setting, the average training time per entailment tree in Seer is 6.98 seconds, and the average inference time per entailment tree in Seer is 3.91 seconds. As reported in their papers, the inference time per entailment tree in RLET (Liu et al., 2022) and FAME (Hong et al., 2023) are 9.34 seconds and 30.77 seconds, respectively. FAME leverages Monte-Carlo planning, necessitating the exploration of numerous nodes to enhance the reasoning capability of the policy, thus requiring considerable computational time. Our proposed Seer significantly surpasses FAME in terms of both efficiency and effectiveness.

C.8 Experiment Environments

All experiments were conducted on Ubuntu 22.04 equipped with NVIDIA A100 GPUs. Our code mainly depends on python 3.10555https://www.python.org/ and PyTorch 2.0.1666https://pytorch.org/. The pre-trained language models are derived from HuggingFace Transformers777https://huggingface.co/.

C.9 Details of Reasoning Graphs

For the reasoning graphs in the STREET Benchmark, the implementation details are slightly different from the entailment trees. In the reasoning graphs, reasoning steps may possess multiple parent nodes, and a fact (xx_{*}) or intermediate conclusion (ii_{*}) may be utilized multiple times (Ribeiro et al., 2023). Therefore, in the reasoning graph, we refrain from incorporating previously used premises into UtU_{t}, instead continually expanding the candidate sentence set CtC_{t} through newly derived intermediate conclusions. In other words, the state in the reasoning graphs is updated according to the following rules: Pt+1=Pt{<premises>it}P_{t+1}=P_{t}\cup\{\text{\textless premises\textgreater}\rightarrow i_{t}\}, Ct+1={XIt+1}C_{t+1}=\{X\cup I_{t+1}\}, and It+1=It{it}I_{t+1}=I_{t}\cup\{i_{t}\}.

C.10 Other Implementation Details

For GPT-4, we set the temperature to 0.7. For Tree of Thought, we set b=5b=5 candidates at each step, and then vote to select the optimal action. Details regarding the prompts of CoT, ToT, and ReAct can be found in Appendix F. For all baselines, we obtain the optimal results based on experimental results or hyperparameter settings derived from the original papers. For our method, we initialize the critic with the encoder of the warm-up policy to expedite the convergence of the critic and facilitate policy updates. The hidden layer dimension of the MLP in the critic is set to 512.

Appendix D Metrics Details

For EntailmentBank, we follow (Dalvi et al., 2021) and evaluate the entailment tree TpredT_{\text{pred}} using three dimensions:

\bullet Leaves: To evaluate the leaf nodes of TpredT_{\text{pred}}, we compute F1 by comparing XpredX_{\text{pred}} with XgoldX_{\text{gold}}.

\bullet Steps: To evaluate the structural correctness of each step, we compare all steps between TpredT_{\text{pred}} and TgoldT_{\text{gold}} and then compute F1. A predicted step is considered structurally correct if its premises (e.g., xx_{*}, ii_{*}) exactly match the gold premises.

\bullet Intermediates: To evaluate the intermediate conclusions, we compare the aligned intermediate conclusions and then compute F1. A predicted intermediate conclusion is deemed correct if the BLEURT score (Sellam et al., 2020) exceeds 0.28.

For each dimension, the AllCorrect score is 1 if F1 is 1, otherwise 0. Given the above scores, we employ the Overall AllCorrect metric to comprehensively evaluate TpredT_{\text{pred}}, which takes a value of 1 if and only if all leaves, steps, and intermediates are correct. Note that this is an extremely strict metric, where any deviations in TpredT_{\text{pred}} will result in a score of 0.

For the STREET benchmark, we follow (Ribeiro et al., 2023) and adopt two metrics, namely, the answer to the question and the quality of the reasoning graphs, to evaluate different methods.

\bullet Answer Accuracy: This metric measures the ability to correctly answer questions. The answer accuracy serves as an upper bound for other metrics, as any reasoning graph generated with incorrect answers is also labeled as incorrect.

\bullet Reasoning Graph Accuracy: This metric compares the predicted reasoning graph and the golden reasoning graph from the aspects of the graph structure and the content of intermediate conclusions. Please note that this is a stringent metric, with minor deviations from the golden reasoning graph resulting in the prediction being incorrect.

Appendix E Illustrations and Case Study of Seer

Given a hypothesis hh and initial facts XX, we first obtain the trajectory through the reasoning process, as shown in Figure 4. The state update follows the Markov decision process (Bellman, 1957), meaning the current state only depends on the previous state. Figure 4 is an erroneous reasoning example to better illustrate the following steps. Then, we convert the trajectory τ\tau into an entailment tree TpredT_{\text{pred}} and align it with TgoldT_{\text{gold}} to assign reward for each intermediate conclusion (as presented in Figure 5). Furthermore, Figure 6 elucidates the issue of equivalent trajectories, and previous work can not accurately describe the logical relationship between different states in entailment trees.

Refer to caption
Figure 4: An illustration of the reasoning process of Seer. Note that a1a_{1} is a correct step, a2a_{2} and a4a_{4} are erroneous steps, and a3a_{3} is a redundant step. We start from the initial state s1s_{1} where existing entailment steps P1=P_{1}=\varnothing and candidate sentences C1=XC_{1}=X. In each step, we sample an action and update the state until the reasoning is done. For the "Reason" action, we sent the premises to the entailment module. The new conclusion is added to the CC, the premises is removed from CC and the entailment step is added to the PP. For the "End" action, we end the reasoning process and output the trajectory.
Refer to caption
Figure 5: An illustration of the reward and alignment process of Seer. Each reasoning step is a subtree (similarly, each reasoning step is a subgraph in the reasoning graph (Ribeiro et al., 2023)). (1) We construct TpredT_{\text{pred}} using the last intermediate conclusion (i4i_{4} in this example) as the hypothesis. (2) We calculate the Jaccard similarity between the intermediate node (ii_{*}) in TpredT_{\text{pred}} and each golden intermediate node in TgoldT_{\text{gold}} (i^1\hat{i}_{1} and hh in this example), and align with the maximum Jaccard similarity. In this example, i1i_{1} is aligned with i^1\hat{i}_{1} due to JS(i1,i^1)=1\text{JS}(i_{1},\hat{i}_{1})=1. i2i_{2} is aligned with "NULL". i4i_{4} is aligned with i^1\hat{i}_{1} due to JS(i4,i^1)=0.5\text{JS}(i_{4},\hat{i}_{1})=0.5 and JS(i4,h)=0.4\text{JS}(i_{4},h)=0.4. (3) We assign rewards based on the alignment results. Note that i3i_{3} (s3s_{3}) is a redundant step. r1=1r_{1}=1, r2=1r_{2}=-1, r3=0.5r_{3}=-0.5, and r4=1r_{4}=-1. The reward for each state originates from the tree structure rather than the chained trajectory. Therefore, the return of each state should also follow the tree structure (or graph structure in reasoning graphs) rather than the chained trajectory.
Refer to caption
Figure 6: An illustration of the equivalent trajectory and the definition of return. As the reasoning steps of s1s_{1}, s2s_{2}, and s3s_{3} are mutually independent, the execution order among these steps can be arbitrary. Thus, τ\tau and τ\tau^{\prime} are equivalent trajectories because they can be converted to the same entailment tree (Dalvi et al., 2021). As shown in blue box, previous work (Liu et al., 2022) defines the return (a.k.a cumulative reward) in a chained trajectory and would assign different returns to s1s_{1} and s2s_{2} in these equivalent trajectories. In contrast, as shown in red box, our structure-based return is defined based on the tree or graph structure inherent in structured reasoning, which is the same source of rewards. Our structure-based return will consistently allocate stable returns to equivalent trajectories, thereby promoting training stability and convergence. Furthermore, maintaining consistency between the sources of rewards and returns can significantly enhance the effectiveness of the policy.

Appendix F Prompts for GPT-4

Refer to caption
Figure 7: A Chain-of-Thought prompt for GPT-4.
Refer to caption
Figure 8: A ReAct prompt for GPT-4. "Thought" and "Action" query GPT-4 in two rounds.
Refer to caption
Figure 9: A Tree of Thought prompt (Thought Generator) for GPT-4.
Refer to caption
Figure 10: A Tree of Thought prompt (State Evaluator) for GPT-4.

Figures 7 and 8 show the Chain-of-Thought (CoT) (Wei et al., 2022) and ReAct (Yao et al., 2023b) prompts for GPT-4, and figures 9 and 10 show the prompts of thought generator and state evaluator in Tree of Thought (ToT) (Yao et al., 2023a), respectively. We provide a detailed introduction to the task definition and guide the model to respond in the required format. We randomly selected three examples for in-context learning. For a fair comparison, we use the same examples for CoT and ReAct, attributing similar thoughts to them. ReAct divides the dialogue into two rounds, "Thought" and "Action", to query GPT-4. For ToT, following (Yao et al., 2023a), we generate candidate actions using a thought generator and subsequently select and execute the optimal action through a state evaluator. Due to its exceptional reasoning capabilities and self-evaluation strategy, ToT achieves superior results compared to CoT and ReAct, as shown in Table 2. However, ToT requires higher costs in comparison to CoT and ReAct.

Appendix G Error Analysis

We conduct a comprehensive error analysis on Task2 and Task3 of EntailmentBank.

G.1 Error Analysis of Task2

We randomly sample 50 entailment trees where Seer made incorrect reasoning. We find the following four types of errors.

(1) Reasoning Step Error (62%). This is the main source of errors and predominantly depends on whether the policy can select the correct premise. We observe that a small portion of the errors (accounting for 12.9% of this error type) use all the gold leaves, but have errors in the combination order. Compared to other reasoning step errors, the policy identified the correct premise. For example, the gold steps are "x24&x5i1x_{24}\;\&\;x_{5}\rightarrow i_{1}; i1&x23hi_{1}\;\&\;x_{23}\rightarrow h" and the error steps predicted by Seer are "x23&x5i1x_{23}\;\&\;x_{5}\rightarrow i_{1}; i1&x24i2i_{1}\;\&\;x_{24}\rightarrow i_{2}".

(2) Early Termination Error (18%). We observe that the reasoning process may terminate prematurely and the existing entailment steps are all correct. On one hand, T5-large outputs “End” prematurely, unlike GPT-4 which can accurately judge when to stop. On the other hand, the entailment module might erroneously infer a hypothesis, leading to premature termination.

(3) Intermediate Conclusion Error (10%). This error type is different from the above error (where the entailment module prematurely infers a hypothesis). Intermediate conclusion error denotes errors triggered by incorrect entailment in the intermediate conclusions, despite having correct leaves and steps. For a fair comparison, we used the entailment module that has already been trained in previous work Hong et al. (2023). It is noted that the reasoning part, which is the focus of our paper, is completely correct in this type of error, and this type of error can be mitigated by training a better entailment module.

(4) Imperfect Evaluation (10%). We discover that some trees deemed as invalid are valid in fact, indicating that current automated metrics underestimate the validity of the trees. The most common reason is that there are multiple valid ways to construct an entailment tree. For example, consider the structure of a gold tree: "x1&x2&x3hx_{1}\;\&\;x_{2}\;\&\;x_{3}\rightarrow h" may be predicted as: "x1&x2i1x_{1}\;\&\;x_{2}\rightarrow i_{1}; i1&x3i2i_{1}\;\&\;x_{3}\rightarrow i_{2}".

G.2 Error Analysis of Task3

Task 3 requires retrieving an initial set of facts XX from the corpus. Therefore, in addition to the errors in Task 2 described above, we found that Task 3 has its own unique set of errors.

(1) Missing Gold Leaves Error (58%). Missing gold leaves error refers to the case where the gold leaves are not included in the facts XX retrieved from the corpus. This case will inevitably lead to an error in the predicted entailment tree, regardless of how powerful the policy is. The bottleneck of this error lies in the retrieval model. For a fair comparison, we directly use the retrieval model provided in previous work (Hong et al., 2023).

(2) Reasoning Errors (42%). The four error types described in G.1 account for 42% in Task3.

We also discovered that the reasoning graph contains similar error types as found in entailment trees, as they both belong to structured reasoning.