Think about it! Improving defeasible reasoning by first modeling the question scenario \redWITHOUT GRAPH CORRECTOR
Abstract
Defeasible reasoning is the mode of reasoning where conclusions can be overturned by taking into account new evidence. Existing cognitive science literature on defeasible reasoning suggests that a person forms a “mental model” of the scenario before answering it. Our research goal asks whether neural models can similarly benefit from envisioning the question scenario before answering a defeasible query. Our approach is, given a question, to have a model first create a graph of relevant influences, and then effectively leverage that graph as an additional input when answering the question. We find that \oursachieves a new state-of-the-art on three different defeasible reasoning datasets, as well as now producing justifications for its answers. This result is significant as it illustrates that performance can be improved by guiding a system to “think about” a question and explicitly model the scenario, rather than answering reflexively.
1 Introduction
Defeasible inference is a mode of reasoning where conclusions may be modified given additional information [sep-reasoning-defeasible]. Here we consider the specific formulation and challenge in [rudinger-etal-2020-thinking]: Given that some premise \preplausibly implies a hypothesis \hypo, does new information that the situation is \updweaken or strengthen the conclusion \hypo? For example, given “The drinking glass fell” weakly implies “The glass broke”, being told “The glass fell on a pillow” here weakens the implication.
Prior work:
\ours:
\squishlist
Given that “The drinking glass fell”. Will it strengthen or weaken the hypothesis “The glass broke” suppose “The glass fell on a pillow”.
weakens
Generated inference graph for ques. scenario
\squishend
A desirable approach is to jointly consider \pre, \hypo, \updto avoid deviating from the topic and to avoid repetition errors. We found that we can borrow ideas from the cognitive science literature that supports defeasible reasoning for humans with an inference graph \citepPollock2009ARS. Inference graphs draw connections between the \pre, \hypoand \updthrough mediating events. This can be seen as a mental model of the question scenario before answering the question. This paper asks the natural question whether modeling the question scenario with inference graphs could help machines in defeasible reasoning?
Proposed system: Our approach is, given a question, to have a model first create an inference graph, and then use that graph as an additional input when answering the defeasible reasoning query. Our proposed system, \ours, comprises of a graph generation module, and a graph encoding module to use the generated graph for the query.
To generate inference graphs, we use past work that uses a sequence to sequence approach [madaan2021improving]. Their model uses data augmentation to create like-gold graphs. However, the resulting graphs can often be erroneous, and so they included an error correction module to generate higher quality inference graphs. This was important because we found that better graphs are more helpful in the downstream QA task.
The generated inference graph is used for the QA task on three existing defeasible inference datasets from diverse domains, viz., \snli(natural language inference) [bowman2015large], \social(reasoning about social norms) [forbes2020social], and \atomic(commonsense reasoning) [sap2019atsosomeomic]. We show that by simply augmenting the question with our generated graphs, there are some gains on all datasets. Importantly, we show that with a more judicious encoding of the graph augmented question that accounts for interactions between the graph nodes, the QA task accuracy improves substantially across all datasets. To achieve this, we use the mixture of experts approach [mixture-of-experts-paper] to include mixture of experts layers during encoding, enabling the ability to selectively attend to certain nodes while capturing their interactions.
In summary, our contribution is in borrowing the inference graph idea from cognitive science literature to show benefits in defeasible inference QA task. Using an error correction module in the graph generation process, and a judicious encoding of the graph augmented question, \oursachieves a new state-of-the-art over three defeasible datasets, as well as now producing justifications for its answers. This result is significant also because our work illustrates that performance can be improved by guiding a system to think about a question. This encourages future studies on other tasks, where the question scenario can be modeled first, rather than answering reflexively.
2 Related work
Mental Models
Cognitive science has long promoted mental models - coherent, constructed representations of the world - as central to understanding, communication, and problem-solving [JohnsonLaird1983MentalM, mental-models, Hilton1996MentalMA]. Our work draws on these ideas, using inference graphs to represent the machine’s “mental model” of the problem at hand. Building the inference graph can be viewed as first asking clarification questions about the context before answering. This is similar to self-talk [selftalk] but directed towards eliciting chains of influence, the inference graph template acting as a schema for a mental model.
Generating graphs for commonsense reasoning
The use of generated knowledge to aid question answering has recently gained much attention [bosselut2019dynamic, yang2020g, madaan2020eigen, rajagopal2021curie]. comet [Bosselut2019COMETCT] uses gpt \citepradford2018improving fine-tuned over commonsense knowledge graphs like atomic [sap2019atomic] and conceptnet \citepspeer2017conceptnet for KB completion. Similarly, \citetrajagopal2021curie aim to generate event influences grounded in a situation using pre-trained language models. Similar to these works, we adopt large-scale language models for such a conditional generation of text for commonsense reasoning. However, our graph generation method generates complete graphs, as opposed to a short event or phrase generated by these methods.
Information retrieval is another way of collecting necessary knowledge. Influences captured in inference graphs can also be disjointly generated via this approach, but it is non-trivial to preserve the joint semantics \pre, \hypo, \upd. [brahman2020learning] generate rationales for defeasible inference, but it is done post-hoc i.e., given the answer they generate a rationale based on its training over e-SNLI corpus which is related to the -NLI dataset. They found that their system mostly generates very trivial rationales. We address the more realistic setup of jointly predicting an inference graph and not in a post-hoc manner, which that paper suggested as an important future direction.
Our work is closest in spirit to \citetbosselut2019dynamic generate a tree that connects question to answer choices via an intermediate layer of generated nodes. As opposed to \citetbosselut2019dynamic, we do not perform on the fly generation of graph nodes, and instead directly consume a pre-generated graph.
Graph Neural Networks for commonsense QA
Several existing methods use graphs as an additional input for commonsense reasoning [lin2019kagnet, lv2020graph, feng2020scalable]. These methods first retrieve a graph relevant to a question using information retrieval techniques, and then encode the graph using graph representation techniques like \gcn [kipf2016semi]. Different from these works, we use a generated graph grounded in the query for answering commonsense question. While we experiment with state-of-the-art \gcnmodels, our model improves over the \gcnbased representations by using a mixture-of-experts (\moe) [jacobs1991adaptive] to pool multi-faceted input. Prior work has looked at using \moefor graph classification tasks [zhouexplore]. \citetchen2019multi use \moeto mix features from various languages for downstream POS and NER tagging tasks. To the best of our knowledge, we are the first to use \moeto pool graph representations for a downstream \qatask.

3 \ours
3.1 Task: Defeasible Inference
Defeasible inference [rudinger-etal-2020-thinking] is a mode of reasoning in which given a premise \pre, a hypothesis \hypomay be strengthened or weakened in light of new evidence \upd. For example, given a premise The drinking glass fell, the hypothesis The glass broke will be weakened by the situation the glass fell on the pillow, and strengthened by the situation \updthe glass fell on the rocks. We define the input \x=(\pre,\hypo,\upd) and output .
Prior research in psychology [Pollock2009ARS] and NLP [defeasible-human-helps-paper] has found that inference graphs can greatly aid human performance at defeasible inference. Inference graph capture the mediating events and the context to facilitate critical thinking, acting as a mental model of the scenario posed by the defeasible query. A sample inference graph is shown in Figure 2.
We answer the question whether neural models can benefit from envisioning the question scenario using an inference graph before answering a defeasible inference query.
Inference graphs
As inference graphs are central to our work, we give a brief description of their structure next. Inference graphs were introduced in philosophy by \citetPollock2009ARS to aid defeasible reasoning for humans, and in NLP by \citettandon2019wiqa for a counterfactual reasoning task.
Inference graphs have four kinds of nodes [Pollock2009ARS, tandon2019wiqa, defeasible-human-helps-paper]: \squishlist
Contextualizers (C): these nodes set the context around a situation and connect to the \prein some way.
Situations (S): these nodes are new situations that emerge which might overturn an inference.
Hypothesis (H): Hypothesis nodes describes the outcome/conclusion of the situation.
Mediators (M): Mediators are nodes that help bridge the knowledge gap between a situation and a hypothesis node by explaining their connection explicitly. These node can either act as a weakener or strengthener. \squishend
Dataset of inference graphs for defeasible reasoning
Recently, [madaan2021improving] proposed a sequence-to-sequence approach for obtaining higher-quality graphs for each defeasible query. They release two datasets: i) a noisy version of the graphs (\mnoisy), and ii) a cleaner version (\mclean). This gives for each defeasible query \x, an inference graph that provides additional context for the query. We use these graphs in this work.
3.2 Graph augmented defeasible reasoning
Following \citetrudinger-etal-2020-thinking, we first concatenate the components \dquesof the defeasible query into a single sequence of tokens , where denotes concatenation. Thus, each sample for our graph-augmented binary-classification task takes the form , where .
Outline:
We first use a language model \lmto obtain a dense representation \hqfor the defeasible query, and a dense representation \hvfor each node . The node representations \hvare then pooled using a hierarchical mixture-of-experts (\moe) to obtain a graph representation \hg. The query representation \hqand the graph representation \hgare combined to solve the defeasible task. Next, we provide details on obtaining \hq, \hv, and \hg.
3.2.1 Encoding query and nodes
Let \lmbe a pre-trained language model (in our case \roberta [liu2019roberta]). We use to denote the dense representation of sequence of tokens returned by the language model \lm. In our case, we use the pooled encoding on the \roberta encoder - <s> as the sequence representation.
We encode the defeasible query and the nodes of the graph using \lm. Query representation is computed as follows: . We similarly obtain a matrix of node representations by encoding each node in \gengraphwith \lmas follows:
(1) |
where refers to the dense representation for the node of derived from \lm(i.e., ), and to refer to the matrix of node representations.
3.2.2 Learning graph representations using Mixture-of-experts
Recently, mixture-of-experts [jacobs1991adaptive, shazeer2017outrageously, fedus2021switch] has emerged as a promising method of combining multiple feature types. Mixture-of-experts (\moe) are especially useful when the input consists of multiple facets, where each facet has a certain semantic meaning. Previously, \citetgu2018universal,chen2019multi have used the ability of \moeto pool disparate features on low-resource and cross-lingual language tasks. Since each node in the inference graphs used by us plays a certain role in defeasible reasoning (contextualizer, situation node, mediator etc.), we take inspiration from these works to design a hierarchical \moemodel to pool node representations into a graph representation \hg.
An \moeconsists of expert networks and a gating network . Given an input , each expert network learns a transform over the input. The gating network gives the weights for combining the expert outputs for the input . Finally, the output is returned as a convex combination of the expert outputs:
(2) |
The output can either be the logits for an end task [shazeer2017outrageously, fedus2021switch] or pooled features that are passed to a downstream learner [chen2019multi, gu2018universal]. The gating network and the expert networks are trained end-to-end. During learning, the gradients to help it in learning to correctly mix the output of the experts, by generating a distribution over the experts that favors the best expert for the given input.
Designing hierarchical \moefor defeasible reasoning
Different parts of the inference graphs might be useful for answering a query to a different degree. Further, for certain queries graphs might not be useful or distracting and model could rely primarily on the input query alone [defeasible-human-helps-paper]. This motivates a need for two-levels architecture that has: (i) the ability to select a subset of the nodes in the graph, and ii) the ability to jointly reason on the query and the graph to varying degrees.
Given these requirements, a hierarchical \moemodel presents itself as a natural choice to model this task 8. The first expert (\nodemoe) creates a graph representation by taking a convex-combination of the node representations. The second expert (\graphmoe) then takes a convex-combination of the graph representation returned by \nodemoeand query representation, and passes it to an MLP for the downstream task.
consists of five node-experts and gating network to selectively pool node representations \hvto graph representation \hg:
(3) |
contains two experts (graph expert and question expert) and a gating network to combine the graph representation \hgreturned by \graphmoeand the query representation \hq:
(4) |
is then passed to a 1-layer MLP to perform classification.
4 Experiments
In this section we empirically study whether we can improve defeasible inference task by first modeling the question scenario using inference graphs. We also study the reasons for any improvements, and perform error analysis.
4.1 Experimental setup
Dataset | Split | # Samples | Total |
---|---|---|---|
\atomic | train | 35,001 | 42,977 |
test | 4137 | ||
dev | 3839 | ||
\social | train | 88,675 | 92,295 |
test | 1836 | ||
dev | 1784 | ||
\snli | train | 77,015 | 95,795 |
test | 9438 | ||
dev | 9342 |
Datasets
Our end task performance is measured on three existing datasets for defeasible inference111 \urlgithub.com/rudinger/defeasible-nli: \atomic, \snli, \social. Table 1 presents the statistics. These datasets exhibit substantial diversity because of their domains : \snli(natural language inference), \social(reasoning about social norms), and \atomic(commonsense reasoning). It would require a general model to perform well across these diverse datasets.
Baselines and setup
The previous state-of-the-art (SOTA) is the \roberta [liu2019roberta] model presented in \citetrudinger-etal-2020-thinking, and we report the published numbers for this baseline. We adhere to the same hyperparameters used in that paper for \lmused in \ours. For more hyperparameter details for \ours, we refer to the Appendix.
4.2 Results
Table 3 compares QA accuracy on these datasets without and with modeling the question scenario with our generated graphs. The results suggest that we get consistent gains across all datasets. \snlisees the highest gains. \oursachieves a new state-of-the-art on all three different datasets, as well as now producing justifications for its answers.
\atomic | \snli | \social | |
---|---|---|---|
Prev-SOTA | 78.3 ± 1.3 | 81.6 ± 1.8 | 86.2 ± 0.7 |
\ours | 80.2 ± 1.2 | 85.6 ± 1.7 | 88.6 ± 0.7 |
We study the reasons behind these gains by analyzing the contributions of the graph quality and the graph encoder.
4.3 Understanding \oursgains
In this section, we study the contribution of the main components of the \ourspipeline.
4.3.1 Impact of graph correctness:
As previously mentioned, the authors of \mercurieprovide two datasets: \mcleanand \mnoisy. To study whether graph correctness is important, we switch \mcleanwith \mnoisyas input to \ours’s graph encoder. Table 3 shows that this consistently hurts across all the datasets. This indicates that noisier graphs lead to worse task performance, and better graphs improve performance.
\atomic | \snli | \social | |
---|---|---|---|
\GEN | 78.5 ± 1.3 | 83.8 ± 1.8 | 88.2 ± 0.7 |
\CORWF | 80.2* ± 1.2 | 85.6* ± 1.7 | 88.6 ± 0.7 |
4.3.2 Impact of graph encoder:
We experiment with two alternative approaches to graph encoding to compare our \moeapproach:
-
1.
Graph convolutional networks: We follow the approach of \citetlv2020graph who use \gcn [kipf2016semi] to learn rich node representations from graphs. Broadly, node representations are initialized by \lmand then refined using a \gcn. Finally, multi-headed attention [vaswani2017attention] between question representation \hqand the node representations is used to yield \hg. We add a detailed description of this method in (Appendix \secrefsec:gcn_pooling).
-
2.
String based representation: Another popular approach [proscript, madaan2020neural] is to concatenate the string representation of the nodes , and then using \lmto obtain the graph representation \hg where indicates string concatenation.
Table 5 shows that \moegraph encoder is instrumental in enhancing end task performance, with only 5% additional parameters introduced compared to the \strbaseline (see Table 6). Next, we study the reasons for these gains.
We suspect that the lacking performance of \gcnis because of the inability of \gcnto ignore the noise present in the graphs. The graphs augmented with each query are not human-curated, and are instead generated by a language model in a zero-shot inference setting. Thus, the \gcnstyle message passing instead might amplified the noise in graph representations. On the other hand, \nodemoefirst selects the nodes that are most useful to answer the query in order to form the graph representation \hg. Further, \graphmoecan decide to completely discard the graph representations, as it does in a lot of the cases where the true answer for the defeasible query is weakener.
To further establish the possibility of message passing hampering the downstream task performance, we experiment with a \gcn-\moehybrid, wherein we first refine the node representations using a 2-layer \gcnas used by [lv2020graph], and then pool the node representations using an \moe. We found the results to be about the same as ones we obtained with \gcn Table 4, indicating that bad node representations are indeed the root cause for the bad performance of \gcn. This is also supported by [Shi2019FeatureAttentionGCNUnderNoise] that found that noise propagation directly deteriorates network embedding and \gcnis not senstive to noise.
Dataset | accuracy |
---|---|
\atomic | 78.69 |
\snli | 84.3 |
\social | 87.76 |
Interestingly, graphs help the end-task even when encoded using a relatively simple \strbased encoding scheme, further establishing their utility.
\atomic | \snli | \social | |
\str | 79.5 ± 1.3 | 83.1 ± 1.8 | 87.2 ± 0.7 |
\gcn | 78.9 ± 1.3 | 82.4 ± 1.8 | 88.1 ± 0.7 |
\moe | 80.2 ± 1.2 | 85.6 ± 1.7 | 88.6 ± 0.7 |
Method | #Params | Runtime |
---|---|---|
\str | 124M | 0.17 |
\gcn | 131M | 0.47 |
\moe | 133M | 0.40 |
4.3.3 Detailed \moeanalysis
We now analyze the \moeused in \ours: (i) the \moeover the nodes i.e, \nodemoe, and (ii) the \moeover and input i.e., \graphmoe.
\graphmoeperforms better for strengthens:
Figure 4 shows that the graph makes a comparatively contribution than the input, when the label is strengthen. In instances where the label is weakener, the gate of \graphmoegives a higher weight to the question. This trend was present across all the datasets. We assume this happens because language models are tuned to generate events that happen rather than events that don’t. In case of a weakener, the nodes must be of the type event1 leads to less of event2, whereas language models are naturally trained for event1 leads to event2. This requires further investigation in the future.
\nodemoerelies more on certain nodes:
We study the distribution over the types of nodes and their contribution to \nodemoe. Recall from Figure 8 that C- and C+ nodes are contextualizers that provide more background context to the question, and S- node is typically an inverse situation (i.e., inverse \upd), while M- and M+ are the mediator nodes leading to the hypothesis. Figure 5 shows that the situation node S- was the most important. This is followed by the contextualizer, and the mediator. Notably, our analysis shows that mediators are less important for machines than they were for humans in the experiments conducted by [defeasible-human-helps-paper]. This is possibly because humans and machines may use different pieces of information. As our error analysis will later show in §5, the mediators can be redundant given . Humans might have used the redundancy to reinforce their beliefs, whereas machines leverage the unique signals present in S- and the contextualizers to solve the task.
\nodemoelearns the node semantics:
The network learned the semantic grouping of the nodes (contextualizers, situation, mediators) which became evident when plotting the correlation between the gate weights. There is a strong negative correlation between the situation nodes and the context nodes, indicating that only one of them is activated at a time.
\nodemoe, \graphmoehave a peaky distribution:
Ideally, a peaky distribution over the gate values (low entropy) is desired, as it implies that the network is judiciously selecting the right expert for a given input. We compute the average entropy of \nodemoeand \graphmoeand found the entropy values to be low (Table 7), which indicates a peaky distribution. These peaky distribution over nodes can be considered as an explanation through supporting nodes. This is analogous to scene graphs based explanations in visual QA [Ghosh2019VQAExplanations].
Gate | Uniform entropy | \moeEntropy |
---|---|---|
\nodemoe | 2.32 | 0.52 |
\graphmoe | 1.0 | 0.08 |
5 Error analysis
now fail | now succ | ||
prev fail | \ABox\atomic 615\snli 197 | ||
\social 772 | \ABox\atomic 294\snli 124 | ||
\social 398 | |||
prev succ | \ABox\atomic 207\snli 68 | ||
\social 302 | \ABox\atomic 3022\snli 1448 | ||
\social 7967 |
Table 8 shows that \oursis able to correct several previously wrong examples. We observe that when \ourscorrected previously failing cases, the \nodemoerelied more on mediators, as the average mediator probabilities go up from 0.09 to 0.13 averaged over the datasets. \oursstill fails, and more concerning are the cases when previously the model succeeded, and with the appended graph \oursfails. To study this, we annotate 50 random dev samples in total over all three datasets. For each sample, a human marked annotated the errors in the graph, if any. We observe the following error categories444For more examples see Appendix Table LABEL:tab:error-analysis-examples): \squishlist
All nodes off topic (4%): The graph nodes were not on topic. This rarely happens when \oursis unable to distinguish the sense of a word in the input question. For instance, \upd= there is a water fountain in the center – \oursgenerated based on an incorrect sense of natural water spring. Another instance, \upd= personX’s toy is a car – \oursgenerated a real car related graph.
Repeated nodes (20%): These may be exact or near-exact matches. Node pairs with similar effects, tend to be repeated in some samples. e.g. S- node is often repeated with contextualizer C- perhaps because these nodes indirectly affect graph nodes in a similar way.
Mediators are uninformative (34%): The mediating nodes are not correct or informative. One source of these errors is when the \hypoand \updare nearly connected by single hop e.g., \hypo= personx pees, and \upd= personx drank a lot of water previously.
Good graphs are ineffective (42%): These graphs contained the information required to answer the question, but the gating MOE mostly ignored this graph. This could be attributed in part to the observation in histogram in Figure 4, that samples with weakener label disproportionately ignore the graph. \squishend
The maximum percentage of errors were in \atomic, in part due to low question quality, this is in accordance with the dataset quality results in the original paper. [rudinger-etal-2020-thinking].
6 Conclusion
We present \ours, a system that achieves a new state-of-the-art on three different defeasible reasoning datasets. We find that primary reason for these gains is that \oursmodels the question scenario with an inference graph, and judiciously encodes the graph to answer the question. This result is significant, because it shows that performance can be improved by guiding a system to “think about” a question and explicitly model the scenario, rather than answering reflexively.
Appendix A All results
Baseline | \GEN | \CORR | \CORWF | |
---|---|---|---|---|
\atomic | 78.3 | 78.78 | 78.35 | 79.48 |
\snli | 81.6 | 82.16 | 83.88 | 83.11 |
\social | 86.2 | 86.75 | 87.75 | 87.24 |
average | 82.03 | 82.56 | 83.32 | 83.27 |
\GEN | \CORR | \CORWF | |
---|---|---|---|
\atomic | 78.54 | 78.78 | 79.14 |
\snli | 83.83 | 83.44 | 85.59 |
\social | 88.23 | 88.21 | 88.62 |
average | 83.53 | 83.48 | 84.5 |
\GEN | \CORR | \CORWF | |
---|---|---|---|
\atomic | 78.25 | 78.9 | 78.85 |
\snli | 82.63 | 83.56 | 82.36 |
\social | 87.92 | 88.08 | 88.12 |
average | 82.93 | 83.51 | 83.11 |

Appendix B Graph-augmented defeasible reasoning algorithm
In Algorithm 1, we outline our graph-augmented defeasible learning process.
Appendix C Description of \gcn
We now describe our adaptation of the method by \citetlv2020graph to pool into \hgusing \gcn.
Refining node representations
The representation for each node is first initialized using:
Where is the node representation returned by the \lm, and . This initial representation is then refined by running L-layers of a GCN [kipf2016semi], where each layer is updated by using representations from the layer as follows:
(5) |
Where is a non-linear activation function, is the GCN weight matrix for the layer, is the list of neighbors of a vertex , and is a matrix of the layer representations the nodes such that .
Learning graph representation
We use multi-headed attention [vaswani2017attention] to combine the query representation and the nodes representations to learn a graph representation . The multiheaded attention operation is defined as follows:
(7) |
Where is the number of attention heads, and .
Finally, the graph representation generated by the the MultiHead attention is concatenated with with the question representation to get the prediction:
where is a single linear layer MLP.
Appendix D Significance Tests
We perform two statistical tests for verifying our results: i) The micro-sign test (s-test) [yang1999re], and ii) McNermar’s test [dror2018hitchhiker].
Dataset | s-test | McNemar’s test |
---|---|---|
\atomic | 5.07e-05 | 1.1e-04 |
\snli | 2.65e-05 | 6.5e-05 |
\social | 1.4e-04 | 3.2e-04 |
Dataset | s-test | McNemar’s test |
---|---|---|
\atomic | 0.001 | 0.003676 |
\snli | 0.01 | 0.026556 |
\social | 0.06 | 0.146536 |
\atomic | \snli | \social | |
---|---|---|---|
\str | 0.13 | 1.8e-06 | 8.7e-06 |
\gcn | 0.006 | 1.31e-05 | 0.03 |
\atomic | \snli | \social | |
---|---|---|---|
\str | 0.28 | 4e-06 | 2e-05 |
\gcn | 0.015127 | 3.2e-05 | 0.06 |