Think about it! Improving defeasible reasoning by first modeling the question scenario \redWITHOUT GRAPH CORRECTOR

Aman Madaan , Niket Tandon ¹¹footnotemark: 1 ^† , Dheeraj Rajagopal ¹¹footnotemark: 1 , Yiming Yang,
Peter Clark^†, Eduard Hovy
Language Technologies Institute, Carnegie Mellon University, Pittsburgh, PA, USA
^† Allen Institute for Artificial Intelligence, Seattle, WA, USA
{dheeraj,amadaan,yiming,hovy}@cs.cmu.edu
{nikett, peterc}@allenai.org
authors contributed equally to this work. Ordering determined by dice rolling.

Abstract

Defeasible reasoning is the mode of reasoning where conclusions can be overturned by taking into account new evidence. Existing cognitive science literature on defeasible reasoning suggests that a person forms a “mental model” of the scenario before answering it. Our research goal asks whether neural models can similarly benefit from envisioning the question scenario before answering a defeasible query. Our approach is, given a question, to have a model first create a graph of relevant influences, and then effectively leverage that graph as an additional input when answering the question. We find that \oursachieves a new state-of-the-art on three different defeasible reasoning datasets, as well as now producing justifications for its answers. This result is significant as it illustrates that performance can be improved by guiding a system to “think about” a question and explicitly model the scenario, rather than answering reflexively.

1 Introduction

Defeasible inference is a mode of reasoning where conclusions may be modified given additional information [sep-reasoning-defeasible]. Here we consider the specific formulation and challenge in [rudinger-etal-2020-thinking]: Given that some premise \preplausibly implies a hypothesis \hypo, does new information that the situation is \updweaken or strengthen the conclusion \hypo? For example, given “The drinking glass fell” weakly implies “The glass broke”, being told “The glass fell on a pillow” here weakens the implication.

Refer to caption — Figure 1: \oursimproves defeasible reasoning by modeling the question scenario \xwith an inference graph $g(x)$ (available from \mcleandataset). The graph is encoded judiciously using our graph encoder $h(.)$ , resulting in end task performance improvement.

A desirable approach is to jointly consider \pre, \hypo, \updto avoid deviating from the topic and to avoid repetition errors. We found that we can borrow ideas from the cognitive science literature that supports defeasible reasoning for humans with an inference graph \citepPollock2009ARS. Inference graphs draw connections between the \pre, \hypoand \updthrough mediating events. This can be seen as a mental model of the question scenario before answering the question. This paper asks the natural question whether modeling the question scenario with inference graphs could help machines in defeasible reasoning?

Proposed system: Our approach is, given a question, to have a model first create an inference graph, and then use that graph as an additional input when answering the defeasible reasoning query. Our proposed system, \ours, comprises of a graph generation module, and a graph encoding module to use the generated graph for the query.

To generate inference graphs, we use past work that uses a sequence to sequence approach [madaan2021improving]. Their model uses data augmentation to create like-gold graphs. However, the resulting graphs can often be erroneous, and so they included an error correction module to generate higher quality inference graphs. This was important because we found that better graphs are more helpful in the downstream QA task.

The generated inference graph is used for the QA task on three existing defeasible inference datasets from diverse domains, viz., \snli(natural language inference) [bowman2015large], \social(reasoning about social norms) [forbes2020social], and \atomic(commonsense reasoning) [sap2019atsosomeomic]. We show that by simply augmenting the question with our generated graphs, there are some gains on all datasets. Importantly, we show that with a more judicious encoding of the graph augmented question that accounts for interactions between the graph nodes, the QA task accuracy improves substantially across all datasets. To achieve this, we use the mixture of experts approach [mixture-of-experts-paper] to include mixture of experts layers during encoding, enabling the ability to selectively attend to certain nodes while capturing their interactions.

In summary, our contribution is in borrowing the inference graph idea from cognitive science literature to show benefits in defeasible inference QA task. Using an error correction module in the graph generation process, and a judicious encoding of the graph augmented question, \oursachieves a new state-of-the-art over three defeasible datasets, as well as now producing justifications for its answers. This result is significant also because our work illustrates that performance can be improved by guiding a system to think about a question. This encourages future studies on other tasks, where the question scenario can be modeled first, rather than answering reflexively.

2 Related work

Mental Models

Cognitive science has long promoted mental models - coherent, constructed representations of the world - as central to understanding, communication, and problem-solving [JohnsonLaird1983MentalM, mental-models, Hilton1996MentalMA]. Our work draws on these ideas, using inference graphs to represent the machine’s “mental model” of the problem at hand. Building the inference graph can be viewed as first asking clarification questions about the context before answering. This is similar to self-talk [selftalk] but directed towards eliciting chains of influence, the inference graph template acting as a schema for a mental model.

Generating graphs for commonsense reasoning

The use of generated knowledge to aid question answering has recently gained much attention [bosselut2019dynamic, yang2020g, madaan2020eigen, rajagopal2021curie]. comet [Bosselut2019COMETCT] uses gpt \citepradford2018improving fine-tuned over commonsense knowledge graphs like atomic [sap2019atomic] and conceptnet \citepspeer2017conceptnet for KB completion. Similarly, \citetrajagopal2021curie aim to generate event influences grounded in a situation using pre-trained language models. Similar to these works, we adopt large-scale language models for such a conditional generation of text for commonsense reasoning. However, our graph generation method generates complete graphs, as opposed to a short event or phrase generated by these methods.

Information retrieval is another way of collecting necessary knowledge. Influences captured in inference graphs can also be disjointly generated via this approach, but it is non-trivial to preserve the joint semantics \pre, \hypo, \upd. [brahman2020learning] generate rationales for defeasible inference, but it is done post-hoc i.e., given the answer they generate a rationale based on its training over e-SNLI corpus which is related to the $\delta$ -NLI dataset. They found that their system mostly generates very trivial rationales. We address the more realistic setup of jointly predicting an inference graph and not in a post-hoc manner, which that paper suggested as an important future direction.

Our work is closest in spirit to \citetbosselut2019dynamic generate a tree that connects question to answer choices via an intermediate layer of generated nodes. As opposed to \citetbosselut2019dynamic, we do not perform on the fly generation of graph nodes, and instead directly consume a pre-generated graph.

Graph Neural Networks for commonsense QA

Several existing methods use graphs as an additional input for commonsense reasoning [lin2019kagnet, lv2020graph, feng2020scalable]. These methods first retrieve a graph relevant to a question using information retrieval techniques, and then encode the graph using graph representation techniques like \gcn [kipf2016semi]. Different from these works, we use a generated graph grounded in the query for answering commonsense question. While we experiment with state-of-the-art \gcnmodels, our model improves over the \gcnbased representations by using a mixture-of-experts (\moe) [jacobs1991adaptive] to pool multi-faceted input. Prior work has looked at using \moefor graph classification tasks [zhouexplore]. \citetchen2019multi use \moeto mix features from various languages for downstream POS and NER tagging tasks. To the best of our knowledge, we are the first to use \moeto pool graph representations for a downstream \qatask.

3 \ours

3.1 Task: Defeasible Inference

Defeasible inference [rudinger-etal-2020-thinking] is a mode of reasoning in which given a premise \pre, a hypothesis \hypomay be strengthened or weakened in light of new evidence \upd. For example, given a premise The drinking glass fell, the hypothesis The glass broke will be weakened by the situation the glass fell on the pillow, and strengthened by the situation \updthe glass fell on the rocks. We define the input \x=(\pre,\hypo,\upd) and output $\V{y}\in\{strengthens,weakens\}$ .

Prior research in psychology [Pollock2009ARS] and NLP [defeasible-human-helps-paper] has found that inference graphs can greatly aid human performance at defeasible inference. Inference graph capture the mediating events and the context to facilitate critical thinking, acting as a mental model of the scenario posed by the defeasible query. A sample inference graph is shown in Figure 2.

We answer the question whether neural models can benefit from envisioning the question scenario using an inference graph before answering a defeasible inference query.

Inference graphs

As inference graphs are central to our work, we give a brief description of their structure next. Inference graphs were introduced in philosophy by \citetPollock2009ARS to aid defeasible reasoning for humans, and in NLP by \citettandon2019wiqa for a counterfactual reasoning task.

Inference graphs have four kinds of nodes [Pollock2009ARS, tandon2019wiqa, defeasible-human-helps-paper]: \squishlist

Contextualizers (C): these nodes set the context around a situation and connect to the \prein some way.

Situations (S): these nodes are new situations that emerge which might overturn an inference.

Hypothesis (H): Hypothesis nodes describes the outcome/conclusion of the situation.

Mediators (M): Mediators are nodes that help bridge the knowledge gap between a situation and a hypothesis node by explaining their connection explicitly. These node can either act as a weakener or strengthener. \squishend

Dataset of inference graphs for defeasible reasoning

Recently, [madaan2021improving] proposed a sequence-to-sequence approach for obtaining higher-quality graphs for each defeasible query. They release two datasets: i) a noisy version of the graphs (\mnoisy), and ii) a cleaner version (\mclean). This gives for each defeasible query \x, an inference graph $\V{G}_{\V}{x}(\V{V},\V{E})$ that provides additional context for the query. We use these graphs in this work.

3.2 Graph augmented defeasible reasoning

Following \citetrudinger-etal-2020-thinking, we first concatenate the components \dquesof the defeasible query into a single sequence of tokens $\V{x}=(\mathbf{P}\|\mathbf{H}\|\mathbf{S})$ , where $\|$ denotes concatenation. Thus, each sample for our graph-augmented binary-classification task takes the form $((\V{x},\V{G}),\V{y})$ , where $\V{y}\in\{\text{{strengthener}, {weakener}}\}$ .

Outline:

We first use a language model \lmto obtain a dense representation \hqfor the defeasible query, and a dense representation \hvfor each node $\V{v}\in\V{G}$ . The node representations \hvare then pooled using a hierarchical mixture-of-experts (\moe) to obtain a graph representation \hg. The query representation \hqand the graph representation \hgare combined to solve the defeasible task. Next, we provide details on obtaining \hq, \hv, and \hg.

3.2.1 Encoding query and nodes

Let \lmbe a pre-trained language model (in our case \roberta [liu2019roberta]). We use $\V{h}_{\V{S}}=\mathcal{L}(\V{S})\in\real{d}$ to denote the dense representation of sequence of tokens $\V{S}$ returned by the language model \lm. In our case, we use the pooled encoding on the \roberta encoder - <s> as the sequence representation.

We encode the defeasible query $\V{x}$ and the nodes of the graph using \lm. Query representation is computed as follows: $\V{h}_{\V}{x}=\C{L}(\V{x})$ . We similarly obtain a matrix of node representations $\nodeset$ by encoding each node $\V{v}$ in \gengraphwith \lmas follows:

\displaystyle\V{h}_{\V{V}}=[\V{h}_{v_{1}};\V{h}_{v_{2}};\ldots;\V{h}_{|\V{V}|}]

(1)

where $\V{h}_{v_{i}}\in\real{d}$ refers to the dense representation for the $i^{th}$ node of $\V{G}$ derived from \lm(i.e., $\V{h}_{v_{i}}=\C{L}(v_{i})$ ), and $\V{h}_{\V{V}}\in\real{|\V{V}|\times d}$ to refer to the matrix of node representations.

3.2.2 Learning graph representations using Mixture-of-experts

Recently, mixture-of-experts [jacobs1991adaptive, shazeer2017outrageously, fedus2021switch] has emerged as a promising method of combining multiple feature types. Mixture-of-experts (\moe) are especially useful when the input consists of multiple facets, where each facet has a certain semantic meaning. Previously, \citetgu2018universal,chen2019multi have used the ability of \moeto pool disparate features on low-resource and cross-lingual language tasks. Since each node in the inference graphs used by us plays a certain role in defeasible reasoning (contextualizer, situation node, mediator etc.), we take inspiration from these works to design a hierarchical \moemodel to pool node representations $\nodeset$ into a graph representation \hg.

An \moeconsists of $n$ expert networks $\V{E_{1}},\V{E_{2}},\ldots,\V{E_{n}}$ and a gating network $\V{M}$ . Given an input $\V{x}\in\real{d}$ , each expert network $\V{E_{i}}:\real{d}\rightarrow\real{k}$ learns a transform over the input. The gating network $\V{M}:\real{d}\rightarrow\Delta^{d}$ gives the weights $\V{p}=\{p_{1},p_{2},\ldots,p_{n}\}$ for combining the expert outputs for the input $\V{x}$ . Finally, the output $\V{y}$ is returned as a convex combination of the expert outputs:

	$\displaystyle\V{p}$	$\displaystyle=\V{M}(\V{x})$
	$\displaystyle\V{y}$	$\displaystyle=\sum_{i=1}^{n}p_{i}\V{E_{i}(x)}$		(2)

The output $\V{y}$ can either be the logits for an end task [shazeer2017outrageously, fedus2021switch] or pooled features that are passed to a downstream learner [chen2019multi, gu2018universal]. The gating network $\V{M}$ and the expert networks $\V{E_{1}},\V{E_{2}},\ldots,\V{E_{n}}$ are trained end-to-end. During learning, the gradients to $\V{M}$ help it in learning to correctly mix the output of the experts, by generating a distribution over the experts that favors the best expert for the given input.

Designing hierarchical \moefor defeasible reasoning

Different parts of the inference graphs might be useful for answering a query to a different degree. Further, for certain queries graphs might not be useful or distracting and model could rely primarily on the input query alone [defeasible-human-helps-paper]. This motivates a need for two-levels architecture that has: (i) the ability to select a subset of the nodes in the graph, and ii) the ability to jointly reason on the query and the graph to varying degrees.

Given these requirements, a hierarchical \moemodel presents itself as a natural choice to model this task 8. The first expert (\nodemoe) creates a graph representation by taking a convex-combination of the node representations. The second expert (\graphmoe) then takes a convex-combination of the graph representation returned by \nodemoeand query representation, and passes it to an MLP for the downstream task.

\squishlist

\nodemoe

consists of five node-experts and gating network to selectively pool node representations \hvto graph representation \hg:

	$\displaystyle\V{p}$	$\displaystyle=\V{M}(\V{h}_{\V}{V})$
	$\displaystyle\V{h}_{\V}{G}$	$\displaystyle=\sum_{v\in\V{V}}p_{v}E_{v}(v)$		(3)

\graphmoe

contains two experts (graph expert and question expert) and a gating network to combine the graph representation \hgreturned by \graphmoeand the query representation \hq:

	$\displaystyle\V{p}$	$\displaystyle=\V{M}([\V{h}_{\V}{G};\V{h}_{\V}{Q}])$
	$\displaystyle\V{h}_{\V}{y}$	$\displaystyle=E_{\V}{G}(\V{h}_{\V}{G})+E_{\V}{Q}(\V{h}_{\V}{Q})$		(4)

\squishend

$\V{h}_{\V}{y}$ is then passed to a 1-layer MLP to perform classification.

4 Experiments

In this section we empirically study whether we can improve defeasible inference task by first modeling the question scenario using inference graphs. We also study the reasons for any improvements, and perform error analysis.

4.1 Experimental setup

Dataset	Split	# Samples	Total
\atomic	train	35,001	42,977
	test	4137
	dev	3839
\social	train	88,675	92,295
	test	1836
	dev	1784
\snli	train	77,015	95,795
	test	9438
	dev	9342

Table 1: Number of samples in each dataset by split.

Datasets

Our end task performance is measured on three existing datasets for defeasible inference¹¹1 \urlgithub.com/rudinger/defeasible-nli: \atomic, \snli, \social. Table 1 presents the statistics. These datasets exhibit substantial diversity because of their domains : \snli(natural language inference), \social(reasoning about social norms), and \atomic(commonsense reasoning). It would require a general model to perform well across these diverse datasets.

Baselines and setup

The previous state-of-the-art (SOTA) is the \roberta [liu2019roberta] model presented in \citetrudinger-etal-2020-thinking, and we report the published numbers for this baseline. We adhere to the same hyperparameters used in that paper for \lmused in \ours. For more hyperparameter details for \ours, we refer to the Appendix.

4.2 Results

Table 3 compares QA accuracy on these datasets without and with modeling the question scenario with our generated graphs. The results suggest that we get consistent gains across all datasets. \snlisees the highest gains. \oursachieves a new state-of-the-art on all three different datasets, as well as now producing justifications for its answers.

	\atomic	\snli	\social
Prev-SOTA	78.3 ± 1.3	81.6 ± 1.8	86.2 ± 0.7
\ours	80.2 ± 1.2	85.6 ± 1.7	88.6 ± 0.7

Table 2: \oursis better on multiple domains on defeasible inference. Demonstrates self-thinking on ques through generating an inference gr. helps. All the results are statistically significant with

p<0.005

using the sign-test and McNemar’s test.³³3We add the p-values in the Appendix Table 12

We study the reasons behind these gains by analyzing the contributions of the graph quality and the graph encoder.

4.3 Understanding \oursgains

In this section, we study the contribution of the main components of the \ourspipeline.

4.3.1 Impact of graph correctness:

As previously mentioned, the authors of \mercurieprovide two datasets: \mcleanand \mnoisy. To study whether graph correctness is important, we switch \mcleanwith \mnoisyas input to \ours’s graph encoder. Table 3 shows that this consistently hurts across all the datasets. This indicates that noisier graphs lead to worse task performance, and better graphs improve performance.

	\atomic	\snli	\social
\GEN	78.5 ± 1.3	83.8 ± 1.8	88.2 ± 0.7
\CORWF	80.2* ± 1.2	85.6* ± 1.7	88.6 ± 0.7

Table 3: Better graphs lead to better task performance. * indicates statistical significance with

p<0.02

4.3.2 Impact of graph encoder:

We experiment with two alternative approaches to graph encoding to compare our \moeapproach:

1.

Graph convolutional networks: We follow the approach of \citetlv2020graph who use \gcn [kipf2016semi] to learn rich node representations from graphs. Broadly, node representations are initialized by \lmand then refined using a \gcn. Finally, multi-headed attention [vaswani2017attention] between question representation \hqand the node representations is used to yield \hg. We add a detailed description of this method in (Appendix \secrefsec:gcn_pooling).
2.

String based representation: Another popular approach [proscript, madaan2020neural] is to concatenate the string representation of the nodes , and then using \lmto obtain the graph representation \hg $=\C{L}(v_{1}\|v_{2}\|..)$ where $\|$ indicates string concatenation.

Table 5 shows that \moegraph encoder is instrumental in enhancing end task performance, with only 5% additional parameters introduced compared to the \strbaseline (see Table 6). Next, we study the reasons for these gains.

We suspect that the lacking performance of \gcnis because of the inability of \gcnto ignore the noise present in the graphs. The graphs augmented with each query are not human-curated, and are instead generated by a language model in a zero-shot inference setting. Thus, the \gcnstyle message passing instead might amplified the noise in graph representations. On the other hand, \nodemoefirst selects the nodes that are most useful to answer the query in order to form the graph representation \hg. Further, \graphmoecan decide to completely discard the graph representations, as it does in a lot of the cases where the true answer for the defeasible query is weakener.

To further establish the possibility of message passing hampering the downstream task performance, we experiment with a \gcn-\moehybrid, wherein we first refine the node representations using a 2-layer \gcnas used by [lv2020graph], and then pool the node representations using an \moe. We found the results to be about the same as ones we obtained with \gcn Table 4, indicating that bad node representations are indeed the root cause for the bad performance of \gcn. This is also supported by [Shi2019FeatureAttentionGCNUnderNoise] that found that noise propagation directly deteriorates network embedding and \gcnis not senstive to noise.

Dataset	accuracy
\atomic	78.69
\snli	84.3
\social	87.76

Table 4: Ablation: using an \moeon top of \gcnrefined representations hurts performance.

Interestingly, graphs help the end-task even when encoded using a relatively simple \strbased encoding scheme, further establishing their utility.

	\atomic	\snli	\social
\str	79.5 ± 1.3	83.1 ± 1.8	87.2 ± 0.7
\gcn	78.9 ± 1.3	82.4 ± 1.8	88.1 ± 0.7
\moe	80.2 ± 1.2	85.6 ± 1.7	88.6 ± 0.7

Table 5: Contribution of \moebased graph encoding compared with alternative graph encoding methods. The gains of \moeover \gcnare statistically significant for all the datasets with

p<0.02

, and the gains over \strare significant for \snliand \socialwith

p<1e-05

Method	#Params	Runtime
\str	124M	0.17
\gcn	131M	0.47
\moe	133M	0.40

Table 6: Number of parameters in the different encoding methods. Runtime reports the number of seconds to process one training example.

4.3.3 Detailed \moeanalysis

We now analyze the \moeused in \ours: (i) the \moeover the nodes i.e, \nodemoe, and (ii) the \moeover $\V{G}$ and input $x$ i.e., \graphmoe.

\graphmoeperforms better for $y=$ strengthens:

Figure 4 shows that the graph makes a comparatively contribution than the input, when the label is strengthen. In instances where the label is weakener, the gate of \graphmoegives a higher weight to the question. This trend was present across all the datasets. We assume this happens because language models are tuned to generate events that happen rather than events that don’t. In case of a weakener, the nodes must be of the type event1 leads to less of event2, whereas language models are naturally trained for event1 leads to event2. This requires further investigation in the future.

\nodemoerelies more on certain nodes:

We study the distribution over the types of nodes and their contribution to \nodemoe. Recall from Figure 8 that C- and C+ nodes are contextualizers that provide more background context to the question, and S- node is typically an inverse situation (i.e., inverse \upd), while M- and M+ are the mediator nodes leading to the hypothesis. Figure 5 shows that the situation node S- was the most important. This is followed by the contextualizer, and the mediator. Notably, our analysis shows that mediators are less important for machines than they were for humans in the experiments conducted by [defeasible-human-helps-paper]. This is possibly because humans and machines may use different pieces of information. As our error analysis will later show in §5, the mediators can be redundant given $x$ . Humans might have used the redundancy to reinforce their beliefs, whereas machines leverage the unique signals present in S- and the contextualizers to solve the task.

\nodemoelearns the node semantics:

The network learned the semantic grouping of the nodes (contextualizers, situation, mediators) which became evident when plotting the correlation between the gate weights. There is a strong negative correlation between the situation nodes and the context nodes, indicating that only one of them is activated at a time.

\nodemoe, \graphmoehave a peaky distribution:

Ideally, a peaky distribution over the gate values (low entropy) is desired, as it implies that the network is judiciously selecting the right expert for a given input. We compute the average entropy of \nodemoeand \graphmoeand found the entropy values to be low (Table 7), which indicates a peaky distribution. These peaky distribution over nodes can be considered as an explanation through supporting nodes. This is analogous to scene graphs based explanations in visual QA [Ghosh2019VQAExplanations].

Gate	Uniform entropy	\moeEntropy
\nodemoe	2.32	0.52
\graphmoe	1.0	0.08

Table 7: Entropy values for \nodemoeand \graphmoecalculated using

\log_{2}

. Low entropy values are desired, and indicate higher selectivity.

5 Error analysis


		now fail	now succ
	prev fail	\ABox\atomic 615\snli 197
\social 772	\ABox\atomic 294\snli 124
\social 398
	prev succ	\ABox\atomic 207\snli 68
\social 302	\ABox\atomic 3022\snli 1448
\social 7967

Table 8: Confusion matrix on the test split of the three datasets. The differences are highly significant with p-values of

1.1e-04

on \atomic,

6.5e-05

on \snli, and

3.2e-04

on \socialusing McNemar’s test.

Table 8 shows that \oursis able to correct several previously wrong examples. We observe that when \ourscorrected previously failing cases, the \nodemoerelied more on mediators, as the average mediator probabilities go up from 0.09 to 0.13 averaged over the datasets. \oursstill fails, and more concerning are the cases when previously the model succeeded, and with the appended graph \oursfails. To study this, we annotate 50 random dev samples in total over all three datasets. For each sample, a human marked annotated the errors in the graph, if any. We observe the following error categories⁴⁴4For more examples see Appendix Table LABEL:tab:error-analysis-examples): \squishlist

All nodes off topic (4%): The graph nodes were not on topic. This rarely happens when \oursis unable to distinguish the sense of a word in the input question. For instance, \upd= there is a water fountain in the center – \oursgenerated based on an incorrect sense of natural water spring. Another instance, \upd= personX’s toy is a car – \oursgenerated a real car related graph.

Repeated nodes (20%): These may be exact or near-exact matches. Node pairs with similar effects, tend to be repeated in some samples. e.g. S- node is often repeated with contextualizer C- perhaps because these nodes indirectly affect graph nodes in a similar way.

Mediators are uninformative (34%): The mediating nodes are not correct or informative. One source of these errors is when the \hypoand \updare nearly connected by single hop e.g., \hypo= personx pees, and \upd= personx drank a lot of water previously.

Good graphs are ineffective (42%): These graphs contained the information required to answer the question, but the gating MOE mostly ignored this graph. This could be attributed in part to the observation in histogram in Figure 4, that samples with weakener label disproportionately ignore the graph. \squishend

The maximum percentage of errors were in \atomic, in part due to low question quality, this is in accordance with the dataset quality results in the original paper. [rudinger-etal-2020-thinking].

6 Conclusion

We present \ours, a system that achieves a new state-of-the-art on three different defeasible reasoning datasets. We find that primary reason for these gains is that \oursmodels the question scenario with an inference graph, and judiciously encodes the graph to answer the question. This result is significant, because it shows that performance can be improved by guiding a system to “think about” a question and explicitly model the scenario, rather than answering reflexively.

Appendix A All results

	Baseline	\GEN	\CORR	\CORWF
\atomic	78.3	78.78	78.35	79.48
\snli	81.6	82.16	83.88	83.11
\social	86.2	86.75	87.75	87.24
average	82.03	82.56	83.32	83.27

Table 9: tab:rq2-str

	\GEN	\CORR	\CORWF
\atomic	78.54	78.78	79.14
\snli	83.83	83.44	85.59
\social	88.23	88.21	88.62
average	83.53	83.48	84.5

Table 10: tab:rq2-moe

	\GEN	\CORR	\CORWF
\atomic	78.25	78.9	78.85
\snli	82.63	83.56	82.36
\social	87.92	88.08	88.12
average	82.93	83.51	83.11

Table 11: tab:rq2-gcn

Appendix B Graph-augmented defeasible reasoning algorithm

In Algorithm 1, we outline our graph-augmented defeasible learning process.

Given: A language model \lm, defeasible query with graph \dquesgra.

Result: Result for the query.

// Encode query

\V{h}_{\V}{Q}\leftarrow

Equation LABEL:eqn:query_encode;

// encode nodes of \gengraph

\{\V{h}_{\V}{v}\}_{\V{v}\in\V{G}}\leftarrow

Equation LABEL:eqn:node_encode;

// MOE1: Combine nodes

\V{h}_{\V{G}}\leftarrow

Equation 3;

// MOE2: Combine

Q

G

\V{h}_{\V{y}}\leftarrow

Equation 4;

return softmax(MLP( $\V{h}_{\V{y}}$ ))

Algorithm 1 Graph-augmented defeasible reasoning using \moe.

Appendix C Description of \gcn

We now describe our adaptation of the method by \citetlv2020graph to pool $\nodeset$ into \hgusing \gcn.

Refining node representations

The representation for each node $v\in\V{V}$ is first initialized using:

\V{h}_{\V}{v}^{0}=\V{W}^{0}\V{h}_{\V}{v}

Where $\V{h}_{\V}{v}\in\real{d}$ is the node representation returned by the \lm, and $\V{W}^{0}\rdim{d\times k}$ . This initial representation is then refined by running L-layers of a GCN [kipf2016semi], where each layer $l+1$ is updated by using representations from the $l^{th}$ layer as follows:

	$\displaystyle\V{h}_{v}^{(l+1)}$	$\displaystyle=\sigma\left(\frac{1}{\|\V{A}(v)\|}{\sum_{w\in\V{A}(v)}\V{W}^{l}\V{h}_{w}^{l}}+\V{W}^{l}\V{h}_{v}^{l}\right)$
	$\displaystyle\V{H}^{L}$	$\displaystyle=[\V{h}_{0}^{L};\V{h}_{1}^{L};\ldots;\V{h}_{\|\V{V}\|-1}^{L}]$		(5)

Where $\sigma$ is a non-linear activation function, $\V{W}^{l}\rdim{k\times k}$ is the GCN weight matrix for the $l^{th}$ layer, $\V{A}(v)$ is the list of neighbors of a vertex $v$ , and $\V{H}^{L}\rdim{\V{V}\times k}$ is a matrix of the $L^{th}$ layer representations the $|\V{V}|$ nodes such that $\V{H}^{L}_{i}=\V{h}_{i}^{L}$ .

Learning graph representation

We use multi-headed attention [vaswani2017attention] to combine the query representation $\V{h}_{\V}{Q}$ and the nodes representations $\V{H}^{L}$ to learn a graph representation $\V{h}_{\V}{G}$ . The multiheaded attention operation is defined as follows:

$\displaystyle\V{a}_{i}$	$\displaystyle=\text{softmax}\left(\frac{(\V{W^{q}_{i}}\V{h}_{\V}{Q})(\V{W^{k}_{i}}\V{H}^{L})^{T}}{\sqrt{d}}\right)$
$\displaystyle\text{head}_{i}$	$\displaystyle=\V{a}_{i}(\V{W^{v}_{i}}\V{H}^{L})$
$\displaystyle\V{h}_{\V}{G}$	$\displaystyle=Concat(head_{1},\ldots,head_{h})\V{W}^{O}$
	$\displaystyle=\text{MultiHead}(\V{h}_{Q},\V{H}^{L})$	(7)

Where $h$ is the number of attention heads, $\V{W}^{q}_{i},\V{W}^{k}_{i},\V{W}^{v}_{i}\rdim{k\times d}$ and $\V{W}^{O}\rdim{hd\times d}$ .

Finally, the graph representation generated by the the MultiHead attention $\V{h}_{\V}{G}\rdim{n}$ is concatenated with with the question representation $\V{h}_{\V}{Q}$ to get the prediction:

\hat{y}=\text{softmax}([\V{h}_{\V}{G},\V{h}_{\V}{Q}]\V{W}_{out})

where $\V{W}_{out}\rdim{d\times 2}$ is a single linear layer MLP.

Appendix D Significance Tests

We perform two statistical tests for verifying our results: i) The micro-sign test (s-test) [yang1999re], and ii) McNermar’s test [dror2018hitchhiker].

Dataset	s-test	McNemar’s test
\atomic	5.07e-05	1.1e-04
\snli	2.65e-05	6.5e-05
\social	1.4e-04	3.2e-04

Table 12: p-values for the three datasets and two different statistical tests while comparing the results with and without graphs (Table 3).

Dataset	s-test	McNemar’s test
\atomic	0.001	0.003676
\snli	0.01	0.026556
\social	0.06	0.146536

Table 13: p-values for the three datasets and two different statistical tests while comparing the results with noisy vs. cleaned graphs (Table 3).

	\atomic	\snli	\social
\str	0.13	1.8e-06	8.7e-06
\gcn	0.006	1.31e-05	0.03

Table 14: p-values for the s-test for Table 5.

	\atomic	\snli	\social
\str	0.28	4e-06	2e-05
\gcn	0.015127	3.2e-05	0.06

Table 15: p-values for the McNemar’s for Table 5.