An Empirical Study of Causal Relation Extraction Transfer: Design and Data

Sydney Anuyah, Jack Vanschaik, Palak Jain, Sawyer Lehman, Sunandan Chakraborty. Indiana University, IndianapolisIndianaUSA sanuyah@iu.edu, jainpalak3286@gmail.com, sblehman@iu.edu, sunchak@iu.edu

Abstract.

We conduct an empirical analysis of neural network architectures and data transfer strategies for causal relation extraction. By conducting experiments with various contextual embedding layers and architectural components, we show that a relatively straightforward BioBERT-BiGRU relation extraction model generalizes better than other architectures across varying web-based sources and annotation strategies. Furthermore, we introduce a metric for evaluating transfer performance, $F1_{phrase}$ that emphasizes noun phrase localization rather than directly matching target tags. Using this metric, we can conduct data transfer experiments, ultimately revealing that augmentation with data with varying domains and annotation styles can improve performance. Data augmentation is especially beneficial when an adequate proportion of implicitly and explicitly causal sentences are included.

Causal Relation Extraction, NNs, Transfer Learning, Natural Language Processing (NLP), BioBERT Data Augmentation, F1phrase Metric, Open-Domain Information Extraction

^†^†copyright: none^†^†conference: ; ; ^†^†price: ^†^†ccs: Computing methodologies Machine learning; Artificial intelligence ^†^†ccs: Information systems Information retrieval^†^†ccs: Software and its engineering Software testing and debugging^†^†ccs: Computing methodologies

1. Introduction

Causality is essential for understanding the relationships between variables, enabling us to identify actual cause-and-effect dynamics rather than mere correlations. This understanding is crucial in fields such as medicine (Bollegala et al., 2018), where identifying causal factors can lead to effective treatments, and finance (Mariko et al., 2021), where it informs decisions with wide-reaching impacts. In technology and AI, causality improves the reliability and interpretability of predictive models. Establishing causality leads to better outcomes across various domains, often requiring the extraction of causal relationships from free-text sources like scholarly articles and news reports, which is considered a key task in the field of Natural Language Processing (NLP).

Although the extraction of causal relationships may have direct applications to many downstream tasks (Zhou et al., 2024), a major goal of causal relation extraction is to construct knowledge graphs that can facilitate downstream uses. ConceptNet (Speer et al., 2017) and CauseNet (Heindorf et al., 2020) are examples of this; however, the rule-based extraction methods used for these datasets limit the type of causal relations present in the knowledge graph. CauseNet, for example, identified causal sentences in Wikipedia and ClueWeb12 with syntactic templates then used a neural sequence tagger to label causes and effects in the identified sentences. Though this method was able to create the largest available causal knowledge base, it could be missing many implicit causal relations. In this sentence from CauseNet, “Insomnia is often caused by fear, stress, and anxiety.” the relationship is made explicitly clear by the highlighted marker “caused”. Other markers (e.g., “leads to”, “associates”, “trigger”, etc.) comprise the source sentences underlying CauseNet. By contrast, take “The clock struck twelve with a loud chime that made me jump.” from (UzZaman et al., 2013). Causal semantics are present, but the lack of a clear marker means such sentences would have been excluded from CauseNet. This under-representation of implicit sentences could bias knowledge graphs and knowledge bases when rule-based extraction is used on a large scale.

We have yet to see a large-scale causal knowledge graph extracted using both implicit and explicit source sentences. This paper addresses the root cause of that problem; causal relation extraction models are trained on disparate benchmark datasets that vary significantly in lexical composition and annotation style. Recent efforts have employed increasingly complicated cause/effect/relation annotation schemes (Li et al., 2021), which produce over-specified models that are not well suited for open-domain extraction. To address this, our analysis examines transfer across six unique causal relation datasets that span varying domains, annotation styles, and implicit/explicit causality makeups. An empirical analysis of architectural components including choice of embedding, recurrent unit, and usage of conditional random fields (CRF) is conducted. In particular, we see transformer-based contextual word embeddings (Devlin et al., 2018) with specialized pre-training datasets, such as BioBERT (Lee et al., 2020) and RoBERTa (Liu et al., 2019), work best across all six target datasets. Experiments further reveal that across domains, common architectures for relation extraction, BiLSTM, and Conditional Random Field (CRF) (Li et al., 2021) are perhaps unnecessary for an open domain causal relation extraction model as these components barely improve performance at the expense of speed and generalizability over a simpler BiGRU. Our current focus is on extracting intra-sentence causal relations, as they are easier to identify and evaluate using existing benchmarks, providing a foundation for more complex tasks. While newer models like Mistral 7b, Llama, and GPT may offer better performance, this work is limited to BERT and Flair embeddings due to their balance of computational efficiency and robust performance in domain-specific tasks. These embeddings are also well-studied and integrated into existing NLP pipelines, making them a practical choice for understanding causality.

The introduction of our $F1_{phrase}$ metric allows for a fair comparison of transfer performance by recognizing the target of causal relation extraction as the cause/effect phrase subtrees of the dependency tree. The application of this metric reveals that many target datasets can benefit from data augmentation, regardless of domain and annotation styles. Using $F1_{phrase}$ and the best transfer models, we further see that improved transfer performance relates to implicit/explicit sentence makeup of a dataset and noun phrase coverage training size, rather than training data size. Causal relation extraction from the web sources is a challenging task, due to noise in the open domain and lack of consistency across viable training datasets. Through the lens of phrasal evaluation, our results imply that models like BioBERT-BiGRU are particularly well suited to causal relation extraction in this setting since they can best incorporate disparate data domains and annotation styles. Large causal knowledge bases have relied on rule-based methods for localizing causal relations (Heindorf et al., 2020), creating a bias for sentences with explicit expressions of causality. This work shows that BioBERT-BiGRU, trained on a collection of disparate data and evaluated with $F1_{phrase}$ offers a path towards richer extraction of causal knowledge from noisy web-based sources. Our work contributes:

(1)

An empirical evaluation of causal relation extraction models across a combination of datasets showed the high performance of specially pre-trained BERT embeddings in the transfer setting, as well as the unnecessariness of Bi-LSTM and CRF layers.
(2)

The $F1_{phrase}$ metric which facilitates the empirical evaluation analysis.
(3)

Furthermore, we conduct additional qualitative analysis to determine what data characteristics contribute to transfer performance. We show that the proportion of implicit causal sentences in the data improves transfer, more so than merely increasing dataset size. Our code files and supplementary materials will be made available after the review process.

2. Related Work

An early example of a machine learning application for causal extraction is Blanco et al. (2008), modified in (Zhang et al., 2023) where Bagging and Decision Trees were used to classify cause-and-effect information. A study by Egami et al. (2018) provides a conceptual framework for text-based causal inferences, generalizing a codebook function to fit several models and linking higher-dimensional text to lower-dimensional representations for making inferences and estimating treatment effects. While machine learning methods are tried-and-true, they cannot learn incrementally in real time and require supervision, limiting their usefulness in real-world applications.

Deep learning applications have been used in recent advances in causal extraction. For instance, (Roemmele and Gordon, 2018) used the Choice of Plausible Alternatives (COPA) framework to predict causally related events in stories. CausalTriad a minimally supervised method for determining causality that relies on discourse connectives and focused distributional similarity techniques was created by (Zhao et al., 2018). Transferred embedding has been employed in more recent research, such as the BiLSTM-CRF model developed by (Li et al., 2021). Tests revealed that models with transferred embeddings performed better than those without. These models included Flair, ELMo, and BERT. As demonstrated by the BERT architecture utilized by (Khetan et al., 2022), transformer-based architectures have also excelled in causal extraction. Training on data other than the target data provided results that were equivalent to or better than training and testing on the target data. We distinguish our work by using transferred embeddings across many architectures and testing a larger number of datasets. Our study also evaluates general-purpose architectures without assuming domain-specific information.

Not much work has investigated transferable causal relation extraction models specifically. Moghimifar et al. (2020) combined a graph convolutional network for encoding dependency trees (GCE) with an adversarial domain adaptive module (ACE) for cross-domain causality classification and relation extraction. This approach is a step towards general-purpose, large-scale causal relation extraction, showing increased transfer performance across domains. However, it requires the model to explicitly learn different domains, which might not always be available in open-domain causal relation extraction.

Table 1. Macro-averaged of F1 Scores for 10 iterations of Embedding Choice Experiments

Dataset	BERTBase	BioBERT	RoBERTa	FlairMulti	FlairNews	FlairFineTuned
CausalTimeBank	0.636	0.590	0.600	0.567	0.543	0.573
CauseNet-cause	0.847	0.848	0.841	0.843	0.805	0.826
CauseNet-noncause	0.833	0.835	0.828	0.826	0.797	0.814
FinCausal2020	0.598	0.596	0.580	0.558	0.561	0.579
MedCaus	0.773	0.777	0.778	0.753	0.752	0.756
SemEval 2018	0.853	0.850	0.861	0.820	0.783	0.822

Table 2. Macro-averaged Test F1 Scores for 10 iterations of RNN Experiments and Hypothesis Testing Results

Dataset	BaseGRU	BaseLSTM	BioBERTGRU	BioBERTLSTM	RoBERTaGRU	RoBERTaLSTM
CausalTimeBank	0.636	0.645 (FTR)	0.590	0.609 (BiLSTM)	0.600	0.605 (FTR)
CauseNet-cause	0.847	0.846 (FTR)	0.848	0.843 (BiGRU)	0.841	0.836 (BiGRU)
CauseNet-noncause	0.833	0.833 (FTR)	0.835	0.833 (FTR)	0.828	0.822 (BiGRU)
FinCausal2020	0.598	0.601 (FTR)	0.596	0.594 (FTR)	0.580	0.578 (FTR)
MedCaus	0.773	0.770 (BiGRU)	0.777	0.773 (BiGRU)	0.778	0.773 (BiGRU)
SemEval 2018	0.853	0.844 (BiGRU)	0.850	0.838 (BiGRU)	0.861	0.848 (BiGRU)

3. Datasets

We focus on model design and data augmentation for sentence-level causal relation extraction since most recent efforts in mining causal knowledge from the web sentence level (Moghimifar et al., 2020; Heindorf et al., 2020). Unlike previous work, however, we distinguish sentences that express causality implicitly or explicitly and make sure to analyze both. The results from sentence-level explicit causal relation extraction could inform future work on multi-sentence causal relation and even document-level causal relation extraction.

Causal-TimeBank (Mirza et al., 2014) consists of causal annotations of the TempEval-3 corpus (UzZaman et al., 2013), which consists of news articles from the web. For uniformity across other datasets, we only consider sentence-level relations, although Causal-TimeBank also contains document-level relations.

CauseNet (Heindorf et al., 2020) is a large graph of explicit causal relations and supporting sentences from ClueWeb12 and Wikipedia, with a high-precision subset that we utilized. For our purposes, we subsampled a collection of 5,000 sentences containing explicit markers such as ”cause,” ”caused,” and ”causing” (CauseNet-cause). Another subsample from CauseNet includes 5,000 sentences without variants of the ”cause” marker (CauseNet-noncause) but with other explicit causal markers like ”leads to” and ”due to.”

The FinCausal2020 dataset (Mariko et al., 2021) is a benchmark for the detection and extraction of causal relations in financial text. It was mined from financial news articles from various economics and finance websites. For our purposes, FinCausal was limited to relations contained in single sentences.

MedCaus (Moghimifar et al., 2020) is a dataset consisting of causal sentences mined from ”medical articles” in Wikipedia that matched certain seed patterns. While we found that many sentences in this dataset are medical or biological, some general sentences (E.g., “The eastern water is saltier because of its proximity to Mediterranean Water”) seem to be captured as well, so we have labeled it as a ”General” domain dataset.

SemEval 2010 Task 8 (Hendrickx et al., 2019) is a multi-way classification dataset. It has widely been used as a general domain benchmark for evaluating relation extraction tasks. Causal relation extraction literature has particularly focused on the Cause-Effect relations in this data which represent 12.4% of the entire dataset. We use only the Cause-Effect relations for our analysis.

4. Experiments on Model Architecture

A first step towards open-domain, large-scale causal relation extraction is finding a model that works well across different datasets. We explore this through experiments on embeddings, recurrent units, and conditional random fields (CRFs), analyzing components used by high-performing models like SCITE (Li et al., 2021). Since most models are developed on a single dataset, comparing them in the transfer setting can be misleading. For example, SCITE’s (Li et al., 2021) annotation scheme is specific to SemEval, so some unifying simplifications should be made across datasets for fair comparison. Yang et al. (2022) analyze recent model performance outside the transfer setting. Our experiments cover these models using combinations of tested components (e.g., SCITE is Flair-BiLSTM-CRF).

We are focused on extracting causal linkages on a sentence level, emphasizing intra-sentence relationships. ”Single Sentence-level” described in this paper involves locating and identifying causal linkages in one sentence. Our reason is that intra-sentences are label consistent across various datasets and our attempt to evaluate our model on multiple datasets justifies this focus. Furthermore, this allows us to take advantage of robust assessment measures and available annotated datasets to systematically improve our models before further expanding our scope to more complex inter-sentence causal relations, which involves identifying causal connections spanning multiple sentences and possibly entire documents.

Sentences for all datasets were tokenized at the word level (Akbik et al., 2019), dependency parsed (Honnibal and Montani, 2017), and each token was labeled as C (cause), E (effect), or O (other). Each dataset was split into 70%-30% train-test subsets. All models used an embedding layer, followed by a recurrent NN (RNN) layer, and either a CRF or linear layer for final prediction. Cross-entropy was used as the loss function. The RNN hidden size was fixed at 256 (empirically determined). Early stopping was used in training with a minimum of ten epochs. For sections 5 and 6, twelve training epochs were used as this was found to be consistently sufficient. An ADAM optimizer with a learning rate of 0.001 and a minibatch size of 16 was used.

4.1. Embedding Input

The trend in causal relation extraction models is to encode tokens as vectors using contextual word embeddings like BERT and Flair (Li et al., 2021). However, an empirical analysis of the ideal embedding choice has not been conducted. We compare base BERT (Devlin et al., 2018), BioBERT (Lee et al., 2020), and RoBERTa (Liu et al., 2019) in the BERT category, and three Flair embeddings (Akbik et al., 2018) trained on general-purpose data, news data, and fine-tuned on CauseNet. Table 1 shows that BERT, specifically BioBERT, performed the best overall. For each iteration in all the tables, each experiment was reproduced 10 times with different seed values. RoBERTa outperformed BioBERT for the SemEval 2018 dataset, and Base BERT outperformed BioBERT for the CausalTimeBank dataset. Additionally, a p-value analysis for 10 iterations of the MedCaus dataset showed no significant difference between BioBERT and RoBERTa embeddings, nor between BioBERT and Base BERT for the FinCausal2020 dataset. BioBERT’s superior performance can be attributed to its training on a vast biomedical corpus, enabling it to capture scientific language nuances more effectively than its counterparts (Lee et al., 2020), however, BioBERT still excelled in FinCausal above its counterparts.

Table 3. Macro-averaged F1 Scores for 10 iterations of Relation Extraction Models with and without CRF, and Hypothesis Testing Results

Dataset	Base	Base+CRF	BioBERT	BioBERT+CRF	RoBERTa	RoBERTa+CRF
CausalTimeBank	0.636	0.636 (FTR)	0.590	0.597 (FTR)	0.600	0.591 (NoCRF)
CauseNet-cause	0.847	0.847 (FTR)	0.848	0.849 (FTR)	0.841	0.842 (FTR)
CauseNet-noncause	0.833	0.836 (CRF)	0.835	0.837 (FTR)	0.828	0.829 (FTR)
FinCausal2020	0.598	0.591 (FTR)	0.596	0.587 (FTR)	0.580	0.578 (FTR)
MedCaus	0.773	0.771 (FTR)	0.777	0.776 (FTR)	0.778	0.777 (FTR)
SemEval 2018	0.853	0.850 (FTR)	0.850	0.849 (FTR)	0.861	0.862 (FTR)

4.2. Recurrent Unit

Many successful causal relation extraction models rely on LSTM (Hochreiter and Schmidhuber, 1997) recurrent units, particularly BiLSTMs (Li et al., 2021), to encode sequential information and parse causal relations in both text directions. Despite the common use of BiLSTMs, the optimal recurrent unit, especially compared to BiGRUs (Cho et al., 2014), is underexplored. Our experiment aimed to address this by keeping all hyperparameters constant except for the recurrent unit. The results, detailed in Table 2 indicate that BiGRUs outperform BiLSTMs. The analysis was done at a 5% significance level for 10 iterations. FTR means Fail to Reject, which suggests that while the choice of the recurrent unit slightly impacts overall performance, BiGRUs’ efficiency and reduced complexity, with fewer parameters and gates, offer subtle advantages in processing causal relations. Our analysis, supported by p-value assessments across various datasets, reveals no significant performance difference between the embeddings of BioBERT, RoBERTa, and Base BERT in most cases. These findings align with the literature suggesting that advanced contextual embeddings, such as those from BERT and Flair, proficiently capture long-term dependencies, potentially diminishing the impact of additional LSTM gates. This empirical evidence challenges the assumption of a one-size-fits-all approach to recurrent units in causal relation extraction, highlighting the importance of embedding choice and its compatibility with the model architecture.

4.3. Conditional Random Fields

Conditional random fields (CRF) (Lafferty et al., 2001) are probabilistic graphical models that can improve segmentation when integrated with neural sequence classifiers (Zheng et al., 2015). While some causal relation extraction models have successfully used CRFs (Li et al., 2021), others have opted not to (Moghimifar et al., 2020). Some evidence suggests that a well-encoded NN may not need a CRF layer (Lample et al., 2016). We conducted experiments to test the utility of CRFs across different causal relation extraction datasets.

A comparison of models with and without CRF is shown in Table 3, with its p-value analysis. FTR means Fail to Reject, meaning that there was no significant difference between the CRF and No-CRF experiments. From the table, the p-value analysis indicated no significant difference, especially for BioBERT embedding, across all datasets. Only in the case of CauseNet-noncause did we see a significant difference at 5%, favoring CRFs when using BaseBERT embedding, but overall, BioBERT embedding still performed better. Given this and the potential for over-fitting in open-domain settings due to varying entity lengths, we excluded CRF from further analysis.

5. Comparing Transfer Performance

5.1. Phrasal Loss: $F1_{phrase}$

Refer to caption — Figure 1. The $F1_{phrase}$ metric measures a models ability to locate these noun phrases. If a candidate model were to predict ”hammer” as the only cause token, the entire phrase ”water hammer pressure” would be counted as entirely correct because the phrase can completely be recovered from ”hammer” using the dependency tree.

A roadblock in causal transfer is a lack of annotation consistency across datasets. For example, SemEval consists of mostly single-word cause/effect annotations, with very few multi-word causes and effects. MedCaus has lengthier cause/effect phrases that span more words on average. On the extreme end, FinCausal annotations mark most tokens in a sentence as C or E, with very few O. This makes transfer difficult, as the objective varies across target data. Naturally, we would expect a model trained on single-word annotations to only identify single words in target data. Since root nouns can be recovered with dependency parsing, penalizing this behavior is excessively harsh. To remedy this, we introduce $F1_{phrase}$ , which is defined as follows.

F1_{phrase}(\mathbf{\tilde{y}},\mathbf{y})=F1(\lambda(\mathbf{\tilde{y}}),\lambda(\mathbf{y}))

F1_{phrase}(\mathbf{\tilde{y}},\mathbf{y})=\frac{2}{\mathrm{precision}(\lambda(\mathbf{\tilde{y}}),\lambda(\mathbf{y}))^{-1}+\mathrm{recall}(\lambda(\mathbf{\tilde{y}}),\lambda(\mathbf{y}))^{-1}}

Where $\mathbf{y}=(y_{1},y_{2},...,y_{t})$ is a target sequence of token labels and $\mathbf{\tilde{y}}=(\tilde{y}_{1},\tilde{y}_{2},...,\tilde{y}_{t})$ is the predicted set of token labels. The function $\lambda$ is an element-wise function with oracle knowledge of cause effect phrases defined as follows:

(1)

\lambda(z_{i})=\begin{cases}z_{i}&\text{if }z_{i}\text{ {\color[rgb]{0,0,0}is not in a cause/effect phrase}}\\ \text{parent}(z_{i})&\text{if }z_{i}\text{ {\color[rgb]{0,0,0}is in a cause/effect phrase}}\end{cases}

In our case, the $parent(z_{i})$ function is based on dependency parsing, considering the ”compound” or ”amod” dependency relations (Honnibal and Montani, 2017). This is, in essence, the F1 score modified so that if any token in the cause-effect phrase is identified correctly, then that entire phrase is counted as correct. This is acceptable because after relations are mined, root nouns can be recovered from noun phrases (and vice versa) via dependency parsing, as illustrated in Figure 1.

To illustrate the utility of $F1_{phrase}$ for evaluating transfer performance, we can look at the specific case of transferring from CauseNet to SemEval. CauseNet has a longer mean entity length (cause: 1.6 tokens, effect 1.5 tokens), than SemEval (cause: 1.06 tokens, effect: 1.02 tokens). Furthermore, the part-of-speech distributions vary between the two datasets with respect to token labels. CauseNet has disproportionately many more adjectives than SemEval Training a RoBERTa-based sequence classifier on CauseNet and evaluating on SemEval yielded an unimpressive F1 score of 67.4% (precision 0.632, recall 0.721). However, this would be an unfair evaluation. In the case of the sentence in Figure 1, the model would likely predict “pressure” as the only cause token, but such a prediction is in essence correct. The $F1_{phrase}$ gives a more honest score of $F1_{phrase}=83.4\%$ , (precision 0.978, recall 0.727).

Table 4. Comparison of F1 Phrase Score and F1 Score across Datasets

Dataset	F1 Phrase Score	F1 Score	% Change
CausalTimeBank	0.719	0.590	21.86%
CauseNet_cause	0.874	0.848	3.07%
CauseNet_noncause	0.877	0.835	5.03%
FinCausal2020	0.844	0.596	41.61%
MedCaus	0.985	0.777	26.77%
SemEval_2018	0.921	0.850	8.35%

In Table 4, the F1 score is compared to the F1-Phrase. CausalTimeBank shows a 21.86% improvement, highlighting the challenge of capturing implicit causality within temporal and event-driven texts, where understanding context is key. In contrast, CauseNetcause exhibits a smaller 3.07% change, reflecting its focus on explicit causal relationships that are more straightforward to identify. CauseNetnoncause, with a 5.03% change, reveals the difficulty in distinguishing causal from non-causal associations, nonetheless, made up of simpler sentences. FinCausal2020, with a significant 41.61% change, underscores the complexity of identifying causality in financial texts, where implicit relationships are often embedded in technical language. MedCaus, showing a 26.77% change, shows the mix of explicit and implicit causality in medical texts, where complex terms and conditional statements are common. Lastly, SemEval 2018 demonstrated an 8.35% change, reflecting the diverse nature of its text types, which require a nuanced understanding of various causal expressions.

5.2. Pairwise transfer and Combined Data Transfer

The pairwise transfer was conducted by training a BioBERT-BiGRU model on one dataset and then evaluating it on another dataset’s test set, creating the matrix of transfer performance values shown in Table 5. This matrix averages 10 iterations. Baseline model performance was also assessed, as shown on the main diagonal of Table 5. $F1_{phrase}$ was used for evaluations, calculated using dependency trees generated from spaCy (Honnibal and Montani, 2017). Results in Table 5 show that transfer from MedCaus outperforms training on the test set’s own data for the $CauseNet_{\text{cause}}$ and CausalTimeBank datasets. The $CauseNet_{\text{non-cause}}$ dataset evaluated by p-value shows no significant difference. Notably, MedCaus, the largest dataset studied, has longer cause/effect phrase annotations. We investigate the impact of transfer data size in the following section. Longer annotations seem advantageous when using $F1_{phrase}$ , which focuses on noun phrase localization. Thus, a good approach for general-purpose causal relation extraction models could be training on datasets with longer annotations. If only the root nouns are needed after extraction, they can be recovered with dependency parsing.

Dataset	CausalTimeBank	CauseNet_cause	CauseNet_noncause	FinCausal2020	MedCaus	SemEval_2018
CausalTimeBank	0.719	0.445	0.509	0.374	0.534	0.569
CauseNet_cause	0.446	0.874	0.790	0.244	0.666	0.809
CauseNet_noncause	0.507	0.802	0.877	0.375	0.693	0.796
FinCausal2020	0.809	0.767	0.771	0.844	0.813	0.778
MedCaus	0.847	0.881	0.874	0.755	0.985	0.881
SemEval_2018	0.498	0.731	0.703	0.350	0.624	0.921

Table 5. Matrix of

F1_{phrase}

scores for transferred relation extraction models, with rows correspond to the training data and columns correspond to the test set used. For the main diagonal, the training data is split from the complete dataset. Otherwise, the entirety of the training data is used. The F1-phrase scores shown in bold correspond to transferred datasets that outperformed the test sets own training data.

We should also examine if including multiple datasets of different domains improves or detracts from transfer performance. For example, if SemEval was the target dataset, then SemEval’s training data and the entirety of all other datasets were used for training as shown in Table 5,

In all cases, including additional datasets significantly improved performance above the baseline. For example, the CauseNetcause dataset saw an increase in $F1_{phrase}$ from 0.874 to 0.882, a modest improvement of 0.92%. Similarly, MedCaus improved from 0.985 to 0.995, marking a 1.01% increase. CauseNetnoncause also benefited, with an improvement from 0.877 to 0.886, translating to a 1.03% increase. Notably, CausalTimeBank and FinCausal2020 showed substantial improvements, with $F1_{phrase}$ scores increasing from 0.719 to 0.747 (3.89%) and from 0.844 to 0.883 (4.62%), respectively. This is particularly significant given that both datasets have relatively longer annotations than most of the data used for training. Even though CausalTimeBank started as the smallest dataset, it achieved the second most significant improvement. SemEval2018 also saw a boost in performance, with the $F1_{phrase}$ score increasing from 0.921 to 0.929, a 0.87% improvement. These results clearly indicate that regardless of the dataset’s lexical composition, domain, or size, incorporating a more varied training dataset leads to better causal relation extraction.

6. Qualitative Analysis

The larger size of the MedCaus dataset, which contains 8,682 sentences, maybe a contributing factor to its effective transfer. To explore this hypothesis, training was conducted using progressively larger subsets of MedCaus, starting with 10% and increasing incrementally by 10% up to the full dataset. The outcomes of these trials suggested that expanding the training subset beyond 40% of the MedCaus dataset does not significantly enhance the $F1_{phrase}$ score for the majority of datasets, with the notable exception of the CausalTime Bank dataset. Additionally, there appears to be no negative impact from using larger amounts of data; however, it seems to be a superfluous and potentially inefficient use of computational resources, considering the plateau in average performance observed when training with 40% to 60% of the data. This indicates that simply increasing the size of the dataset used for transfer does not necessarily correlate with better model performance.

Another consideration for causal transfer is the composition of implicitly and explicitly causal sentences in the training data. (Chen, 2023) While explicit relations like those in CauseNet are more readily available, a relation extraction model trained on explicit relations alone might merely learn the syntactic templates that generated CauseNet, and therefore fail to generalize to implicit relations. To test this, the BioBERT-BiGRU model was trained on implicit and explicit subsets of MedCaus, at accumulating larger sizes, then tested on SemEval, Causal-TimeBank, and CauseNet. Training with only explicit sentences from MedCaus leads to better performance on the CauseNet datasets since these are all based on explicit syntactic templates. For SemEval, equal-sized subsets of explicit and implicit sentences seemed to provide equal transfer performance. For CausalTimeBank, training with implicit sentences was better. Since CausalTimeBank is over half implicit (54.7% over SemEval’s 34%). Thus, the implicit/explicit composition of the training data should reflect the composition of the target data in the transfer setting.

7. Discussion

Our experiments show that a BioBERT-BiGRU sequence tagger is a strong candidate for large-scale causal relation extraction in the open domain. BioBERT’s pretraining with dynamic masked language modeling seems to enhance model performance, suggesting that dynamic masking for causal phrases is worth investigating. Causality-specific pretraining/fine-tuning, as demonstrated by Khetan et al. (2022) and Li et al. (2021), is further evidenced by our results. Future research could explore pretraining for open-domain relation extraction.

While our current study focuses on intra-sentence causal relation extraction, this approach allows us to fine-tune our models and metrics in a more manageable context. Addressing intra-sentence relations first ensures that our methods are well-grounded and effective within simpler boundaries. This step-by-step approach is crucial for developing the sophisticated techniques required for inter-sentence causal relation extraction, which will also be the focus of future research.

The $F1_{phrase}$ metric reveals that combining disparate data from varying annotation schemes and domains can significantly improve transfer performance in causal relation extraction. No single dataset is likely to be the key to large-scale causal relation extraction, but combining diverse data has proven beneficial. Future work could benefit from combining even more causal datasets than we used here. Our results from section 6 show that data size matters up to a point, but increasing training data beyond a few thousand sentences does not always improve transfer performance. Instead, attention should be given to the composition of implicit and explicit sentences in the dataset.

8. Limitations

This study did not consider advanced large language models like GPT, Mistral-7B, and Llama. Although these recent models may have the potential to offer enhanced performance, our study is confined to using BERT and Flair embedding, because we considered the optimal balance between computational efficiency and robust performance in domain-specific tasks. Moreover, BERT and Flair’s embedding are well-documented and widely integrated into existing NLP pipelines, making them a practical and reliable choice for investigating causality in this context.

Also, this study was limited to sentence-level causal relation extraction, when inter-sentence and document-level relations offer a rich source of causal information for open-domain applications. While some of the datasets used, such as FinCausal and CausalTimeBank, do contain inter-sentence relations, we limited our analysis to sentence-level relations for compatibility with other datasets.

Although this study includes a range of datasets with varying domains and annotation styles, the selection may still not fully capture the diversity of causal relations present in real-world texts. Additional datasets from different domains and with different annotation styles could further validate the findings. The models and methods were primarily tested on specific benchmark datasets. While efforts were made to ensure transferability, the generalizability to other, non-tested datasets and domains cannot be assumed and therefore, remains uncertain.

While our goal was to use a simple model like BioBERT-BiGRU which showed promising results, this study did not explore more advanced architectures that might offer improved performance or interpretability. It is possible in the future to explore the trade-offs between model complexity and performance in more detail.

9. Conclusion

We conducted an empirical study of model design and data augmentation strategies for building causal relation extraction models that transfer effectively. We showed BioBERT embeddings paired with a relatively simple BiGRU, is a strong performing base causal relation extraction model across varying datasets. To better facilitate data transfer experiments, we introduce $F1_{phrase}$ , a variant of the F1 measure that prioritizes noun phrase localization. This revealed that longer annotations transfer well, cross-domain data augmentation is consistently beneficial, and implicit/explicit dataset composition trumps dataset size in terms of transfer performance.

10. Ethical Considerations

(1)

All data used in this experiment is open-source and most are licensed by Creative Commons. The others had their individual licensing
(2)

Non-sensitive data was used during this research and as such, no privacy violations were made.
(3)

There was clear documentation of methodologies and algorithms to enable transparency, reproducibility, and accountability.
(4)

For each of the datasets, we followed their individual terms of usage

References

(1)
Akbik et al. (2019) Alan Akbik, Tanja Bergmann, Duncan Blythe, Kashif Rasul, Stefan Schweter, and Roland Vollgraf. 2019. FLAIR: An easy-to-use framework for state-of-the-art NLP. In NAACL 2019, 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations). 54–59.
Akbik et al. (2018) Alan Akbik, Duncan Blythe, and Roland Vollgraf. 2018. Contextual String Embeddings for Sequence Labeling. In COLING 2018, 27th International Conference on Computational Linguistics. 1638–1649.
Blanco et al. (2008) Eduardo Blanco, Nuria Castell, and Dan I Moldovan. 2008. Causal Relation Extraction.. In Lrec.
Bollegala et al. (2018) Danushka Bollegala, Simon Maskell, Richard Sloane, Joanna Hajne, and Munir Pirmohamed. 2018. Causality Patterns for Detecting Adverse Drug Reactions From Social Media: Text Mining Approach. JMIR public health and surveillance 4, 2 (2018).
Chen (2023) Siyuan Chen. 2023. Deep neural networks for identifying causal relations in texts. (2023).
Cho et al. (2014) Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259 (2014).
Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
Egami et al. (2018) Naoki Egami, Christian J Fong, Justin Grimmer, Margaret E Roberts, and Brandon M Stewart. 2018. How to make causal inferences using texts. arXiv preprint arXiv:1802.02163 (2018).
Heindorf et al. (2020) Stefan Heindorf, Yan Scholten, Henning Wachsmuth, Axel-Cyrille Ngonga Ngomo, and Martin Potthast. 2020. CauseNet: Towards a Causality Graph Extracted from the Web. In CIKM. ACM.
Hendrickx et al. (2019) Iris Hendrickx, Su Nam Kim, Zornitsa Kozareva, Preslav Nakov, Diarmuid O Séaghdha, Sebastian Padó, Marco Pennacchiotti, Lorenza Romano, and Stan Szpakowicz. 2019. Semeval-2010 task 8: Multi-way classification of semantic relations between pairs of nominals. arXiv preprint arXiv:1911.10422 (2019).
Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.
Honnibal and Montani (2017) Matthew Honnibal and Ines Montani. 2017. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. (2017). To appear.
Khetan et al. (2022) Vivek Khetan, Roshni Ramnani, Mayuresh Anand, Subhashis Sengupta, and Andrew E Fano. 2022. Causal BERT: Language models for causality detection between events expressed in text. In Intelligent Computing. Springer, 965–980.
Lafferty et al. (2001) John Lafferty, Andrew McCallum, and Fernando CN Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. (2001).
Lample et al. (2016) Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360 (2016).
Lee et al. (2020) Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2020. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 4 (2020), 1234–1240.
Li et al. (2021) Zhaoning Li, Qi Li, Xiaotian Zou, and Jiangtao Ren. 2021. Causality extraction based on self-attentive BiLSTM-CRF with transferred embeddings. Neurocomputing 423 (2021), 207–219.
Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
Mariko et al. (2021) Dominique Mariko, Hanna Abi-Akl, Estelle Labidurie, Stephane Durfort, Hugues De Mazancourt, and Mahmoud El-Haj. 2021. The Financial Document Causality Detection Shared Task (FinCausal 2021). In Proceedings of the 3rd Financial Narrative Processing Workshop. 58–60.
Mirza et al. (2014) Paramita Mirza, Rachele Sprugnoli, Sara Tonelli, and Manuela Speranza. 2014. Annotating causality in the tempeval-3 corpus. In Proceedings of the EACL 2014 Workshop on Computational Approaches to Causality in Language (CAtoCL). 10–19.
Moghimifar et al. (2020) Farhad Moghimifar, Gholamreza Haffari, and Mahsa Baktashmotlagh. 2020. Domain adaptative causality encoder. arXiv preprint arXiv:2011.13549 (2020).
Roemmele and Gordon (2018) Melissa Roemmele and Andrew Gordon. 2018. An Encoder-decoder Approach to Predicting Causal Relations in Stories. In Proceedings of the First Workshop on Storytelling. 50–59.
Speer et al. (2017) Robyn Speer, Joshua Chin, and Catherine Havasi. 2017. Conceptnet 5.5: An open multilingual graph of general knowledge. In Thirty-first AAAI conference on artificial intelligence.
UzZaman et al. (2013) Naushad UzZaman, Hector Llorens, Leon Derczynski, James Allen, Marc Verhagen, and James Pustejovsky. 2013. Semeval-2013 task 1: Tempeval-3: Evaluating time expressions, events, and temporal relations. In Second Joint Conference on Lexical and Computational Semantics (* SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013). 1–9.
Yang et al. (2022) Jie Yang, Soyeon Caren Han, and Josiah Poon. 2022. A survey on extraction of causal relations from natural language text. Knowledge and Information Systems (2022), 1–26.
Zhang et al. (2023) Chong Zhang, Jiagao Lyu, and Ke Xu. 2023. A storytree-based model for inter-document causal relation extraction from news articles. Knowledge and Information Systems 65, 2 (2023), 827–853.
Zhao et al. (2018) Sendong Zhao, Meng Jiang, Ming Liu, Bing Qin, and Ting Liu. 2018. Causaltriad: toward pseudo causal relation discovery and hypotheses generation from medical text data. In Proceedings of the 2018 ACM international conference on bioinformatics, computational biology, and health informatics. 184–193.
Zheng et al. (2015) Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-Paredes, Vibhav Vineet, Zhizhong Su, Dalong Du, Chang Huang, and Philip HS Torr. 2015. Conditional random fields as recurrent neural networks. In Proceedings of the IEEE international conference on computer vision. 1529–1537.
Zhou et al. (2024) Bin Zhou, Xinyu Li, Tianyuan Liu, Kaizhou Xu, Wei Liu, and Jinsong Bao. 2024. CausalKGPT: industrial structure causal knowledge-enhanced large language model for cause analysis of quality problems in aerospace product manufacturing. Advanced Engineering Informatics 59 (2024), 102333.

11. License

(1)

The Causal-TimeBank corpus is provided under the Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license. Click here for more details
(2)

CauseNet is available for non-commercial academic use under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license. Click here for detailed terms
(3)

The FinCausal2020 dataset is provided under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license. Click here for detailed terms
(4)

The MedCaus dataset, derived from Wikipedia and medical articles, is available under the Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.
(5)

The SemEval 2010 Task 8 dataset is available for research purposes under the Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Appendix A Appendix

Dataset	Sentences	Implicit	Domain	Mean tokens per C/E	Reference
MedCaus	8682	17%	General	8.41 / 7.68	Moghimifar et al.,2020
CauseNet-noncause	5000	0%	General	1.61 / 1.5	Heindorf et al.,2020
CauseNet-cause	5000	0%	General	1.53 / 1.46	Heindorf et al.,2020
SemEval 2010	1003	34%	General	1.06 / 1.02	Hendrickx et al.,2019
CausalTimeBank	298	54.7%	News	1 / 0.99	Mirza et al.,2014
FinCausal2020	1719	78.7%	Financial	23.72 / 10.26	Mariko et al.,2021

Table 6. Sentence level causal relation extraction datasets used in this paper. Percentage of implicit sentences in each dataset is shown. Implicit is defined as a sentence that does not fit the syntactic patterns mined by CauseNet. ”Mean tokens per C/E” gives the average length, in tokens, of cause and effect annotations across the dataset.

Model Type	Train F1	Test F1	Epochs	Split	Params
Fine Tuned Flair-BiGRU	0.945	0.811	8	2000/1000	hidden = 32
Fine Tuned Flair-BiGRU	0.982	0.808	8	2000/1000	hidden = 64
Fine Tuned Flair-BiGRU	0.991	0.819	8	2000/1000	hidden = 128
Fine Tuned Flair-BiGRU	0.995	0.820	8	2000/1000	hidden = 256
Fine Tuned Flair-BiGRU	0.997	0.811	8	2000/1000	hidden = 512
Fine Tuned Flair-BiGRU	0.998	0.815	8	2000/1000	hidden = 1024

Table 7. Performance of a Flair-BiGRU causal relation extractor across varying hidden dimension. A hidden dimension of 256 leads to the best test F1.

Model Type	Train F1	Test F1	SemEval F1	Epochs	Split
BioBERT-BiGRU	0.995	0.724	0.676	12	200/100/1003
BioBERT-BiGRU	0.999	0.632	0.695	12	2000/1000/1003
BioBERT-BiGRU	0.987	0.873	0.69	12	20000/1000/1003

Table 8. This table shows performance of a BioBERT-BiGRU model trained on subsamples of CauseNet that accumulate in size, with transfer performance on SemEval. Increasing the training data by an order of magnitude beyond 2,000 does not improve generalization performance. The information in the training data that can be used meaningfully on the transfer target is learned within a few thousand samples.