Causality-aware Concept Extraction
based on Knowledge-guided Prompting

Siyu Yuan^$\heartsuit$, Deqing Yang^$\heartsuit$,
Jinxi Liu^$\heartsuit$, Shuyu Tian^$\heartsuit$, Jiaqing Liang^$\heartsuit$, Yanghua Xiao^{$\spadesuit\clubsuit$}¹¹footnotemark: 1, Rui Xie^{$\diamondsuit$}
^$\heartsuit$School of Data Science, Fudan University, Shanghai, China
^$\spadesuit$Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University
^$\clubsuit$Fudan-Aishu Cognitive Intelligence Joint Research Center, ^{$\diamondsuit$}Meituan, Beijing, China
{syyuan21,jxliu22, sytian21}@m.fudan.edu.cn,
{yangdeqing,liangjiaqing,shawyh}@fudan.edu.cn, rui.xie@meituan.com Corresponding authors.

Abstract

Concepts benefit natural language understanding but are far from complete in existing knowledge graphs (KGs). Recently, pre-trained language models (PLMs) have been widely used in text-based concept extraction (CE). However, PLMs tend to mine the co-occurrence associations from massive corpus as pre-trained knowledge rather than the real causal effect between tokens. As a result, the pre-trained knowledge confounds PLMs to extract biased concepts based on spurious co-occurrence correlations, inevitably resulting in low precision. In this paper, through the lens of a Structural Causal Model (SCM), we propose equipping the PLM-based extractor with a knowledge-guided prompt as an intervention to alleviate concept bias. The prompt adopts the topic of the given entity from the existing knowledge in KGs to mitigate the spurious co-occurrence correlations between entities and biased concepts. Our extensive experiments on representative multilingual KG datasets justify that our proposed prompt can effectively alleviate concept bias and improve the performance of PLM-based CE models. The code has been released on https://github.com/siyuyuan/KPCE.

\newfloatcommand

capbtabboxtable[][\FBwidth]

Causality-aware Concept Extraction
based on Knowledge-guided Prompting

Siyu Yuan^$\heartsuit$, Deqing Yang^$\heartsuit$^†^†thanks: Corresponding authors., Jinxi Liu^$\heartsuit$, Shuyu Tian^$\heartsuit$, Jiaqing Liang^$\heartsuit$, Yanghua Xiao^{$\spadesuit\clubsuit$}¹¹footnotemark: 1, Rui Xie^{$\diamondsuit$} ^$\heartsuit$School of Data Science, Fudan University, Shanghai, China ^$\spadesuit$Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University ^$\clubsuit$Fudan-Aishu Cognitive Intelligence Joint Research Center, ^{$\diamondsuit$}Meituan, Beijing, China {syyuan21,jxliu22, sytian21}@m.fudan.edu.cn, {yangdeqing,liangjiaqing,shawyh}@fudan.edu.cn, rui.xie@meituan.com

1 Introduction

The concepts in knowledge graphs (KGs) enable machines to understand natural languages better, and thus benefit many downstream tasks, such as question answering Han et al. (2020), commonsense reasoning Zhong et al. (2021) and entity typing Yuan et al. (2022). However, the concepts, especially the fine-grained ones, in existing KGs still need to be completed. For example, in the widely used Chinese KG CN-DBpedia Xu et al. (2017), there are nearly 17 million entities but only 0.27 million concepts in total, and more than 20% entities even have no concepts. Although Probase Wu et al. (2012) is a large-scale English KG, the fine-grained concepts with two or more modifiers in it only account for 30% Li et al. (2021). We focus on extracting multi-grained concepts from texts to complete existing KGs.

Most of the existing text-based concept acquisition approaches adopt the extraction scheme, which can be divided into two categories: 1) pattern-matching approaches Auer et al. (2007); Wu et al. (2012); Xu et al. (2017), which can obtain high-quality concepts but only have low recall due to poor generalization; 2) learning-based approaches Luo et al. (2020); Ji et al. (2020); Yuan et al. (2021a), which employ pre-trained language models (PLMs) fine-tuned with labeled data to extract concepts.

Refer to caption — Figure 1: The example of concept bias. The PLM-based CE models are biased to extract novel mistakenly as the concept of Louisa May Alcott from the text.

However, an unignorable drawback of these learning-based approaches based on PLMs is concept bias. Concept bias means the concepts are extracted based on their contextual (co-occurrence) associations rather than the real causal effect between the entities and concepts, resulting in low extraction precision. For example, in Figure 1, PLMs tend to extract novel and writer together as concepts for the entity Louisa May Alcott even if we explicitly input the entity Louisa May Alcott to the model. Previous work demonstrates that causal inference is a promising technique for bias analysis Lu et al. (2022). To analyze the reasons behind concept bias, we devise a Structural Causal Model (SCM) Pearl (2009) to investigate the causal effect in the PLM-based concept extraction (CE) system, and show that pre-trained knowledge in PLMs confounds PLMs to extract biased concepts. During the pre-training, the entities and biased concepts (e.g., Louisa May Alcott and novel) often co-occur in many texts. Thus, PLMs tend to mine statistical associations from a massive corpus rather than the real causal effect between them Li et al. (2022), which induces spurious co-occurrence correlations between entities (i.e., Louisa May Alcott) and biased concepts (i.e., novel). Since we cannot directly observe the prior distribution of pre-trained knowledge, the backdoor adjustment is intractable for our problem Pearl (2009). Alternatively, the frontdoor adjustment Peters et al. (2017) can apply a mediator as an intervention to mitigate bias.

In this paper, we adopt language prompting Gao et al. (2021); Li and Liang (2021) as a mediator for the frontdoor adjustment to handle concept bias. We propose a novel Concept Extraction framework with Knowledge-guided Prompt, namely KPCE to extract concepts for given entities from text. Specifically, we construct a knowledge-guided prompt by obtaining the topic of the given entity (e.g., person for Louisa May Alcott) from the knowledge in the existing KGs. Our proposed knowledge-guided prompt is independent of pre-trained knowledge and fulfills the frontdoor criterion. Thus, it can be used as a mediator to guide PLMs to focus on the right cause and alleviate spurious correlations. Although adopting our knowledge-guided prompt to construct the mediator is straightforward, it has been proven effective in addressing concept bias and improving the extraction performance of PLM-based extractors in the CE task.

In summary, our contributions include: 1) To the best of our knowledge, we are the first to identify the concept bias problem in the PLM-based CE system. 2) We define a Structural Causal Model to analyze the concept bias from a causal perspective and propose adopting a knowledge-guided prompt as a mediator to alleviate the bias via frontdoor adjustment. 3) Experimental results demonstrate the effectiveness of the proposed knowledge-guided prompt, which significantly mitigates the bias and achieves a new state-of-the-art for CE task.

2 Related Work

Concept Acquisition

Most of the existing text-based concept acquisition approaches adopt the extraction scheme, which can be divided into two categories: 1) Pattern-matching Approaches: extract concepts from free texts with hand-crafted patterns Auer et al. (2007); Wu et al. (2012); Xu et al. (2017). Although they can obtain high-quality concepts, they have low recall due to their poor generalization ability; 2) Learning-based Approaches: mostly employ the PLM-based extraction models from other extraction tasks, such as the Named Entity Recognition (NER) models Li et al. (2020); Luo et al. (2021); Lange et al. (2022) and Information Extraction models Fang et al. (2021); Yuan et al. (2021a) in the CE task. Although they can extract many concepts from a large corpus, the concept bias cannot be well handled.

Causality for Language Processing

Several recent work studies causal inference combined with language models for natural language processing (NLP) Schölkopf (2022), such as controllable text generation Hu and Li (2021); Goyal et al. (2022) and counterfactual reasoning Chen et al. (2022); Paranjape et al. (2022). In addition, causal inference can recognize spurious correlations via Structural Causal Model (SCM) Pearl (2009) for bias analysis and eliminate biases using causal intervention techniques Weber et al. (2020); Lu et al. (2022). Therefore, there are also studies showing that causal inference is a promising technique to identify undesirable biases in the NLP dataset Feder et al. (2022) pre-trained language models (PLMs) Li et al. (2022). In this paper, we adopt causal inference to identify, understand, and alleviate concept bias in concept extraction.

Language Prompting

Language prompting can distill knowledge from PLMs to improve the model performance in the downstream task. Language prompt construction methods can be divided into two categories Liu et al. (2021a): 1) Hand-crafted Prompts, which are created manually based on human insights into the tasks Brown et al. (2020); Schick and Schütze (2021); Schick and Schutze (2021). Although they obtain high-quality results, how to construct optimal prompts for a certain downstream task is an intractable challenge; 2) Automated Constructed Prompts, which are generated automatically from natural language phrases Jiang et al. (2020); Yuan et al. (2021b) or vector space Li and Liang (2021); Liu et al. (2021b). Although previous work analyzes the prompt from a causal perspective Cao et al. (2022), relatively little attention has been paid to adopting the prompt to alleviate the bias in the downstream task.

3 Concept Bias Analysis

In this section, we first formally define our task. Then we investigate the concept bias issued by PLMs in empirical studies. Finally, we devise a Structural Causal Model (SCM) to analyze the bias and alleviate it via causal inference.

3.1 Preliminary

Task Definition

Our CE task addressed in this paper can be formulated as follows. Given an entity $E=\{e_{1},e_{2},\cdots,e_{|{E}|}\}$ and its relevant text $T=\{t_{1},t_{2},\cdots,t_{|{T}|}\}$ where $e_{i}$ (or $t_{i}$ ) is a word token, our framework aims to extract one or multiple spans from $T$ as the concept(s) of ${E}$ .

Data Selection

It must guarantee that the given text contains concepts. The abstract text of an entity expresses the concepts of the entity explicitly, which can be obtained from online encyclopedias or knowledge bases. In this paper, we take the abstract text of an entity as its relevant text $T$ . The details of dataset construction will be introduced in $\mathsection$ 5.1. Since we aim to extract concepts from $T$ for $E$ , it is reasonable to concatenate $E$ and $T$ to form the input text $X=\{E,T\}$ .

3.2 Empirical Studies on Concept Bias

To demonstrate the presence of concept bias, we conduct empirical studies on the CN-DBpedia dataset Xu et al. (2017). First, we randomly sample 1 million entities with their concepts from CN-DBpedia, and select the top 100 concepts with the most entities as the typical concept set. Then we randomly select 100 entities with their abstracts for each typical concept to construct the input texts and run a BERT-based extractor to extract concepts. Details of the extraction process will be introduced in $\mathsection$ 4.2. We invite volunteers to assess whether the extracted concepts are biased. To quantify the degree of concept bias, we calculate the bias rate of concept A to another concept B. The bias rate is defined as the number of entities of A for which B or the sub-concepts of B are mistakenly extracted by the extractor, divided by the total number of entities of A.

The bias rates among 26 typical concepts are shown in Figure 2, where the concepts (dots) of the same topic are clustered in one rectangle. The construction of concept topics will be introduced in $\mathsection$ 4.1. From the figure, we can conclude that concept bias is widespread in the PLM-based CE system and negatively affects the quality of the results. Previous studies have proven that causal inference can analyze bias via SCM and eliminate bias with causal intervention techniques Cao et al. (2022). Next, we will analyze concept bias from a causal perspective.

3.3 The Causal Framework for Concept Bias Analysis

The Structural Causal Model

We devise a Structural Causal Model (SCM) to identify the causal effect between the input text $X$ of a given entity $E$ and the concept span $S$ that can be extracted from ${X}$ . As shown in Figure 3, our CE task aims to extract one or multiple spans $S$ from ${X}$ as the concept(s) of ${E}$ where the causal effect can be denoted as $X\rightarrow S$ .

During the pre-training, the contextual embedding of one token depends on the ones that frequently appear nearby in the corpus. We extrapolate that the high co-occurrence between the entities of true concepts (e.g., writer) and biased concepts (e.g., novel) in the pre-trained knowledge induces spurious correlations between entities (e.g., Louisa May Alcott) and biased concepts (e.g., novel). Therefore, the PLM-based CE models can mistakenly extract biased concepts even if the entity is explicitly mentioned in $X$ . The experiments in $\mathsection$ 5.4 also prove our rationale. Based on the foregoing analysis, we define the pre-trained knowledge ${K}$ from PLM-based extraction models as a confounder.

We cannot directly observe the latent space of the PLMs, and thus the backdoor adjustment Pearl (2009) is not applicable in our case. Alternatively, we adopt the frontdoor adjustment Peters et al. (2017) and design a mediator to mitigate the concept bias.

Causal Intervention

To mitigate the concept bias, we construct a prompt $P$ as a mediator for $X\rightarrow S$ , and then the frontdoor adjustment can apply do-operation.

Specifically, to make the PLMs attend to the right cause and alleviate spurious co-occurrence correlation (e.g., novel and Louisa May Alcott), we assign a topic as a knowledge-guided prompt $P$ (i.e., person) to the input text $X$ (The detailed operation is elaborated in $\mathsection$ 4.1). The topics obtained from KGs are independent of pre-trained knowledge, and thus $P$ fulfills the frontdoor criterion.

For the causal effect $X\rightarrow P$ , we can observe that $X\rightarrow P\rightarrow S\leftarrow K$ is a collider that blocks the association between $P$ and $K$ , and no backdoor path is available for $X\rightarrow P$ . Therefore, we can directly rely on the conditional probability after applying the do-operator for $X$ :

P(P=p|do(X=x))=P(P=p|X=x).

(1)

Next, for the causal effect $P\rightarrow S$ , $P\leftarrow X\leftarrow K\rightarrow S$ is a backdoor path from $P$ to $S$ , which we need to cut off. Since $K$ is an unobserved variable, we can block the backdoor path through $X$ :

P(S|do(P))=\sum_{x}P(S|P,X=x)P(X=x).

(2)

Therefore, the underlying causal mechanism of our CE task is a combination of Eq.1 and Eq.2, which can be formulated as:

		$\displaystyle P(S\|do(X))$
		$\displaystyle=\sum_{p}P(S\|p,do(X))P(p\|do(X))$
		$\displaystyle=\sum_{p}P(S\|do(P),do(X))P(p\|do(X))$
		$\displaystyle=\sum_{p}P(S\|do(P))P(p\|do(X)).$		(3)

The theoretical details of the frontdoor adjustment are introduced in Appendix A.

We make the assumption of strong ignorability, i.e., there is only one confounder $K$ between $X$ and $S$ . One assumption of the frontdoor criterion is that the only way the input text $X$ influences $S$ is through the mediator $P$ . Thus, $X\rightarrow P\rightarrow S$ must be the only path. Otherwise, the front-door adjustment cannot stand. Notice that $K$ already represents all the knowledge from pre-trained data in PLMs. Therefore, it is reasonable to use the strong ignorability assumption that it already includes all possible confounders.

Through the frontdoor adjustment, we can block the backdoor path from input text to concepts and alleviate spurious correlation caused by the confounder, i.e., pre-trained knowledge. In practice, we can train a topic classifier to estimate Eq.1 ( $\mathsection$ 4.1) and train a concept extractor on our training data to estimate Eq.2 ( $\mathsection$ 4.2). Next, we will introduce the implementation of the frontdoor adjustment in detail.

4 Methodology

In this section, we present our CE framework KPCE and discuss how to perform prompting to alleviate concept bias. The overall framework of KPCE is illustrated in Figure 4, which consists of two major modules: 1) Prompt Constructor: assigns the topic obtained from KGs for entities as a knowledge-guided prompt to estimate Eq.1; 2) Concept Extractor: trains a BERT-based extractor with the constructed prompt to estimate Eq.2 and extract multi-grained concepts from the input text. Next, we will introduce the two modules of KPCE.

4.1 Prompt Constructor

Knowledge-guided Prompt Construction

To reduce the concept bias, we use the topic of a given entity as a knowledge-guided prompt, which is identified based on the external knowledge of the existing KGs. Take CN-DBpedia Xu et al. (2017) as an example ¹¹1In fact, the concepts of CN-DBpedia are inherited from Probase, so the typical topics are the same for CN-DBpedia and Probase.. We randomly sample one million entities from this KG and obtain their existing concepts. Then, we select the top 100 concepts having the most entities to constitute the typical concept set, which can cover more than 99.80% entities in the KG. Next, we use spectral clustering Von Luxburg (2007) with the adaptive K-means Bhatia et al. (2004) algorithm to cluster these typical concepts into several groups, each of which corresponds to a topic. To achieve the spectral clustering, we use the following overlap coefficient Vijaymeena and Kavitha (2016) to measure the similarity between two concepts,

Overlap(c_{1},c_{2})=\frac{|ent(c_{1})\cap ent(c_{2})|}{min(|ent(c_{1})|,|ent(c_{2})|)+\delta}

(4)

where $ent(c_{1})$ and $ent(c_{2})$ are the entity sets of concept $c_{1}$ and concept $c_{2}$ , respectively. We then construct a similarity matrix of typical concepts to achieve spectral clustering. To determine the best number of clusters, we calculate the Silhouette Coefficient (SC) Aranganayagi and Thangavel (2007) and the Calinski Harabaz Index (CHI) Maulik and Bandyopadhyay (2002) from 3 to 30 clusters. The scores are shown in Figure 5, from which we find that the best number of clusters is 17. As a result, we cluster the typical concepts into 17 groups and define a topic name for each group. The 17 typical topics and their corresponding concepts are listed in Appendix B.1

Figure 5: The scores of Silhouette Coefficient (SC) and Calinski Harabaz Index (CHI) under different cluster numbers. The scores are normalized with feature scaling for a fair comparison.

Identifying Topic Prompt for Each Entity

We adopt a topic classifier to assign the topic prompt to the input text $X$ , which is one of the 17 typical topics in Table 6. To construct the training data, we randomly fetch 40,000 entities together with their abstract texts and existing concepts in the KG. According to the concept clustering results, we can assign each topic to the entities. We adopt transformer encoder Vaswani et al. (2017) followed by a two-layer perception (MLP) Gardner and Dorling (1998) activated by ReLU, as our topic classifier ²²2We do not employ the PLM-based topic classifier since it will bring a direct path from $K$ to $P$ in Figure 3.. We train the topic classifier to predict the topic prompt ${P}=\{p_{1},p_{2},\cdots,p_{|{P}|}\}$ for ${X}$ , which is calculated as ³³3The detailed training operation of topic classifier can be found in Appendix B.1:

\begin{split}{P}=\mathop{\arg\max}\limits_{i}\big{(}P({P^{i}}|X)\big{)},1\leq i\leq 17,\end{split}

(5)

where ${P^{i}}$ is the i-th topic among the 17 typical topics.

In our experiments, the topic classifier achieves more than 97.8% accuracy in 500 samples by human assessment. Through training the topic classifier, we can estimate Eq.1 to identify the causal effect $X\rightarrow P$ .

4.2 Concept Extractor

Prompt-based BERT

The concept extractor is a BERT equipped with our proposed prompt followed by a pointer network Vinyals et al. (2015). The pointer network is adopted for extracting multi-grained concepts.

We first concatenate the token sequence with the tokens of ${P}$ and ${X}$ to constitute the input, i.e., $\{\texttt{[CLS]}{P}\texttt{[SEP]}{X}\texttt{[SEP]}\}$ , where [CLS] and [SEP] are the special tokens in BERT. With multi-headed self-attention operations over the above input, the BERT outputs the final hidden state (matrix), i.e., $\mathbf{H}^{N_{L}}\in\mathbb{R}^{(|{P}|+|{X}|+3)\times d^{\prime}}$ where $d^{\prime}$ is the vector dimension and $N_{L}$ is the total number of layers. Then the pointer network predicts the probability of a token being the start position and the end position of the extracted span. We use $\mathbf{p}^{start},\mathbf{p}^{end}\in\mathbb{R}^{|{P}|+|{X}|+3}$ to denote the vectors storing the probabilities of all tokens to be the start position and end position, which are calculated as

[\mathbf{p}^{start};\mathbf{p}^{end}]=\textrm{softmax}(\mathbf{H}^{N_{L}}\mathbf{W}+\mathbf{B})

(6)

where $\mathbf{B}\in\mathbb{R}^{(|{P}|+|{X}|+3)\times 2}$ and $\mathbf{W}\in\mathbb{R}^{d^{\prime}\times 2}$ and are both trainable parameters. We only consider the probabilities of the tokens in the abstract $T$ . Given a span with $x_{i}$ and $x_{j}$ as the start token and the end token, its confidence score ${cs}_{ij}\in\mathbb{R}$ can be calculated as

cs_{ij}=p^{start}_{i}+p^{end}_{j}.

(7)

Accordingly, the model outputs a ranking list of candidate concepts (spans) with their confidence scores. We only reserve the concepts with confidence scores bigger than the selection threshold. An example to illustrate how to perform the pointer network is provided in Appendix B.2.

During training, the concept extractor is fed with the input texts with topic prompts and outputs the probability (confidence scores) of the spans, and thus can estimate the causal effect $P\rightarrow S$ in Eq.2.

Model Training

We adopt the cross-entropy function $\mathop{CE}(\cdot)$ as the loss function of our model. Specifically, suppose that $\mathbf{y}_{start}\in\mathbb{N}^{|{P}|+|{X}|+3}$ (or $\mathbf{y}_{end}\in\mathbb{N}^{|{P}|+|{X}|+3}$ ) contains the real label (0 or 1) of each input token being the start (or end) position of a concept. Then, we have the following two training losses for the predictions:

	$\displaystyle\mathcal{L}_{start}$	$\displaystyle=\mathop{CE}(\mathbf{p}^{start},\mathbf{y}_{start}),$		(8)
	$\displaystyle\mathcal{L}_{end}$	$\displaystyle=\mathop{CE}(\mathbf{p}^{end},\mathbf{y}_{end}).$		(9)

Then, the overall training loss is

\mathcal{L}=\alpha\mathcal{L}_{start}+(1-\alpha)\mathcal{L}_{end}

(10)

where $\alpha\in(0,1)$ is the control parameter. We use Adam Kingma and Ba (2015) to optimize $\mathcal{L}$ .

5 Experiments

5.1 Datasets

CN-DBpedia

From the latest version of Chinese KG CN-DBpedia Xu et al. (2017) and Wikipedia, we randomly sample 100,000 instances to construct our sample pool. Each instance in the sample pool consists of an entity with its concept and abstract text ⁴⁴4If one entity has multiple concepts, we randomly select one as the golden label.. Then, we sample 500 instances from the pool as our test set and divide the rest of the instances into the training set and validation set according to 9:1.

Probase

We obtain the English sample pool of 50,000 instances from Probase Wu et al. (2012) and Wikipedia. The training, validation and test set construction are the same as the Chinese dataset.

5.2 Evaluation Metrics

We compare KPCE with seven baselines, including a pattern-matching approach i.e., Hearst pattern. Detailed information on baselines and some experiment settings is shown in Appendix C.1 and C.2. Some extracted concepts do not exist in the KG, and cannot be assessed automatically. Therefore, we invite the annotators to assess whether the extracted concepts are correct. The annotation detail is shown in Appendix C.3.

Please note that the extracted concepts may already have existed in the KG for the given entity, which we denote as ECs (existing concepts). However, our work expects to extract correct but new concepts (that do not exist in the KG) to complete the KGs, which we denote as NCs (new concepts). Therefore, we record the number of new concepts (NC #) and display the ratio of correct concepts (ECs and NCs) as precision (Prec.). Since it is difficult to know all the correct concepts in the input text, we report the relative recall (Recall_R). Specifically, suppose NCs # is the total number of new concepts extracted by all models. Then, the relative recall is calculated as NC # divided by NCs # ⁵⁵5Please note that NCs # is counted based on all models in one comparison. Therefore, Recall_R can be different for one model when the compared models change.. Accordingly, the relative F1 (F1_R) can be calculated with Prec. and Recall_R. In addition, we also record the average length of new concepts (Len_NC) to investigate the effectiveness of the pointer network.

5.3 Overall Performance

We present the main results in Table 5.3. Generally, we have the following findings:

Our method outperforms previous baselines by large margins, including previous state-of-the-art (MRC-CE, Yuan et al., 2021a). However, the pattern-based approach still beats the learning-based ones in precision, envisioning a room for improvement. We find that KPCE achieves a more significant improvement in extracting new concepts, indicating that KPCE can be applied to achieve KG completion ( $\mathsection$ 5.5). We also compare KPCE with its ablated variant and the results show that adding a knowledge-guided prompt can guide BERT to achieve accurate CE results.

We notice that almost all models have higher extraction precision on the Chinese dataset than that on the English dataset. This is because the modifiers are usually placed before nouns in Chinese syntactic structure, and thus it is easier to identify these modifiers and extract them with the coarse-grained concepts together to form the fine-grained ones. However, for the English dataset, not only adjectives but also subordinate clauses modify coarse-grained concepts, and thus identifying these modifiers is more difficult.

Compared with learning-based baselines, KPCE can extract more fine-grained concepts. Although the Hearst pattern can also extract fine-grained concepts, it cannot simultaneously extract multi-grained concepts when a coarse-grained concept term is the subsequence of another fine-grained concept term. For example, in Figure 4, if Hearst Pattern extracts American novelist as a concept, it cannot extract novelist simultaneously. KPCE solves this problem well with the aid of the pointer network and achieves a much higher recall.

		$\displaystyle P(S\|do(X))$
		$\displaystyle=\sum_{p}P(S\|p,do(X))P(p\|do(X))$
		$\displaystyle=\sum_{p}P(S\|do(P),do(X))P(p\|do(X))$
		$\displaystyle=\sum_{p}P(S\|do(P))P(p\|do(X)).$		(3)

Model	NC #	Len_NC	Prec.	Recall_R	F1_R
Trained on CN-DBpedia
Hearst	222	5.95	95.24%	21.66%	35.29%
\cdashline1-6 FLAIR	64	3.09	95.71%	6.24%	11.72%
XLNet	47	2.66	88.48%	4.68%	8.90%
KVMN	254	4.03	64.45%	26.02%	37.08%
XLM-R	255	5.35	76.82%	24.78%	37.47%
BBF	26	4.34	88.28%	2.54%	4.93%
GACEN	346	3.58	84.89%	36.73%	51.27%
MRC-CE	323	5.33	92.12%	31.51%	46.96%
\cdashline1-6 KPCE	482	5.52	94.20%	44.38%	60.33%
w/o P	338	5.21	72.07%	34.05%	46.25%
Trained on Probose
Hearst	287	2.43	89.04%	17.10%	28.69%
\cdashline1-6 FLAIR	140	1.68	84.31%	7.73%	14.16%
XLNet	342	1.51	79.30%	18.87%	30.49%
KVMN	403	1.97	47.39%	22.24%	30.27%
XLM-R	322	2.28	81.73%	17.77%	29.19%
BBC	154	1.68	81.13%	8.44%	15.30%
GACEN	486	1.75	76.93%	31.82%	45.02%
MRC-CE	598	2.23	88.59%	33.00%	48.09%
\cdashline1-6 KPCE	752	2.31	88.69%	46.83%	61.30%
w/o P	691	2.26	78.64%	40.62%	53.57 %

Concept_O	Concept_B	KPCE ${}_{\textit{w/o}P}$	KPCE
writer	book	20%	7%
plant	doctor	8%	0%
illness	medicine	12%	6%
singer	music	19%	2%
poem	poet	25%	1%

Topic: Technology. Entity: Korean alphabet. Abstract: The Korean alphabet is a writing system for the Korean language created by King Sejong the Great in 1443.
Output Results
KPCE ${}_{\textit{w/o}P}$		KPCE
Span	C.S.	Span	C.S.
language	0.238	system	0.240
alphabet	0.213	writing system	0.219
system	0.209	system for the Korean language	0.130

Model	NC #	Prec.	Recall_R	F1_R
Trained on CN-DBpedia
KPCE	482	94.20%	85.23%	89.49%
KPCE _LDA	308	93.08%	82.13%	87.26%
ERNIE	302	93.86%	80.53%	86.69%
Trained on Probose
KPCE	752	88.69%	80.85%	84.59%
KPCE _LDA	381	68.23%	61.45%	64.66%
ERNIE	286	77.96%	46.13%	57.97%

Model	TS #	NC #	Prec.	Recall_R	F1_R
KPCE	0	62	82.66%	48.44%	61.08%
w/o P	0	55	69.62%	42.97%	53.14%
\cdashline1-6 KPCE	300	107	82.95%	83.59%	83.27%
w/o P	300	89	81.65%	69.53%	75.10%

		$\displaystyle P(S=s\|do(X=x))$
		$\displaystyle=P_{m}(S=s\|X=x)$
		$\displaystyle=\sum_{k}P_{m}(S=s\|X=x,K=k)P_{m}(K=k)$
		$\displaystyle=\sum_{k}P(S=s\|X=x,K=k)P(K=k),$		(11)

		$\displaystyle P(S=s\|do(P=p))$
		$\displaystyle=\sum_{x}P(S=s\|X=x,P=p)P(X=x),$		(13)

Topic	Corresponding Concept Examples
Person	politicians, teachers, doctors
Book	romance novels, art books, fantasy novels
Location	towns, villages, scenic spots
Film and TV	movies, animation, TV dramas
Language	idioms, medical terms, cultural terms
Game	electronic games, web games, mobile games
Creature	plants, animals, bacteria
Food	Indian cuisine, Japanese cuisine
Website	government websites, corporate websites
Music	singles, songs, pop music
Software	application software, system software
Folklore	social folklore, belief folklore
Organization	companies, brands, universities
History	ancient history, modern history
Disease	digestive system disease, oral disease
Technology	technology products, electrical appliances
Medicine	Chinese medicine, Western medicine

$\displaystyle\mathbf{H}^{0}$	$\displaystyle=\mathbf{EW}^{0}+\mathbf{B}^{0},$	(15)
$\displaystyle\mathbf{H}^{l}$	$\displaystyle=encoder(\mathbf{H}^{l-1}),1\leq l\leq N_{L},$	(16)
$\displaystyle P(T)$	$\displaystyle=\textrm{softmax}(\mathbf{h}_{\texttt{[CLS]}}^{N_{L}}\mathbf{W}^{L}),$	(17)
$\displaystyle T$	$\displaystyle=\mathop{\arg\max}\limits_{i}\big{(}P(T^{i})\big{)},1\leq i\leq 17$	(18)

Dataset	Pattern
CN-DBpedia	X is Y
	X is one of Y
	X is a type/a of Y
	X belongs to Y
	Y is located/founded/ in…
Probase	X is a Y that/which/who
	X is one of Y
	X refers to Y
	X is a member/part/form… of Y
	As Y, X is …

Causality-aware Concept Extraction based on Knowledge-guided Prompting

Abstract

1 Introduction

2 Related Work

Concept Acquisition

Causality for Language Processing

Language Prompting

3 Concept Bias Analysis

3.1 Preliminary

Task Definition

Data Selection

3.2 Empirical Studies on Concept Bias

3.3 The Causal Framework for Concept Bias Analysis

The Structural Causal Model

Causal Intervention

4 Methodology

4.1 Prompt Constructor

Knowledge-guided Prompt Construction

Identifying Topic Prompt for Each Entity

4.2 Concept Extractor

Prompt-based BERT

Model Training

5 Experiments

5.1 Datasets

CN-DBpedia

Probase

5.2 Evaluation Metrics

5.3 Overall Performance

5.4 Analysis

How does KPCE alleviate the concept bias?

How does the prompt affect the spurious co-occurrence correlations?

What if other knowledge injection methods are adopted?

5.5 Applications

KG Completion

Domain Concept Acquisition

6 Conclusion

7 Limitations

Model Novelty

Topic Classification

Threshold Selection

Acknowledgement

References

Appendix A Theoretical Details of Causal Framework

A.1 Preliminaries

SCM

Causal Intervention

A.2 The Backdoor Adjustment

A.3 The Frontdoor Adjustment

Step 1

Step 2

Appendix B Detailed Information about KPCE

B.1 Identifying Topic for Each Entity

B.2 An Example for Point Network

Appendix C Experiment Detail

C.1 Baselines

C.2 Experiment Settings

C.3 Human Assessment

C.4 Other Knowledge Injection Methods

Causality-aware Concept Extraction
based on Knowledge-guided Prompting