This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Efficient Relation-aware Scoring Function Search for Knowledge Graph Embedding

Shimin DI1, Quanming YAO2,3, Yongqi ZHANG2, Lei CHEN1 1The Hong Kong University of Science and Technology, Hong Kong SAR, China
24Paradigm Inc. Hong Kong SAR, China
3Department of Electronic Engineering, Tsinghua University, Beijing, China
{sdiaa,leichen}@cse.ust.hk, {yaoquanming,zhangyongqi}@4paradigm.com
Abstract

The scoring function, which measures the plausibility of triplets in knowledge graphs (KGs), is the key to ensure the excellent performance of KG embedding, and its design is also an important problem in the literature. Automated machine learning (AutoML) techniques have recently been introduced into KG to design task-aware scoring functions, which achieve the state-of-the-art performance in KG embedding. However, the effectiveness of searched scoring functions is still not as good as desired. In this paper, observing that existing scoring functions can exhibit distinct performance on different semantic patterns, we are motivated to explore such semantics by searching relation-aware scoring functions. But the relation-aware search requires a much larger search space than the previous one. Hence, we propose to encode the space as a supernet and propose an efficient alternative minimization algorithm to search through the supernet in a one-shot manner. Finally, experimental results on benchmark datasets demonstrate that the proposed method can efficiently search relation-aware scoring functions, and achieve better embedding performance than state-of-the-art methods. 111The work was done when S. Di was an intern in 4Paradigm Inc. mentored by Q. Yao; and Correspondence is to Q.Yao.

Index Terms:
Knowledge Graph, Knowledge Graph Embedding, Neural Architecture Search, Automated Machine Learning

I Introduction

Knowledge Graph (KG) [1, 2], as one of the most effective ways to explore and organize knowledge base, applies to various problems, such as question answering [3], recommendation [4], and few-shot learning [5]. In KGs, every edge represents a knowledge triplet in the form of (head entity, relation, tail entity), or (h,r,t)(h,r,t) for simplicity. Given a triplet, several crucial tasks in KGs, such as link prediction and triplet classification [1, 2], can be used to verify whether such a fact exists to form this triplet. KG embedding has been proposed to address this issue. Basically, KG embedding targets to embed entities h,th,t and relations rr into low-dimensional vector space such as 𝒉,𝒓,𝒕d\bm{h},\bm{r},\bm{t}\in\mathbb{R}^{d}. Then based on the embeddings, a scoring function ff is employed to compute a score f(𝒉,𝒓,𝒕)f(\bm{h},\bm{r},\bm{t}) to verify whether a triplet (h,r,t)(h,r,t) is a fact. Triplets with higher scores are more likely to be facts.

In the last decade, various scoring functions have been proposed to significantly improve the quality of embeddings [6, 1, 2]. TransE [7], as a representative scoring function, interprets the relation rr as a translation from head entity hh to tail entity tt, and optimizes the embeddings to satisfy 𝒉+𝒓=𝒕\bm{h}+\bm{r}=\bm{t}. However, TransE [7] and its variants [8, 9] are not fully expressive and their empirical performance is inferior to the others as mentioned in [10]. Recently, some works [11, 12, 13] employ neural networks to design universal scoring functions. But these scoring functions are not well-regularized for the KG properties and are expensive to general predictions [14]. Furthermore, bilinear models (BLMs) [15, 16, 17, 18, 19] are proposed to compute the score by the weighted sum of pairwise interactions of embeddings. Currently, BLMs are the most powerful in terms of both empirical results and theoretical guarantees [19] on expressiveness  [10, 17].

Generally, the models in BLMs share the form as f(𝒉,𝒓,𝒕)=𝒉g(𝒓)𝒕f(\bm{h},\bm{r},\bm{t})=\bm{h}^{\top}\!g(\bm{r})\bm{t}, where g(𝒓)g(\bm{r}) returns a square matrix referring to the relation 𝒓\bm{r}. DistMult [15] regularizes g(𝒓)g(\bm{r}) to be diagonal, such as g(𝒓)=diag(𝒓)g(\bm{r})=\text{diag}(\bm{r}), in order to solve the overfitting problem. ComplEx [16] extends the embeddings to be complex values. SimplE [17] is another variant that regularizes the matrix g(𝒓)g(\bm{r}) with a simpler but valid constraint. More recently, TuckER [20] extends BLMs to tensor models for KG embedding. However, the designing of scoring functions is still challenging because of the diversity of KGs [2]. A scoring function performs well on one task may not adapt well to the other tasks since different KGs usually own distinct patterns [14], especially the relation patterns [2].

Recently, automated machine learning (AutoML) [21, 22], as demonstrated via automated hyperparameter tuning [23] and neural architecture search (NAS) [24], has shown to be very useful in the design of machine learning models. Inspired by such a success, a pioneered work, AutoSF [14], has proposed to search an appropriate scoring function for any given KG data using AutoML techniques. AutoSF first defines the manual scoring functions design problem as a scoring function search problem and then proposes a search algorithm. It empirically shows that the searched scoring functions are KG dependent and outperform the state-of-the-art ones designed by human experts. In short, AutoSF can search for proper scoring functions, which depend on the given KG data and evaluation metric. In other words, AutoSF is task-aware, while traditional scoring functions are not.

TABLE I: The summary of existing scoring functions. The expressiveness measures whether a scoring function can handle common patterns in KGs: symmetry, anti-symmetry, general asymmetry, inversion. We compare the inference cost on single triplet of scoring functions w.r.t. the embedding dimension dd. NeN_{e} and NrN_{r} denote the number of entities and relations, respectively.
Scoring functions Effectiveness Efficiency
expressive task-aware relation-aware inference time model complexity
TransE [7] ×\times ×\times ×\times O(d)O(d) O(Ned+Nrd)O(N_{e}d+N_{r}d)
Hand-designed DistMult [15] ×\times ×\times ×\times O(d)O(d) O(Ned+Nrd)O(N_{e}d+N_{r}d)
NTN [11] \surd ×\times ×\times O(d2)O(d^{2}) O(Ned+Nrd2)O(N_{e}d+N_{r}d^{2})
TuckER [20] \surd ×\times ×\times O(d3)O(d^{3}) O(d3+Ned+Nrd)O(d^{3}+N_{e}d+N_{r}d)
ComplEx [16] \surd ×\times ×\times O(d)O(d) O(Ned+Nrd)O(N_{e}d+N_{r}d)
HypER [25] \surd ×\times ×\times O(d2)O(d^{2}) O(Ned+Nrd)O(N_{e}d+N_{r}d)
Searched AutoSF [14] \surd \surd ×\times O(d)O(d) O(Ned+Nrd)O(N_{e}d+N_{r}d)
(by AutoML) ERAS \surd \surd \surd O(d)O(d) O(Ned+Nrd)O(N_{e}d+N_{r}d)

However, the task-aware method AutoSF still follows the classic way that forcing all relations to share one scoring function. This is not relation-aware and may cause the effectiveness issue. Generally, common KG relations can be roughly categorized into different patterns based on their semantic properties, such as symmetry [15], anti-symmetry [16, 26], general asymmetry [18], and inversion [17]. Traditional models design universal scoring functions to cover more and more relation patterns. For example, DistMult [15] only handles symmetric relations, while TransE [7] covers another three kinds of relations except for symmetric relations. Furthermore, ComplEx [16] and SimplE [17] are able to cover all these four common relation patterns. Intuitively, the more patterns covered by the scoring function, the stronger ability it has to learn KGs. However, there are some potential risks in the pursuit of universal scoring functions. A universal scoring function may not perform well on certain relation patterns even though it can handle all kinds of relations [27]. For instance, it has been reported in [28] that HolEX [29] achieves unsatisfactory performance on symmetric relations in the FB15K data set [7], despite HolEX being a universal scoring function. This indicates that forcing all relations to share one scoring function may not be able to fully express the interactions between entities with different relations. As compared in Table I, none of the existing methods cover all the aspects in terms of the effectiveness.

Unfortunately, it is hard to extend the task-aware method (i.e., AutoSF [14]) to be relation-aware due to the efficiency issue. AutoSF adopts a progressive greedy search approach to find a universal scoring function. It requires separately training hundreds of scoring functions to convergence, which suffers from a lot of computational overhead. In general, AutoSF takes more than one GPU day to search on the smallest benchmark data set WN18RR, while it requires more than 9 GPU days on the larger data set YAGO. But the relation-aware search problem requires a much larger search space than the space of AutoSF. Therefore, a much more efficient search algorithm is needed for the relation-aware search.

In this paper, to address the issues mentioned above, we propose the Efficient search method for Relation-Aware Scoring functions (ERAS) in KG embedding. We propose to search multiple scoring functions, which are expressive, task-aware, and relation-aware, for common relation patterns in any given KG data. More concretely, we propose a supernet to model the relation-aware search space and introduce an efficient search approach to the supernet. We suggest sharing KG embeddings on the supernet, so as to avoid training hundreds of candidate scoring functions from scratch as AutoSF does. In summary, we list the contribution we have made in this work as follows:

  • Previous works mainly emphasize the expressiveness of scoring functions, which also motivates AutoSF to design task-aware scoring functions. However, they ignore that scoring functions should also be relation-aware as they model the semantics of relations. In this paper, to address such a problem, we propose an AutoML-based method to design relation-aware scoring functions.

  • We define a novel supernet representation to model the relation-aware search space, where the relations are assigned into different groups and each group has a unique scoring function. The simple but effective supernet not only enables us to share KG embeddings to significantly accelerate the search but also protects our search from negative effects.

  • Inspired by the one-shot architecture search (OAS) algorithms, we propose a stochastic algorithm, i.e., ERAS, that is efficient and suitable for the automated scoring function search task. It optimizes the search problem through alternative minimization, where embeddings are stochastically updated in the supernet, groups are assigned by Expectation-Maximization clustering, and scoring functions are updated by reinforcement learning.

  • We conduct extensive experiments on five popular benchmark data sets on link prediction and triplet classification tasks. Experimental results demonstrate that ERAS can achieve state-of-the-art performance by designing relation-aware scoring functions. Especially, ERAS can consistently outperform literature at the relation type level of a given KG data. Besides, the search is much more efficient compared with AutoSF and the other popular automated methods.

II Related Works

TABLE II: Notations.
Symbols Meanings
E,RE,R The entity set and relation set.
SS The KB triples {(h,r,t)}\{(h,r,t)\}, where h,tEh,t\in E and rRr\in R.
𝒉,𝒓,𝒕\langle\bm{h},\bm{r},\bm{t}\rangle The triple-dot product 𝒉,𝒓,𝒕=ihiriti\langle\bm{h},\bm{r},\bm{t}\rangle=\sum_{i}h_{i}\cdot r_{i}\cdot t_{i}.
𝝎={𝑬,𝑹}\bm{\omega}=\{\bm{E},\bm{R}\} The set of embeddings ENe×dE\!\in\!\mathbb{R}^{N_{e}\times d} and RNr×dR\!\in\!\mathbb{R}^{N_{r}\times d}.
ff The scoring function, such as f(𝒉,𝒓,𝒕)f(\bm{h},\bm{r},\bm{t}).
\mathcal{M} The performance measurement.
M,NM,N The number of relation blocks and relation groups.
𝒪\mathcal{O} The operation set 𝒪{𝟎,±𝒓1,,±𝒓M}\mathcal{O}\equiv\{\mathbf{0},\pm\bm{r}_{1},\cdots,\pm\bm{r}_{M}\}.
𝑨\bm{A} The weight of architecture.
𝑩\bm{B} The weight of relation assignment.

II-A Neural Architecture Search (NAS)

II-A1 General Principles

Generally, Neural Architecture Search (NAS) [21, 24, 22] is formed as a bi-level optimization problem [30] where we need to update the neural architectures on the upper-level and train the model parameters in the lower-level. Subsequently, three important aspects should be considered in NAS:

  • Search space: it defines what network architectures in principle should be searched, e.g., Convolutional Neural Networks (CNNs) [31] and Recurrent Neural Networks (RNNs) [32]. A well-defined search space should be expressive enough to enable powerful models to be searched. But it cannot be too large to search.

  • Search algorithm: it aims to efficiently search in the above space, e.g., bayesian optimization [33], reinforcement learning [24], evolution algorithm [34]. A search algorithm is required to perform an efficient search over the search space and be able to find architectures that achieve good performance.

  • Evaluation mechanism: it determines how to evaluate the searched architectures in the search strategy. Fast and accurate evaluation of candidate architectures can significantly boost the search efficiency.

Unfortunately, classic NAS [34, 24] methods are computationally consuming since candidate architectures are trained by the stand-alone way, i.e., many architectures are trained from scratch to convergence and are evaluated separately.

II-A2 One-shot Architecture Search (OAS)

More recently, One-shot Architecture Search (OAS) methods [35, 36, 37], have been proposed to significantly reduce the search cost in NAS. OAS first represents the whole search space by a supernet [35], which is formed by a directed acyclic graph (DAG), where the nodes are the operations in neural networks (e.g., 3×33\times 3 conv in CNNs). Every neural architecture in the space can be represented by a path in the DAG. Then, instead of training independent model weights of each candidate architecture like the stand-alone approach, OAS keeps weights for the supernet and forces different architectures to share the same weights if they share the same edges in the DAG (i.e., parameter-sharing [35, 38]). In this way, architectures can be searched by training the supernet once (i.e., the one-shot manner), which makes NAS much faster.

Generally, OAS methods unify the search space with the form of DAG but differ in their way to search the optimal subgraph of DAG. Sampling OAS (e.g., ENAS [35]) employs a controller to sample architectures and search an optimal subgraph of the whole DAG, which maximizes the expected reward of the subgraph on the validation set. Instead of involving controllers, differentiable OAS (e.g., DARTS [36] and NASP [37]) relaxes the search space to be continuous so that the architectures can be optimized by gradient descent. However, differentiable OAS may not able to derive an architecture that results in high evaluation performance when the evaluation metric is not differentiable. In comparison, sampling OAS is more suitable for the non-differentiable scenario since it utilizes the policy gradient [39] to optimize the controller.

TABLE III: Hit@1 (in %) results for existing scoring functions on the link prediction task.
SF Type Methods Symmetric relations Anti-symmetric relations
WN18 WN18RR FB15k FB15k237 WN18 WN18RR FB15k FB15k237
Non-universal TransE [7] 0.0 0.0 0.0 5.0 51.0 3.0 55.0 27.0
DistMult [15] 93.0 90.0 73.0 7.0 65.0 9.0 74.0 25.0
Universal ConvE [12] 93.0 93.0 42.0 1.0 94.0 6.0 61.0 25.0
TuckER [20] 94.0 93.0 67.0 2.0 95.0 12.0 73.0 22.0
ComplEx [16] 94.0 94.0 88.0 2.0 95.0 11.0 80.0 23.0
SimplE [17] 92.0 93.0 74.0 5.0 94.0 5.0 64.0 13.0
Analogy [18] 93.0 92.0 52.0 6.0 93.0 2.0 66.0 27.0
AutoSF [14] 93.2 93.5 85.8 5.7 94.8 11.5 81.1 26.7

II-B AutoSF: Searching Task-aware Scoring Functions

Given a KG, it is very empirical to choose a suitable scoring function from the above manual methods. To better adapt to different KG tasks, AutoSF [14] leverages the AutoML techniques to design and customize a proper scoring function on the given KG.

II-B1 Search Problem

Motivated by the expressiveness guarantee and computational efficiency of BLMs, AutoSF proposes to partition embeddings 𝒉,𝒓,𝒕d\bm{h},\bm{r},\bm{t}\in\mathbb{R}^{d} into MM splits (e.g., 𝒉=[𝒉1;;𝒉M]\bm{h}=[\bm{h}_{1};\cdots;\bm{h}_{M}] where 𝒉id/M\bm{h}_{i}\in\mathbb{R}^{d/M}), and represents scoring functions as:

f(𝒉,𝒓,𝒕)=i=1Mj=1M𝒉i,𝒐,𝒕j,\displaystyle f(\bm{h},\bm{r},\bm{t})=\sum\nolimits_{i=1}^{M}\sum\nolimits_{j=1}^{M}\langle\bm{h}_{i},\bm{o},\bm{t}_{j}\rangle, (1)

where 𝒐𝒪\bm{o}\in\mathcal{O} with 𝒪{𝟎,±𝒓1,,±𝒓M}\mathcal{O}\equiv\{\bm{0},\pm\bm{r}_{1},\cdots,\pm\bm{r}_{M}\}. Note that 𝒉i,𝒐,𝒕j\langle\bm{h}_{i},\bm{o},\bm{t}_{j}\rangle computes the triple-dot product and is named as the multiplicative item. Then previous outstanding scoring functions [40, 15, 18, 16, 17] can be unified in ff with different choices of 𝒐\bm{o} [14]. This indicates that ff is general enough to represent good scoring functions which are designed manually. In this way, AutoSF generalizes from human wisdom and allows the discovery of better scoring functions, which are not visited in the literature. Subsequently, the search problem is defined as:

Definition 1 (AutoSF problem [14])

Let f¯\bar{f} denote the desired scoring function and a\mathcal{F}_{a} denote the set of all possible scoring functions in AutoSF expressed by (1). Then the scoring function search problem is defined as follows:

f¯\displaystyle\bar{f} =argmaxfaval(f,𝝎¯;Sval)\displaystyle=\arg\max\nolimits_{f\in\mathcal{F}_{a}}\mathcal{M}_{\text{val}}\left(f,\bar{\bm{\omega}};S_{\text{val}}\right)
s.t. 𝝎¯\displaystyle\text{ s.t. }\bar{\bm{\omega}} =argmax𝝎tra(f,𝝎;Stra),\displaystyle=\arg\max\nolimits_{\bm{\omega}}\mathcal{M}_{\text{tra}}\left(f,\bm{\omega};S_{\text{tra}}\right),

where tra\mathcal{M}_{\text{tra}} and val\mathcal{M}_{\text{val}} measure the performance of scoring function ff and KG embeddings 𝛚\bm{\omega} on corresponding KG data SS (e.g., training set StraS_{\text{tra}} and validation set SvalS_{\text{val}}), respectively.

Given the embeddings 𝝎¯\bar{\bm{\omega}} learned on the training data StraS_{\text{tra}}, AutoSF aims to search for a better scoring function ff which leads to higher performance on the validation set SvalS_{\text{val}}. Hence, AutoSF can find task-aware scoring functions that can achieve impressive performance on different KG tasks. However, it is non-trivial to efficiently search a proper scoring function from the AutoSF’s search space due to its size O((2M+1)M2)O((2M+1)^{M^{2}}).

II-B2 Search Algorithm

Since a large number of possible scoring functions exist in the unified scoring function search space, AutoSF then proposes a progressive greedy search algorithm to find a proper scoring function according to an inductive rule:

fb=fb1+𝒉i,𝒐,𝒕j,f^{b}=f^{b-1}+\langle\bm{h}_{i},\bm{o},\bm{t}_{j}\rangle, (2)

where bb is the burget of non-zero multiplicative terms (i.e., 𝒐𝟎\bm{o}\neq\mathbf{0}) in (1). The intuition behind (2) is to gradually add nonzero multiplicative terms 𝒉i,𝒐,𝒕j\langle\bm{h}_{i},\bm{o},\bm{t}_{j}\rangle to achieve the final desired ff. Each greedy step in the search algorithm mainly contains two parts: sampling scoring functions and evaluate the sampled scoring functions. We summarize the search algorithm in Algorithm 1.

Algorithm 1 AutoSF: Progressive Greedy Search of Task-aware Scoring Functions
0:  BB: number of nonzero blocks.
1:  for bb in 4,,B4,\cdots,B do
2:     Randomly select NN scoring functions {fb1}\{f^{b-1}\};
3:     Sample N1N_{1} scoring functions {fb}\{f^{b}\} by adding relation blocks to {fb1}\{f^{b-1}\} as fb=fb1+𝒉i,𝒐,𝒕jf^{b}=f^{b-1}+\langle\bm{h}_{i},\bm{o},\bm{t}_{j}\rangle;
4:     Select top-KK {fb}\{f^{b}\} based on the Predictor.
5:     Train the top-KK {fb}\{f^{b}\} to convergence separately and update Predictor with the evaluated performance.
6:  end for
7:  return  Scoring function fBf^{B} with highest performance.

However, there are several issues in AutoSF. First, the evaluation mechanism in AutoSF is inefficient. In every greedy step, AutoSF trains all candidate scoring functions under budget bb to convergence for performance evaluation, i.e., step 5 of Algorithm 1. Then well-trained embeddings of all scoring functions will be discarded in the next greedy search. It wastes a lot of effort to train KG embeddings for performance evaluation. Moreover, within the budget bb, the predictor in AutoSF search algorithm can only leverage the prior experience that is smaller than the budget bb, i.e., step 4 of Algorithm 1, which may also bring unnecessary KG embeddings training due to inaccurate prediction. Second, AutoSF pursuits a universal scoring function over a given KG data. A universal scoring function that can learn certain relationships does not necessarily mean that this scoring function can perform well on them [28, 27]. This will be further discussed in Section III-A.

III Problem Definition

In this section, to further illustrate the motivation of relation-aware scoring functions, we first discuss the performance of existing scoring functions at the relation pattern level. Then, we formulate a relation-aware scoring function search problem.

III-A Motivation of Relation-aware Scoring Functions

As introduced in Section I, traditional scoring functions and AutoSF try to design universal scoring functions to cover as many as relation patterns as possible. However, being expressive does not mean achieving good performance as relations exhibit different patterns [28, 27]. We summarize the experimental results reported by  Figure 14 and 15 from [27] in Table III, which demonstrate the performance (the higher the better) of popular scoring functions on symmetric and anti-symmetric relations (e.g., Table IV).

TABLE IV: Exemplar relations corresponding to relation patterns in the benchmark data sets (see Section V-A1 for details).
Relation Patterns WN18/WN18RR FB15k/FB15k237
Symmetric similar_to, synset_of spouse_of
Anti-symmetric hypernym, hyponym child_of

From Table III, it is worth noting that universal scoring functions may perform even worse than non-universal scoring functions at the relation pattern level. First, TransE performs badly on symmetric relations on all benchmark data sets [7, 41, 12] since it cannot handle the symmetric relations. But it achieves better performance on symmetric relations of FB15k237 [41] than several universal scoring functions, such as ConvE, TuckER, ComplEx.

Second, DistMult only covers symmetric relations. Therefore, DistMult achieves good performance on symmetric relations, while it performs unsatisfactorily on anti-symmetric relations. However, as reported in Table III, there are several cases that universal scoring functions perform worse than DistMult on anti-symmetric relations:

  • 1.

    ConvE, SimplE, and Analogy in WN18RR [12].

  • 2.

    ConvE, SimplE, and Analogy in FB15K [7].

  • 3.

    TuckER, ComplEx, SimplE, and AutoSF in FB15k237 [41].

In summary, universal scoring functions may achieve unsatisfactory performance on specific relation patterns of certain KG. Such observation motivates us to design relation-aware scoring functions.

III-B Problem Formulation

Inspired by the observation in Section III-A and the task-aware method AutoSF, we here propose to search relation-aware scoring functions for different relation patterns over any given KG data.

Recall that AutoSF targets to find a scoring function ff that can achieve high (f,𝝎;S)\mathcal{M}(f,\bm{\omega};S) for given triplets SS. But in relation-aware search, it is also important to assign relations to appropriate groups to better cover relation patterns. Let 𝑩{0,1}Nr×N\bm{B}\in\{0,1\}^{N_{r}\times N} record the relation assignment, where Brn=1B_{rn}=1 if the rRr\in R is assigned to nn-th group otherwise Brn=0B_{rn}=0, and fnf_{n} denote the scoring function for relations in nn-th group. Then, relation-aware search aims to find a set of scoring functions {fn}\{f_{n}\} and relation assignments 𝑩\bm{B} that can achieve high ({fn},𝑩,𝝎;S)\mathcal{M}(\{f_{n}\},\bm{B},\bm{\omega};S). Formally, the problem in this paper is defined as:

Definition 2 (ERAS problem)

Let NN denote the total number of relation groups, and e\mathcal{F}_{e} denote the set of all possible relation-aware scoring functions. Then the relation-aware scoring function search problem is defined as:

{f¯n}n=1N=argmaxfneval({fn}n=1N,𝑩¯,𝝎¯;Sval),\displaystyle\{\bar{f}_{n}\}_{n=1}^{N}=\arg\max\nolimits_{f_{n}\in\mathcal{F}_{e}}\ \mathcal{M}_{\text{val}}\left(\{f_{n}\}_{n=1}^{N},\bar{\bm{B}},\bar{\bm{\omega}};S_{\text{val}}\right), (3)
s.t. {𝝎¯=argmax𝝎tra({fn}n=1N,𝑩¯,𝝎;Stra)𝑩¯=argmin𝑩B(𝑩,𝝎),\displaystyle\text{ s.t. }\begin{cases}\bar{\bm{\omega}}=\arg\max\nolimits_{\bm{\omega}}\;\mathcal{M}_{\text{tra}}\left(\{f_{n}\}_{n=1}^{N},\bar{\bm{B}},\bm{\omega};S_{\text{tra}}\right)\\ \bar{\bm{B}}=\arg\min\nolimits_{\bm{B}}\;\mathcal{L}_{B}\left(\bm{B},\bm{\omega}\right)\end{cases}\!\!\!\!\!\!, (4)

where B\mathcal{L}_{B} is the loss of relation assignments, and tra\mathcal{M}_{\text{tra}}, val\mathcal{M}_{\text{val}} measures the performance on the training set StraS_{\text{tra}} and validation set SvalS_{\text{val}}, respectively.

Generally, the relation-aware scoring function search problem is based on the bi-level optimization problem in Definition 2. Compared with the single-level objective, bi-level optimization allows the model to optimize parameters in different ways, which is more suitable for deriving scoring functions, embeddings, and relation assignments. This definition looks similar to AutoSF in Definition 1, but it is quite different in essence. First, we need to assign relations to proper relation groups in the lower-level objective (4). Second, there are multiple targeted scoring functions {fn}n=1N\{f_{n}\}_{n=1}^{N} for handling NN relation groups. Therefore, the search space e\mathcal{F}_{e} for ERAS with size O((2M+1)NM2)O((2M+1)^{NM^{2}}) is much larger than that for AutoSF with size O((2M+1)M2)O((2M+1)^{M^{2}}). The larger search space requires ERAS to have a more efficient search algorithm.

IV Search Algorithm

Here, we propose a new algorithm to solve the search problem in Definition 2. We can see that there are three types of parameters need to be simultaneously optimized in (3), i.e.,

  • Group Assignments: The relation assignment mechanism should be flexible enough to update 𝑩\bm{B} during the whole search procedure.

  • Architectures: We pursue {fn}n=1N\{f_{n}\}_{n=1}^{N} with high performance. But it is difficult to maximize (3) because performance evaluation in KG embedding is usually non-differentiable.

  • Embeddings: Training KG embeddings 𝝎\bm{\omega} for evaluating candidate scoring functions consume a lot of computation overhead in AutoSF. It is essential to tackle this issue since ERAS has a much larger search space than AutoSF.

Due to these challenges, neither AutoSF nor other existing NAS algorithms can be applied (see discussions in Section IV-D1 and IV-D2). Thus, we propose to deal with the above challenges using alternative minimization. Specifically, we incorporate three key components in search algorithm: Expectation-Maximization (EM) clustering for updating 𝑩\bm{B}, policy gradient for searching {fn}n=1N\{f_{n}\}_{n=1}^{N}, and embedding sharing for updating 𝝎\bm{\omega}. Details are in the sequel.

Refer to caption
(a) Modelling the generation of ff as a multi-step decision process.
Refer to caption
(b) Architecture of supernet.
Refer to caption
(c) Illustration to embedding sharing.
Figure 1: Set M,N=2M,N=2 in all examples. (a) An example of recurrently generating the relation-aware scoring functions {f1,f2}\{f_{1},f_{2}\}: f1(𝒉,𝒓,𝒕)=𝒉1,𝒓1,𝒕1+𝒉2,𝒓2,𝒕2f_{1}(\bm{h},\bm{r},\bm{t})=\left\langle\bm{h}_{1},\bm{r}_{1},\bm{t}_{1}\right\rangle+\left\langle\bm{h}_{2},\bm{r}_{2},\bm{t}_{2}\right\rangle and f2(𝒉,𝒓,𝒕)=𝒉1,𝒓1,𝒕2𝒉2,𝒓2,𝒕1f_{2}(\bm{h},\bm{r},\bm{t})=-\left\langle\bm{h}_{1},\bm{r}_{1},\bm{t}_{2}\right\rangle-\left\langle\bm{h}_{2},\bm{r}_{2},\bm{t}_{1}\right\rangle; (b) The illustration to supernet in the form of bipartite graph and architecture weight 𝑨\bm{A}; (c) The example of sharing embeddings between two sampled relation-aware scoring functions.

IV-A Update Group Assignments by EM Clustering

In this part, we illustrate how to assign relations to proper groups in the search procedure. Intuitively, relations with similar semantic meanings should be grouped. Since the KG embeddings are designed to encode the semantic meanings [7, 2], we propose to assign relations based on the given KG embeddings 𝝎\bm{\omega}, i.e., minimizing B(𝑩,𝝎)\mathcal{L}_{B}(\bm{B},\bm{\omega}) in Definition 2. Specifically, given a set of relations RR and NN groups, let 𝒄n\bm{c}_{n} denote the vector representation of the nn-th group (i.e., 𝑪\bm{C} for all groups), and BrnB_{rn} define the degree of membership between the relation rr with nn-th group. Then, the objective B\mathcal{L}_{B} for relation clustering in Definition 2 is defined as:

min𝑩B(𝑩,𝝎)=min𝑩,𝑪rnBrn𝒓𝐜n2,\min\nolimits_{\bm{B}}\mathcal{L}_{B}(\bm{B},\bm{\omega})\!=\!\min\nolimits_{\bm{B},\bm{C}}\sum\nolimits_{r}\sum\nolimits_{n}B_{rn}\!\left\|\bm{r}\!-\!\mathbf{c}_{n}\right\|^{2}, (5)

which can be solved by the Expectation-Maximization (EM) algorithm [42].

IV-B Updating Architectures by Reinforcement Learning

Our target in (3) is to derive an optimal relation-aware scoring function {f¯n}n=1N\{\bar{f}_{n}\}_{n=1}^{N}, which maximizes the evaluation performance on the given KG data. It is natural that we use Mean Reciprocal Ranking (MRR) on the validation data as the evaluation measurement val\mathcal{M}_{\text{val}}. However, MRR is non-differentiable, which indicates that directly optimizing (3) by gradient descent (e.g., differentiable OAS [36, 37]) is not suitable (see discussions in Section V-E1).

IV-B1 Reinforcement Learning Formulation

To handle the non-differentiable MRR, we first formulate the search problem of {fn}\{f_{n}\} as a multi-step decision problem. Then we adopt reinforcement learning to solve (3).

Recall that each fnf_{n} is a summation of multiplication terms 𝒉i,𝒐,𝒕j\langle\bm{h}_{i},\bm{o},\bm{t}_{j}\rangle in (1). We can sequentially determine which operation 𝒐𝒪\bm{o}\in\mathcal{O} is selected for the (i,j)(i,j)-th multiplicative item in fnf_{n}. Let v denote the index of all multiplicative items in {fn}\{f_{n}\} and αv\alpha_{\emph{v}} denote the operation selected for the v-th multiplicative item. Then, as shown in Figure 1 (a), the search process of {fn}\{f_{n}\} can be viewed as a multi-step decision problem: a list of tokens {αv}v=1V\{\alpha_{\emph{v}}\}_{\emph{v}=1}^{V} (V=NM2V=NM^{2}) that needs to be predicted. It is intuitive to adopt reinforcement learning to solve this problem. Let 𝑨{0,1}NM2×(2M+1)\bm{A}\in\{0,1\}^{NM^{2}\times(2M+1)}, where Avk=1A_{\emph{v}k}=1 if αv\alpha_{\emph{v}} chooses kk-th operation 𝒐k\bm{o}_{k} in 𝒪\mathcal{O} otherwise Avk=0A_{\emph{v}k}=0. Then (3) can be reformulated as:

maxθ𝒥(θ)𝔼𝑨π(𝑨;θ)[Q(𝑨,𝑩,𝝎;Sval)],\displaystyle\max\nolimits_{\theta}\mathcal{J}(\theta)\equiv\mathbb{E}_{\bm{A}\sim\pi(\bm{A};\theta)}[Q\left(\bm{A},\bm{B},\bm{\omega};S_{\text{val}}\right)], (6)

where π(𝑨;θ)\pi(\bm{A};\theta) is a policy parameterized by θ\theta for generating 𝑨\bm{A}, and Q(𝑨,𝑩,𝝎;Sval)Q\left(\bm{A},\bm{B},\bm{\omega};S_{\text{val}}\right) measures MRR performance as:

Q(𝑨,𝑩,𝝎;Sval)=n(h,r,t)SvalBrnq(fn(𝒉,𝒓,𝒕)).\displaystyle Q\left(\bm{A},\bm{B},\bm{\omega};S_{\text{val}}\right)=\sum\nolimits_{n}\sum\nolimits_{(h,r,t)\in S_{\text{val}}}B_{rn}\cdot q(f_{n}(\bm{h},\bm{r},\bm{t})).

Note that q()q(\cdot) measures MRR of a triplet and fnf_{n} is the nn-th scoring function based on 𝑨\bm{A}. Then, the gradient of θ\theta in (6) can be computed by REINFORCE gradient [39] as

θ𝒥(θ)=𝔼𝑨π(𝑨;θ)[v=1VθlogP(αv|α(v1):1;θ)Q]\displaystyle\nabla_{\theta}\mathcal{J}(\theta)=\mathbb{E}_{\bm{A}\sim\pi(\bm{A};\theta)}\left[\sum\nolimits_{\emph{v}=1}^{V}\nabla_{\theta}\log{P(\alpha_{\emph{v}}|\alpha_{(\emph{v}-1):1};\theta)}Q\right]
1Uu=1Uv=1VθlogP(αv|α(v1):1;θ)(Qub),\displaystyle\approx\frac{1}{U}\!\sum\nolimits_{u=1}^{U}\sum\nolimits_{\emph{v}=1}^{V}\nabla_{\theta}\log{P(\alpha_{\emph{v}}|\alpha_{(\emph{v}-1):1};\theta)}(Q_{u}\!-\!b), (7)

where QuQ_{u} denotes Q(𝑨u,𝑩,𝝎;Sval)Q\left(\bm{A}^{u},\bm{B},\bm{\omega};S_{\text{val}}\right), 𝑨u\bm{A}^{u} is uu-th sampled architecture from π(𝑨;θ)\pi(\bm{A};\theta), bb is a moving average baseline to reduce variance, and UU denote the number of sampled scoring functions. We can see that solving (6) has been converted to optimize θ\theta in (7), and whether QQ is differentiable does not affect the gradient computation w.r.t θ\theta.

Inspired by [24, 35], we also adopt a Long Short-term Memory (LSTM) [43] to parameterize θ\theta for learning the policy π(𝑨;θ)\pi(\bm{A};\theta). Specifically, the controller samples decisions in an autogressive way: the decided operation αv\alpha_{\emph{v}} in the previous multiplicative item is carried out and then fed into the next to predict αv+1\alpha_{\emph{v}+1} (see Figure 1 (a)). Finally, (7) is used to update the LSTM policy network θ\theta.

IV-B2 Encoding Prior of Architectures

To fully evaluate the search space, we expect that every relation segmentation 𝒓i𝒪\𝟎\bm{r}_{i}\in\mathcal{O}\backslash\mathbf{0} must be selected at least once in the searched scoring functions, named as the exploitative constraint. Thus, we constrain 𝑨\bm{A} with this exploitative constraint. If the sampled 𝑨\bm{A} does not satisfy it, we will directly set the reward QQ to 0.

IV-C Update Embedding in Shared Supernet

AutoSF follows the classic NAS way to evaluate the stand-alone performance of candidate scoring functions, which requires separately training KG embeddings hundreds of times. As discussed in Section II-A2, OAS methods propose a more efficient evaluation mechanism by forcing all candidates sharing parameters. Inspired by OAS works, we design a simple but effective supernet that enables candidate scoring functions sharing the KG embeddings for accelerating search.

TABLE V: Comparison of AutoSF (Algorithm 1) and ERAS (Algorithm 2) in terms of NAS principles (Section II-A1).
space algorithm evaluation
size property
AutoSF O((2M+1)M2)O((2M+1)^{M^{2}}) task-aware progressive greedy search by stand-alone
ERAS O((2M+1)NM2)O((2M+1)^{NM^{2}}) task-/relation-aware alternative minimization (EM cluster + reinforcement learning) in embedding shared supernet

IV-C1 Design of Supernet

To enable fast evaluation, we propose a supernet view of the relation-aware scoring function search space. Specifically, the fnf_{n} is represented as:

fn(𝒉,𝒓,𝒕)=ijkAvk𝒉i,𝒐k,𝒕j.\displaystyle f_{n}(\bm{h},\bm{r},\bm{t})=\sum\nolimits_{i}\sum\nolimits_{j}\sum\nolimits_{k}A_{\emph{v}k}\cdot\langle\bm{h}_{i},\bm{o}_{k},\bm{t}_{j}\rangle. (8)

Recall that v is the index of multiplicative items in {fn}\{f_{n}\} and kk is the index of operations in 𝒪\mathcal{O}. Then, as shown in Figure 1 (b), we can take 𝑨\bm{A} as the adjacency matrix of a bipartite graph (i.e., supernet) from above (8), where multiplicative items and operations are nodes, and AvkA_{\emph{v}k} records the edge weight between multiplicative items and operations. Based on the supernet design, Figure 1 (c) illustrates that any relation-aware {fn}\{f_{n}\} can be realized by taking subgraphs of the supernet. Then ERAS forces all subgraphs to share embeddings, thereby different scoring functions can be evaluated based on the same KG embedding. It enables us to evaluate candidate scoring functions faster by avoiding repetitive embedding training.

IV-C2 Update Embeddings

Given the fixed controller’s policy π(𝑨;θ)\pi(\bm{A};\theta) and relation assignment 𝑩\bm{B}, we propose to solve tra\mathcal{M}_{\text{tra}} in objective (4) by minimizing the expected loss \mathcal{L} on the training data, such as 𝔼𝑨π(𝑨;θ)[(𝑨,𝑩,𝝎;Stra)]\mathbb{E}_{\bm{A}\sim\pi(\bm{A};\theta)}[\mathcal{L}\left(\bm{A},\bm{B},\bm{\omega};S_{\text{tra}}\right)]. Then stochastic gradient descent (SGD) can be performed to optimize 𝝎\bm{\omega}. We approximate the gradient 𝝎\nabla_{\bm{\omega}} according to:

𝝎𝔼𝑨π(𝑨;θ)[]1Uu=1U𝝎(𝑨u,𝑩,𝝎;Stra),\nabla_{\bm{\omega}}\mathbb{E}_{\bm{A}\sim\pi(\bm{A};\theta)}[\mathcal{L}]\approx\frac{1}{U}\sum\nolimits_{u=1}^{U}\nabla_{\bm{\omega}}\mathcal{L}\left(\bm{A}^{u},\bm{B},\bm{\omega};S_{\text{tra}}\right), (9)

where UU is the number sampled scoring functions, and

(𝑨u,𝑩,𝝎;Stra)=n(h,r,t)StraBrn(fn(𝒉,𝒓,𝒕)).\displaystyle\mathcal{L}\left(\bm{A}^{u},\bm{B},\bm{\omega};S_{\text{tra}}\right)=\sum\nolimits_{n}\sum\nolimits_{(h,r,t)\in S_{\text{tra}}}B_{rn}\cdot\ell(f_{n}(\bm{h},\bm{r},\bm{t})).

Note that ()\ell(\cdot) is the multiclass log-loss [19] and fnf_{n} is the nn-th scoring function based on sampled 𝑨u\bm{A}^{u}. Hence, (9) can be represented as:

𝝎𝔼𝑨π(𝑨;θ)[]1Uu=1Un(h,r,t)Stra𝝎Brn(fn(𝒉,𝒓,𝒕)).\displaystyle\nabla_{\bm{\omega}}\mathbb{E}_{\bm{A}\sim\pi(\bm{A};\theta)}[\mathcal{L}]\!\approx\!\frac{1}{U}\sum\nolimits_{u=1}^{U}\sum_{n}\!\!\sum_{(h,r,t)\in S_{\text{tra}}}\!\!\!\!\!\!\nabla_{\bm{\omega}}B_{rn}\!\cdot\!\ell(f_{n}(\bm{h},\bm{r},\bm{t})).

IV-C3 Performance Evaluation

Recall that QuQ_{u} denotes the reward of the uu-th sampled scoring function in (7). In the supernet, every sampled scoring function can be regarded as a subgraph of the supernet where some edges are activated. Hence, we can evaluate QuQ_{u} based on the sampled scoring functions by activating the subgraph on the supernet.

IV-D Complete Algorithm

The proposed algorithm ERAS is summarized in Algorithm 2, where KG embedding 𝝎\bm{\omega}, relation assignments 𝑩\bm{B} and architectures 𝑨\bm{A} are alternatively updated in every epoch. To improve the efficiency of scoring function search, we represent the search space as a supernet and propose to share KG embeddings across different scoring functions in step 3 (see Section IV-C). Thus, ERAS is capable of avoiding wasting a lot of computation on training embeddings from scratch. To enable relation-aware scoring function search, we introduce EM clustering in step 4 to dynamically assign relations 𝑩\bm{B} based on the learned embeddings (see Section IV-A). To handle the non-differentiable measurement of scoring functions, we use reinforcement learning and perform policy gradient in step 5-6 (see Section IV-B). After searching, we derive several sampled scoring functions with the well-trained controller and compute its reward on a mini-batch of the validation data in step 9-11. Finally, we take the scoring functions with the highest reward and re-train it from scratch.

Algorithm 2 ERAS: Efficient Relation-aware Scoring Functions Search.
0:  Initialize embeddings 𝝎\bm{\omega}, relation groups 𝑩\bm{B}, and controller’s parameter θ\theta. // search relation- and task-aware scoring functions
1:  while not converge do
2:     Sample a set of scoring functions from π(𝑨;θ)\pi(\bm{A};\theta);
3:     Update shared embeddings 𝝎\bm{\omega} with (9);
4:     Update relation assignments 𝑩\bm{B} according to (5);
5:     Sample a mini-batch data BvalB_{\text{val}} from validation data SvalS_{\text{val}};
6:     Update architectures policy π(𝑨;θ)\pi(\bm{A};\theta) with (7);
7:  end while// Derive the final scoring function 𝑨¯\bar{\bm{A}}
8:  ERAS samples KK scoring functions 𝒜K\mathcal{A}^{K} from π(𝑨;θ)\pi(\bm{A};\theta);
9:  for 𝑨𝒜K\bm{A}\in\mathcal{A}^{K} do
10:     Compute Q(𝑨,𝑩,𝝎;Sval)Q\left(\bm{A},\bm{B},\bm{\omega};S_{\text{val}}\right);
11:  end for
12:  return  The scoring function 𝑨¯\bar{\bm{A}} with highest reward on validation set, and train it from scratch to convergence.

IV-D1 Comparison with AutoSF

To compare ERAS with the pioneering work AutoSF, we summarize them from the perspective of NAS principles in Table V. First, to capture the inherent properties of relation patterns in KGs, ERAS targets to solve the relation-aware scoring function search problem in this paper. This leads to the larger search space in ERAS than that in AutoSF. Second, ERAS proposes an alternative minimization way to solve the non-differentiable problem of val\mathcal{M}_{\text{val}}. Finally, AutoSF evaluates every scoring function based on their stand-alone performance, which requires repeatedly training KG embeddings hundreds of times. On the contrary, ERAS shares KG embeddings across different scoring functions, which extremely reduces time expense for performance evaluation of candidate scoring functions.

IV-D2 Comparison with Existing OAS Algorithms

Inspired by parameter sharing in OAS works, we propose to share KG embedding in the supernet, thereby ERAS avoids repeatedly training KG embedding hundreds of times. However, there exist some concerns about parameter-sharing in OAS [44, 38]. Specifically, while parameter sharing enables an alternative way to train all architectures in the search space in a cheaper way, it can lead to a biased evaluation problem: the evaluation of candidate architectures in the search phase (i.e., val\mathcal{M}_{\text{val}}) is inconsistent with the stand-alone training phase. Especially, the correlation between one-shot and stand-alone performance will probably be unstable as the supernet goes deep and complex.

Therefore, to make parameter sharing work for our problem, we first construct a search space, which can be represented by a supernet in (8). Then, to avoid the supernet being deep and complex, we design a shallow and simple supernet with the form of a bipartite graph, which differs from the complex DAG supernet in classic OAS works. We demonstrate that the ERAS’s supernet design does not suffer from a biased evaluation problem in Section V-E1. In summary, we leverage the domain-knowledge of KG embedding to make embedding sharing works.

TABLE VI: Comparison of the best scoring functions identified by ERAS and the state-of-the-arts on the link prediction task. The bold numbers mean the best performance and the underline ones mean the second best. [][\clubsuit]: results are taken from [14]; [][{\dagger}]: from [45]; [][{\ddagger}]: from [11]; §\S: from [25]; [][\diamondsuit]: from [12]; [][*]: from [20]; [+][+]: from [29]; [][\spadesuit]: from [46].
type model WN18 WN18RR FB15k FB15k237 YAGO3-10
MRR Hit@1 Hit@10 MRR Hit@1 Hit@10 MRR Hit@1 Hit@10 MRR Hit@1 Hit@10 MRR Hit@1 Hit@10
TransE 0.500 - 94.1 0.178 - 45.1 0.495 - 77.4 0.256 - 41.9 - - -
TDMs TransH 0.521 - 94.5 0.186 45.1 0.452 - 76.6 0.233 - 40.1 - - -
RotatE 0.949 94.4 95.9 0.476 42.8 57.1 0.797 74.6 88.4 0.338 24.1 53.3 - - -
NTN{\ddagger} 0.53 - 66.1 - - - 0.25 - 41.4 - - - - -
NNMs ConvE 0.942 93.5 95.5 0.460 39.0 48.0 0.745 67.0 87.3 0.316 23.9 49.1 0.520 45.0 66.0
HypER§ 0.951 94.7 95.8 0.465 43.6 52.2 0.790 73.4 88.5 0.341 25.2 52.0 0.533 45.5 67.8
TBMs TuckER * 0.953 94.9 95.8 0.470 44.3 52.6 0.795 74.1 89.2 0.358 26.6 54.4 - - -
HolEX+ 0.938 93.0 94.9 - - - 0.800 75.0 88.6 - - - - - -
QuatE 0.950 94.5 95.9 0.488 43.8 58.2 0.782 71.1 90.0 0.348 24.8 55.0 - - -
DistMult 0.821 71.7 95.2 0.443 40.4 50.7 0.817 77.7 89.5 0.349 25.7 53.7 0.552 47.6 69.4
ComplEx 0.951 94.5 95.7 0.471 43.0 55.1 0.831 79.6 90.5 0.347 25.4 54.1 0.566 49.1 70.9
Analogy 0.950 94.6 95.7 0.472 43.3 55.8 0.829 79.3 90.5 0.348 25.6 54.7 0.565 49.0 71.3
SimplE 0.950 94.5 95.9 0.468 42.9 55.2 0.830 79.8 90.3 0.350 26.0 54.4 0.565 49.1 71.0
Rule-based AnyBURL 0.950 94.6 95.9 0.480 44.6 55.5 0.830 80.8 87.6 0.310 23.3 48.6 0.540 47.7 67.3
AutoML AutoSF 0.952 94.7 96.1 0.490 45.1 56.7 0.853 82.1 91.0 0.360 26.7 55.2 0.571 50.1 71.5
ERASN=1\text{ERAS}^{N=1} 0.951 94.7 96.0 0.490 45.0 56.8 0.853 82.0 91.2 0.361 26.6 55.2 0.570 50.2 71.5
ERAS 0.953 95.0 96.2 0.492 45.2 56.8 0.855 82.3 91.4 0.365 26.8 55.5 0.577 50.3 71.7

V Empirical Study

Here we mainly show that ERAS can improve effectiveness in KG embedding with high efficiency, and provide some insight views. All codes are implemented with PyTorch [47] and experiments are run on a single TITAN Xp GPU.

V-A Experiment Setup

V-A1 Data Sets

In our experiments, we mainly conduct experiments on five public benchmarks data sets: WN18 [7], WN18RR [12], FB15k [7], FB15k237 [41], and YAGO3-10 [12], that have been employed to compare KG embedding models in [7, 17, 18, 16, 15]. Note that WN18RR and FB15k237 remove duplicate and inverse duplicate relations from WN18 and FB15k, respectively. The statistics of five data sets are summarized in Table VII.

TABLE VII: Summary of KG benchmark data sets.
Data set #relation #entity #training #validation #testing
WN18 18 40,943 141,442 5,000 5,000
WN18RR 11 40,943 86,835 3,034 3,134
FB15k 1,345 14,951 484,142 50,000 59,071
FB15k237 237 14,541 272,115 17,535 20,466
YAGO3-10 37 123,188 1,079,040 5,000 5,000

V-A2 Hyperparameter Settings

The hyperparameters in this work can be mainly categorized into searching and evaluation parameters. To fairly compare existing scoring functions, including human-designed and searched ones, we search a stand-alone parameter set on the SimplE with the help of HyperOpt, a hyperparameter optimization framework [33]. The tuned parameter set includes: learning rate, L2 penalty, decay rate, batch size, embedding dimensions. Then we compare the stand-alone performance of different scoring functions on the tuned parameter set. Besides, the searching parameters of ERAS are the number of segments MM, relation groups NN, sampled scoring functions UU in (7) and (9). Moreover, we optimize embeddings 𝝎\bm{\omega} with Adagrad [48] algorithm and the controller θ\theta with Adam [49] algorithm.

Refer to caption
(a) WN18RR.
Refer to caption
(b) FB15K237.
Refer to caption
(c) YAGO3-10.
Figure 2: Search efficiency comparison of ERAS with the other popular search algorithms in AutoML.

V-B Comparison with KG Embedding Methods

As in [14, 20, 46], we perform experiments with link prediction and triplet classification task, as they are important testing bed for scoring functions.

V-B1 Link Prediction

We first test the performance of our proposed method on the link prediction task. This is the test bed to measure KG embedding models and works as an important task in KG completion. Given the triplet (h,r,t)SvalStest(h,r,t)\in S_{\text{val}}\cup S_{\text{test}}, the KG embedding model obtains the rank of hh through computing the score of (h,r,t)(h^{\prime},r,t) for all entities, and the same for tt. We adopt the classic metrics [7, 8]:

  • Mean Reciprocal Ranking (MRR): 1/|S|i=1|S|1/ranki\nicefrac{{1}}{{|S|}}\sum_{i=1}^{|S|}\nicefrac{{1}}{{\text{rank}_{i}}}, where ranki\text{rank}_{i} is the ranking result; and

  • Hit@1, i.e., 1/|S|i=1|S|𝕀(ranki1)\nicefrac{{1}}{{|S|}}\sum_{i=1}^{|S|}\mathbb{I}(\text{rank}_{i}\!\leq\!1), and Hit@10, i.e., 1/|S|i=1|S|𝕀(ranki10)\nicefrac{{1}}{{|S|}}\sum_{i=1}^{|S|}\mathbb{I}(\text{rank}_{i}\!\leq\!10), where 𝕀()\mathbb{I}(\cdot) is the indicator function.

Note that the higher MRR, Hit@1 and Hit@10 values mean better embedding quality. We compare the proposed ERAS (Algorithm 2) with the popular KG embedding models mentioned in Section I:

  • Translational models (TDMs): TransE [7], TransH [8], and RotatE [45];

  • Neural network models (NNMs): NTN [11], ConvE [12], and HypER [25];

  • Tensor-based models (TBMs): TuckER [20], HolEX [29], QuatE [46], DistMult [15], ComplEx [16], Analogy [18] and SimplE [17];

  • The rule-based model: AnyBURL [50];

  • The scoring function search method: AutoSF [14].

TABLE VIII: Hit@1 results for ERASN=1\text{ERAS}^{N=1} and ERAS on the link prediction task at the relation pattern level.
Methods Symmetric relations Anti-symmetric relations
WN18RR FB15k FB15k237 WN18RR FB15k FB15k237
Best in Table III 94.0 88.0 7.0 12.0 81.0 27.0
ERASN=1\text{ERAS}^{N=1} 93.2 86.5 5.3 11.6 80.4 26.9
ERAS 94.3 90.0 8.8 13.2 82.1 27.9
TABLE IX: Running time analysis of the automated approaches on single GPU in hours.
Methods AutoSF ERASN=1\text{ERAS}^{N=1} ERAS Handed-designed
greedy search evaluation supernet training evaluation supernet training evaluation DistMult QuatE
WN18 65.7±\pm3.0 5.5±\pm0.5 3.29±\pm0.2 2.1±\pm0.1 3.54±\pm0.1 2.2±\pm0.1 1.9±\pm0.1 2.0±\pm0.1
FB15K 127.1±\pm5.2 20.5±\pm 1.3 4.55±\pm0.2 19.0±\pm0.2 4.86±\pm0.2 19.49±\pm0.3 8.36±\pm0.2 11.1±\pm0.4
WN18RR 38.6±\pm1.9 3.72±\pm0.6 2.97±\pm0.2 0.50±\pm0.1 3.19±\pm0.1 0.52±\pm0.1 0.42±\pm0.1 0.95±\pm0.1
FB15k237 61.1±\pm2.8 8.5±\pm0.4 3.22±\pm0.1 4.7±\pm0.1 3.54±\pm0.1 4.8±\pm0.2 2.6±\pm0.1 5.0±\pm0.3
YAGO 219.9±\pm5.1 18.9±\pm2.0 17.5±\pm0.3 29.5±\pm1.1 19.8±\pm0.3 30.3±\pm1.9 26.4±\pm1.5 32.6±\pm2.0

The comparison of the global effectiveness between ERAS and other methods is in Table VI. Firstly, it is clear that traditional scoring functions, such as TDMs, NNMs, and TBMs, are not task-aware since no scoring functions can perform consistently over the benchmark data sets. This indicates that a single scoring function is hard to adapt to different KGs even though it is a universal scoring function like, as discussed in Table III. The task-aware method AutoSF can search the KG-dependent scoring functions on five data sets and perform consistently better than traditional scoring functions. Then, we compare AutoSF with a variant of ERAS, i.e., ERASN=1\text{ERAS}^{N=1}, that aims to search a universal scoring function since all relations are assigned into one group (i.e., only task-aware as AutoSF). For five benchmark data set, ERASN=1\text{ERAS}^{N=1} shows almost the same performance with AutoSF. Moreover, as a task-aware and relation-aware method, ERAS performs better than AutoSF and other manually designed scoring functions.

As discussed in Section III-A, the existing scoring functions may achieve unsatisfactory performance on specific relation types of certain KG data. Corresponding to Table III, we investigate the performance of ERASN=1\text{ERAS}^{N=1} and ERAS at the relation type level in Table VIII. Obviously, the relation-aware method ERAS can consistently achieve outstanding performance on various relation types of any KG data. Especially, ERAS improves the performance of symmetric relations in FB15k and FB15k237. However, since the fact of symmetric relation only accounts for 3% of the test data in FB15k and FB15k237 [27], the global improvement is not so notable.

V-B2 Triplet Classification

To further demonstrate the effectiveness of ERAS, we also conduct the triplet classification experiments on FB15k, WN18RR, and FB15k237, where the positive and negative triplets are provided. We compare our methods with those that have reported results in public papers. This task aims to answer whether a given (h,r,t)(h,r,t) exists or not. In this task, we utilize the accuracy to evaluate the scoring functions. We set the same decision rule of classification in literature [14]: predicting a (h,r,t)(h,r,t) is positive if f(h,r,t)>θrf(h,r,t)>\theta_{r} otherwise negative, where θr\theta_{r} is a relation-specific threshold and is inferred by maximizing the accuracy on SvalS_{\text{val}}. As shown in Table X, the scoring function searched by relation-aware ERAS consistently outperforms other BLMs or searched scoring functions.

TABLE X: Comparison of the best scoring functions identified by ERAS and the state-of-the-art scoring functions for triplet classification. [\clubsuit]: results are taken from [14].
Data set FB15k WN18RR FB15k237
DistMult 80.8 84.6 79.8
Analogy 82.1 86.1 79.7
ComplEx 81.8 86.6 79.6
SimplE 81.5 85.7 79.6
AutoSF 82.7 87.7 81.2
ERAS 82.9 88.0 81.4
Refer to caption
Figure 3: The example of searched relation-aware scoring functions by ERAS on WN18.
Refer to caption
Figure 4: The example of searched relation-aware scoring functions by ERAS on WN18RR.

V-C Comparison with AutoML Search Methods

ERAS enables embedding sharing to improve the scoring function search efficiency. Hence we also compare the scoring function search efficiency of ERAS and ERASN=1\text{ERAS}^{N=1} with other automated search algorithms, i.e. AutoSF [14], random search [44], and Bayes algorithm [33], over three benchmark datasets. As shown in the Figure 2, both ERAS and ERASN=1\text{ERAS}^{N=1} complete the search very quickly. That is because other AutoML methods have to train hundreds of candidate scoring functions to convergence, while ERAS and ERASN=1\text{ERAS}^{N=1} avoids the time-consuming training by one-shot way. Compared with ERAS, ERASN=1\text{ERAS}^{N=1} convergess faster in search procedure since it does not dynamically assign relations and search corresponding scoring functions for different groups. It is worth noting that ERAS has unstable performance at beginning of the search. That is because FB15k237 has much more relations than another 2 data sets. It takes some time to find proper relation assignments at the start. Furthermore, other automated methods can achieve higher performance than ERAS in the search strategy. ERAS consistently updates the candidate scoring functions in every mini-batch data, which results in that the searched scoring functions cannot be well trained with one or several mini-batches. But other automated methods train the searched scoring functions with all training data until convergence.

We also summarize the running time on the five data sets in Table IX. AutoSF sets the embedding dimension dd to 64 for all data sets, while ERAS enables much faster search and hence sets d=512d=512. The larger dimensionality enables us to evaluate the scoring function more accurately. As shown in Table IX, the search time of AutoSF is significantly reduced by ERASN=1\text{ERAS}^{N=1}. But recall that the effectiveness of ERASN=1\text{ERAS}^{N=1} is the same with AutoSF in Table VI. This indicates that ERASN=1\text{ERAS}^{N=1} can extremely shorten the task-aware search time but maintain the same effectiveness with AutoSF. Moreover, although ERAS searches from a larger search space that is relation-aware, it reduces the search time of AutoSF by one order with a large dimension size d=512d=512 and improves effectiveness as in Table VI. In summary, ERAS can more efficiently search for more effective scoring functions.

V-D Case Study: The Searched scoring functions

To show the searched scoring functions by ERAS are relation-aware, we use searched scoring functions from WN18 and WN18RR as examples and plot them in Figure 3 and  4. As we can see, the three searched scoring functions have distinct patterns and are relation-aware. Moreover, the relations are grouped into general asymmetry, symmetry, and anti-symmetry. And the three SFs searched have distinct patterns and can handle their corresponding relations.

V-E Ablation Study

To investigate the influence of different components of ERAS, we conduct several ablation studies.

V-E1 Impact of Evaluation Measurement and Optimization Algorithm

As discussed in Section IV-D2, the deep and complex supernet design will probably lead to the biased evaluation problem. In Figure 5(a), we first demonstrate the correlation between stand-alone validation MRR with one-shot validation MRR (i.e., val\mathcal{M}_{\text{val}}) of various scoring functions in ERAS. It is obvious that one-shot validation MRR has near positive correlation with the stand-alone validation MRR. Therefore, the simplified design of supernet makes embedding sharing work, and there is no biased evaluation problem.

To further investigate the impact of val\mathcal{M}_{\text{val}} and optimization algorithms, we compare ERAS with following variants:

  • ERASlos\text{ERAS}^{\text{los}} utilizes the validation loss \mathcal{L} to replace val\mathcal{M}_{\text{val}}. Other steps are same with ERAS.

  • ERASdif\text{ERAS}^{\text{dif}} first replaces val\mathcal{M}_{\text{val}} with \mathcal{L} as ERASlos\text{ERAS}^{\text{los}} does. Then the differentiable measurement \mathcal{L} enables the differentiable optimization algorithm [36, 37] for search. The detailed implementing ERASdif\text{ERAS}^{\text{dif}} is presented in Appendix.

We show the correlation between stand-alone validation MRR with different val\mathcal{M}_{\text{val}} settings in Figure 5. Obviously, compared with using MRR as val\mathcal{M}_{\text{val}} in Figure 5 (a), Figure 5 (b) shows that using \mathcal{L} as val\mathcal{M}_{\text{val}} has a low correlation with stand-alone validation MRR. This indicates that validation loss in search strategy cannot well evaluate the stand-alone performance of scoring functions. Subsequently, in Table XI, we can observe that the performance of ERAS using MRR as val\mathcal{M}_{\text{val}} is better than that of two variants, i.e., ERASlos\text{ERAS}^{\text{los}} and ERASdif\text{ERAS}^{\text{dif}}. Moreover, ERASlos\text{ERAS}^{\text{los}} is worse than ERASdif\text{ERAS}^{\text{dif}} because optimizing loss with the RL approach could not make use of the differentiable nature of \mathcal{L}.

Refer to caption
(a) Validation MRR as val\mathcal{M}_{\text{val}}.
Refer to caption
(b) Validation loss as val\mathcal{M}_{\text{val}}.
Figure 5: The correlation between stand-alone validation MRR with different val\mathcal{M}_{\text{val}} settings of various searched scoring functions on WN18RR.
Refer to caption
(a) WN18RR.
Refer to caption
(b) FB15k237.
Figure 6: Comparison on time (sec) of model training vs. testing MRR with different number of groups NN in ERAS.
TABLE XI: Comparison of the variants of ERAS on the link prediction task.
Section Variant WN18 WN18RR FB15K FB15k237 YAGO3-10
MRR MRR MRR MRR MRR
V-E1 ERASlos\text{ERAS}^{\text{los}} 0.944 0.485 0.840 0.344 0.560
ERASdif\text{ERAS}^{\text{dif}} 0.949 0.485 0.848 0.355 0.565
V-E2 ERASsig\text{ERAS}^{\text{sig}} 0.945 0.480 0.844 0.338 0.559
V-E3 ERASpde\text{ERAS}^{\text{pde}} 0.950 0.489 0.850 0.349 0.570
ERASsmt\text{ERAS}^{\text{smt}} 0.948 0.485 0.845 0.347 0.565
ERAS 0.953 0.492 0.855 0.365 0.577

V-E2 Impact of Optimization Level

In this paper, Definition 2 formulates the problem with a bi-level optimization objective. As stated in Section III-B, bi-level optimization can benefit ERAS by updating the scoring functions and embeddings separately. To investigate the impact of optimization level, we add another variant of ERAS as ERASsig\text{ERAS}^{\text{sig}}, which utilizes the training set to update (3) in Definition 2 (i.e., single-level problem). In Table XI, compared with ERASsig\text{ERAS}^{\text{sig}}, ERAS demonstrates the bi-level optimization is needed because optimizing scoring functions on the validation set encourages ERAS to select scoring functions that generalize well rather than scoring functions that overfit on the training data.

V-E3 Impact of Grouping Approaches

To explore more about the influence of different grouping approaches, we set two variants of ERAS as follows:

  • ERASpde\text{ERAS}^{\text{pde}} does not update 𝑩\bm{B} during search. Instead, it fixes groupings based on the embedding trained from SimplE.

  • ERASsmt\text{ERAS}^{\text{smt}} groups relations based on their semantic meanings (i.e., symmetric, anti-symmetric, asymmetric and inverse).

We compare these two variants with ERAS as shown in Table XI. Generally, the performance of ERASsmt\text{ERAS}^{\text{smt}} is unsatisfactory due to the bias between human understanding of relation groups with the proper groups derived from data. In short, the performance comparison also indicates the importance of dynamically assigning relation groups in the search strategy, which encourages relations to be assigned to the appropriate scoring function.

V-E4 Impact of Grouping Numbers

To investigate the impact of grouping numbers in ERAS, we summarize the performance of searched scoring functions by AutoSF and different settings of ERAS in Figure 6, i.e., ERASN\text{ERAS}^{N} (N{1,2,,5}N\in\{1,2,\cdots,5\}) on WN18RR and FB15k237. Generally, the more groups there are, the longer the running time is. And ERAS achieves the best performance when N=3N=3 or 44. Comparing AutoSF and ERASN=1\text{ERAS}^{N=1}, they have a similar learning curve since their model complexities are the same.

V-E5 Impact of Block Numbers

AutoSF fixes the block number MM to 4 due to the efficiency issue. Once MM is changed (e.g., M=3M=3 or M=5M=5), all design details in AutoSF must be re-done. On the contrary, the efficiency of ERAS allows more flexible settings of MM. Here we try M{3,4,5}M\in\{3,4,5\} in order to learn more about how the block number MM influences the ERAS performance. As shown in Figure 7, we can observe that M=4M=4 does have excellent performance among {3,4,5}\{3,4,5\}.

Refer to caption
(a) WN18RR.
Refer to caption
(b) FB15k237.
Figure 7: Comparison on time (sec) of model training vs. testing MRR with different number of blocks MM in ERAS.

VI Conclusion

In this paper, we propose a new automated machine learning (AutoML) method for designing scoring functions (scoring functions) in knowledge graph embedding. First, we design a relation-aware search space, which is motivated by our analysis of how existing scoring functions adapt to different relations. Then, we represent the new search space as a supernet in the form of a graph and propose to search through the supernet by one-shot architecture search methods. Experimental results on benchmark data sets well demonstrate not only the efficiency of our approach but also the competitive effectiveness.

For future works, one interesting direction to connect ERAS with graph neural network [51]; another direction worth to try is utilizing path instead of triplet to exploit higher-order information in KGs [52].

VII Acknowledgements

This work is partially supported by National Key Research and Development Program of China Grant no. 2018AAA0101100, the Hong Kong RGC GRF Project 16202218 , CRF Project C6030-18G, C1031-18G, C5026-18G, AOE Project AoE/E-603/18, China NSFC No. 61729201, Guangdong Basic and Applied Basic Research Foundation 2019B151530001, Hong Kong ITC ITF grants ITS/044/18FX and ITS/470/18FX, Microsoft Research Asia Collaborative Research Grant, Didi-HKUST joint research lab project, and Wechat and Webank Research Grants.

References

  • [1] M. Nickel, K. Murphy, V. Tresp, and E. Gabrilovich, “A review of relational machine learning for knowledge graphs,” Proceedings of the IEEE, vol. 104, no. 1, pp. 11–33, 2015.
  • [2] Q. Wang, Z. Mao, B. Wang, and L. Guo, “Knowledge graph embedding: A survey of approaches and applications,” TKDE, vol. 29, no. 12, pp. 2724–2743, 2017.
  • [3] D. Lukovnikov, A. Fischer, J. Lehmann, and S. Auer, “Neural network-based question answering over knowledge graphs on word and character level,” in WWW, pp. 1211–1220, 2017.
  • [4] F. Zhang, N. J. Yuan, D. Lian, X. Xie, and W.-Y. Ma, “Collaborative knowledge base embedding for recommender systems,” in SIGKDD, pp. 353–362, 2016.
  • [5] X. Wang, Y. Ye, and A. Gupta, “Zero-shot recognition via semantic embeddings and knowledge graphs,” in ICPR, pp. 6857–6866, 2018.
  • [6] Y. Lin, X. Han, R. Xie, Z. Liu, and M. Sun, “Knowledge representation learning: A quantitative review,” tech. rep., 2018.
  • [7] A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko, “Translating embeddings for modeling multi-relational data,” in NIPS, pp. 2787–2795, 2013.
  • [8] Z. Wang, J. Zhang, J. Feng, and Z. Chen, “Knowledge graph embedding by translating on hyperplanes,” in AAAI, 2014.
  • [9] Y. Lin, Z. Liu, M. Sun, Y. Liu, and X. Zhu, “Learning entity and relation embeddings for knowledge graph completion,” in AAAI, 2015.
  • [10] Y. Wang, R. Gemulla, and H. Li, “On multi-relational link prediction with bilinear models,” in AAAI, 2018.
  • [11] R. Socher, D. Chen, C. Manning, and A. Ng, “Reasoning with neural tensor networks for knowledge base completion,” in NIPS, 2013.
  • [12] T. Dettmers, P. Minervini, P. Stenetorp, and S. Riedel, “Convolutional 2D knowledge graph embeddings,” in AAAI, 2018.
  • [13] X. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, K. Murphy, T. Strohmann, S. Sun, and W. Zhang, “Knowledge vault: A web-scale approach to probabilistic knowledge fusion,” in SIGKDD, 2014.
  • [14] Y. Zhang, Q. Yao, W. Dai, and L. Chen, “AutoSF: Searching scoring functions for knowledge graph embedding,” in ICDE, pp. 433–444, 2020.
  • [15] B. Yang, W. Yih, X. He, J. Gao, and L. Deng, “Embedding entities and relations for learning and inference in knowledge bases,” in ICLR, 2015.
  • [16] T. Trouillon, C. R., É. Gaussier, J. Welbl, S. Riedel, and G. Bouchard, “Knowledge graph completion via complex tensor factorization,” JMLR, vol. 18, no. 1, pp. 4735–4772, 2017.
  • [17] S. Kazemi and D. Poole, “Simple embedding for link prediction in knowledge graphs,” in NeurIPS, pp. 4284–4295, 2018.
  • [18] H. Liu, Y. Wu, and Y. Yang, “Analogical inference for multi-relational embeddings,” in ICML, pp. 2168–2178, JMLR. org, 2017.
  • [19] T. Lacroix, N. Usunier, and G. Obozinski, “Canonical tensor decomposition for knowledge base completion,” in ICML, pp. 2863–2872, 2018.
  • [20] I. Balazevic, C. Allen, and T. Hospedales, “Tucker: Tensor factorization for knowledge graph completion,” in EMNLP-IJCNLP, pp. 5188–5197, 2019.
  • [21] F. Hutter, L. Kotthoff, and J. Vanschoren, Automated Machine Learning: Methods, Systems, Challenges. Springer, 2018.
  • [22] Q. Yao and M. Wang, “Taking human out of learning applications: A survey on automated machine learning,” tech. rep., arXiv:1810.13306, 2018.
  • [23] M. Feurer, A. Klein, K. Eggensperger, J. Springenberg, M. Blum, and F. Hutter, “Efficient and robust automated machine learning,” in NIPS, pp. 2962–2970, 2015.
  • [24] B. Zoph and Q. Le, “Neural architecture search with reinforcement learning,” in ICLR, 2016.
  • [25] I. Balažević, C. Allen, and T. Hospedales, “Hypernetwork knowledge graph embeddings,” in ICANN, pp. 553–565, Springer, 2019.
  • [26] M. Nickel, L. Rosasco, and T. Poggio, “Holographic embeddings of knowledge graphs,” in AAAI, 2016.
  • [27] A. Rossi, D. Firmani, A. Matinata, P. Merialdo, and D. Barbosa, “Knowledge graph embedding for link prediction: A comparative analysis,” tech. rep., 2020.
  • [28] C. Meilicke, M. Fink, Y. Wang, D. Ruffinelli, R. Gemulla, and H. Stuckenschmidt, “Fine-grained evaluation of rule-and embedding-based systems for knowledge graph completion,” in ISWC, pp. 3–20, 2018.
  • [29] Y. Xue, Y. Yuan, Z. Xu, and A. Sabharwal, “Expanding holographic embeddings for knowledge completion,” in NeurIPS, pp. 4491–4501, 2018.
  • [30] B. Colson, P. Marcotte, and G. Savard, “An overview of bilevel optimization,” Annals of Operations Research, vol. 153, no. 1, pp. 235–256, 2007.
  • [31] Y. LeCun, “Gradient-based learning applied to document recognition,” in Proceedings of the IEEE, 1998.
  • [32] G. E. H. Rumelhart, David E. and R. J. Williams, “Learning representations by back-propagating errors,” in Nature, 1986.
  • [33] J. Bergstra, D. Yamins, and D. D. Cox, “Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures,” 2013.
  • [34] L. Xie and A. Yuille, “Genetic CNN,” in ICCV, pp. 1388–1397, 2017.
  • [35] H. Pham, M. Guan, B. Zoph, Q. Le, and J. Dean, “Efficient neural architecture search via parameter sharing,” in ICML, pp. 4092–4101, 2018.
  • [36] H. Liu, K. Simonyan, and Y. Yang, “DARTS: Differentiable architecture search,” in ICLR, 2018.
  • [37] Q. Yao, J. Xu, W.-W. Tu, and Z. Zhu, “Efficient neural architecture search via proximal iterations,” in AAAI, 2020.
  • [38] G. Bender, P.-J. Kinderm, B. Zoph, V. Vasudevan, and Q. Le, “Understanding and simplifying one-shot architecture search,” in ICML, pp. 549–558, 2018.
  • [39] R. J. Williams, “Simple statistical gradient-following algorithms for connectionist reinforcement learning,” ML, vol. 8, no. 3-4, pp. 229–256, 1992.
  • [40] M. Nickel, V. Tresp, and H. Kriegel, “A three-way model for collective learning on multi-relational data.,” in ICML, vol. 11, pp. 809–816, 2011.
  • [41] K. Toutanova and D. Chen, “Observed versus latent features for knowledge base and text inference,” in Workshop on CVSMC, pp. 57–66, 2015.
  • [42] A. Dempster, N. M. Laird, and D. Rubin, “Maximum likelihood from incomplete data via the em algorithm,” Journal of the Royal Statistical Society: Series B (Methodological), vol. 39, no. 1, pp. 1–22, 1977.
  • [43] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  • [44] L. Li and A. Talwalkar, “Random search and reproducibility for neural architecture search,” tech. rep., 2019.
  • [45] Z. Sun, Z. Deng, J. Nie, and J. Tang, “RotatE: Knowledge graph embedding by relational rotation in complex space,” in ICLR, 2019.
  • [46] S. Zhang, Y. Tay, L. Yao, and Q. Liu, “Quaternion knowledge graph embeddings,” in NeurIPS, pp. 2731–2741, 2019.
  • [47] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al., “Pytorch: An imperative style, high-performance deep learning library,” in NeurIPS, pp. 8024–8035, 2019.
  • [48] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization,” JMLR, vol. 12, no. Jul, pp. 2121–2159, 2011.
  • [49] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in ICLR, 2014.
  • [50] C. Meilicke, M. W. Chekol, D. Ruffinelli, and H. Stuckenschmidt, “Anytime bottom-up rule learning for knowledge graph completion,” in IJCAI, pp. 3137–3143, 2019.
  • [51] J. Zhou, G. Cui, Z. Zhang, C. Yang, Z. Liu, L. Wang, C. Li, and M. Sun, “Graph neural networks: A review of methods and applications,” tech. rep., 2018.
  • [52] Y. Zhang, Q. Yao, and L. Chen, “Interstellar: Searching recurrent architecture forknowledge graph embedding,” in NeurIPS, 2019.

Details of Implementing ERASdif\text{ERAS}^{\text{dif}}

We propose a supernet view of the scoring function search space as in (8). This supernet design allows us to employ differentiable OAS methods when we use the loss \mathcal{L} as val\mathcal{M}_{\text{val}}. Following NASP [37], ERASdif\text{ERAS}^{\text{dif}} can update the architecture weight 𝑨\bm{A} by gradient descent as:

𝑨𝑨ϵn(h,r,t)Sval𝑨Brn(fn(𝒉,𝒓,𝒕)).\displaystyle\bm{A}\leftarrow\bm{A}-\epsilon\sum\nolimits_{n}\sum\nolimits_{(h,r,t)\in S_{\text{val}}}\nabla_{\bm{A}}B_{rn}\cdot\ell(f_{n}(\bm{h},\bm{r},\bm{t})).

Then (3) can be optimized by above equation. Other steps of ERASdif\text{ERAS}^{\text{dif}} are same with ERAS.