Self-supervised Representation Learning for Evolutionary Neural Architecture Search

Chen Wei, Yiping Tang, Chuang Niu, Haihong Hu, Yue Wang and Jimin Liang Manuscript received xxxx xx, 2020; revised xxxx xx, 2020; accepted xxxx xx, 2020. This work was supported in part by the National Natural Science Foundation of China under Grants U19B2030 and 61976167, and in part by the Xi’an Science and Technology Program under Grant 201809170CX11JC12. (Corresponding author: Jimin Liang.)Chen Wei is with the School of Electronic Engineering, Xidian University, Xi’an, Shaanxi 710071, China, and also with the College of Economics and Management, Xi’an University of Posts&Telecommunications, Xi’an, Shaanxi 710061, China (e-mail: weichen_3@stu.xidian.edu.cn).Yiping Tang, Haihong Hu, Yue Wang and Jimin Liang are with School of Electronic Engineering, Xidian University, Xi’an, Shaanxi 710071, China (e-mail: tangyiping@aliyun.com; hhhu@mail.xidian.edu.cn; wangyue1991@stu.xidian.edu.cn; jimleung@mail.xidian.edu.cn).Chuang Niu is with the Department of Biomedical Engineering, Rensselaer Polytechnic Institute, Troy, NY 12180, USA (e-mail: niuc@rpi.edu)

Abstract

Recently proposed neural architecture search (NAS) algorithms adopt neural predictors to accelerate the architecture search. The capability of neural predictors to accurately predict the performance metrics of neural architecture is critical to NAS, and the acquisition of training datasets for neural predictors is time-consuming. How to obtain a neural predictor with high prediction accuracy using a small amount of training data is a central problem to neural predictor-based NAS. Here, we firstly design a new architecture encoding scheme that overcomes the drawbacks of existing vector-based architecture encoding schemes to calculate the graph edit distance of neural architectures. To enhance the predictive performance of neural predictors, we devise two self-supervised learning methods from different perspectives to pre-train the architecture embedding part of neural predictors to generate a meaningful representation of neural architectures. The first one is to train a carefully designed two branch graph neural network model to predict the graph edit distance of two input neural architectures. The second method is inspired by the prevalently contrastive learning, and we present a new contrastive learning algorithm that utilizes a central feature vector as a proxy to contrast positive pairs against negative pairs. Experimental results illustrate that the pre-trained neural predictors can achieve comparable or superior performance compared with their supervised counterparts with several times less training samples. We achieve state-of-the-art performance on the NASBench-101 and NASBench201 benchmarks when integrating the pre-trained neural predictors with an evolutionary NAS algorithm.

Index Terms:

Neural Architecture Search, Self-Supervised Learning, Neural Predictor, Evolutionary Algorithm, Graph Neural Network.

I Introduction

Neural architecture search (NAS) refers to the use of certain search strategies to find the best performing neural architecture in a pre-defined search space with minimum searching costs [1]. The search strategies sample the potentially promising neural architectures from the search space and the performance metrics of the sampled architectures, obtained from time-consuming training and validation procedures, are used to optimize the search strategies. To alleviate the time cost of training and validation procedures, some recently proposed NAS search strategies employ neural predictors to accelerate the performance estimation of sampled architectures [2, 3, 4, 5, 6]. The capability of neural predictors to accurately predict the performance of sampled architectures is critical to downstream search strategies [2, 5, 6, 7, 8].Because of the significant time cost of obtaining labeled training samples, how to acquire accurate neural predictors using a fewer number of training samples is one of the key issues in NAS methods employing neural predictors.

Self-supervised representation learning, a type of unsupervised representation learning, has been successfully applied in areas such as image classification [9, 10] and natural language processing [11]. If a model is pre-trained by an effective self-supervised representation learning and then fine-tuned by supervised learning using a few labeled training data, then it is highly likely to outperform its supervised counterparts [9, 10, 12]. In this paper, we novelty study and apply the self-supervised representation learning to the NAS domain to enhance the performance of neural predictors built from graph neural networks [13] and employ it to the downstream evolutionary search strategy.

Effective unsupervised representation learning falls into one of two categories: generative or discriminative [9]. Existing unsupervised representation learning methods for NAS [8, 14] belong to the generative category. Their learning objective is to make the neural predictor correctly reconstruct the input neural architecture, but it has limited relevance to NAS. This may result in the trained neural predictor producing a less effective representation of the input neural architecture. Discriminative unsupervised representation learning, also known as self-supervised learning, requires designing a pretext task [15, 16] from an unlabeled dataset and using it as supervision to learn meaningful feature representation. Inspired by previous findings that “close by” architectures tend to have similar performance metrics [17, 18], we adopt the graph edit distance (GED) as a supervision to carry out self-supervised learning because GED can reflect the distance of different neural architectures in the search space. Commonly used GED is computed based on the graph encoding of two different neural architectures (adjacency matrices and node operations) [17], but this scheme cannot identify isomorphic graphs. Path-based encoding [3] is another commonly used neural architecture encoding scheme, but it can not recognize the position of each operation in the neural architecture, e.g., two different operations in a neural architecture may have the same path-encoding vectors. To overcome the above drawbacks, we propose a new neural architecture encoding scheme denoted as position-aware path-based encoding, which can identify isomorphic graphs and recognize the position of different operations in neural architectures.

Since different pretext tasks may lead to different feature representations, utilizing the GED, we devise two self-supervised learning methods from two different perspectives to improve the feature representation of neural architectures, and to investigate the effect of different pretext tasks on the predictive performance of neural predictors. The first method utilizes a handcrafted pretext task, while the second one learns feature representation by contrasting positive pairs against negative pairs.

The pretext task of the first self-supervised learning method is to predict the normalized GED of two different neural architectures in the search space. We design a model with two independent identical branches and use the concatenation of their output features to predict the normalized GED. After the self-supervised pre-training, we adopt only one branch of the model to build neural predictor. This method is denoted as self-supervised regression learning.

The second self-supervised learning method is inspired by the prevalently contrastive learning for image classification [9, 10, 12], which maximizes the agreement between differently augmented views of the same image via a contrastive loss in the latent space [9]. Since there is no guarantee that a neural architecture and its transformed form will have the same performance metrics, it is not reasonable to directly apply the contrastive learning to NAS. We propose a new contrastive learning algorithm, termed central contrastive learning, that uses the feature vector of a neural architecture and its nearby neural architectures’ feature vectors (with small GEDs) to build a central feature vector. Then the contrastive loss is utilized to tightly aggregate the feature vectors of the architecture and its nearby architectures onto the central feature vector and push the feature vectors of other neural architectures away from the central feature vector. This method is indicated as self-supervised central contrastive learning.

After self-supervised pre-training, two neural predictors are built by connecting a fully connected layer to the architecture embedding modules of the pre-trained models. Finally, we integrate the pre-trained neural predictors into the neural predictor guided evolution neural architecture search (NPENAS) algorithm [6] to verify their performance.

Our main contributions can be summarized as follows.

•

We present a new vector encoding scheme, termed position-aware path-based encoding, to overcome the drawbacks of the adjacency matrix encoding and path-based encoding methods. The scheme is more efficient than path-based encoding, and the experimental results illustrate its superiority to filter out isomorphic graphs.
•

We propose a self-supervised regression learning method that defines a pretext task to predict the normalized GED of two different neural architectures and design a neural network with two independent identical branches to learn meaningful representation of neural architectures. After the self-supervised pre-training, a neural predictor is constructed and fine-tuned. The neural predictor pre-trained by this method achieves its performance upper bound with a small search budget and a few training epochs, and in the best case, achieves better performance using ten times less search budget than it supervised counterparts.
•

We present a central contrastive learning algorithm that forces neural architectures with small GED to lean closer together in the feature space, while neural architectures with large GED divide further apart. When trained for more epochs, the pre-trained neural predictor fine-tuned with half search budget can achieve comparable performance to its supervised counterparts, and with the same search budget, the fine-tuned neural predictor outperforms their supervised counterparts by about 1.5 times. The proposed central contrastive learning algorithm can also be extended to the domain of graph unsupervised representation learning without any modifications.
•

When integrating the pre-trained neural predictors to the NPENAS, we achieve state-of-the-art performance on the NASBench-101 [17] and NASBench-201 [19] benchmarks. The searched neural architectures have comparable or equal results to the ORACLE baseline (performance upper bound).

II Related Works

II-A Neural Architecture Search

Due to the huge size of the pre-defined search space, NAS usually search for the potential superiority neural network architectures by utilizing a search strategy. Reinforcement learning (RL) [20, 21, 22, 2], evolutionary algorithms [23, 24, 25, 26, 27, 6], gradient-based methods [28, 29, 30, 31, 32], Bayesian optimization (BO) [33, 3, 6], and predictor-based methods [7, 4, 5, 8] are the commonly used search strategies. A search strategy adjusts itself by exploiting the selected neural architectures’ performance metrics to explore the search space better.

As it is time-consuming to estimate the performance metrics of a given neural architecture through the training and validation procedures, many performance estimation strategies are proposed to speed up this task. Commonly used strategies include using a proxy dataset and proxy architecture, early stopping, inheriting weights from a trained architecture, and weight sharing [1]. A neural predictor that is employed to estimate the performance metrics of the neural network architectures can also be recognized as a kind of performance estimation strategy. Recently, many NAS algorithms have adopted neural predictors to explore the search space [2, 3, 7, 4, 5, 6]. The capability of neural predictors to accurately predict the performance of neural architectures is critical for NAS algorithms using neural predictors. The neural predictors are trained on a training dataset that is composed by some neural architectures together with their corresponding performance metrics, which are acquired through the time-consuming training and validation procedures.

In this paper, we novelty apply the self-supervised representation learning to the NAS domain. We propose two self-supervised representation learning methods to improve the feature representation of neural predictors that are built from graph neural networks, thus enhance the prediction performance of neural predictors.

II-B Neural Architecture Encoding Scheme

The neural architecture is usually defined as a direct acyclic graph (DAG). The adjacency matrix of the graph is used to represent the connections of operations, and the nodes are used to represent the operations. The commonly used neural architecture encoding schemes can be categorized into vector encoding scheme and graph encoding scheme.

Adjacency matrix encoding [4, 29, 34] and path-based encoding [3] are two frequently used vector encoding schemes. The adjacency matrix encoding is the concatenation of the flattened adjacency matrix and the one-hot encoding vector of each node, but it can not identify the isomorphic graphs [8]. The path-based encoding is the encoding of the input-to-output paths of the neural architecture, but as demonstrated in Appendix A, this scheme can not recognize the position of operations in the neural architecture. The graph encoding scheme represents the neural architecture by its adjacency matrix and the one-hot encoding of each node.

In this paper, we propose a new vector encoding scheme denoted as position-aware path-based encoding. This encoding scheme can identify the isomorphic graphs and recognize the position of operations in the neural architecture. We adopt the graph encoding scheme and employ a graph neural network [13] to embed the neural architecture into feature space. Since the graph encoding scheme can not identify isomorphic graphs [8], the position-aware path-based encoding is used firstly to filter out isomorphic graphs. We also utilize the position-aware path-based encoding to calculate the GED of different neural architectures.

II-C Unsupervised Representation Learning for NAS

Unsupervised representation learning methods fall into to two categories - generative and discriminative [9]. The learning objective of existing generative unsupervised learning methods for NAS, arc2vec [8] and NASGEM [14], is to reconstruct the input neural architectures using an encoder-decoder network, which has little relevance to NAS. Moreover, the arc2vec [8] adopts the variational autoencoder [35] to embed the input neural architectures into a high dimensional continuous feature space, and the feature space is assumed to follow the Gaussian distribution. Since there is no guarantee that the real underlying distribution of the feature space is Gaussian, this assumption may harm the representation of neural architectures. The NASGEM [14] adds a similarity loss to improve the feature representation. However, the similarity loss only considers the adjacency matrix of the input neural architecture and ignores the node operations, resulting in the failure to identify isomorphic graphs.

We present two self-supervised learning methods to the domain of NAS. The first one is inspired by unsupervised graph representation learning. GMNs [36] adopts graph neural network as building blocks and presents a cross-graph attention-based mechanism to predict the similarity of the two input graphs. SimGNN [37] takes two graphs as input, embeds the graph and each node of the graph into the feature space using a graph convolutional neural network, and then uses graph feature similarity and node feature similarity to predict the similarity of the input graphs. UGRAPHEMB [38] takes two graphs as input, adopts graph isomorphism network (GIN) [39] to embed the input graphs into feature space, and utilizes a multi-scale node attention mechanism to predict the similarity of the input graphs. Our work is quite like UGRAPHEMB, but we design a new neural network model without using the complex multi-scale node attention and apply the unsupervised learning to the field of neural architecture representation learning.

The second method is inspired by the contrastive learning for image classification. The contrastive learning used in image classification forces the image and its transformations to be similar in the feature space [9, 10, 40]. Since there is no guarantee that a neural architecture and its transformed form will have the same performance metrics, it is not reasonable to directly apply the contrastive learning for image classification to the NAS domain. We propose a new contrastive learning algorithm, central contrast learning, to learn meaningful representation of neural architectures. To our best knowledge, this is the first study applying contrastive learning to the NAS domain and the field of unsupervised graph similarity learning.

III Methodology

To enhance the prediction performance of neural predictors, we propose two self-supervised representation learning methods to improve the feature representation ability of the neural predictors. We design a new neural architecture encoding scheme to calculate the GED of graphs in Section III-B. The self-supervised regression learning that utilizes a carefully designed model with two independent identical graph neural network branches to predict the GED of neural architectures is discussed in Section III-C. The self-supervised central contrastive learning is introduced in Section III-D. The utilization of the pre-trained neural predictors for the downstream search strategies is elaborated in Section III-E.

Refer to caption — Figure 1: Overview of the position-aware path-based encoding. (a) A neural architecture in the NASBench-101 search space. The green and red lines indicate two input-to-output paths. (b) Operations and their corresponding unique indices. (c) Two different input-to-output paths and their operation indices. (d) Position-aware path-based encoding of the two input-to-output paths in (c).

III-A Problem Formulation

In a pre-defined search space $S$ , a neural architecture $s$ can be represented as a DAG

s=(V,E),\quad s\in S,

(1)

where $V=\{v_{i}\}_{i=1:H}$ is the set of nodes representing operations in $s$ , $E=\{v_{i},v_{j}\}_{i,j=1:H}$ is the set of edges describing the connection of operations, and $H$ is the number of nodes.

Predictor-based NAS adopts a neural predictor modeled as

\hat{y}=f(s),

(2)

where $f$ is the neural predictor, and it takes a neural architecture $s$ as input and outputs the performance metric prediction $\hat{y}$ of $s$ .

III-B Position-aware Path-based Encoding

Since the proposed self-supervised learning methods utilize the GED to measure the similarity of different neural architectures, it is critical to calculate the GED effectively. We present a new vector encoding scheme, position-aware path-based encoding, which improves on the path-based encoding [3] by recording the position of each operation in the path. The scheme consists of two steps: generating the position-aware path-based encoding vectors for the input-to-output paths of the neural architecture, and concatenating the vectors of all the paths.

As shown in Eq 1, a neural architecture can be defined by a DAG with its nodes representing the operations in the neural architecture. The DAG consists of an input node, some operation nodes and an output node, connected in sequence. The adjacency matrix of the DAG is used to represent the connections of the different nodes. Since each node in the DAG has a fixed position, we assign each node with a unique index, which implies that each operation associated with the node has a unique index.

NASBench-101 [17] is a widely used NAS search space. It contains three different operations: convolution 3 $\times$ 3, convolution 1 $\times$ 1, and max-pool 3 $\times$ 3. Fig. 1a illustrates a neural architecture in the NASBench-101 search space and uses green and red lines to indicate two different input-to-output paths. The operations and their corresponding indices of the neural architecture are shown in Fig. 1b. In Fig. 1c, we demonstrate two input-to-output paths of the neural architecture in Fig. 1a. Unlike the path-based encoding [3], when we extract all the input-to-output paths in the neural architecture, we also record the index of operations in the input-to-output paths. The position-aware path-based encoding of the two different paths in Fig. 1c is indicated in Fig. 1d. The vector length of each input-to-output path is fixed, which equals the multiplication of the number of operation nodes and the number of operation types. We traverse all the operation nodes in the neural architecture. If an operation node appears in the input-to-output path, then the operation is represented by its one-hot operation type vector; otherwise, it is represented by a zero vector. Since there are three different operations in the NASBench-101, the length of the one-hot operation vector and the zero vector is three.

The final encoding vector is the concatenation of the position-aware path-based encoding vectors for all the input-to-output paths in the neural architecture. To keep the concatenation consistent, we design the following steps:

1.

Firstly, sort all the input-to-output paths in ascending order by path length.
2.

Secondly, sort all the input-to-output paths of the same path length in ascending order by the operation index.
3.

Finally, concatenate the sorted input-to-output paths’ position-aware path-based encoding vectors.

The vector length of path-based encoding [3] increases exponentially with the number of operation nodes, whereas the vector length of the position-aware path-based encoding increases linearly with the number of input-to-output paths. Therefore, the position-aware path-based encoding is a more efficient vector encoding scheme than path-based encoding. As the number of input-to-output paths may be different in different neural architectures, we will pad the short vectors with zeros to keep all vectors have the same length.

III-C Self-supervised Regression Learning

The pretext task of the proposed self-supervised regression learning is to predict the normalized GED of two input neural architectures. The GED is defined as

GED(s_{i},s_{j})=\sum_{k=i}^{K}\lvert\textbf{p}_{i}^{k}-\textbf{p}_{j}^{k}\rvert,\quad s_{i},s_{j}\in S,

(3)

where $\textbf{p}_{i}$ and $\textbf{p}_{j}$ are the position-aware path-based encoding vectors of architecture $s_{i}$ and $s_{j}$ , $\textbf{p}_{i}^{k}$ and $\textbf{p}_{j}^{k}$ are the $k^{th}$ elements of the position-aware encoding vector $\textbf{p}_{i}$ and $\textbf{p}_{j}$ , and $K$ is the vector length.

Following [37], we define the normalized GED as

nGED(s_{i},s_{j})=exp^{-dist},\quad where\quad dist=\frac{GED(s_{i},s_{j})}{|V|},

(4)

where $|V|$ is the number of nodes in the neural architectures.

As the architecture in search space is represented as a DAG, it is straightforward to adopt graph neural networks to aggregate features for each node and generate the graph embedding by averaging the nodes’ features. We design self-supervised models and neural predictors utilizing the spatial-based graph neural network GIN layers.

Since the pretext task is to predict the normalized GED of two different neural architectures, we design a regression model $f_{rl}$ that consists of two independent identical graph neural network branches, as illustrated in Fig. 2. Each branch is composed of three sequentially connected GIN layers and a global mean pooling (GMP) layer. The GMP layer outputs the mean of node features of the last GIN layer. The outputs of the two branches are concatenated, and then sent to two sequentially connected fully connected layers to predict the two input architectures’ normalized GED. The regression loss function to optimize the parameters $w_{rl}$ of $f_{rl}$ is formulated as

w_{rl}^{*}=\arg\min_{w_{rl}}\sum_{(s_{i},s_{j})\in S}(f_{rl}(s_{i},s_{j})-nGED(s_{i},s_{j}))^{2}.

(5)

After the self-supervised pre-training, we can select any branch of $f_{rl}$ to embed the neural architectures into feature space. We design a neural predictor by connecting a fully connected layer to the architecture embedding module (as illustrated in the red rectangle of Fig. 2) of the pre-trained models. The regression loss is employed to fine-tune the neural predictor. The parameters of the neural predictor, denoted as $w$ , are optimized as

w^{*}=\arg\min_{w}\sum_{s_{i}\in S}(f(s_{i})-y_{i})^{2},

(6)

where $y_{i}$ is the performance metric of $s_{i}$ .

III-D Self-supervised Central Contrastive Learning

We present a central contrastive learning algorithm to force neural architectures with small GED to lean closer together in the feature space, while neural architectures with large GED divide further apart.

As illustrated in Fig. 3, we design a model $f_{ccl}$ to embed the neural architecture into feature space. Following the SimCLR’s [9], $f_{ccl}$ consists of a neural architecture embedding module, a non-linear fully connected layer, and a fully connected layer. For a fair comparison, the architecture embedding module is identical to that of $f_{rl}$ . After the self-supervised central contrastive pre-training, we connect a fully connect layer to the architecture embedding module to predict the input neural architecture’s performance.

Given a batch of neural architectures $S_{b}=\{s_{k}\}_{k=1}^{N}$ and a neural architecture $s_{i}\in S_{b}$ , we first calculate the minimum GED of $s_{i}$ to all other architectures $s_{j}\in S_{b}$ . Second, we collect the neural architectures $s_{k}\in S_{b}$ that have the same $GED(s_{i},s_{k})$ as the minimum GED, and denote the set of collections as $S_{pos}$ . We also put $s_{i}$ into $S_{pos}$ . The set of neural architectures in this batch but not in set $S_{pos}$ is denoted as $S_{neg}$ . We then use the model $f_{ccl}$ to embed all the neural architectures in $S_{pos}$ and $S_{neg}$ into the feature space, and denote $E_{pos}$ as the feature vector set of $S_{pos}$ and $E_{neg}$ as the feature vector set of $S_{neg}$ . A central vector $\textbf{e}_{c}$ is calculated by averaging all the feature vectors in $E_{pos}$ . At last, the contrastive loss is used to aggregate all the feature vectors in $E_{pos}$ to the central vector $\textbf{e}_{c}$ and push the feature vectors in $E_{neg}$ far away from $\textbf{e}_{c}$ . An example of the central contrastive learning is illustrated in Appendix B.

The detailed procedure of central contrastive learning is summarized in the Algorithm. 1, where the feature vectors $\textbf{e}_{j}$ and $\textbf{e}_{c}$ are normalized vectors.

Algorithm 1 Central Contrast Learning

1: Input: batch size

N

, number of training architectures

M

and

M\leq N

, temperature

\tau

, regularization weight

\lambda

, model

f_{ccl}

2: for sampled minibatch

S_{b}=\{s_{k}\}_{k=1}^{N}

E_{c}=\varnothing

4: for all

t\in\{1,\ldots,M\}

5: randomly draw one neural architecture

s_{i}\in S_{b}

g_{min}=\min GED(s_{i},s_{j})

where

j=\{1,\ldots,i-1,i+1,\ldots,N\}

GED

: Eq 3

E_{pos}=\varnothing

and

E_{neg}=\varnothing

8: for all

j\in\{1,\ldots,i-1,i+1,\ldots N\}

\textbf{e}_{j}=f_{ccl}(s_{j})

10: if

GED(s_{i},s_{j})==g_{min}

then

11:

E_{pos}\leftarrow E_{pos}\cup\textbf{e}_{j}

12: else

13:

E_{neg}\leftarrow E_{neg}\cup\textbf{e}_{j}

14: end

15: end for

16:

\textbf{e}_{c}=\frac{1}{|E_{pos}|}*\sum_{\textbf{e}\in E_{pos}}\textbf{e}

# vector average

17:

E_{c}\leftarrow E_{c}\cup\textbf{e}_{c}

18: for all idx,

\textbf{e}_{p}\in E_{pos}

do # idx is the index of

\textbf{e}_{p}

E_{pos}

19:

E_{pair}=\varnothing

20:

sim_{p,c}=\textbf{e}_{p}^{T}\textbf{e}_{c}/(\tau\lVert\textbf{e}_{p}\rVert\lVert\textbf{e}_{c}\rVert)

21:

E_{pair}\leftarrow E_{pair}\cup sim_{p,c}

22: for all

\textbf{e}_{n}\in E_{neg}

23:

sim_{n,c}=\textbf{e}_{n}^{T}\textbf{e}_{c}/(\tau\lVert\textbf{e}_{n}\rVert\lVert\textbf{e}_{c}\rVert)

24:

E_{pair}\leftarrow E_{pair}\cup sim_{n,c}

25: end for

26:

l_{t,idx}=-\log\frac{\exp(sim_{p,c})}{\sum_{sim_{vec}\in E_{pair}}exp(sim_{vec})}

27: end for

28:

l_{t}=\sum_{idx}l_{t,idx}

29: end for

30:

L=\frac{1}{M}\sum_{t=1}^{M}l_{t}+\lambda L_{reg}(E_{c})

L_{reg}

: Eq 7

31: update model

f_{ccl}

to minimize

L

32: end for

33: return model

f_{ccl}

To reduce the interaction between the center vectors, we add a center vector regularization term to the loss function that forces each pair of center vectors to be orthogonal. The center vector regularization term is defined as

L_{reg}=\frac{1}{2}\sum_{0<i,j<M}(\textbf{E}\textbf{E}^{T}-diag(\textbf{E}\textbf{E}^{T})),\quad\textbf{E}\in\mathbb{R}^{(M*d)},

(7)

where $M$ is the number of training architectures, $d$ is the dimension of the vector, E is the matrix of center vectors with each row representing a center vector, and $i,j$ is the row and column indices of the matrix generated by the matrix multiplication $\textbf{E}\textbf{E}^{T}$ , respectively.

In each batch of neural architectures, as one neural architecture can have the same minimum GED with several other neural architectures in this batch, we may only need to use $M$ neural architectures to optimize model $f_{ccl}$ . After the pre-training, we adopt the regression loss like Eq 6 to fine-tune the parameters of the neural predictor.

III-E Fixed Budget NPENAS

NPENAS [6] combines the evolutionary search strategy with a neural predictor and utilizes the neural predictor to guide the evolutionary search strategy to explore the search space. We integrate a pre-trained neural predictor with NPENAS to illustrate the performance gains that result from applying self-supervised learning to NAS.

Since our experiments demonstrate that the neural predictor built from a self-supervised pre-trained model can significantly outperform its supervised counterpart and achieve comparable performance with a smaller training dataset, we modify the NPENAS method to utilize only a fixed search budget to carry out the neural architecture search.

We summarize the fixed budget NPENAS in Algorithm. 2, which is modified from the NPENAS [6], and we only demonstrate the different parts.

Algorithm 2 Fixed Budget NPENAS

1: Input: initial population size

n_{0}

, initial population

D=\{(s_{i},y_{i}),i=1,2,\cdots,n_{0}\}

, neural predictor

f

, number of the total searched architectures

total\_num

, number of the evaluated architectures (budget) to fine-tune neural predictor

ft\_num

2: for

n

from

n_{0}

total\_num

3: if

n\leq ft\_num

then

4: Initialize the weights of neural predictor

f

with the weights from the pre-trained model

5: Fine-tune the neural predictor

f

with dataset

D=\{(s_{i},y_{i}),i=1,2,\cdots,n\}

6: end

7: Utilize the the neural predictor

f

to guide the evolutionary neural architecture search. # Detailed code can be found in the Algorithm 2 of NPENAS [6].

8: end for

9: Output:

s^{*}=\arg\min(y_{i}),{(s_{i},y_{i})\in D}

IV Experiments and Analysis

In this section, we conduct experiments to illustrate that the performance of our designed neural predictors can be significantly improved by utilizing self-supervised learning to pre-train the neural predictors’ architecture embedding modules. We also demonstrate that integrating the designed pre-trained neural predictors with NPENAS is beneficial for NAS.

All the experiments are implemented in Pytorch [41]. We use the implementation of GIN from the publicly available graph neural network library pytorch_geometric [42]. The code of this paper is provided at [43].

IV-A Benchmark Datasets

We perform all the experiments on the NASBench-101 [17] and NASBench-201 [19] benchmarks.

NASBench-101

The NASBench-101 [17] contains $423$ k neural architectures, and each architecture is trained three times on the CIFAR-10 [44] training dataset independently. The structure of the neural architectures, as well as their validation accuracies, test accuracies corresponding to the three independently training on the CIFAR-10, are reported. The architecture in this search space is defined by DAG, utilizing nodes to represent the operations of the neural architecture and using the adjacency matrix to represent the connection of different operations. Only convolution $1\times 1$ , convolution $3\times 3$ , and max-pool $3\times 3$ are allowed to be used to build the neural architectures. The best architecture achieves a mean test error of $5.68\%$ , and the mean test error of the architecture with the best validation error is $5.77\%$ .

NASBench-201

The NASBench-201 [19] is a recently proposed NAS benchmark, and it contains $15.6$ k trained architectures for image classification. Each architecture is trained once on the CIFAR-10, CIFAR-100 [44], and ImageNet-16-120, and the ImageNet-16-120 is a down-sampled variant of ImageNet [45]. The structure of each architecture and its evaluation details such as training error, validation error, and test error of each architecture are reported. Each architecture is defined by a DAG, utilizing nodes to represent the feature maps and using the edges to represent the operation. The convolution $1\times 1$ , convolution $3\times 3$ , average pooling $3\times 3$ , skip connection, and zeroize operation are allowed to be used to construct the neural architectures. On the CIFAR-10, CIFAR-100, and ImageNet-16-120, the best test percentage errors are $8.48\%$ , $26.49\%$ , and $52.69\%$ , respectively. On the CIFAR-10, CIFAR-100, and ImageNet-16-120, architectures with the best validation error achieve the test percentage errors of $8.91\%$ , $26.49\%$ , and $53.8\%$ , respectively.

IV-B Prediction Analysis

Model Details

We first utilize the self-supervised regression learning to train the model $f_{rl}$ in Fig. 2 and the self-supervised central contrastive learning to train the model $f_{ccl}$ in Fig. 3. The architecture embedding module consists of three sequentially connected GIN layers. The hidden layer size of the GIN layer is 32, and each GIN layer is followed with a batch normalization and a ReLU layer. The hidden dimension size of the fully connected layer of the model $f_{rl}$ and $f_{ccl}$ is 16 and 8, respectively. After self-supervised pre-training, we construct the neural predictors by connecting the pre-trained architecture embedding modules with a single fully connected layer with the hidden dimension size of 8. The neural predictors constructed by the architecture embedding modules of $f_{rl}$ and $f_{ccl}$ are denoted as SS-RL and SS-CCL, respectively.

We employ the same neural architecture encoding method as the NPENAS [6]. The architecture in the NASBench-101 is represented by a $7\times 7$ upper triangle adjacency matrix and a collection of 6-dimensional one-hot encoded node features, and that in the NASBench-201 is represented by an $8\times 8$ upper triangle matrix and several 8-dimensional one-hot encoded node features.

Training Detials

The self-supervised regression learning utilizes $90\%$ of the neural architectures in NASBench-101 to pre-train the model $f_{rl}$ , the training epoch is 300, and the batch size is 64. We employ Adam optimizer [46] to optimize the parameters of the model $f_{rl}$ , the initial learning rate is $5\mathrm{e}{-4}$ , and the weight decay is $1\mathrm{e}{-4}$ . A cosine learning rate schedule [47] without restart is adopted to anneal down the learning rate to zero. The training details of self-supervised regression learning on NASBench-201 are the same as for NASBench-101.

The self-supervised central contrastive learning utilizes all the architectures in NASBench-101 to pre-train the model $f_{ccl}$ . The training epoch is 300, the regularization weight $\lambda$ is 0.5, and the temperature $\tau$ is $0.07$ . The batch size is $140$ k, the training architectures are $140$ k, and we drop each epoch’s last batch. When pre-train on the NASBench-201 benchmark, the batch size is $10$ k, the training architectures are $1$ k, and we also drop the last batch of each epoch. Other training details like the optimizer, learning rate, weight decay, and the learning rate schedule are identical with that for the self-supervised regression learning.

After pre-training, the neural architectures and their corresponding validation accuracies are used to fine-tune the neural predictors. The neural predictors are fine-tuned with an initial learning rate of $5\mathrm{e}{-5}$ and weight decay of $1\mathrm{e}{-4}$ . The optimizer and the learning schedule are the same as for the self-supervised pre-training.

Setup

The search budget and training epochs of neural predictors directly affect the time cost of NAS. Given that the search budget defines the size of the training dataset of the neural predictors, the search budget affects the time cost of NAS more than the number of training epochs. We conduct experiments to illustrate the performance of neural predictors under different search budgets. The supervised neural predictor, SS-RL and SS-CCL are compared under the search budgets of 20, 50, 100, 150 and 200, respectively. To illustrate the effect of training epochs, we also compare the performance of the neural predictors with different search budgets trained under 50, 100, 150, 200, 250, and 300 training epochs. The weights of the supervised neural predictor are randomly initialized. After fine-tuning, we evaluate the correlation between the validation accuracy of the neural architecture and its performance predicted by the neural predictors. The Kendall tau rank correlation is used for comparison. All the experiments results are averaged over 40 independent running using different random seeds.

Results

The predictive performance measurements of the neural predictors on NASBench-101 and NASBench-201 are shown in in Fig. 4 and Fig. 5, respectively.

On the NASBench-101 search space, the predictive performance of the two pre-trained neural predictors significantly outperform the supervised neural predictor. The pre-trained neural predictor SS-RL can achieve its best performance with less search budget and training epochs, and the performance gap between SS-RL and the supervised neural predictor gradually decreases with the increasing of training epochs and search budget. When the training epochs are larger than 250, the supervised neural predictor begins to outperform SS-RL using a large search budget, e.g., more than 150. The performance of SS-CCL consistently outperforms the supervised neural predictor and increases as the training epochs increase. With the increase of training epochs, the performance of SS-CCL gradually increases. When the training epochs are larger than 150 and the search budget larger than 100, the performance of SS-CCL begins to outperform the SS-RL. When the training epochs is small, SS-RL using ten times less training neural architectures can achieve better (Fig. 4(a)) or comparable (Fig. 4(b)) performance with the supervised neural predictor. At larger training epochs, SS-CCL using twice less training neural architectures can achieve comparable (Fig. 4(e)) or even better performance (Fig. 4(f)) than the supervised neural predictor.

On the NASBench-201 search space, the performance of SS-RL consistently outperforms SS-CCL and the supervised neural predictor. The predictive performance of SS-CCL gradually approaches that of SS-RL as training epochs increase. The performance of SS-RL using ten times less training neural architectures and only trained for 50 epochs can outperform (Fig. 5(a), Fig. 5(f)) the supervised neural predictor trained for 300 epochs. When the training epochs are larger than 150, the performance of SS-CCL trained with four times less neural architectures can outperform (Fig. 5(d), Fig. 5(e), Fig. 5(f))the supervised neural predictor.

In summary, SS-RL can achieve its best performance with fewer training epochs, while SS-CCL requires more training epochs to achieve its best performance. As the number of training epochs increases, SS-CCL outperforms SS-RL on the NASBench-101 and tends to outperform SS-RL on the NASBench-201.

IV-C Effect of Batch Size

Since the number of negative pairs of the central contrastive learning is determined by the batch size, in this section, we perform experiments to investigate the effect of different batch sizes on the performance of neural predictors. We compare the prediction performance of neural predictors using the batch sizes $N$ of 10k, 40k, 70k, and 100k, and denoted the neural predictors corresponding to different $N$ s as SS-CCL_10k, SS-CCL_40k, SS-CCL_70k, and SS-CCL_100k, respectively. We set the number of training architectures $M$ to be half of the batch sizes. To compare the performance of neural predictors with larger $M$ , we also include SS-CCL that is pre-trained with the batch size of 140k and the number of training architectures of 140k, and denoted it as SS-CCL_140k. All the results are averaged over 40 independently running, and each running uses a different seed.

As shown in Fig. 6, when we only take the neural predictors with the same training architectures $M$ (excluding SS-CCL_140k) into consideration, SS-CCL_40k is consistently better than other neural predictors, and SS-CCL_70k tends to be the worst compared with other pre-trained neural predictors. The above result is different from the findings in the contrastive learning for image classification [9] that the performance is consistently increasing with large batch size and more training epochs. The predictive performance of SS-CCL-140k has a slightly better performance than SSL-CCL_40k. Because the large batches generate more negative pairs, making the contrastive learning more difficult, we conjecture that as the batch size $N$ in Algorithm 1 increases, the number of training architecture $M$ should also increase. The predictive performance of all the pre-trained neural predictors continuously increases with increasing training epochs, and the performance gap between the different pre-trained neural predictors decreases with the increasing of training epochs and search budget.

IV-D Fixed Budget NPENAS

Setup

We integrate our pre-trained neural predictors with NPENAS [6] and denote the integration of neural predictors SS-RL and SS-CCL with NPENAS as NPENAS-SSRL and NPENAS-SSCCL, respectively. The fixed budget version of NPENAS-SSRL and NPENAS-SSCCL are denoted as NPENAS-SSRL-FIXED and NPENAS-SSCCL-FIXED, respectively. We adopt the same experimental setting as NPENAS, and compare with the random search (RS) [48], regularized evolutionary (REA) [24], BANANAS [3] with path-based encoding (BANANAS-PE), BANANAS with adjacency matrix encoding (BANANAS-AE), BANANAS with position-aware path-based encoding (BANANAS-PAPE), NPENAS-NP [6], and NPENAS-NP with fixed search budget (NPENAS-NP-FIXED). Each algorithm is given a search budget of 150 and 100 on the NASBench-101 and NASBench-201 search space, respectively. All the experiment results are averaged over 600 independent trails, every update of the population, each algorithm returns the architecture with the lowest validation error so far and reports its test error, so there are 15 or 10 best architectures in total. We also compare with the arc2vec [8] that is a recently proposed unsupervised representation learning for NAS, and directly adopt its reported results. As the search strategies employ the neural architectures’ validation error to explore the search space, a reasonable best performance of NAS is the test error of the neural architecture that has the best validation error in the search space, which is denoted as the ORACLE baseline [7]. We use the ORACLE baseline as the upper bound of performance.

IV-D1 NAS Results on NASBench-101

The comparison of different algorithms is illustrated in Fig. 7, and we also demonstrate the quantitative comparison of algorithms in Table I. As illustrated in Fig. 7, the performance of NPENAS-SSCCL is slightly better than NPENAS-SSRL, and NPENAS-NP achieves the best performance. The proposed position-aware path-based encoding is an efficient and effective encoding scheme. The performance of BANANAS [3] with position-aware path-based encoding is better than BANANAS with the path-based encoding. We utilize the position-aware path-based encoding to filter out the isomorphic graphs, while NPENAS employs the path-based encoding to filter out isomorphic graphs. Due to this difference, the performance of NPENAS-NP shown in Table I improves from $5.86\%$ to $5.83\%$ . Table I also shows our proposed self-supervised pre-trained neural predictors using a small search budget performs better than the unsupervised arch2vec.

TABLE I: Performance comparison of NAS algorithms on NASBench-101

[b] Methods Search Budget Test Err (%) Avg Architecture Embedding Search Method RA [48] 150 6.42 $\pm$ 0.2 – Random Search REA [24] 150 6.32 $\pm$ 0.2 Discrete Evolution BANANAS-PE [3] 150 5.9 $\pm$ 0.15 Supervised Bayesian Optimization BANANAS-AE [3] 150 5.85 $\pm$ 0.14 Supervised Bayesian Optimization BANANAS-PEAE [3] 150 5.86 $\pm$ 0.14 Supervised Bayesian Optimization NPENAS-NP [6] 150 5.83 $\pm$ 0.11 Supervised Evolution NPENAS-NP-FIXED [6] 90^† 5.9 $\pm$ 0.16 Supervised Evolution arch2vec-RL [8] 400 5.9 Unsupervised REINFORCE arch2vec-BO [8] 400 5.95 Unsupervised Bayesian Optimization NPENAS-SSRL 150 5.85 $\pm$ 0.13 Self-supervised Evolution NPENAS-SSRL-FIXED 90^† 5.88 $\pm$ 0.16 Self-supervised Evolution NPENAS-SSCCL 150 5.84 $\pm$ 0.12 Self-supervised Evolution NPENAS-SSCCL-FIXED 90^† 5.85 $\pm$ 0.13 Self-supervised Evolution

$\dagger$

The neural predictor is trained with 90 neural architectures, while the NPENAS algorithm needs 150 neural architectures.

As shown in Table I, the performance of NPENAS-NP has a large drop after switching to the fixed version, while NPENAS-SSRL and NPENAS-SSCCL only have a slight drop. We compare the performance of NPENAS-SSRL and NPENAS-SSCCL under differ search budget, and the results are shown in Table II. The performance of NPENAS-SSRL continuously increases with the increase of the search budget, while the NPENAS-SSCCL only using 80 neural architectures to achieve its best performance. From the above findings, the neural predictor SS-CCL is better than SS-RL when applying to the NPENAS.

TABLE II: Comparison of NAS algorithms on NASBench-101

[b] Methods Search Budget^† Test Err (%) Avg NPENAS-SSRL 20 6.1 $\pm$ 0.27 NPENAS-SSRL 50 5.93 $\pm$ 0.19 NPENAS-SSRL 80 5.88 $\pm$ 0.16 NPENAS-SSRL 110 5.86 $\pm$ 0.15 NPENAS-SSRL 150 5.85 $\pm$ 0.13 NPENAS-SSCCL 20 6.0 $\pm$ 0.2 NPENAS-SSCCL 50 5.87 $\pm$ 0.15 NPENAS-SSCCL 80 5.84 $\pm$ 0.13 NPENAS-SSCCL 110 5.84 $\pm$ 0.12 NPENAS-SSCCL 150 5.84 $\pm$ 0.12

$\dagger$

The neural predictor is trained with the given number of evaluated neural architectures, while the NPENAS algorithm needs 150 evaluate architectures.

IV-D2 NAS Results on NASBench-201

We compare above algorithms on the CIFAR-10, CIFAR-100, and ImageNet-16-120 on NASBench-201, and the results are shown in Fig. 8, Fig. 9 and Fig. 10, respectively. The quantitative comparison is presented in Table III. As arc2vec [8] does not report queries on this benchmark, we do not compare with it.

As can be seen in Fig. 8, Fig. 9, and Fig. 10, all algorithms can find neural architectures with good performance using a small search budget. As shown in Table III, NPENAS-SSCCL achieves the best performance on both CIFAR-100 and ImageNet-16-120 on NASBench-201, with nearly the ORACLE baseline on CIFAR-100 ( $26.5\%$ vs. $26.49\%$ ). On ImageNet-16-120, the performance of NPENAS-SSCCL is equivalent to the ORACLE baseline. On CIFAR-10 on NASBench-201, NPENAS-SSRL with the fixed search budget achieves the best performance, which is comparable with the ORACLE baseline ( $8.92\%$ vs $8.91\%$ ).

TABLE III: Performance comparison of NAS algorithms on NASBench-201

[b] Methods Search Budget Test Err (%) Avg CIFAR-10 Test Err (%) Avg CIFAR-100 Test Err (%) Avg ImageNet-16-120 RA [48] 100 9.26 $\pm$ 0.32 28.54 $\pm$ 0.87 54.62 $\pm$ 0.83 REA [24] 100 8.97 $\pm$ 0.22 27.16 $\pm$ 0.85 53.9 $\pm$ 0.67 BANANAS-PE [3] 100 9.05 $\pm$ 0.3 27.37 $\pm$ 0.99 53.82 $\pm$ 0.64 BANANAS-AE [3] 100 8.96 $\pm$ 0.16 26.75 $\pm$ 0.66 53.69 $\pm$ 0.38 BANANAS-PEAE [3] 100 8.94 $\pm$ 0.16 26.9 $\pm$ 0.74 53.7 $\pm$ 0.5 NPENAS-NP [6] 100 8.95 $\pm$ 0.13 26.74 $\pm$ 0.67 53.9 $\pm$ 0.62 NPENAS-NP-FIXED [6] 50^† 8.94 $\pm$ 0.13 26.7 $\pm$ 0.51 53.87 $\pm$ 0.57 NPENAS-SSRL 150 8.94 $\pm$ 0.1 26.51 $\pm$ 0.21 53.78 $\pm$ 0.43 NPENAS-SSRL-FIXED 50^† 8.92 $\pm$ 0.1 26.57 $\pm$ 0.29 53.68 $\pm$ 0.4 NPENAS-SSCCL 150 8.94 $\pm$ 0.1 26.5 $\pm$ 0.15 53.8 $\pm$ 0.32 NPENAS-SSCCL-FIXED 50^† 8.94 $\pm$ 0.11 26.57 $\pm$ 0.3 53.65 $\pm$ 0.36

$\dagger$

The neural predictor is trained with 50 evaluated neural architectures, while the NPENAS algorithm needs 150 evaluate architectures.

V Conclusion

We present a new neural architecture encoding scheme, position-aware path-based encoding, to calculate the GED of neural architectures. To enhance the performance of neural predictors, we propose two self-supervised learning methods to pre-train the neural predictors’ architecture embedding modules to generate a meaningful representation of neural architectures. Extensive experiments illustrate the superiority of the self-supervised pre-training. When integrating the pre-trained neural predictors with NPENAS, we achieve the state-of-the-art performance on the NASBench-101 and NASBench-201 benchmarks.

The experimental results show that the two self-supervised learning pre-trained neural predictors illustrate totally different behaviors. An in-depth investigation and theoretical analysis are needed to uncover the mechanism that leads the difference, which will help the design of a better self-supervised learning method for NAS. Combining the pre-trained neural predictors to other neural predictor-based NAS to verify the generalize ability of the pre-trained neural predictors is worthy for further study. Extending the interaction of pre-trained neural predictors with NPENAS to other tasks like image segmentation, object detection, and natural language processing is also a meaningful future work.

Appendix A The Operation Position Analyze of Path-based Encoding Scheme

Fig. 11 shows two neural network architectures from the NASBench-101 search space, and the mean percentage test accuracy of the two neural architectures are $91.2\%$ (11(a)) and $90.4\%$ (11(b)), respectively.

The two neural architectures in Fig. 11 have the same path-based encoding, as shown in Fig. 12. The red line path in Fig. 11(a) and Fig. 11(b) indicates the same input-to-output path that only contains a max-pool 3 $\times$ 3 operation. Although the red line path in the two neural architectures is identical, the position of the max-pool 3x3 operation in the two neural architectures are different, wherein Fig. 11(a), the input of the max-pool 3x3 operation is the input node and the convolution 1x1 operation, and in Fig. 11(b) the input of the max-pool 3x3 operation is the input node. The ignorance of the position of operations in the neural architecture caused the path-based encoding method to map the two different neural architectures in Fig. 11 into the same encoding vector.

Appendix B An Illustration of the Central Contrastive Learning

As illustrated in Fig. 4, the objective of the central contrastive learning is to aggregate the positive green features to the center vector $\textbf{e}_{c}$ and push the negative orange features far away from the center.

The central vector $\textbf{e}_{c}$ in Fig. 13 is calculated by average all the green feature vectors and can be formulated as

\textbf{e}_{c}=\sum_{i=1}^{3}\textbf{e}_{pi}.

(8)

The contrastive loss corresponding to Fig. 13 is defined as

l=\sum_{i=0}^{3}-\log\frac{\exp(sim(\textbf{e}_{pi},\textbf{e}_{c}))}{\exp(sim(\textbf{e}_{pi},\textbf{e}_{c}))+\sum_{k=1}^{5}\exp(sim(\textbf{e}_{nk},\textbf{e}_{c}))}.

(9)

References

[1] T. Elsken, J. H. Metzen, and F. Hutter, “Neural architecture search: A survey,” Journal of Machine Learning research, vol. 20, pp. 55:1–55:21, 2018.
[2] C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L.-J. Li, L. Fei-Fei, A. Yuille, J. Huang, and K. Murphy, “Progressive neural architecture search,” in Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part I, vol. 11205, 2018, pp. 19–35.
[3] C. White, W. Neiswanger, and Y. Savani. (2019) Bananas: Bayesian optimization with neural architectures for neural architecture search. [Online]. Available: https://arxiv.org/abs/1910.11858
[4] L. Wang, Y. Zhao, Y. Jinnai, Y. Tian, and R. Fonseca, “Neural architecture search using deep neural networks and monte carlo tree search,” in The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, 2020, pp. 9983–9991.
[5] X. Ning, Y. Zheng, T. Zhao, Y. Wang, and H. Yang, “A generic graph-based neural architecture encoding scheme for predictor-based NAS,” in Proceedings of the European Conference on Computer Vision, 2020.
[6] C. Wei, C. Niu, Y. Tang, Y. Wang, H. Hu, and J. min Liang. (2020) Npenas: Neural predictor guided evolution for neural architecture search. [Online]. Available: https://arxiv.org/abs/2003.12857
[7] W. Wen, H. Liu, Y. Chen, H. H. Li, G. Bender, and P. Kindermans, “Neural predictor for neural architecture search,” in Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XXIX, A. Vedaldi, H. Bischof, T. Brox, and J. Frahm, Eds., vol. 12374. Springer, 2020, pp. 660–676.
[8] S. Yan, Y. Zheng, W. Ao, X. Zeng, and M. Zhang. (2020) Does unsupervised architecture representation learning help neural architecture search? [Online]. Available: https://arxiv.org/abs/2006.06936
[9] T. Chen, S. Kornblith, M. Norouzi, and G. E. Hinton. (2020) A simple framework for contrastive learning of visual representations. [Online]. Available: https://arxiv.org/abs/2002.05709
[10] K. He, H. Fan, Y. Wu, S. Xie, and R. B. Girshick, “Momentum contrast for unsupervised visual representation learning,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020. IEEE, 2020, pp. 9726–9735.
[11] J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio, Eds., 2019, pp. 4171–4186.
[12] Y. M. Asano, C. Rupprecht, and A. Vedaldi, “Self-labelling via simultaneous clustering and representation learning,” in 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, 2020.
[13] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu, “A comprehensive survey on graph neural networks,” IEEE transactions on neural networks and learning systems, 2020.
[14] H. Cheng, T. Zhang, S. Li, F. Yan, M. Li, V. Chandra, H. H. Li, and Y. Chen. (2020) NASGEM: neural architecture search via graph embedding method. [Online]. Available: https://arxiv.org/abs/2007.04452
[15] M. Noroozi and P. Favaro, “Unsupervised learning of visual representations by solving jigsaw puzzles,” in Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VI, ser. Lecture Notes in Computer Science, B. Leibe, J. Matas, N. Sebe, and M. Welling, Eds., vol. 9910. Springer, 2016, pp. 69–84.
[16] S. Gidaris, P. Singh, and N. Komodakis, “Unsupervised representation learning by predicting image rotations,” in 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018.
[17] C. Ying, A. Klein, E. Christiansen, E. Real, K. Murphy, and F. Hutter, “NAS-bench-101: Towards reproducible neural architecture search,” in Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, vol. 97, 2019, pp. 7105–7114.
[18] J. You, J. Leskovec, K. He, and S. Xie, “Graph structure of neural networks,” in Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 12-18 July 2020, 2020.
[19] X. Dong and Y. Yang, “Nas-bench-201: Extending the scope of reproducible neural architecture search,” in 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, 2020.
[20] B. Zoph and Q. V. Le, “Neural architecture search with reinforcement learning,” in 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017.
[21] H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean, “Efficient neural architecture search via parameter sharing,” in Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, vol. 80, 2018, pp. 4092–4101.
[22] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning transferable architectures for scalable image recognition,” in 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, 2017, pp. 8697–8710.
[23] E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, J. Tan, Q. V. Le, and A. Kurakin, “Large-scale evolution of image classifiers,” in Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, vol. 70, 2017, pp. 2902–2911.
[24] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le, “Regularized evolution for image classifier architecture search,” in The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019, 2019, pp. 4780–4789.
[25] Y. Sun, B. Xue, M. Zhang, G. G. Yen, and J. Lv, “Automatically designing cnn architectures using the genetic algorithm for image classification,” IEEE Transactions on Cybernetics, pp. 1–15, 2020.
[26] Y. Sun, B. Xue, M. Zhang, and G. G. Yen, “Completely automated cnn architecture design based on blocks,” IEEE Transactions on Neural Networks and Learning Systems, vol. 31, no. 4, pp. 1242–1254, 2020.
[27] ——, “Evolving deep convolutional neural networks for image classification,” IEEE Transactions on Evolutionary Computation, vol. 24, no. 2, pp. 394–407, 2020.
[28] H. Liu, K. Simonyan, and Y. Yang, “DARTS: Differentiable architecture search,” in 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, 2019.
[29] H. Zhou, M. Yang, J. Wang, and W. Pan, “Bayesnas: A bayesian approach for neural architecture search,” in Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, vol. 97, 2019, pp. 7603–7613.
[30] S. Xie, H. Zheng, C. Liu, and L. Lin, “SNAS: stochastic neural architecture search,” in 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, 2019.
[31] X. Chen, L. Xie, J. Wu, and Q. Tian, “Progressive differentiable architecture search: Bridging the depth gap between search and evaluation,” pp. 1294–1303, 2019.
[32] Y. Xu, L. Xie, X. Zhang, X. Chen, G.-J. Qi, Q. Tian, and H. Xiong, “PC-DARTS: Partial channel connections for memory-efficient architecture search,” in 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, 2020.
[33] K. Kandasamy, W. Neiswanger, J. Schneider, B. Poczos, and E. P. Xing, “Neural architecture search with bayesian optimisation and optimal transport,” in Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada, 2018, pp. 2016–2025.
[34] B. Baker, O. Gupta, R. Raskar, and N. Naik, “Accelerating neural architecture search using performance prediction,” in 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Workshop Track Proceedings, 2018.
[35] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, Y. Bengio and Y. LeCun, Eds., 2014.
[36] Y. Li, C. Gu, T. Dullien, O. Vinyals, and P. Kohli, “Graph matching networks for learning the similarity of graph structured objects,” in Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, ser. Proceedings of Machine Learning Research, K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97. PMLR, 2019, pp. 3835–3845.
[37] Y. Bai, H. Ding, S. Bian, T. Chen, Y. Sun, and W. Wang, “Simgnn: A neural network approach to fast graph similarity computation,” in Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, WSDM 2019, Melbourne, VIC, Australia, February 11-15, 2019, J. S. Culpepper, A. Moffat, P. N. Bennett, and K. Lerman, Eds. ACM, 2019, pp. 384–392.
[38] Y. Bai, H. Ding, Y. Qiao, A. Marinovic, K. Gu, T. Chen, Y. Sun, and W. Wang, “Unsupervised inductive graph-level representation learning via graph-graph proximity,” in Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019, S. Kraus, Ed. ijcai.org, 2019, pp. 1988–1994.
[39] K. Xu, W. Hu, J. Leskovec, and S. Jegelka, “How powerful are graph neural networks?” in 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, 2019.
[40] M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin. (2020) Unsupervised learning of visual features by contrasting cluster assignments. [Online]. Available: https://arxiv.org/abs/2006.09882
[41] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high-performance deep learning library,” in Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8-14 December 2019, Vancouver, BC, Canada, 2019, pp. 8024–8035.
[42] M. Fey and J. E. Lenssen, “Fast graph representation learning with PyTorch Geometric,” in International Conference on Learning Representations Workshop on Representation Learning on Graphs and Manifolds, 2019.
[43] C. Wei. (2020) Code for Self-supervised Represetnation Learning for Evolutionary Neural Architecture Search. [Online]. Available: https://github.com/auroua/SSNENAS
[44] A. Krizhevsky and G. E. Hinton, “Learning multiple layers of features from tiny images,” Tech. Rep., 2009.
[45] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States, 2012, pp. 1106–1114.
[46] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2014.
[47] I. Loshchilov and F. Hutter, “Sgdr: Stochastic gradient descent with warm restarts,” in 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017.
[48] L. Li and A. Talwalkar, “Random search and reproducibility for neural architecture search,” in Proceedings of the Thirty-Fifth Conference on Uncertainty in Artificial Intelligence, UAI 2019, Tel Aviv, Israel, July 22-25, 2019, 2019, p. 129.