Few-shot Learning for
Unsupervised Feature Selection

Atsutoshi Kumagai
NTT Computer and Data Science Laboratories
atsutoshi.kumagai.ht@hco.ntt.co.jp
&Tomoharu Iwata
NTT Communication Science Laboratories
tomoharu.iwata.gy@hco.ntt.co.jp
Yasuhiro Fujiwara
NTT Communication Science Laboratories
yasuhiro.fujiwara.kh@hco.ntt.co.jp

Abstract

We propose a few-shot learning method for unsupervised feature selection, which is a task to select a subset of relevant features in unlabeled data. Existing methods usually require many instances for feature selection. However, sufficient instances are often unavailable in practice. The proposed method can select a subset of relevant features in a target task given a few unlabeled target instances by training with unlabeled instances in multiple source tasks. Our model consists of a feature selector and decoder. The feature selector outputs a subset of relevant features taking a few unlabeled instances as input such that the decoder can reconstruct the original features of unseen instances from the selected ones. The feature selector uses the Concrete random variables to select features via gradient descent. To encode task-specific properties from a few unlabeled instances to the model, the Concrete random variables and decoder are modeled using permutation-invariant neural networks that take a few unlabeled instances as input. Our model is trained by minimizing the expected test reconstruction error given a few unlabeled instances that is calculated with datasets in source tasks. We experimentally demonstrate that the proposed method outperforms existing feature selection methods.

1 Introduction

Feature selection is an important problem in machine learning that aims to reduce dimensionality of data by identifying the subset of relevant features [8]. By extracting a small subset of features, we can more easily analyze/interpret the characteristics of datasets and accelerate the learning processes of subsequent tasks such as clustering and classification. In addition, when the full set of features is expensive or difficult to collect, feature selection can eliminate the cost of collecting irrelevant or redundant features. Thanks to these beneficial properties, feature selection methods have been widely used in various applications such as biomarker discovery [52], document categorization [14], disease diagnosis [1], and drug development [33].

Many feature selection methods have been proposed. Supervised methods use labeled instances for feature selection such as Least Absolute Shrinkage and Selection Operator (Lasso) [48] and kernel approaches [35, 53, 54]. Although these methods are effective, labeled instances are expensive or impossible to collect since labels need to be manually annotated by domain experts. On the other hand, unsupervised methods require only unlabeled instances; therefore, they can be used in a wider range of situations than supervised methods [20, 61, 56, 3]. Existing unsupervised feature selection methods usually require a relatively large amount of instances. However, sufficient instances might be unavailable in real-world applications. For example, when we want to analyze data generated from industrial assets, sufficient instances are difficult to collect from newly introduced assets. When we want to analyze data obtained from persons for personalization [36], sufficient instances are also difficult to collect from new persons. In healthcare, when medical test results are used as features, the full set of features is expensive to collect, and thus, collecting sufficient instances in a hospital can be difficult [7]. When there are no sufficient data, existing methods perform worse since overfitting easily occurs [32, 46, 53]. Thus, feature selection with small instances is a critical problem [32, 23].

In this paper, we propose a few-shot learning method for unsupervised feature selection that enables us to select the subset of relevant features in a task of interest, called a target task, given a few unlabeled instances. To alleviate the lack of instances, the proposed method trains the model using unlabeled instances in multiple related tasks, called source tasks. In the above examples, other industrial assets/persons/hospitals can be regarded as related tasks. When target and source tasks are related, we can transfer useful knowledge from the source tasks to the target task [39]. Figure 2 shows our problem formulation.

Our model consists of two components: a feature selector and decoder, which are based on neural networks that enable accurate feature selection with their high expressive power. The feature selector outputs the subset of features taking a few unlabeled instances in a task, called a support set, as input. The decoder reconstructs the original features of testing instances in the same task, called a query set, from the selected features. The reconstruction from the selected features means that the unselected features can be approximated by the nonlinear transformation of selected features. Therefore, by selecting features that minimize the reconstruction errors, our model can automatically eliminate redundant features. Figure 2 shows our model.

However, selecting a subset of features in neural networks is difficult because its operation is not differentiable. To handle this problem, we use a continuous relaxation of the discrete random variables, the Concrete random variables [34, 24], in the feature selector. With our model, the Concrete random variables and decoder are modeled using permutation-invariant neural networks that take the support set as input. By this modeling, our model can reflect task-specific properties from the support set in the parameters of Concrete random variables, which enables to select a suitable subset of features for each task given the support set.

Our model is trained by minimizing the expected test reconstruction error of a query set given a support set that is calculated using unlabeled instances in source tasks. Since our model is explicitly trained so as to perform few-shot feature selection that works well in testing instances in multiple tasks, we can expect it also to perform well on the target task. Since all neural network parameters of our model are shared across all tasks, the learned model can be applied to unseen target tasks without re-training. Although we focus on the unsupervised approach in this paper, our framework can be easily modified for the supervised one with suitable loss functions such as the cross-entropy loss.

Refer to caption — Figure 1: Our problem formulation. In the training phase, our model is learned with unlabeled instances in multiple source tasks. In the testing phase, the learned model selects the subset of relevant features in a target task given the target support set. We assume that all the source and target tasks have the same feature sizes $M$ .

Our main contributions are summarized as follows: 1) To the best of knowledge, our work is the first attempt at few-shot learning for unsupervised feature selection. 2) We propose a reconstruction-based feature selection method that enables us to perform feature selection from a few unlabeled instances. 3) We empirically show that the proposed method performs well in few-shot feature selection problems.

2 Related Work

Feature selection methods have been widely studied. Many existing methods use label information to select relevant features [9, 42, 48, 35, 53, 54, 46, 55]. These methods select the subset of features that can explain the labels well. Although they are effective, labels are often expensive or impossible to prepare in real-world applications. Unsupervised feature selection methods do not require labels [20, 61, 6, 56]. Some unsupervised methods use strong prior knowledge about a dataset (the number of clusters) [6, 56] although the proposed method does not require such knowledge. Recently, it has been reported that reconstruction-based feature selection methods perform better than traditional unsupervised approaches because the data reconstruction error is a good criterion to define the relevance of features [30, 51, 3, 27]. In addition, they can provide a way to naturally perform model selection on the basis of validation reconstruction errors, which is important for practical use, although unsupervised methods typically have difficulty performing the model selection. Among them, concrete autoencoder (CAE) [3] performs particularly well by using Concrete random variables in the autoencoder (AE) framework. However, it requires many instances for training and thus cannot perform well with a few instances [46]. To make the best use of its capability with a few instances, we extend the CAE’s model by incorporating support set information and use an episodic training framework [50], which enables us to perform few-shot feature selection by transferring knowledge in source tasks. Singh et al. [46] proposed a supervised extension of CAE, which uses diet networks [44] to deal with high-dimensional data. Unlike the proposed method, this method requires labels and cannot use any data in source tasks.

Some transfer learning methods for feature selection have been proposed [49, 16, 5, 21, 60]. These methods typically handle only two tasks (source and target tasks) and cannot handle more than two tasks. In addition, these methods assume many unlabeled instances and/or small labeled instances in the target task. In contrast, the proposed method handles multiple tasks and uses a few unlabeled target instances. Although multi-task feature selection methods can handle more than two tasks [38, 62, 2], they usually assume labeled instances in each task. Besides, transfer and multi-task learning methods require training for each (target) task. The proposed method can perform feature selection for each target task without re-training, which enables fast adaptation to new tasks.

Few-shot learning has recently attracted attention, which aims to adapt to new tasks rapidly and effectively with a few instances [47, 12, 13, 58, 45, 15, 50, 43, 57, 22]. Most existing studies have proposed methods for few-shot classification tasks, which aim to obtain classifiers to recognize unseen classes during training, or reinforcement learning tasks [47, 12, 15, 50]. Some methods aim to learn a subspace or subset in the latent feature space for few-shot classification tasks [11, 28, 31]. These methods require labels and cannot perform feature selection from original features. In contrast, the proposed method does not require labels and can perform feature selection from original features.

3 Preliminary

We briefly review Concrete or Gumbel-softmax distribution [34, 24], which is used in the proposed method. Let ${\bf z}$ be a categorical variable with class probabilities $(\frac{\alpha_{1}}{\sum_{m}\alpha_{m}},\dots,\frac{\alpha_{M}}{\sum_{m}\alpha_{m}})$ where $\alpha_{m}\in\mathbb{R}_{>0}$ . We can assume that states with $0$ probability are excluded [34]. We assume that ${\bf z}$ is represented as an $M$ -dimensional one-hot vector, where the $m$ -th element of ${\bf z}$ , $z_{m}$ , is one if the $m$ -th class is selected and $z_{m}=0$ otherwise. The Gumbel-Max trick provides an efficient way to draw an instance ${\bf z}$ from the categorical distribution as ${\bf z}={\rm onehot}\left(\mathop{\rm arg~{}max}\limits_{m}[{\rm log}\ \alpha_{m}+g_{m}]\right)$ , where $g_{m}=-{\rm log}(-{\rm log}(u_{m}))$ , $u_{m}\sim{\rm Uniform}(0,1)$ , and ${\rm onehot}(m)$ returns a one-hot vector where the $m$ -th element is one. The instance ${\bf z}$ is not differentiable with respect to $\alpha_{m}$ since the ${\rm onehot}(\mathop{\rm arg~{}max}\limits[\cdot])$ operator is not differentiable. To deal with this problem, Maddison et al. [34] proposed a differentiable relaxation of the discrete variable with softmax transformations: $z_{m}=\frac{{\rm exp}(({\rm log}\ \alpha_{m}+g_{m})/\tau)}{\sum_{j}{\rm exp}(({\rm log}\ \alpha_{j}+g_{j})/\tau)},\ \ m=1,\dots,M,$ where $\tau>0$ is a temperature parameter. This approximate variable is called a Concrete random variable. Obviously, Concrete random variables are differentiable with respect to $\alpha_{m}$ . When $\tau$ is large, the Concrete random variable becomes a uniform vector. When $\tau$ approaches zero, the Concrete random variable becomes a one-hot vector and follows the categorical distribution with class probabilities $(\frac{\alpha_{1}}{\sum_{m}\alpha_{m}},\dots,\frac{\alpha_{M}}{\sum_{m}\alpha_{m}})$ [34, 24]. The probability distribution for Concrete random variables is called the Concrete distribution or Gumbel-softmax distribution.

4 Proposed Method

4.1 Problem Formulation

Let $X_{d}$ be a set of unlabeled instances in the $d$ -th task, and ${\bf x}_{n}\in X_{d}$ be the $M$ -dimensional feature vector of the $n$ -th instance in the $d$ -th task. We assume that each task has the same feature vector size $M$ , but each distribution can differ, which is the standard assumption used in transfer learning studies [39]. Suppose that unlabeled instances in $D$ source tasks $X=\{X_{d}\}_{d=1}^{D}$ are given at the training phase. At the testing phase, we are given a few unlabeled instances in a target task $X_{d^{\prime}}$ . Here, the target task is not contained in the source tasks, i.e., $d^{\prime}\notin\{1,\dots,D\}$ . Our goal is to select at most $K$ features ( $K\leq M$ ) that are appropriate for the target task from $X_{d^{\prime}}$ .

4.2 Model

We explain our model that selects the subset of relevant features in a task given a few unlabeled instances in the task. Our model consists of two main components: a feature selector and decoder as described in Figure 2. The feature selector outputs the subset of relevant features from a few unlabeled instances in a task, called a support set. The decoder is used to reconstruct the original inputs of testing instances, called a query set, from the selected features in the same task. The reconstruction means that all features can be approximated by the nonlinear transformation of selected features. Thus, by minimizing the reconstruction error, our model can select an informative feature subset.

We first explain the feature selector. Given a support set ${\cal S}$ from a task, the $k$ -th feature $u^{(k)}\in\mathbb{R}$ is obtained by the feature selector as follows:

\displaystyle u^{(k)}={\bf x}\cdot{\bf z}^{(k)}({\cal S}),\ \ k=1,\dots,K,

(1)

where $\cdot$ is the inner product, ${\bf x}$ is a query instance in the same task, and ${\bf z}^{(k)}({\cal S})=(z^{(k)}_{1}({\cal S}),\dots,z^{(k)}_{M}({\cal S}))$ is a Concrete random variable with parameters ${\boldsymbol{\alpha}}^{(k)}({\cal S})=(\alpha^{(k)}_{1}({\cal S}),\dots,\alpha^{(k)}_{M}({\cal S}))\in\mathbb{R}^{M}_{>0}$ :

\displaystyle z^{(k)}_{m}({\cal S})=

\displaystyle\frac{{\rm exp}(({\rm log}\ \alpha^{(k)}_{m}({\cal S})+g_{m}^{(k)})/\tau)}{\sum_{j}{\rm exp}(({\rm log}\ \alpha^{(k)}_{j}({\cal S})+g_{j}^{(k)})/\tau)},\ \ \ k=1,\dots,K,\ m=1,\dots,M.

(2)

Eq. (1) means that $u^{(k)}$ is a linear combination of input features. As the temperature $\tau$ approaches zero, ${\bf z}^{(k)}({\cal S})$ becomes a one-hot vector and therefore $u^{(k)}$ exactly outputs one of the input features. As a result, the feature selector can select at most $K$ features from the input features when ${\tau}$ is small. The stochasticity of Concrete random variables enables informative feature combinations to be efficiently explored, which is analyzed in the supplemental material.

With our model, the parameters of Concrete random variable ${\boldsymbol{\alpha}}^{(k)}({\cal S})$ depend on the support set ${\cal S}$ . Therefore, our model encodes the characteristics of the support set (task) to the parameters so that it can perform task-specific feature selection that is suitable for each task. We model this function by the permutation-invariant feed-forward neural networks [59]:

\displaystyle{\rm log}\ {\boldsymbol{\alpha}}^{(k)}({\cal S})\!=\!g_{\phi_{2}}\left(\left[\sum_{{\bf x}\in{\cal S}}f_{\phi_{1}}({\bf x}),{\boldsymbol{\pi}}^{(k)}\right]\right),\ k=1,\!\dots\!,K,

(3)

where $f_{\phi_{1}}$ and $g_{\phi_{2}}$ are any feed-forward neural networks with parameters $\phi_{1}$ and $\phi_{2}$ , respectively, $[\cdot,\cdot]$ is the concatenation, ${\boldsymbol{\pi}}^{(k)}\in\mathbb{R}^{T}$ is learnable parameters for the $k$ -th selected feature, and ${\rm log}\ {\boldsymbol{\alpha}}^{(k)}({\cal S})$ means that ${\rm log}$ is applied for each component of ${\boldsymbol{\alpha}}^{(k)}({\cal S})$ . Due to summation, this neural network is permutation-invariant to the order of instances in the support set and thus is well-defined as a function for set inputs [59]. The parameters ${\boldsymbol{\pi}}^{(k)}$ are introduced to select different features in the feature selector. That is, each ${\boldsymbol{\pi}}^{(k)}$ would take different values to minimize the reconstruction error, which leads to vary the class probability of each Concrete distribution, and thus, output different features. Note that in the absence of ${\boldsymbol{\pi}}^{(k)}$ , this neural network outputs the same values for all $k$ from ${\cal S}$ . It means that the same feature is selected for all $k$ , which is not desirable.

The decoder outputs the reconstructed original features ${\bf{\hat{x}}}\in\mathbb{R}^{M}$ from the selected features ${\bf u}=(u^{(1)},\dots,u^{(K)})\in\mathbb{R}^{K}$ . Specifically, the decoder is defined as follows:

\displaystyle{\bf{\hat{x}}}=h_{\theta}\left(\left[{\bf u}({\bf x};{\cal S}),{\bf r}({\cal S})\right]\right),\ {\bf r}({\cal S})=g_{\psi_{2}}\left(\sum_{{\bf x}\in{\cal S}}f_{\psi_{1}}({\bf x})\right),

(4)

where $h_{\theta}$ , $f_{\psi_{1}}$ , and $g_{\psi_{2}}$ are any feed-forward neural networks with parameters $\theta$ , $\psi_{1}$ , and $\psi_{2}$ , respectively. In Eq. (4), we explicitly described that ${\bf u}$ depends on both ${\bf x}$ and ${\cal S}$ for clarity. ${\bf r}({\cal S})$ is also permutation-invariant to the order of instances in ${\cal S}$ . Since different feature subsets can be selected in each task, different decoders are required for each task to reconstruct any instances in each task. By using ${\bf r}({\cal S})$ , we can model task-specific decoders.

Algorithm 1 Training procedure of our model.

0: Datasets in source tasks

X

, support set size

N_{{\rm S}}

, query set size

N_{{\rm Q}}

, maximum number of iterations

I

, initial temp.

T_{0}

, final temp.

T_{1}

0: Parameters of our model

\Theta

1: Initialize iteration number:

i=0

2: repeat

3: Set temperature:

\tau=T_{0}(T_{1}/T_{0})^{i/I}

4: Sample task

d

from

\{1,\dots,D\}

5: Sample support set

{\cal S}

with size

N_{{\rm S}}

from

X_{d}

6: Sample query set

{\cal Q}

with size

N_{{\rm Q}}

from

X_{d}\setminus{\cal S}

7: Update parameters with the gradients of the reconstruction error on the query set

{\cal Q}

i=i+1

9: until End condition is satisfied;

4.3 Training

We estimate parameters of our model $\Theta=(\theta,\phi_{1},\phi_{2},\psi_{1},\psi_{2},{\boldsymbol{\pi}^{(1)}},\dots,{\boldsymbol{\pi}^{(K)}})$ by minimizing the expected reconstruction error on a query set ${\cal Q}$ given a support set ${\cal S}$ using an episodic training framework [47, 50], where support and query sets are randomly generated from source datasets $X$ :

\displaystyle\underset{\Theta}{{\rm min}}\ \mathbb{E}_{d\sim\{1,\dots,D\}}\biggl{[}\mathbb{E}_{({\cal S},{\cal Q})\sim X_{d}}\biggl{[}\biggr{.}\biggr{.}\left.\left.\frac{1}{N_{\rm Q}}\sum_{{\bf x}\in{\cal Q}}\|{\bf x}-h_{\theta}(\left[{\bf u}({\bf x};{\cal S}),{\bf r}({\cal S})\right])\|^{2}\right]\right],

(5)

where $\mathbb{E}$ is the expectation, $\|\cdot\|^{2}$ is the squared Euclidean norm, ${\bf x}$ is an instance in the query set, and $N_{{\rm Q}}$ is the number of instances in the query set. The pseudocode for our training procedure is illustrated in Algorithm 1. This training enables our model to learn how to select relevant features from a few instances since it simulates feature selection from a few instances in multiple source tasks so as to work well for unseen instances. Intuitively, since all model parameters $\Theta$ are shared across all tasks, they can be learned with all instances in all source tasks that enables knowledge to be shared between the tasks. We can learn how to select features in each task from instances in the task.

We use the annealing schedule for the temperature parameter $\tau$ . Although small $\tau$ selects individual features in each Concrete random variable, it cannot explore various combinations of features and converges to a poor local minimum [3]. Thus, we decay the value of $\tau$ at each iteration from large value $T_{0}$ to small value $T_{1}$ as in CAE [3], by which our model can effectively explore combinations of features in the initial phase and converge to informative individual features. The effect of this annealing is analyzed in detail in the supplemental material. Although we used the reconstruction error as the loss function for simplicity, we can use any other loss function. For example, our framework can be straightforwardly modified for supervised settings by using the cross-entropy loss.

4.4 Feature Selection

Given the target support set $X_{d^{\prime}}$ at the testing phase, we can estimate the parameters of Concrete random variables as ${\rm log}\ {\boldsymbol{\alpha}}^{(k)}(X_{d^{\prime}}),\ k=1,\dots,K$ . To select the subset of features from the parameters, we use the discrete ${\mathop{\rm arg~{}max}\limits}$ operator instead of Concrete random variables because differentiability is unnecessary for testing. Specifically, the $k$ -th feature index is selected by ${\mathop{\rm arg~{}max}\limits_{m}}\ {\rm log}\ \alpha_{m}^{(k)}(X_{d^{\prime}})$ .

5 Experiments

In this section, we demonstrate the effectiveness of the proposed method for few-shot feature selection. We evaluated the selected features in terms of both reconstruction and clustering qualities. For clustering, each feature selection method is first performed to select features, and the K-means clustering is performed on the basis of the selected features, which is the standard procedure in unsupervised feature selection studies [30, 56, 61, 4]. Since K-means is a simple method, we can evaluate the selected features with less risk of overfitting. We used scikit-learn implementation for the K-means, which automatically alleviates effects of bad initial centroid seeds by repeating multiple runs. All experiments were conducted on a Linux server with an Intel Xeon CPU and a NVIDIA GeForce GTX 1080 GPU.

5.1 Data

We used four real-world datasets: MNIST-r, Isolet, Amazon, and VLCS, which have been widely used in previous studies [17, 18, 37, 3]. MNIST-r is derived from MNIST by rotating the images [17]. This dataset has six tasks (six rotation angles), and its feature dimension is $256$ . Isolet consists of letters spoken by 150 speakers, and speakers are grouped into five groups (tasks) by speaking similarity. Each instance is represented as a 617-dimensional vector. Amazon consists of product reviews in four tasks (four product categories), where the feature dimension is 400. VLCS consists of real images on four tasks (four data sources), where each image is represented by $4096$ dimensional features [17, 37]. The numbers of classes of MNIST-r, Isolet, Amazon, and VLCS are 10, 26, 2, and 5, respectively. The details of the datasets are provided in the supplemental material.

5.2 Comparison Methods

We compared the proposed method with seven commonly used unsupervised feature selection methods: CAE-S, CAE-T, CAE-ST, LS-T, LS-ST, SPEC-T, and SPEC-ST. CAE [3] is a recently proposed reconstruction-based feature selection method that has been reported to perform better than existing methods in various datasets [3]. CAE-T and CAE-S were trained with target instances and source instances, respectively. CAE-ST was trained with target instances by fine-tuning after pre-training with source instances. Unlike the proposed method, CAE uses task-invariant Concrete random variables and decoder. LS is the Laplacian Score [20], which is a widely used unsupervised feature selection method that selects features on the basis of Laplacian Eigenmaps [4]. LS-T and LS-ST were trained with target instances and both target and source instances, respectively. SPEC [61] is the general framework of spectral unsupervised feature selection. SPEC-T and SPEC-ST were trained with target and both target and source instances, respectively. The proposed method (Ours) and CAE are neural network-based feature selection methods. We implemented both methods using PyTorch [41]. LS and SPEC are not neural-network based methods, which are suitable for feature selection with small instances due to their simple models. We used scikit-feature [29] implementation for both methods.

5.3 Settings

We used mean squared reconstruction error (MSRE) for evaluating the reconstruction ability. For clustering, we used two metrics for evaluating each method: Adjusted Rand Index (ARI) and Normalized Mutual Information (NMI). Both metrics are widely used to evaluate clustering results. In both metrics, values close to one indicate high quality clustering. For the K-means, we set the number of clusters to that of classes for each dataset, which is the standard procedure in previous studies [20, 19, 30]. The proposed method and CAE used the same architecture of neural networks for the base AEs for fair comparison. For each dataset, we randomly selected one task for the target task and the rest for the source (training/validation) tasks. We randomly created 20 different target, training, and validation sets and evaluated the average test MSREs, ARIs, and NMIs over 20 sets for each pair of the number of target support instances and number of selected features. We used the number of target support instances $N_{\rm S}$ within $\{2,4,6\}$ and selected features $K$ within $\{10,20,30,40,50\}$ . The setup details such as network architectures and hyperparameters are described in the supplemental material.

5.4 Results

Reconstruction Ability

We evaluated the test MSRE for the reconstruction-based methods: the proposed method and CAE. Table 1 shows the averages and standard deviations of test MSREs over different numbers of target support instances $N_{\rm S}$ and selected features $K$ , where $N_{\rm S}$ and $K$ were evaluated within $\{2,4,6\}$ and $\{10,20,30,40,50\}$ , respectively. In Table 1, we did not include LS and SPEC because both methods do not have components for reconstructions. In Tables 1–3, boldface denotes the best and comparable methods according to the paired t-test and the significance level of 5 %. The proposed method showed the best test MSREs for all datasets. This is because our model is trained so that it selects features from a few instances that can reconstruct original features of unseen instances. CAE-T performed poorly for all datasets because there were too few target support instances to train the neural network. Interestingly, CAE-ST performed worse than CAE-S even though it used target instances for fine-tuning. This is because there were too few target support instances to fine-tune, and thus fine-tuning destroyed the pre-trained model parameters. Figure 9 shows test MSREs with $N_{\rm S}=2$ when varying the number of selected features $K$ for each dataset. We found that the proposed method consistently achieved the best test MSREs in each case. Note that it seems that the proposed method and CAE-S yielded similar results due to the poor results of CAE-T in MNIST-r and Isolet of Figure 9, the proposed method was statistically better than CAE-S as described in Table 1. The proposed method also performed well when $N_{\rm S}=4$ and $6$ , which is described in the supplemental material..

Table 1: Averages and standard deviations of test MSREs over different numbers of target support instances and selected features.

Data	Ours	CAE-S	CAE-T	CAE-ST
MNIST-r	4.258(2.252)	4.459(2.242)	15.477(4.453)	8.029(3.826)
Isolet	39.441(6.179)	42.941(5.272)	161.411(17.395)	95.196(15.243)
Amazon	0.056(0.004)	0.065(0.004)	0.079(0.011)	0.143(0.137)
VLCS	0.851(0.057)	0.914(0.023)	0.963(0.057)	1.137(0.175)

Table 2: Averages and standard deviations of test ARIs [%] over different numbers of target support instances and selected features.

Data	Ours	CAE-S	CAE-T	CAE-ST	LS-T	LS-ST	SPEC-T	SPEC-ST
MNIST-r	30.2(6.3)	24.7(5.1)	16.7(5.6)	24.7(5.1)	21.8(5.3)	20.0(5.9)	3.4(3.7)	3.7(5.1)
Isolet	33.9(5.9)	33.8(5.6)	20.1(7.4)	33.7(5.4)	21.0(7.4)	29.3(7.1)	14.7(5.7)	19.3(6.9)
Amazon	0.5(0.9)	0.2(0.4)	0.2(0.5)	0.2(0.4)	0.2(0.4)	0.2(0.1)	0.1(0.1)	0.2(0.2)
VLCS	11.2(9.4)	3.5(4.8)	7.8(10.0)	3.7(4.5)	11.0(14.8)	6.7(7.1)	10.3(14.7)	2.0(4.6)

Table 3: Averages and standard deviations of test NMIs [%] over different numbers of target support instances and selected features.

Data	Ours	CAE-S	CAE-T	CAE-ST	LS-T	LS-ST	SPEC-T	SPEC-ST
MNIST-r	42.6(6.1)	38.0(5.3)	28.6(6.2)	38.0(5.3)	34.5(5.9)	33.0(6.2)	15.5(4.6)	12.3(8.6)
Isolet	61.4(4.9)	61.5(4.9)	46.2(8.8)	61.4(4.8)	47.9(8.7)	58.7(6.7)	40.3(6.6)	43.3(7.7)
Amazon	1.4(2.6)	0.6(0.9)	0.5(1.0)	0.6(0.9)	0.3(0.9)	0.1(0.1)	0.6(0.8)	1.3(1.1)
VLCS	23.3(17.2)	15.4(14.3)	21.0(19.0)	15.5(14.2)	21.9(19.9)	23.2(19.7)	15.0(14.2)	10.0(9.2)

We qualitatively evaluated the proposed method by visualizing reconstructed images on MNIST-r. We compared the proposed method with reconstruction-based methods (CAE). Figure 4 shows three instances of test reconstructed images on a target task and the selected features. The proposed method was able to reconstruct images from selected features more accurately than others. The features selected by the proposed method were evenly distributed near the center. Since all digits in this dataset are placed near the center, the proposed method captured the important information of digits well. CAE-T was not able to reconstruct all images because it was trained with only four target images. The features selected by CAE-T seem to be over-fitted to a few target images. CAE-ST destroyed the reconstructed images obtained by CAE-S since fine-tuning with few target instances is difficult for CAE. From these results, we found that the proposed method can select features from a few target instances that accurately reconstruct test images by using useful knowledge on source tasks.

Clustering Ability

We investigated the test ARIs and NMIs to evaluate the quality of selected features on clustering problems, which are commonly used to evaluate unsupervised feature selection abilities [30, 56, 61, 4]. Tables 2 and 3 show the averages and standard deviations of test ARIs and test NMIs over different numbers of target support instances and selected features, respectively. The proposed method showed the best ARIs and NMIs with all datasets. Especially, it outperformed methods trained with the target support set only (CAE-T, LS-T, and SPEC-T) by a large margin in almost all cases, which indicates that the difficulty of feature selection with a few instances and the effectiveness of using instances in source tasks. Although non-neural network based methods (LS and SPEC) are generally suitable for small instance problems, they performed worse than the proposed method. Although CAE-ST performed worse than CAE-S in terms of reconstruction errors, it did not change test ARIs well. This result suggests that the parameters of Concrete random variables are difficult to change drastically with a few instances by fine-training. Note that similar results were obtained even if only parameters of the feature selector were learned during fine-tuning. In contrast, the proposed method can use information of the target support set effectively since our model is designed for few-shot feature selection. Therefore, it performed well. Figures 10 and 11 show test ARIs and NMIs with $N_{\rm S}=2$ when varying the value of $K$ for each dataset. We found that the proposed method performed well in all cases. It also performed well when $N_{\rm S}=4$ and $6$ , which is described in the supplemental material.

We conducted an ablation study to investigate the effects of our feature selector and decoder. We compared our model with three models. w/o ${\bf r}({\cal S})$ is our model without ${\bf r}({\cal S})$ in Eq. (4), which uses the task-invariant decoder for all tasks. w/o ${\boldsymbol{\alpha}}({\cal S})$ is our model without ${\boldsymbol{\alpha}}({\cal S})$ in Eq. (3), which uses the task-invariant (support set-independent) Concrete random variables ${\boldsymbol{\alpha}}$ . CAE-S is equivalent to our model without both ${\bf r}({\cal S})$ and ${\boldsymbol{\alpha}}({\cal S})$ . The upper half of Table 5 shows the average of ARIs and NMIs over all datasets, target support instances, and selected features. w/o ${\boldsymbol{\alpha}}({\cal S})$ and CAE-S did not perform well because both methods do not have any mechanism to perform task-specific feature selection. In contrast, Ours and w/o ${\bf r}({\cal S})$ performed well and showed similar results. This result indicates that ${\boldsymbol{\alpha}}({\cal S})$ is particularly important for our model to select relevant features. Note that w/o ${\bf r}({\cal S})$ is also included in our proposal. For further investigation, we additionally evaluated performance with few selected features $(K=2$ ). In this case, Ours outperformed w/o ${\bf r}({\cal S})$ . This is because instances are difficult to reconstruct from few features by a single decoder. By using task-specific decoders, Ours performed well. More detailed analysis of our model such as effects of temperature annealing and Concrete random variables are described in the supplemental material. These results show that both temperature annealing and randomness of Concrete random variables are useful for the proposed method and CAE.

Table 4: Ablation study. ARI (NMI) means the average of test ARIs (NMIs) [%] over all datasets, target support instances within

\{2,4,6\}

, and selected features within

\{10,20,30,40,50\}

. (

K=2

) means the results with two selected features.

	Ours	w/o ${\bf r}({\cal S})$	w/o ${\boldsymbol{\alpha}}({\cal S})$	CAE-S
ARI	19.0	19.2	15.7	15.6
NMI	32.2	32.3	28.5	28.9
ARI ( $K=2$ )	6.5	4.4	5.1	4.8
NMI ( $K=2$ )	16.6	15.8	14.4	14.4

Table 5: The average number of features selected by the proposed method after deduplication.

Data \ $K$	10	20	30	40	50
MNIST-r	10.0	20.0	29.7	39.1	47.0
Isolet	10.0	20.0	29.8	40.0	49.9
Amazon	6.8	12.9	17.9	20.1	30.1
VLCS	10.0	19.9	30.0	40.0	50.0

Deduplicated Features Selected by the Proposed Method

We investigated the average number of deduplicated features selected by the proposed method. Although the proposed method uses $K$ Concrete random variables to select features in the feature selector, it possibly selects fewer features than $K$ . Table 5 shows the average number of features selected by the proposed method after deduplication when $N_{{\cal S}}=2$ . For MNIST-r, Isolet, and VLCS, the proposed method selected almost $K$ features. For Amazon, it selected fewer features than $K$ . This would be because Amazon has few important features. Indeed, the proposed method shows good test MSRE, ARI, and NMI even if $K=10$ in Figures 9, 10, and 11.

6 Conclusions

In this paper, we proposed a few-shot learning method for unsupervised feature selection. Our model can perform target task-specific feature selection given a few target instances by using useful knowledge in source tasks. Experimental results showed that the proposed method outperformed other feature selection methods. For future work, we plan to extend our framework to supervised feature selection problems. Finally, we describe a potential negative social impact of this work. Since the proposed method can use various datasets for training, there is a risk that users might include biased datasets without careful thought in source tasks, which might result in biased feature selection. Therefore. we encourage research to automatically detect biased datasets.

References

[1] M. F. Akay. Support vector machines combined with feature selection for breast cancer diagnosis. Expert systems with applications, 36(2):3240–3247, 2009.
[2] A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Machine learning, 73(3):243–272, 2008.
[3] M. F. Balın, A. Abid, and J. Zou. Concrete autoencoders: differentiable feature selection and reconstruction. In ICML, 2019.
[4] M. Belkin and P. Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering. In NeurIPS, 2002.
[5] W. Bi, Y. Shi, and Z. Lan. Transferred feature selection. In ICDM Workshops, 2009.
[6] D. Cai, C. Zhang, and X. He. Unsupervised feature selection for multi-cluster data. In KDD, 2010.
[7] N. Cesa-Bianchi, S. Shalev-Shwartz, and O. Shamir. Efficient learning with partially observed attributes. Journal of Machine Learning Research, 12(10), 2011.
[8] G. Chandrashekar and F. Sahin. A survey on feature selection methods. Computers & Electrical Engineering, 40(1):16–28, 2014.
[9] C. Ding and H. Peng. Minimum redundancy feature selection from microarray gene expression data. Journal of bioinformatics and computational biology, 3(02):185–205, 2005.
[10] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: a deep convolutional activation feature for generic visual recognition. In ICML, 2014.
[11] N. Dvornik, C. Schmid, and J. Mairal. Selecting relevant features from a multi-domain representation for few-shot classification. In ECCV, 2020.
[12] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, 2017.
[13] C. Finn, K. Xu, and S. Levine. Probabilistic model-agnostic meta-learning. In NeurIPS, pages 9516–9527, 2018.
[14] G. Forman. Bns feature scaling: an improved representation over tf-idf for svm text classification. In CIKM, 2008.
[15] M. Garnelo, D. Rosenbaum, C. Maddison, T. Ramalho, D. Saxton, M. Shanahan, Y. W. Teh, D. Rezende, and S. A. Eslami. Conditional neural processes. In ICML, 2018.
[16] L. Gautheron, I. Redko, and C. Lartizien. Feature selection for unsupervised domain adaptation using optimal transport. In ECML PKDD, 2018.
[17] M. Ghifary, W. Bastiaan Kleijn, M. Zhang, and D. Balduzzi. Domain generalization for object recognition with multi-task autoencoders. In ICCV, 2015.
[18] B. Gong, K. Grauman, and F. Sha. Connecting the dots with landmarks: discriminatively learning domain-invariant features for unsupervised domain adaptation. In ICML, 2013.
[19] K. Han, Y. Wang, C. Zhang, C. Li, and C. Xu. Autoencoder inspired unsupervised feature selection. In ICASSP, 2018.
[20] X. He, D. Cai, and P. Niyogi. Laplacian score for feature selection. In NeurIPS, 2006.
[21] T. Helleputte and P. Dupont. Feature selection by transfer learning with linear regularized models. In ECML PKDD, 2009.
[22] T. Iwata and A. Kumagai. Meta-learning from tasks with heterogeneous attribute spaces. NeurIPS, 2020.
[23] A. Jain and D. Zongker. Feature selection: evaluation, application, and small sample performance. IEEE transactions on pattern analysis and machine intelligence, 19(2):153–158, 1997.
[24] E. Jang, S. Gu, and B. Poole. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144, 2016.
[25] T. Jebara. Multi-task feature and kernel selection for svms. In Proceedings of the twenty-first international conference on Machine learning, page 55, 2004.
[26] D. P. Kingma and J. Ba. Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[27] I. Lemhadri, F. Ruan, and R. Tibshirani. Lassonet: neural networks with feature sparsity. In AISTATS, 2021.
[28] H. Li, D. Eigen, S. Dodge, M. Zeiler, and X. Wang. Finding task-relevant features for few-shot learning by category traversal. In CVPR, 2019.
[29] J. Li, K. Cheng, S. Wang, F. Morstatter, R. P. Trevino, J. Tang, and H. Liu. Feature selection: a data perspective. CSUR, 2018.
[30] J. Li, J. Tang, and H. Liu. Reconstruction-based unsupervised feature selection: an embedded approach. In IJCAI, 2017.
[31] M. Lichtenstein, P. Sattigeri, R. Feris, R. Giryes, and L. Karlinsky. Tafssl: task-adaptive feature sub-space learning for few-shot classification. arXiv preprint arXiv:2003.06670, 2020.
[32] B. Liu, Y. Wei, Y. Zhang, and Q. Yang. Deep neural networks for high dimension, low sample size data. In IJCAI, 2017.
[33] Y. Liu. A comparative study on feature selection methods for drug discovery. Journal of chemical information and computer sciences, 44(5):1823–1828, 2004.
[34] C. J. Maddison, A. Mnih, and Y. W. Teh. The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712, 2016.
[35] M. Masaeli, G. Fung, and J. G. Dy. From transformation-based dimensionality reduction to feature selection. In ICML, 2010.
[36] B. Mobasher. Data mining for web personalization. In The adaptive web, pages 90–135. Springer, 2007.
[37] S. Motiian, M. Piccirilli, D. A. Adjeroh, and G. Doretto. Unified deep supervised domain adaptation and generalization. In ICCV, 2017.
[38] G. Obozinski, B. Taskar, and M. Jordan. Multi-task feature selection. Statistics Department, UC Berkeley, Tech. Rep, 2(2.2), 2006.
[39] S. J. Pan and Q. Yang. A survey on transfer learning. IEEE Transactions on knowledge and Data Engineering, 22(10):1345–1359, 2009.
[40] S. Parameswaran and K. Q. Weinberger. Large margin multi-task metric learning. In NeurIPS, 2010.
[41] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. 2017.
[42] H. Peng, F. Long, and C. Ding. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on pattern analysis and machine intelligence, 27(8):1226–1238, 2005.
[43] A. Rajeswaran, C. Finn, S. M. Kakade, and S. Levine. Meta-learning with implicit gradients. In NeurIPS, 2019.
[44] A. Romero, P. L. Carrier, A. Erraqabi, T. Sylvain, A. Auvolat, E. Dejoie, M.-A. Legault, M.-P. Dubé, J. G. Hussin, and Y. Bengio. Diet networks: thin parameters for fat genomics. arXiv preprint arXiv:1611.09340, 2016.
[45] A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap. Meta-learning with memory-augmented neural networks. In ICML, 2016.
[46] D. Singh, H. Climente-González, M. Petrovich, E. Kawakami, and M. Yamada. Fsnet: feature selection network on high-dimensional biological data. arXiv preprint arXiv:2001.08322, 2020.
[47] J. Snell, K. Swersky, and R. Zemel. Prototypical networks for few-shot learning. In NeurIPS, 2017.
[48] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996.
[49] S. Uguroglu and J. Carbonell. Feature selection for transfer learning. In ECML PKDD, 2011.
[50] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. Matching networks for one shot learning. In NeurIPS, 2016.
[51] S. Wang, Z. Ding, and Y. Fu. Feature selection guided auto-encoder. In AAAI, 2017.
[52] E. P. Xing, M. I. Jordan, R. M. Karp, et al. Feature selection for high-dimensional genomic microarray data. In ICML, 2001.
[53] M. Yamada, W. Jitkrittum, L. Sigal, E. P. Xing, and M. Sugiyama. High-dimensional feature selection by feature-wise kernelized lasso. Neural computation, 26(1):185–207, 2014.
[54] M. Yamada, Y. Umezu, K. Fukumizu, and I. Takeuchi. Post selection inference with kernels. In AISTATS, 2018.
[55] Y. Yamada, O. Lindenbaum, S. Negahban, and Y. Kluger. Feature selection using stochastic gates. In ICML, 2020.
[56] Y. Yang, H. T. Shen, Z. Ma, Z. Huang, and X. Zhou. L2, 1-norm regularized discriminative feature selection for unsupervised. In IJCAI, 2011.
[57] M. Yin, G. Tucker, M. Zhou, S. Levine, and C. Finn. Meta-learning without memorization. In ICLR, 2020.
[58] J. Yoon, T. Kim, O. Dia, S. Kim, Y. Bengio, and S. Ahn. Bayesian model-agnostic meta-learning. In NeurIPS, 2018.
[59] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhutdinov, and A. J. Smola. Deep sets. In NeurIPS, 2017.
[60] B. Zhao, X. Sun, Y. Fu, Y. Yao, and Y. Wang. Msplit lbi: realizing feature selection and dense estimation simultaneously in few-shot and zero-shot learning. arXiv preprint arXiv:1806.04360, 2018.
[61] Z. Zhao and H. Liu. Spectral feature selection for supervised and unsupervised learning. In ICML, 2007.
[62] Y. Zhou, R. Jin, and S. C.-H. Hoi. Exclusive lasso for multi-task feature selection. In AISTATS, 2010.

Appendix A Data Details

We used four real-world datasets: MNIST-r¹¹1https://github.com/ghif/mtae, Isolet²²2http://archive.ics.uci.edu/ml/datasets/ISOLET, Amazon³³3http://multilevel.ioe.ac.uk/intro/datasets.html, and VLCS⁴⁴4http://www.cs.dartmouth.edu/chenfang/projpage/FXR iccv13/index.php. MNIST-r is commonly used in domain generalization studies [17]. This dataset, which was derived from the hand-written digit dataset MNIST, was introduced by Ghifary et al [17]. Each task is created by rotating the images in multiples of 15 degrees: 0, 15, 30, 45, 60, and 75. Therefore, this dataset has six tasks. Each task has 1,000 images, which are represented by 256-dimensional vectors, of 10 classes (digits).

Isolet is a widely used real-world dataset in multi-task learning studies [40, 25]. This dataset is collected from 150 speakers who say each letter of the Roman alphabet twice. Thus, there are 52 instances from each individual. The individuals are grouped into five groups on the basis of speaking similarity, thus, this dataset has five tasks. instances are represented as 617-dimensional vectors of 26 classes (letters).

Amazon is a widely used real-world dataset for cross-domain sentiment analysis [18]. This dataset consists of product reviews in four tasks: kitchen appliances, DVDs, books, and electronics. We used the processed data from Gong et al. [18], in which the dimensionality of the bag-of-words features was 400. Each task has 1000 positive and 1000 negative reviews (two classes).

VLCS is also commonly used in domain adaptation studies [37]. This dataset consists of real images on four tasks: VOC2007, LabelMe, Caltech-101, and SUN09. Each task shares five object classes: bird, car, chair, dog, and person. This dataset has 10,729 images in total. Instead of using the raw features, we used the $\ell^{2}$ -normalized DeCaF-6 features [10] as input features following previous studies [17, 37]. These features have 4096 dimensions.

Appendix B Experimental Settings Details

We used mean squared reconstruction error (MSRE) for evaluating the reconstruction ability. For clustering, we used two metrics for evaluating each method in clustering tasks: Adjusted Rand Index (ARI) and Normalized Mutual Information (NMI). Both metrics are widely used to evaluate clustering results. In both metrics, values close to one indicate high quality clustering. For the K-means, we set the number of clusters to that of classes for each dataset.

For the proposed method and concrete autoencoder (CAE), we used two-hidden-layer neural networks with $32$ hidden units and rectified linear unit (ReLU) activation for the decoders $h_{\theta}$ with MNIST-r, Isolet, and Amazon. For VLCS, we used two-hidden layer neural networks with $256$ hidden units and ReLU activation for $h_{\theta}$ because this dataset has higher dimensionality of features than other datasets. For the activation of output layers, we used the sigmoid function for MNIST-r, Amazon, and VLCS since features are normalized to $[0,1]$ in these datasets. For Isolet, the tanh function was used since this dataset takes feature values within $[-1,1]$ . For the permutation-invariant neural networks for Concrete random variables and decoders in the proposed method, we used different one-layer neural networks with $64$ output units and ReLU activation for $f_{\phi_{1}}$ and $f_{\psi_{1}}$ . For VLCS, we used different one-layer neural networks with $512$ output units and ReLU activation for $f_{\phi_{1}}$ and $f_{\psi_{1}}$ . We used different one-layer neural networks for $g_{\phi_{2}}$ and $g_{\psi_{2}}$ with all datasets. The dimension of parameters ${\boldsymbol{\pi}}^{(k)}$ was set to $300$ . The output size of the permutation-invariant neural network for the decoder $g_{\psi_{2}}$ was set to one. For the proposed method and CAE, we set the minibatch size to $64$ . (For the proposed method, we set $N_{\rm S}+N_{\rm Q}=64$ .) The initial temperature parameter $T_{0}$ was set to $10$ and the final temperature $T_{1}$ to $0.01$ . We used the Adam optimizer [26] with a learning rate of 0.001, and the maximum number of iterations was 50,000. For the proposed method, CAE-S, and CAE-T, we used early-stopping on the basis of the validation reconstruction errors to avoid overfitting. For CAE-ST, the number of iterations for fine-turning was set to a large value (1000) because few iterations were not able to change the selected features obtained by CAE-S.

Table 6: Effects of annealing schedule for temperature parameter

\tau

. Average of test ARIs [%] and NMIs [%] over all datasets, target support instances within

\{2,4,6\}

, and selected features within

\{10,20,30,40,50\}

, respectively.

		Ours	Ours	Ours		CAE-S	CAE-S	CAE-S
	Ours	( $\tau=0.01$ )	( $\tau=1$ )	( $\tau=10$ )	CAE-S	( $\tau=0.01$ )	( $\tau=1$ )	$(\tau=10)$
ARI	19.0	7.4	17.0	15.1	15.6	13.2	15.3	12.6
NMI	32.2	17.8	29.6	26.8	28.9	25.9	28.3	24.4

For LS-T and LS-ST, the candidates of heat kernel parameters were $\{0.1,1,10,100\}$ , and the best value for testing data was selected. For LS-ST, the number of nearest neighbors was set to five following previous studies [20, 30, 56]. For LS-T, the best value was selected within $\{1,3,5\}$ because five nearest neighbors are not defined for a few target support instances. For SPEC-T and SPEC-ST, we used the second feature ranking function that has been reported to robustly perform well [61].

For MNIST-r and Isolet, we randomly chose one task for a target task and the rest for source tasks. From the source tasks, we randomly chose one task for validation and the rest for training. Then, we randomly chose 80% of instances from each training/validation task. For Amazon and VLCS, we used 80% of instances from each source task for training and the rest for validation since these datasets have only four tasks. For each target task, we randomly chose a few instances for the target support set and the remainder for testing. Specifically, we evaluated the performance by changing the number of target support instances within $\{2,4,6\}$ . For each case, we randomly created 20 different datasets and calculated an average test MSRE, ARI, and NMI.

Appendix C Additional Experimental Results

C.1 Effects of Annealing Schedule for Temperature Parameter $\tau$

We investigated the effect of the annealing schedule for temperature parameter $\tau$ , which is used for the proposed method and CAE in the main paper. Table 6 shows the average of test ARIs and NMIs over all datasets, target support instances, and selected features. Ours and CAE-S used the annealing of temperature $\tau$ from $10$ to $0.01$ during training. Other methods did not use the annealing and instead used a fixed value of $\tau$ during training. First, Ours, which uses the annealing, performed the best. For the proposed method, Ours ( $\tau=0.01$ ) and Ours ( $\tau=10$ ) were not able to perform as well as Ours. This is because although small $\tau$ can select individual features in each Concrete random variable, it cannot explore various combinations of features and converge to a poor local minimum, and large $\tau$ cannot select individual features in each Concrete random variable during training. Ours ( $\tau=1$ ) outperformed both methods with the suitable fixed value of $\tau$ but was outperformed by Ours with the annealing. By using the annealing of $\tau$ , the proposed method can explore various combinations of features in the initial phase of training and converge to a good local minimum in the last phase of training. For CAE, the results were similar to the proposed method. Overall, these results indicate that the proposed method can improve feature selection performance by using the annealing schedule of temperature parameters.

Table 7: Effects of randomness in the feature selector of our model. Average of test ARIs [%] and NMIs [%] over all datasets, target support instances within

\{2,4,6\}

, and selected features within

\{10,20,30,40,50\}

, respectively.

		Ours		CAE-S
	Ours	w/o rand.	CAE-S	w/o rand.
ARI	19.0	16.9	15.6	14.5
NMI	32.2	29.1	28.9	26.6

C.2 Effects of Randomness in the Feature Selector

We investigated the effect of randomness in the feature selector (Concrete random variables), which is used for the proposed method and CAE in the main paper. The feature selector has random variables $g_{m}$ . Therefore, we investigate the role of this random variable. Table 7 shows the average of test ARIs and NMIs over all datasets, target support instances, and selected features. Ours and CAE-S used the random variables $g_{m}$ although Ours w/o rand. and CAE-S w/o rand. did not. Ours performed the best. Ours (CAE-S) performed better than Ours w/o rand. (CAE-S w/o rand.). By using the random variables $g_{m}$ , the proposed method would explored informative feature combinations during training while avoiding to being trapped at a poor local minimum. These results indicate the effectiveness of randomness in the feature selector in our framework.

C.3 Dependency of the Dimension of Parameters ${\boldsymbol{\pi}}^{(k)}$

We investigated the dependency of the dimension of parameters ${\boldsymbol{\pi}}^{(k)}\in\mathbb{R}^{T}$ in the proposed method. These parameters are used for varying the class probability of each Concrete random variable, which is necessary for selecting different features via the feature selector. Figures 7–8 show the average and standard errors of test ARIs and NMIs over different numbers of target support instances and selected features when varying $T$ within $\{1,10,50,100,200,300,400\}$ , respectively. For all evaluation measures and datasets, the proposed method consistently performed well when the value of $T$ was larger than $100$ . When the value of $T$ was small, the proposed method was not able to perform well in our experiments. This would be because the small dimension is not enough to change the class probabilities of each Concrete variable. These results show that the proposed method can perform well when $T$ is set to a relatively large value.

C.4 Additional Comparison Method

We compared the proposed method with an additional variant of CAE-ST. Although CAE-ST was trained with target instances by fine-tuning after pre-training with source instances, which is a standard technique for transfer learning,

CAE can be also trained with source and target instances, simultaneously (CAE-ST-sim). Table 8 shows the average of test ARIs and NMIs over all datasets, target support instances, and selected features. The proposed method clearly outperformed CAE-ST-sim. Since CAE-ST-sim does not have explicit mechanisms for few-shot learning, it did not work well.

Table 8: Comparison with CAE-ST-sim. Average of test ARIs [%] and NMIs [%] over all datasets, target support instances within

\{2,4,6\}

, and selected features within

\{10,20,30,40,50\}

, respectively.

	Ours	CAE-S	CAE-ST	CAE-ST-sim
ARI	19.0	15.6	15.6	16.9
NMI	32.2	28.9	28.9	27.7

Table 9: Computation time in seconds for training by each method. Ours-target represents the feature selection time for a target task of the proposed method, i.e., the total time of estimation of parameters of Concrete random variables from target support instances and applying the

{\mathop{\rm arg~{}max}\limits}

operator to the estimated parameters.

Ours	Ours-target	CAE-S	CAE-T	CAE-ST	LS-T	LS-ST	SPEC-T	SPEC-ST
303.725	0.003	125.065	119.699	4.673	0.003	1.493	0.011	14.627

C.5 Computation Cost

We investigated the computation time of the proposed method on MNIST-r. Table 9 shows the training time of each method and target task-specific feature selection time of the proposed method on a computer with the 2.20GHz CPU. CAE-ST represents the training time for fine-tuning. Although our model took longer to train than the others, it can perform fast and accurate target task-specific feature selection without re-training.

C.6 Full Results of Test MSREs, ARIs, and NMIs

Figures 9–11 show the average and standard error of test MSREs, ARIs, and NMIs when changing the number of selected features $K$ with different numbers of target support instances, respectively. The proposed method performed well in almost all cases.

C.7 Visualization of Test Reconstructed Images on MNIST-r

Figure 12 shows the 10 instances (digits from 0 to 9) of test reconstructed images on a target task when 4 target support instances and 20 selected features were used on MNIST-r. The proposed method was able to reconstruct images from selected features more accurately than the others.