Active Feature Acquisition with Generative Surrogate Models

Yang Li Junier B. Oliva

Abstract

Many real-world situations allow for the acquisition of additional relevant information when making an assessment with limited or uncertain data. However, traditional ML approaches either require all features to be acquired beforehand or regard part of them as missing data that cannot be acquired. In this work, we consider models that perform active feature acquisition (AFA) and query the environment for unobserved features to improve the prediction assessments at evaluation time. Our work reformulates the Markov decision process (MDP) that underlies the AFA problem as a generative modeling task and optimizes a policy via a novel model-based approach. We propose learning a generative surrogate model (GSM) that captures the dependencies among input features to assess potential information gain from acquisitions. The GSM is leveraged to provide intermediate rewards and auxiliary information to aid the agent navigate a complicated high-dimensional action space and sparse rewards. Furthermore, we extend AFA in a task we coin active instance recognition (AIR) for the unsupervised case where the target variables are the unobserved features themselves and the goal is to collect information for a particular instance in a cost-efficient way. Empirical results demonstrate that our approach achieves considerably better performance than previous state of the art methods on both supervised and unsupervised tasks.

Machine Learning, ICML

1 Introduction

A typical machine learning paradigm for discriminative tasks is to learn the distribution of an output, $y$ given a complete set of features, $x\in\mathbb{R}^{d}$ : $p(y\mid x)$ . Although this paradigm is successful in a multitude of domains, it is incongruous with the expectations of many real-world intelligent systems in two key ways: first, it assumes that a complete set of features has been observed; second, as a consequence, it also assumes that no additional information (features) of an instance may be obtained at evaluation time. These assumptions often do not hold; human agents routinely reason over instances with incomplete data and decide when and what additional information to obtain. For example, consider a doctor diagnosing a patient. The doctor usually has not observed all possible measurements (such as blood samples, x-rays, etc.) for the patient. He/she is not forced to make a diagnosis based on the observed measurements; instead, he/she may dynamically decide to take more measurements to help determine the diagnosis. Of course, the next measurement to make (feature to observe), if any, will depend on the values of the already observed features; thus, the doctor may determine a different set of features to observe from patient to patient (instance to instance) depending on the values of the features that were observed. Hence, each patient will not have the same subset of features selected (as would be the case with typical feature selection). Furthermore, acquiring features typically involves some cost (in time, money and risk), and intelligent systems are expected to automatically balance the cost and improvement on performance. In order to more closely match the needs of many real-world applications, we propose an active feature acquisition (AFA) model that not only makes predictions with incomplete/missing features, but also determines the next feature that would be the most valuable to obtain for a particular instance.

Refer to caption — Figure 1: Active feature acquisition on MNIST. Example of our acquisition process (top) and the prediction probabilities (bottom). The green masks indicate the unobserved features.

As noted in (Shim et al., 2018), the active feature acquisition problem may be formulated as a Markov decision process (MDP), where the state is the set of currently observed features and the action is the next feature to acquire. A special action indicates whether to stop the acquisition process and make a final prediction. After acquiring its value and paying the acquisition cost, the newly acquired feature is added to the observed subset and the agent proceeds to the next acquisition step. Once the agent decides to terminate the acquisition, it makes a final prediction based on the features acquired thus far. For example, in an image classification task (Fig. 2), the agent would dynamically acquire pixels until it is certain of the image class. The goal of the agent is to maximize the prediction performance while minimizing the acquisition cost.

The key insight of this work is that the dynamics model for the AFA MDP is based on the conditionals of the features: $p(x_{j}\mid x_{o})$ , where $x_{j}$ is an unobserved feature selected for acquisition and $x_{o}$ are the previously acquired features. Thus, we develop a model-based approach through generative modeling of all conditional dependencies. Equipped with the surrogate model, our method, Generative Surrogate Models for RL (GSMRL), essentially combines model-free and model-based RL into a holistic framework.

GSMRL rectifies several short-comings of a model-free scheme such as JAFA (Shim et al., 2018). In the aforementioned MDP, the agent pays the acquisition cost at each acquisition step but only receives a reward about the prediction after completing the acquisition process. To reduce the sparsity of the rewards and simplify the credit assignment problem for potentially long episodes (Minsky, 1961; Sutton, 1988), we leverage a surrogate model to provide intermediate rewards by assessing the information gain of a newly acquired feature, which quantifies how much our confidence about the prediction improves by acquiring this feature. In addition to sparse rewards, an AFA agent must also navigate a complicated high-dimensional action space (Dulac-Arnold et al., 2015), and must manage multiple roles as it has to: implicitly model dependencies, perform a cost/benefit analysis, and act as a classifier. To lessen the burden, we also propose using the surrogate model to provide side information that assists the agent. The side information shall explicitly inform the agent of: 1) uncertainty and imputations for unobserved features; 2) an estimate of the expected information gain of future acquisitions; 3) uncertainty of the target output. This allows the agent to easily assess its current uncertainty and helps the agent ‘look ahead’ to expected outcomes from future acquisitions.

In this work, we also propose the first (to the best of our knowledge) unsupervised AFA task, which we coin active instance recognition (AIR). Here we consider the case where there is not a single target variable, but instead the target of interest may be the remaining unobserved features themselves. That is, rather than reducing the uncertainty with respect to some desired output response (that cannot be directly queried and must be predicted), the task is to query as few features as possible that allows the agent to correctly uncover the remaining unobserved features. For example, in image data AIR, an agent queries new pixels until it can reliably uncover the remaining pixels (see Fig. 2). AIR is especially relevant in surveying tasks, which are broadly applicable across various domains and applications. Most surveys aim to discover a broad set of underlying characteristics of instances (e.g., citizens in a census) using a limited number of queries (questions in the census form), which is at the core of AIR. Policies for AIR would build a personalized subset of survey questions (for individual instances) that quickly uncovered the likely answers to all remaining questions.

Our contributions are as follows: 1) We reformulate the AFA problem as a generative modeling task and build surrogate models that capture the state transitions with arbitrary conditional distributions. 2) We develop methodology to leverage the surrogate model to provide intermediate rewards as training signals and to provide auxiliary information that assists the agent. Our framework represents a novel combination of model-free and model-based RL. 3) We propose the first unsupervised active feature acquisition task where the target variables are the unobserved features themselves. 4) We achieve state-of-the-art performance on both supervised and unsupervised tasks in the largest scale AFA study to date. 5) We open-source a standardized environment inheriting the OpenAI gym interfaces (Brockman et al., 2016) to assist future research on active feature acquisition. Code will be released upon publication.

2 Methods

In this section, we first describe our GSMRL framework for both active feature acquisition (AFA) and active instance recognition (AIR) problems. We then develop our RL algorithm and the corresponding surrogate models for different settings. We also introduce a special application that acquires features for time series data.

2.1 AFA and AIR with GSMRL

Consider a discriminative task with features $x\in\mathbb{R}^{d}$ and target $y$ . Instead of predicting the target by first collecting all the features, we perform a sequential feature acquisition process in which we start from an empty set of features and actively acquire more features. There is typically a cost associated with features and the goal is to maximize the task performance while minimizing the acquisition cost, i.e.,

\text{minimize}~{}\mathcal{L}(\hat{y}(x_{o}),y)+\alpha\mathcal{C}(o),

(1)

where $\mathcal{L}(\hat{y}(x_{o}),y)$ represents the loss function between the prediction $\hat{y}(x_{o})$ and the target $y$ . Note that the prediction is made with the acquired feature subset $x_{o},o\subseteq\{1,\ldots,d\}$ . Therefore the agent should be able to predict with arbitrary subsets. $\mathcal{C}(o)$ represents the cost of the acquired features $o$ . The hyperparameter $\alpha$ controls the trade-off between prediction loss and acquisition cost. For unsupervised tasks, the target variable $y$ equals to $x$ ; that is, we acquire features actively to represent the instance with a selected subset.

In order to solve the optimization problem in (1), we formulate it as a Markov decision process as in (Zubek et al., 2004; Shim et al., 2018):

		$\displaystyle s=[o,x_{o}],\quad a\in u\cup\phi,$		(2)
		$\displaystyle r(s,a)=-\mathcal{L}(\hat{y},y)\mathbb{I}(a=\phi)-\alpha\mathcal{C}(a)\mathbb{I}(a\neq\phi).$		(2)

The state $s$ is the current acquired feature subset $o\subseteq\{1,\ldots,d\}$ and their values $x_{o}$ . The action space contains the remaining candidate features $u=\{1,\ldots,d\}\setminus o$ and a special action $\phi$ that indicates the termination of the acquisition process. To optimize the MDP, a reinforcement learning agent acts based on the observed state and receives rewards from the environment. When the agent acquires a new feature $i$ , the current state transits to a new state following $o\xrightarrow{i}o\cup i,x_{o}\xrightarrow{i}x_{o}\cup x_{i}$ , and the reward is the negative acquisition cost of this feature. $x_{i}$ is obtained from the environment (i.e. we observe the true $i^{\mathrm{th}}$ feature value for the instance).

Feature Dependencies as Dynamics Model A surprisingly unexplored property of the AFA MDP, and the driving observation to our work, is that the dynamics of the problem are dictated by conditional dependencies among the data’s features. That is, the state transitions are based on the conditionals: $p(x_{j}\mid x_{o})$ , where $x_{j}$ is an unobserved feature. Therefore we frame our approach according to the estimation of conditionals among features with generative models. We build a surrogate model to learn the distribution $p(y,x_{j}\mid x_{o})$ , where $x_{j}$ and $x_{o}$ contain arbitrary features from $x$ . We find that the most efficacious use of our generative surrogate model (see section 5) is a hybrid model-based approach that utilizes intermediate rewards and side information stemming from dependencies.

Intermediate Rewards When the agent terminates the acquisition and makes a prediction, the reward equals to the negative prediction loss using current acquired features. Since the prediction is made at the end of acquisitions, the reward of the prediction is received only when the agent decides to terminate the acquisition process. This is a typical temporal credit assignment problem, which may affect the learning of the agent (Minsky, 1961; Sutton, 1988). Given the surrogate model, we propose to remedy the credit assignment problem by providing intermediate rewards for each acquisition. Inspired by the information gain, the surrogate model assesses the intermediate reward for a newly acquired feature $i$ with

r_{m}(s,i)=H(y\mid x_{o})-\gamma H(y\mid x_{o},x_{i}),

(3)

where $\gamma$ is a discount factor for the MDP. In appendix A, we show that our intermediate rewards will not change the optimal policy.

Side Information In addition to intermediate rewards, we propose using the surrogate model to also provide side information to assist the agent, which includes the current prediction and output likelihood, the possible values and corresponding uncertainties of the unobserved features, and the estimated utilities of the candidate acquisitions. The current prediction $\hat{y}$ and likelihood $p(y\mid x_{o})$ inform the agent about its confidence, which can help the agent determine whether to stop the acquisition. The imputed values and uncertainties of the unobserved features give the agent the ability to look ahead into and future and guide its exploration. For example, if the surrogate model is very confident about the value of a currently unobserved feature, then acquiring it would be redundant. The utility of a feature $i$ is estimated by its expected information gain to the target variable:

	$\displaystyle\mathcal{U}_{i}$	$\displaystyle=H(y\mid x_{o})-\mathbb{E}_{p(x_{i}\mid x_{o})}H(y\mid x_{i},x_{o})$		(4)
		$\displaystyle=H(x_{i}\mid x_{o})-\mathbb{E}_{p(y\mid x_{o})}H(x_{i}\mid y,x_{o}),$		(4)

where the surrogate model is used to estimate the expected entropies. The utility essentially quantifies the conditional mutual information $I(x_{i};y\mid x_{o})$ between each candidate feature and the target variable. A greedy policy can be easily built based on the utilities where the next feature to acquire is the one with maximum utility (Ma et al., 2018; Gong et al., 2019). Here, our agent takes the utilities as side information to help balance exploration and exploitation, and eventually learns a non-greedy policy.

Algorithm 1 Active Feature Acquisition with GSMRL

0: pretrained surrogate model

M

; agent agent; prediction model

f_{\theta}(\cdot)

; test dataset

D

to be acquired

1. instantiate an environment:

\textit{env}=\text{Env}(D)

x_{o}

, o = env.reset()

3. done = False; reward = 0

while not done do

aux =

M

.query(

x_{o}

, o)

action = agent.act(

x_{o}

, o, aux)

r_{m}=

M

.reward(

x_{o}

o

, action)

x_{o}

, o, done, r = env.step(action)

reward = reward + r +

r_{m}

end while

prediction = agent.predict(

x_{o}

, o, aux)

Prediction Model When the agent deems that acquisition is complete, it makes a final prediction based on the acquired features thus far. The final prediction may be made using the surrogate model, i.e., $p(y\mid x_{o})$ , but it might be beneficial to train predictions specifically based on the agent’s own distribution of acquired features $o$ , since the surrogate model is agnostic to the feature acquisition policy of the agent. Therefore, we build a prediction model $f_{\theta}(\cdot)$ that takes both the current state $x_{o}$ and the side information as inputs (i.e. the same inputs as the policy). The prediction model can be trained simultaneously with the policy as an auxiliary task. Weight sharing between the policy and prediction function facilitates the learning of better representations.

Given the two predictions from the surrogate model and the prediction model respectively, the final reward $-\mathcal{L}(\hat{y},y)$ during training is the maximum one using either predictions. During test time, we choose one prediction based on validation performance. An illustration of our framework is presented in Fig. 3. Please refer to Algorithm 1 for pseudo-code of the acquisition process with our GSMRL framework. Please also see Algorithm 2 in the appendix for a detailed version. We will expound on the surrogate models for different settings below.

2.1.1 Surrogate Model for AFA

As we mentioned above, the surrogate model learns the conditional distributions $p(y,x_{j}\mid x_{o})$ . Note that $x_{o}$ is an arbitrary subset of the features and $x_{j}$ is an arbitrary unobserved feature since the surrogate model must be able to assist arbitrary policies, and acquired features will vary from instance to instance. Thus, there are an exponential number of different conditionals that the surrogate model must estimate for a $d$ -dimensional feature space. Therefore, learning a separate model for each different conditional is intractable. Fortunately, Ivanov et al. (2018) and Li et al. (2019) have proposed models to learn arbitrary conditional distributions $p(x_{u}\mid x_{o})$ . They regard different conditionals as different tasks and train VAE and normalizing flow based generative models, respectively, in a multi-task fashion to capture the arbitrary conditionals with a unified model. In this work, we leverage arbitrary conditionals and extend them to model the target variable $y$ as well. For continuous target variables, we concatenate them with the features, thus $p(y,x_{j}\mid x_{o})$ can be directly modeled. For discrete target variables, where we have a mix of continuous features and discrete labels, we use Bayes’ rule:

p(y,x_{j}\mid x_{o})=\frac{p(x_{j}\mid y,x_{o})p(x_{o}\mid y)P(y)}{\sum_{y^{\prime}}p(x_{o}\mid y^{\prime})P(y^{\prime})}.

(5)

We employ a variant arbitrary conditioning model that conditions on the target $y$ to obtain the arbitrary conditional likelihoods $p(x_{j}\mid y,x_{o})$ and $p(x_{o}\mid y)$ in (5).

Given a trained surrogate model, the prediction $p(y\mid x_{o})$ , the information gain in (3), and the utilities in (4) can all be estimated using the arbitrary conditionals. For continuous target variables, the prediction can be estimated by drawing samples from $p(y\mid x_{o})$ , and we express their uncertainties using sample variances. We calculate the entropy terms in (3) with Monte Carlo estimations. The utility in (4) can be further simplified as

	$\displaystyle\mathcal{U}_{i}$	$\displaystyle=\mathbb{E}_{p(y,x_{i}\mid x_{o})}\log\frac{p(x_{i},y\mid x_{o})}{p(y\mid x_{o})p(x_{i}\mid x_{o})}$		(6)
		$\displaystyle=\mathbb{E}_{p(y,x_{i}\mid x_{o})}\log\frac{p(y\mid x_{i},x_{o})}{p(y\mid x_{o})}.$		(6)

We then perform a Monte Carlo estimation by sampling from $p(y,x_{i}\mid x_{o})$ . Note that $p(y\mid x_{i},x_{o})$ is evaluated on sampled $x_{i}$ rather than the true value, since we have not acquired its value yet.

For discrete target variables, we employ Bayes’ rule to make a prediction

	$\displaystyle P(y\mid x_{o})$	$\displaystyle=\frac{p(x_{o}\mid y)P(y)}{\sum_{y^{\prime}}p(x_{o}\mid y^{\prime})P(y^{\prime})}$		(7)
		$\displaystyle=\text{softmax}_{y}(\log p(x_{o}\mid y^{\prime})+\log P(y^{\prime})),$		(7)

and the uncertainty is expressed as the prediction probability. The information gain in (3) can be estimated analytically, since the entropy for a categorical distribution is analytically available. To estimate the utility, we further simplify (6) to

	$\displaystyle\mathcal{U}_{i}$	$\displaystyle=\mathbb{E}_{p(x_{i}\mid x_{o})P(y\mid x_{i},x_{o})}\log\frac{P(y\mid x_{i},x_{o})}{P(y\mid x_{o})}$		(8)
		$\displaystyle=\mathbb{E}_{p(x_{i}\mid x_{o})}D_{\mathrm{KL}}[P(y\mid x_{i},x_{o})\\|P(y\mid x_{o})],$		(8)

where the KL divergence between two discrete distributions can be analytically computed. $x_{i}$ is sampled from $p(x_{i}\mid x_{o})$ as before. We again use Monte Carlo estimation for the expectation.

Although the utility can be estimated accurately by (6) and (8), it involves some overhead especially for long episodes, since we need to calculate them for each candidate feature at each acquisition step. Moreover, each Monte Carlo estimation may require multiple samples. To reduce the computation overhead, we utilize (4) and estimate the entropy terms with Gaussian approximations. That is, we approximate $p(x_{i}\mid x_{o})$ and $p(x_{i}\mid y,x_{o})$ as Gaussian distributions and entropies reduce to a function of the variance. We use sample variance as an approximation. We found that this Gaussian entropy approximation performs comparably while being much faster.

2.1.2 Surrogate Model for AIR

For unsupervised tasks, our goal is to represent the full set of features with an actively selected subset. Since the target is also $x$ , we modify our surrogate model to capture arbitrary conditional distributions $p(x_{u}\mid x_{o})$ .

$\mathcal{U}_{i}=H(x_{i}\mid x_{o})-\mathbb{E}_{p(x\mid x_{o})}H(x_{i}\mid x,x_{o})=H(x_{i}\mid x_{o}).$

(9)

The last equality is due to the fact that $H(x_{i}\mid x)=0$ . We again use a Gaussian approximation to estimate the entropy. Therefore, the side information for AIR only contains imputed values and their variances of the unobserved features. Similar to the supervised case, we leverage the surrogate model to provide the intermediate rewards. Instead of using the information gain in (3), we use the reduction of negative log likelihood per dimension, i.e.,

$r_{m}(s,i)=\frac{-\log p(x_{u}\mid x_{o})}{|u|}-\gamma\frac{-\log p(x_{u\setminus i}\mid x_{o},x_{i})}{|u|-1},$

(10)

since (3) involves estimating the entropy for potentially high dimensional distributions, which itself is an open problem (Kybic, 2007). We show in appendix A that the optimal policy is invariant under this form of intermediate rewards. The final reward $-\mathcal{L}(\hat{x},x)$ is calculated as the negative MSE of unobserved features $-\mathcal{L}(\hat{x},x)=-\|\hat{x}_{u}-x_{u}\|_{2}^{2}$ .

2.2 AFA for Time Series

We also apply our GSMRL framework on time series data. For example, consider a scenario where sensors are deployed in the field with limited resources. We would like the sensors to decide when to put themselves online to collect data. The goal is to make as few acquisitions as possible while still making an accurate prediction. In contrast to ordinary vector data, the acquired features must follow a chronological order, i.e., the newly acquired feature $i$ must occur after all elements of $o$ (since we may not go back in time to turn on sensors). In this case, it is detrimental to acquire a feature that occurs very late in an early acquisition step, since we will lose the opportunity to observe features ahead of it. The chronological constraint in action space removes all the features behind the acquired features from the candidate set. For example, after acquiring feature $t$ , features $\{1,\ldots,t\}$ are no longer considered as candidates for the next acquisition.

2.3 Implementation

We implement our GSMRL framework using the Proximal Policy Optimization (PPO) algorithm (Schulman et al., 2017). The policy network takes in a set of observed features and a set of auxiliary information from the surrogate model, extracts a set embedding from them using the set transformer (Lee et al., 2019), and outputs the actions. The critic network that estimates the value function shares the same set embedding as the policy network. To help learn useful representations, we also use the same set embedding as inputs for the prediction model $f_{\theta}$ . Arbitrary conditionals are estimated based on (Li et al., 2019).

To reflect the fact that acquiring the same feature repeatedly is redundant, we manually remove those acquired features from the candidate set. For time-series data, the acquired features must follow the chronological order since we cannot go back in time to acquire another feature, therefore we need to remove all the features behind the acquired features from the candidate set. Similar spatial constraints can also be applied for spatial data. To satisfy those constraints, we manually set the probabilities of the invalid actions to zeros.

3 Related Works

Active Learning Active learning (Fu et al., 2013; Konyushkova et al., 2017; Yoo & Kweon, 2019) is a related approach in ML to gather more information when a learner can query an oracle for the true label, $y$ , of a complete feature vector $x\in\mathbb{R}^{d}$ to build a better estimator. However, our methods consider queries to the environment for the feature value corresponding to an unobserved feature dimension, $i$ , in order to provide a better prediction on the current instance. Thus, while the active learning paradigm queries an oracle during training to build a classifier with complete features, our paradigm queries the environment at evaluation to obtain missing features of a current instance to help its current assessment.

Feature Selection Feature selection (Miao & Niu, 2016; Li et al., 2017; Cai et al., 2018), ascertains a static subset of important features to eliminate redundancies, which can help reduce computation and improve generalization. Feature selection methods choose a fixed subset of features $s\subseteq\{1,\dots,d\}$ , and always predict $y$ using this same subset of feature values, $x_{s}$ . In contrast, our model considers a dynamic subset of features that is sequentially chosen and personalized on an instance-by-instance basis to increase useful information.

Active Feature Acquisition Instead of predicting the target passively using collected features, previous works have explored actively acquiring features in the cost-sensitive setting. Active perception is a relevant sub-field where a robot with a mounted camera is planning by selecting the best next view (Bajcsy, 1988; Aloimonos et al., 1988; Cheng et al., 2018; Jayaraman & Grauman, 2018). In this work we consider general features, and take images as one of many data sources. For general data, Ling et al. (2004), Chai et al. (2004) and Nan et al. (2014) propose decision tree, naive Bayes and maximum margin based classifiers, respectively, to jointly minimize the misclassification cost and feature acquisition cost. Ma et al. (2018) and Gong et al. (2019) acquire features greedily using mutual information as the estimated utility. Zubek et al. (2004) formulate the AFA problem as a MDP and fit a transition model using complete data, then they use the AO* heuristic search algorithm to find an optimal policy. Rückstieß et al. (2011) formulate the problem as a partially observable MDP and solve it using Fitted Q-Iteration. He et al. (2012) and He et al. (2016) instead employ a imitation learning approach guided by a greedy reference policy. Shim et al. (2018) utilize Deep Q-Learning and jointly learn a policy and a classifier. The classifier is treated as an environment that calculates the classification loss as the reward. ODIN (Zannone et al., 2019) presents an approach to learn a policy and a prediction model using augmented data with a Partial VAE (Ma et al., 2018). In contrast, GSMRL uses a surrogate model, which estimates both the state transitions and the prediction in a unified model, to directly provide intermediate rewards and auxiliary information to an agent.

Model-based and Model-free RL Reinforcement learning can be roughly grouped into model-based methods and model-free methods depending on whether they use a transition model (Li, 2017). Model-based methods are more data efficient but could suffer from significant bias if the dynamics are misspecified. On the contrary, model-free methods can handle arbitrary dynamic system but typically requires substantially more data samples. There have been works that combine model-free and model-based methods to compensate with each other. The usage of the model includes generating synthetic samples to learn a policy (Gu et al., 2016), back-propagating the reward to the policy along a trajectory (Heess et al., 2015), and planning (Chebotar et al., 2017; Pong et al., 2018). In this work, we rely on the model to provide intermediate rewards and side information. We compare this strategy to other model-based approaches in section 5.

4 Experiments

In this section, we evaluate our method on several benchmark environments built upon the UCI repository (Dua & Graff, 2017) and MNIST dataset (LeCun, 1998). We compare our method to another RL based approach, JAFA (Shim et al., 2018), which jointly trains an agent and a classifier. We also compare to a greedy policy EDDI (Ma et al., 2018) that estimates the utility for each candidate feature using a VAE based model and selects one feature with the highest utility at each acquisition step. As a baseline, we also acquire features greedily using our surrogate model that estimates the utility following (6), (8) and (9). We use a fixed cost for each feature and report multiple results with different $\alpha$ in (1) to control the trade-off between task performance and acquisition cost. We cross validate the best architecture and hyperparameters for baselines. Architectural details, hyperparameters and sensitivity analysis are provided in the Appendix. In this work, we conduct the largest scale AFA study to date. Previous works have typically considered smaller datasets (both in terms of the number of features and the number of instances). We instead consider a broad range of datasets with more instances and higher dimensionality. In terms of comparisons, previous works often compare to naively simple baselines, such as a random acquisition order. In this work, we compare our GSMRL to the state-of-the-art models with both greedy policy and non-greedy RL policy.

Classification We first perform classification on the MNIST dataset. We downsample the original images to $16\times 16$ to reduce the action space to accommodate baselines such as EDDI that have trouble scaling (see Sec. D in the appendix for details on full MNIST).

Fig. 5 illustrates several examples of the acquired features and their prediction probability for different images. We can see that our model acquires a different subset of features for different images. Notice the checkerboard patterns of the acquired features, which indicates our model is able to exploit the spatial correlation of the data. Fig. 2 shows the acquisition process and the prediction probability along the acquisition. We can see the prediction become certain after acquiring only a small subset of features. The test accuracy in Fig. 10 demonstrates the superiority of our method over other baselines. It typically achieves higher accuracy with a lower acquisition cost. It is worth noting that our surrogate model with a greedy acquisition policy outperforms EDDI. We believe the improvement is due to the better distribution modeling ability of our surrogate model so that the utility and the prediction are more accurately estimated. We also perform classification using several UCI datasets. The test accuracy is presented in Fig. 7. Again, our method outperforms baselines under the same acquisition budget.

Regression We also conduct experiments for regression tasks using several UCI datasets. We report the root mean squared error (RMSE) of the target variable in Fig. 7. Similar to the classification task, our model outperforms baselines with a lower acquisition cost.

Time Series To evaluate the performance with constraints in action space, we classify over time series data where the acquired features must follow chronological ordering. The datasets are from the UEA & UCR time series classification repository (Bagnall et al., 2017). For GSMRL and JAFA, we clip the probability of invalid actions to zero; for the greedy method, we use a prior to bias the selection towards earlier time points. Please refer to Appendix B.4 for details. Fig. 9 shows the accuracy with different numbers of acquired features. Our method achieves high accuracy by collecting a small subset of the features.

Medical Diagnosis We evaluate the AFA performance for medical diagnosis. We use the Physionet challenge

2012 dataset (Goldberger et al., 2000) to predict the in-hospital mortality. Since the classes are heavily imbalanced, we use weighted cross entropy as training loss and the final rewards. For evaluation, we report the F1 scores in Fig. 11. Compared to baselines, our model achieves higher F1 with lower acquisition cost.

Unsupervised Next, we evaluate our method on unsupervised tasks where features are actively acquired to impute the unobserved features.

We use negative MSE as the reward for GSMRL and JAFA. The greedy policy calculates the utility following (9). For low dimensional UCI datasets, our method is comparable to baselines as shown in Fig. 9; but for the high dimensional case, as shown in Fig. 12, our method is doing better. Note JAFA is worse than the greedy policy for MNIST. We found it hard to train the policy and the reconstruction model jointly without the help of the surrogate model in this case. See Fig. 2 for an example of the acquisition process.

5 Ablations

We now conduct a series of ablation studies to explore the capabilities of our GSMRL model.

Model-based Alternatives Our GSMRL model combines model-based and model-free approach into a holistic framework by providing the agent with auxiliary information and intermediate rewards. Here, we study different ways of utilizing the dynamics model. As in ODIN (Zannone et al., 2019), we utilize class conditioned generative models to generate synthetic trajectories.

The agent is then trained with both real and synthetic data (PPO+Syn). Another way of using the model is to extract a semantic embedding from the observations (Kumar et al., 2018). We use a pretrained EDDI to embed the current observed features into a 100-dimensional feature vector. An agent then takes the embedding as input and predicts the next acquisition (PPO+Embed). Figure 13 compares our method with these alternatives. We also present the results from a model-free approach as a baseline. We see our GSMRL outperforms other model-based approaches by a large margin.

Surrogate Models Our method relies on the surrogate model to provide intermediate rewards and auxiliary information. To better understand the contributions each component does to the overall framework, we conduct ablation studies using the MNIST dataset. We gradually drop one component from the full model and report the results in Fig. 15. The ‘Full Model’ uses both intermediate rewards and auxiliary information.

We then drop the intermediate rewards and denote it as ‘w/o rm’. The model without auxiliary information is denoted as ‘w/o aux’. We further drop both components and denote it as ‘w/o rm & aux’. From Fig. 15, we see these two components contribute significantly to the final results. We also compare models with and without the surrogate model. For models without a surrogate model, we train a classifier jointly with the agent as in JAFA. We plot the smoothed rewards using moving window average during training in Fig. 15. The agent with a surrogate model not only produces higher and smoother rewards but also converges faster.

Dynamic vs. Static Acquisition Our GSMRL acquires features following a dynamic order where it eventually acquires different features for different instances. A dynamic acquisition policy should perform better than a static one (i.e., the same set of features are acquired for each instance), since the dynamic policy allows the acquisition to be specifically adapted to the corresponding instance. To verify this is actually the case, we compare the dynamic and static acquisition under a greedy policy for MNIST classification.

Similar to the dynamic greedy policy, the static acquisition policy acquires the feature with maximum utility at each step, but the utility is averaged over the whole testing set, therefore the same acquisition order is adopted for the whole testing set. Figure 16 shows the classification accuracy for both EDDI and GSM under a greedy acquisition policy. We can see the dynamic policy is always better than the corresponding static one. Furthermore, our GSM with a static acquisition can already outperform dynamic EDDI.

Greedy vs. Non-greedy Acquisition Our GSMRL will terminate the acquisition process if the agent deems the current acquisition achieves the optimal trade-off between the prediction performance and the acquisition cost.

To evaluate how much the termination action affects the performance and to directly compare with the greedy policies under the same acquisition budget, we conduct an ablation study that removes the termination action and gives the agent a hard acquisition budget (i.e., forcing the agent to predict after some number of acquisitions). We can see (Fig. 17) GSMRL outperforms the greedy policy under all budgets. Moreover, we see that the agent is able to correctly assess whether or not more acquisitions are useful, since it obtains better performance when it dictates when to predict with the termination action.

6 Conclusion

In this work, we reformulate the dynamics of the AFA MDP as a generative modeling task among features. We leverage a generative surrogate model to capture the state transitions across arbitrary feature subsets. The surrogate model also provides auxiliary information and intermediate rewards to assist the agent. Our GSMRL model essentially combines model-based and model-free approaches. We conduct a large scale study to evaluate our model on both supervised and unsupervised AFA problems. Our model achieves state-of-the-art performance on both problems. In future work, we will explore AFA in spatial-temporal setting with continuously indexed features.

References

Aloimonos et al. (1988) Aloimonos, J., Weiss, I., and Bandyopadhyay, A. Active vision. International journal of computer vision, 1(4):333–356, 1988.
Bagnall et al. (2017) Bagnall, A., Lines, J., Bostrom, A., Large, J., and Keogh, E. The great time series classification bake off: a review and experimental evaluation of recent algorithmic advances. Data Mining and Knowledge Discovery, 31(3):606–660, 2017.
Bajcsy (1988) Bajcsy, R. Active perception. Proceedings of the IEEE, 76(8):966–1005, 1988.
Brabec & Machlica (2018) Brabec, J. and Machlica, L. Bad practices in evaluation methodology relevant to class-imbalanced problems. arXiv preprint arXiv:1812.01388, 2018.
Brockman et al. (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym, 2016.
Cai et al. (2018) Cai, J., Luo, J., Wang, S., and Yang, S. Feature selection in machine learning: A new perspective. Neurocomputing, 300:70–79, 2018.
Chai et al. (2004) Chai, X., Deng, L., Yang, Q., and Ling, C. X. Test-cost sensitive naive bayes classification. In Fourth IEEE International Conference on Data Mining (ICDM’04), pp. 51–58. IEEE, 2004.
Chebotar et al. (2017) Chebotar, Y., Hausman, K., Zhang, M., Sukhatme, G., Schaal, S., and Levine, S. Combining model-based and model-free updates for trajectory-centric reinforcement learning. arXiv preprint arXiv:1703.03078, 2017.
Cheng et al. (2018) Cheng, R., Agarwal, A., and Fragkiadaki, K. Reinforcement learning of active vision for manipulating objects under occlusions. arXiv preprint arXiv:1811.08067, 2018.
Dua & Graff (2017) Dua, D. and Graff, C. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
Dulac-Arnold et al. (2015) Dulac-Arnold, G., Evans, R., van Hasselt, H., Sunehag, P., Lillicrap, T., Hunt, J., Mann, T., Weber, T., Degris, T., and Coppin, B. Deep reinforcement learning in large discrete action spaces. arXiv preprint arXiv:1512.07679, 2015.
Fu et al. (2013) Fu, Y., Zhu, X., and Li, B. A survey on instance selection for active learning. Knowledge and information systems, 35(2):249–283, 2013.
Goldberger et al. (2000) Goldberger, A. L., Amaral, L. A., Glass, L., Hausdorff, J. M., Ivanov, P. C., Mark, R. G., Mietus, J. E., Moody, G. B., Peng, C.-K., and Stanley, H. E. Physiobank, physiotoolkit, and physionet: components of a new research resource for complex physiologic signals. circulation, 101(23):e215–e220, 2000.
Gong et al. (2019) Gong, W., Tschiatschek, S., Nowozin, S., Turner, R. E., Hernández-Lobato, J. M., and Zhang, C. Icebreaker: Element-wise efficient information acquisition with a bayesian deep latent gaussian model. In Advances in Neural Information Processing Systems, pp. 14820–14831, 2019.
Gu et al. (2016) Gu, S., Lillicrap, T., Sutskever, I., and Levine, S. Continuous deep q-learning with model-based acceleration. In International Conference on Machine Learning, pp. 2829–2838, 2016.
He et al. (2012) He, H., Eisner, J., and Daume, H. Imitation learning by coaching. In Advances in Neural Information Processing Systems, pp. 3149–3157, 2012.
He et al. (2016) He, H., Mineiro, P., and Karampatziakis, N. Active information acquisition. arXiv preprint arXiv:1602.02181, 2016.
Heess et al. (2015) Heess, N., Wayne, G., Silver, D., Lillicrap, T., Erez, T., and Tassa, Y. Learning continuous control policies by stochastic value gradients. In Advances in Neural Information Processing Systems, pp. 2944–2952, 2015.
Ivanov et al. (2018) Ivanov, O., Figurnov, M., and Vetrov, D. Variational autoencoder with arbitrary conditioning. arXiv preprint arXiv:1806.02382, 2018.
Jayaraman & Grauman (2018) Jayaraman, D. and Grauman, K. Learning to look around: Intelligently exploring unseen environments for unknown tasks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1238–1247, 2018.
Konyushkova et al. (2017) Konyushkova, K., Sznitman, R., and Fua, P. Learning active learning from data. In Advances in Neural Information Processing Systems, pp. 4225–4235, 2017.
Kumar et al. (2018) Kumar, A., Eslami, S., Rezende, D. J., Garnelo, M., Viola, F., Lockhart, E., and Shanahan, M. Consistent generative query networks. arXiv preprint arXiv:1807.02033, 2018.
Kybic (2007) Kybic, J. High-dimensional entropy estimation for finite accuracy data: R-nn entropy estimator. In Biennial International Conference on Information Processing in Medical Imaging, pp. 569–580. Springer, 2007.
LeCun (1998) LeCun, Y. The mnist database of handwritten digits. http://yann.lecun.com/exdb/mnist/, 1998.
Lee et al. (2019) Lee, J., Lee, Y., Kim, J., Kosiorek, A., Choi, S., and Teh, Y. W. Set transformer: A framework for attention-based permutation-invariant neural networks. In International Conference on Machine Learning, pp. 3744–3753. PMLR, 2019.
Li et al. (2017) Li, J., Cheng, K., Wang, S., Morstatter, F., Trevino, R. P., Tang, J., and Liu, H. Feature selection: A data perspective. ACM Computing Surveys (CSUR), 50(6):1–45, 2017.
Li (2017) Li, Y. Deep reinforcement learning: An overview. arXiv preprint arXiv:1701.07274, 2017.
Li et al. (2019) Li, Y., Akbar, S., and Oliva, J. B. Acflow: Flow models for arbitrary conditional likelihoods. arXiv preprint arXiv:1909.06319, 2019.
Ling et al. (2004) Ling, C. X., Yang, Q., Wang, J., and Zhang, S. Decision trees with minimal costs. In Proceedings of the twenty-first international conference on Machine learning, pp. 69, 2004.
Ma et al. (2018) Ma, C., Tschiatschek, S., Palla, K., Hernández-Lobato, J. M., Nowozin, S., and Zhang, C. Eddi: Efficient dynamic discovery of high-value information with partial vae. arXiv preprint arXiv:1809.11142, 2018.
Miao & Niu (2016) Miao, J. and Niu, L. A survey on feature selection. Procedia Computer Science, 91:919–926, 2016.
Minsky (1961) Minsky, M. Steps toward artificial intelligence. Proceedings of the IRE, 49(1):8–30, 1961.
Nan et al. (2014) Nan, F., Wang, J., Trapeznikov, K., and Saligrama, V. Fast margin-based cost-sensitive classification. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2952–2956. IEEE, 2014.
Ng et al. (1999) Ng, A. Y., Harada, D., and Russell, S. Policy invariance under reward transformations: Theory and application to reward shaping. 1999.
Pong et al. (2018) Pong, V., Gu, S., Dalal, M., and Levine, S. Temporal difference models: Model-free deep rl for model-based control. arXiv preprint arXiv:1802.09081, 2018.
Rückstieß et al. (2011) Rückstieß, T., Osendorfer, C., and van der Smagt, P. Sequential feature selection for classification. In Australasian Joint Conference on Artificial Intelligence, pp. 132–141. Springer, 2011.
Russo et al. (2017) Russo, D., Van Roy, B., Kazerouni, A., Osband, I., and Wen, Z. A tutorial on thompson sampling. arXiv preprint arXiv:1707.02038, 2017.
Schulman et al. (2017) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
Shim et al. (2018) Shim, H., Hwang, S. J., and Yang, E. Joint active feature acquisition and classification with variable-size set encoding. In Advances in neural information processing systems, pp. 1368–1378, 2018.
Sutton (1988) Sutton, R. S. Learning to predict by the methods of temporal differences. Machine learning, 3(1):9–44, 1988.
Thompson (1933) Thompson, W. R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.
Yoo & Kweon (2019) Yoo, D. and Kweon, I. S. Learning loss for active learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 93–102, 2019.
Zannone et al. (2019) Zannone, S., Hernandez Lobato, J. M., Zhang, C., and Palla, K. Odin: Optimal discovery of high-value information using model-based deep reinforcement learning. In Real-world Sequential Decision Making Workshop, ICML, June 2019.
Zubek et al. (2004) Zubek, V. B., Dietterich, T. G., et al. Pruning improves heuristic search for cost-sensitive learning. 2004.

Appendix A Policy Invariance under Intermediate Rewards

Assume the original Markov Decision Process (MDP) without the intermediate rewards is defined as $M=(S,A,T,\gamma,R)$ , where $S$ and $A$ are state and action spaces, $T$ is the state transition probabilities, $\gamma$ is the discount factor, and $R$ is the rewards. When we introduce the intermediate rewards $R_{m}$ , the MDP is modified to $M^{\prime}=(S,A,T,\gamma,R^{\prime})$ , where $R^{\prime}=R+R_{m}$ . The following theory provides a sufficient and necessary condition for the modified MDP $M^{\prime}$ to achieve the same optimal policy as the original MDP $M$ .

Theorem 1.

The modified MDP $M^{\prime}=(S,A,T,\gamma,R+F)$ with any shaping reward function $F$ is guaranteed to be consistent with the optimal policy of the original MDP $M=(S,A,T,\gamma,R)$ if the shaping function $F$ have the following form

F(s,a,s^{\prime})=\gamma\Phi(s^{\prime})-\Phi(s),

(A.1)

where $\Phi:S\rightarrow\mathbb{R}$ is a potential function evaluated on states. For infinite-state case (i.e., the state space is an infinite set) the potential function is additionally required to be bounded.

Proof.

Please refer to Ng et al. (1999) for detailed proof. ∎

From the above theorem, we can see our intermediate rewards in (3) is a potential based shaping function and the potential function is $\Phi(s)=-H(y\mid s)$ . For classification task where $y\in\mathcal{Y}=\{1,2,\dots,K\}$ is a discrete variable, the entropy is naturally bounded, i.e., $0\leq H(y\mid s)\leq log|\mathcal{Y}|$ , where $|\mathcal{Y}|$ is the cardinality of the label space. For regression task where $y\in\mathbb{R}$ , the entropy is bounded by $0\leq H(y\mid s)\leq H(y\mid\emptyset)$ . The upper bound $H(y\mid\emptyset)$ is determined by the given surrogate model. Similarly, for the intermediate rewards in (10), the potential function $\Phi(s)=\frac{\log p(x_{u}\mid s)}{|u|}$ is also bounded for a given surrogate model.

1. load pretrained surrogate model $M$ , agent agent and prediction model $f_{\theta}(\cdot)$

2. instantiate an environment with data $D$ : $\textit{env}=\text{Environment}(D)$

3. $x_{o}$ , o = env.reset(); // $o=\emptyset$

4. done = False; reward = 0

while not done do

aux =

M

.query(

x_{o}

, o); // query M for auxiliary information // aux contains the prediction

\hat{y}\sim p(y\mid x_{o})

and output likelihoods, // the imputed values

\hat{x}_{u}\sim p(x_{u}\mid x_{o})

and their uncertainties, // and estimated utilities

\mathcal{U}_{i}

for each

i\in u

((4)). action = agent.act(

x_{o}

, o, aux); // act based on the state and auxiliary info

r_{m}

M

.reward(

x_{o}

o

, action); // calculate intermediate rewards with the surrogate model

x_{o}

, o, done, r = env.step(action); // take a step based on the action // if action indicates termination: done=True, r=-

\mathcal{L}(\hat{y}(x_{o}),y)

// else: done=False, r=

-\alpha\mathcal{C}(action)

o=o\cup action

reward = reward + r +

r_{m}

; // accumulate rewards

end while

prediction = agent.predict(

x_{o}

o

, aux); // make a final prediction // using either

M

.predict(

x_{o}

, o, aux) or

f_{\theta}

(

x_{o}

, o, aux) based on validation

Algorithm 2 Active Feature Acquisition with GSMRL

Appendix B Experiments

B.1 Classification

For classification tasks, we conduct experiments on MNIST and two UCI datasets. We downsample the MNIST images to $16\times 16$ to reduce the total number of features in order to accommodate baselines such as (Ma et al., 2018), which had trouble scaling. Features are normalized into the range $[0,1]$ .

The surrogate model for classification task estimate arbitrary conditional distributions that are conditioned on the target variable $y$ . For MNIST, we stack conditional coupling transformations and a conditional Gaussian likelihood module. For UCI dataset, we use an autoregressive likelihood module. To train the surrogate model, we randomly select two non-overlapping subsets $u$ and $o$ and optimize the arbitrary conditional log likelihood

\begin{split}&\log p(y,x_{u}\mid x_{o})=\log p(x_{u}\mid x_{o})+\log P(y\mid x_{u},x_{o})\\ &=\log p(x_{u}\mid x_{o})+\log\frac{p(x_{u},x_{o}\mid y)P(y)}{\sum_{y^{\prime}}p(x_{u},x_{o}\mid y^{\prime})P(y^{\prime})}.\end{split}

(B.2)

The agent is implemented as a PPO policy. Given the current state $x_{o}$ and the auxiliary information from the surrogate model, we extract a set embedding using set transformer (Lee et al., 2019). The inputs are first transformed to sets by concatenating with the one-hot encoding of their indexes. The set embedding is beneficial to deal with arbitrary dimensionality of the inputs. The policy network then takes the set embedding as inputs and outputs the next action. The critic network takes the same set embedding as inputs and output an estimate of the state values. To help the agent extract meaningful representations from its inputs, we let the prediction model $f_{\theta}$ take the same set embedding as input. The policy network, the critic network and the prediction function are all implemented as fully connected layers.

We run the baseline model JAFA (Shim et al., 2018) using their public code. We cross-validate the optimal architecture by modifying the number of layers and the size of each layer for both the agent and the classifier.

We adapt EDDI (Ma et al., 2018) to perform classification task by modifying the decoder to output Categorical distribution for $y$ and Gaussian distribution for $x$ . EDDI learns the distribution $p(y,x_{o})$ by utilizing a VAE based model. The acquisition metric for EDDI is

	$\displaystyle\mathcal{U}_{i}=$	$\displaystyle\mathbb{E}_{x_{i}\sim p(x_{i}\mid x_{o})}D_{\mathrm{KL}}[p(z\mid x_{i},x_{o})\\|p(z\|x_{o})]$		(B.3)
		$\displaystyle-\mathbb{E}_{y,x_{i}\sim p(y,x_{i}\mid x_{o})}D_{\mathrm{KL}}[p(z\mid y,x_{i},x_{o})\\|p(z\mid y,x_{o})],$		(B.3)

which is estimated using the proposal distribution. Then, a greedy policy that acquires the feature with maximum utility is employed. We similarly cross-validate the architecture for each dataset.

We also compare to a greedy policy using the surrogate model where the utility is calculated by (8). At each acquisition step, the one with maximum utility is selected.

B.2 Regression

For regression task, the target variable $y$ is concatenated into the features $x$ and the surrogate model learns the distribution $p(y,x_{u}\mid x_{o})$ . The agent is similarly implemented as the PPO policy with a set transformer based feature extractor. Baseline models include JAFA and EDDI, where the architecture is selected by cross validation. We also build a greedy policy using our surrogate model by estimating the utility following (6). For GSMRL and JAFA, the reward for a prediction $\hat{y}$ is calculated as the negative MSE $-\|\hat{y}-y\|_{2}^{2}$ .

B.3 Medical Diagnosis

We evaluate our model on Physionet challenge 2012 dataset (Goldberger et al., 2000). We first preprocess the dataset by removing some non-relevant features (such as patient ID) and eliminating the instances with very high missing rate (larger than 80%). The features are then normalized to the range of [0,1]. The model and baselines are mostly the same as the classification experiments for UCI dataset. Since the classses are heavily imbalanced, we use weighted cross entropy as loss and reward. To evaluate the performance for data with missing entries, We first impute those missing features with our GSM model. For EDDI, the missing entries are similarly imputed by the VAE model. JAFA does not have a generative component, thus we simply replace the missing features with zeros. JAFA reports the AUC score for this dataset, but AUC is known inappropriate for imbalanced classification (Brabec & Machlica, 2018). We instead report the F1 scores for this experiment.

B.4 Time Series

Acquiring features for time series data requires the agent to integrate chronological constraints into the action space. For RL based approach, we manually set the probabilities of invalid action to zeros. For greedy approach, inspired by Thompson sampling (Thompson, 1933; Russo et al., 2017), we employ a prior distribution to encode our chronological constraint. Specifically, we set the prior as a Dirichlet distribution that is biased towards the selection of earlier time steps:

	$\displaystyle\pi(\rho)=\text{Dir}\left[\right.$	$\displaystyle\alpha(T-(\max(o)+1)),$		(B.4)
		$\displaystyle\ldots,\alpha(T-(T-1))\left.\right](\rho),$		(B.4)

where $\alpha$ is a hyperparameter, $T$ is the total time steps, $\max(o)$ represents the latest time step already acquired, and $\rho$ is a distribution for acquisition over the remaining future time steps. However, we still desire that the acquired features are informative for target $y$ . Hence, we update the prior to a posterior using time steps $V$ that are drawn according to how informative they are:

\begin{split}&p(V_{n}=t)\propto\exp(I(x_{t};y\mid x_{o})),\\ &t\in\{\max(o)+1,\ldots,T-1\},\ n\in\{1,\ldots,N\},\end{split}

(B.5)

where $N$ is the number of samples. Due to conjugacy, the posterior is also a Dirichlet distribution

	$\displaystyle p(\rho\mid V)=\text{Dir}\Bigg{[}\Bigg{.}\alpha(T-(\max(o)+1))$		(B.6)
	$\displaystyle+\sum_{n=1}^{N}\mathbb{I}\{V_{n}=\max(o)+1\},\ldots\Bigg{.}\Bigg{]}(\rho).$		(B.6)

Samples from posterior represent the probabilities of choosing each candidate, which now prefer both earlier time steps and informative features. We draw a sample from posterior and select the most likely time step at each acquisition step.

B.5 Unsupervised

To perform active feature acquisition on unsupervised tasks, a.k.a, active instance recognition, we modify the reward for prediction as the negative MSE of the unobserved features, i.e., $-\|\hat{x}_{u}-x_{u}\|_{2}^{2}$ , where $\hat{x}_{u}$ is the imputed values of the unobserved features.

The JAFA is adapted to this task by changing the classifier to an auto-encoder like model, where the observed features $x_{o}$ are encoded to predict the unobserved features $x_{u}$ .

For EDDI, by plugging $y=x$ into (B.3), we have the acquisition metric for this setting as

\mathcal{U}_{i}=\mathbb{E}_{x_{i}\sim p(x_{i}\mid x_{o})}D_{\mathrm{KL}}[p(z\mid x_{i},x_{o})\|p(z|x_{o})],

(B.7)

since the second KL term in (B.3) equals to zero.

To build a greedy policy using our surrogate model, we estimate the utility using (9). Monte Carlo estimation is utilized to estimate the entropy.

Appendix C Hyperparameters

We search the hyperparameters for both our GSMRL and baselines using cross-validation. The range of the hyperparameters is listed in Table C.1.

Table C.1: Hyperparameters for GSMRL and baselines.

GSMRL	set transformer	$\{32,64\}\times\{1,2\}$
	set embedding size	${\{32,64\}}$
	policy network	$\{32,64\}\times\{2,3\}$
	critic network	$\{32,64\}\times\{2,3\}$
	prediction network	$\{64,128\}\times\{2,3\}$
	advantage $\lambda$	0.95
	discount factor $\gamma$	0.99
	PPO clip range	$[0.8,1.2]$
	entropy coefficient	0.0
JAFA	set embedding size	$\{16,32,64,128\}$
	Q network	$\{16,32,64,128\}\times\{2,3,4,5\}$
	prediction network	$\{16,32,64,128\}\times\{2,3,4,5\}$
EDDI	set embedding size	$\{10,20,50,100\}$
	encoder	$\{32,64,128,256\}\times\{3,4,5,6\}$
	latent code	$\{10,20,50,100\}$
	decoder	$\{32,64,128,256\}\times\{3,4,5,6\}$

Appendix D Additional Results

Due to the space limit, we only show one example for the acquisition process in the main text. Figure D.2 and D.4 show some additional examples for AFA and AIR tasks respectively. In Fig. D.2 and D.4, we present several examples of the acquisition process from the greedy policy. Note that the predictions for both the greedy and the non-greedy policy are from the same pretrained arbitrary conditioning model, therefore the only difference is the acquired features. Comparing the greedy and the non-greedy policy suggests that the non-greedy policy eliminates the prediction uncertainty much faster than the greedy one.

In Fig. 5 and 5, we present the acquired features from our GSMRL for several testing examples. To better understand the overall distribution of the acquired features across all the testing instances, we plot the frequencies of each feature being acquired in Fig. D.6 and D.6 for both AFA and AIR on MNIST respectively. A higher value of the frequency means the corresponding feature is acquired for more testing instances. Specifically, the frequency for a feature equals to one means the corresponding feature is a common feature acquired for all testing instances. The frequency loosely represents the importance of each feature, which could help with model interpretation and reasoning about decision making. We will explore this direction in future works.

In Fig. D.7, we analyse the sensitivity of our model to random initialization by running our model three times independently with different random seeds. We report the mean and standard deviation for both the number of acquisitions and the task performance. Baseline performance are presented for reference. We can see that our model is robust to random initialization and performs consistently better than baselines.

Small vs. Large Action Space For the sake of comparison, we employ a downsampled version of MNIST in the experiment section. Here, we show that our GSMRL model can be easily scaled up to a large action space. We conduct experiments using the original MNIST of size $28\times 28$ . We observe that JAFA has difficulty in scaling to this large action space, the agent either acquires no feature or acquires all features.

The greedy approaches are also hard to scale, since at each acquisition step, the greedy policy will need to compute the utilities for every unobserved features, which incurs a total $O(d^{2})$ complexity. In contrast, our GSMRL only has $O(d)$ complexity. Furthermore, with the help of the surrogate model, our GSMRL is pretty stable during training and converges to the optimal policy quickly. Fig. D.8 shows the accuracy with a certain percent of features acquired. The task is definitely harder for large action space as can be seen from the drop in performance when the agent acquires the same percentage of features for both small and large action space, but our GSMRL still achieves high accuracy by only acquiring a small portion of features.

Reward Evaluation Since we are dealing with a dynamic acquisition scenario, different algorithms or the same algorithm with different hyperparameters, such as $\alpha$ , could lead to different acquisitions, which renders the direct comparison difficult. In the experiment section, we compare different algorithms by plotting the performance curve w.r.t. the number of acquisitions. Here, we utilize another evaluation metric that directly compares the returned reward.

Table D.2: Normalized rewards for MNIST AFA experiments.

Algorithm	Reward
GSMRL	0.7998
JAFA	0.7335
GSM+Greedy	0.7038
EDDI	0.6116

We use a normalized reward for evaluation where for a $d$ -dimensional instance, each feature costs $\frac{1}{d}$ and the final classification is rewarded 1 if the prediction is correct, otherwise the reward is zero. The normalized reward is within the rage of $[-1,1]$ where a correct classification with no feature acquired obtains the highest reward 1, a wrong classification with all feature acquired obtains the lowest reward -1, and a correct classification with all features acquired obtains the reward 0. We report the normalized reward for MNIST classification in Table D.2.