Robust Offline Active Learning on Graphs

Yuanchen Wu¹¹1E-mail: yqw5734@psu.edu Department of Statistics, The Pennsylvania State University Yubai Yuan²²2Corresponding author: yvy5509@psu.edu Department of Statistics, The Pennsylvania State University

Abstract

We consider the problem of active learning on graphs for node-level tasks, which has crucial applications in many real-world networks where labeling node responses is expensive. In this paper, we propose an offline active learning method that selects nodes to query by explicitly incorporating information from both the network structure and node covariates. Building on graph signal recovery theories and the random spectral sparsification technique, the proposed method adopts a two-stage biased sampling strategy that takes both informativeness and representativeness into consideration for node querying. Informativeness refers to the complexity of graph signals that are learnable from the responses of queried nodes, while representativeness refers to the capacity of queried nodes to control generalization errors given noisy node-level information. We establish a theoretical relationship between generalization error and the number of nodes selected by the proposed method. Our theoretical results demonstrate the trade-off between informativeness and representativeness in active learning. Extensive numerical experiments show that the proposed method is competitive with existing graph-based active learning methods, especially when node covariates and responses contain noises. Additionally, the proposed method is applicable to both regression and classification tasks on graphs.

Key words: Offline active learning, graph semi-supervised learning, graph signal recovery, network sampling

1 Introduction

In many graph-based semi-supervised learning tasks for node-level prediction, labeled nodes are scarce, and the labeling process often incurs high costs in real-world applications. Randomly sampling nodes for labeling can be inefficient, as it overlooks label dependencies across the network. Active learning [29] addresses this issue by selecting informative nodes for labeling by human annotators, thereby improving the performance of downstream prediction algorithms.

Active learning is closely related to the optimal experimental design principle [1] in statistics. Traditional optimal experimental design methods select samples to maximize a specific statistical criterion [25, 11]. However, these methods often are not designed to incorporate network structure, therefore inefficient for graph-based learning tasks. On the other hand, selecting informative nodes on a network is studied extensively in the graph signal sampling literature [14, 22, 26, 8]. These strategies are typically based on the principle of network homophily, which assumes that connected nodes tend to have similar labels. However, a node’s label often also depends on its individual covariates. Therefore, signal-sampling strategies that focus solely on network information may miss critical insights provided by covariates.

Recently, inspired by the great success of graph neural networks (GNNs) [19, 35] in graph-based machine learning tasks, many GNN-based active learning strategies have been proposed. Existing methods select nodes to query by maximizing information gain under different criteria, including information entropy [7], the number of influenced nodes [39], prediction uncertainty [23], expected error reduction [27], and expected model change [30]. Most of these information gain measurements are defined in the spatial domain, leveraging the message-passing framework of GNNs to incorporate both network structure and covariate information. However, their effectiveness in maximizing learning outcomes is not guaranteed and can be difficult to evaluate. This challenge arises from the difficulty of quantifying node labeling complexity in the spatial domain due to intractable network topologies. While complexity measures exist for binary classification over networks [10], their extension to more complex graph signals incorporating node covariates remains unclear. This lack of well-defined complexity measures complicates performance analysis and creates a misalignment between graph-based information measurements and the gradient used to search the labeling function space, potentially leading to sub-optimal node selection.

Moreover, from a practical perspective, most of the previously discussed methods operate in an online setting, requiring prompt labeling feedback from an external annotator. However, this online framework is not always feasible when computational resources are limited [28] or when recurrent interaction between the algorithm and the annotator is impractical, such as in remote sensing or online marketing tasks [33, 37]. Additionally, both network data and annotator-provided labels may contain measurement errors. These methods often fail to account for noise in the training data [18], which can significantly degrade the prediction performance of models on unlabeled nodes [12, 6].

To address these challenges, we propose an offline active learning on graphs framework for node-level prediction tasks. Inspired by the theory of graph signal recovery [14, 22, 8] and GNNs, we first introduce a graph function space that integrates both node covariate information and network topology. The complexity of the node labeling function within this space is well-defined in the graph spectral domain. Accordingly, we propose a query information gain measurement aligned with the spectral-based complexity, allowing our strategy to achieve theoretically optimal sample complexity.

Building on this, we develop a greedy node query strategy. The labels of the queried nodes help identify orthogonal components of the target labeling function, each with varying levels of smoothness across the network. To address data noise, the query procedure considers both informativeness—the contribution of queried nodes in recovering non-smooth components of a signal—and representativeness—the robustness of predictions against noise in the training data. Compared to existing methods, the proposed approach provides a provably effective strategy under general network structures and achieves higher query efficiency by incorporating both network and node covariate information.

The proposed method identifies the labeling function via a bottom-up strategy—first identifying the smoother components of the labeling function and then continuing to more oscillated components. Therefore, the proposed method is naturally robust to high-frequency noise in node covariates. We provide a theoretical guarantee for the effectiveness of the proposed method in semi-supervised learning tasks. The generalization error bound is guaranteed even when the node labels are noisy. Our theoretical results also highlight an interesting trade-off between informativeness and representativeness in graph-based active learning.

2 Preliminaries

We consider an undirected, weighted, connected graph $\mathbf{G}=\{\mathbf{V},\mathbf{A}\}$ , where $\mathbf{V}=\{1,2,\cdots,n\}$ is the set of $n$ nodes, and $\mathbf{A}\in\mathbb{R}^{n\times n}$ is the symmetric adjacency matrix, with element $a_{ij}\geq 0$ denoting the edge weight between nodes $i$ and $j$ . The degree matrix is defined as $\mathbf{D}=\text{diag}\{d_{1},d_{2},\cdots,d_{n}\}$ , where $d_{i}=\sum_{1\leq i\leq n}a_{ij}$ denotes the degree of node $i$ . Additionally, we observe the node response vector $\mathbf{Y}\in\mathbb{R}^{n\times 1}$ and the node covariate matrix $\mathbf{X}=(X_{1},\cdots,X_{p})\in\mathbb{R}^{n\times p}$ , where the $i^{th}$ row, $\mathbf{X}_{i\cdot}$ , is the $p$ -dimensional covariate vector for node $i$ . The linear space of all linear combinations of $\{X_{1},\cdots,X_{p}\}$ is denoted as $\text{Span}\{X_{1},\cdots,X_{p}\}$ . The normalized graph Laplacian matrix is defined as $\mathcal{L}=\mathbf{I}-\mathbf{D}^{-1/2}\mathbf{A}\mathbf{D}^{-1/2}$ , where $\mathbf{I}$ is the $n\times n$ identity matrix. The matrix $\mathcal{L}$ is symmetric and positive semi-definite, with $n$ real eigenvalues satisfying $0=\lambda_{1}\leq\lambda_{2}\leq\cdots\leq\lambda_{n}\leq 2$ , and a corresponding set of eigenvectors denoted by $\mathbf{U}=\{U_{1},U_{2},\cdots,U_{n}\}$ . We use $b=\mathcal{O}(a)$ to indicate $|b|\leq M|a|$ for some $M>0$ . For a set of nodes $\mathcal{S}\subset\mathbf{V}$ , $|\mathcal{S}|$ indicates its cardinality, and $S^{c}=\mathbf{V}\backslash\mathcal{S}$ denotes the complement of $\mathcal{S}$ .

2.1 Graph signal representation

Consider a graph signal $\mathbf{f}\in\mathbb{R}^{n}$ , where $\mathbf{f}(i)$ denotes the signal value at node $i$ . For a set of nodes $S$ , we define the subspace $\mathbf{L}_{\mathcal{S}}:=\{\mathbf{f}\in\mathbb{R}^{n}\mid\mathbf{f}(S^{c})=0\}$ , where $\mathbf{f}(S)\in\mathbb{R}^{|\mathcal{S}|}$ represents the values of $\mathbf{f}$ on nodes in $\mathcal{S}$ . In this paper, we consider both regression tasks, where $\mathbf{f}(i)$ is a continuous response, and classification tasks, where $\mathbf{f}(i)$ is a multi-class label.

Since $\mathbf{U}$ serves as a set of bases for $\mathbb{R}^{n}$ , we can decompose $\mathbf{f}$ in the graph spectral domain as $\mathbf{f}=\sum_{j=1}^{n}\alpha_{\mathbf{f}}(\lambda_{j})U_{j}$ , where $\alpha_{\mathbf{f}}(\lambda_{j})=\langle\mathbf{f},U_{j}\rangle$ is defined as the graph Fourier transform (GFT) coefficient corresponding to frequency $\lambda_{j}$ . From a graph signal processing perspective, a smaller eigenvalue $\lambda_{k}$ indicates lower variation in the associated eigenvector $U_{k}$ , reflecting smoother transitions between neighboring nodes. Therefore, the smoothness of $\mathbf{f}$ over the network can be characterized by the magnitude of $\alpha_{\mathbf{f}}(\lambda_{j})$ at each frequency $\lambda_{j}$ . More formally, we measure the signal complexity of $\mathbf{f}$ using the bandwidth frequency $\omega_{\mathbf{f}}=\sup\{\lambda_{j}|\alpha_{\mathbf{f}}(\lambda_{j})>0\}$ . Accordingly, we define the subspace of graph signals with a bandwidth frequency less than or equal to $\omega$ as $\mathbf{L}_{\omega}:=\{\mathbf{f}\in\mathbb{R}^{n}\mid\omega_{\mathbf{f}}\leq\omega\}$ . It follows directly that $\forall\omega_{1}<\omega_{2},\mathbf{L}_{\omega_{1}}\subset\mathbf{L}_{\omega_{2}}$ .

2.2 Active semi-supervised learning on graphs

The key idea in graph-based semi-supervised learning is to reconstruct the graph signal $\mathbf{f}$ within a function space $\mathbf{H}_{\omega}(\mathbf{X},\mathbf{A})$ that depends on both the network structure and node-wise covariates, where the frequency parameter $\omega$ controls the size of the space to mitigate overftting. Assume that $Y_{i}$ is the observed noisy realization of the true signal $\mathbf{f}(i)$ at node $i$ , active learning operates in a scenario where we have limited access to $Y_{i}$ on only a subset of nodes $\mathcal{S}$ , with $|\mathcal{S}|<<n$ . The objective is to estimate $\mathbf{f}$ within $\mathbf{H}_{\omega}(\mathbf{X},\mathbf{A})$ using $\{Y_{i}\}_{i\in\mathcal{S}}$ by considering the empirical estimator of $\mathbf{f}$ as

\displaystyle\mathbf{f}_{\mathcal{S}}=\operatorname*{argmin}_{\mathbf{g}\in\mathbf{H}_{\omega}(\mathbf{X},\mathbf{A})}\sum_{i\in\mathcal{S}}l\big{(}Y_{i},\mathbf{g}(i)\big{)},

(1)

where $l(\cdot)$ is a task-specific loss function. We denote $\mathbf{f}^{*}$ as the minimizer of (1) when responses on all nodes are available, i.e., $\mathbf{f}^{*}=\mathbf{f}_{\mathbf{V}}$ . The goal of active semi-supervised learning is to design an appropriate function space $\mathbf{H}_{\omega}(\mathbf{X},\mathbf{A})$ and select an informative subset of nodes $\mathcal{S}$ for querying responses, under the query budget $|\mathcal{S}|\leq\mathcal{B}$ , such that the estimation error is bounded as follows:

\displaystyle\|\mathbf{{f}}_{\mathcal{S}}-\mathbf{f}^{*}\|_{2}^{2}\leq\rho\|\mathbf{f}^{*}-\mathbf{f}\|_{2}^{2}

For a fixed $\mathcal{B}$ , we wish to minimize the parameter $\rho>0$ , which converges to $0$ as the query budget $\mathcal{B}$ approaches $n$ .

3 Biased Sequential Sampling

In this section, we introduce a function space for recovering the graph signal. Leveraging this function space, we propose an offline node query strategy that integrates criteria of both node informativeness and representativeness to infer the labels of unannotated nodes in the network.

3.1 Graph signal function space

In semi-supervised learning tasks on networks, both the network topology and node-wise covariates are crucial for inferring the graph signal. To effectively incorporate this information, we propose a function class for reconstructing the graph signal that lies at the intersection of the graph spectral domain and the space of node covariates. Motivated by the graph Fourier transform, we define the following function class:

	$\displaystyle\mathbf{H}_{\omega}(\mathbf{X},\mathbf{A})=$	$\displaystyle\text{Proj}_{\mathbf{L}_{\omega}}\text{Span}(\mathbf{X}):=\text{Span}\{\text{Proj}_{\mathbf{L}_{\omega}}X_{1},\cdots,\text{Proj}_{\mathbf{L}_{\omega}}X_{p}\},$
		$\displaystyle\text{where}\;\text{Proj}_{\mathbf{L}_{\omega}}X_{i}=\sum_{j:\lambda_{j}\leq\omega}\langle X_{i},U_{j}\rangle U_{j}.$

Here, the choice of $\omega$ balances the information from node covariates and network structure. When $\omega=2$ , $\mathbf{H}_{\omega}(\mathbf{X},\mathbf{A})$ spans the full column space of covariates, i.e., $\text{Span}\{X_{1},\cdots,X_{p}\}$ , allowing for a full utilization of the original covariate space to estimate the graph signal, but without incorporating any network information. On the other hand, when $\omega$ is close to zero—consider, for example, the extreme case where $|\{U_{j}\mid\lambda_{j}\leq\omega\}|=2$ and $p\gg 2$ —then $\text{Proj}_{\mathbf{L}_{\omega}}X_{i}$ reduces to $\text{Span}\{U_{1},U_{2}\}$ , resulting in a loss of critical information provided by the original $\mathbf{X}$ .

By carefully choosing $\omega$ , however, this function space can offer two key advantages for estimating the graph signal. From a signal recovery perspective, $\mathbf{H}_{\omega}(\mathbf{X},\mathbf{A})$ imposes graph-based regularization over node covariates, enhancing generalizability when the dimension of covariates $p$ exceeds the query budget or even the network size—conditions commonly encountered in real applications. Additionally, covariate smoothing filters out signals in the covariates that are irrelevant to network-based prediction, thereby increasing robustness against potential noise in the covariates. From an active learning perspective, using $\mathbf{H}_{\omega}(\mathbf{X},\mathbf{A})$ enables a bottom-up query strategy that begins with a small $\omega$ to capture the smoothest global trends in the graph signal. As the labeling budget increases, $\omega$ is adaptively increased to capture more complex graph signals within the larger space $\mathbf{H}_{\omega}(\mathbf{X},\mathbf{A})$ .

The graph signal $\mathbf{f}$ can be approximated by its projection onto the space $\mathbf{H}_{\omega}(\mathbf{X},\mathbf{A})$ . Specifically, let $\mathbf{U}_{d}=\{U_{1},U_{2},\cdots,U_{d}\}\in\mathbb{R}^{n\times d}$ stack the $d$ leading eigenvectors of $\mathcal{L}$ , where $d=\operatorname*{argmax}_{1\leq j\leq n}(\lambda_{j}-\omega)\leq 0$ . The graph signal estimation is then given by $\mathbf{U}_{d}\mathbf{U}^{T}_{d}\mathbf{X}{\beta}$ , where $\beta\in\mathbb{R}^{d}$ is a trainable weight vector. However, the parameters $\beta$ may become unidentifiable when the covariate dimension $p$ exceeds $d$ . To address this issue, we reparameterize the linear regression as follows:

\displaystyle\mathbf{U}_{d}\mathbf{U}^{T}_{d}\mathbf{X}{\beta}=\tilde{\bm{X}}\tilde{\beta},

(2)

where $\tilde{\beta}=\Sigma V_{2}^{T}{\beta}$ and $\tilde{\mathbf{X}}=\mathbf{U}_{d}V_{1}$ . Here, $V_{1}\in\mathbb{R}^{d\times r}$ , $V_{2}\in\mathbb{R}^{p\times r}$ , and $\Sigma\in\mathbb{R}^{r\times r}$ denote the left and right singular vectors and the diagonal matrix of the $r$ singular values, respectively.

In the reparameterized form $(\ref{reg_2})$ , the columns of $\tilde{\mathbf{X}}$ serve as bases for $\mathbf{H}_{\omega}(\mathbf{X},\mathbf{A})$ , thus
$\text{dim}(\mathbf{H}_{\omega}(\mathbf{X},\mathbf{A}))=\text{rank}(\tilde{\mathbf{X}})=r\leq\min\{d,p\}$ . The transformed predictors $\tilde{\mathbf{X}}$ capture the components of the node covariates constrained within the low-frequency graph spectrum. A graph signal $\mathbf{f}\in\mathbf{H}_{\omega}(\mathbf{X},\mathbf{A})$ can be parameterized as a linear combination of the columns of $\tilde{\mathbf{X}}$ , with the corresponding weights $\tilde{\beta}$ identified via

\displaystyle\hat{\beta}=\operatorname*{argmin}_{\tilde{\beta}}\sum_{i\in\mathcal{S}}\big{(}\mathbf{f}(i)-(\tilde{\mathbf{X}}_{\mathbf{S}})_{i\cdot}\;\tilde{\beta}\big{)}^{2}

(3)

where $\tilde{\mathbf{X}}_{\mathbf{S}}\in\mathrm{R}^{|\mathcal{S}|\times r}$ is the submatrix of $\tilde{\bm{X}}$ containing rows indexed by the query set $\mathcal{S}$ , and $\{\mathbf{f}(i)\}_{i\in\mathcal{S}}$ represents the true labels for nodes in $\mathcal{S}$ . To achieve the identification of $\mathbf{f}$ , it is necessary that $|\mathcal{S}|\geq r$ ; otherwise, there will be more parameters than equations in (3). More importantly, since $\text{rank}(\tilde{\mathbf{X}}_{\mathbf{S}})\leq\text{rank}(\tilde{\mathbf{X}})=r$ , $\mathbf{f}$ is only identifiable if $\tilde{\mathbf{X}}_{\mathbf{S}}$ has full column rank. Notice that $r$ increases monotonically with $\omega$ . If $\mathcal{S}$ is not carefully selected, the graph signal can only be identified in $\mathbf{H}_{\omega^{\prime}}(\mathbf{X},\mathbf{A})$ for some $\omega^{\prime}<\omega$ , which is a subspace of $\mathbf{H}_{\omega}(\mathbf{X},\mathbf{A})$ .

3.2 Informative node selection

We first define the identification of $\mathbf{H}_{\omega}(\mathbf{X},\mathbf{A})$ by the node query set $\mathcal{S}$ as follows:

Definition 1.

A subset of nodes $\mathcal{S}\subset\mathbf{V}$ can identify the graph signal space $\mathbf{H}_{\omega}(\mathbf{X},\mathbf{A})$ up to frequency $\omega$ if, for any two functions $\mathbf{f}_{1},\mathbf{f}_{2}\in\mathbf{H}_{\omega}(\mathbf{X},\mathbf{A})$ such that $\mathbf{f}_{1}(i)=\mathbf{f}_{2}(i)$ for all $i\in\mathcal{S}$ , it follows that $\mathbf{f}_{1}(j)=\mathbf{f}_{2}(j)$ for all $j\in\mathbf{V}$ .

Intuitively, the informativeness of a set $\mathcal{S}$ can be quantified by the frequency $\omega$ corresponding to the space $\mathbf{H}_{\omega}(\mathbf{X},\mathbf{A})$ that can be identified. To select informative nodes, we need to bridge the query set $\mathcal{S}$ in the spatial domain with $\omega$ in the spectral domain. To achieve this, we consider the counterpart of the function space $\mathbf{H}_{\omega}(\mathbf{X},\mathbf{A})$ in the spatial domain. Specifically, we introduce the projection space with respect to a subset of nodes $\mathcal{S}$ as follows: $\mathbf{H}_{\mathcal{S}}(\mathbf{X},\mathbf{A}):=\text{Span}\{{X}^{(\mathcal{S})}_{1},\cdots,{X}^{(\mathcal{S})}_{p}\}$ , where

\displaystyle{X}^{(\mathcal{S})}_{p}(i)=\begin{cases}{X}_{p}(i)&\text{if }i\in\mathcal{S}\\ \mathbf{0}&\text{if }i\in\mathcal{S}^{c}\end{cases}

Here, ${X}_{p}(i)$ denotes the $p$ -th covariate for node $i$ in $\mathbf{X}$ . Theorem 3.1 establishes a connection between the two graph signal spaces $\mathbf{H}_{\omega}(\mathbf{X},\mathbf{A})$ and $\mathbf{H}_{\mathcal{S}}(\mathbf{X},\mathbf{A})$ , providing a metric for evaluating the informativeness of querying a subset of nodes on the graph.

Theorem 3.1.

Any graph signal $\mathbf{f}\in\mathbf{H}_{\omega}(\mathbf{X},\mathbf{A})$ can be identified using labels on a subset of nodes $\mathcal{S}$ if and only if:

\displaystyle\omega<\omega(\mathcal{S}):=\operatorname*{inf}_{\mathbf{g}\in\mathbf{H}_{\mathcal{S}^{c}}(\mathbf{X},\mathbf{A})}\omega_{\mathbf{g}},

(4)

where $\mathcal{S}^{c}$ denotes the complement of $\mathcal{S}$ in $\mathbf{V}$ .

We denote the quantity $\omega(\mathcal{S})$ in (4) as the bandwidth frequency with respect to the node set $\mathcal{S}$ . This quantity can be explicitly calculated and measures the size of the space $\mathbf{H}_{\omega}(\mathbf{X},\mathbf{A})$ that can be recovered from the subset of nodes $\mathcal{S}$ . The goal of the active learning strategy is to select $\mathcal{S}$ within a given budget to maximize the bandwidth frequency $\omega(\mathcal{S})$ , thus enabling the identification of graph signals with the highest possible complexity.

To calculate the bandwidth frequency $\omega(\mathcal{S})$ , consider any graph signal $\mathbf{g}$ and its components with non-zero frequency $\Lambda_{\mathbf{g}}:=\{\lambda_{i}\mid\alpha_{\mathbf{g}}(\lambda_{i})>0\}$ . We use the fact that

\lim_{k\rightarrow\infty}\left(\sum_{j:\lambda_{j}\in\Lambda_{\mathbf{g}}}c_{j}\lambda_{j}^{k}\right)^{1/k}=\max_{\lambda_{j}\in\Lambda_{\mathbf{g}}}(\lambda_{j}),

where $\sum_{j:\lambda_{j}\in\Lambda_{\mathbf{g}}}c_{j}=1$ and $0\leq c_{j}\leq 1$ . Combined with the Rayleigh quotient representation of eigenvalues, the bandwidth frequency $\omega_{\mathbf{g}}$ can be calculated as

\omega_{\mathbf{g}}=\lim_{k\rightarrow\infty}\omega_{\mathbf{g}}(k),\quad\text{where}\quad\omega_{\mathbf{g}}(k)=\left(\frac{\mathbf{g}^{T}\mathcal{L}^{k}\mathbf{g}}{\mathbf{g}^{T}\mathbf{g}}\right)^{1/k}.

As a result, we can approximate the bandwidth $\omega_{\mathbf{g}}$ using $\omega_{\mathbf{g}}(k)$ for a large $k$ . Maximizing $\omega(\mathcal{S})$ over $\mathcal{S}$ then transforms into the following optimization problem:

\displaystyle\mathcal{S}=\operatorname*{argmax}_{\mathcal{S}:|\mathcal{S}|\leq\mathcal{B}}\hat{\omega}(\mathcal{S}),\quad\text{where}\quad\hat{\omega}(\mathcal{S}):=\operatorname*{inf}_{\mathbf{g}\in\mathbf{H}_{\mathcal{S}^{c}}(\mathbf{X},\mathbf{A})}\omega^{k}_{\mathbf{g}}(k),

(5)

where $\mathcal{B}$ represents the budget for querying labels. Due to the combinatorial complexity of directly solving optimization problem (5) by simultaneously selecting $\mathcal{S}$ , we propose a greedy selection strategy as a continuous relaxation of (5).

The selection procedure starts with $\mathcal{S}=\emptyset$ and sequentially adds one node to $\mathcal{S}$ that maximizes the increase in $\omega(\mathcal{S})$ until the budget is reached. We introduce an $n$ -dimensional vector $\mathbf{t}=(t_{1},t_{2},\cdots,t_{n})^{T}$ with $0\leq t_{i}\leq 1$ , and define the corresponding diagonal matrix $D({\mathbf{t}})$ with diagonal entries given by $\mathbf{t}$ . This allows us to encode the set of query nodes using $\mathbf{t}=\mathbf{1}_{\mathcal{S}}$ , where $\mathbf{1}_{\mathcal{S}}(i)=1$ if $i\in\mathcal{S}$ and $\mathbf{1}_{\mathcal{S}}(i)=0$ if $i\in\mathcal{S}^{c}$ . We then consider the space spanned by the columns of $D(\mathbf{t})\mathbf{X}$ as $\text{Span}\{D(\mathbf{t})\mathbf{X}\}$ , and the following relation holds:

\displaystyle\mathbf{H}_{\mathcal{S}^{c}}(\mathbf{X},\mathbf{A})=\text{Span}\{D(\mathbf{1}_{\mathcal{S}^{c}})\mathbf{X}\}.

Intuitively, $\text{Span}\{D(\mathbf{t})\mathbf{X}\}$ acts as a differentiable relaxation of the subspace $\mathbf{H}_{\mathcal{S}}(\mathbf{X},\mathbf{A})$ , enabling perturbation analysis of the bandwidth frequency when a new node is added to $\mathcal{S}$ . The projection operator associated with $\text{Span}\{D(\mathbf{t})\mathbf{X}\}$ can be explicitly expressed as

\mathbf{P}(\mathbf{t})=D(\mathbf{t})\mathbf{X}\left(\mathbf{X}^{T}D(\mathbf{t}^{2})\mathbf{X}\right)^{-1}\mathbf{X}^{T}D(\mathbf{t}).

To quantify the increase in $\hat{\omega}(\mathcal{S})$ when adding a new node to $\mathcal{S}$ , we consider the following regularized optimization problem:

\displaystyle\lambda_{\alpha}(\mathbf{t})=\min_{\phi}\frac{\phi^{T}\mathcal{L}^{k}\phi}{\phi^{T}\phi}+\alpha\frac{\phi^{T}\left(\mathbf{I}-\mathbf{P}(\mathbf{t})\right)\phi}{\phi^{T}\phi}.

(6)

The penalty term on the right-hand side of (6) encourages the graph signal $\phi$ to remain in $\mathbf{H}_{\mathcal{S}^{c}}(\mathbf{X},\mathbf{A})$ . As the parameter $\alpha$ approaches infinity and $\mathbf{t}=\mathbf{1}_{\mathcal{S}^{c}}$ , the minimization $\lambda_{\alpha}(\mathbf{1}_{\mathcal{S}^{c}})$ in (6) converges to $\hat{\omega}(\mathcal{S})$ in (5). The information gain from labeling a node $i\in\mathcal{S}^{c}$ can then be quantified by the gradient of the bandwidth frequency as $t_{i}$ decreases from 1 to 0:

\displaystyle\Delta_{i}:=-\frac{\partial\lambda_{\alpha}(\mathbf{t})}{\partial t_{i}}\Big{|}_{\mathbf{t}=\mathbf{1}_{\mathcal{S}^{c}}}=2\alpha\times\phi^{T}\frac{\partial\mathbf{P}(\mathbf{t})}{\partial t_{i}}\phi\Big{|}_{\mathbf{t}=\mathbf{1}_{\mathcal{S}^{c}}},

(7)

where $\phi$ is the minimizer of (6) at $\mathbf{t}=\mathbf{1}_{\mathcal{S}^{c}}$ , which corresponds to the eigenvector associated with the smallest non-zero eigenvalue of the matrix $\mathbf{P}(\mathbf{1}_{\mathcal{S}^{c}})\mathcal{L}^{k}\mathbf{P}(\mathbf{1}_{\mathcal{S}^{c}})$ . We then select the node $i=\operatorname*{argmax}_{j\in\mathcal{S}^{c}}\Delta_{j}$ and update the query set as $\mathcal{S}=\mathcal{S}\cup\{i\}$ .

When calculating node-wise informativeness in (7), we can enhance computational efficiency by avoiding the inversion of $D(\mathbf{t})$ . When $\mathbf{t}$ is in the neighborhood of $\mathbf{1}_{\mathcal{S}^{c}}$ , we can approximate:

\displaystyle\mathbf{P}(\mathbf{t})\approx D(\mathbf{t})\mathbf{X}_{\mathcal{S}^{c}}(\mathbf{X}_{\mathcal{S}^{c}}^{T}\mathbf{X}_{\mathcal{S}^{c}})^{-1}\mathbf{X}_{\mathcal{S}^{c}}^{T}D(\mathbf{t})=D(\mathbf{t})\mathbf{Z}_{\mathcal{S}^{c}}\mathbf{Z}_{\mathcal{S}^{c}}^{T}D(\mathbf{t}),

where $\mathbf{X}_{\mathcal{S}^{c}}=((X_{1})_{\mathcal{S}^{c}},\cdots,(X_{p})_{\mathcal{S}^{c}})$ and $\mathbf{Z}_{\mathcal{S}^{c}}=\mathbf{X}_{\mathcal{S}^{c}}(\mathbf{X}_{\mathcal{S}^{c}}^{T}\mathbf{X}_{\mathcal{S}^{c}})^{-1/2}$ . Then, the node-wise informativeness can be explicitly expressed as:

\displaystyle\Delta_{i}\propto t_{i}\phi_{i}^{2}(\mathbf{Z}_{\mathcal{S}^{c}})_{i\cdot}(\mathbf{Z}_{\mathcal{S}^{c}}^{T})_{i\cdot}+\sum_{j\neq i,1\leq j\leq n}t_{i}t_{j}\phi_{i}\phi_{j}(\mathbf{Z}_{\mathcal{S}^{c}})_{i\cdot}(\mathbf{Z}_{\mathcal{S}^{c}}^{T})_{j\cdot}.

(8)

We find that this approximation yields very similar empirical performance compared to the exact formulation in (7). Therefore, we adopt the formulation in (8) for the subsequent numerical experiments.

3.3 Representative node selection

In real-world applications, we often have access only to a perturbed version of the true graph signals, denoted as $Y=\mathbf{f}+\mathbf{\xi}$ , where $\mathbf{\xi}$ represents node labeling noise that is independent of the network data. When replacing the true label $\mathbf{f}(i)$ with $Y(i)$ in (3), this noise term introduces both finite-sample bias and variance in the estimation of the graph signal $\mathbf{f}$ . As a result, we aim to query nodes that are sufficiently representative of the entire covariate space to bound the generalization error. To achieve this, we introduce principled randomness into the deterministic selection procedure described in Section 3.2 to ensure that $\mathcal{S}$ includes nodes that are both informative and representative. The modified graph signal estimation procedure is given by:

\displaystyle\hat{\beta}=\operatorname*{argmin}_{\tilde{\beta}}\sum_{i\in\mathcal{S}}s_{i}\big{(}Y(i)-(\tilde{\mathbf{X}}_{\mathbf{S}})_{i\cdot}\;\tilde{\beta}\big{)}^{2},

(9)

where $s_{i}$ is the weight associated with the probability of selecting node $i$ into $\mathcal{S}$ .

Specifically, the generalization error of the estimator in (9) is determined by the smallest eigenvalue of $\tilde{\mathbf{X}}_{\mathcal{S}}^{T}\tilde{\mathbf{X}}_{\mathcal{S}}$ , denoted as $\lambda_{\min}(\tilde{\mathbf{X}}_{\mathcal{S}}^{T}\tilde{\mathbf{X}}_{\mathcal{S}})$ . Given that $\lambda_{\min}(\tilde{\mathbf{X}}^{T}\tilde{\mathbf{X}})=1$ , our goal is to increase the representativeness of $\mathcal{S}$ such that $\lambda_{\min}(\tilde{\mathbf{X}}_{\mathcal{S}}^{T}\tilde{\mathbf{X}}_{\mathcal{S}})$ is lower-bounded by:

\displaystyle\lambda_{\min}\left(\tilde{\mathbf{X}}_{\mathcal{S}}^{T}\tilde{\mathbf{X}}_{\mathcal{S}}\right)\geq(1-o_{|\mathcal{S}|}(1))\lambda_{\min}\left(\tilde{\mathbf{X}}^{T}\tilde{\mathbf{X}}\right).

(10)

However, the informative selection method in Section 3.2 does not guarantee (10). To address this, we propose a sequential biased sampling approach that balances informative node selection with generalization error control.

The key idea to achieve a lower bound for $\lambda_{\min}(\tilde{\mathbf{X}}_{\mathcal{S}}^{T}\tilde{\mathbf{X}}_{\mathcal{S}})$ is to use spectral sparsification techniques for positive semi-definite matrices [4]. Let $v_{i}\in\mathbb{R}^{1\times r}$ denote the $i$ -th row of the constrained basis $\tilde{\mathbf{X}}$ . By definition of $\tilde{\mathbf{X}}$ , it follows that $\mathbf{I}_{r\times r}=\sum_{i=1}^{n}v_{i}^{T}v_{i}$ . Inspired by the randomized sampling approach in [21], we propose a biased sampling strategy to construct $\mathcal{S}$ with $|\mathcal{S}|\ll n$ and weights $\{s_{i}>0,i\in\mathcal{S}\}$ such that $\sum_{i\in\mathcal{S}}s_{i}v_{i}^{T}v_{i}\approx\mathbf{I}$ . In other words, the weighted covariance matrix of the query set $\mathcal{S}$ satisfies $\lambda_{\min}(\tilde{\mathbf{X}}_{\mathcal{S}}^{T}W_{S}\tilde{\mathbf{X}}_{\mathcal{S}})\approx 1$ , where $W_{S}$ is a diagonal matrix with $s_{i}$ on its diagonal.

We outline the detailed sampling procedure as follows. After the $(t-1)^{\text{th}}$ selection, let the set of query nodes be $\mathcal{S}_{t-1}$ with corresponding node-wise weights $\mathcal{W}_{t-1}=\{s_{j}>0\mid j\in\mathcal{S}_{t-1}\}$ . The covariance matrix of $\mathcal{S}_{t-1}$ is given by $C_{t-1}\in\mathbb{R}^{r\times r}$ , defined as $C_{t-1}=\tilde{\mathbf{X}}_{\mathcal{S}_{t-1}}^{T}\tilde{\mathbf{X}}_{\mathcal{S}_{t-1}}=\sum_{j\in\mathcal{S}_{t-1}}s_{j}v_{j}^{T}v_{j}$ . To analyze the behavior of eigenvalues as the query set is updated, we follow [21] and introduce the potential function:

\displaystyle\Phi_{t-1}=\text{Tr}[(u_{t-1}I-C_{t-1})^{-1}]+\text{Tr}[(C_{t-1}-l_{t-1}I)^{-1}],

(11)

where $u_{t-1}$ and $l_{t-1}$ are constants such that $l_{t-1}<\lambda_{\min}(C_{t-1})\leq\lambda_{\max}(C_{t-1})<u_{t-1}$ , and $\text{Tr}(\cdot)$ denotes the trace of a matrix. The potential function $\Phi_{t-1}$ quantifies the coherence among all eigenvalues of $C_{t-1}$ . To construct the candidate set $B_{m}$ , we sample node $i$ and update $C_{t}$ , $u_{t}$ , and $l_{t}$ such that all eigenvalues of $C_{t}$ remain within the interval $(l_{t},u_{t})$ . To achieve this, we first calculate the node-wise probabilities $\{p_{i}\}_{i=1}^{n}$ as:

\displaystyle p_{i}=\big{[}v_{i}(u_{t-1}I-C_{t-1})^{-1}v_{i}^{T}+v_{i}(C_{t-1}-l_{t-1}I)^{-1}v_{i}^{T}\big{]}/\Phi_{t-1},

(12)

where $\sum_{i=1}^{n}p_{i}=1$ . We then sample $m$ nodes into $B_{m}$ according to $\{p_{i}\}_{i=1}^{n}$ . For each node $i\in B_{m}$ , the corresponding weight is given by $s_{i}=\frac{\epsilon}{p_{i}\Phi_{t-1}},\;0<\epsilon<1$ . After obtaining the candidate set $B_{m}$ , we apply the informative node selection criterion $\Delta_{i}$ introduced in Section 3.2, i.e., selecting the node $i=\operatorname*{argmax}_{i\in B_{m}}\Delta_{i}$ , and update the query set and weights as follows:

	$\displaystyle\text{if}\;i\in\mathcal{S}_{t-1}^{c}:\;\mathcal{S}_{t}=\mathcal{S}_{t-1}\cup\{i\},\;\mathcal{W}_{t}=\mathcal{W}_{t-1}\cup\{s_{i}\},$
	$\displaystyle\text{if}\;i\in\mathcal{S}_{t-1}:s_{i}=s_{i}+\frac{\epsilon}{p_{i}\Phi_{t-1}}.$

We then update the lower and upper eigenvalue bounds as follows:

\displaystyle u_{t}=u_{t-1}+\frac{\epsilon}{\Phi_{t-1}(1-\epsilon)},\;l_{t}=l_{t-1}+\frac{\epsilon}{\Phi_{t-1}(1+\epsilon)}.

(13)

The update rule ensures that $u_{t}-l_{t}$ increases at a slower rate than $u_{t}$ , leading to the convergence of the gap between the largest and smallest eigenvalues of $\tilde{\mathbf{X}}_{\mathcal{S}}^{T}W_{S}\tilde{\mathbf{X}}_{\mathcal{S}}$ , thereby controlling the condition number. Accordingly, the covariance matrix is updated with the selected node $i$ as:

\displaystyle C_{t}=C_{t-1}+\frac{\epsilon}{p_{i}\Phi_{t-1}}v_{i}^{T}v_{i}.

(14)

With the covariance matrix update rule in (14), the average increment is $\mathbf{E}(C_{t})-C_{t-1}=\sum_{i=1}^{n}p_{i}s_{i}v_{i}^{T}v_{i}=\frac{\epsilon}{\Phi_{t-1}}\mathbf{I}$ . Intuitively, the selected node allows all eigenvalues of $C_{t-1}$ to increase at the same rate on average. This ensures that $\lambda_{\min}(\tilde{\mathbf{X}}_{\mathcal{S}}^{T}\tilde{\mathbf{X}}_{\mathcal{S}})$ continues to approach $\lambda_{\min}(\tilde{\mathbf{X}}^{T}\tilde{\mathbf{X}})=1$ during the selection process, thus driving the smallest eigenvalue away from zero. Additionally, the selected node remains locally informative within the candidate set $B_{m}$ . Compared with the entire set of nodes, selecting from a subset serves as a regularization on informativeness maximization, achieving a balance between informativeness and representativeness for node queries.

3.4 Node query algorithm and graph signal recovery

We summarize the biased sampling selection strategy in Algorithm 1. At a high level, each step in the biased sampling query strategy consists of two stages. First, we use randomized spectral sparsification to sample $m\ll n$ nodes and collect them into a candidate set $B_{m}$ . Intuitively, the covariance matrix on the updated $\mathcal{S}$ maintains lower-bounded eigenvalues if a node from $B_{m}$ is added to $\mathcal{S}$ . In the second stage, we select one node from $B_{m}$ based on the informativeness criterion in Section 3.2 to achieve a significant frequency increase in (7).

For initialization, the dimension of the network spectrum $d$ , the size of the candidate set $m$ , and the constant $0<\epsilon<1$ for spectral sparsification need to be specified. Based on the discussion at the end of Section 3.1, the dimension of the function space $\mathbf{H}_{\omega}(\mathbf{X},\mathbf{A})$ is at most $\mathcal{B}$ , where $\mathcal{B}$ is the budget for label queries. Therefore, we can set $d=\min\{p,\mathcal{B}\}$ . The parameters $m$ and $\epsilon$ jointly control the condition number $\frac{\lambda_{\max}(\tilde{\mathbf{X}}_{\mathcal{S}}^{T}W_{S}\tilde{\mathbf{X}}_{\mathcal{S}})}{\lambda_{\min}(\tilde{\mathbf{X}}_{\mathcal{S}}^{T}W_{S}\tilde{\mathbf{X}}_{\mathcal{S}})}$ .

In practice, we can tune $m$ to ensure that the covariance matrix is well-conditioned. Specifically, we can run the biased sampling procedure multiple times with different values of $m$ and select the largest $m$ such that the condition number of the covariance matrix on the query set $\mathcal{S}$ is less than 10 [20]. This threshold is a commonly accepted rule of thumb for considering a covariance matrix to be well-conditioned [20]. Additionally, $\epsilon$ is typically fixed at a small value, following the protocol outlined in [21].

Algorithm 1 Biased Sampling Query Algorithm

t=0

C_{0}=0

, the set of query nodes

\mathcal{S}_{0}=\emptyset

, the set of node weights

\mathcal{W}_{0}=\emptyset

, spectral dimension

d

, size of candidate set

m

, constant

0<\epsilon<1/m

, query budget

\mathcal{B}

Initialization: Compute SVD decomposition

\mathbf{U}^{T}_{d}\mathbf{X}=V_{1}\Sigma V_{2}^{T}

, and set

\tilde{\mathbf{X}}=\mathbf{U}_{d}V_{1}

r=\text{rank}(\tilde{\mathbf{X}})

u_{0}=2r/\epsilon

l_{0}=-2r/\epsilon

\kappa=2r(1-m^{2}\epsilon^{2})/(m\epsilon^{2})

while

\mathcal{B}>0

Step 1: Calculate

\Phi_{t}

as in (11) and the node-wise probabilities

\{p_{i}\}_{i=1}^{n}

using (12).

Step 2: Sample

m

nodes with replacement according to probabilities

\{p_{i}\}_{i=1}^{n}

to form the candidate set

B_{m}

Step 3: Select node

i

i=\operatorname*{argmax}_{i\in B_{m}}\Delta_{i}

and calculate its weight

w_{i}=\frac{\epsilon}{p_{i}\Phi_{t}}

. If

i\notin\mathcal{S}_{t}

, then update

\mathcal{S}_{t+1}=\mathcal{S}_{t}\cup\{i\}

and

\mathcal{W}_{t+1}=\mathcal{W}_{t}\cup\{s_{i}\}

with

s_{i}=\frac{w_{i}}{\kappa}

. Else if

i\in\mathcal{S}_{t}

, then update

s_{i}=s_{i}+\frac{w_{i}}{\kappa}

Step 4: Update

C_{t}

u_{t}

l_{t}

\mathcal{B}

and

t

as:

	$\displaystyle C_{t+1}=C_{t}+w_{i}v_{i}^{T}v_{i},\quad u_{t+1}$	$\displaystyle=u_{t}+\frac{\epsilon}{\Phi_{t}(1-m\epsilon)},\quad l_{t+1}=l_{t}+\frac{\epsilon}{\Phi_{t}(1+m\epsilon)},$
	$\displaystyle\mathcal{B}$	$\displaystyle=\mathcal{B}-1,\quad t=t+1.$

end while

Query: Label all nodes in

\mathcal{S}

through an external annotator.

Output: Set of queried nodes

\mathcal{S}

, annotated responses

\{Y_{i}\mid i\in\mathcal{S}\}

, smoothed covariates

\tilde{\mathbf{X}}_{\mathcal{S}}

, and weights of queried nodes

\mathcal{W}

Based on the output from Algorithm 1, we solve the weighted least squares problem in (9):

\displaystyle\hat{\beta}=\operatorname*{argmin}_{\tilde{\beta}}\sum_{i\in\mathcal{S}}s_{i}\big{(}Y(i)-(\tilde{\mathbf{X}}_{\mathcal{S}})_{i\cdot}\tilde{\beta}\big{)}^{2},

(15)

and recover the graph signal on the entire network as $\hat{\mathbf{f}}=\tilde{\mathbf{X}}\hat{\beta}$ .

Although our theoretical analysis is developed for node regression tasks, the proposed query strategy and graph signal recovery procedure are also applicable to classification tasks. Consider a $K$ -class classification problem, where the response on each node $i$ is given by $\mathbf{f}(i)\in{1,2,\ldots,K}$ . We introduce a dummy membership vector $(Y_{1}(i),\ldots,Y_{K}(i))$ , where $Y_{c}(i)=1$ if $\mathbf{f}(i)=c$ and $Y_{c}(i)=0$ otherwise. For each class $c\in\{1,2,\ldots,K\}$ , we first estimate $\hat{\beta}_{c}$ based on (15) with the training data $\{\tilde{\bm{X}}_{i\cdot},Y_{c}(i),s_{i}\}_{i\in\mathcal{S}}$ , and then compute the score for class $c$ as $\hat{\mathbf{f}}_{c}=\tilde{X}\hat{\beta}_{c}$ . The label of an unqueried node $j$ is assigned as $\mathbf{\hat{f}}(j)=\operatorname*{argmax}_{1\leq c\leq K}\{\hat{\mathbf{f}}_{1}(j),\hat{\mathbf{f}}_{2}(j),\cdots,\hat{\mathbf{f}}_{K}(j)\}$ . Notice that the above score-based classifier is equivalent to the softmax classifier:

\displaystyle\mathbf{\hat{f}}=\operatorname*{argmax}_{1\leq c\leq K}\{\frac{\exp(\hat{\mathbf{f}}_{1})}{\sum_{c}\exp(\hat{\mathbf{f}}_{c})},\cdots,\frac{\exp(\hat{\mathbf{f}}_{K})}{\sum_{c}\exp(\hat{\mathbf{f}}_{c})}\}

since the softmax function is monotonically increasing with respect to each score function $\{\mathbf{\hat{f}}_{c}\}_{c=1}^{K}$ .

3.5 Computational Complexity

In the representative sampling stage, the computational complexity of calculating the sampling probability is $\mathcal{O}(n)$ . We then sample $m$ nodes to formulate a candidate set $B_{m}$ , where the complexity of sampling $m$ variables from a discrete probability distribution is $\mathcal{O}(m)$ [32]. Consequently, the complexity of the representative learning stage is $\mathcal{O}(n+m)$ .

In the informative selection stage, we calculate the information gain $\Delta_{i}$ for each node. This involves obtaining the eigenvector corresponding to the smallest non-zero eigenvalue of the projected graph Laplacian matrix, with a complexity of $\mathcal{O}(n^{3})$ due to the singular value decomposition (SVD) operation. Subsequently, we compute $\Delta_{i}$ for each node in the candidate set $B_{m}$ based on their loadings on the eigenvector, which incurs an additional computational cost of $\mathcal{O}(mn)$ . Therefore, the total complexity of our biased sampling method is $\mathcal{O}(n+m+nm+n^{3})$ . Given a node label query budget $\mathcal{B}$ , the overall computational cost becomes $\mathcal{O}(\mathcal{B}\left(n+m+nm+n^{3})\right)$ .

When the dimension of node covariates $p\ll n$ , we can replace the SVD operation with the Lanczos algorithm to accelerate the informative selection stage. The Lanczos algorithm is designed to efficiently obtain the $k$ th largest or smallest eigenvalues and their corresponding eigenvectors using a generalized power iteration method, which has a time complexity of $\mathcal{O}(kn^{2})$ [17]. As a result, the complexity of the proposed biased sampling method reduces to $\mathcal{O}(pn^{2})$ . This is comparable to GNN-based active learning methods, as GNNs and their variations generally have a complexity of $\mathcal{O}(pn^{2})$ per training update [5, 36].

4 Theoretical Analysis

In this section, we present a theoretical analysis of the proposed node query strategy. The results are divided into two parts: the first focuses on the local information gain of the selection process, while the second examines the global performance of graph signal recovery. Given a set of query nodes $\mathcal{S}$ , the information gain from querying the label of a new node $i$ is measured as the increase in bandwidth frequency, defined as $\Delta_{i}:=\omega(\mathcal{S}\cup\{i\})-\omega(\mathcal{S})$ . We provide a step-by-step analysis of the proposed method by comparing the increase in bandwidth frequency with that of random selection.

Theorem 4.1.

Define $d_{\text{min}}=\underset{i}{\min}\{d_{i}\}$ , where $d_{i}$ denotes the degree of node $i$ . Let $\mathcal{S}$ represent the set of queried nodes prior to the $s^{th}$ selection. Denote the adjacency matrix of the subgraph excluding $\mathcal{S}$ as $\mathbf{A}_{(n-|\mathcal{S}|)\times(n-|\mathcal{S}|)}$ . Let $\Delta^{\text{R}}_{s}$ and $\Delta^{\text{B}}_{s}$ denote the increase in bandwidth frequency resulting from the $s^{th}$ label query on a node selected by random sampling and the proposed sampling method, respectively. Let $j^{*}$ denote the node with the largest magnitude in the eigenvector corresponding to the smallest non-zero eigenvalue of $\mathcal{L}_{\mathcal{S}^{c}}$ . Then we have:

\displaystyle\mathbf{E}(\Delta^{\text{R}}_{s})=\Omega(\frac{1}{n})\;(1),\;\;\text{and}\;\;\mathbf{E}(\Delta^{\text{B}}_{s})-\mathbf{E}(\Delta^{\text{R}}_{s})>\Omega(\frac{1}{\eta_{0}\eta_{1}^{3}d^{2}_{\text{min}}})-\Omega(\frac{1}{n})\;(2),

where $f=\Omega(g)$ if $c_{1}\leq|\frac{f}{g}|\leq c_{2}$ for constants $c_{1},c_{2}$ when $n$ is sufficient large. Inequality (2) holds given $m$ satisfying

(\frac{n-m-d_{min}}{n-m})^{m}(\frac{n-m-d_{min}}{n-d_{min}})^{d_{min}}\sqrt{d_{min}}=\mathcal{O}(1).

The expectation $\mathbf{E}(\cdot)$ is taken over the randomness of node selection. Both $\eta_{0},\eta_{1}$ are network-related quantities, where $\eta_{0}\triangleq\#|\{i:|\frac{d_{i}-d_{j^{*}}}{d_{\text{min}}}|\leq 1\}|$ and $\eta_{1}\triangleq\max_{i}(\frac{d_{i}}{d_{\text{min}}})$ .

Theorem 4.1 provides key insights into the information gain achieved through different node label querying strategies. While random selection yields a constant average information gain, the proposed biased sampling method guarantees a higher information gain under mild assumptions.

Theorem 4.1 is derived using first-order matrix perturbation theory [2] on the Laplacian matrix $\mathcal{L}$ . In Theorem 4.1, we assume that the column space of the node covariate matrix $\mathbf{X}$ is identical to the space spanned by the first $d$ eigenvectors of $\mathcal{L}_{\mathcal{S}^{c}}$ . This assumption simplifies the analysis and the results by focusing on the perturbation of $\mathcal{L}_{\mathcal{S}^{c}}$ , where $\mathcal{L}_{\mathcal{S}^{c}}$ is the reduced Laplacian matrix with zero entries in the rows and columns indexed by $\mathcal{S}$ .

The analysis can be naturally extended to the general setting by replacing $\mathcal{L}_{\mathcal{S}^{c}}$ with $\mathbf{P}(\mathbf{1}_{\mathcal{S}^{c}})\mathcal{L}\mathbf{P}(\mathbf{1}_{\mathcal{S}^{c}})$ , where $\mathbf{P}(\mathbf{t})$ is the projection operator defined in Section 3.2. Moreover, under the assumption on the node covariates, the information gain $\Delta_{i}$ exhibits an explicit dependence on the network statistics, providing a clearer interpretation of how the network structure influences the benefits of selecting informative nodes.

Theorem 4.1 indicates that the improvement of biased sampling is more significant when $d_{\text{min}}$ is larger and $\eta_{0},\eta_{1}$ are smaller. Specifically, $d_{\text{min}}$ reflects the connectedness of the network, where a better-connected network facilitates the propagation of label information and enhances the informativeness of a node’s label for other nodes. A smaller $\eta_{1}$ prevents the existence of dominating nodes, ensuring that the connectedness does not significantly decrease when some nodes are removed from the network.

Notice that the node $j^{*}$ is the most informative node for the next selection, and $\eta_{0}$ measures the number of nodes similar to $j^{*}$ in the network. Recall that the proposed biased sampling method considers both the informativeness and representativeness of the selected nodes. Therefore, the information gain is less penalized by the representativeness requirement if $\eta_{0}$ is small. Additionally, the size of the candidate set $m$ should be sufficiently large to ensure that informative nodes are included in $B_{m}$ .

In Theorem 4.2, we provide the generalization error bound for the proposed sampling method under the weighted OLS estimator. To formally state Theorem 4.2, we first introduce the following two assumptions:

Assumption 1 For the underlying graph signal $\mathbf{f}$ , there exists a bandwidth frequency $\omega_{0}$ such that $\mathbf{f}\in\mathbf{H}_{\omega_{0}}(\mathbf{X},\mathbf{A})$ .
Assumption 2 The observed node-wise response $Y_{i}$ can be decomposed as $Y_{i}=\mathbf{f}(i)+\xi_{i}$ , where $\{\xi_{i}\}_{i=1}^{n}$ are independent random variables with $\mathbf{E}(\xi_{i})=0$ and $\mathbf{Var}(\xi_{i})\leq\sigma^{2}$ .

Theorem 4.2.

Under Assumptions 1 and 2, for the graph signal estimation $\hat{\mathbf{f}}$ obtained by training (15) on $\mathcal{B}$ labeled nodes selected by Algorithm 1, with probability greater than $1-\frac{2m}{t}$ , where $t>2m$ , we have

\displaystyle\mathbf{E}_{Y}\|\hat{\mathbf{f}}-\mathbf{f}\|_{2}^{2}\leq\mathcal{O}\Big{(}\frac{r_{d}t}{\mathcal{B}}+2(\frac{r_{d}t}{\mathcal{B}})^{3/2}+(\frac{r_{d}t}{\mathcal{B}})^{2}\Big{)}\times(n\sigma^{2}+\sum_{i>d,i\in\text{supp}(\mathbf{f})}\alpha_{i}^{2})+\sum_{i>d,i\in\text{supp}(\mathbf{f})}\alpha_{i}^{2},

(16)

where $\alpha_{i}:=\langle\mathbf{f},U_{i}\rangle$ , $\text{supp}(\mathbf{f}):=\{i:1\leq i\leq n,|\alpha_{i}|>0\}$ and $r_{d}=\text{rank}(\mathbf{U}_{d}^{T}\mathbf{X})$ . $\mathbf{E}_{Y}(\cdot)$ denotes the expected value with respect to the randomness in observed responses.

Theorem 4.2 reveals the trade-off between informativeness and representativeness in graph-based active learning, which is controlled by the spectral dimension $d$ . Since $r_{d}$ is a monotonic function of $d$ , a larger $d$ reduces representativeness among queried nodes, thereby increasing variance in controlling the condition number (i.e., the first three terms). On the other hand, a larger $d$ reduces approximation bias to the true graph signal (i.e., the fifth and last terms) by including more informative nodes for capturing less smoothed signals.

The RHS of (16) captures both the variance and bias involved in estimating $\mathbf{f}$ using noisy labels on sampled nodes. Specifically, the first three terms represent the estimation variance arising from controlling the condition number of the design matrix on the queried nodes. The fourth and fifth terms reflect the noise and unidentifiable components in the responses of the queried nodes, while the last term denotes the bias resulting from the approximation error of the space using $\mathbf{H}_{\omega}(\mathbf{X},\mathbf{A})$ .

The bias term in Theorem 4.2 can be further controlled if the true signal $\mathbf{f}$ exhibits decaying or zero weights on high-frequency network components. In addition to $r_{d}$ , the size of the candidate set $m$ also influences the probability of controlling the generalization error. A small $m$ places greater emphasis on the representativeness criterion in sampling, increasing the likelihood of controlling the condition number but potentially overlooking informative nodes, thereby increasing approximation bias.

For a fixed prediction MSE, the query complexity of our method is $\mathcal{O}(d)$ , whereas random sampling incurs a complexity of $\mathcal{O}(\tilde{d}\log\tilde{d})$ , where $\tilde{d}>d$ . Our method outperforms random sampling in two key aspects: (1) The information-based selection identifies $\mathbf{f}$ with fewer queries than random sampling, as shown in Theorem 4.1, and (2) our method achieves an additional improvement by actively controlling the condition number of the covariate matrix, resulting in a logarithmic factor reduction compared to random sampling.

5 Numerical Studies

In this section, we conduct extensive numerical studies to evaluate the proposed active learning strategy for node-level prediction tasks on both synthetic and real-world networks. For the synthetic networks, we focus on regression tasks with continuous responses, while for the real-world networks, we consider classification tasks with discrete node labels.

5.1 Synthetic networks

We consider three different network topologies generated by widely studied statistical network models: the Watts–Strogatz model [34] for small-world properties, the Stochastic block model [16] for community structure, and the Barabási–Albert model [3] for scale-free properties.

Node responses are generated as $Y=\mathbf{f}+\xi$ , where $\mathbf{f}$ is the true graph signal and $\xi\sim N(0,\sigma^{2}I_{n})$ is Gaussian noise. The true signal is constructed as $\mathbf{f}=\mathbf{U}_{d}\mathbb{\beta}$ , where $\mathbb{\beta}$ is the linear coefficient and $\mathbf{U}_{d}$ denotes the leading $d$ eigenvectors of the normalized graph Laplacian of the synthetic network. Since our theoretical analysis assumes that the observed node covariates $\mathbf{X}$ contain noise, we generate $\mathbf{X}$ as a perturbed version of $\mathbf{U}_{d}$ by adding non-leading eigenvectors of the normalized graph Laplacian. The detailed simulation settings can be found in Appendix B.1.

We compare our algorithm with several offline active learning methods: 1. D-optimal [25] selects subset of nodes $\mathcal{S}$ to maximize determinant of observed covariate information matrix $\mathbf{X}_{\mathcal{S}}^{T}\mathbf{X}_{\mathcal{S}}$ . 2. RIM [38] selects nodes to maximize the number of influenced nodes. 3. GPT [23] and SPA [13] split the graph into disjoint partitions and select informative nodes from each partition.

After the node query step, we fit the weighted linear regression from (15) on the labeled nodes, using the smoothed covariates $\mathbf{\tilde{X}}$ , to estimate the linear coefficient $\hat{\beta}$ and predict the response $\hat{\mathbf{Y}}$ for the unlabeled nodes. In Figure 1, we plot the prediction MSE of the proposed method against baselines on unlabeled nodes for various levels of labeling noise $\sigma^{2}\in(0.5,0.6,0.7,0.8,0.9,1)$ . The results show that the proposed method significantly outperforms all baselines across all simulation settings and exhibits strong robustness to noise. The inferior performance of the baselines can be attributed to several factors. D-optimal and RIM fail to account for noise in the node covariates. Meanwhile, partition-based methods like GPT and SPA are highly sensitive to hyperparameters, such as the optimal number of partitions, which limits their generalization to networks lacking a clear community structure.

Table 1: Test accuracy (Micro-F1%) on five real-world networks with varying levels of homophily. The edge homophily ratio

h

of a network is defined as the fraction of edges that connect nodes with the same class label. A higher

h

indicates a network with stronger homophily.

#labeled nodes	35	70	140	15	30	60	30	60	120	50	75	100	15	30	45
	Cora ( $h=0.81$ )			Pubmed ( $h=0.80$ )			Citeseer ( $h=0.74$ )			Chameleon ( $h=0.23$ )			Texas ( $h=0.11$ )
Random	68.2 $\pm$ 1.3	74.5 $\pm$ 1.0	78.9 $\pm$ 0.9	71.2 $\pm$ 1.8	74.9 $\pm$ 1.6	78.4 $\pm$ 0.5	57.7 $\pm$ 0.8	65.3 $\pm$ 1.4	70.7 $\pm$ 0.7	22.4 $\pm$ 2.6	22.1 $\pm$ 2.5	21.8 $\pm$ 2.1	67.0 $\pm$ 3.3	69.9 $\pm$ 3.3	73.8 $\pm$ 3.2
AGE	72.1 $\pm$ 1.1	78.0 $\pm$ 0.9	82.5 $\pm$ 0.5	74.9 $\pm$ 1.1	77.5 $\pm$ 1.2	79.4 $\pm$ 0.7	65.3 $\pm$ 1.1	67.7 $\pm$ 0.5	71.4 $\pm$ 0.5	30.0 $\pm$ 4.5	28.2 $\pm$ 4.9	28.6 $\pm$ 5.0	67.9 $\pm$ 2.6	68.8 $\pm$ 3.3	72.1 $\pm$ 3.6
GPT	77.4 $\pm$ 1.6	81.6 $\pm$ 1.2	86.5 $\pm$ 1.2	77.0 $\pm$ 3.1	79.9 $\pm$ 2.8	81.5 $\pm$ 1.6	67.9 $\pm$ 1.8	71.0 $\pm$ 2.4	74.0 $\pm$ 2.0	14.1 $\pm$ 2.5	15.8 $\pm$ 2.2	16.4 $\pm$ 2.4	72.6 $\pm$ 2.0	72.5 $\pm$ 3.6	74.6 $\pm$ 1.8
RIM	77.5 $\pm$ 0.8	81.6 $\pm$ 1.1	84.1 $\pm$ 0.8	75.0 $\pm$ 1.5	77.2 $\pm$ 0.6	80.2 $\pm$ 0.4	67.5 $\pm$ 0.7	70.0 $\pm$ 0.6	73.2 $\pm$ 0.7	35.5 $\pm$ 3.7	42.8 $\pm$ 3.0	34.4 $\pm$ 3.5	68.5 $\pm$ 3.7	78.4 $\pm$ 3.0	74.6 $\pm$ 3.7
IGP	77.4 $\pm$ 1.7	81.7 $\pm$ 1.6	86.3 $\pm$ 0.7	78.5 $\pm$ 1.2	82.3 $\pm$ 1.4	83.5 $\pm$ 0.5	68.2 $\pm$ 1.1	72.1 $\pm$ 0.9	75.8 $\pm$ 0.4	32.5 $\pm$ 3.6	33.7 $\pm$ 3.1	33.4 $\pm$ 3.5	70.8 $\pm$ 3.7	69.9 $\pm$ 3.3	76.1 $\pm$ 3.6
SPA	76.5 $\pm$ 1.9	80.3 $\pm$ 1.6	85.2 $\pm$ 0.6	75.4 $\pm$ 1.6	78.3 $\pm$ 2.0	73.5 $\pm$ 1.2	66.4 $\pm$ 2.2	69.3 $\pm$ 1.7	73.5 $\pm$ 2.0	30.2 $\pm$ 3.2	28.5 $\pm$ 2.9	31.0 $\pm$ 4.4	72.0 $\pm$ 3.2	72.5 $\pm$ 3.1	74.6 $\pm$ 2.1
Proposed	78.4 $\pm$ 1.7	81.8 $\pm$ 1.8	86.5 $\pm$ 1.1	78.9 $\pm$ 1.1	79.1 $\pm$ 0.6	82.3 $\pm$ 0.6	69.1 $\pm$ 1.0	72.2 $\pm$ 1.3	75.5 $\pm$ 0.8	35.1 $\pm$ 2.8	35.7 $\pm$ 3.0	37.2 $\pm$ 3.0	75.0 $\pm$ 1.9	79.5 $\pm$ 0.8	80.4 $\pm$ 2.7

Table 2: Test accuracy (Macro-F1% and Micro-F1%) on two real-world large-scale networks: Ogbn-Arxiv (

n=169,343

) and Co-Physics (

n=34,493

#labeled nodes	160	320	640	1280	160	320	640	1280	10	20	40
	Ogbn-Arxiv (Macro-F1)				Ogbn-Arxiv (Micro-F1)				Co-Physics (Macro-F1)
Random	21.9 $\pm$ 1.4	27.6 $\pm$ 1.5	33.0 $\pm$ 1.4	37.2 $\pm$ 1.1	52.3 $\pm$ 0.8	56.4 $\pm$ 0.8	60.0 $\pm$ 0.7	63.5 $\pm$ 0.4	58.3 $\pm$ 13.8	66.9 $\pm$ 10.1	78.3 $\pm$ 7.1
AGE	20.4 $\pm$ 0.9	25.9 $\pm$ 1.1	31.7 $\pm$ 0.8	36.4 $\pm$ 0.8	48.3 $\pm$ 2.3	54.9 $\pm$ 1.6	60.0 $\pm$ 0.7	63.5 $\pm$ 0.3	63.7 $\pm$ 7.8	71.0 $\pm$ 8.8	82.4 $\pm$ 3.9
GPT	24.2 $\pm$ 0.7	29.5 $\pm$ 0.8	36.4 $\pm$ 0.5	41.0 $\pm$ 0.5	52.3 $\pm$ 0.9	56.8 $\pm$ 0.8	60.7 $\pm$ 0.6	63.6 $\pm$ 0.5	75.8 $\pm$ 2.7	85.8 $\pm$ 0.3	88.9 $\pm$ 0.3
Proposed	25.8 $\pm$ 1.3	34.3 $\pm$ 1.4	38.3 $\pm$ 1.2	41.3 $\pm$ 1.3	53.1 $\pm$ 1.3	58.0 $\pm$ 1.0	62.3 $\pm$ 1.6	64.8 $\pm$ 1.0	83.5 $\pm$ 0.8	86.8 $\pm$ 1.3	89.2 $\pm$ 1.2

Table 3: Average query time (in seconds) per node.

Dataset	Size	Time
Texas	183	0.19 $\pm$ 0.03
Chameleon	2,277	0.34 $\pm$ 0.18
Cora	2,708	0.30 $\pm$ 0.19
Citeseer	3,327	0.26 $\pm$ 0.07
Pubmed	19,717	0.48 $\pm$ 0.25
Co-Physics	34,493	1.08 $\pm$ 0.43
Ogbn-Arxiv	169,343	2.11 $\pm$ 0.33

5.2 Real-world networks

We evaluate the proposed method for node classification tasks on real-world datasets, which include five networks with varying homophily levels (high to low: Cora, PubMed, Citeseer, Chameleon and Texas) and two large-scale networks (Ogbn-Arxiv and Co-Physics). In addition to the offline methods described in Section 5.1, we also compare our approach with two GNN-based online active learning methods AGE [7] and IGP [39]. In each GNN iteration, AGE selects nodes to maximize a linear combination of heuristic metrics, while IGP selects nodes that maximize information gain propagation.

Unlike regression, node classification with GNNs is a widely studied area of research. Previous works [23, 30, 38, 39] have demonstrated that the prediction performance of various active learning strategies on unlabeled nodes remains relatively consistent across different types of GNNs. Therefore, we employ Simplified Graph Convolution (SGC) [35] as the GNN classifier due to its straightforward theoretical intuition. Since SGC is essentially multi-class logistic regression on low-pass-filtered covariates, it can be approximately viewed as a special case of the regression model defined in (15). Thus, we conjecture that our theoretical analysis can also be extended to classification tasks and leave its formal verification for future work.

The results in Figure 1 demonstrate that the proposed algorithm is highly competitive with baselines across real-world networks with varying degrees of homophily. Our method achieves the best performance on Cora (highest homophily) and Texas (lowest homophily, i.e., highest heterophily) and is particularly effective when the labeling budget is most limited. To handle heterophily in networks like Chameleon and Texas, we expand the graph signal subspace $\mathbf{U}_{d}$ in Algorithm 1 to $\mathbf{U}_{d}=\{U_{1},\cdots,U_{d},U_{n-d+1},\cdots,U_{n}\}$ , combining eigenvectors corresponding to the $d$ smallest and $d$ largest eigenvalues. Admittedly, relying on a priori knowledge of label construction may be unrealistic, so developing adaptive methods for designing the signal subspace to effectively handle both homophily and heterophily remains a promising direction for future research.

Table 5.1 summarizes the performance on two large-scale networks. The greatest improvement is observed in the Macro-F1 score on Ogbn-Arxiv, with an increase of up to 4.8% at 320 labeled nodes. Moreover, Table 5.1 demonstrates that our algorithm scales efficiently to large networks, with the time cost of querying a single node being approximately 2 seconds when $n=169,343$ .

6 Conclusion

We propose a graph-based offline active learning framework for node-level tasks. Our node query strategy effectively leverages both the network structure and node covariate information, demonstrating robustness to diverse network topologies and node-level noise. We provide theoretical guarantees for controlling generalization error, uncovering a novel trade-off between informativeness and representativeness in active learning on graphs. Empirical results demonstrate that our method performs strongly on both synthetic and real-world networks, achieving competitiveness with state-of-the-art methods on benchmark datasets. Future work could explore extensions to an online active learning setting that iteratively incorporates node response information to further enhance query efficiency. Additionally, scalability on large graphs could be improved by utilizing the Lanczos method [31] or Chebyshev polynomial approximation [19] during node selection.

References

[1] Z. Allen-Zhu, Y. Li, A. Singh, and Y. Wang. Near-optimal discrete optimization for experimental design: A regret minimization approach. Mathematical Programming, 2020.
[2] B. Bamieh. A tutorial on matrix perturbation theory (using compact matrix notation). arXiv preprint arXiv:2002.05001, 2020.
[3] A.-L. Barabasi and R. Albert. Emergence of scaling in random networks. Science, 1999.
[4] J. Batson, D. A. Spielman, N. Srivastava, and S.-H. Teng. Spectral sparsification of graphs: theory and algorithms. Communications of the ACM, 2013.
[5] D. Blakely, J. Lanchantin, and Y. Qi. Time and space complexity of graph convolutional networks. Accessed on: Dec, 2021.
[6] M.-R. Bouguelia, S. Nowaczyk, K. Santosh, and A. Verikas. Agreeing to disagree: active learning with noisy labels without crowdsourcing. International Journal of Machine Learning and Cybernetics, 2018.
[7] H. Cai, V. W. Zheng, and K. Chen-Chuan Chang. Active learning for graph embedding. arXiv preprint arXiv:1705.05085, 2017.
[8] S. Chen, A. Sandryhaila, J. M. F. Moura, and J. Kovacevic. Signal recovery on graphs: Variation minimization. IEEE Transactions on Signal Processing, 2016.
[9] X. Chen and E. Price. Active regression via linear-sample sparsification. In Conference on Learning Theory, pages 663–695. PMLR, 2019.
[10] G. Dasarathy, R. Nowak, and X. Zhu. S2: An efficient graph based active learning algorithm with application to nonparametric classification. In Proceedings of The 28th Conference on Learning Theory (COLT), 2015.
[11] M. Derezinski, M. K. Warmuth, and D. J. Hsu. Leveraged volume sampling for linear regression. In Advances in Neural Information Processing Systems (NeurIPS), 2018.
[12] J. Du and C. X. Ling. Active learning with human-like noisy oracle. In Proceedings of the 2010 IEEE International Conference on Data Mining (ICDM), 2010.
[13] R. Fajri, Y. Pei, L. Yin, and M. Pechenizkiy. A structural-clustering based active learning for graph neural networks. In Symposium on Intelligent Data Analysis (IDA), 2024.
[14] A. Gadde, A. Anis, and A. Ortega. Active semi-supervised learning using sampling theory for graph signals. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2014.
[15] A. Gadde, A. Anis, and A. Ortega. Active semi-supervised learning using sampling theory for graph signals. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 492–501, 2014.
[16] M. Girvan and M. E. J. Newman. Community structure in social and biological networks. Proceedings of the National Academy of Sciences, 2002.
[17] G. Golub and C. Loan. Matrix computations. HU Press, 2013.
[18] N. T. Hoang, T. Maehara, and T. Murata. Revisiting graph neural networks: Graph filtering perspective. In Proceedings of the 25th International Conference on Pattern Recognition (ICPR), 2021.
[19] T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. In Proceedings of the 5th International Conference on Learning Representations (ICLR), 2017.
[20] M. Kutner, C. Nachtsheim, J. Neter, and W. Li. Applied Linear Statistical Models. McGraw-Hill/Irwin, 2004.
[21] Y. T. Lee and H. Sun. Constructing linear-sized spectral sparsification in almost-linear time. SIAM Journal on Computing, 47(6):2315–2336, 2018.
[22] P. Di Lorenzo, S. Barbarossa, and P. Banelli. Cooperative and Graph Signal Processing. Academic Press, 2018.
[23] J. Ma, Z. Ma, J. Chai, and Q. Mei. Partition-based active learning for graph neural networks. Transactions on Machine Learning Research, 2023.
[24] E. L. Paluck, H. Shepherd, and P. M. Aronow. Changing climates of conflict: a social network experiment in 56 schools. Proceedings of the National Academy of Sciences, page 566–571, 2016.
[25] F. Pukelsheim. Optimal Design of Experiments (Classics in Applied Mathematics, 50). Society for Industrial and Applied Mathematics, 2018.
[26] G. Puy, N. Tremblay, R. Gribonval, and P. Vandergheynst. Random sampling of bandlimited signals on graphs. Applied and Computational Harmonic Analysis, 2016.
[27] F. F. Regol, S. Pal, Y. Zhang, and M. Coates. Active learning on attributed graphs via graph cognizant logistic regression and preemptive query generation. In Proceedings of the 37th International Conference on Machine Learning (ICML), 2020.
[28] P. Ren, Y. Xiao, X. Chang, P.-Y. Huang, Z. Li, X. Chen, and X. Wang. A survey of deep active learning. ACM Computing Surveys, 2020.
[29] B. Settles. Active learning literature survey. Technical report, University of Wisconsin Madison, 2010.
[30] Z. Song, Y. Zhang, and I. King. No change, no gain: Empowering graph neural networks with expected model change maximization for active learning. In Advances in Neural Information Processing Systems (NeurIPS), 2023.
[31] A. Susnjara, N. Perraudin, D. Kressner, and P. Vandergheynst. Accelerated filtering on graphs using lanczos method. arXiv preprint arXiv:1509.04537, 2015.
[32] M. Vose. Linear algorithm for generating random numbers with a given distribution. IEEE Transactions on Software Engineering, 1991.
[33] W. Wang and W. N. Street. Modeling and maximizing influence diffusion in social networks for viral marketing. Applied Network Science, 2018.
[34] D. J. Watts and S. H. Strogatz. Collective dynamics of ’small-world’ networks. Nature, 1998.
[35] F. Wu, A. H. S. Jr., T. Zhang, C. Fifty, T. Yu, and K. Q. Weinberger. Simplifying graph convolutional networks. In Proceedings of the 36th International Conference on Machine Learning (ICML), 2019.
[36] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. Yu. A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems, 2020.
[37] Y. Xu, X. Chen, A. Liu, and C. Hu. A latency and coverage optimized data collection scheme for smart cities. In New Advances in Identification, Information and Knowledge in the Internet of Things, 2017.
[38] W. Zhang, Y. Wang, Z. You, M. Cao, P. Huang, J. Shan, Z. Yang, and B. Cui. Rim: Reliable influence-based active learning on graphs. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
[39] W. Zhang, Y. Wang, Z. You, M. Cao, P. Huang, J. Shan, Z. Yang, and B. Cui. Information gain propagation: a new way to graph active learning with soft labels. In Proceedings of the 10th International Conference on Learning Representations (ICLR), 2022.

Appendix A Proofs

A.1 Proof of Theorem 3.1

Proof: Consider the threshold frequency $\omega$ defined as

\displaystyle\omega<\omega(\mathcal{S}):=\operatorname*{inf}_{\mathbf{g}\in\text{Proj}_{\mathbf{L}_{\mathcal{S}^{c}}}\text{span}(\mathbf{X})}\omega_{\mathbf{g}},

(17)

and notice that 17 is true if and only if

\displaystyle\text{Proj}_{\mathbf{L}_{\omega}}\text{span}(\mathbf{X})\cap\text{Proj}_{\mathbf{L}_{\mathcal{S}^{c}}}\text{span}(\mathbf{X})=\{0\}.

(18)

$\Rightarrow$ For $\forall\phi\in\text{Proj}_{\mathbf{L}_{\omega}}\text{span}(\mathbf{X})\cap\text{Proj}_{\mathbf{L}_{\mathcal{S}^{c}}}\text{span}(\mathbf{X})$ and $\forall\mathbf{f}\in\text{Proj}_{\mathbf{L}_{\omega}}\text{span}(\mathbf{X})$ , we have $\mathbf{g}=\phi+\mathbf{f}\in\text{Proj}_{\mathbf{L}_{\omega}}\text{span}(\mathbf{X})$ by closure under addition and $\mathbf{f}(S)=\mathbf{g}(S)$ . Since $\text{Proj}_{\mathbf{L}_{\omega}}\text{span}(\mathbf{X})$ can be identified by $\mathcal{S}$ , one must have $\mathbf{f}=\mathbf{g}$ , which implies $\phi=0$ . Hence, 18 is true.

$\Leftarrow$ Given 18 is true, assume that $\text{Proj}_{\mathbf{L}_{\omega}}\text{span}(\mathbf{X})$ cannot be identified by $\mathcal{S}$ . Then, by definition, there exists $\mathbf{f}_{1},\mathbf{f}_{2}\in\text{Proj}_{\mathbf{L}_{\omega}}\text{span}(\mathbf{X})$ such that $\mathbf{f}_{1}(S)=\mathbf{f}_{2}(S)$ and $\mathbf{f}_{1}\neq\mathbf{f}_{2}$ . Since $(\mathbf{f}_{1}-\mathbf{f}_{2})(S)=0$ , $\mathbf{f}_{1}-\mathbf{f}_{2}\in\text{Proj}_{\mathbf{L}_{\mathcal{S}^{c}}}\text{span}(\mathbf{X})$ . By clousure under addition, $\mathbf{f}_{1}-\mathbf{f}_{2}\in\text{Proj}_{\mathbf{L}_{\omega}}\text{span}(\mathbf{X})$ . However, $\mathbf{f}_{1}-\mathbf{f}_{2}\in\text{Proj}_{\mathbf{L}_{\omega}}\text{span}(\mathbf{X})\cap\text{Proj}_{\mathbf{L}_{\mathcal{S}^{c}}}\text{span}(\mathbf{X})$ is a contradiction with 18. Therefore, $\text{Proj}_{\mathbf{L}_{\omega}}\text{span}(\mathbf{X})$ can be identified by $\mathcal{S}$ .

A.2 Proof of Theorem 4.1

Proof: Let $\mathbf{U}=\{U_{1},U_{2},\cdots,U_{n}\}$ be the eigenvectors of $\mathcal{L}=\mathbf{I}-\mathbf{D}^{-1/2}\mathbf{A}\mathbf{D}^{-1/2}$ and $\Lambda=\text{diag}(\lambda_{1},\lambda_{2},\cdots,\lambda_{n})$ . Without loss of generality, we analyze the case when $k=1$ and the case of no repeated eigenvalues. The analysis for the case of repeated eigenvalues can be similar performed with matrix perturbation analysis on degenerate case [2].

After selecting node $i$ , we have the reduced Laplacian matrix $\mathcal{L}^{\ast}$

\mathcal{L}^{\ast}=\mathcal{L}+\mathcal{L}^{(-i)},\;\;\text{where}\;\;\mathcal{L}^{(-i)}=-\begin{pmatrix}0&\cdots&\mathcal{L}_{1i}&\cdots&0\\ \vdots&\ddots&\vdots&\vdots&\vdots\\ \mathcal{L}_{i1}&\ldots&\mathcal{L}_{ii}&\ldots&\mathcal{L}_{in}\\ \vdots&\vdots&\vdots&\ddots&\vdots\\ 0&\cdots&\mathcal{L}_{ni}&\cdots&0\\ \end{pmatrix}

Define $\Lambda^{\ast}$ as the diagonal matrix containing eigenvalues of $\mathcal{L}^{\ast}$ . Using the first order perturbation analysis [2], we have $\Lambda^{\ast}\approx\Lambda+\text{diag}(\mathbf{U}^{T}\mathcal{L}^{(-i)}\mathbf{U})$ . Let $\mathcal{L}_{\bm{\cdot}i}$ and $\mathcal{L}_{i\bm{\cdot}}$ be the $i^{th}$ column and row of $\mathcal{L}$ , respectively. Since

\displaystyle\mathcal{L}^{(-i)}U_{j}=

\displaystyle\begin{pmatrix}-\mathcal{L}_{1i}\mathbf{U}_{ij}\\ \vdots\\ -\mathcal{L}^{(-i)}_{i\bm{\cdot}}U_{j}\\ \vdots\\ -\mathcal{L}_{ni}\mathbf{U}_{ij}\\ \end{pmatrix}=-\mathcal{L}_{\bm{\cdot}i}\mathbf{U}_{ij}+\begin{pmatrix}0\\ \vdots\\ \mathcal{L}_{ii}\mathbf{U}_{ij}-\mathcal{L}^{(-i)}_{i\bm{\cdot}}U_{j}\\ \vdots\\ 0\\ \end{pmatrix}=-\mathcal{L}_{\bm{\cdot}i}\mathbf{U}_{ij}+\begin{pmatrix}0\\ \vdots\\ (\mathcal{L}_{ii}-\lambda_{j})\mathbf{U}_{ij}\\ \vdots\\ 0\\ \end{pmatrix}

we have

	$\displaystyle U^{T}_{j}\mathcal{L}^{(-i)}U_{j}=$	$\displaystyle-\mathbf{U}_{ij}(\mathcal{L}_{i\bm{\cdot}}U_{j})+(\mathcal{L}_{ii}-\lambda_{j})\mathbf{U}^{2}_{ij}$
	$\displaystyle=$	$\displaystyle-\mathbf{U}_{ij}(\lambda_{j}\mathbf{U}_{ij})+(\mathcal{L}_{ii}-\lambda_{j})\mathbf{U}^{2}_{ij}$
	$\displaystyle=$	$\displaystyle(1-2\lambda_{j})\mathbf{U}^{2}_{ij}\;\;\text{since the diagonal entry of }\mathcal{L}\;\text{is}\;1$

Therefore,

\displaystyle\Lambda^{\ast}\approx\Lambda+\begin{pmatrix}(1-2\lambda_{1})\mathbf{U}^{2}_{i1}&&\\ &\ddots&\\ &&(1-2\lambda_{n})\mathbf{U}^{2}_{in}\end{pmatrix}

Assume $U_{1}$ corresponding to the smallest non-zero eigenvalue of $\Lambda^{\ast}$ , then the increase of bandwidth frequency for node $i$ is $\mathbf{U}^{2}_{i1}$ .

For random selection, where node $i$ is queried with uniform probability,

\displaystyle\mathbf{E}(\Delta_{\text{R}})=\frac{1}{n}\displaystyle\sum_{i=1}^{n}\mathbf{U}^{2}_{i1}\propto\frac{1}{n}

For the proposed selection method, define $\mathbf{R}=\text{diag}(r_{1},r_{2},\cdots,r_{n})$ , where $r_{i}=\frac{d_{i}}{d_{min}}$ , then

	$\displaystyle\mathcal{L}=$	$\displaystyle\mathbf{I}-\mathbf{D}^{-1/2}\mathbf{A}\mathbf{D}^{-1/2}$
	$\displaystyle=$	$\displaystyle\underbrace{\mathbf{I}+\epsilon^{\prime}\mathbf{D}}_{\mathbf{A}_{0}}+\epsilon^{\prime}\left(\underbrace{(\epsilon^{\prime}d_{min})^{-1}\mathbf{R}^{-\frac{1}{2}}\mathbf{A}\mathbf{R}^{-\frac{1}{2}}-\mathbf{D}}_{\mathbf{A}_{1}}\right)$
	$\displaystyle=$	$\displaystyle\mathbf{A_{0}}+\epsilon^{\prime}\mathbf{A_{1}}$

Next we perform matrix perturbation analysis, define

\displaystyle U_{0}=I,\;\Lambda_{0}=A_{0},\;\Lambda=\Lambda_{0}+\epsilon^{\prime}\Lambda_{1},\;U=U_{0}+\epsilon^{\prime}U_{1}

where $\Lambda,U$ approximate eigenvalues and eigenvectors of $\Lambda^{\ast}$ . Denote $\left(U_{1}\right)_{k}$ as the $k^{th}$ column of $U_{1}$ and assume $d_{i}\neq d_{j}$ , we have

\left(U_{1}\right)_{k}=\sum_{i\neq k}\frac{\left(U_{0}\right)_{i}^{\top}A_{1}\left(U_{0}\right)_{k}}{\left(\Lambda_{0}\right)_{k}-\left(\Lambda_{0}\right)_{i}}\left(U_{0}\right)_{i}=\sum_{i\neq k}\frac{A_{ik}}{\left(\varepsilon^{\prime}\right)^{2}\left(d_{i}-d_{k}\right)\sqrt{d_{i}}\sqrt{d_{k}}}e_{i}

(U)_{k}=\left(U_{0}\right)_{k}+\varepsilon^{\prime}\left(U_{1}\right)_{k}

To satisfy $\left\|U_{k}\right\|_{2}=1$ , we multiply $\tau$ as

		$\displaystyle(\tau U)_{k}=\left(\tau U_{0}\right)_{k}+\varepsilon^{\prime}\left(\tau U_{1}\right)_{k}$
		$\displaystyle\Rightarrow\text{ recalculate }U_{1}\text{ by }\tau U_{0}\text{ as }\left(U_{1}\right)_{k}=\sum_{i\neq k}\frac{A_{ik}\tau^{3}}{(\varepsilon^{\prime})^{2}\left(d_{i}-d_{k}\right)\sqrt{d_{i}}\sqrt{d_{k}}}e_{i}$
		$\displaystyle\Rightarrow\left(\tau U\right)_{k}=\tau e_{k}+\sum_{i\neq k}\frac{A_{ik}\tau^{4}}{\varepsilon^{\prime}\left(d_{i}-d_{k}\right)\sqrt{d_{i}}\sqrt{d_{k}}}e_{i},\text{ choose }\varepsilon^{\prime}=\tau^{3}\times\frac{1}{\sqrt{d_{k}}d_{\text{min }}^{\frac{3}{2}}}$
		$\displaystyle\Rightarrow\text{ then }\left\\|\tau U_{k}\right\\|_{2}=1\text{ if }\tau=\frac{1}{\sqrt{1+\sum_{i\neq k}\frac{A_{ik}}{\left(\frac{d_{i}-d_{k}}{d_{\text{min }}}\right)^{2}\frac{d_{i}}{d_{\text{min }}}}}}$

Then we consider the normalized $\tau U$ as $U$ in the following analysis. Assume $(U)_{k}$ is the eigenvector to the smallest non-zero eigenvalue, then at the $t-1$ step

\displaystyle\mathbf{E}(\Delta)=\mathbf{E}(\mathbf{U}^{2}_{ik})

where $\mathbf{E}$ is in terms of the randomness in the proposed sampling procedure.

Based on the approximation

\displaystyle(U)_{k}=\tau e_{k}+\sum_{i\neq k}\frac{A_{ik}\tau}{\frac{d_{i}-d_{k}}{d_{\min}}\times\sqrt{\frac{d_{i}}{d_{\min}}}}

then

\displaystyle\#\big{|}\{\mathbf{U}_{ik}\neq 0\}^{n}_{i=1}\big{|}=1+d_{k}>d_{\min}

Define $S=\{i\in\{1,2,\cdots,n\}:\mathbf{U}_{ik}\neq 0\}$ and $\mathbf{P_{1}}=P(\text{the node being selected to query}\notin S)$ . Take $p=\min_{i\in S}\mathbf{P}(i),q=\max_{i\notin S}\mathbf{P}(i)$ and $\mathbf{P}(\cdot)$ denotes the probability of being selected into candidate set size $m$ . Denote $k=d_{min}$ , we first upper bound $\mathbf{P}_{1}$ as

	$\displaystyle\mathbf{P}_{1}\leq$	$\displaystyle\frac{\binom{n-k}{m}q^{m}(1-q)^{n-k-m}(1-p)^{k}}{\sum_{i=0}^{k}\binom{n-k}{m-i}q^{m-k}(1-q)^{n-k-(m-i)}\binom{k}{i}p^{i}(1-p)^{k-i}}$
	$\displaystyle=$	$\displaystyle\frac{\binom{n-k}{m}}{\sum_{i=0}^{k}\binom{n-k}{m-i}\binom{k}{i}\eta^{i}}\text{, where }\eta=\frac{p(1-q)}{(1-p)q}$

We calculate the denominator as

	$\displaystyle\sum_{i=0}^{k}\binom{n-k}{m-i}\binom{k}{i}\eta^{i}=c_{0}\sqrt{\sum_{i=0}^{k}\binom{n-k}{m-i}^{2}\binom{k}{i}^{2}}\sqrt{\sum_{i=0}^{k}\eta^{2i}}$
	$\displaystyle\geq\frac{c_{0}}{\sqrt{k+1}}\binom{n}{m}\sqrt{\frac{1-\eta^{2k+2}}{1-\eta^{2}}}>\frac{c_{0}}{\sqrt{k+1}}\binom{n}{m}.$

In addition, by Stirling’s approximation, when $n$ is large

	$\displaystyle\binom{n}{m}\sim$	$\displaystyle\sqrt{\frac{n}{2\pi m(n-m)}}\cdot\frac{n^{n}}{m^{m}(n-m)^{n-m}}$
	$\displaystyle\;\binom{n-k}{m}\sim$	$\displaystyle\sqrt{\frac{n-k}{2\pi m(n-k-m)}}\cdot\frac{(n-k)^{n-k}}{m^{m}(n-k-m)^{n-k-m}}$

then combining the above simplification, we have

	$\displaystyle\mathbf{P}_{1}<\sqrt{\frac{n-k}{n}\cdot\frac{n-m}{n-k-m}}\cdot\frac{(n-k)^{n-k}}{n^{n}}\cdot\frac{(n-m)^{n-m}}{(n-k-m)^{n-k-m}}\times\frac{\sqrt{k+1}}{c_{0}}$
	$\displaystyle\leq\left(\frac{n-k-m}{n-m}\right)^{m}\left(\frac{n-k-m}{n-k}\right)^{k}\times\frac{\sqrt{k+1}}{c_{0}}$

then we can lower bound the expected value of information gain as

\displaystyle\mathbf{E}(\mathbf{U}^{2}_{ik})=

\displaystyle\displaystyle\sum_{i\in S}\mathbf{P}(i)\mathbf{U}^{2}_{i1}\geq(1-\mathbf{P}_{1})\min_{i\in S}\left(\mathbf{U}_{ik}^{2}\right)

Notice that $\mathbf{P}_{1}$ is a monotone decreasing function of $m$ given $n$ and $k$ fixed, then we can select a $m_{0}$ such that

\left(\frac{n-k-m_{0}}{n-m_{0}}\right)^{m_{0}}\left(\frac{n-k-m_{0}}{n-k}\right)^{k}\sqrt{k}\leq\frac{\delta}{c_{0}},

where $\delta<1$ is a constant, therefore $\mathbf{E}(\mathbf{U}^{2}_{ik})>(1-\delta)\min_{i\in S}\left(\mathbf{U}_{ik}^{2}\right)$ .

Next we lower bound the quantity $\min_{i\in S}\left(\mathbf{U}_{ik}^{2}\right)$ . Denote $\eta_{1}:=\max_{i}(\frac{d_{i}}{d_{min}})$ and $\eta_{0}:=\#|\{i:|\frac{d_{i}-d_{k}}{d_{min}}|\leq 1\}|$ , we calculate the lower bound for $\min_{i\in S}\left(\mathbf{U}_{ik}^{2}\right)$ as

	$\displaystyle\min_{i\in S}\left(\mathbf{U}_{ik}^{2}\right)\geq$	$\displaystyle\min\left(\tau,\frac{\tau^{2}}{\frac{(d_{i}-d_{k})^{2}}{{d}^{2}_{min}}\times\frac{d_{i}}{d_{min}}}\right)\;\forall i\in S$
	$\displaystyle\geq$	$\displaystyle\frac{1}{\frac{(d_{i}-d_{k})^{2}}{{d}^{2}_{min}}\frac{d_{i}}{d_{min}}+1+\frac{(d_{i}-d_{k})^{2}}{{d}^{2}_{min}}\frac{d_{i}}{d_{min}}\displaystyle\sum_{j\neq 1,j\neq i}\frac{1}{\frac{(d_{j}-d_{i})^{2}}{{d}^{2}_{min}}\frac{d_{j}}{d_{min}}}}$
	$\displaystyle\geq$	$\displaystyle\frac{1}{\frac{(d_{i}-d_{k})^{2}}{{d}^{2}_{min}}\frac{d_{i}}{d_{min}}\left(1+o_{n}(1)+\displaystyle\sum_{j\neq 1,j\neq i}\frac{1}{\frac{(d_{j}-d_{i})^{2}}{{d}^{2}_{min}}\frac{d_{j}}{d_{min}}}\right)}$
	$\displaystyle\approx$	$\displaystyle\frac{1}{\eta_{1}^{3}\left(1+\displaystyle\sum_{j\in\{i:\mid\frac{d_{i}-d_{k}}{d_{min}}\mid\leq 1\}}d^{2}_{min}+\displaystyle\sum_{j\notin\{i:\mid\frac{d_{i}-d_{k}}{d_{min}}\mid<1\}}1\right)}$
	$\displaystyle\geq$	$\displaystyle\frac{1}{\eta_{1}^{3}(\eta_{0}d^{2}_{min}+d_{min}-\eta_{0})}$

which implies

\displaystyle\min_{i\in S}\left(\mathbf{U}_{ik}^{2}\right)\geq\frac{1}{\eta_{1}^{3}(\eta_{0}d^{2}_{min}+d_{min}-\eta_{0})}

As a result, as long as $m\geq m_{0}$ we have

\displaystyle\mathbf{E}(\mathbf{U}^{2}_{ik})\geq\frac{1-\delta}{\eta_{1}^{3}(\eta_{0}d^{2}_{min}+d_{min}-\eta_{0})}.

A.3 Proof of Theorem 4.2

Proof: Based on the assumption that $\mathbf{f}\in\text{Proj}_{\mathbf{L}_{\omega_{0}}}\text{span}(\mathbf{X})$ , we denote $d_{0}=|\{1\leq j\leq n\mid\lambda_{j}\leq\omega_{0}\}|$ and $\mathbf{U}_{d_{0}}=(U_{1},U_{2},\cdots,U_{d_{0}})$ . Therefore, we can represent $\mathbf{f}=\mathbf{U}_{d_{0}}\mathbf{U}^{T}_{d_{0}}\mathbf{X}\beta$ for some parameter $\beta\in\mathbb{R}^{p\times 1}$ and $\langle\mathbf{f},U_{i}\rangle=U_{i}^{T}\mathbf{X}\beta$ . For the query set $\mathcal{S}$ and the corresponding bandwidth frequency $\omega\leq\omega_{0}$ , we similarly denote $d=|\{1\leq j\leq n\mid\lambda_{j}\leq\omega\}|\leq d_{0}$ and $\mathbf{U}_{d}=(U_{1},U_{2},\cdots,U_{d})$ . We denote $\mathbf{V}_{n\times r_{d}}=\mathbf{U}_{d}V_{1}$ as the bases of $\text{Proj}_{\mathbf{L}_{\omega}}\text{span}(\mathbf{X})$ where $V_{1}$ is obtained from SVD decomposition $\mathbf{U}^{T}_{d}\mathbf{X}=V_{1}\Sigma V_{2}^{T}$ where $(V_{1})_{d\times r_{d}}$ and $(V_{2})_{p\times r_{d}}$ are left and right singular vectors, respectively. The diagonal matrix $\Sigma_{r_{d}\times r_{d}}$ contains $r_{d}$ positive singular values with $r_{d}\leq\min\{d,p\}$ . The estimation (11) at the end of section 3 is equivalent to the weighted regression problem of $\{\left(\mathbf{V}_{i\cdot},Y_{i},s_{i}\right)\}_{i\in\mathcal{S}}$ ,

	$\displaystyle\tilde{f}=$	$\displaystyle\underset{\tilde{f}\in\text{Proj}_{\mathbf{L}_{\omega}}\text{span}(\mathbf{X})}{\operatorname*{argmin}}\sum_{i\in\mathcal{S}}s_{i}\|Y_{i}-\tilde{f}(i)\|^{2}$
	$\displaystyle\Rightarrow$	$\displaystyle\underset{\alpha\in\mathbb{R}^{r_{d}\times 1}}{\operatorname*{argmin}}\sum_{i=1}^{\mathcal{B}}\|\sqrt{s_{i}}Y_{i}-(\sqrt{s_{i}}\mathbf{V}_{1}(i),\ldots,\sqrt{s_{i}}\mathbf{V}_{r_{d}}(i))\alpha\|^{2},$

where $|\mathcal{S}|=\mathcal{B}$ . We have the least squares solution

\displaystyle\alpha(\tilde{f})=

\displaystyle\left(A^{\top}A\right)^{-1}A^{\top}WY_{\mathcal{B}}\;\leavevmode\hbox to14.18pt{\vbox to14.18pt{\pgfpicture\makeatletter\hbox{\hskip 7.09111pt\lower-7.09111pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{{{}}}{{}}{}{}{}{}{}{}{}{}{}{{}\pgfsys@moveto{6.89111pt}{0.0pt}\pgfsys@curveto{6.89111pt}{3.8059pt}{3.8059pt}{6.89111pt}{0.0pt}{6.89111pt}\pgfsys@curveto{-3.8059pt}{6.89111pt}{-6.89111pt}{3.8059pt}{-6.89111pt}{0.0pt}\pgfsys@curveto{-6.89111pt}{-3.8059pt}{-3.8059pt}{-6.89111pt}{0.0pt}{-6.89111pt}\pgfsys@curveto{3.8059pt}{-6.89111pt}{6.89111pt}{-3.8059pt}{6.89111pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-2.5pt}{-3.22221pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{1}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{{{}}}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}

where

\displaystyle A=\begin{pmatrix}\sqrt{s_{1}}\mathbf{V}_{1}(1)&\cdots&\sqrt{s_{1}}\mathbf{V}_{r_{d}}(1)\\ \vdots&\ddots&\vdots\\ \sqrt{s_{\mathcal{B}}}\mathbf{V}_{1}\left(\mathcal{B}\right)&\ldots&\sqrt{s_{r_{d}}}\mathbf{V}_{r_{d}}\left(\mathcal{B}\right)\\ \end{pmatrix}\;\;\;\text{and}\;\;\;W=\text{diag}(\sqrt{s_{1},\ldots,s_{\mathcal{B}}})

(19)

We assume $Y=\mathbf{f}+\varepsilon$ , where $E\left(\varepsilon\right)=0$ and $Var{(\varepsilon)}=\sigma^{2}$ . Notice the oracle $\mathbf{f}$ satisfies

\displaystyle\mathbf{f}=\underset{f\in\text{Proj}_{\mathbf{L}_{\omega_{0}}}\text{span}(\mathbf{X})}{\operatorname*{argmin}}\sum_{i=1}^{n}\mathbf{E}_{Y}\left(Y_{i}-f^{2}(i)\right)

We decompose the space $\text{Proj}_{\mathbf{L}_{\omega_{0}}}\text{span}(\mathbf{X})$ as

\displaystyle\text{Proj}_{\mathbf{L}_{\omega_{0}}}\text{span}(\mathbf{X})=\text{Proj}_{\mathbf{L}_{\omega}}\text{span}(\mathbf{X})\bigoplus\big{(}\text{Proj}_{\mathbf{L}_{\omega}}\text{span}(\mathbf{X})\big{)}^{c}

Then we decompose $\mathbf{f}=\mathbf{f}_{1}+\mathbf{f}_{2},\;\text{where}\;\mathbf{f}_{1}\in\text{Proj}_{\mathbf{L}_{\omega}}\text{span}(\mathbf{X}),\;\mathbf{f}_{2}\in\big{(}\text{Proj}_{\mathbf{L}_{\omega}}\text{span}(\mathbf{X})\big{)}^{c}$ , then

\displaystyle\mathbf{f}_{1}=\underset{f\in\text{Proj}_{\mathbf{L}_{\omega}}\{\text{span}(\mathbf{X})\}}{\operatorname*{argmin}}\sum_{i=1}^{n}\mathbf{E}_{Y}\left(Y_{i}-f\left(i\right)\right)^{2}

Then we can represent $\mathbf{f}_{1}(i)=(\mathbf{V}_{1}(i),\ldots,\mathbf{V}_{r_{d}}(i))\alpha(\mathbf{f}_{1})$ , by solving $A\alpha(\mathbf{f}_{1})=W\mathbf{f}_{1}$ , we have

\displaystyle\alpha(\mathbf{f}_{1})=\left(A^{\top}A\right)^{-1}A^{\top}W\mathbf{f}_{1}\;\leavevmode\hbox to14.18pt{\vbox to14.18pt{\pgfpicture\makeatletter\hbox{\hskip 7.09111pt\lower-7.09111pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{{{}}}{{}}{}{}{}{}{}{}{}{}{}{{}\pgfsys@moveto{6.89111pt}{0.0pt}\pgfsys@curveto{6.89111pt}{3.8059pt}{3.8059pt}{6.89111pt}{0.0pt}{6.89111pt}\pgfsys@curveto{-3.8059pt}{6.89111pt}{-6.89111pt}{3.8059pt}{-6.89111pt}{0.0pt}\pgfsys@curveto{-6.89111pt}{-3.8059pt}{-3.8059pt}{-6.89111pt}{0.0pt}{-6.89111pt}\pgfsys@curveto{3.8059pt}{-6.89111pt}{6.89111pt}{-3.8059pt}{6.89111pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-2.5pt}{-3.22221pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{2}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{{{}}}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}

From and , we have

	$\displaystyle\sum_{i=1}^{n}\left\|\tilde{f}\left(i\right)-\mathbf{f}\left(i\right)\right\|^{2}=$	$\displaystyle\\|\tilde{f}-\mathbf{f}\\|_{2}^{2}\leq\\|\tilde{f}-\mathbf{f}_{1}\\|_{2}^{2}+\\|\mathbf{f}_{1}-\mathbf{f}\\|_{2}^{2}$
	$\displaystyle\leq$	$\displaystyle\left\\|\alpha\left(\tilde{f}\right)-\alpha(\mathbf{f}_{1})\right\\|_{2}^{2}+\\|\mathbf{f}_{1}-\mathbf{f}\\|_{2}^{2}$
	$\displaystyle\leq$	$\displaystyle\left\\|\left(A^{\top}A\right)^{-1}A^{\top}W\left(Y_{\mathcal{B}}-\left(\mathbf{f}_{1}\right)_{\mathcal{B}}\right)\right\\|_{2}^{2}+\\|\mathbf{f}_{1}-\mathbf{f}\\|_{2}^{2}$
	$\displaystyle\leq$	$\displaystyle\lambda_{max}\left(A^{\top}A\right)^{-1}\\|A^{\top}W\left(Y_{\mathcal{B}}-\left(\mathbf{f}_{1}\right)_{\mathcal{B}}\\|_{2}^{2}+\\|\mathbf{f}_{1}-\mathbf{f}\\|_{2}^{2}\right.$

Denote $g_{i}=Y_{i}-\mathbf{f}_{1}(i)$ , we have

\displaystyle\mathbf{E}_{\mathcal{S}}\|A^{\top}Wg\|^{2}=

\displaystyle\sum_{i=1}^{r_{d}}\mathbf{E}_{\mathcal{S}}\left[\sum_{j=1}^{\mathcal{B}}s_{j}^{2}\mathbf{V}_{i}^{2}(j)|g_{j}|^{2}\right]

Denote $\alpha_{i}=s_{i}*p_{i}$ for $j\in\mathcal{S}$ where $p_{j}$ is the probability of node $j$ being selected to query. Since

	$\displaystyle\mathbf{E}_{\mathcal{S}}\left[s_{j}\mathbf{V}_{i}(j)(Y_{j}-\mathbf{f}_{1}(j))\right]=$	$\displaystyle\alpha_{j}\mathbf{E}_{\mathcal{S}}\left[\frac{1}{p_{j}}\mathbf{V}_{i}(j)(Y_{j}-\mathbf{f}_{1}(j))\right]$
	$\displaystyle=$	$\displaystyle\alpha_{j}\sum_{l=1}^{n}\mathbf{E}_{Y}\left[\mathbf{V}_{i}(l)(Y_{l}-\mathbf{f}_{1}(l))\right]$
	$\displaystyle=$	$\displaystyle 0\;\;\text{since }\mathbf{V}_{i}\perp\mathbf{f}_{2}$

we have

	$\displaystyle\mathbf{E}_{\mathcal{S}}\\|A^{\top}Wg\\|=\sum_{j=1}^{\mathcal{B}}\mathbf{E}_{\mathcal{S}}\left[\sum_{i=1}^{r_{d}}s_{j}^{2}\mathbf{V}_{i}^{2}(j)\|g_{j}\|^{2}\right]\leq$	$\displaystyle\underset{j}{\sup}\left(s_{j}\sum_{i=1}^{r_{d}}\|\mathbf{V}_{i}(j)\|^{2}\right)\times\sum_{j=1}^{\mathcal{B}}\mathbf{E}_{\mathcal{S}}\left(s_{j}g^{2}_{j}\right)$
	$\displaystyle=$	$\displaystyle\underset{j}{\sup}\left(s_{j}\sum_{i=1}^{r_{d}}\|\mathbf{V}_{i}(j)\|^{2}\right)\times\sum_{j=1}^{\mathcal{B}}\alpha_{j}\mathbf{E}_{\mathcal{S}}\left(\frac{1}{p_{j}}g^{2}_{j}\right)$
	$\displaystyle=$	$\displaystyle\underset{j}{\sup}\left(s_{j}\sum_{i=1}^{r_{d}}\|\mathbf{V}_{i}(j)\|^{2}\right)\times\sum_{j=1}^{\mathcal{B}}\alpha_{j}\times\sum_{l=1}^{n}\mathbf{E}_{Y}(Y_{l}-\mathbf{f}_{1}(l))^{2}$

Notice that

\displaystyle\sum_{l=1}^{n}\mathbf{E}_{Y}(Y_{l}-\mathbf{f}_{1}(l))^{2}=\sum_{l=1}^{n}\mathbf{E}_{Y}(Y_{l}-\mathbf{f}(l))^{2}+\sum_{l=1}^{n}(\mathbf{f}(l)-\mathbf{f}_{1}(l))^{2}=n\sigma^{2}+\|\mathbf{f}-\mathbf{f}_{1}\|_{2}^{2}

Notice that $\mathbf{f}=\mathbf{U}_{d_{0}}\mathbf{U}^{T}_{d_{0}}\mathbf{X}\beta$ and $\mathbf{f}_{1}=\mathbf{U}_{d}\mathbf{U}^{T}_{d}\mathbf{f}$ , then $\mathbf{f}-\mathbf{f}_{1}=\mathbf{U}_{d^{\prime}}\mathbf{U}^{T}_{d^{\prime}}\mathbf{X}\beta=\sum_{i>d}\langle\mathbf{f},U_{i}\rangle U_{i}$ where $\mathbf{U}_{d^{\prime}}=(U_{d+1},\cdots,U_{d_{0}})$ . Therefore, $\|\mathbf{f}-\mathbf{f}_{1}\|=\sum_{i>d,i\in\text{supp}(\mathbf{f})}\langle\mathbf{f},U_{i}\rangle^{2}$ . We first state and then prove the following Lemma 1.

Lemma A.1.

For the output from Algorithm 1 with $\mathcal{B}$ query budgets, we have

	$\displaystyle\sum_{i=1}^{\mathcal{B}}\alpha_{i}\leq\frac{4}{3},\;\underset{j\in\mathcal{S}}{\sup}\left(s_{j}\sum_{i=1}^{r_{d}}\|\mathbf{V}_{i}(j)\|^{2}\right)\leq 10\delta,\;\text{and}$
	$\displaystyle\lambda(A^{\top}A)\in\Big{[}\frac{1}{2}\times\frac{1}{\left(1+\frac{m\sqrt{\delta}}{C_{0}}\right)^{2}},\frac{8}{3}\times\frac{1}{1-\left(\frac{m\sqrt{\delta}}{C_{0}}\right)^{2}}\Big{]}\;\text{with probability}\;1-\frac{2}{C},$

where $\delta=\frac{r_{d}CC_{0}^{2}}{m\mathcal{B}}$ , and $C_{0}$ is a constant such that $C^{2}_{0}=\left(m\right)^{2}+\max(2,\frac{16}{d})m$ .

Using Lemma 1 we have

	$\displaystyle\mathbf{E}\\|\tilde{f}-\mathbf{f}\\|_{2}^{2}\leq$	$\displaystyle\lambda_{min}\left(A^{\top}A\right)\times\left(\sum_{j=1}^{\mathcal{B}}\alpha_{j}\right)\times\underset{j\in\mathcal{S}}{\sup}\left(s_{j}\sum_{i=1}^{r_{d}}\|\mathbf{V}_{i}(j)\|^{2}\right)\times\sum_{j=1}^{n}\mathbf{E}_{Y}(Y_{j}-\mathbf{f}_{1}(j))^{2}+\\|\mathbf{f}-\mathbf{f}_{1}\\|_{2}^{2}$
	$\displaystyle\leq$	$\displaystyle 2(1+\frac{m\sqrt{\delta}}{C_{0}})^{2}\times\frac{4}{3}\times 10\times\frac{CC_{0}^{2}r_{d}}{m}\times\frac{1}{\mathcal{B}}\times\sum_{j=1}^{n}\mathbf{E}_{Y}(Y_{j}-\mathbf{f}_{1}(j))^{2}+\\|\mathbf{f}-\mathbf{f}_{1}\\|_{2}^{2}$
	$\displaystyle\leq$	$\displaystyle O\Big{(}2(\frac{r_{d}t}{\mathcal{B}})+\frac{r_{d}t}{\mathcal{B}})^{3/2}+(\frac{r_{d}t}{\mathcal{B}})^{2}\Big{)}\times(n\sigma^{2}+\sum_{i>d,i\in\text{supp}(\mathbf{f})}\langle U_{i},\mathbf{f}\rangle^{2})+\sum_{i>d,i\in\text{supp}(\mathbf{f})}\langle U_{i},\mathbf{f}\rangle^{2},$

with probability larger than $1-\frac{2m}{t}$ where $t>2m$ .

In the following, we prove Lemma 1 which is based on Theorem 5.2 in [9] and Lemma 3.5 and 3.6 in [21].

In the following, we denote the accumulated covariance matrix in the $j$ selection as $A_{j}$ , the potential function as $\Phi_{u_{j},l_{j}}(A_{j})=\text{Tr}[(u_{j}I-A_{j})^{-1}]+\text{Tr}[(A_{j}-l_{j}I)^{-1}]$ , and $R_{i}(u,l,A)=v_{i}(uI-A)^{-1}v_{i}^{T}+v_{i}(A-lI)^{-1}v_{i}^{T}$ , where $v_{i}$ is the $i$ th row of $\mathbf{V}$ . Notice that $\sum_{i=1}^{n}R_{i}=\Phi_{u,l}(A)$ . At each iteration of algorithm 1, the $i$ th node is selected as one of $m$ candidates with $p*_{i}=\frac{R_{i}}{\Phi}$ . For the $m$ candidates, we define the following probability

q_{i}=\begin{dcases*}1-\eta&\text{if } $i\text{ has maximum }\Delta_{i}\text{ among m candidates}$\\ \frac{\eta}{m-1}&\text{otherwise}\end{dcases*}

where $0<\eta<1$ . Notice that when $\eta$ goes to $0$ , the $q_{i}$ approximate the step 3 in Algorithm 1. Therefore, the probability of node k being query is

\displaystyle p_{k}=P(\text{select k})=

\displaystyle P(\underbrace{\text{ select k}}_{Q_{k}}|k\text{ in }B_{m})\cdot P({k\text{ in }B_{m}})=p^{*}_{k}\times q_{k}

	$\displaystyle\mathbf{E}\left(\frac{1}{p_{k}}v_{k}v^{\top}_{k}\right)=$	$\displaystyle\mathbf{E}_{B_{m}}\mathbf{E}_{Q_{k}\|B_{m}}\left(\frac{1}{P(\text{select k})}v_{k}v^{\top}_{k}\right)$
	$\displaystyle=$	$\displaystyle\mathbf{E}_{B_{m}}\left(\sum_{k\in B_{m}}P(\text{select }k\|k\in B_{m})\cdot\frac{1}{P(\text{select k})}v_{k}v^{\top}_{k}\right)$
	$\displaystyle=$	$\displaystyle\mathbf{E}_{B_{m}}\left(\sum_{k\in B_{m}}\frac{1}{P(k\in B_{m})}v_{k}v^{\top}_{k}\right)$
	$\displaystyle=$	$\displaystyle\sum_{\Omega}P(B_{m})\times\sum_{k\in B_{m}}\frac{1}{P(k\in B_{m})}v_{k}v^{\top}_{k}$
	$\displaystyle=$	$\displaystyle\sum_{B_{m}\in\Omega}\sum_{k\in B_{m}}P(B_{m}\mid k\in B_{m})\cdot v_{k}v^{\top}_{k}$

where $\Omega$ denote all $C_{n}^{m}$ possible candidate set with choosing $m$ nodes from $n$ nodes, $P(B_{m}\mid k\in B_{m})$ denotes the probability of selecting $m-1$ nodes into $B_{m}$ conditioning on $k\in B_{m}$ . Denote $\Omega_{k}$ as all possible size $m$ candidate sets with node $k$ always in the set. Then

\displaystyle\mathbf{E}\left(\frac{1}{p_{k}}v_{k}v^{\top}_{k}\right)=\sum_{k=1}^{n}\left(\sum_{B^{k}_{m-1}\subset\Omega_{k}}P\left(B_{m}\mid k\in B_{m}\right)\right)\cdot v_{k}v^{\top}_{k}=\sum_{k=1}^{n}v_{k}v^{\top}_{k}=I

	$\displaystyle\frac{\epsilon}{(\sum R_{i})p_{k}}v_{k}v^{\top}_{k}=\frac{\epsilon}{(\sum R_{i})p^{*}_{k}q_{k}}v_{k}v^{\top}_{k}=$	$\displaystyle\frac{\epsilon}{R_{k}q_{k}}v_{k}v^{\top}_{k}$
	$\displaystyle\preceq$	$\displaystyle\epsilon(uI-A)\frac{1}{q_{k}}$
	$\displaystyle\leq$	$\displaystyle\frac{m\epsilon}{\eta}(uI-A)$

where we use the fact $vv^{T}\preceq(v^{T}B^{-1}v)B$ for any semi-positive definite matrix $B$ . In addition,

\displaystyle\mathbf{E}\left(\frac{\epsilon}{(\sum R_{i})p_{k}}v_{k}v^{\top}_{k}\right)=\frac{\epsilon}{\sum R_{i}}I=\frac{\epsilon}{\Phi_{u,l}(A)}I

for any $k=1,\cdots,n$ . Denote $w_{k}=\sqrt{\frac{\epsilon}{\sum R_{i}p_{k}}}v_{k}$ , then $w_{k}w^{\top}_{k}\preceq\frac{m\epsilon}{\eta}(uI-A)$ , which implies for any $k\in[1,n]$

	$\displaystyle w^{\top}_{k}(uI-A)^{-1}w_{k}w^{\top}_{k}(uI-A)^{-1}w_{k}\leq$	$\displaystyle\frac{m\epsilon}{\eta}w^{\top}_{k}(uI-A)^{-1}w_{k}$
	$\displaystyle\Rightarrow\;\;\;w^{\top}_{k}(uI-A)^{-1}w_{k}\leq$	$\displaystyle\frac{m\epsilon}{\eta}$

Similarly, we have

\displaystyle w^{\top}_{k}(A-lI)^{-1}w_{k}\leq

\displaystyle\frac{m\epsilon}{\eta}

Then from Lemma 3.3 and Lemma 3.4 in [batson2014twice] we have

	$\displaystyle\text{Tr}(uI-A-w_{k}w_{k}^{\top})\leq\;$	$\displaystyle\text{Tr}(uI-A)+\frac{w_{k}^{\top}(uI-A)^{-2}w_{k}}{1-\frac{m\epsilon}{\eta}}$		(20)
	$\displaystyle\text{Tr}(A+w_{k}w_{k}^{\top}-lI)\leq\;$	$\displaystyle\text{Tr}(A-lI)-\frac{w_{k}^{\top}(A-lI)^{-2}w_{k}}{1+\frac{m\epsilon}{\eta}}$		(21)

Define $\epsilon^{\prime}=\frac{m\epsilon}{\eta}$ , we show in the following that $\mathbf{E}\left(\Phi_{u_{j},l_{j}}(A_{j})\right)\leq\Phi_{u_{j-1},l_{j-1}}(A_{j-1})$ .

From (20) we have

\displaystyle\Phi_{u_{j},l_{j}}(A_{j})\leq\Phi_{u_{j},l_{j}}(A_{j-1})+\frac{w_{j-1}^{\top}(u_{j}I-A_{j-1})^{-2}w_{j-1}}{1-\epsilon}-\frac{w_{j-1}^{\top}(A_{j-1}-l_{j}I)^{-2}w_{j-1}}{1+\epsilon}\;\leavevmode\hbox to14.18pt{\vbox to14.18pt{\pgfpicture\makeatletter\hbox{\hskip 7.09111pt\lower-7.09111pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{{{}}}{{}}{}{}{}{}{}{}{}{}{}{{}\pgfsys@moveto{6.89111pt}{0.0pt}\pgfsys@curveto{6.89111pt}{3.8059pt}{3.8059pt}{6.89111pt}{0.0pt}{6.89111pt}\pgfsys@curveto{-3.8059pt}{6.89111pt}{-6.89111pt}{3.8059pt}{-6.89111pt}{0.0pt}\pgfsys@curveto{-6.89111pt}{-3.8059pt}{-3.8059pt}{-6.89111pt}{0.0pt}{-6.89111pt}\pgfsys@curveto{3.8059pt}{-6.89111pt}{6.89111pt}{-3.8059pt}{6.89111pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-2.5pt}{-3.22221pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{3}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{{{}}}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}

Define $\Delta_{u}=u_{j}-u_{j-1}=\frac{\epsilon}{(1-\epsilon^{\prime})\sum R_{j}}$ and $\Delta_{l}=l_{j}-l_{j-1}=\frac{\epsilon}{(1+\epsilon^{\prime})\sum R_{j}}$ . Notice that

	$\displaystyle\frac{\partial}{\partial u}\text{Tr}(uI-A)^{-1}=-\text{Tr}(uI-A)^{-2}<0$
	$\displaystyle\frac{\partial}{\partial l}\text{Tr}(A-lI)^{-1}=\text{Tr}(A-lI)^{-2}<0$

at each step based on the design of $u_{j}$ and $l_{j}$ . and $\Phi_{u,l}(A)$ is convex in terms of $u$ and $l$ . From , we have

	$\displaystyle\Phi_{u_{j},l_{j}}(A_{j})\leq\Phi_{u_{j},l_{j}}(A_{j-1})+$	$\displaystyle\frac{1}{1-\epsilon^{\prime}}\text{Tr}\big{[}(u_{j}I-A_{j-1})^{-2}w_{j-1}w^{\top}_{j-1}\big{]}$
	$\displaystyle-$	$\displaystyle\frac{1}{1+\epsilon^{\prime}}\text{Tr}\big{[}(A_{j-1}-l_{j}I)^{-2}w_{j-1}w^{\top}_{j-1}\big{]}$

then with $\mathbf{E}(w_{k}w_{k}^{T})=\frac{\epsilon}{\sum R_{i}}I$

	$\displaystyle\textbf{E}\left(\Phi_{u_{j},l_{j}}(A_{j})\right)$	$\displaystyle\leq\Phi_{u_{j},l_{j}}(A_{j-1})+\frac{\epsilon}{(1-\epsilon^{\prime})\sum R_{i}}\text{Tr}\big{[}(u_{j}I-A_{j-1})^{-2}\big{]}$
		$\displaystyle\quad-\frac{\epsilon}{(1+\epsilon^{\prime})\sum R_{i}}\text{Tr}\big{[}(A_{j-1}-l_{j}I)^{-2}\big{]}$
		$\displaystyle\leq\Phi_{u_{j},l_{j}}(A_{j-1})+\Delta_{u}\text{Tr}\big{[}(u_{j}I-A_{j-1})^{-2}\big{]}$
		$\displaystyle\quad-\Delta_{l}\text{Tr}\big{[}(A_{j-1}-l_{j})^{-2}\big{]}$

Define

\displaystyle f(t)=\text{Tr}\big{[}(u_{j-1}+t\cdot\Delta_{u})I-A_{j-1}\big{]}^{-1}+\text{Tr}\big{[}A_{j-1}-(l_{j-1}+\Delta_{l}\cdot t)I\big{]}^{-1}

then

\displaystyle\frac{\partial f(t)}{\partial t}=-\Delta_{u}\text{Tr}\big{[}(u_{j-1}+t\cdot\Delta_{u})I-A_{j-1}\big{]}^{-2}+\Delta_{l}\text{Tr}\big{[}A_{j-1}-(l_{j-1}+\Delta_{l}\cdot t)I\big{]}^{-2}

Since $f(t)$ is convex, we have

\displaystyle\frac{\partial f(t)}{\partial t}\bigg{|}_{t=1}\geq f(1)-f(0)=\Phi_{u_{j},l_{j}}(A_{j-1})-\Phi_{u_{j-1},l_{j-1}}(A_{j-1})

(23)

Then plugin (23), we have

\displaystyle\textbf{E}\left(\Phi_{u_{j},l_{j}}(A_{j})\right)\leq\Phi_{u_{j-1},l_{j-1}}(A_{j-1})

Notice that for selection

\frac{\Delta_{u_{j}}-\Delta_{l_{j}}}{\Delta_{u_{j}}}=\frac{\frac{\varepsilon}{t\left(1-\varepsilon^{\prime}\right)}-\frac{\varepsilon}{t\left(1+\varepsilon^{\prime}\right)}}{\frac{\varepsilon}{t\left(1-\varepsilon^{\prime}\right)}}=\frac{\frac{1}{\left(1-\varepsilon^{\prime}\right)}-\frac{1}{(1+\varepsilon^{\prime})}}{\frac{1}{\left(1-\varepsilon^{\prime}\right)}}\leq 2\varepsilon^{\prime}

where $t=\sum R_{i}$ . We consider that the selection process stops when at the iteration $k$ that $u_{k}-l_{k}\geq 8r_{d}/\epsilon$ . Notice $u_{0}=\frac{2r_{d}}{\varepsilon},l_{0}=\frac{-2r_{d}}{\varepsilon}$ , when stop at $u_{k}-l_{k}\geqslant\frac{8r_{d}}{\varepsilon}$ , we have

	$\displaystyle\frac{u_{k}-l_{k}}{u_{k}}=$	$\displaystyle\frac{\left(u_{0}-l_{0}\right)+\sum_{j=0}^{k-1}\left(\Delta_{u_{j}}-\Delta_{l_{j}}\right)}{u_{0}+\sum_{j=0}^{k-1}\Delta_{u_{j}}}$
	$\displaystyle\leq$	$\displaystyle\frac{4r_{d}/\varepsilon+\sum_{j=0}^{k-1}\left(\Delta_{u_{j}}-\Delta_{l_{j}}\right)}{2r_{d}/\varepsilon+\left(2\varepsilon^{\prime}\right)^{-1}\sum_{j=0}^{k-1}\left(\Delta_{u_{j}}-\Delta_{l_{j}}\right)}$
	$\displaystyle\leq$	$\displaystyle\frac{4r_{d}/\varepsilon+4r_{d}/\varepsilon}{2r_{d}/\varepsilon+\left(2\varepsilon^{\prime}\right)^{-1}4r_{d}/\varepsilon}$
	$\displaystyle=$	$\displaystyle\frac{8r_{d}/\varepsilon}{2r_{d}\left(1+\frac{1}{\varepsilon^{\prime}}\right)/\varepsilon}$
	$\displaystyle=$	$\displaystyle\frac{4}{1+\frac{1}{\varepsilon^{\prime}}}\leq 4\varepsilon^{\prime}$

Then we have $\frac{u_{k}}{l_{k}}=\left(1-\frac{u_{k}-l_{k}}{u_{k}}\right)^{-1}\leq 1+4\left(\varepsilon^{\prime}\right)$ . Notice that $u_{k}-l_{k}\geqslant\frac{8r_{d}}{\varepsilon}\implies\sum_{j=0}^{k-1}\left(\Delta_{u_{j}}-\Delta_{l_{j}}\right)\geq\frac{4r_{d}}{\varepsilon}$ .

Consider at the $j$ th selection

	$\displaystyle\Delta_{u_{j}}-\Delta_{l_{j}}=$	$\displaystyle\left(\frac{\epsilon}{1-\epsilon^{\prime}}-\frac{\epsilon}{1+\epsilon^{\prime}}\right)\frac{1}{\sum R_{i}}$
	$\displaystyle=$	$\displaystyle\frac{\tilde{\epsilon}}{\Phi_{u_{j},l_{j}}(A_{j})},\;\;\epsilon^{\prime}=\frac{2\epsilon\epsilon^{\prime}}{(1-\epsilon^{\prime})(1+\epsilon^{\prime})}$

Then

	$\displaystyle P\left(\text{finish selection after $\mathcal{B}$ times selection}z\text{ vectors}\right)\geq$	$\displaystyle P\left(\sum_{j=0}^{\mathcal{B}-1}\frac{\tilde{\epsilon}}{\Phi_{u_{j},l_{j}}(A_{j})}\geq\frac{4r_{d}}{\epsilon}\right)$
	$\displaystyle=$	$\displaystyle P\left(\sum_{j=0}^{\mathcal{B}-1}\Phi^{-1}_{u_{j},l_{j}}(A_{j})\geq\frac{4r_{d}}{\tilde{\epsilon}\epsilon}\right)$
	$\displaystyle\geq$	$\displaystyle P\left(\frac{\mathcal{B}^{2}}{\sum_{j=0}^{\mathcal{B}-1}\Phi_{u_{j},l_{j}}(A_{j})}\geq\frac{4r_{d}}{\tilde{\epsilon}\epsilon}\right)$
	$\displaystyle=$	$\displaystyle P\left(\sum_{j=0}^{\mathcal{B}}\Phi_{u_{j},l_{j}}(A_{j})\leq\frac{\mathcal{B}^{2}\tilde{\epsilon}\epsilon}{4r_{d}}\right),$
	$\displaystyle\geq$	$\displaystyle 1-\frac{4n}{\mathcal{B}\tilde{\epsilon}},$
	$\displaystyle\geq$	$\displaystyle 1-\frac{2r_{d}}{\mathcal{B}\cdot\frac{m}{\eta}\cdot\epsilon^{2}}$

where we use the result that $\textbf{E}\left(\Phi_{u_{j},l_{j}}(A_{j})\right)\leq\Phi_{u_{0},l_{0}}(A_{0})$ by recursively using $\textbf{E}\left(\Phi_{u_{j},l_{j}}(A_{j})\right)\leq\Phi_{u_{j-1},l_{j-1}}(A_{j-1})$ and the fact that $\Phi_{u_{0},l_{0}}(A_{0})=\epsilon$ .

We consider the following reparametrization:

\displaystyle\epsilon=\frac{\sqrt{\delta}}{C_{0}},\;0<\delta,\;\text{mid}=2r_{d}(1-m^{2}\epsilon^{2})/(m\epsilon^{2}).

For the $j$ th selection, $\alpha_{j}=\frac{\epsilon}{\Phi_{j}}\frac{1}{\text{mid}}$ . From previous result, we have $\frac{u_{k}}{l_{k}}=\left(1-\frac{u_{k}-l_{k}}{u_{k}}\right)^{-1}\leq 1+4\left(\varepsilon^{\prime}\right)$ with probability $1-2/C$ with $\delta=\frac{CC_{0}^{2}\eta r_{d}}{m\mathcal{B}}$ . Notice that $u_{k}=u_{0}+\sum_{j=1}^{k}\frac{\epsilon}{(1-\epsilon^{\prime})\Phi_{j}}$ and $l_{k}=l_{0}+\sum_{j=1}^{k}\frac{\epsilon}{(1+\epsilon^{\prime})\Phi_{j}}$ , and

\displaystyle u_{k}+l_{k}=\sum_{j=1}^{k}\frac{\epsilon}{\Phi_{j}}\left(\frac{1}{1-\epsilon^{\prime}}+\frac{1}{1+\epsilon^{\prime}}\right)

Then if stop at $k$ th selection

	$\displaystyle\Phi_{k}\geq\frac{2r_{d}}{u_{k}-l_{k}}=$	$\displaystyle\frac{2r_{d}}{(u_{k-1}-l_{k-1})+\frac{\epsilon}{\Phi_{k}}\left(\frac{1}{1-\epsilon^{\prime}}-\frac{1}{1+\epsilon^{\prime}}\right)}$
	$\displaystyle\geq$	$\displaystyle\frac{2r_{d}}{\frac{8r_{d}}{\epsilon}+\frac{\epsilon}{\Phi_{k}}\left(\frac{1}{1-\epsilon^{\prime}}-\frac{1}{1+\epsilon^{\prime}}\right)}$

Denote $c=\frac{1}{1-\epsilon^{\prime}}-\frac{1}{1+\epsilon^{\prime}}$ , then $\Phi_{k}\geq\frac{1}{4}\epsilon-\frac{c}{8}\epsilon^{2}$ . We find $C_{0}$ such that $c\epsilon=\frac{2\epsilon^{\prime}\epsilon}{1-(\epsilon^{\prime})^{2}}<1$ then $\Phi_{k}\geq\frac{\epsilon}{8}$

Therefore,

	$\displaystyle u_{k}-l_{k}=u_{k-1}-l_{k-1}+\frac{\epsilon}{\Phi_{k}}\left(\frac{1}{1-\epsilon^{\prime}}-\frac{1}{1+\epsilon^{\prime}}\right)\leq$	$\displaystyle\frac{8r_{d}}{\epsilon}+\frac{\epsilon}{\Phi_{k}}\left(\frac{1}{1-\epsilon^{\prime}}-\frac{1}{1+\epsilon^{\prime}}\right)$
	$\displaystyle\leq$	$\displaystyle\frac{8r_{d}}{\epsilon}+8\left(\frac{1}{1-\epsilon^{\prime}}-\frac{1}{1+\epsilon^{\prime}}\right)$
	$\displaystyle=$	$\displaystyle\frac{8r_{d}}{\epsilon}+\frac{16\epsilon^{\prime}}{1-(\epsilon^{\prime})^{2}}$

Then we choose $C_{0}$ large such that $\frac{16\epsilon^{\prime}}{1-(\epsilon^{\prime})^{2}}<\frac{r_{d}}{\epsilon}\Rightarrow u_{k}-l_{k}\leq\frac{9r_{d}}{\epsilon}$ . Given that $\epsilon^{\prime}=\frac{m}{\eta}\epsilon$ and $\epsilon=\frac{\sqrt{\delta}}{C_{0}}$ , we choose appropriate $C_{0}$ to satisfy previous requirement on $\epsilon$ and $\epsilon^{\prime}$ as

\begin{dcases*}2\epsilon\epsilon^{\prime}<1-(\epsilon^{\prime})^{2}\\ \frac{16\epsilon^{\prime}}{1-(\epsilon^{\prime})^{2}}<\frac{r_{d}}{\epsilon}\\ \epsilon^{\prime}<1\end{dcases*}

Therefore, we choose $C_{0}$ such that $C^{2}_{0}>\left(\frac{m}{\eta}\right)^{2}+\max(2,\frac{16}{r_{d}})\times\frac{m}{\eta}$ . Notice that

\displaystyle u_{k}-l_{k}=\sum_{j=1}^{k}\frac{\epsilon}{\Phi_{j}}\left(\frac{1}{1-\epsilon^{\prime}}-\frac{1}{1+\epsilon^{\prime}}\right)\geq\frac{4r_{d}}{\epsilon}

and mid is defined as $\text{mid}=\frac{\frac{4d}{\epsilon}}{\frac{1}{1-\epsilon^{\prime}}-\frac{1}{1+\epsilon^{\prime}}}$ , then $\sum_{j=1}^{k}\frac{\epsilon}{\Phi_{j}}\geq\text{mid}$ . Also,

	$\displaystyle\sum_{j=1}^{k-1}\frac{\epsilon}{\Phi_{j}}\leq\text{mid}\Rightarrow\text{mid}>$	$\displaystyle\sum_{j=1}^{k}\frac{\epsilon}{\Phi_{j}}-8$
	$\displaystyle>$	$\displaystyle\frac{\epsilon}{\Phi_{j}}-\frac{4\epsilon\epsilon^{\prime}}{r_{d}(1-(\epsilon^{\prime})^{2})}\sum_{j=1}^{k}\frac{\epsilon}{\Phi_{j}}$
	$\displaystyle=$	$\displaystyle\left(1-\frac{4\epsilon\epsilon^{\prime}}{r_{d}(1-(\epsilon^{\prime})^{2})}\right)\sum_{j=1}^{k}\frac{\epsilon}{\Phi_{j}}$

which implies

\displaystyle\text{mid}\in\Big{[}1-\frac{4\epsilon\epsilon^{\prime}}{r_{d}(1-(\epsilon^{\prime})^{2})},1\Big{]}\cdot\sum_{j=1}^{m}\frac{\epsilon}{\Phi_{j}}=\Big{[}1-\frac{4\epsilon\epsilon^{\prime}}{r_{d}(1-(\epsilon^{\prime})^{2})},1\Big{]}\cdot\frac{u_{k}+l_{k}}{\frac{1}{1-\epsilon^{\prime}}+\frac{1}{1+\epsilon^{\prime}}}

Notice that for the design matrix $A$ in (19), we have $\frac{1}{\sqrt{\text{mid}}}A=A_{k}$ where $A_{k}$ is the accumulated covariance matrix when the query process stops at the $k$ the selection. Therefore, the eigenvalues of $A$ satisfy $\lambda(A^{T}A)=\frac{1}{\text{mid}}\lambda(A_{k}^{T}A_{k})\in\Bigg{[}\frac{l_{k}}{\text{mid}},\frac{u_{k}}{\text{mid}}\Bigg{]}$ . Then

\displaystyle\Bigg{[}\frac{l_{k}}{\text{mid}},\frac{u_{k}}{\text{mid}}\Bigg{]}\subset\Bigg{[}\frac{l_{k}}{\frac{u_{k}+l_{k}}{\frac{1}{1-\epsilon^{\prime}}+\frac{1}{1+\epsilon^{\prime}}}},\frac{u_{k}}{\left(1-\frac{4\epsilon\epsilon^{\prime}}{r_{d}(1-(\epsilon^{\prime})^{2})}\right)\cdot\frac{u_{k}+l_{k}}{\frac{1}{1-\epsilon^{\prime}}+\frac{1}{1+\epsilon^{\prime}}}}\Bigg{]}

Given that with high probability, $1-4\epsilon^{\prime}\leq\frac{u_{k}}{l_{k}}\leq 1+4\epsilon^{\prime}$ . Then for the lower bound,

\displaystyle\frac{l_{k}}{\frac{u_{k}+l_{k}}{\frac{1}{1-\epsilon^{\prime}}+\frac{1}{1+\epsilon^{\prime}}}}\geq\frac{1}{(1-(\epsilon^{\prime})^{2})(1+2\epsilon^{\prime})}>\frac{1}{(1+\epsilon^{\prime})(1+2\epsilon^{\prime})}>\frac{1}{2}\times\frac{1}{(1+\epsilon^{\prime})^{2}}=\frac{1}{2}\frac{1}{(1+m\sqrt{\delta}/(\eta C_{0}))^{2}}

and upper bound

	$\displaystyle\frac{u_{k}}{\left(1-\frac{4\epsilon\epsilon^{\prime}}{r_{d}(1-(\epsilon^{\prime})^{2})}\right)\cdot\frac{u_{k}+l_{k}}{\frac{1}{1-\epsilon^{\prime}}+\frac{1}{1+\epsilon^{\prime}}}}\leq$	$\displaystyle\frac{1+4\epsilon^{\prime}}{(1+2\epsilon^{\prime})(1-(\epsilon^{\prime})^{2})-(1+2\epsilon^{\prime})\frac{4\epsilon\epsilon^{\prime}}{r_{d}}}$
	$\displaystyle\leq$	$\displaystyle\frac{4}{3}\times\frac{1+4\epsilon^{\prime}}{(1+2\epsilon)(1-(\epsilon^{\prime})^{2})},\;\text{given }\frac{4\epsilon\epsilon^{\prime}}{r_{d}}>\frac{1}{4}(1-(\epsilon^{\prime})^{2})$
	$\displaystyle<$	$\displaystyle\frac{8}{3}\times\frac{1}{1-(\epsilon^{\prime})^{2}}$
	$\displaystyle<$	$\displaystyle\frac{8}{3}\frac{1}{1-(m\sqrt{\delta}/(\eta C_{0}))^{2}}.$

Then with probability larger than $1-\frac{2}{C}$ , we have

\displaystyle\lambda(A^{\top}A)\in\bigg{[}\frac{1}{2}\times\frac{1}{(1+\frac{m\sqrt{\delta}}{\eta C_{0}})^{2}},\frac{8}{3}\times\frac{1}{1-(\frac{m\sqrt{\delta}}{\eta C_{0}})^{2}}\bigg{]}

Consider $\alpha_{j}=\frac{\epsilon}{\Phi_{j}}\cdot\frac{1}{\text{mid}}$ ,

\displaystyle\Rightarrow\sum_{j=1}^{k}\alpha_{j}=\sum_{j=1}^{k}\frac{\epsilon}{\Phi_{j}}\cdot\frac{1}{\text{mid}}\in\bigg{[}1,\frac{1}{1-\frac{4\epsilon\epsilon^{\prime}}{r_{d}(1-(\epsilon^{\prime})^{2})}}\bigg{]}\leq\frac{1}{1-\frac{1}{4}\frac{(1-\epsilon^{\prime})^{2}}{(1-\epsilon^{\prime})^{2}}}=\frac{4}{3}

Finally, check $\underset{j\in\mathcal{S}}{\sup}\left(s_{j}\sum_{i=1}^{r_{d}}|\mathbf{V}_{i}(j)|^{2}\right)$ at the $k$ th selection

	$\displaystyle\underset{j\in\mathcal{S}}{\sup}\left(s_{j}\sum_{i=1}^{r_{d}}\|\mathbf{V}_{i}(j)\|^{2}\right)=$	$\displaystyle\underset{j\in\mathcal{S}}{\sup}\Big{\{}\frac{\epsilon}{\Phi_{k}}\cdot\frac{1}{\text{mid}}\times\frac{\Phi_{k}}{R_{j}}\times\sum_{i=1}^{r_{d}}\|\mathbf{V}_{i}(j)\|^{2}\Big{\}}$
	$\displaystyle=$	$\displaystyle\frac{\epsilon}{\text{mid}}\cdot\underset{j}{\sup}\Big{\{}\frac{\sum_{j=1}^{r_{d}}\|\mathbf{V}_{i}(j)\|^{2}}{R_{j}}\Big{\}}$
	$\displaystyle\leq$	$\displaystyle\frac{\epsilon}{\text{mid}}\cdot\frac{1}{\frac{1}{u_{k}-l_{k}}+\frac{1}{u_{k}-l_{k}}}$
	$\displaystyle=$	$\displaystyle\frac{\epsilon}{\text{mid}}\times\frac{u_{k}-l_{k}}{2}\leq\frac{\epsilon}{\text{mid}}\times\frac{9\cdot\frac{r_{d}}{\epsilon}}{2}=\frac{4.5r_{d}}{\text{mid}}$
	$\displaystyle\leq$	$\displaystyle\frac{4.5r_{d}}{\frac{2r_{d}(1-(\epsilon^{\prime})^{2})}{\epsilon\epsilon^{\prime}}}=2.25\times\frac{\epsilon\epsilon^{\prime}}{1-(\epsilon^{\prime})^{2}}\leq 2.25\times 4\delta=10\delta$

Then we finish the proof of Lemma 1.

Appendix B More on Numerical Studies

B.1 Experimental setups

Synthetic networks The parameters for the three network topologies are: Watts–Strogatz (WS) model ( $K=4,\beta_{WS}=0.1$ ) for small world properties, Stochastic block model (SBM) ( $N_{\text{community}}=4,P_{\text{in}}=0.35,P_{\text{out}}=0.01$ ) for community structure, and Barabási-Albert (BA) model ( $\alpha=3$ ) for scale-free properties. We set $n=100$ for all three networks. After generating the networks, we consider them fixed and then simulate $\mathbf{Y}$ and $\mathbf{X}$ repeatedly using 10 different random seeds. By a slight abuse of notation, we set the node responses and covariates for SBM and WS as $\mathbf{Y}=U_{1:10}\beta+\xi$ and $\mathbf{X}=U_{1:10}+MU_{45:54}$ , where $M_{ij}\overset{\text{iid}}{\sim}N(0.3,0.1)$ and $\beta=(\underbrace{5,5,\ldots,5}_{\textrm{length }10})^{T}$ . For the BA model, we set $\mathbf{Y}=U_{1:15}\beta+\xi$ and $\mathbf{X}=U_{1:15}+MU_{45:59}$ , where $M_{ij}\overset{\text{iid}}{\sim}N(0.5,0.2)$ and $\beta=(\underbrace{1,\ldots,1}_{\textrm{length }5},\underbrace{5,\ldots,5}_{\textrm{length }10})^{T}$ .

Real-world networks For the proposed method and all baselines, we train a 2-layer SGC model for a fixed 300 epochs. In SGC, the propagation matrix performs low-pass filtering on homophilic networks and high-pass filtering on heterophilic networks. During training, the initial learning rate is set to $10^{-2}$ and weight decay as $10^{-4}$ .

Dataset	#Nodes	Type	$m$	$d$	Dataset	#Nodes	Type	$m$	$d$
SBM	100	homophilic	50	10	Citeseer	3,327	homophilic	1000	100
WS	100	homophilic	50	10	Chameleon	2,277	heterophilic	800	30, 30
BA	100	homophilic	50	15	Texas	183	heterophilic	60	15, 15
Cora	2,708	homophilic	2000	200	Ogbn-Arxiv	169,343	homophilic	1000	120
Pubmed	19,717	homophilic	3000	60	Co-Physics	34,493	homophilic	3000	150

Table 4: A description of all datasets used in Section 5 and the hyperparameter settings for each dataset. We set

\epsilon=0.001

for all networks. For heterophilic networks, we combine the eigenvectors corresponding to the

d

smallest and

d

largest eigenvalues.

B.2 Visualization

In Figure 2, we visualize the node query process on synthetic networks generated using SBM and BA, as described in Section 5.1. The figure clearly demonstrates that nodes queried by the proposed algorithm adapt to the informativeness criterion specific to each network topology, effectively aligning with the community structure in SBM and the scale-free structure in BA.

B.3 Ablation study

To gain deeper insights into the respective roles of representative sampling and informative selection in the proposed algorithm, we conduct additional experiments on a New Jersey public school social network dataset, School [24], which was originally collected to study the impact of educational workshops on reducing conflicts in schools. As School is not a benchmark dataset in the active learning literature, we did not compare the performance of our method against other baselines in Section 5.2 to ensure fairness. In this dataset with $n=615$ nodes, each node represents an individual student, and edges denote friendships among students. We treat the students’ grade point averages (GPA) as the node responses and select $p=5$ student features—grade level, race, and three binary survey responses—as node covariates using a standard forward selection approach.

As shown in Section 3.4, the representative sampling in steps 1 and 2 of Algorithm 1 is essential to control the condition number of the design matrix and, consequently, the prediction error given noisy network data. We illustrate in Figure 3(a) the condition number $\frac{\lambda_{max}(\tilde{X}^{T}_{\mathcal{S}}W_{S}\tilde{X}_{\mathcal{S}})}{\lambda_{min}(\tilde{X}^{T}_{\mathcal{S}}W_{S}\tilde{X}_{\mathcal{S}})}$ using the proposed method, and compare with the one using random selection. With $m=200$ , the proposed algorithm achieve a significantly lower condition number than random selection, especially when the number of query is small. In the Citeseer dataset, we investigate the prediction performance of Algorithm 1 when removing steps 1 and 2, i.e., setting the candidate set $B_{m}=\mathcal{S}^{c}_{t-1}$ for the $t^{th}$ selection. Figure 3(b) shows that, with representative sampling, the Macro-F1 score is consistently higher, with a performance gap of up to 15%. Given that node classification on Citeseer is found to be sensitive to labeling noise [38], this result validates the effectiveness of representative sampling in improving the robustness of our query strategy to data noise.

In addition, we examine the ability of the proposed method to integrate node covariates for improving prediction performance. In the School dataset, we compare our method to one that removes node covariates during the query stage by setting $\mathbf{X}$ as the identity matrix $\mathbf{I}$ . Figure 3(c) illustrates that the prediction MSE for GPA is significantly lower when incorporating node covariates, thus distinguishing our node query strategy from existing graph signal recovery methods [15] that do not account for node covariate information.

B.4 Code

The implementation code for the proposed algorithm is available at
github.com/Yuanchen-Wu/RobustActiveLearning/.

	$\displaystyle\sum_{i=1}^{n}\left\|\tilde{f}\left(i\right)-\mathbf{f}\left(i\right)\right\|^{2}=$	$\displaystyle\\|\tilde{f}-\mathbf{f}\\|_{2}^{2}\leq\\|\tilde{f}-\mathbf{f}_{1}\\|_{2}^{2}+\\|\mathbf{f}_{1}-\mathbf{f}\\|_{2}^{2}$
	$\displaystyle\leq$	$\displaystyle\left\\|\alpha\left(\tilde{f}\right)-\alpha(\mathbf{f}_{1})\right\\|_{2}^{2}+\\|\mathbf{f}_{1}-\mathbf{f}\\|_{2}^{2}$
	$\displaystyle\leq$	$\displaystyle\left\\|\left(A^{\top}A\right)^{-1}A^{\top}W\left(Y_{\mathcal{B}}-\left(\mathbf{f}_{1}\right)_{\mathcal{B}}\right)\right\\|_{2}^{2}+\\|\mathbf{f}_{1}-\mathbf{f}\\|_{2}^{2}$
	$\displaystyle\leq$	$\displaystyle\lambda_{max}\left(A^{\top}A\right)^{-1}\\|A^{\top}W\left(Y_{\mathcal{B}}-\left(\mathbf{f}_{1}\right)_{\mathcal{B}}\\|_{2}^{2}+\\|\mathbf{f}_{1}-\mathbf{f}\\|_{2}^{2}\right.$

	$\displaystyle\mathbf{E}_{\mathcal{S}}\\|A^{\top}Wg\\|=\sum_{j=1}^{\mathcal{B}}\mathbf{E}_{\mathcal{S}}\left[\sum_{i=1}^{r_{d}}s_{j}^{2}\mathbf{V}_{i}^{2}(j)\|g_{j}\|^{2}\right]\leq$	$\displaystyle\underset{j}{\sup}\left(s_{j}\sum_{i=1}^{r_{d}}\|\mathbf{V}_{i}(j)\|^{2}\right)\times\sum_{j=1}^{\mathcal{B}}\mathbf{E}_{\mathcal{S}}\left(s_{j}g^{2}_{j}\right)$
	$\displaystyle=$	$\displaystyle\underset{j}{\sup}\left(s_{j}\sum_{i=1}^{r_{d}}\|\mathbf{V}_{i}(j)\|^{2}\right)\times\sum_{j=1}^{\mathcal{B}}\alpha_{j}\mathbf{E}_{\mathcal{S}}\left(\frac{1}{p_{j}}g^{2}_{j}\right)$
	$\displaystyle=$	$\displaystyle\underset{j}{\sup}\left(s_{j}\sum_{i=1}^{r_{d}}\|\mathbf{V}_{i}(j)\|^{2}\right)\times\sum_{j=1}^{\mathcal{B}}\alpha_{j}\times\sum_{l=1}^{n}\mathbf{E}_{Y}(Y_{l}-\mathbf{f}_{1}(l))^{2}$