This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Robust Offline Active Learning on Graphs

Yuanchen Wu111E-mail: yqw5734@psu.edu Department of Statistics, The Pennsylvania State University Yubai Yuan222Corresponding author: yvy5509@psu.edu Department of Statistics, The Pennsylvania State University
Abstract

We consider the problem of active learning on graphs for node-level tasks, which has crucial applications in many real-world networks where labeling node responses is expensive. In this paper, we propose an offline active learning method that selects nodes to query by explicitly incorporating information from both the network structure and node covariates. Building on graph signal recovery theories and the random spectral sparsification technique, the proposed method adopts a two-stage biased sampling strategy that takes both informativeness and representativeness into consideration for node querying. Informativeness refers to the complexity of graph signals that are learnable from the responses of queried nodes, while representativeness refers to the capacity of queried nodes to control generalization errors given noisy node-level information. We establish a theoretical relationship between generalization error and the number of nodes selected by the proposed method. Our theoretical results demonstrate the trade-off between informativeness and representativeness in active learning. Extensive numerical experiments show that the proposed method is competitive with existing graph-based active learning methods, especially when node covariates and responses contain noises. Additionally, the proposed method is applicable to both regression and classification tasks on graphs.

Key words: Offline active learning, graph semi-supervised learning, graph signal recovery, network sampling

1 Introduction

In many graph-based semi-supervised learning tasks for node-level prediction, labeled nodes are scarce, and the labeling process often incurs high costs in real-world applications. Randomly sampling nodes for labeling can be inefficient, as it overlooks label dependencies across the network. Active learning [29] addresses this issue by selecting informative nodes for labeling by human annotators, thereby improving the performance of downstream prediction algorithms.

Active learning is closely related to the optimal experimental design principle [1] in statistics. Traditional optimal experimental design methods select samples to maximize a specific statistical criterion [25, 11]. However, these methods often are not designed to incorporate network structure, therefore inefficient for graph-based learning tasks. On the other hand, selecting informative nodes on a network is studied extensively in the graph signal sampling literature [14, 22, 26, 8]. These strategies are typically based on the principle of network homophily, which assumes that connected nodes tend to have similar labels. However, a node’s label often also depends on its individual covariates. Therefore, signal-sampling strategies that focus solely on network information may miss critical insights provided by covariates.

Recently, inspired by the great success of graph neural networks (GNNs) [19, 35] in graph-based machine learning tasks, many GNN-based active learning strategies have been proposed. Existing methods select nodes to query by maximizing information gain under different criteria, including information entropy [7], the number of influenced nodes [39], prediction uncertainty [23], expected error reduction [27], and expected model change [30]. Most of these information gain measurements are defined in the spatial domain, leveraging the message-passing framework of GNNs to incorporate both network structure and covariate information. However, their effectiveness in maximizing learning outcomes is not guaranteed and can be difficult to evaluate. This challenge arises from the difficulty of quantifying node labeling complexity in the spatial domain due to intractable network topologies. While complexity measures exist for binary classification over networks [10], their extension to more complex graph signals incorporating node covariates remains unclear. This lack of well-defined complexity measures complicates performance analysis and creates a misalignment between graph-based information measurements and the gradient used to search the labeling function space, potentially leading to sub-optimal node selection.

Moreover, from a practical perspective, most of the previously discussed methods operate in an online setting, requiring prompt labeling feedback from an external annotator. However, this online framework is not always feasible when computational resources are limited [28] or when recurrent interaction between the algorithm and the annotator is impractical, such as in remote sensing or online marketing tasks [33, 37]. Additionally, both network data and annotator-provided labels may contain measurement errors. These methods often fail to account for noise in the training data [18], which can significantly degrade the prediction performance of models on unlabeled nodes [12, 6].

To address these challenges, we propose an offline active learning on graphs framework for node-level prediction tasks. Inspired by the theory of graph signal recovery [14, 22, 8] and GNNs, we first introduce a graph function space that integrates both node covariate information and network topology. The complexity of the node labeling function within this space is well-defined in the graph spectral domain. Accordingly, we propose a query information gain measurement aligned with the spectral-based complexity, allowing our strategy to achieve theoretically optimal sample complexity.

Building on this, we develop a greedy node query strategy. The labels of the queried nodes help identify orthogonal components of the target labeling function, each with varying levels of smoothness across the network. To address data noise, the query procedure considers both informativeness—the contribution of queried nodes in recovering non-smooth components of a signal—and representativeness—the robustness of predictions against noise in the training data. Compared to existing methods, the proposed approach provides a provably effective strategy under general network structures and achieves higher query efficiency by incorporating both network and node covariate information.

The proposed method identifies the labeling function via a bottom-up strategy—first identifying the smoother components of the labeling function and then continuing to more oscillated components. Therefore, the proposed method is naturally robust to high-frequency noise in node covariates. We provide a theoretical guarantee for the effectiveness of the proposed method in semi-supervised learning tasks. The generalization error bound is guaranteed even when the node labels are noisy. Our theoretical results also highlight an interesting trade-off between informativeness and representativeness in graph-based active learning.

2 Preliminaries

We consider an undirected, weighted, connected graph 𝐆={𝐕,𝐀}\mathbf{G}=\{\mathbf{V},\mathbf{A}\}, where 𝐕={1,2,,n}\mathbf{V}=\{1,2,\cdots,n\} is the set of nn nodes, and 𝐀n×n\mathbf{A}\in\mathbb{R}^{n\times n} is the symmetric adjacency matrix, with element aij0a_{ij}\geq 0 denoting the edge weight between nodes ii and jj. The degree matrix is defined as 𝐃=diag{d1,d2,,dn}\mathbf{D}=\text{diag}\{d_{1},d_{2},\cdots,d_{n}\}, where di=1inaijd_{i}=\sum_{1\leq i\leq n}a_{ij} denotes the degree of node ii. Additionally, we observe the node response vector 𝐘n×1\mathbf{Y}\in\mathbb{R}^{n\times 1} and the node covariate matrix 𝐗=(X1,,Xp)n×p\mathbf{X}=(X_{1},\cdots,X_{p})\in\mathbb{R}^{n\times p}, where the ithi^{th} row, 𝐗i\mathbf{X}_{i\cdot}, is the pp-dimensional covariate vector for node ii. The linear space of all linear combinations of {X1,,Xp}\{X_{1},\cdots,X_{p}\} is denoted as Span{X1,,Xp}\text{Span}\{X_{1},\cdots,X_{p}\}. The normalized graph Laplacian matrix is defined as =𝐈𝐃1/2𝐀𝐃1/2\mathcal{L}=\mathbf{I}-\mathbf{D}^{-1/2}\mathbf{A}\mathbf{D}^{-1/2}, where 𝐈\mathbf{I} is the n×nn\times n identity matrix. The matrix \mathcal{L} is symmetric and positive semi-definite, with nn real eigenvalues satisfying 0=λ1λ2λn20=\lambda_{1}\leq\lambda_{2}\leq\cdots\leq\lambda_{n}\leq 2, and a corresponding set of eigenvectors denoted by 𝐔={U1,U2,,Un}\mathbf{U}=\{U_{1},U_{2},\cdots,U_{n}\}. We use b=𝒪(a)b=\mathcal{O}(a) to indicate |b|M|a||b|\leq M|a| for some M>0M>0. For a set of nodes 𝒮𝐕\mathcal{S}\subset\mathbf{V}, |𝒮||\mathcal{S}| indicates its cardinality, and Sc=𝐕\𝒮S^{c}=\mathbf{V}\backslash\mathcal{S} denotes the complement of 𝒮\mathcal{S}.

2.1 Graph signal representation

Consider a graph signal 𝐟n\mathbf{f}\in\mathbb{R}^{n}, where 𝐟(i)\mathbf{f}(i) denotes the signal value at node ii. For a set of nodes SS, we define the subspace 𝐋𝒮:={𝐟n𝐟(Sc)=0}\mathbf{L}_{\mathcal{S}}:=\{\mathbf{f}\in\mathbb{R}^{n}\mid\mathbf{f}(S^{c})=0\}, where 𝐟(S)|𝒮|\mathbf{f}(S)\in\mathbb{R}^{|\mathcal{S}|} represents the values of 𝐟\mathbf{f} on nodes in 𝒮\mathcal{S}. In this paper, we consider both regression tasks, where 𝐟(i)\mathbf{f}(i) is a continuous response, and classification tasks, where 𝐟(i)\mathbf{f}(i) is a multi-class label.

Since 𝐔\mathbf{U} serves as a set of bases for n\mathbb{R}^{n}, we can decompose 𝐟\mathbf{f} in the graph spectral domain as 𝐟=j=1nα𝐟(λj)Uj\mathbf{f}=\sum_{j=1}^{n}\alpha_{\mathbf{f}}(\lambda_{j})U_{j}, where α𝐟(λj)=𝐟,Uj\alpha_{\mathbf{f}}(\lambda_{j})=\langle\mathbf{f},U_{j}\rangle is defined as the graph Fourier transform (GFT) coefficient corresponding to frequency λj\lambda_{j}. From a graph signal processing perspective, a smaller eigenvalue λk\lambda_{k} indicates lower variation in the associated eigenvector UkU_{k}, reflecting smoother transitions between neighboring nodes. Therefore, the smoothness of 𝐟\mathbf{f} over the network can be characterized by the magnitude of α𝐟(λj)\alpha_{\mathbf{f}}(\lambda_{j}) at each frequency λj\lambda_{j}. More formally, we measure the signal complexity of 𝐟\mathbf{f} using the bandwidth frequency ω𝐟=sup{λj|α𝐟(λj)>0}\omega_{\mathbf{f}}=\sup\{\lambda_{j}|\alpha_{\mathbf{f}}(\lambda_{j})>0\}. Accordingly, we define the subspace of graph signals with a bandwidth frequency less than or equal to ω\omega as 𝐋ω:={𝐟nω𝐟ω}\mathbf{L}_{\omega}:=\{\mathbf{f}\in\mathbb{R}^{n}\mid\omega_{\mathbf{f}}\leq\omega\}. It follows directly that ω1<ω2,𝐋ω1𝐋ω2\forall\omega_{1}<\omega_{2},\mathbf{L}_{\omega_{1}}\subset\mathbf{L}_{\omega_{2}}.

2.2 Active semi-supervised learning on graphs

The key idea in graph-based semi-supervised learning is to reconstruct the graph signal 𝐟\mathbf{f} within a function space 𝐇ω(𝐗,𝐀)\mathbf{H}_{\omega}(\mathbf{X},\mathbf{A}) that depends on both the network structure and node-wise covariates, where the frequency parameter ω\omega controls the size of the space to mitigate overftting. Assume that YiY_{i} is the observed noisy realization of the true signal 𝐟(i)\mathbf{f}(i) at node ii, active learning operates in a scenario where we have limited access to YiY_{i} on only a subset of nodes 𝒮\mathcal{S}, with |𝒮|<<n|\mathcal{S}|<<n. The objective is to estimate 𝐟\mathbf{f} within 𝐇ω(𝐗,𝐀)\mathbf{H}_{\omega}(\mathbf{X},\mathbf{A}) using {Yi}i𝒮\{Y_{i}\}_{i\in\mathcal{S}} by considering the empirical estimator of 𝐟\mathbf{f} as

𝐟𝒮=argmin𝐠𝐇ω(𝐗,𝐀)i𝒮l(Yi,𝐠(i)),\displaystyle\mathbf{f}_{\mathcal{S}}=\operatorname*{argmin}_{\mathbf{g}\in\mathbf{H}_{\omega}(\mathbf{X},\mathbf{A})}\sum_{i\in\mathcal{S}}l\big{(}Y_{i},\mathbf{g}(i)\big{)}, (1)

where l()l(\cdot) is a task-specific loss function. We denote 𝐟\mathbf{f}^{*} as the minimizer of (1) when responses on all nodes are available, i.e., 𝐟=𝐟𝐕\mathbf{f}^{*}=\mathbf{f}_{\mathbf{V}}. The goal of active semi-supervised learning is to design an appropriate function space 𝐇ω(𝐗,𝐀)\mathbf{H}_{\omega}(\mathbf{X},\mathbf{A}) and select an informative subset of nodes 𝒮\mathcal{S} for querying responses, under the query budget |𝒮||\mathcal{S}|\leq\mathcal{B}, such that the estimation error is bounded as follows:

𝐟𝒮𝐟22ρ𝐟𝐟22\displaystyle\|\mathbf{{f}}_{\mathcal{S}}-\mathbf{f}^{*}\|_{2}^{2}\leq\rho\|\mathbf{f}^{*}-\mathbf{f}\|_{2}^{2}

For a fixed \mathcal{B}, we wish to minimize the parameter ρ>0\rho>0, which converges to 0 as the query budget \mathcal{B} approaches nn.

3 Biased Sequential Sampling

In this section, we introduce a function space for recovering the graph signal. Leveraging this function space, we propose an offline node query strategy that integrates criteria of both node informativeness and representativeness to infer the labels of unannotated nodes in the network.

3.1 Graph signal function space

In semi-supervised learning tasks on networks, both the network topology and node-wise covariates are crucial for inferring the graph signal. To effectively incorporate this information, we propose a function class for reconstructing the graph signal that lies at the intersection of the graph spectral domain and the space of node covariates. Motivated by the graph Fourier transform, we define the following function class:

𝐇ω(𝐗,𝐀)=\displaystyle\mathbf{H}_{\omega}(\mathbf{X},\mathbf{A})= Proj𝐋ωSpan(𝐗):=Span{Proj𝐋ωX1,,Proj𝐋ωXp},\displaystyle\text{Proj}_{\mathbf{L}_{\omega}}\text{Span}(\mathbf{X}):=\text{Span}\{\text{Proj}_{\mathbf{L}_{\omega}}X_{1},\cdots,\text{Proj}_{\mathbf{L}_{\omega}}X_{p}\},
whereProj𝐋ωXi=j:λjωXi,UjUj.\displaystyle\text{where}\;\text{Proj}_{\mathbf{L}_{\omega}}X_{i}=\sum_{j:\lambda_{j}\leq\omega}\langle X_{i},U_{j}\rangle U_{j}.

Here, the choice of ω\omega balances the information from node covariates and network structure. When ω=2\omega=2, 𝐇ω(𝐗,𝐀)\mathbf{H}_{\omega}(\mathbf{X},\mathbf{A}) spans the full column space of covariates, i.e., Span{X1,,Xp}\text{Span}\{X_{1},\cdots,X_{p}\}, allowing for a full utilization of the original covariate space to estimate the graph signal, but without incorporating any network information. On the other hand, when ω\omega is close to zero—consider, for example, the extreme case where |{Ujλjω}|=2|\{U_{j}\mid\lambda_{j}\leq\omega\}|=2 and p2p\gg 2—then Proj𝐋ωXi\text{Proj}_{\mathbf{L}_{\omega}}X_{i} reduces to Span{U1,U2}\text{Span}\{U_{1},U_{2}\}, resulting in a loss of critical information provided by the original 𝐗\mathbf{X}.

By carefully choosing ω\omega, however, this function space can offer two key advantages for estimating the graph signal. From a signal recovery perspective, 𝐇ω(𝐗,𝐀)\mathbf{H}_{\omega}(\mathbf{X},\mathbf{A}) imposes graph-based regularization over node covariates, enhancing generalizability when the dimension of covariates pp exceeds the query budget or even the network size—conditions commonly encountered in real applications. Additionally, covariate smoothing filters out signals in the covariates that are irrelevant to network-based prediction, thereby increasing robustness against potential noise in the covariates. From an active learning perspective, using 𝐇ω(𝐗,𝐀)\mathbf{H}_{\omega}(\mathbf{X},\mathbf{A}) enables a bottom-up query strategy that begins with a small ω\omega to capture the smoothest global trends in the graph signal. As the labeling budget increases, ω\omega is adaptively increased to capture more complex graph signals within the larger space 𝐇ω(𝐗,𝐀)\mathbf{H}_{\omega}(\mathbf{X},\mathbf{A}).

The graph signal 𝐟\mathbf{f} can be approximated by its projection onto the space 𝐇ω(𝐗,𝐀)\mathbf{H}_{\omega}(\mathbf{X},\mathbf{A}). Specifically, let 𝐔d={U1,U2,,Ud}n×d\mathbf{U}_{d}=\{U_{1},U_{2},\cdots,U_{d}\}\in\mathbb{R}^{n\times d} stack the dd leading eigenvectors of \mathcal{L}, where d=argmax1jn(λjω)0d=\operatorname*{argmax}_{1\leq j\leq n}(\lambda_{j}-\omega)\leq 0. The graph signal estimation is then given by 𝐔d𝐔dT𝐗β\mathbf{U}_{d}\mathbf{U}^{T}_{d}\mathbf{X}{\beta}, where βd\beta\in\mathbb{R}^{d} is a trainable weight vector. However, the parameters β\beta may become unidentifiable when the covariate dimension pp exceeds dd. To address this issue, we reparameterize the linear regression as follows:

𝐔d𝐔dT𝐗β=𝑿~β~,\displaystyle\mathbf{U}_{d}\mathbf{U}^{T}_{d}\mathbf{X}{\beta}=\tilde{\bm{X}}\tilde{\beta}, (2)

where β~=ΣV2Tβ\tilde{\beta}=\Sigma V_{2}^{T}{\beta} and 𝐗~=𝐔dV1\tilde{\mathbf{X}}=\mathbf{U}_{d}V_{1}. Here, V1d×rV_{1}\in\mathbb{R}^{d\times r}, V2p×rV_{2}\in\mathbb{R}^{p\times r}, and Σr×r\Sigma\in\mathbb{R}^{r\times r} denote the left and right singular vectors and the diagonal matrix of the rr singular values, respectively.

In the reparameterized form (2)(\ref{reg_2}), the columns of 𝐗~\tilde{\mathbf{X}} serve as bases for 𝐇ω(𝐗,𝐀)\mathbf{H}_{\omega}(\mathbf{X},\mathbf{A}), thus
dim(𝐇ω(𝐗,𝐀))=rank(𝐗~)=rmin{d,p}\text{dim}(\mathbf{H}_{\omega}(\mathbf{X},\mathbf{A}))=\text{rank}(\tilde{\mathbf{X}})=r\leq\min\{d,p\}. The transformed predictors 𝐗~\tilde{\mathbf{X}} capture the components of the node covariates constrained within the low-frequency graph spectrum. A graph signal 𝐟𝐇ω(𝐗,𝐀)\mathbf{f}\in\mathbf{H}_{\omega}(\mathbf{X},\mathbf{A}) can be parameterized as a linear combination of the columns of 𝐗~\tilde{\mathbf{X}}, with the corresponding weights β~\tilde{\beta} identified via

β^=argminβ~i𝒮(𝐟(i)(𝐗~𝐒)iβ~)2\displaystyle\hat{\beta}=\operatorname*{argmin}_{\tilde{\beta}}\sum_{i\in\mathcal{S}}\big{(}\mathbf{f}(i)-(\tilde{\mathbf{X}}_{\mathbf{S}})_{i\cdot}\;\tilde{\beta}\big{)}^{2} (3)

where 𝐗~𝐒R|𝒮|×r\tilde{\mathbf{X}}_{\mathbf{S}}\in\mathrm{R}^{|\mathcal{S}|\times r} is the submatrix of 𝑿~\tilde{\bm{X}} containing rows indexed by the query set 𝒮\mathcal{S}, and {𝐟(i)}i𝒮\{\mathbf{f}(i)\}_{i\in\mathcal{S}} represents the true labels for nodes in 𝒮\mathcal{S}. To achieve the identification of 𝐟\mathbf{f}, it is necessary that |𝒮|r|\mathcal{S}|\geq r; otherwise, there will be more parameters than equations in (3). More importantly, since rank(𝐗~𝐒)rank(𝐗~)=r\text{rank}(\tilde{\mathbf{X}}_{\mathbf{S}})\leq\text{rank}(\tilde{\mathbf{X}})=r, 𝐟\mathbf{f} is only identifiable if 𝐗~𝐒\tilde{\mathbf{X}}_{\mathbf{S}} has full column rank. Notice that rr increases monotonically with ω\omega. If 𝒮\mathcal{S} is not carefully selected, the graph signal can only be identified in 𝐇ω(𝐗,𝐀)\mathbf{H}_{\omega^{\prime}}(\mathbf{X},\mathbf{A}) for some ω<ω\omega^{\prime}<\omega, which is a subspace of 𝐇ω(𝐗,𝐀)\mathbf{H}_{\omega}(\mathbf{X},\mathbf{A}).

3.2 Informative node selection

We first define the identification of 𝐇ω(𝐗,𝐀)\mathbf{H}_{\omega}(\mathbf{X},\mathbf{A}) by the node query set 𝒮\mathcal{S} as follows:

Definition 1.

A subset of nodes 𝒮𝐕\mathcal{S}\subset\mathbf{V} can identify the graph signal space 𝐇ω(𝐗,𝐀)\mathbf{H}_{\omega}(\mathbf{X},\mathbf{A}) up to frequency ω\omega if, for any two functions 𝐟1,𝐟2𝐇ω(𝐗,𝐀)\mathbf{f}_{1},\mathbf{f}_{2}\in\mathbf{H}_{\omega}(\mathbf{X},\mathbf{A}) such that 𝐟1(i)=𝐟2(i)\mathbf{f}_{1}(i)=\mathbf{f}_{2}(i) for all i𝒮i\in\mathcal{S}, it follows that 𝐟1(j)=𝐟2(j)\mathbf{f}_{1}(j)=\mathbf{f}_{2}(j) for all j𝐕j\in\mathbf{V}.

Intuitively, the informativeness of a set 𝒮\mathcal{S} can be quantified by the frequency ω\omega corresponding to the space 𝐇ω(𝐗,𝐀)\mathbf{H}_{\omega}(\mathbf{X},\mathbf{A}) that can be identified. To select informative nodes, we need to bridge the query set 𝒮\mathcal{S} in the spatial domain with ω\omega in the spectral domain. To achieve this, we consider the counterpart of the function space 𝐇ω(𝐗,𝐀)\mathbf{H}_{\omega}(\mathbf{X},\mathbf{A}) in the spatial domain. Specifically, we introduce the projection space with respect to a subset of nodes 𝒮\mathcal{S} as follows: 𝐇𝒮(𝐗,𝐀):=Span{X1(𝒮),,Xp(𝒮)}\mathbf{H}_{\mathcal{S}}(\mathbf{X},\mathbf{A}):=\text{Span}\{{X}^{(\mathcal{S})}_{1},\cdots,{X}^{(\mathcal{S})}_{p}\}, where

Xp(𝒮)(i)={Xp(i)if i𝒮𝟎if i𝒮c\displaystyle{X}^{(\mathcal{S})}_{p}(i)=\begin{cases}{X}_{p}(i)&\text{if }i\in\mathcal{S}\\ \mathbf{0}&\text{if }i\in\mathcal{S}^{c}\end{cases}

Here, Xp(i){X}_{p}(i) denotes the pp-th covariate for node ii in 𝐗\mathbf{X}. Theorem 3.1 establishes a connection between the two graph signal spaces 𝐇ω(𝐗,𝐀)\mathbf{H}_{\omega}(\mathbf{X},\mathbf{A}) and 𝐇𝒮(𝐗,𝐀)\mathbf{H}_{\mathcal{S}}(\mathbf{X},\mathbf{A}), providing a metric for evaluating the informativeness of querying a subset of nodes on the graph.

Theorem 3.1.

Any graph signal 𝐟𝐇ω(𝐗,𝐀)\mathbf{f}\in\mathbf{H}_{\omega}(\mathbf{X},\mathbf{A}) can be identified using labels on a subset of nodes 𝒮\mathcal{S} if and only if:

ω<ω(𝒮):=inf𝐠𝐇𝒮c(𝐗,𝐀)ω𝐠,\displaystyle\omega<\omega(\mathcal{S}):=\operatorname*{inf}_{\mathbf{g}\in\mathbf{H}_{\mathcal{S}^{c}}(\mathbf{X},\mathbf{A})}\omega_{\mathbf{g}}, (4)

where 𝒮c\mathcal{S}^{c} denotes the complement of 𝒮\mathcal{S} in 𝐕\mathbf{V}.

We denote the quantity ω(𝒮)\omega(\mathcal{S}) in (4) as the bandwidth frequency with respect to the node set 𝒮\mathcal{S}. This quantity can be explicitly calculated and measures the size of the space 𝐇ω(𝐗,𝐀)\mathbf{H}_{\omega}(\mathbf{X},\mathbf{A}) that can be recovered from the subset of nodes 𝒮\mathcal{S}. The goal of the active learning strategy is to select 𝒮\mathcal{S} within a given budget to maximize the bandwidth frequency ω(𝒮)\omega(\mathcal{S}), thus enabling the identification of graph signals with the highest possible complexity.

To calculate the bandwidth frequency ω(𝒮)\omega(\mathcal{S}), consider any graph signal 𝐠\mathbf{g} and its components with non-zero frequency Λ𝐠:={λiα𝐠(λi)>0}\Lambda_{\mathbf{g}}:=\{\lambda_{i}\mid\alpha_{\mathbf{g}}(\lambda_{i})>0\}. We use the fact that

limk(j:λjΛ𝐠cjλjk)1/k=maxλjΛ𝐠(λj),\lim_{k\rightarrow\infty}\left(\sum_{j:\lambda_{j}\in\Lambda_{\mathbf{g}}}c_{j}\lambda_{j}^{k}\right)^{1/k}=\max_{\lambda_{j}\in\Lambda_{\mathbf{g}}}(\lambda_{j}),

where j:λjΛ𝐠cj=1\sum_{j:\lambda_{j}\in\Lambda_{\mathbf{g}}}c_{j}=1 and 0cj10\leq c_{j}\leq 1. Combined with the Rayleigh quotient representation of eigenvalues, the bandwidth frequency ω𝐠\omega_{\mathbf{g}} can be calculated as

ω𝐠=limkω𝐠(k),whereω𝐠(k)=(𝐠Tk𝐠𝐠T𝐠)1/k.\omega_{\mathbf{g}}=\lim_{k\rightarrow\infty}\omega_{\mathbf{g}}(k),\quad\text{where}\quad\omega_{\mathbf{g}}(k)=\left(\frac{\mathbf{g}^{T}\mathcal{L}^{k}\mathbf{g}}{\mathbf{g}^{T}\mathbf{g}}\right)^{1/k}.

As a result, we can approximate the bandwidth ω𝐠\omega_{\mathbf{g}} using ω𝐠(k)\omega_{\mathbf{g}}(k) for a large kk. Maximizing ω(𝒮)\omega(\mathcal{S}) over 𝒮\mathcal{S} then transforms into the following optimization problem:

𝒮=argmax𝒮:|𝒮|ω^(𝒮),whereω^(𝒮):=inf𝐠𝐇𝒮c(𝐗,𝐀)ω𝐠k(k),\displaystyle\mathcal{S}=\operatorname*{argmax}_{\mathcal{S}:|\mathcal{S}|\leq\mathcal{B}}\hat{\omega}(\mathcal{S}),\quad\text{where}\quad\hat{\omega}(\mathcal{S}):=\operatorname*{inf}_{\mathbf{g}\in\mathbf{H}_{\mathcal{S}^{c}}(\mathbf{X},\mathbf{A})}\omega^{k}_{\mathbf{g}}(k), (5)

where \mathcal{B} represents the budget for querying labels. Due to the combinatorial complexity of directly solving optimization problem (5) by simultaneously selecting 𝒮\mathcal{S}, we propose a greedy selection strategy as a continuous relaxation of (5).

The selection procedure starts with 𝒮=\mathcal{S}=\emptyset and sequentially adds one node to 𝒮\mathcal{S} that maximizes the increase in ω(𝒮)\omega(\mathcal{S}) until the budget is reached. We introduce an nn-dimensional vector 𝐭=(t1,t2,,tn)T\mathbf{t}=(t_{1},t_{2},\cdots,t_{n})^{T} with 0ti10\leq t_{i}\leq 1, and define the corresponding diagonal matrix D(𝐭)D({\mathbf{t}}) with diagonal entries given by 𝐭\mathbf{t}. This allows us to encode the set of query nodes using 𝐭=𝟏𝒮\mathbf{t}=\mathbf{1}_{\mathcal{S}}, where 𝟏𝒮(i)=1\mathbf{1}_{\mathcal{S}}(i)=1 if i𝒮i\in\mathcal{S} and 𝟏𝒮(i)=0\mathbf{1}_{\mathcal{S}}(i)=0 if i𝒮ci\in\mathcal{S}^{c}. We then consider the space spanned by the columns of D(𝐭)𝐗D(\mathbf{t})\mathbf{X} as Span{D(𝐭)𝐗}\text{Span}\{D(\mathbf{t})\mathbf{X}\}, and the following relation holds:

𝐇𝒮c(𝐗,𝐀)=Span{D(𝟏𝒮c)𝐗}.\displaystyle\mathbf{H}_{\mathcal{S}^{c}}(\mathbf{X},\mathbf{A})=\text{Span}\{D(\mathbf{1}_{\mathcal{S}^{c}})\mathbf{X}\}.

Intuitively, Span{D(𝐭)𝐗}\text{Span}\{D(\mathbf{t})\mathbf{X}\} acts as a differentiable relaxation of the subspace 𝐇𝒮(𝐗,𝐀)\mathbf{H}_{\mathcal{S}}(\mathbf{X},\mathbf{A}), enabling perturbation analysis of the bandwidth frequency when a new node is added to 𝒮\mathcal{S}. The projection operator associated with Span{D(𝐭)𝐗}\text{Span}\{D(\mathbf{t})\mathbf{X}\} can be explicitly expressed as

𝐏(𝐭)=D(𝐭)𝐗(𝐗TD(𝐭2)𝐗)1𝐗TD(𝐭).\mathbf{P}(\mathbf{t})=D(\mathbf{t})\mathbf{X}\left(\mathbf{X}^{T}D(\mathbf{t}^{2})\mathbf{X}\right)^{-1}\mathbf{X}^{T}D(\mathbf{t}).

To quantify the increase in ω^(𝒮)\hat{\omega}(\mathcal{S}) when adding a new node to 𝒮\mathcal{S}, we consider the following regularized optimization problem:

λα(𝐭)=minϕϕTkϕϕTϕ+αϕT(𝐈𝐏(𝐭))ϕϕTϕ.\displaystyle\lambda_{\alpha}(\mathbf{t})=\min_{\phi}\frac{\phi^{T}\mathcal{L}^{k}\phi}{\phi^{T}\phi}+\alpha\frac{\phi^{T}\left(\mathbf{I}-\mathbf{P}(\mathbf{t})\right)\phi}{\phi^{T}\phi}. (6)

The penalty term on the right-hand side of (6) encourages the graph signal ϕ\phi to remain in 𝐇𝒮c(𝐗,𝐀)\mathbf{H}_{\mathcal{S}^{c}}(\mathbf{X},\mathbf{A}). As the parameter α\alpha approaches infinity and 𝐭=𝟏𝒮c\mathbf{t}=\mathbf{1}_{\mathcal{S}^{c}}, the minimization λα(𝟏𝒮c)\lambda_{\alpha}(\mathbf{1}_{\mathcal{S}^{c}}) in (6) converges to ω^(𝒮)\hat{\omega}(\mathcal{S}) in (5). The information gain from labeling a node i𝒮ci\in\mathcal{S}^{c} can then be quantified by the gradient of the bandwidth frequency as tit_{i} decreases from 1 to 0:

Δi:=λα(𝐭)ti|𝐭=𝟏𝒮c=2α×ϕT𝐏(𝐭)tiϕ|𝐭=𝟏𝒮c,\displaystyle\Delta_{i}:=-\frac{\partial\lambda_{\alpha}(\mathbf{t})}{\partial t_{i}}\Big{|}_{\mathbf{t}=\mathbf{1}_{\mathcal{S}^{c}}}=2\alpha\times\phi^{T}\frac{\partial\mathbf{P}(\mathbf{t})}{\partial t_{i}}\phi\Big{|}_{\mathbf{t}=\mathbf{1}_{\mathcal{S}^{c}}}, (7)

where ϕ\phi is the minimizer of (6) at 𝐭=𝟏𝒮c\mathbf{t}=\mathbf{1}_{\mathcal{S}^{c}}, which corresponds to the eigenvector associated with the smallest non-zero eigenvalue of the matrix 𝐏(𝟏𝒮c)k𝐏(𝟏𝒮c)\mathbf{P}(\mathbf{1}_{\mathcal{S}^{c}})\mathcal{L}^{k}\mathbf{P}(\mathbf{1}_{\mathcal{S}^{c}}). We then select the node i=argmaxj𝒮cΔji=\operatorname*{argmax}_{j\in\mathcal{S}^{c}}\Delta_{j} and update the query set as 𝒮=𝒮{i}\mathcal{S}=\mathcal{S}\cup\{i\}.

When calculating node-wise informativeness in (7), we can enhance computational efficiency by avoiding the inversion of D(𝐭)D(\mathbf{t}). When 𝐭\mathbf{t} is in the neighborhood of 𝟏𝒮c\mathbf{1}_{\mathcal{S}^{c}}, we can approximate:

𝐏(𝐭)D(𝐭)𝐗𝒮c(𝐗𝒮cT𝐗𝒮c)1𝐗𝒮cTD(𝐭)=D(𝐭)𝐙𝒮c𝐙𝒮cTD(𝐭),\displaystyle\mathbf{P}(\mathbf{t})\approx D(\mathbf{t})\mathbf{X}_{\mathcal{S}^{c}}(\mathbf{X}_{\mathcal{S}^{c}}^{T}\mathbf{X}_{\mathcal{S}^{c}})^{-1}\mathbf{X}_{\mathcal{S}^{c}}^{T}D(\mathbf{t})=D(\mathbf{t})\mathbf{Z}_{\mathcal{S}^{c}}\mathbf{Z}_{\mathcal{S}^{c}}^{T}D(\mathbf{t}),

where 𝐗𝒮c=((X1)𝒮c,,(Xp)𝒮c)\mathbf{X}_{\mathcal{S}^{c}}=((X_{1})_{\mathcal{S}^{c}},\cdots,(X_{p})_{\mathcal{S}^{c}}) and 𝐙𝒮c=𝐗𝒮c(𝐗𝒮cT𝐗𝒮c)1/2\mathbf{Z}_{\mathcal{S}^{c}}=\mathbf{X}_{\mathcal{S}^{c}}(\mathbf{X}_{\mathcal{S}^{c}}^{T}\mathbf{X}_{\mathcal{S}^{c}})^{-1/2}. Then, the node-wise informativeness can be explicitly expressed as:

Δitiϕi2(𝐙𝒮c)i(𝐙𝒮cT)i+ji,1jntitjϕiϕj(𝐙𝒮c)i(𝐙𝒮cT)j.\displaystyle\Delta_{i}\propto t_{i}\phi_{i}^{2}(\mathbf{Z}_{\mathcal{S}^{c}})_{i\cdot}(\mathbf{Z}_{\mathcal{S}^{c}}^{T})_{i\cdot}+\sum_{j\neq i,1\leq j\leq n}t_{i}t_{j}\phi_{i}\phi_{j}(\mathbf{Z}_{\mathcal{S}^{c}})_{i\cdot}(\mathbf{Z}_{\mathcal{S}^{c}}^{T})_{j\cdot}. (8)

We find that this approximation yields very similar empirical performance compared to the exact formulation in (7). Therefore, we adopt the formulation in (8) for the subsequent numerical experiments.

3.3 Representative node selection

In real-world applications, we often have access only to a perturbed version of the true graph signals, denoted as Y=𝐟+ξY=\mathbf{f}+\mathbf{\xi}, where ξ\mathbf{\xi} represents node labeling noise that is independent of the network data. When replacing the true label 𝐟(i)\mathbf{f}(i) with Y(i)Y(i) in (3), this noise term introduces both finite-sample bias and variance in the estimation of the graph signal 𝐟\mathbf{f}. As a result, we aim to query nodes that are sufficiently representative of the entire covariate space to bound the generalization error. To achieve this, we introduce principled randomness into the deterministic selection procedure described in Section 3.2 to ensure that 𝒮\mathcal{S} includes nodes that are both informative and representative. The modified graph signal estimation procedure is given by:

β^=argminβ~i𝒮si(Y(i)(𝐗~𝐒)iβ~)2,\displaystyle\hat{\beta}=\operatorname*{argmin}_{\tilde{\beta}}\sum_{i\in\mathcal{S}}s_{i}\big{(}Y(i)-(\tilde{\mathbf{X}}_{\mathbf{S}})_{i\cdot}\;\tilde{\beta}\big{)}^{2}, (9)

where sis_{i} is the weight associated with the probability of selecting node ii into 𝒮\mathcal{S}.

Specifically, the generalization error of the estimator in (9) is determined by the smallest eigenvalue of 𝐗~𝒮T𝐗~𝒮\tilde{\mathbf{X}}_{\mathcal{S}}^{T}\tilde{\mathbf{X}}_{\mathcal{S}}, denoted as λmin(𝐗~𝒮T𝐗~𝒮)\lambda_{\min}(\tilde{\mathbf{X}}_{\mathcal{S}}^{T}\tilde{\mathbf{X}}_{\mathcal{S}}). Given that λmin(𝐗~T𝐗~)=1\lambda_{\min}(\tilde{\mathbf{X}}^{T}\tilde{\mathbf{X}})=1, our goal is to increase the representativeness of 𝒮\mathcal{S} such that λmin(𝐗~𝒮T𝐗~𝒮)\lambda_{\min}(\tilde{\mathbf{X}}_{\mathcal{S}}^{T}\tilde{\mathbf{X}}_{\mathcal{S}}) is lower-bounded by:

λmin(𝐗~𝒮T𝐗~𝒮)(1o|𝒮|(1))λmin(𝐗~T𝐗~).\displaystyle\lambda_{\min}\left(\tilde{\mathbf{X}}_{\mathcal{S}}^{T}\tilde{\mathbf{X}}_{\mathcal{S}}\right)\geq(1-o_{|\mathcal{S}|}(1))\lambda_{\min}\left(\tilde{\mathbf{X}}^{T}\tilde{\mathbf{X}}\right). (10)

However, the informative selection method in Section 3.2 does not guarantee (10). To address this, we propose a sequential biased sampling approach that balances informative node selection with generalization error control.

The key idea to achieve a lower bound for λmin(𝐗~𝒮T𝐗~𝒮)\lambda_{\min}(\tilde{\mathbf{X}}_{\mathcal{S}}^{T}\tilde{\mathbf{X}}_{\mathcal{S}}) is to use spectral sparsification techniques for positive semi-definite matrices [4]. Let vi1×rv_{i}\in\mathbb{R}^{1\times r} denote the ii-th row of the constrained basis 𝐗~\tilde{\mathbf{X}}. By definition of 𝐗~\tilde{\mathbf{X}}, it follows that 𝐈r×r=i=1nviTvi\mathbf{I}_{r\times r}=\sum_{i=1}^{n}v_{i}^{T}v_{i}. Inspired by the randomized sampling approach in [21], we propose a biased sampling strategy to construct 𝒮\mathcal{S} with |𝒮|n|\mathcal{S}|\ll n and weights {si>0,i𝒮}\{s_{i}>0,i\in\mathcal{S}\} such that i𝒮siviTvi𝐈\sum_{i\in\mathcal{S}}s_{i}v_{i}^{T}v_{i}\approx\mathbf{I}. In other words, the weighted covariance matrix of the query set 𝒮\mathcal{S} satisfies λmin(𝐗~𝒮TWS𝐗~𝒮)1\lambda_{\min}(\tilde{\mathbf{X}}_{\mathcal{S}}^{T}W_{S}\tilde{\mathbf{X}}_{\mathcal{S}})\approx 1, where WSW_{S} is a diagonal matrix with sis_{i} on its diagonal.

We outline the detailed sampling procedure as follows. After the (t1)th(t-1)^{\text{th}} selection, let the set of query nodes be 𝒮t1\mathcal{S}_{t-1} with corresponding node-wise weights 𝒲t1={sj>0j𝒮t1}\mathcal{W}_{t-1}=\{s_{j}>0\mid j\in\mathcal{S}_{t-1}\}. The covariance matrix of 𝒮t1\mathcal{S}_{t-1} is given by Ct1r×rC_{t-1}\in\mathbb{R}^{r\times r}, defined as Ct1=𝐗~𝒮t1T𝐗~𝒮t1=j𝒮t1sjvjTvjC_{t-1}=\tilde{\mathbf{X}}_{\mathcal{S}_{t-1}}^{T}\tilde{\mathbf{X}}_{\mathcal{S}_{t-1}}=\sum_{j\in\mathcal{S}_{t-1}}s_{j}v_{j}^{T}v_{j}. To analyze the behavior of eigenvalues as the query set is updated, we follow [21] and introduce the potential function:

Φt1=Tr[(ut1ICt1)1]+Tr[(Ct1lt1I)1],\displaystyle\Phi_{t-1}=\text{Tr}[(u_{t-1}I-C_{t-1})^{-1}]+\text{Tr}[(C_{t-1}-l_{t-1}I)^{-1}], (11)

where ut1u_{t-1} and lt1l_{t-1} are constants such that lt1<λmin(Ct1)λmax(Ct1)<ut1l_{t-1}<\lambda_{\min}(C_{t-1})\leq\lambda_{\max}(C_{t-1})<u_{t-1}, and Tr()\text{Tr}(\cdot) denotes the trace of a matrix. The potential function Φt1\Phi_{t-1} quantifies the coherence among all eigenvalues of Ct1C_{t-1}. To construct the candidate set BmB_{m}, we sample node ii and update CtC_{t}, utu_{t}, and ltl_{t} such that all eigenvalues of CtC_{t} remain within the interval (lt,ut)(l_{t},u_{t}). To achieve this, we first calculate the node-wise probabilities {pi}i=1n\{p_{i}\}_{i=1}^{n} as:

pi=[vi(ut1ICt1)1viT+vi(Ct1lt1I)1viT]/Φt1,\displaystyle p_{i}=\big{[}v_{i}(u_{t-1}I-C_{t-1})^{-1}v_{i}^{T}+v_{i}(C_{t-1}-l_{t-1}I)^{-1}v_{i}^{T}\big{]}/\Phi_{t-1}, (12)

where i=1npi=1\sum_{i=1}^{n}p_{i}=1. We then sample mm nodes into BmB_{m} according to {pi}i=1n\{p_{i}\}_{i=1}^{n}. For each node iBmi\in B_{m}, the corresponding weight is given by si=ϵpiΦt1, 0<ϵ<1s_{i}=\frac{\epsilon}{p_{i}\Phi_{t-1}},\;0<\epsilon<1. After obtaining the candidate set BmB_{m}, we apply the informative node selection criterion Δi\Delta_{i} introduced in Section 3.2, i.e., selecting the node i=argmaxiBmΔii=\operatorname*{argmax}_{i\in B_{m}}\Delta_{i}, and update the query set and weights as follows:

ifi𝒮t1c:𝒮t=𝒮t1{i},𝒲t=𝒲t1{si},\displaystyle\text{if}\;i\in\mathcal{S}_{t-1}^{c}:\;\mathcal{S}_{t}=\mathcal{S}_{t-1}\cup\{i\},\;\mathcal{W}_{t}=\mathcal{W}_{t-1}\cup\{s_{i}\},
ifi𝒮t1:si=si+ϵpiΦt1.\displaystyle\text{if}\;i\in\mathcal{S}_{t-1}:s_{i}=s_{i}+\frac{\epsilon}{p_{i}\Phi_{t-1}}.

We then update the lower and upper eigenvalue bounds as follows:

ut=ut1+ϵΦt1(1ϵ),lt=lt1+ϵΦt1(1+ϵ).\displaystyle u_{t}=u_{t-1}+\frac{\epsilon}{\Phi_{t-1}(1-\epsilon)},\;l_{t}=l_{t-1}+\frac{\epsilon}{\Phi_{t-1}(1+\epsilon)}. (13)

The update rule ensures that utltu_{t}-l_{t} increases at a slower rate than utu_{t}, leading to the convergence of the gap between the largest and smallest eigenvalues of 𝐗~𝒮TWS𝐗~𝒮\tilde{\mathbf{X}}_{\mathcal{S}}^{T}W_{S}\tilde{\mathbf{X}}_{\mathcal{S}}, thereby controlling the condition number. Accordingly, the covariance matrix is updated with the selected node ii as:

Ct=Ct1+ϵpiΦt1viTvi.\displaystyle C_{t}=C_{t-1}+\frac{\epsilon}{p_{i}\Phi_{t-1}}v_{i}^{T}v_{i}. (14)

With the covariance matrix update rule in (14), the average increment is 𝐄(Ct)Ct1=i=1npisiviTvi=ϵΦt1𝐈\mathbf{E}(C_{t})-C_{t-1}=\sum_{i=1}^{n}p_{i}s_{i}v_{i}^{T}v_{i}=\frac{\epsilon}{\Phi_{t-1}}\mathbf{I}. Intuitively, the selected node allows all eigenvalues of Ct1C_{t-1} to increase at the same rate on average. This ensures that λmin(𝐗~𝒮T𝐗~𝒮)\lambda_{\min}(\tilde{\mathbf{X}}_{\mathcal{S}}^{T}\tilde{\mathbf{X}}_{\mathcal{S}}) continues to approach λmin(𝐗~T𝐗~)=1\lambda_{\min}(\tilde{\mathbf{X}}^{T}\tilde{\mathbf{X}})=1 during the selection process, thus driving the smallest eigenvalue away from zero. Additionally, the selected node remains locally informative within the candidate set BmB_{m}. Compared with the entire set of nodes, selecting from a subset serves as a regularization on informativeness maximization, achieving a balance between informativeness and representativeness for node queries.

3.4 Node query algorithm and graph signal recovery

We summarize the biased sampling selection strategy in Algorithm 1. At a high level, each step in the biased sampling query strategy consists of two stages. First, we use randomized spectral sparsification to sample mnm\ll n nodes and collect them into a candidate set BmB_{m}. Intuitively, the covariance matrix on the updated 𝒮\mathcal{S} maintains lower-bounded eigenvalues if a node from BmB_{m} is added to 𝒮\mathcal{S}. In the second stage, we select one node from BmB_{m} based on the informativeness criterion in Section 3.2 to achieve a significant frequency increase in (7).

For initialization, the dimension of the network spectrum dd, the size of the candidate set mm, and the constant 0<ϵ<10<\epsilon<1 for spectral sparsification need to be specified. Based on the discussion at the end of Section 3.1, the dimension of the function space 𝐇ω(𝐗,𝐀)\mathbf{H}_{\omega}(\mathbf{X},\mathbf{A}) is at most \mathcal{B}, where \mathcal{B} is the budget for label queries. Therefore, we can set d=min{p,}d=\min\{p,\mathcal{B}\}. The parameters mm and ϵ\epsilon jointly control the condition number λmax(𝐗~𝒮TWS𝐗~𝒮)λmin(𝐗~𝒮TWS𝐗~𝒮)\frac{\lambda_{\max}(\tilde{\mathbf{X}}_{\mathcal{S}}^{T}W_{S}\tilde{\mathbf{X}}_{\mathcal{S}})}{\lambda_{\min}(\tilde{\mathbf{X}}_{\mathcal{S}}^{T}W_{S}\tilde{\mathbf{X}}_{\mathcal{S}})}.

In practice, we can tune mm to ensure that the covariance matrix is well-conditioned. Specifically, we can run the biased sampling procedure multiple times with different values of mm and select the largest mm such that the condition number of the covariance matrix on the query set 𝒮\mathcal{S} is less than 10 [20]. This threshold is a commonly accepted rule of thumb for considering a covariance matrix to be well-conditioned [20]. Additionally, ϵ\epsilon is typically fixed at a small value, following the protocol outlined in [21].

Algorithm 1 Biased Sampling Query Algorithm
t=0t=0, C0=0C_{0}=0, the set of query nodes 𝒮0=\mathcal{S}_{0}=\emptyset, the set of node weights 𝒲0=\mathcal{W}_{0}=\emptyset, spectral dimension dd, size of candidate set mm, constant 0<ϵ<1/m0<\epsilon<1/m, query budget \mathcal{B}.
Initialization: Compute SVD decomposition 𝐔dT𝐗=V1ΣV2T\mathbf{U}^{T}_{d}\mathbf{X}=V_{1}\Sigma V_{2}^{T}, and set 𝐗~=𝐔dV1\tilde{\mathbf{X}}=\mathbf{U}_{d}V_{1}, r=rank(𝐗~)r=\text{rank}(\tilde{\mathbf{X}}), u0=2r/ϵu_{0}=2r/\epsilon, l0=2r/ϵl_{0}=-2r/\epsilon, κ=2r(1m2ϵ2)/(mϵ2)\kappa=2r(1-m^{2}\epsilon^{2})/(m\epsilon^{2}).
while >0\mathcal{B}>0 do
     Step 1: Calculate Φt\Phi_{t} as in (11) and the node-wise probabilities {pi}i=1n\{p_{i}\}_{i=1}^{n} using (12).
     Step 2: Sample mm nodes with replacement according to probabilities {pi}i=1n\{p_{i}\}_{i=1}^{n} to form the candidate set BmB_{m}.
     Step 3: Select node ii as i=argmaxiBmΔii=\operatorname*{argmax}_{i\in B_{m}}\Delta_{i} and calculate its weight wi=ϵpiΦtw_{i}=\frac{\epsilon}{p_{i}\Phi_{t}}.      If i𝒮ti\notin\mathcal{S}_{t}, then update 𝒮t+1=𝒮t{i}\mathcal{S}_{t+1}=\mathcal{S}_{t}\cup\{i\} and 𝒲t+1=𝒲t{si}\mathcal{W}_{t+1}=\mathcal{W}_{t}\cup\{s_{i}\} with si=wiκs_{i}=\frac{w_{i}}{\kappa}.      Else if i𝒮ti\in\mathcal{S}_{t}, then update si=si+wiκs_{i}=s_{i}+\frac{w_{i}}{\kappa}.
     Step 4: Update CtC_{t}, utu_{t}, ltl_{t}, \mathcal{B} and tt as:
Ct+1=Ct+wiviTvi,ut+1\displaystyle C_{t+1}=C_{t}+w_{i}v_{i}^{T}v_{i},\quad u_{t+1} =ut+ϵΦt(1mϵ),lt+1=lt+ϵΦt(1+mϵ),\displaystyle=u_{t}+\frac{\epsilon}{\Phi_{t}(1-m\epsilon)},\quad l_{t+1}=l_{t}+\frac{\epsilon}{\Phi_{t}(1+m\epsilon)},
\displaystyle\mathcal{B} =1,t=t+1.\displaystyle=\mathcal{B}-1,\quad t=t+1.
end while
Query: Label all nodes in 𝒮\mathcal{S} through an external annotator.
Output: Set of queried nodes 𝒮\mathcal{S}, annotated responses {Yii𝒮}\{Y_{i}\mid i\in\mathcal{S}\}, smoothed covariates 𝐗~𝒮\tilde{\mathbf{X}}_{\mathcal{S}}, and weights of queried nodes 𝒲\mathcal{W}.

Based on the output from Algorithm 1, we solve the weighted least squares problem in (9):

β^=argminβ~i𝒮si(Y(i)(𝐗~𝒮)iβ~)2,\displaystyle\hat{\beta}=\operatorname*{argmin}_{\tilde{\beta}}\sum_{i\in\mathcal{S}}s_{i}\big{(}Y(i)-(\tilde{\mathbf{X}}_{\mathcal{S}})_{i\cdot}\tilde{\beta}\big{)}^{2}, (15)

and recover the graph signal on the entire network as 𝐟^=𝐗~β^\hat{\mathbf{f}}=\tilde{\mathbf{X}}\hat{\beta}.

Although our theoretical analysis is developed for node regression tasks, the proposed query strategy and graph signal recovery procedure are also applicable to classification tasks. Consider a KK-class classification problem, where the response on each node ii is given by 𝐟(i)1,2,,K\mathbf{f}(i)\in{1,2,\ldots,K}. We introduce a dummy membership vector (Y1(i),,YK(i))(Y_{1}(i),\ldots,Y_{K}(i)), where Yc(i)=1Y_{c}(i)=1 if 𝐟(i)=c\mathbf{f}(i)=c and Yc(i)=0Y_{c}(i)=0 otherwise. For each class c{1,2,,K}c\in\{1,2,\ldots,K\}, we first estimate β^c\hat{\beta}_{c} based on (15) with the training data {𝑿~i,Yc(i),si}i𝒮\{\tilde{\bm{X}}_{i\cdot},Y_{c}(i),s_{i}\}_{i\in\mathcal{S}}, and then compute the score for class cc as 𝐟^c=X~β^c\hat{\mathbf{f}}_{c}=\tilde{X}\hat{\beta}_{c}. The label of an unqueried node jj is assigned as 𝐟^(j)=argmax1cK{𝐟^1(j),𝐟^2(j),,𝐟^K(j)}\mathbf{\hat{f}}(j)=\operatorname*{argmax}_{1\leq c\leq K}\{\hat{\mathbf{f}}_{1}(j),\hat{\mathbf{f}}_{2}(j),\cdots,\hat{\mathbf{f}}_{K}(j)\}. Notice that the above score-based classifier is equivalent to the softmax classifier:

𝐟^=argmax1cK{exp(𝐟^1)cexp(𝐟^c),,exp(𝐟^K)cexp(𝐟^c)}\displaystyle\mathbf{\hat{f}}=\operatorname*{argmax}_{1\leq c\leq K}\{\frac{\exp(\hat{\mathbf{f}}_{1})}{\sum_{c}\exp(\hat{\mathbf{f}}_{c})},\cdots,\frac{\exp(\hat{\mathbf{f}}_{K})}{\sum_{c}\exp(\hat{\mathbf{f}}_{c})}\}

since the softmax function is monotonically increasing with respect to each score function {𝐟^c}c=1K\{\mathbf{\hat{f}}_{c}\}_{c=1}^{K}.

3.5 Computational Complexity

In the representative sampling stage, the computational complexity of calculating the sampling probability is 𝒪(n)\mathcal{O}(n). We then sample mm nodes to formulate a candidate set BmB_{m}, where the complexity of sampling mm variables from a discrete probability distribution is 𝒪(m)\mathcal{O}(m) [32]. Consequently, the complexity of the representative learning stage is 𝒪(n+m)\mathcal{O}(n+m).

In the informative selection stage, we calculate the information gain Δi\Delta_{i} for each node. This involves obtaining the eigenvector corresponding to the smallest non-zero eigenvalue of the projected graph Laplacian matrix, with a complexity of 𝒪(n3)\mathcal{O}(n^{3}) due to the singular value decomposition (SVD) operation. Subsequently, we compute Δi\Delta_{i} for each node in the candidate set BmB_{m} based on their loadings on the eigenvector, which incurs an additional computational cost of 𝒪(mn)\mathcal{O}(mn). Therefore, the total complexity of our biased sampling method is 𝒪(n+m+nm+n3)\mathcal{O}(n+m+nm+n^{3}). Given a node label query budget \mathcal{B}, the overall computational cost becomes 𝒪((n+m+nm+n3))\mathcal{O}(\mathcal{B}\left(n+m+nm+n^{3})\right).

When the dimension of node covariates pnp\ll n, we can replace the SVD operation with the Lanczos algorithm to accelerate the informative selection stage. The Lanczos algorithm is designed to efficiently obtain the kkth largest or smallest eigenvalues and their corresponding eigenvectors using a generalized power iteration method, which has a time complexity of 𝒪(kn2)\mathcal{O}(kn^{2}) [17]. As a result, the complexity of the proposed biased sampling method reduces to 𝒪(pn2)\mathcal{O}(pn^{2}). This is comparable to GNN-based active learning methods, as GNNs and their variations generally have a complexity of 𝒪(pn2)\mathcal{O}(pn^{2}) per training update [5, 36].

4 Theoretical Analysis

In this section, we present a theoretical analysis of the proposed node query strategy. The results are divided into two parts: the first focuses on the local information gain of the selection process, while the second examines the global performance of graph signal recovery. Given a set of query nodes 𝒮\mathcal{S}, the information gain from querying the label of a new node ii is measured as the increase in bandwidth frequency, defined as Δi:=ω(𝒮{i})ω(𝒮)\Delta_{i}:=\omega(\mathcal{S}\cup\{i\})-\omega(\mathcal{S}). We provide a step-by-step analysis of the proposed method by comparing the increase in bandwidth frequency with that of random selection.

Theorem 4.1.

Define dmin=min𝑖{di}d_{\text{min}}=\underset{i}{\min}\{d_{i}\}, where did_{i} denotes the degree of node ii. Let 𝒮\mathcal{S} represent the set of queried nodes prior to the sths^{th} selection. Denote the adjacency matrix of the subgraph excluding 𝒮\mathcal{S} as 𝐀(n|𝒮|)×(n|𝒮|)\mathbf{A}_{(n-|\mathcal{S}|)\times(n-|\mathcal{S}|)}. Let ΔsR\Delta^{\text{R}}_{s} and ΔsB\Delta^{\text{B}}_{s} denote the increase in bandwidth frequency resulting from the sths^{th} label query on a node selected by random sampling and the proposed sampling method, respectively. Let jj^{*} denote the node with the largest magnitude in the eigenvector corresponding to the smallest non-zero eigenvalue of 𝒮c\mathcal{L}_{\mathcal{S}^{c}}. Then we have:

𝐄(ΔsR)=Ω(1n)(1),and𝐄(ΔsB)𝐄(ΔsR)>Ω(1η0η13dmin2)Ω(1n)(2),\displaystyle\mathbf{E}(\Delta^{\text{R}}_{s})=\Omega(\frac{1}{n})\;(1),\;\;\text{and}\;\;\mathbf{E}(\Delta^{\text{B}}_{s})-\mathbf{E}(\Delta^{\text{R}}_{s})>\Omega(\frac{1}{\eta_{0}\eta_{1}^{3}d^{2}_{\text{min}}})-\Omega(\frac{1}{n})\;(2),

where f=Ω(g)f=\Omega(g) if c1|fg|c2c_{1}\leq|\frac{f}{g}|\leq c_{2} for constants c1,c2c_{1},c_{2} when nn is sufficient large. Inequality (2) holds given mm satisfying

(nmdminnm)m(nmdminndmin)dmindmin=𝒪(1).(\frac{n-m-d_{min}}{n-m})^{m}(\frac{n-m-d_{min}}{n-d_{min}})^{d_{min}}\sqrt{d_{min}}=\mathcal{O}(1).

The expectation 𝐄()\mathbf{E}(\cdot) is taken over the randomness of node selection. Both η0,η1\eta_{0},\eta_{1} are network-related quantities, where η0#|{i:|didjdmin|1}|\eta_{0}\triangleq\#|\{i:|\frac{d_{i}-d_{j^{*}}}{d_{\text{min}}}|\leq 1\}| and η1maxi(didmin)\eta_{1}\triangleq\max_{i}(\frac{d_{i}}{d_{\text{min}}}).

Theorem 4.1 provides key insights into the information gain achieved through different node label querying strategies. While random selection yields a constant average information gain, the proposed biased sampling method guarantees a higher information gain under mild assumptions.

Theorem 4.1 is derived using first-order matrix perturbation theory [2] on the Laplacian matrix \mathcal{L}. In Theorem 4.1, we assume that the column space of the node covariate matrix 𝐗\mathbf{X} is identical to the space spanned by the first dd eigenvectors of 𝒮c\mathcal{L}_{\mathcal{S}^{c}}. This assumption simplifies the analysis and the results by focusing on the perturbation of 𝒮c\mathcal{L}_{\mathcal{S}^{c}}, where 𝒮c\mathcal{L}_{\mathcal{S}^{c}} is the reduced Laplacian matrix with zero entries in the rows and columns indexed by 𝒮\mathcal{S}.

The analysis can be naturally extended to the general setting by replacing 𝒮c\mathcal{L}_{\mathcal{S}^{c}} with 𝐏(𝟏𝒮c)𝐏(𝟏𝒮c)\mathbf{P}(\mathbf{1}_{\mathcal{S}^{c}})\mathcal{L}\mathbf{P}(\mathbf{1}_{\mathcal{S}^{c}}), where 𝐏(𝐭)\mathbf{P}(\mathbf{t}) is the projection operator defined in Section 3.2. Moreover, under the assumption on the node covariates, the information gain Δi\Delta_{i} exhibits an explicit dependence on the network statistics, providing a clearer interpretation of how the network structure influences the benefits of selecting informative nodes.

Theorem 4.1 indicates that the improvement of biased sampling is more significant when dmind_{\text{min}} is larger and η0,η1\eta_{0},\eta_{1} are smaller. Specifically, dmind_{\text{min}} reflects the connectedness of the network, where a better-connected network facilitates the propagation of label information and enhances the informativeness of a node’s label for other nodes. A smaller η1\eta_{1} prevents the existence of dominating nodes, ensuring that the connectedness does not significantly decrease when some nodes are removed from the network.

Notice that the node jj^{*} is the most informative node for the next selection, and η0\eta_{0} measures the number of nodes similar to jj^{*} in the network. Recall that the proposed biased sampling method considers both the informativeness and representativeness of the selected nodes. Therefore, the information gain is less penalized by the representativeness requirement if η0\eta_{0} is small. Additionally, the size of the candidate set mm should be sufficiently large to ensure that informative nodes are included in BmB_{m}.

In Theorem 4.2, we provide the generalization error bound for the proposed sampling method under the weighted OLS estimator. To formally state Theorem 4.2, we first introduce the following two assumptions:

Assumption 1 For the underlying graph signal 𝐟\mathbf{f}, there exists a bandwidth frequency ω0\omega_{0} such that 𝐟𝐇ω0(𝐗,𝐀)\mathbf{f}\in\mathbf{H}_{\omega_{0}}(\mathbf{X},\mathbf{A}).
Assumption 2 The observed node-wise response YiY_{i} can be decomposed as Yi=𝐟(i)+ξiY_{i}=\mathbf{f}(i)+\xi_{i}, where {ξi}i=1n\{\xi_{i}\}_{i=1}^{n} are independent random variables with 𝐄(ξi)=0\mathbf{E}(\xi_{i})=0 and 𝐕𝐚𝐫(ξi)σ2\mathbf{Var}(\xi_{i})\leq\sigma^{2}.

Theorem 4.2.

Under Assumptions 1 and 2, for the graph signal estimation 𝐟^\hat{\mathbf{f}} obtained by training (15) on \mathcal{B} labeled nodes selected by Algorithm 1, with probability greater than 12mt1-\frac{2m}{t}, where t>2mt>2m, we have

𝐄Y𝐟^𝐟22𝒪(rdt+2(rdt)3/2+(rdt)2)×(nσ2+i>d,isupp(𝐟)αi2)+i>d,isupp(𝐟)αi2,\displaystyle\mathbf{E}_{Y}\|\hat{\mathbf{f}}-\mathbf{f}\|_{2}^{2}\leq\mathcal{O}\Big{(}\frac{r_{d}t}{\mathcal{B}}+2(\frac{r_{d}t}{\mathcal{B}})^{3/2}+(\frac{r_{d}t}{\mathcal{B}})^{2}\Big{)}\times(n\sigma^{2}+\sum_{i>d,i\in\text{supp}(\mathbf{f})}\alpha_{i}^{2})+\sum_{i>d,i\in\text{supp}(\mathbf{f})}\alpha_{i}^{2}, (16)

where αi:=𝐟,Ui\alpha_{i}:=\langle\mathbf{f},U_{i}\rangle, supp(𝐟):={i:1in,|αi|>0}\text{supp}(\mathbf{f}):=\{i:1\leq i\leq n,|\alpha_{i}|>0\} and rd=rank(𝐔dT𝐗)r_{d}=\text{rank}(\mathbf{U}_{d}^{T}\mathbf{X}). 𝐄Y()\mathbf{E}_{Y}(\cdot) denotes the expected value with respect to the randomness in observed responses.

Theorem 4.2 reveals the trade-off between informativeness and representativeness in graph-based active learning, which is controlled by the spectral dimension dd. Since rdr_{d} is a monotonic function of dd, a larger dd reduces representativeness among queried nodes, thereby increasing variance in controlling the condition number (i.e., the first three terms). On the other hand, a larger dd reduces approximation bias to the true graph signal (i.e., the fifth and last terms) by including more informative nodes for capturing less smoothed signals.

The RHS of (16) captures both the variance and bias involved in estimating 𝐟\mathbf{f} using noisy labels on sampled nodes. Specifically, the first three terms represent the estimation variance arising from controlling the condition number of the design matrix on the queried nodes. The fourth and fifth terms reflect the noise and unidentifiable components in the responses of the queried nodes, while the last term denotes the bias resulting from the approximation error of the space using 𝐇ω(𝐗,𝐀)\mathbf{H}_{\omega}(\mathbf{X},\mathbf{A}).

The bias term in Theorem 4.2 can be further controlled if the true signal 𝐟\mathbf{f} exhibits decaying or zero weights on high-frequency network components. In addition to rdr_{d}, the size of the candidate set mm also influences the probability of controlling the generalization error. A small mm places greater emphasis on the representativeness criterion in sampling, increasing the likelihood of controlling the condition number but potentially overlooking informative nodes, thereby increasing approximation bias.

For a fixed prediction MSE, the query complexity of our method is 𝒪(d)\mathcal{O}(d), whereas random sampling incurs a complexity of 𝒪(d~logd~)\mathcal{O}(\tilde{d}\log\tilde{d}), where d~>d\tilde{d}>d. Our method outperforms random sampling in two key aspects: (1) The information-based selection identifies 𝐟\mathbf{f} with fewer queries than random sampling, as shown in Theorem 4.1, and (2) our method achieves an additional improvement by actively controlling the condition number of the covariate matrix, resulting in a logarithmic factor reduction compared to random sampling.

5 Numerical Studies

In this section, we conduct extensive numerical studies to evaluate the proposed active learning strategy for node-level prediction tasks on both synthetic and real-world networks. For the synthetic networks, we focus on regression tasks with continuous responses, while for the real-world networks, we consider classification tasks with discrete node labels.

5.1 Synthetic networks

We consider three different network topologies generated by widely studied statistical network models: the Watts–Strogatz model [34] for small-world properties, the Stochastic block model [16] for community structure, and the Barabási–Albert model [3] for scale-free properties.

Node responses are generated as Y=𝐟+ξY=\mathbf{f}+\xi, where 𝐟\mathbf{f} is the true graph signal and ξN(0,σ2In)\xi\sim N(0,\sigma^{2}I_{n}) is Gaussian noise. The true signal is constructed as 𝐟=𝐔dβ\mathbf{f}=\mathbf{U}_{d}\mathbb{\beta}, where β\mathbb{\beta} is the linear coefficient and 𝐔d\mathbf{U}_{d} denotes the leading dd eigenvectors of the normalized graph Laplacian of the synthetic network. Since our theoretical analysis assumes that the observed node covariates 𝐗\mathbf{X} contain noise, we generate 𝐗\mathbf{X} as a perturbed version of 𝐔d\mathbf{U}_{d} by adding non-leading eigenvectors of the normalized graph Laplacian. The detailed simulation settings can be found in Appendix B.1.

We compare our algorithm with several offline active learning methods: 1. D-optimal [25] selects subset of nodes 𝒮\mathcal{S} to maximize determinant of observed covariate information matrix 𝐗𝒮T𝐗𝒮\mathbf{X}_{\mathcal{S}}^{T}\mathbf{X}_{\mathcal{S}}. 2. RIM [38] selects nodes to maximize the number of influenced nodes. 3. GPT [23] and SPA [13] split the graph into disjoint partitions and select informative nodes from each partition.

After the node query step, we fit the weighted linear regression from (15) on the labeled nodes, using the smoothed covariates 𝐗~\mathbf{\tilde{X}}, to estimate the linear coefficient β^\hat{\beta} and predict the response 𝐘^\hat{\mathbf{Y}} for the unlabeled nodes. In Figure 1, we plot the prediction MSE of the proposed method against baselines on unlabeled nodes for various levels of labeling noise σ2(0.5,0.6,0.7,0.8,0.9,1)\sigma^{2}\in(0.5,0.6,0.7,0.8,0.9,1). The results show that the proposed method significantly outperforms all baselines across all simulation settings and exhibits strong robustness to noise. The inferior performance of the baselines can be attributed to several factors. D-optimal and RIM fail to account for noise in the node covariates. Meanwhile, partition-based methods like GPT and SPA are highly sensitive to hyperparameters, such as the optimal number of partitions, which limits their generalization to networks lacking a clear community structure.

Refer to caption
(a) Small World
Refer to caption
(b) Community
Refer to caption
(c) Scale-free
Figure 1: Prediction performance on unlabeled nodes at different levels of labeling noise (σ2\sigma^{2}). All three simulated networks have n=100n=100 nodes, with the number of labeled nodes fixed at 2525.
Table 1: Test accuracy (Micro-F1%) on five real-world networks with varying levels of homophily. The edge homophily ratio hh of a network is defined as the fraction of edges that connect nodes with the same class label. A higher hh indicates a network with stronger homophily.
Cora (h=0.81h=0.81) Pubmed (h=0.80h=0.80) Citeseer (h=0.74h=0.74) Chameleon (h=0.23h=0.23) Texas (h=0.11h=0.11)
#labeled nodes 35 70 140 15 30 60 30 60 120 50 75 100 15 30 45
Random 68.2 ±\pm 1.3 74.5 ±\pm 1.0 78.9 ±\pm 0.9 71.2 ±\pm 1.8 74.9 ±\pm 1.6 78.4 ±\pm 0.5 57.7 ±\pm 0.8 65.3 ±\pm 1.4 70.7 ±\pm 0.7 22.4 ±\pm 2.6 22.1 ±\pm 2.5 21.8 ±\pm 2.1 67.0 ±\pm 3.3 69.9 ±\pm 3.3 73.8 ±\pm 3.2
AGE 72.1 ±\pm 1.1 78.0 ±\pm 0.9 82.5 ±\pm 0.5 74.9 ±\pm 1.1 77.5 ±\pm 1.2 79.4 ±\pm 0.7 65.3 ±\pm 1.1 67.7 ±\pm 0.5 71.4 ±\pm 0.5 30.0 ±\pm 4.5 28.2 ±\pm 4.9 28.6 ±\pm 5.0 67.9 ±\pm 2.6 68.8 ±\pm 3.3 72.1 ±\pm 3.6
GPT 77.4 ±\pm 1.6 81.6 ±\pm 1.2 86.5 ±\pm 1.2 77.0 ±\pm 3.1 79.9 ±\pm 2.8 81.5 ±\pm 1.6 67.9 ±\pm 1.8 71.0 ±\pm 2.4 74.0 ±\pm 2.0 14.1 ±\pm 2.5 15.8 ±\pm 2.2 16.4 ±\pm 2.4 72.6 ±\pm 2.0 72.5 ±\pm 3.6 74.6 ±\pm 1.8
RIM 77.5 ±\pm 0.8 81.6 ±\pm 1.1 84.1 ±\pm 0.8 75.0 ±\pm 1.5 77.2 ±\pm 0.6 80.2 ±\pm 0.4 67.5 ±\pm 0.7 70.0 ±\pm 0.6 73.2 ±\pm 0.7 35.5 ±\pm 3.7 42.8 ±\pm 3.0 34.4 ±\pm 3.5 68.5 ±\pm 3.7 78.4 ±\pm 3.0 74.6 ±\pm 3.7
IGP 77.4 ±\pm 1.7 81.7 ±\pm 1.6 86.3 ±\pm 0.7 78.5 ±\pm 1.2 82.3 ±\pm 1.4 83.5 ±\pm 0.5 68.2 ±\pm 1.1 72.1 ±\pm 0.9 75.8 ±\pm 0.4 32.5 ±\pm 3.6 33.7 ±\pm 3.1 33.4 ±\pm 3.5 70.8 ±\pm 3.7 69.9 ±\pm 3.3 76.1 ±\pm 3.6
SPA 76.5 ±\pm 1.9 80.3 ±\pm 1.6 85.2 ±\pm 0.6 75.4 ±\pm 1.6 78.3 ±\pm 2.0 73.5 ±\pm 1.2 66.4 ±\pm 2.2 69.3 ±\pm 1.7 73.5 ±\pm 2.0 30.2 ±\pm 3.2 28.5 ±\pm 2.9 31.0 ±\pm 4.4 72.0 ±\pm 3.2 72.5 ±\pm 3.1 74.6 ±\pm 2.1
Proposed 78.4 ±\pm 1.7 81.8 ±\pm 1.8 86.5 ±\pm 1.1 78.9 ±\pm 1.1 79.1 ±\pm 0.6 82.3 ±\pm 0.6 69.1 ±\pm 1.0 72.2 ±\pm 1.3 75.5 ±\pm 0.8 35.1 ±\pm 2.8 35.7 ±\pm 3.0 37.2 ±\pm 3.0 75.0 ±\pm 1.9 79.5 ±\pm 0.8 80.4 ±\pm 2.7
Table 2: Test accuracy (Macro-F1% and Micro-F1%) on two real-world large-scale networks: Ogbn-Arxiv (n=169,343n=169,343) and Co-Physics (n=34,493n=34,493).
Ogbn-Arxiv (Macro-F1) Ogbn-Arxiv (Micro-F1) Co-Physics (Macro-F1)
#labeled nodes 160 320 640 1280 160 320 640 1280 10 20 40
Random 21.9 ±\pm 1.4 27.6 ±\pm 1.5 33.0 ±\pm 1.4 37.2 ±\pm 1.1 52.3 ±\pm 0.8 56.4 ±\pm 0.8 60.0 ±\pm 0.7 63.5 ±\pm 0.4 58.3 ±\pm 13.8 66.9 ±\pm 10.1 78.3 ±\pm 7.1
AGE 20.4 ±\pm 0.9 25.9 ±\pm 1.1 31.7 ±\pm 0.8 36.4 ±\pm 0.8 48.3 ±\pm 2.3 54.9 ±\pm 1.6 60.0 ±\pm 0.7 63.5 ±\pm 0.3 63.7 ±\pm 7.8 71.0 ±\pm 8.8 82.4 ±\pm 3.9
GPT 24.2 ±\pm 0.7 29.5 ±\pm 0.8 36.4 ±\pm 0.5 41.0 ±\pm 0.5 52.3 ±\pm 0.9 56.8 ±\pm 0.8 60.7 ±\pm 0.6 63.6 ±\pm 0.5 75.8 ±\pm 2.7 85.8 ±\pm 0.3 88.9 ±\pm 0.3
Proposed 25.8 ±\pm 1.3 34.3 ±\pm 1.4 38.3 ±\pm 1.2 41.3 ±\pm 1.3 53.1 ±\pm 1.3 58.0 ±\pm 1.0 62.3 ±\pm 1.6 64.8 ±\pm 1.0 83.5 ±\pm 0.8 86.8 ±\pm 1.3 89.2 ±\pm 1.2
Table 3: Average query time (in seconds) per node.
Dataset Size Time
Texas 183 0.19 ±\pm 0.03
Chameleon 2,277 0.34 ±\pm 0.18
Cora 2,708 0.30 ±\pm 0.19
Citeseer 3,327 0.26 ±\pm 0.07
Pubmed 19,717 0.48 ±\pm 0.25
Co-Physics 34,493 1.08 ±\pm 0.43
Ogbn-Arxiv 169,343 2.11 ±\pm 0.33

5.2 Real-world networks

We evaluate the proposed method for node classification tasks on real-world datasets, which include five networks with varying homophily levels (high to low: Cora, PubMed, Citeseer, Chameleon and Texas) and two large-scale networks (Ogbn-Arxiv and Co-Physics). In addition to the offline methods described in Section 5.1, we also compare our approach with two GNN-based online active learning methods AGE [7] and IGP [39]. In each GNN iteration, AGE selects nodes to maximize a linear combination of heuristic metrics, while IGP selects nodes that maximize information gain propagation.

Unlike regression, node classification with GNNs is a widely studied area of research. Previous works [23, 30, 38, 39] have demonstrated that the prediction performance of various active learning strategies on unlabeled nodes remains relatively consistent across different types of GNNs. Therefore, we employ Simplified Graph Convolution (SGC) [35] as the GNN classifier due to its straightforward theoretical intuition. Since SGC is essentially multi-class logistic regression on low-pass-filtered covariates, it can be approximately viewed as a special case of the regression model defined in (15). Thus, we conjecture that our theoretical analysis can also be extended to classification tasks and leave its formal verification for future work.

The results in Figure 1 demonstrate that the proposed algorithm is highly competitive with baselines across real-world networks with varying degrees of homophily. Our method achieves the best performance on Cora (highest homophily) and Texas (lowest homophily, i.e., highest heterophily) and is particularly effective when the labeling budget is most limited. To handle heterophily in networks like Chameleon and Texas, we expand the graph signal subspace 𝐔d\mathbf{U}_{d} in Algorithm 1 to 𝐔d={U1,,Ud,Und+1,,Un}\mathbf{U}_{d}=\{U_{1},\cdots,U_{d},U_{n-d+1},\cdots,U_{n}\}, combining eigenvectors corresponding to the dd smallest and dd largest eigenvalues. Admittedly, relying on a priori knowledge of label construction may be unrealistic, so developing adaptive methods for designing the signal subspace to effectively handle both homophily and heterophily remains a promising direction for future research.

Table 5.1 summarizes the performance on two large-scale networks. The greatest improvement is observed in the Macro-F1 score on Ogbn-Arxiv, with an increase of up to 4.8% at 320 labeled nodes. Moreover, Table 5.1 demonstrates that our algorithm scales efficiently to large networks, with the time cost of querying a single node being approximately 2 seconds when n=169,343n=169,343.

6 Conclusion

We propose a graph-based offline active learning framework for node-level tasks. Our node query strategy effectively leverages both the network structure and node covariate information, demonstrating robustness to diverse network topologies and node-level noise. We provide theoretical guarantees for controlling generalization error, uncovering a novel trade-off between informativeness and representativeness in active learning on graphs. Empirical results demonstrate that our method performs strongly on both synthetic and real-world networks, achieving competitiveness with state-of-the-art methods on benchmark datasets. Future work could explore extensions to an online active learning setting that iteratively incorporates node response information to further enhance query efficiency. Additionally, scalability on large graphs could be improved by utilizing the Lanczos method [31] or Chebyshev polynomial approximation [19] during node selection.

References

  • [1] Z. Allen-Zhu, Y. Li, A. Singh, and Y. Wang. Near-optimal discrete optimization for experimental design: A regret minimization approach. Mathematical Programming, 2020.
  • [2] B. Bamieh. A tutorial on matrix perturbation theory (using compact matrix notation). arXiv preprint arXiv:2002.05001, 2020.
  • [3] A.-L. Barabasi and R. Albert. Emergence of scaling in random networks. Science, 1999.
  • [4] J. Batson, D. A. Spielman, N. Srivastava, and S.-H. Teng. Spectral sparsification of graphs: theory and algorithms. Communications of the ACM, 2013.
  • [5] D. Blakely, J. Lanchantin, and Y. Qi. Time and space complexity of graph convolutional networks. Accessed on: Dec, 2021.
  • [6] M.-R. Bouguelia, S. Nowaczyk, K. Santosh, and A. Verikas. Agreeing to disagree: active learning with noisy labels without crowdsourcing. International Journal of Machine Learning and Cybernetics, 2018.
  • [7] H. Cai, V. W. Zheng, and K. Chen-Chuan Chang. Active learning for graph embedding. arXiv preprint arXiv:1705.05085, 2017.
  • [8] S. Chen, A. Sandryhaila, J. M. F. Moura, and J. Kovacevic. Signal recovery on graphs: Variation minimization. IEEE Transactions on Signal Processing, 2016.
  • [9] X. Chen and E. Price. Active regression via linear-sample sparsification. In Conference on Learning Theory, pages 663–695. PMLR, 2019.
  • [10] G. Dasarathy, R. Nowak, and X. Zhu. S2: An efficient graph based active learning algorithm with application to nonparametric classification. In Proceedings of The 28th Conference on Learning Theory (COLT), 2015.
  • [11] M. Derezinski, M. K. Warmuth, and D. J. Hsu. Leveraged volume sampling for linear regression. In Advances in Neural Information Processing Systems (NeurIPS), 2018.
  • [12] J. Du and C. X. Ling. Active learning with human-like noisy oracle. In Proceedings of the 2010 IEEE International Conference on Data Mining (ICDM), 2010.
  • [13] R. Fajri, Y. Pei, L. Yin, and M. Pechenizkiy. A structural-clustering based active learning for graph neural networks. In Symposium on Intelligent Data Analysis (IDA), 2024.
  • [14] A. Gadde, A. Anis, and A. Ortega. Active semi-supervised learning using sampling theory for graph signals. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2014.
  • [15] A. Gadde, A. Anis, and A. Ortega. Active semi-supervised learning using sampling theory for graph signals. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 492–501, 2014.
  • [16] M. Girvan and M. E. J. Newman. Community structure in social and biological networks. Proceedings of the National Academy of Sciences, 2002.
  • [17] G. Golub and C. Loan. Matrix computations. HU Press, 2013.
  • [18] N. T. Hoang, T. Maehara, and T. Murata. Revisiting graph neural networks: Graph filtering perspective. In Proceedings of the 25th International Conference on Pattern Recognition (ICPR), 2021.
  • [19] T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. In Proceedings of the 5th International Conference on Learning Representations (ICLR), 2017.
  • [20] M. Kutner, C. Nachtsheim, J. Neter, and W. Li. Applied Linear Statistical Models. McGraw-Hill/Irwin, 2004.
  • [21] Y. T. Lee and H. Sun. Constructing linear-sized spectral sparsification in almost-linear time. SIAM Journal on Computing, 47(6):2315–2336, 2018.
  • [22] P. Di Lorenzo, S. Barbarossa, and P. Banelli. Cooperative and Graph Signal Processing. Academic Press, 2018.
  • [23] J. Ma, Z. Ma, J. Chai, and Q. Mei. Partition-based active learning for graph neural networks. Transactions on Machine Learning Research, 2023.
  • [24] E. L. Paluck, H. Shepherd, and P. M. Aronow. Changing climates of conflict: a social network experiment in 56 schools. Proceedings of the National Academy of Sciences, page 566–571, 2016.
  • [25] F. Pukelsheim. Optimal Design of Experiments (Classics in Applied Mathematics, 50). Society for Industrial and Applied Mathematics, 2018.
  • [26] G. Puy, N. Tremblay, R. Gribonval, and P. Vandergheynst. Random sampling of bandlimited signals on graphs. Applied and Computational Harmonic Analysis, 2016.
  • [27] F. F. Regol, S. Pal, Y. Zhang, and M. Coates. Active learning on attributed graphs via graph cognizant logistic regression and preemptive query generation. In Proceedings of the 37th International Conference on Machine Learning (ICML), 2020.
  • [28] P. Ren, Y. Xiao, X. Chang, P.-Y. Huang, Z. Li, X. Chen, and X. Wang. A survey of deep active learning. ACM Computing Surveys, 2020.
  • [29] B. Settles. Active learning literature survey. Technical report, University of Wisconsin Madison, 2010.
  • [30] Z. Song, Y. Zhang, and I. King. No change, no gain: Empowering graph neural networks with expected model change maximization for active learning. In Advances in Neural Information Processing Systems (NeurIPS), 2023.
  • [31] A. Susnjara, N. Perraudin, D. Kressner, and P. Vandergheynst. Accelerated filtering on graphs using lanczos method. arXiv preprint arXiv:1509.04537, 2015.
  • [32] M. Vose. Linear algorithm for generating random numbers with a given distribution. IEEE Transactions on Software Engineering, 1991.
  • [33] W. Wang and W. N. Street. Modeling and maximizing influence diffusion in social networks for viral marketing. Applied Network Science, 2018.
  • [34] D. J. Watts and S. H. Strogatz. Collective dynamics of ’small-world’ networks. Nature, 1998.
  • [35] F. Wu, A. H. S. Jr., T. Zhang, C. Fifty, T. Yu, and K. Q. Weinberger. Simplifying graph convolutional networks. In Proceedings of the 36th International Conference on Machine Learning (ICML), 2019.
  • [36] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. Yu. A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems, 2020.
  • [37] Y. Xu, X. Chen, A. Liu, and C. Hu. A latency and coverage optimized data collection scheme for smart cities. In New Advances in Identification, Information and Knowledge in the Internet of Things, 2017.
  • [38] W. Zhang, Y. Wang, Z. You, M. Cao, P. Huang, J. Shan, Z. Yang, and B. Cui. Rim: Reliable influence-based active learning on graphs. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
  • [39] W. Zhang, Y. Wang, Z. You, M. Cao, P. Huang, J. Shan, Z. Yang, and B. Cui. Information gain propagation: a new way to graph active learning with soft labels. In Proceedings of the 10th International Conference on Learning Representations (ICLR), 2022.

Appendix A Proofs

A.1 Proof of Theorem 3.1

Proof: Consider the threshold frequency ω\omega defined as

ω<ω(𝒮):=inf𝐠Proj𝐋𝒮cspan(𝐗)ω𝐠,\displaystyle\omega<\omega(\mathcal{S}):=\operatorname*{inf}_{\mathbf{g}\in\text{Proj}_{\mathbf{L}_{\mathcal{S}^{c}}}\text{span}(\mathbf{X})}\omega_{\mathbf{g}}, (17)

and notice that 17 is true if and only if

Proj𝐋ωspan(𝐗)Proj𝐋𝒮cspan(𝐗)={0}.\displaystyle\text{Proj}_{\mathbf{L}_{\omega}}\text{span}(\mathbf{X})\cap\text{Proj}_{\mathbf{L}_{\mathcal{S}^{c}}}\text{span}(\mathbf{X})=\{0\}. (18)

\Rightarrow For ϕProj𝐋ωspan(𝐗)Proj𝐋𝒮cspan(𝐗)\forall\phi\in\text{Proj}_{\mathbf{L}_{\omega}}\text{span}(\mathbf{X})\cap\text{Proj}_{\mathbf{L}_{\mathcal{S}^{c}}}\text{span}(\mathbf{X}) and 𝐟Proj𝐋ωspan(𝐗)\forall\mathbf{f}\in\text{Proj}_{\mathbf{L}_{\omega}}\text{span}(\mathbf{X}), we have 𝐠=ϕ+𝐟Proj𝐋ωspan(𝐗)\mathbf{g}=\phi+\mathbf{f}\in\text{Proj}_{\mathbf{L}_{\omega}}\text{span}(\mathbf{X}) by closure under addition and 𝐟(S)=𝐠(S)\mathbf{f}(S)=\mathbf{g}(S). Since Proj𝐋ωspan(𝐗)\text{Proj}_{\mathbf{L}_{\omega}}\text{span}(\mathbf{X}) can be identified by 𝒮\mathcal{S}, one must have 𝐟=𝐠\mathbf{f}=\mathbf{g}, which implies ϕ=0\phi=0. Hence, 18 is true.

\Leftarrow Given 18 is true, assume that Proj𝐋ωspan(𝐗)\text{Proj}_{\mathbf{L}_{\omega}}\text{span}(\mathbf{X}) cannot be identified by 𝒮\mathcal{S}. Then, by definition, there exists 𝐟1,𝐟2Proj𝐋ωspan(𝐗)\mathbf{f}_{1},\mathbf{f}_{2}\in\text{Proj}_{\mathbf{L}_{\omega}}\text{span}(\mathbf{X}) such that 𝐟1(S)=𝐟2(S)\mathbf{f}_{1}(S)=\mathbf{f}_{2}(S) and 𝐟1𝐟2\mathbf{f}_{1}\neq\mathbf{f}_{2}. Since (𝐟1𝐟2)(S)=0(\mathbf{f}_{1}-\mathbf{f}_{2})(S)=0, 𝐟1𝐟2Proj𝐋𝒮cspan(𝐗)\mathbf{f}_{1}-\mathbf{f}_{2}\in\text{Proj}_{\mathbf{L}_{\mathcal{S}^{c}}}\text{span}(\mathbf{X}). By clousure under addition, 𝐟1𝐟2Proj𝐋ωspan(𝐗)\mathbf{f}_{1}-\mathbf{f}_{2}\in\text{Proj}_{\mathbf{L}_{\omega}}\text{span}(\mathbf{X}). However, 𝐟1𝐟2Proj𝐋ωspan(𝐗)Proj𝐋𝒮cspan(𝐗)\mathbf{f}_{1}-\mathbf{f}_{2}\in\text{Proj}_{\mathbf{L}_{\omega}}\text{span}(\mathbf{X})\cap\text{Proj}_{\mathbf{L}_{\mathcal{S}^{c}}}\text{span}(\mathbf{X}) is a contradiction with 18. Therefore, Proj𝐋ωspan(𝐗)\text{Proj}_{\mathbf{L}_{\omega}}\text{span}(\mathbf{X}) can be identified by 𝒮\mathcal{S}.

A.2 Proof of Theorem 4.1

Proof: Let 𝐔={U1,U2,,Un}\mathbf{U}=\{U_{1},U_{2},\cdots,U_{n}\} be the eigenvectors of =𝐈𝐃1/2𝐀𝐃1/2\mathcal{L}=\mathbf{I}-\mathbf{D}^{-1/2}\mathbf{A}\mathbf{D}^{-1/2} and Λ=diag(λ1,λ2,,λn)\Lambda=\text{diag}(\lambda_{1},\lambda_{2},\cdots,\lambda_{n}). Without loss of generality, we analyze the case when k=1k=1 and the case of no repeated eigenvalues. The analysis for the case of repeated eigenvalues can be similar performed with matrix perturbation analysis on degenerate case [2].

After selecting node ii, we have the reduced Laplacian matrix \mathcal{L}^{\ast}

=+(i),where(i)=(01i0i1iiin0ni0)\mathcal{L}^{\ast}=\mathcal{L}+\mathcal{L}^{(-i)},\;\;\text{where}\;\;\mathcal{L}^{(-i)}=-\begin{pmatrix}0&\cdots&\mathcal{L}_{1i}&\cdots&0\\ \vdots&\ddots&\vdots&\vdots&\vdots\\ \mathcal{L}_{i1}&\ldots&\mathcal{L}_{ii}&\ldots&\mathcal{L}_{in}\\ \vdots&\vdots&\vdots&\ddots&\vdots\\ 0&\cdots&\mathcal{L}_{ni}&\cdots&0\\ \end{pmatrix}

Define Λ\Lambda^{\ast} as the diagonal matrix containing eigenvalues of \mathcal{L}^{\ast}. Using the first order perturbation analysis [2], we have ΛΛ+diag(𝐔T(i)𝐔)\Lambda^{\ast}\approx\Lambda+\text{diag}(\mathbf{U}^{T}\mathcal{L}^{(-i)}\mathbf{U}). Let i\mathcal{L}_{\bm{\cdot}i} and i\mathcal{L}_{i\bm{\cdot}} be the ithi^{th} column and row of \mathcal{L}, respectively. Since

(i)Uj=\displaystyle\mathcal{L}^{(-i)}U_{j}= (1i𝐔iji(i)Ujni𝐔ij)=i𝐔ij+(0ii𝐔iji(i)Uj0)=i𝐔ij+(0(iiλj)𝐔ij0)\displaystyle\begin{pmatrix}-\mathcal{L}_{1i}\mathbf{U}_{ij}\\ \vdots\\ -\mathcal{L}^{(-i)}_{i\bm{\cdot}}U_{j}\\ \vdots\\ -\mathcal{L}_{ni}\mathbf{U}_{ij}\\ \end{pmatrix}=-\mathcal{L}_{\bm{\cdot}i}\mathbf{U}_{ij}+\begin{pmatrix}0\\ \vdots\\ \mathcal{L}_{ii}\mathbf{U}_{ij}-\mathcal{L}^{(-i)}_{i\bm{\cdot}}U_{j}\\ \vdots\\ 0\\ \end{pmatrix}=-\mathcal{L}_{\bm{\cdot}i}\mathbf{U}_{ij}+\begin{pmatrix}0\\ \vdots\\ (\mathcal{L}_{ii}-\lambda_{j})\mathbf{U}_{ij}\\ \vdots\\ 0\\ \end{pmatrix}

we have

UjT(i)Uj=\displaystyle U^{T}_{j}\mathcal{L}^{(-i)}U_{j}= 𝐔ij(iUj)+(iiλj)𝐔ij2\displaystyle-\mathbf{U}_{ij}(\mathcal{L}_{i\bm{\cdot}}U_{j})+(\mathcal{L}_{ii}-\lambda_{j})\mathbf{U}^{2}_{ij}
=\displaystyle= 𝐔ij(λj𝐔ij)+(iiλj)𝐔ij2\displaystyle-\mathbf{U}_{ij}(\lambda_{j}\mathbf{U}_{ij})+(\mathcal{L}_{ii}-\lambda_{j})\mathbf{U}^{2}_{ij}
=\displaystyle= (12λj)𝐔ij2since the diagonal entry of is 1\displaystyle(1-2\lambda_{j})\mathbf{U}^{2}_{ij}\;\;\text{since the diagonal entry of }\mathcal{L}\;\text{is}\;1

Therefore,

ΛΛ+((12λ1)𝐔i12(12λn)𝐔in2)\displaystyle\Lambda^{\ast}\approx\Lambda+\begin{pmatrix}(1-2\lambda_{1})\mathbf{U}^{2}_{i1}&&\\ &\ddots&\\ &&(1-2\lambda_{n})\mathbf{U}^{2}_{in}\end{pmatrix}

Assume U1U_{1} corresponding to the smallest non-zero eigenvalue of Λ\Lambda^{\ast}, then the increase of bandwidth frequency for node ii is 𝐔i12\mathbf{U}^{2}_{i1}.

For random selection, where node ii is queried with uniform probability,

𝐄(ΔR)=1ni=1n𝐔i121n\displaystyle\mathbf{E}(\Delta_{\text{R}})=\frac{1}{n}\displaystyle\sum_{i=1}^{n}\mathbf{U}^{2}_{i1}\propto\frac{1}{n}

For the proposed selection method, define 𝐑=diag(r1,r2,,rn)\mathbf{R}=\text{diag}(r_{1},r_{2},\cdots,r_{n}), where ri=didminr_{i}=\frac{d_{i}}{d_{min}}, then

=\displaystyle\mathcal{L}= 𝐈𝐃1/2𝐀𝐃1/2\displaystyle\mathbf{I}-\mathbf{D}^{-1/2}\mathbf{A}\mathbf{D}^{-1/2}
=\displaystyle= 𝐈+ϵ𝐃𝐀0+ϵ((ϵdmin)1𝐑12𝐀𝐑12𝐃𝐀1)\displaystyle\underbrace{\mathbf{I}+\epsilon^{\prime}\mathbf{D}}_{\mathbf{A}_{0}}+\epsilon^{\prime}\left(\underbrace{(\epsilon^{\prime}d_{min})^{-1}\mathbf{R}^{-\frac{1}{2}}\mathbf{A}\mathbf{R}^{-\frac{1}{2}}-\mathbf{D}}_{\mathbf{A}_{1}}\right)
=\displaystyle= 𝐀𝟎+ϵ𝐀𝟏\displaystyle\mathbf{A_{0}}+\epsilon^{\prime}\mathbf{A_{1}}

Next we perform matrix perturbation analysis, define

U0=I,Λ0=A0,Λ=Λ0+ϵΛ1,U=U0+ϵU1\displaystyle U_{0}=I,\;\Lambda_{0}=A_{0},\;\Lambda=\Lambda_{0}+\epsilon^{\prime}\Lambda_{1},\;U=U_{0}+\epsilon^{\prime}U_{1}

where Λ,U\Lambda,U approximate eigenvalues and eigenvectors of Λ\Lambda^{\ast}. Denote (U1)k\left(U_{1}\right)_{k} as the kthk^{th} column of U1U_{1} and assume didjd_{i}\neq d_{j}, we have

(U1)k=ik(U0)iA1(U0)k(Λ0)k(Λ0)i(U0)i=ikAik(ε)2(didk)didkei\left(U_{1}\right)_{k}=\sum_{i\neq k}\frac{\left(U_{0}\right)_{i}^{\top}A_{1}\left(U_{0}\right)_{k}}{\left(\Lambda_{0}\right)_{k}-\left(\Lambda_{0}\right)_{i}}\left(U_{0}\right)_{i}=\sum_{i\neq k}\frac{A_{ik}}{\left(\varepsilon^{\prime}\right)^{2}\left(d_{i}-d_{k}\right)\sqrt{d_{i}}\sqrt{d_{k}}}e_{i}
(U)k=(U0)k+ε(U1)k(U)_{k}=\left(U_{0}\right)_{k}+\varepsilon^{\prime}\left(U_{1}\right)_{k}

To satisfy Uk2=1\left\|U_{k}\right\|_{2}=1, we multiply τ\tau as

(τU)k=(τU0)k+ε(τU1)k\displaystyle(\tau U)_{k}=\left(\tau U_{0}\right)_{k}+\varepsilon^{\prime}\left(\tau U_{1}\right)_{k}
 recalculate U1 by τU0 as (U1)k=ikAikτ3(ε)2(didk)didkei\displaystyle\Rightarrow\text{ recalculate }U_{1}\text{ by }\tau U_{0}\text{ as }\left(U_{1}\right)_{k}=\sum_{i\neq k}\frac{A_{ik}\tau^{3}}{(\varepsilon^{\prime})^{2}\left(d_{i}-d_{k}\right)\sqrt{d_{i}}\sqrt{d_{k}}}e_{i}
(τU)k=τek+ikAikτ4ε(didk)didkei, choose ε=τ3×1dkdmin 32\displaystyle\Rightarrow\left(\tau U\right)_{k}=\tau e_{k}+\sum_{i\neq k}\frac{A_{ik}\tau^{4}}{\varepsilon^{\prime}\left(d_{i}-d_{k}\right)\sqrt{d_{i}}\sqrt{d_{k}}}e_{i},\text{ choose }\varepsilon^{\prime}=\tau^{3}\times\frac{1}{\sqrt{d_{k}}d_{\text{min }}^{\frac{3}{2}}}
 then τUk2=1 if τ=11+ikAik(didkdmin )2didmin\displaystyle\Rightarrow\text{ then }\left\|\tau U_{k}\right\|_{2}=1\text{ if }\tau=\frac{1}{\sqrt{1+\sum_{i\neq k}\frac{A_{ik}}{\left(\frac{d_{i}-d_{k}}{d_{\text{min }}}\right)^{2}\frac{d_{i}}{d_{\text{min }}}}}}

Then we consider the normalized τU\tau U as UU in the following analysis. Assume (U)k(U)_{k} is the eigenvector to the smallest non-zero eigenvalue, then at the t1t-1 step

𝐄(Δ)=𝐄(𝐔ik2)\displaystyle\mathbf{E}(\Delta)=\mathbf{E}(\mathbf{U}^{2}_{ik})

where 𝐄\mathbf{E} is in terms of the randomness in the proposed sampling procedure.

Based on the approximation

(U)k=τek+ikAikτdidkdmin×didmin\displaystyle(U)_{k}=\tau e_{k}+\sum_{i\neq k}\frac{A_{ik}\tau}{\frac{d_{i}-d_{k}}{d_{\min}}\times\sqrt{\frac{d_{i}}{d_{\min}}}}

then

#|{𝐔ik0}i=1n|=1+dk>dmin\displaystyle\#\big{|}\{\mathbf{U}_{ik}\neq 0\}^{n}_{i=1}\big{|}=1+d_{k}>d_{\min}

Define S={i{1,2,,n}:𝐔ik0}S=\{i\in\{1,2,\cdots,n\}:\mathbf{U}_{ik}\neq 0\} and 𝐏𝟏=P(the node being selected to queryS)\mathbf{P_{1}}=P(\text{the node being selected to query}\notin S). Take p=miniS𝐏(i),q=maxiS𝐏(i)p=\min_{i\in S}\mathbf{P}(i),q=\max_{i\notin S}\mathbf{P}(i) and 𝐏()\mathbf{P}(\cdot) denotes the probability of being selected into candidate set size mm. Denote k=dmink=d_{min}, we first upper bound 𝐏1\mathbf{P}_{1} as

𝐏1\displaystyle\mathbf{P}_{1}\leq (nkm)qm(1q)nkm(1p)ki=0k(nkmi)qmk(1q)nk(mi)(ki)pi(1p)ki\displaystyle\frac{\binom{n-k}{m}q^{m}(1-q)^{n-k-m}(1-p)^{k}}{\sum_{i=0}^{k}\binom{n-k}{m-i}q^{m-k}(1-q)^{n-k-(m-i)}\binom{k}{i}p^{i}(1-p)^{k-i}}
=\displaystyle= (nkm)i=0k(nkmi)(ki)ηi, where η=p(1q)(1p)q\displaystyle\frac{\binom{n-k}{m}}{\sum_{i=0}^{k}\binom{n-k}{m-i}\binom{k}{i}\eta^{i}}\text{, where }\eta=\frac{p(1-q)}{(1-p)q}

We calculate the denominator as

i=0k(nkmi)(ki)ηi=c0i=0k(nkmi)2(ki)2i=0kη2i\displaystyle\sum_{i=0}^{k}\binom{n-k}{m-i}\binom{k}{i}\eta^{i}=c_{0}\sqrt{\sum_{i=0}^{k}\binom{n-k}{m-i}^{2}\binom{k}{i}^{2}}\sqrt{\sum_{i=0}^{k}\eta^{2i}}
c0k+1(nm)1η2k+21η2>c0k+1(nm).\displaystyle\geq\frac{c_{0}}{\sqrt{k+1}}\binom{n}{m}\sqrt{\frac{1-\eta^{2k+2}}{1-\eta^{2}}}>\frac{c_{0}}{\sqrt{k+1}}\binom{n}{m}.

In addition, by Stirling’s approximation, when nn is large

(nm)\displaystyle\binom{n}{m}\sim n2πm(nm)nnmm(nm)nm\displaystyle\sqrt{\frac{n}{2\pi m(n-m)}}\cdot\frac{n^{n}}{m^{m}(n-m)^{n-m}}
(nkm)\displaystyle\;\binom{n-k}{m}\sim nk2πm(nkm)(nk)nkmm(nkm)nkm\displaystyle\sqrt{\frac{n-k}{2\pi m(n-k-m)}}\cdot\frac{(n-k)^{n-k}}{m^{m}(n-k-m)^{n-k-m}}

then combining the above simplification, we have

𝐏1<nknnmnkm(nk)nknn(nm)nm(nkm)nkm×k+1c0\displaystyle\mathbf{P}_{1}<\sqrt{\frac{n-k}{n}\cdot\frac{n-m}{n-k-m}}\cdot\frac{(n-k)^{n-k}}{n^{n}}\cdot\frac{(n-m)^{n-m}}{(n-k-m)^{n-k-m}}\times\frac{\sqrt{k+1}}{c_{0}}
(nkmnm)m(nkmnk)k×k+1c0\displaystyle\leq\left(\frac{n-k-m}{n-m}\right)^{m}\left(\frac{n-k-m}{n-k}\right)^{k}\times\frac{\sqrt{k+1}}{c_{0}}

then we can lower bound the expected value of information gain as

𝐄(𝐔ik2)=\displaystyle\mathbf{E}(\mathbf{U}^{2}_{ik})= iS𝐏(i)𝐔i12(1𝐏1)miniS(𝐔ik2)\displaystyle\displaystyle\sum_{i\in S}\mathbf{P}(i)\mathbf{U}^{2}_{i1}\geq(1-\mathbf{P}_{1})\min_{i\in S}\left(\mathbf{U}_{ik}^{2}\right)

Notice that 𝐏1\mathbf{P}_{1} is a monotone decreasing function of mm given nn and kk fixed, then we can select a m0m_{0} such that

(nkm0nm0)m0(nkm0nk)kkδc0,\left(\frac{n-k-m_{0}}{n-m_{0}}\right)^{m_{0}}\left(\frac{n-k-m_{0}}{n-k}\right)^{k}\sqrt{k}\leq\frac{\delta}{c_{0}},

where δ<1\delta<1 is a constant, therefore 𝐄(𝐔ik2)>(1δ)miniS(𝐔ik2)\mathbf{E}(\mathbf{U}^{2}_{ik})>(1-\delta)\min_{i\in S}\left(\mathbf{U}_{ik}^{2}\right).

Next we lower bound the quantity miniS(𝐔ik2)\min_{i\in S}\left(\mathbf{U}_{ik}^{2}\right). Denote η1:=maxi(didmin)\eta_{1}:=\max_{i}(\frac{d_{i}}{d_{min}}) and η0:=#|{i:|didkdmin|1}|\eta_{0}:=\#|\{i:|\frac{d_{i}-d_{k}}{d_{min}}|\leq 1\}|, we calculate the lower bound for miniS(𝐔ik2)\min_{i\in S}\left(\mathbf{U}_{ik}^{2}\right) as

miniS(𝐔ik2)\displaystyle\min_{i\in S}\left(\mathbf{U}_{ik}^{2}\right)\geq min(τ,τ2(didk)2dmin2×didmin)iS\displaystyle\min\left(\tau,\frac{\tau^{2}}{\frac{(d_{i}-d_{k})^{2}}{{d}^{2}_{min}}\times\frac{d_{i}}{d_{min}}}\right)\;\forall i\in S
\displaystyle\geq 1(didk)2dmin2didmin+1+(didk)2dmin2didminj1,ji1(djdi)2dmin2djdmin\displaystyle\frac{1}{\frac{(d_{i}-d_{k})^{2}}{{d}^{2}_{min}}\frac{d_{i}}{d_{min}}+1+\frac{(d_{i}-d_{k})^{2}}{{d}^{2}_{min}}\frac{d_{i}}{d_{min}}\displaystyle\sum_{j\neq 1,j\neq i}\frac{1}{\frac{(d_{j}-d_{i})^{2}}{{d}^{2}_{min}}\frac{d_{j}}{d_{min}}}}
\displaystyle\geq 1(didk)2dmin2didmin(1+on(1)+j1,ji1(djdi)2dmin2djdmin)\displaystyle\frac{1}{\frac{(d_{i}-d_{k})^{2}}{{d}^{2}_{min}}\frac{d_{i}}{d_{min}}\left(1+o_{n}(1)+\displaystyle\sum_{j\neq 1,j\neq i}\frac{1}{\frac{(d_{j}-d_{i})^{2}}{{d}^{2}_{min}}\frac{d_{j}}{d_{min}}}\right)}
\displaystyle\approx 1η13(1+j{i:didkdmin1}dmin2+j{i:didkdmin<1}1)\displaystyle\frac{1}{\eta_{1}^{3}\left(1+\displaystyle\sum_{j\in\{i:\mid\frac{d_{i}-d_{k}}{d_{min}}\mid\leq 1\}}d^{2}_{min}+\displaystyle\sum_{j\notin\{i:\mid\frac{d_{i}-d_{k}}{d_{min}}\mid<1\}}1\right)}
\displaystyle\geq 1η13(η0dmin2+dminη0)\displaystyle\frac{1}{\eta_{1}^{3}(\eta_{0}d^{2}_{min}+d_{min}-\eta_{0})}

which implies

miniS(𝐔ik2)1η13(η0dmin2+dminη0)\displaystyle\min_{i\in S}\left(\mathbf{U}_{ik}^{2}\right)\geq\frac{1}{\eta_{1}^{3}(\eta_{0}d^{2}_{min}+d_{min}-\eta_{0})}

As a result, as long as mm0m\geq m_{0} we have

𝐄(𝐔ik2)1δη13(η0dmin2+dminη0).\displaystyle\mathbf{E}(\mathbf{U}^{2}_{ik})\geq\frac{1-\delta}{\eta_{1}^{3}(\eta_{0}d^{2}_{min}+d_{min}-\eta_{0})}.

A.3 Proof of Theorem 4.2

Proof: Based on the assumption that 𝐟Proj𝐋ω0span(𝐗)\mathbf{f}\in\text{Proj}_{\mathbf{L}_{\omega_{0}}}\text{span}(\mathbf{X}), we denote d0=|{1jnλjω0}|d_{0}=|\{1\leq j\leq n\mid\lambda_{j}\leq\omega_{0}\}| and 𝐔d0=(U1,U2,,Ud0)\mathbf{U}_{d_{0}}=(U_{1},U_{2},\cdots,U_{d_{0}}). Therefore, we can represent 𝐟=𝐔d0𝐔d0T𝐗β\mathbf{f}=\mathbf{U}_{d_{0}}\mathbf{U}^{T}_{d_{0}}\mathbf{X}\beta for some parameter βp×1\beta\in\mathbb{R}^{p\times 1} and 𝐟,Ui=UiT𝐗β\langle\mathbf{f},U_{i}\rangle=U_{i}^{T}\mathbf{X}\beta. For the query set 𝒮\mathcal{S} and the corresponding bandwidth frequency ωω0\omega\leq\omega_{0}, we similarly denote d=|{1jnλjω}|d0d=|\{1\leq j\leq n\mid\lambda_{j}\leq\omega\}|\leq d_{0} and 𝐔d=(U1,U2,,Ud)\mathbf{U}_{d}=(U_{1},U_{2},\cdots,U_{d}). We denote 𝐕n×rd=𝐔dV1\mathbf{V}_{n\times r_{d}}=\mathbf{U}_{d}V_{1} as the bases of Proj𝐋ωspan(𝐗)\text{Proj}_{\mathbf{L}_{\omega}}\text{span}(\mathbf{X}) where V1V_{1} is obtained from SVD decomposition 𝐔dT𝐗=V1ΣV2T\mathbf{U}^{T}_{d}\mathbf{X}=V_{1}\Sigma V_{2}^{T} where (V1)d×rd(V_{1})_{d\times r_{d}} and (V2)p×rd(V_{2})_{p\times r_{d}} are left and right singular vectors, respectively. The diagonal matrix Σrd×rd\Sigma_{r_{d}\times r_{d}} contains rdr_{d} positive singular values with rdmin{d,p}r_{d}\leq\min\{d,p\}. The estimation (11) at the end of section 3 is equivalent to the weighted regression problem of {(𝐕i,Yi,si)}i𝒮\{\left(\mathbf{V}_{i\cdot},Y_{i},s_{i}\right)\}_{i\in\mathcal{S}},

f~=\displaystyle\tilde{f}= argminf~Proj𝐋ωspan(𝐗)i𝒮si|Yif~(i)|2\displaystyle\underset{\tilde{f}\in\text{Proj}_{\mathbf{L}_{\omega}}\text{span}(\mathbf{X})}{\operatorname*{argmin}}\sum_{i\in\mathcal{S}}s_{i}|Y_{i}-\tilde{f}(i)|^{2}
\displaystyle\Rightarrow argminαrd×1i=1|siYi(si𝐕1(i),,si𝐕rd(i))α|2,\displaystyle\underset{\alpha\in\mathbb{R}^{r_{d}\times 1}}{\operatorname*{argmin}}\sum_{i=1}^{\mathcal{B}}|\sqrt{s_{i}}Y_{i}-(\sqrt{s_{i}}\mathbf{V}_{1}(i),\ldots,\sqrt{s_{i}}\mathbf{V}_{r_{d}}(i))\alpha|^{2},

where |𝒮|=|\mathcal{S}|=\mathcal{B}. We have the least squares solution

α(f~)=\displaystyle\alpha(\tilde{f})= (AA)1AWY1\displaystyle\left(A^{\top}A\right)^{-1}A^{\top}WY_{\mathcal{B}}\;\leavevmode\hbox to14.18pt{\vbox to14.18pt{\pgfpicture\makeatletter\hbox{\hskip 7.09111pt\lower-7.09111pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{{{}}}{{}}{}{}{}{}{}{}{}{}{}{{}\pgfsys@moveto{6.89111pt}{0.0pt}\pgfsys@curveto{6.89111pt}{3.8059pt}{3.8059pt}{6.89111pt}{0.0pt}{6.89111pt}\pgfsys@curveto{-3.8059pt}{6.89111pt}{-6.89111pt}{3.8059pt}{-6.89111pt}{0.0pt}\pgfsys@curveto{-6.89111pt}{-3.8059pt}{-3.8059pt}{-6.89111pt}{0.0pt}{-6.89111pt}\pgfsys@curveto{3.8059pt}{-6.89111pt}{6.89111pt}{-3.8059pt}{6.89111pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-2.5pt}{-3.22221pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{1}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{{{}}}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}

where

A=(s1𝐕1(1)s1𝐕rd(1)s𝐕1()srd𝐕rd())andW=diag(s1,,s)\displaystyle A=\begin{pmatrix}\sqrt{s_{1}}\mathbf{V}_{1}(1)&\cdots&\sqrt{s_{1}}\mathbf{V}_{r_{d}}(1)\\ \vdots&\ddots&\vdots\\ \sqrt{s_{\mathcal{B}}}\mathbf{V}_{1}\left(\mathcal{B}\right)&\ldots&\sqrt{s_{r_{d}}}\mathbf{V}_{r_{d}}\left(\mathcal{B}\right)\\ \end{pmatrix}\;\;\;\text{and}\;\;\;W=\text{diag}(\sqrt{s_{1},\ldots,s_{\mathcal{B}}}) (19)

We assume Y=𝐟+εY=\mathbf{f}+\varepsilon, where E(ε)=0E\left(\varepsilon\right)=0 and Var(ε)=σ2Var{(\varepsilon)}=\sigma^{2}. Notice the oracle 𝐟\mathbf{f} satisfies

𝐟=argminfProj𝐋ω0span(𝐗)i=1n𝐄Y(Yif2(i))\displaystyle\mathbf{f}=\underset{f\in\text{Proj}_{\mathbf{L}_{\omega_{0}}}\text{span}(\mathbf{X})}{\operatorname*{argmin}}\sum_{i=1}^{n}\mathbf{E}_{Y}\left(Y_{i}-f^{2}(i)\right)

We decompose the space Proj𝐋ω0span(𝐗)\text{Proj}_{\mathbf{L}_{\omega_{0}}}\text{span}(\mathbf{X}) as

Proj𝐋ω0span(𝐗)=Proj𝐋ωspan(𝐗)(Proj𝐋ωspan(𝐗))c\displaystyle\text{Proj}_{\mathbf{L}_{\omega_{0}}}\text{span}(\mathbf{X})=\text{Proj}_{\mathbf{L}_{\omega}}\text{span}(\mathbf{X})\bigoplus\big{(}\text{Proj}_{\mathbf{L}_{\omega}}\text{span}(\mathbf{X})\big{)}^{c}

Then we decompose 𝐟=𝐟1+𝐟2,where𝐟1Proj𝐋ωspan(𝐗),𝐟2(Proj𝐋ωspan(𝐗))c\mathbf{f}=\mathbf{f}_{1}+\mathbf{f}_{2},\;\text{where}\;\mathbf{f}_{1}\in\text{Proj}_{\mathbf{L}_{\omega}}\text{span}(\mathbf{X}),\;\mathbf{f}_{2}\in\big{(}\text{Proj}_{\mathbf{L}_{\omega}}\text{span}(\mathbf{X})\big{)}^{c}, then

𝐟1=argminfProj𝐋ω{span(𝐗)}i=1n𝐄Y(Yif(i))2\displaystyle\mathbf{f}_{1}=\underset{f\in\text{Proj}_{\mathbf{L}_{\omega}}\{\text{span}(\mathbf{X})\}}{\operatorname*{argmin}}\sum_{i=1}^{n}\mathbf{E}_{Y}\left(Y_{i}-f\left(i\right)\right)^{2}

Then we can represent 𝐟1(i)=(𝐕1(i),,𝐕rd(i))α(𝐟1)\mathbf{f}_{1}(i)=(\mathbf{V}_{1}(i),\ldots,\mathbf{V}_{r_{d}}(i))\alpha(\mathbf{f}_{1}), by solving Aα(𝐟1)=W𝐟1A\alpha(\mathbf{f}_{1})=W\mathbf{f}_{1}, we have

α(𝐟1)=(AA)1AW𝐟12\displaystyle\alpha(\mathbf{f}_{1})=\left(A^{\top}A\right)^{-1}A^{\top}W\mathbf{f}_{1}\;\leavevmode\hbox to14.18pt{\vbox to14.18pt{\pgfpicture\makeatletter\hbox{\hskip 7.09111pt\lower-7.09111pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{{{}}}{{}}{}{}{}{}{}{}{}{}{}{{}\pgfsys@moveto{6.89111pt}{0.0pt}\pgfsys@curveto{6.89111pt}{3.8059pt}{3.8059pt}{6.89111pt}{0.0pt}{6.89111pt}\pgfsys@curveto{-3.8059pt}{6.89111pt}{-6.89111pt}{3.8059pt}{-6.89111pt}{0.0pt}\pgfsys@curveto{-6.89111pt}{-3.8059pt}{-3.8059pt}{-6.89111pt}{0.0pt}{-6.89111pt}\pgfsys@curveto{3.8059pt}{-6.89111pt}{6.89111pt}{-3.8059pt}{6.89111pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-2.5pt}{-3.22221pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{2}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{{{}}}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}

From 1 and 2, we have

i=1n|f~(i)𝐟(i)|2=\displaystyle\sum_{i=1}^{n}\left|\tilde{f}\left(i\right)-\mathbf{f}\left(i\right)\right|^{2}= f~𝐟22f~𝐟122+𝐟1𝐟22\displaystyle\|\tilde{f}-\mathbf{f}\|_{2}^{2}\leq\|\tilde{f}-\mathbf{f}_{1}\|_{2}^{2}+\|\mathbf{f}_{1}-\mathbf{f}\|_{2}^{2}
\displaystyle\leq α(f~)α(𝐟1)22+𝐟1𝐟22\displaystyle\left\|\alpha\left(\tilde{f}\right)-\alpha(\mathbf{f}_{1})\right\|_{2}^{2}+\|\mathbf{f}_{1}-\mathbf{f}\|_{2}^{2}
\displaystyle\leq (AA)1AW(Y(𝐟1))22+𝐟1𝐟22\displaystyle\left\|\left(A^{\top}A\right)^{-1}A^{\top}W\left(Y_{\mathcal{B}}-\left(\mathbf{f}_{1}\right)_{\mathcal{B}}\right)\right\|_{2}^{2}+\|\mathbf{f}_{1}-\mathbf{f}\|_{2}^{2}
\displaystyle\leq λmax(AA)1AW(Y(𝐟1)22+𝐟1𝐟22\displaystyle\lambda_{max}\left(A^{\top}A\right)^{-1}\|A^{\top}W\left(Y_{\mathcal{B}}-\left(\mathbf{f}_{1}\right)_{\mathcal{B}}\|_{2}^{2}+\|\mathbf{f}_{1}-\mathbf{f}\|_{2}^{2}\right.

Denote gi=Yi𝐟1(i)g_{i}=Y_{i}-\mathbf{f}_{1}(i), we have

𝐄𝒮AWg2=\displaystyle\mathbf{E}_{\mathcal{S}}\|A^{\top}Wg\|^{2}= i=1rd𝐄𝒮[j=1sj2𝐕i2(j)|gj|2]\displaystyle\sum_{i=1}^{r_{d}}\mathbf{E}_{\mathcal{S}}\left[\sum_{j=1}^{\mathcal{B}}s_{j}^{2}\mathbf{V}_{i}^{2}(j)|g_{j}|^{2}\right]

Denote αi=sipi\alpha_{i}=s_{i}*p_{i} for j𝒮j\in\mathcal{S} where pjp_{j} is the probability of node jj being selected to query. Since

𝐄𝒮[sj𝐕i(j)(Yj𝐟1(j))]=\displaystyle\mathbf{E}_{\mathcal{S}}\left[s_{j}\mathbf{V}_{i}(j)(Y_{j}-\mathbf{f}_{1}(j))\right]= αj𝐄𝒮[1pj𝐕i(j)(Yj𝐟1(j))]\displaystyle\alpha_{j}\mathbf{E}_{\mathcal{S}}\left[\frac{1}{p_{j}}\mathbf{V}_{i}(j)(Y_{j}-\mathbf{f}_{1}(j))\right]
=\displaystyle= αjl=1n𝐄Y[𝐕i(l)(Yl𝐟1(l))]\displaystyle\alpha_{j}\sum_{l=1}^{n}\mathbf{E}_{Y}\left[\mathbf{V}_{i}(l)(Y_{l}-\mathbf{f}_{1}(l))\right]
=\displaystyle= 0since 𝐕i𝐟2\displaystyle 0\;\;\text{since }\mathbf{V}_{i}\perp\mathbf{f}_{2}

we have

𝐄𝒮AWg=j=1𝐄𝒮[i=1rdsj2𝐕i2(j)|gj|2]\displaystyle\mathbf{E}_{\mathcal{S}}\|A^{\top}Wg\|=\sum_{j=1}^{\mathcal{B}}\mathbf{E}_{\mathcal{S}}\left[\sum_{i=1}^{r_{d}}s_{j}^{2}\mathbf{V}_{i}^{2}(j)|g_{j}|^{2}\right]\leq sup𝑗(sji=1rd|𝐕i(j)|2)×j=1𝐄𝒮(sjgj2)\displaystyle\underset{j}{\sup}\left(s_{j}\sum_{i=1}^{r_{d}}|\mathbf{V}_{i}(j)|^{2}\right)\times\sum_{j=1}^{\mathcal{B}}\mathbf{E}_{\mathcal{S}}\left(s_{j}g^{2}_{j}\right)
=\displaystyle= sup𝑗(sji=1rd|𝐕i(j)|2)×j=1αj𝐄𝒮(1pjgj2)\displaystyle\underset{j}{\sup}\left(s_{j}\sum_{i=1}^{r_{d}}|\mathbf{V}_{i}(j)|^{2}\right)\times\sum_{j=1}^{\mathcal{B}}\alpha_{j}\mathbf{E}_{\mathcal{S}}\left(\frac{1}{p_{j}}g^{2}_{j}\right)
=\displaystyle= sup𝑗(sji=1rd|𝐕i(j)|2)×j=1αj×l=1n𝐄Y(Yl𝐟1(l))2\displaystyle\underset{j}{\sup}\left(s_{j}\sum_{i=1}^{r_{d}}|\mathbf{V}_{i}(j)|^{2}\right)\times\sum_{j=1}^{\mathcal{B}}\alpha_{j}\times\sum_{l=1}^{n}\mathbf{E}_{Y}(Y_{l}-\mathbf{f}_{1}(l))^{2}

Notice that

l=1n𝐄Y(Yl𝐟1(l))2=l=1n𝐄Y(Yl𝐟(l))2+l=1n(𝐟(l)𝐟1(l))2=nσ2+𝐟𝐟122\displaystyle\sum_{l=1}^{n}\mathbf{E}_{Y}(Y_{l}-\mathbf{f}_{1}(l))^{2}=\sum_{l=1}^{n}\mathbf{E}_{Y}(Y_{l}-\mathbf{f}(l))^{2}+\sum_{l=1}^{n}(\mathbf{f}(l)-\mathbf{f}_{1}(l))^{2}=n\sigma^{2}+\|\mathbf{f}-\mathbf{f}_{1}\|_{2}^{2}

Notice that 𝐟=𝐔d0𝐔d0T𝐗β\mathbf{f}=\mathbf{U}_{d_{0}}\mathbf{U}^{T}_{d_{0}}\mathbf{X}\beta and 𝐟1=𝐔d𝐔dT𝐟\mathbf{f}_{1}=\mathbf{U}_{d}\mathbf{U}^{T}_{d}\mathbf{f}, then 𝐟𝐟1=𝐔d𝐔dT𝐗β=i>d𝐟,UiUi\mathbf{f}-\mathbf{f}_{1}=\mathbf{U}_{d^{\prime}}\mathbf{U}^{T}_{d^{\prime}}\mathbf{X}\beta=\sum_{i>d}\langle\mathbf{f},U_{i}\rangle U_{i} where 𝐔d=(Ud+1,,Ud0)\mathbf{U}_{d^{\prime}}=(U_{d+1},\cdots,U_{d_{0}}). Therefore, 𝐟𝐟1=i>d,isupp(𝐟)𝐟,Ui2\|\mathbf{f}-\mathbf{f}_{1}\|=\sum_{i>d,i\in\text{supp}(\mathbf{f})}\langle\mathbf{f},U_{i}\rangle^{2}. We first state and then prove the following Lemma 1.

Lemma A.1.

For the output from Algorithm 1 with \mathcal{B} query budgets, we have

i=1αi43,supj𝒮(sji=1rd|𝐕i(j)|2)10δ,and\displaystyle\sum_{i=1}^{\mathcal{B}}\alpha_{i}\leq\frac{4}{3},\;\underset{j\in\mathcal{S}}{\sup}\left(s_{j}\sum_{i=1}^{r_{d}}|\mathbf{V}_{i}(j)|^{2}\right)\leq 10\delta,\;\text{and}
λ(AA)[12×1(1+mδC0)2,83×11(mδC0)2]with probability 12C,\displaystyle\lambda(A^{\top}A)\in\Big{[}\frac{1}{2}\times\frac{1}{\left(1+\frac{m\sqrt{\delta}}{C_{0}}\right)^{2}},\frac{8}{3}\times\frac{1}{1-\left(\frac{m\sqrt{\delta}}{C_{0}}\right)^{2}}\Big{]}\;\text{with probability}\;1-\frac{2}{C},

where δ=rdCC02m\delta=\frac{r_{d}CC_{0}^{2}}{m\mathcal{B}}, and C0C_{0} is a constant such that C02=(m)2+max(2,16d)mC^{2}_{0}=\left(m\right)^{2}+\max(2,\frac{16}{d})m.

Using Lemma 1 we have

𝐄f~𝐟22\displaystyle\mathbf{E}\|\tilde{f}-\mathbf{f}\|_{2}^{2}\leq λmin(AA)×(j=1αj)×supj𝒮(sji=1rd|𝐕i(j)|2)×j=1n𝐄Y(Yj𝐟1(j))2+𝐟𝐟122\displaystyle\lambda_{min}\left(A^{\top}A\right)\times\left(\sum_{j=1}^{\mathcal{B}}\alpha_{j}\right)\times\underset{j\in\mathcal{S}}{\sup}\left(s_{j}\sum_{i=1}^{r_{d}}|\mathbf{V}_{i}(j)|^{2}\right)\times\sum_{j=1}^{n}\mathbf{E}_{Y}(Y_{j}-\mathbf{f}_{1}(j))^{2}+\|\mathbf{f}-\mathbf{f}_{1}\|_{2}^{2}
\displaystyle\leq 2(1+mδC0)2×43×10×CC02rdm×1×j=1n𝐄Y(Yj𝐟1(j))2+𝐟𝐟122\displaystyle 2(1+\frac{m\sqrt{\delta}}{C_{0}})^{2}\times\frac{4}{3}\times 10\times\frac{CC_{0}^{2}r_{d}}{m}\times\frac{1}{\mathcal{B}}\times\sum_{j=1}^{n}\mathbf{E}_{Y}(Y_{j}-\mathbf{f}_{1}(j))^{2}+\|\mathbf{f}-\mathbf{f}_{1}\|_{2}^{2}
\displaystyle\leq O(2(rdt)+rdt)3/2+(rdt)2)×(nσ2+i>d,isupp(𝐟)Ui,𝐟2)+i>d,isupp(𝐟)Ui,𝐟2,\displaystyle O\Big{(}2(\frac{r_{d}t}{\mathcal{B}})+\frac{r_{d}t}{\mathcal{B}})^{3/2}+(\frac{r_{d}t}{\mathcal{B}})^{2}\Big{)}\times(n\sigma^{2}+\sum_{i>d,i\in\text{supp}(\mathbf{f})}\langle U_{i},\mathbf{f}\rangle^{2})+\sum_{i>d,i\in\text{supp}(\mathbf{f})}\langle U_{i},\mathbf{f}\rangle^{2},

with probability larger than 12mt1-\frac{2m}{t} where t>2mt>2m.

In the following, we prove Lemma 1 which is based on Theorem 5.2 in [9] and Lemma 3.5 and 3.6 in [21].

In the following, we denote the accumulated covariance matrix in the jj selection as AjA_{j}, the potential function as Φuj,lj(Aj)=Tr[(ujIAj)1]+Tr[(AjljI)1]\Phi_{u_{j},l_{j}}(A_{j})=\text{Tr}[(u_{j}I-A_{j})^{-1}]+\text{Tr}[(A_{j}-l_{j}I)^{-1}], and Ri(u,l,A)=vi(uIA)1viT+vi(AlI)1viTR_{i}(u,l,A)=v_{i}(uI-A)^{-1}v_{i}^{T}+v_{i}(A-lI)^{-1}v_{i}^{T}, where viv_{i} is the iith row of 𝐕\mathbf{V}. Notice that i=1nRi=Φu,l(A)\sum_{i=1}^{n}R_{i}=\Phi_{u,l}(A). At each iteration of algorithm 1, the iith node is selected as one of mm candidates with pi=RiΦp*_{i}=\frac{R_{i}}{\Phi}. For the mm candidates, we define the following probability

qi={1ηif  i has maximum Δi among m candidatesηm1otherwiseq_{i}=\begin{dcases*}1-\eta&\text{if } $i\text{ has maximum }\Delta_{i}\text{ among m candidates}$\\ \frac{\eta}{m-1}&\text{otherwise}\end{dcases*}

where 0<η<10<\eta<1. Notice that when η\eta goes to 0, the qiq_{i} approximate the step 3 in Algorithm 1. Therefore, the probability of node k being query is

pk=P(select k)=\displaystyle p_{k}=P(\text{select k})= P( select kQk|k in Bm)P(k in Bm)=pk×qk\displaystyle P(\underbrace{\text{ select k}}_{Q_{k}}|k\text{ in }B_{m})\cdot P({k\text{ in }B_{m}})=p^{*}_{k}\times q_{k}
𝐄(1pkvkvk)=\displaystyle\mathbf{E}\left(\frac{1}{p_{k}}v_{k}v^{\top}_{k}\right)= 𝐄Bm𝐄Qk|Bm(1P(select k)vkvk)\displaystyle\mathbf{E}_{B_{m}}\mathbf{E}_{Q_{k}|B_{m}}\left(\frac{1}{P(\text{select k})}v_{k}v^{\top}_{k}\right)
=\displaystyle= 𝐄Bm(kBmP(select k|kBm)1P(select k)vkvk)\displaystyle\mathbf{E}_{B_{m}}\left(\sum_{k\in B_{m}}P(\text{select }k|k\in B_{m})\cdot\frac{1}{P(\text{select k})}v_{k}v^{\top}_{k}\right)
=\displaystyle= 𝐄Bm(kBm1P(kBm)vkvk)\displaystyle\mathbf{E}_{B_{m}}\left(\sum_{k\in B_{m}}\frac{1}{P(k\in B_{m})}v_{k}v^{\top}_{k}\right)
=\displaystyle= ΩP(Bm)×kBm1P(kBm)vkvk\displaystyle\sum_{\Omega}P(B_{m})\times\sum_{k\in B_{m}}\frac{1}{P(k\in B_{m})}v_{k}v^{\top}_{k}
=\displaystyle= BmΩkBmP(BmkBm)vkvk\displaystyle\sum_{B_{m}\in\Omega}\sum_{k\in B_{m}}P(B_{m}\mid k\in B_{m})\cdot v_{k}v^{\top}_{k}

where Ω\Omega denote all CnmC_{n}^{m} possible candidate set with choosing mm nodes from nn nodes, P(BmkBm)P(B_{m}\mid k\in B_{m}) denotes the probability of selecting m1m-1 nodes into BmB_{m} conditioning on kBmk\in B_{m}. Denote Ωk\Omega_{k} as all possible size mm candidate sets with node kk always in the set. Then

𝐄(1pkvkvk)=k=1n(Bm1kΩkP(BmkBm))vkvk=k=1nvkvk=I\displaystyle\mathbf{E}\left(\frac{1}{p_{k}}v_{k}v^{\top}_{k}\right)=\sum_{k=1}^{n}\left(\sum_{B^{k}_{m-1}\subset\Omega_{k}}P\left(B_{m}\mid k\in B_{m}\right)\right)\cdot v_{k}v^{\top}_{k}=\sum_{k=1}^{n}v_{k}v^{\top}_{k}=I
ϵ(Ri)pkvkvk=ϵ(Ri)pkqkvkvk=\displaystyle\frac{\epsilon}{(\sum R_{i})p_{k}}v_{k}v^{\top}_{k}=\frac{\epsilon}{(\sum R_{i})p^{*}_{k}q_{k}}v_{k}v^{\top}_{k}= ϵRkqkvkvk\displaystyle\frac{\epsilon}{R_{k}q_{k}}v_{k}v^{\top}_{k}
\displaystyle\preceq ϵ(uIA)1qk\displaystyle\epsilon(uI-A)\frac{1}{q_{k}}
\displaystyle\leq mϵη(uIA)\displaystyle\frac{m\epsilon}{\eta}(uI-A)

where we use the fact vvT(vTB1v)Bvv^{T}\preceq(v^{T}B^{-1}v)B for any semi-positive definite matrix BB. In addition,

𝐄(ϵ(Ri)pkvkvk)=ϵRiI=ϵΦu,l(A)I\displaystyle\mathbf{E}\left(\frac{\epsilon}{(\sum R_{i})p_{k}}v_{k}v^{\top}_{k}\right)=\frac{\epsilon}{\sum R_{i}}I=\frac{\epsilon}{\Phi_{u,l}(A)}I

for any k=1,,nk=1,\cdots,n. Denote wk=ϵRipkvkw_{k}=\sqrt{\frac{\epsilon}{\sum R_{i}p_{k}}}v_{k}, then wkwkmϵη(uIA)w_{k}w^{\top}_{k}\preceq\frac{m\epsilon}{\eta}(uI-A), which implies for any k[1,n]k\in[1,n]

wk(uIA)1wkwk(uIA)1wk\displaystyle w^{\top}_{k}(uI-A)^{-1}w_{k}w^{\top}_{k}(uI-A)^{-1}w_{k}\leq mϵηwk(uIA)1wk\displaystyle\frac{m\epsilon}{\eta}w^{\top}_{k}(uI-A)^{-1}w_{k}
wk(uIA)1wk\displaystyle\Rightarrow\;\;\;w^{\top}_{k}(uI-A)^{-1}w_{k}\leq mϵη\displaystyle\frac{m\epsilon}{\eta}

Similarly, we have

wk(AlI)1wk\displaystyle w^{\top}_{k}(A-lI)^{-1}w_{k}\leq mϵη\displaystyle\frac{m\epsilon}{\eta}

Then from Lemma 3.3 and Lemma 3.4 in [batson2014twice] we have

Tr(uIAwkwk)\displaystyle\text{Tr}(uI-A-w_{k}w_{k}^{\top})\leq\; Tr(uIA)+wk(uIA)2wk1mϵη\displaystyle\text{Tr}(uI-A)+\frac{w_{k}^{\top}(uI-A)^{-2}w_{k}}{1-\frac{m\epsilon}{\eta}} (20)
Tr(A+wkwklI)\displaystyle\text{Tr}(A+w_{k}w_{k}^{\top}-lI)\leq\; Tr(AlI)wk(AlI)2wk1+mϵη\displaystyle\text{Tr}(A-lI)-\frac{w_{k}^{\top}(A-lI)^{-2}w_{k}}{1+\frac{m\epsilon}{\eta}} (21)

Define ϵ=mϵη\epsilon^{\prime}=\frac{m\epsilon}{\eta}, we show in the following that 𝐄(Φuj,lj(Aj))Φuj1,lj1(Aj1)\mathbf{E}\left(\Phi_{u_{j},l_{j}}(A_{j})\right)\leq\Phi_{u_{j-1},l_{j-1}}(A_{j-1}).

From (20) we have

Φuj,lj(Aj)Φuj,lj(Aj1)+wj1(ujIAj1)2wj11ϵwj1(Aj1ljI)2wj11+ϵ3\displaystyle\Phi_{u_{j},l_{j}}(A_{j})\leq\Phi_{u_{j},l_{j}}(A_{j-1})+\frac{w_{j-1}^{\top}(u_{j}I-A_{j-1})^{-2}w_{j-1}}{1-\epsilon}-\frac{w_{j-1}^{\top}(A_{j-1}-l_{j}I)^{-2}w_{j-1}}{1+\epsilon}\;\leavevmode\hbox to14.18pt{\vbox to14.18pt{\pgfpicture\makeatletter\hbox{\hskip 7.09111pt\lower-7.09111pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{ {{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{{{}}}{{}}{}{}{}{}{}{}{}{}{}{{}\pgfsys@moveto{6.89111pt}{0.0pt}\pgfsys@curveto{6.89111pt}{3.8059pt}{3.8059pt}{6.89111pt}{0.0pt}{6.89111pt}\pgfsys@curveto{-3.8059pt}{6.89111pt}{-6.89111pt}{3.8059pt}{-6.89111pt}{0.0pt}\pgfsys@curveto{-6.89111pt}{-3.8059pt}{-3.8059pt}{-6.89111pt}{0.0pt}{-6.89111pt}\pgfsys@curveto{3.8059pt}{-6.89111pt}{6.89111pt}{-3.8059pt}{6.89111pt}{0.0pt}\pgfsys@closepath\pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-2.5pt}{-3.22221pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{{3}} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} } \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope{{{}}}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}

Define Δu=ujuj1=ϵ(1ϵ)Rj\Delta_{u}=u_{j}-u_{j-1}=\frac{\epsilon}{(1-\epsilon^{\prime})\sum R_{j}} and Δl=ljlj1=ϵ(1+ϵ)Rj\Delta_{l}=l_{j}-l_{j-1}=\frac{\epsilon}{(1+\epsilon^{\prime})\sum R_{j}}. Notice that

uTr(uIA)1=Tr(uIA)2<0\displaystyle\frac{\partial}{\partial u}\text{Tr}(uI-A)^{-1}=-\text{Tr}(uI-A)^{-2}<0
lTr(AlI)1=Tr(AlI)2<0\displaystyle\frac{\partial}{\partial l}\text{Tr}(A-lI)^{-1}=\text{Tr}(A-lI)^{-2}<0

at each step based on the design of uju_{j} and ljl_{j}. and Φu,l(A)\Phi_{u,l}(A) is convex in terms of uu and ll. From 3, we have

Φuj,lj(Aj)Φuj,lj(Aj1)+\displaystyle\Phi_{u_{j},l_{j}}(A_{j})\leq\Phi_{u_{j},l_{j}}(A_{j-1})+ 11ϵTr[(ujIAj1)2wj1wj1]\displaystyle\frac{1}{1-\epsilon^{\prime}}\text{Tr}\big{[}(u_{j}I-A_{j-1})^{-2}w_{j-1}w^{\top}_{j-1}\big{]}
\displaystyle- 11+ϵTr[(Aj1ljI)2wj1wj1]\displaystyle\frac{1}{1+\epsilon^{\prime}}\text{Tr}\big{[}(A_{j-1}-l_{j}I)^{-2}w_{j-1}w^{\top}_{j-1}\big{]}

then with 𝐄(wkwkT)=ϵRiI\mathbf{E}(w_{k}w_{k}^{T})=\frac{\epsilon}{\sum R_{i}}I

E(Φuj,lj(Aj))\displaystyle\textbf{E}\left(\Phi_{u_{j},l_{j}}(A_{j})\right) Φuj,lj(Aj1)+ϵ(1ϵ)RiTr[(ujIAj1)2]\displaystyle\leq\Phi_{u_{j},l_{j}}(A_{j-1})+\frac{\epsilon}{(1-\epsilon^{\prime})\sum R_{i}}\text{Tr}\big{[}(u_{j}I-A_{j-1})^{-2}\big{]}
ϵ(1+ϵ)RiTr[(Aj1ljI)2]\displaystyle\quad-\frac{\epsilon}{(1+\epsilon^{\prime})\sum R_{i}}\text{Tr}\big{[}(A_{j-1}-l_{j}I)^{-2}\big{]}
Φuj,lj(Aj1)+ΔuTr[(ujIAj1)2]\displaystyle\leq\Phi_{u_{j},l_{j}}(A_{j-1})+\Delta_{u}\text{Tr}\big{[}(u_{j}I-A_{j-1})^{-2}\big{]}
ΔlTr[(Aj1lj)2]\displaystyle\quad-\Delta_{l}\text{Tr}\big{[}(A_{j-1}-l_{j})^{-2}\big{]}

Define

f(t)=Tr[(uj1+tΔu)IAj1]1+Tr[Aj1(lj1+Δlt)I]1\displaystyle f(t)=\text{Tr}\big{[}(u_{j-1}+t\cdot\Delta_{u})I-A_{j-1}\big{]}^{-1}+\text{Tr}\big{[}A_{j-1}-(l_{j-1}+\Delta_{l}\cdot t)I\big{]}^{-1}

then

f(t)t=ΔuTr[(uj1+tΔu)IAj1]2+ΔlTr[Aj1(lj1+Δlt)I]2\displaystyle\frac{\partial f(t)}{\partial t}=-\Delta_{u}\text{Tr}\big{[}(u_{j-1}+t\cdot\Delta_{u})I-A_{j-1}\big{]}^{-2}+\Delta_{l}\text{Tr}\big{[}A_{j-1}-(l_{j-1}+\Delta_{l}\cdot t)I\big{]}^{-2}

Since f(t)f(t) is convex, we have

f(t)t|t=1f(1)f(0)=Φuj,lj(Aj1)Φuj1,lj1(Aj1)\displaystyle\frac{\partial f(t)}{\partial t}\bigg{|}_{t=1}\geq f(1)-f(0)=\Phi_{u_{j},l_{j}}(A_{j-1})-\Phi_{u_{j-1},l_{j-1}}(A_{j-1}) (23)

Then plugin (23), we have

E(Φuj,lj(Aj))Φuj1,lj1(Aj1)\displaystyle\textbf{E}\left(\Phi_{u_{j},l_{j}}(A_{j})\right)\leq\Phi_{u_{j-1},l_{j-1}}(A_{j-1})

Notice that for selection

ΔujΔljΔuj=εt(1ε)εt(1+ε)εt(1ε)=1(1ε)1(1+ε)1(1ε)2ε\frac{\Delta_{u_{j}}-\Delta_{l_{j}}}{\Delta_{u_{j}}}=\frac{\frac{\varepsilon}{t\left(1-\varepsilon^{\prime}\right)}-\frac{\varepsilon}{t\left(1+\varepsilon^{\prime}\right)}}{\frac{\varepsilon}{t\left(1-\varepsilon^{\prime}\right)}}=\frac{\frac{1}{\left(1-\varepsilon^{\prime}\right)}-\frac{1}{(1+\varepsilon^{\prime})}}{\frac{1}{\left(1-\varepsilon^{\prime}\right)}}\leq 2\varepsilon^{\prime}

where t=Rit=\sum R_{i}. We consider that the selection process stops when at the iteration kk that uklk8rd/ϵu_{k}-l_{k}\geq 8r_{d}/\epsilon. Notice u0=2rdε,l0=2rdεu_{0}=\frac{2r_{d}}{\varepsilon},l_{0}=\frac{-2r_{d}}{\varepsilon}, when stop at uklk8rdεu_{k}-l_{k}\geqslant\frac{8r_{d}}{\varepsilon}, we have

uklkuk=\displaystyle\frac{u_{k}-l_{k}}{u_{k}}= (u0l0)+j=0k1(ΔujΔlj)u0+j=0k1Δuj\displaystyle\frac{\left(u_{0}-l_{0}\right)+\sum_{j=0}^{k-1}\left(\Delta_{u_{j}}-\Delta_{l_{j}}\right)}{u_{0}+\sum_{j=0}^{k-1}\Delta_{u_{j}}}
\displaystyle\leq 4rd/ε+j=0k1(ΔujΔlj)2rd/ε+(2ε)1j=0k1(ΔujΔlj)\displaystyle\frac{4r_{d}/\varepsilon+\sum_{j=0}^{k-1}\left(\Delta_{u_{j}}-\Delta_{l_{j}}\right)}{2r_{d}/\varepsilon+\left(2\varepsilon^{\prime}\right)^{-1}\sum_{j=0}^{k-1}\left(\Delta_{u_{j}}-\Delta_{l_{j}}\right)}
\displaystyle\leq 4rd/ε+4rd/ε2rd/ε+(2ε)14rd/ε\displaystyle\frac{4r_{d}/\varepsilon+4r_{d}/\varepsilon}{2r_{d}/\varepsilon+\left(2\varepsilon^{\prime}\right)^{-1}4r_{d}/\varepsilon}
=\displaystyle= 8rd/ε2rd(1+1ε)/ε\displaystyle\frac{8r_{d}/\varepsilon}{2r_{d}\left(1+\frac{1}{\varepsilon^{\prime}}\right)/\varepsilon}
=\displaystyle= 41+1ε4ε\displaystyle\frac{4}{1+\frac{1}{\varepsilon^{\prime}}}\leq 4\varepsilon^{\prime}

Then we have uklk=(1uklkuk)11+4(ε)\frac{u_{k}}{l_{k}}=\left(1-\frac{u_{k}-l_{k}}{u_{k}}\right)^{-1}\leq 1+4\left(\varepsilon^{\prime}\right). Notice that uklk8rdεj=0k1(ΔujΔlj)4rdεu_{k}-l_{k}\geqslant\frac{8r_{d}}{\varepsilon}\implies\sum_{j=0}^{k-1}\left(\Delta_{u_{j}}-\Delta_{l_{j}}\right)\geq\frac{4r_{d}}{\varepsilon}.

Consider at the jjth selection

ΔujΔlj=\displaystyle\Delta_{u_{j}}-\Delta_{l_{j}}= (ϵ1ϵϵ1+ϵ)1Ri\displaystyle\left(\frac{\epsilon}{1-\epsilon^{\prime}}-\frac{\epsilon}{1+\epsilon^{\prime}}\right)\frac{1}{\sum R_{i}}
=\displaystyle= ϵ~Φuj,lj(Aj),ϵ=2ϵϵ(1ϵ)(1+ϵ)\displaystyle\frac{\tilde{\epsilon}}{\Phi_{u_{j},l_{j}}(A_{j})},\;\;\epsilon^{\prime}=\frac{2\epsilon\epsilon^{\prime}}{(1-\epsilon^{\prime})(1+\epsilon^{\prime})}

Then

P(finish selection after  times selectionz vectors)\displaystyle P\left(\text{finish selection after $\mathcal{B}$ times selection}z\text{ vectors}\right)\geq P(j=01ϵ~Φuj,lj(Aj)4rdϵ)\displaystyle P\left(\sum_{j=0}^{\mathcal{B}-1}\frac{\tilde{\epsilon}}{\Phi_{u_{j},l_{j}}(A_{j})}\geq\frac{4r_{d}}{\epsilon}\right)
=\displaystyle= P(j=01Φuj,lj1(Aj)4rdϵ~ϵ)\displaystyle P\left(\sum_{j=0}^{\mathcal{B}-1}\Phi^{-1}_{u_{j},l_{j}}(A_{j})\geq\frac{4r_{d}}{\tilde{\epsilon}\epsilon}\right)
\displaystyle\geq P(2j=01Φuj,lj(Aj)4rdϵ~ϵ)\displaystyle P\left(\frac{\mathcal{B}^{2}}{\sum_{j=0}^{\mathcal{B}-1}\Phi_{u_{j},l_{j}}(A_{j})}\geq\frac{4r_{d}}{\tilde{\epsilon}\epsilon}\right)
=\displaystyle= P(j=0Φuj,lj(Aj)2ϵ~ϵ4rd),\displaystyle P\left(\sum_{j=0}^{\mathcal{B}}\Phi_{u_{j},l_{j}}(A_{j})\leq\frac{\mathcal{B}^{2}\tilde{\epsilon}\epsilon}{4r_{d}}\right),
\displaystyle\geq 14nϵ~,\displaystyle 1-\frac{4n}{\mathcal{B}\tilde{\epsilon}},
\displaystyle\geq 12rdmηϵ2\displaystyle 1-\frac{2r_{d}}{\mathcal{B}\cdot\frac{m}{\eta}\cdot\epsilon^{2}}

where we use the result that E(Φuj,lj(Aj))Φu0,l0(A0)\textbf{E}\left(\Phi_{u_{j},l_{j}}(A_{j})\right)\leq\Phi_{u_{0},l_{0}}(A_{0}) by recursively using E(Φuj,lj(Aj))Φuj1,lj1(Aj1)\textbf{E}\left(\Phi_{u_{j},l_{j}}(A_{j})\right)\leq\Phi_{u_{j-1},l_{j-1}}(A_{j-1}) and the fact that Φu0,l0(A0)=ϵ\Phi_{u_{0},l_{0}}(A_{0})=\epsilon.

We consider the following reparametrization:

ϵ=δC0, 0<δ,mid=2rd(1m2ϵ2)/(mϵ2).\displaystyle\epsilon=\frac{\sqrt{\delta}}{C_{0}},\;0<\delta,\;\text{mid}=2r_{d}(1-m^{2}\epsilon^{2})/(m\epsilon^{2}).

For the jjth selection, αj=ϵΦj1mid\alpha_{j}=\frac{\epsilon}{\Phi_{j}}\frac{1}{\text{mid}}. From previous result, we have uklk=(1uklkuk)11+4(ε)\frac{u_{k}}{l_{k}}=\left(1-\frac{u_{k}-l_{k}}{u_{k}}\right)^{-1}\leq 1+4\left(\varepsilon^{\prime}\right) with probability 12/C1-2/C with δ=CC02ηrdm\delta=\frac{CC_{0}^{2}\eta r_{d}}{m\mathcal{B}}. Notice that uk=u0+j=1kϵ(1ϵ)Φju_{k}=u_{0}+\sum_{j=1}^{k}\frac{\epsilon}{(1-\epsilon^{\prime})\Phi_{j}} and lk=l0+j=1kϵ(1+ϵ)Φjl_{k}=l_{0}+\sum_{j=1}^{k}\frac{\epsilon}{(1+\epsilon^{\prime})\Phi_{j}}, and

uk+lk=j=1kϵΦj(11ϵ+11+ϵ)\displaystyle u_{k}+l_{k}=\sum_{j=1}^{k}\frac{\epsilon}{\Phi_{j}}\left(\frac{1}{1-\epsilon^{\prime}}+\frac{1}{1+\epsilon^{\prime}}\right)

Then if stop at kkth selection

Φk2rduklk=\displaystyle\Phi_{k}\geq\frac{2r_{d}}{u_{k}-l_{k}}= 2rd(uk1lk1)+ϵΦk(11ϵ11+ϵ)\displaystyle\frac{2r_{d}}{(u_{k-1}-l_{k-1})+\frac{\epsilon}{\Phi_{k}}\left(\frac{1}{1-\epsilon^{\prime}}-\frac{1}{1+\epsilon^{\prime}}\right)}
\displaystyle\geq 2rd8rdϵ+ϵΦk(11ϵ11+ϵ)\displaystyle\frac{2r_{d}}{\frac{8r_{d}}{\epsilon}+\frac{\epsilon}{\Phi_{k}}\left(\frac{1}{1-\epsilon^{\prime}}-\frac{1}{1+\epsilon^{\prime}}\right)}

Denote c=11ϵ11+ϵc=\frac{1}{1-\epsilon^{\prime}}-\frac{1}{1+\epsilon^{\prime}}, then Φk14ϵc8ϵ2\Phi_{k}\geq\frac{1}{4}\epsilon-\frac{c}{8}\epsilon^{2}. We find C0C_{0} such that cϵ=2ϵϵ1(ϵ)2<1c\epsilon=\frac{2\epsilon^{\prime}\epsilon}{1-(\epsilon^{\prime})^{2}}<1 then Φkϵ8\Phi_{k}\geq\frac{\epsilon}{8}

Therefore,

uklk=uk1lk1+ϵΦk(11ϵ11+ϵ)\displaystyle u_{k}-l_{k}=u_{k-1}-l_{k-1}+\frac{\epsilon}{\Phi_{k}}\left(\frac{1}{1-\epsilon^{\prime}}-\frac{1}{1+\epsilon^{\prime}}\right)\leq 8rdϵ+ϵΦk(11ϵ11+ϵ)\displaystyle\frac{8r_{d}}{\epsilon}+\frac{\epsilon}{\Phi_{k}}\left(\frac{1}{1-\epsilon^{\prime}}-\frac{1}{1+\epsilon^{\prime}}\right)
\displaystyle\leq 8rdϵ+8(11ϵ11+ϵ)\displaystyle\frac{8r_{d}}{\epsilon}+8\left(\frac{1}{1-\epsilon^{\prime}}-\frac{1}{1+\epsilon^{\prime}}\right)
=\displaystyle= 8rdϵ+16ϵ1(ϵ)2\displaystyle\frac{8r_{d}}{\epsilon}+\frac{16\epsilon^{\prime}}{1-(\epsilon^{\prime})^{2}}

Then we choose C0C_{0} large such that 16ϵ1(ϵ)2<rdϵuklk9rdϵ\frac{16\epsilon^{\prime}}{1-(\epsilon^{\prime})^{2}}<\frac{r_{d}}{\epsilon}\Rightarrow u_{k}-l_{k}\leq\frac{9r_{d}}{\epsilon}. Given that ϵ=mηϵ\epsilon^{\prime}=\frac{m}{\eta}\epsilon and ϵ=δC0\epsilon=\frac{\sqrt{\delta}}{C_{0}}, we choose appropriate C0C_{0} to satisfy previous requirement on ϵ\epsilon and ϵ\epsilon^{\prime} as

{2ϵϵ<1(ϵ)216ϵ1(ϵ)2<rdϵϵ<1\begin{dcases*}2\epsilon\epsilon^{\prime}<1-(\epsilon^{\prime})^{2}\\ \frac{16\epsilon^{\prime}}{1-(\epsilon^{\prime})^{2}}<\frac{r_{d}}{\epsilon}\\ \epsilon^{\prime}<1\end{dcases*}

Therefore, we choose C0C_{0} such that C02>(mη)2+max(2,16rd)×mηC^{2}_{0}>\left(\frac{m}{\eta}\right)^{2}+\max(2,\frac{16}{r_{d}})\times\frac{m}{\eta}. Notice that

uklk=j=1kϵΦj(11ϵ11+ϵ)4rdϵ\displaystyle u_{k}-l_{k}=\sum_{j=1}^{k}\frac{\epsilon}{\Phi_{j}}\left(\frac{1}{1-\epsilon^{\prime}}-\frac{1}{1+\epsilon^{\prime}}\right)\geq\frac{4r_{d}}{\epsilon}

and mid is defined as mid=4dϵ11ϵ11+ϵ\text{mid}=\frac{\frac{4d}{\epsilon}}{\frac{1}{1-\epsilon^{\prime}}-\frac{1}{1+\epsilon^{\prime}}}, then j=1kϵΦjmid\sum_{j=1}^{k}\frac{\epsilon}{\Phi_{j}}\geq\text{mid}. Also,

j=1k1ϵΦjmidmid>\displaystyle\sum_{j=1}^{k-1}\frac{\epsilon}{\Phi_{j}}\leq\text{mid}\Rightarrow\text{mid}> j=1kϵΦj8\displaystyle\sum_{j=1}^{k}\frac{\epsilon}{\Phi_{j}}-8
>\displaystyle> ϵΦj4ϵϵrd(1(ϵ)2)j=1kϵΦj\displaystyle\frac{\epsilon}{\Phi_{j}}-\frac{4\epsilon\epsilon^{\prime}}{r_{d}(1-(\epsilon^{\prime})^{2})}\sum_{j=1}^{k}\frac{\epsilon}{\Phi_{j}}
=\displaystyle= (14ϵϵrd(1(ϵ)2))j=1kϵΦj\displaystyle\left(1-\frac{4\epsilon\epsilon^{\prime}}{r_{d}(1-(\epsilon^{\prime})^{2})}\right)\sum_{j=1}^{k}\frac{\epsilon}{\Phi_{j}}

which implies

mid[14ϵϵrd(1(ϵ)2),1]j=1mϵΦj=[14ϵϵrd(1(ϵ)2),1]uk+lk11ϵ+11+ϵ\displaystyle\text{mid}\in\Big{[}1-\frac{4\epsilon\epsilon^{\prime}}{r_{d}(1-(\epsilon^{\prime})^{2})},1\Big{]}\cdot\sum_{j=1}^{m}\frac{\epsilon}{\Phi_{j}}=\Big{[}1-\frac{4\epsilon\epsilon^{\prime}}{r_{d}(1-(\epsilon^{\prime})^{2})},1\Big{]}\cdot\frac{u_{k}+l_{k}}{\frac{1}{1-\epsilon^{\prime}}+\frac{1}{1+\epsilon^{\prime}}}

Notice that for the design matrix AA in (19), we have 1midA=Ak\frac{1}{\sqrt{\text{mid}}}A=A_{k} where AkA_{k} is the accumulated covariance matrix when the query process stops at the kkthe selection. Therefore, the eigenvalues of AA satisfy λ(ATA)=1midλ(AkTAk)[lkmid,ukmid]\lambda(A^{T}A)=\frac{1}{\text{mid}}\lambda(A_{k}^{T}A_{k})\in\Bigg{[}\frac{l_{k}}{\text{mid}},\frac{u_{k}}{\text{mid}}\Bigg{]}. Then

[lkmid,ukmid][lkuk+lk11ϵ+11+ϵ,uk(14ϵϵrd(1(ϵ)2))uk+lk11ϵ+11+ϵ]\displaystyle\Bigg{[}\frac{l_{k}}{\text{mid}},\frac{u_{k}}{\text{mid}}\Bigg{]}\subset\Bigg{[}\frac{l_{k}}{\frac{u_{k}+l_{k}}{\frac{1}{1-\epsilon^{\prime}}+\frac{1}{1+\epsilon^{\prime}}}},\frac{u_{k}}{\left(1-\frac{4\epsilon\epsilon^{\prime}}{r_{d}(1-(\epsilon^{\prime})^{2})}\right)\cdot\frac{u_{k}+l_{k}}{\frac{1}{1-\epsilon^{\prime}}+\frac{1}{1+\epsilon^{\prime}}}}\Bigg{]}

Given that with high probability, 14ϵuklk1+4ϵ1-4\epsilon^{\prime}\leq\frac{u_{k}}{l_{k}}\leq 1+4\epsilon^{\prime}. Then for the lower bound,

lkuk+lk11ϵ+11+ϵ1(1(ϵ)2)(1+2ϵ)>1(1+ϵ)(1+2ϵ)>12×1(1+ϵ)2=121(1+mδ/(ηC0))2\displaystyle\frac{l_{k}}{\frac{u_{k}+l_{k}}{\frac{1}{1-\epsilon^{\prime}}+\frac{1}{1+\epsilon^{\prime}}}}\geq\frac{1}{(1-(\epsilon^{\prime})^{2})(1+2\epsilon^{\prime})}>\frac{1}{(1+\epsilon^{\prime})(1+2\epsilon^{\prime})}>\frac{1}{2}\times\frac{1}{(1+\epsilon^{\prime})^{2}}=\frac{1}{2}\frac{1}{(1+m\sqrt{\delta}/(\eta C_{0}))^{2}}

and upper bound

uk(14ϵϵrd(1(ϵ)2))uk+lk11ϵ+11+ϵ\displaystyle\frac{u_{k}}{\left(1-\frac{4\epsilon\epsilon^{\prime}}{r_{d}(1-(\epsilon^{\prime})^{2})}\right)\cdot\frac{u_{k}+l_{k}}{\frac{1}{1-\epsilon^{\prime}}+\frac{1}{1+\epsilon^{\prime}}}}\leq 1+4ϵ(1+2ϵ)(1(ϵ)2)(1+2ϵ)4ϵϵrd\displaystyle\frac{1+4\epsilon^{\prime}}{(1+2\epsilon^{\prime})(1-(\epsilon^{\prime})^{2})-(1+2\epsilon^{\prime})\frac{4\epsilon\epsilon^{\prime}}{r_{d}}}
\displaystyle\leq 43×1+4ϵ(1+2ϵ)(1(ϵ)2),given 4ϵϵrd>14(1(ϵ)2)\displaystyle\frac{4}{3}\times\frac{1+4\epsilon^{\prime}}{(1+2\epsilon)(1-(\epsilon^{\prime})^{2})},\;\text{given }\frac{4\epsilon\epsilon^{\prime}}{r_{d}}>\frac{1}{4}(1-(\epsilon^{\prime})^{2})
<\displaystyle< 83×11(ϵ)2\displaystyle\frac{8}{3}\times\frac{1}{1-(\epsilon^{\prime})^{2}}
<\displaystyle< 8311(mδ/(ηC0))2.\displaystyle\frac{8}{3}\frac{1}{1-(m\sqrt{\delta}/(\eta C_{0}))^{2}}.

Then with probability larger than 12C1-\frac{2}{C}, we have

λ(AA)[12×1(1+mδηC0)2,83×11(mδηC0)2]\displaystyle\lambda(A^{\top}A)\in\bigg{[}\frac{1}{2}\times\frac{1}{(1+\frac{m\sqrt{\delta}}{\eta C_{0}})^{2}},\frac{8}{3}\times\frac{1}{1-(\frac{m\sqrt{\delta}}{\eta C_{0}})^{2}}\bigg{]}

Consider αj=ϵΦj1mid\alpha_{j}=\frac{\epsilon}{\Phi_{j}}\cdot\frac{1}{\text{mid}},

j=1kαj=j=1kϵΦj1mid[1,114ϵϵrd(1(ϵ)2)]1114(1ϵ)2(1ϵ)2=43\displaystyle\Rightarrow\sum_{j=1}^{k}\alpha_{j}=\sum_{j=1}^{k}\frac{\epsilon}{\Phi_{j}}\cdot\frac{1}{\text{mid}}\in\bigg{[}1,\frac{1}{1-\frac{4\epsilon\epsilon^{\prime}}{r_{d}(1-(\epsilon^{\prime})^{2})}}\bigg{]}\leq\frac{1}{1-\frac{1}{4}\frac{(1-\epsilon^{\prime})^{2}}{(1-\epsilon^{\prime})^{2}}}=\frac{4}{3}

Finally, check supj𝒮(sji=1rd|𝐕i(j)|2)\underset{j\in\mathcal{S}}{\sup}\left(s_{j}\sum_{i=1}^{r_{d}}|\mathbf{V}_{i}(j)|^{2}\right) at the kkth selection

supj𝒮(sji=1rd|𝐕i(j)|2)=\displaystyle\underset{j\in\mathcal{S}}{\sup}\left(s_{j}\sum_{i=1}^{r_{d}}|\mathbf{V}_{i}(j)|^{2}\right)= supj𝒮{ϵΦk1mid×ΦkRj×i=1rd|𝐕i(j)|2}\displaystyle\underset{j\in\mathcal{S}}{\sup}\Big{\{}\frac{\epsilon}{\Phi_{k}}\cdot\frac{1}{\text{mid}}\times\frac{\Phi_{k}}{R_{j}}\times\sum_{i=1}^{r_{d}}|\mathbf{V}_{i}(j)|^{2}\Big{\}}
=\displaystyle= ϵmidsup𝑗{j=1rd|𝐕i(j)|2Rj}\displaystyle\frac{\epsilon}{\text{mid}}\cdot\underset{j}{\sup}\Big{\{}\frac{\sum_{j=1}^{r_{d}}|\mathbf{V}_{i}(j)|^{2}}{R_{j}}\Big{\}}
\displaystyle\leq ϵmid11uklk+1uklk\displaystyle\frac{\epsilon}{\text{mid}}\cdot\frac{1}{\frac{1}{u_{k}-l_{k}}+\frac{1}{u_{k}-l_{k}}}
=\displaystyle= ϵmid×uklk2ϵmid×9rdϵ2=4.5rdmid\displaystyle\frac{\epsilon}{\text{mid}}\times\frac{u_{k}-l_{k}}{2}\leq\frac{\epsilon}{\text{mid}}\times\frac{9\cdot\frac{r_{d}}{\epsilon}}{2}=\frac{4.5r_{d}}{\text{mid}}
\displaystyle\leq 4.5rd2rd(1(ϵ)2)ϵϵ=2.25×ϵϵ1(ϵ)22.25×4δ=10δ\displaystyle\frac{4.5r_{d}}{\frac{2r_{d}(1-(\epsilon^{\prime})^{2})}{\epsilon\epsilon^{\prime}}}=2.25\times\frac{\epsilon\epsilon^{\prime}}{1-(\epsilon^{\prime})^{2}}\leq 2.25\times 4\delta=10\delta

Then we finish the proof of Lemma 1.

Appendix B More on Numerical Studies

B.1 Experimental setups

Synthetic networks   The parameters for the three network topologies are: Watts–Strogatz (WS) model (K=4,βWS=0.1K=4,\beta_{WS}=0.1) for small world properties, Stochastic block model (SBM) (Ncommunity=4,Pin=0.35,Pout=0.01N_{\text{community}}=4,P_{\text{in}}=0.35,P_{\text{out}}=0.01) for community structure, and Barabási-Albert (BA) model (α=3\alpha=3) for scale-free properties. We set n=100n=100 for all three networks. After generating the networks, we consider them fixed and then simulate 𝐘\mathbf{Y} and 𝐗\mathbf{X} repeatedly using 10 different random seeds. By a slight abuse of notation, we set the node responses and covariates for SBM and WS as 𝐘=U1:10β+ξ\mathbf{Y}=U_{1:10}\beta+\xi and 𝐗=U1:10+MU45:54\mathbf{X}=U_{1:10}+MU_{45:54}, where MijiidN(0.3,0.1)M_{ij}\overset{\text{iid}}{\sim}N(0.3,0.1) and β=(5,5,,5length 10)T\beta=(\underbrace{5,5,\ldots,5}_{\textrm{length }10})^{T}. For the BA model, we set 𝐘=U1:15β+ξ\mathbf{Y}=U_{1:15}\beta+\xi and 𝐗=U1:15+MU45:59\mathbf{X}=U_{1:15}+MU_{45:59}, where MijiidN(0.5,0.2)M_{ij}\overset{\text{iid}}{\sim}N(0.5,0.2) and β=(1,,1length 5,5,,5length 10)T\beta=(\underbrace{1,\ldots,1}_{\textrm{length }5},\underbrace{5,\ldots,5}_{\textrm{length }10})^{T}.

Real-world networks   For the proposed method and all baselines, we train a 2-layer SGC model for a fixed 300 epochs. In SGC, the propagation matrix performs low-pass filtering on homophilic networks and high-pass filtering on heterophilic networks. During training, the initial learning rate is set to 10210^{-2} and weight decay as 10410^{-4}.

Dataset #Nodes Type mm dd Dataset #Nodes Type mm dd
SBM 100 homophilic 50 10 Citeseer 3,327 homophilic 1000 100
WS 100 homophilic 50 10 Chameleon 2,277 heterophilic 800 30, 30
BA 100 homophilic 50 15 Texas 183 heterophilic 60 15, 15
Cora 2,708 homophilic 2000 200 Ogbn-Arxiv 169,343 homophilic 1000 120
Pubmed 19,717 homophilic 3000 60 Co-Physics 34,493 homophilic 3000 150
Table 4: A description of all datasets used in Section 5 and the hyperparameter settings for each dataset. We set ϵ=0.001\epsilon=0.001 for all networks. For heterophilic networks, we combine the eigenvectors corresponding to the dd smallest and dd largest eigenvalues.
Refer to caption
(a) Stochastic Block Model (SBM)
Refer to caption
(b) Barabási–Albert model (BA)
Figure 2: For (a) SBM, nodes are grouped by the assigned community; for (b) BA, nodes are grouped by degree. The integer ii on each node represents the ithi^{th} node queried by the proposed algorithm in one replication.

B.2 Visualization

In Figure 2, we visualize the node query process on synthetic networks generated using SBM and BA, as described in Section 5.1. The figure clearly demonstrates that nodes queried by the proposed algorithm adapt to the informativeness criterion specific to each network topology, effectively aligning with the community structure in SBM and the scale-free structure in BA.

B.3 Ablation study

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 3: Ablation study: (a) The condition number (log scale) of the design matrix of query nodes selected by proposed method and random sampling. The effectiveness of (b) representative sampling and (c) incorporating covariate information in Algorithm 1.

To gain deeper insights into the respective roles of representative sampling and informative selection in the proposed algorithm, we conduct additional experiments on a New Jersey public school social network dataset, School [24], which was originally collected to study the impact of educational workshops on reducing conflicts in schools. As School is not a benchmark dataset in the active learning literature, we did not compare the performance of our method against other baselines in Section 5.2 to ensure fairness. In this dataset with n=615n=615 nodes, each node represents an individual student, and edges denote friendships among students. We treat the students’ grade point averages (GPA) as the node responses and select p=5p=5 student features—grade level, race, and three binary survey responses—as node covariates using a standard forward selection approach.

As shown in Section 3.4, the representative sampling in steps 1 and 2 of Algorithm 1 is essential to control the condition number of the design matrix and, consequently, the prediction error given noisy network data. We illustrate in Figure 3(a) the condition number λmax(X~𝒮TWSX~𝒮)λmin(X~𝒮TWSX~𝒮)\frac{\lambda_{max}(\tilde{X}^{T}_{\mathcal{S}}W_{S}\tilde{X}_{\mathcal{S}})}{\lambda_{min}(\tilde{X}^{T}_{\mathcal{S}}W_{S}\tilde{X}_{\mathcal{S}})} using the proposed method, and compare with the one using random selection. With m=200m=200, the proposed algorithm achieve a significantly lower condition number than random selection, especially when the number of query is small. In the Citeseer dataset, we investigate the prediction performance of Algorithm 1 when removing steps 1 and 2, i.e., setting the candidate set Bm=𝒮t1cB_{m}=\mathcal{S}^{c}_{t-1} for the ttht^{th} selection. Figure 3(b) shows that, with representative sampling, the Macro-F1 score is consistently higher, with a performance gap of up to 15%. Given that node classification on Citeseer is found to be sensitive to labeling noise [38], this result validates the effectiveness of representative sampling in improving the robustness of our query strategy to data noise.

In addition, we examine the ability of the proposed method to integrate node covariates for improving prediction performance. In the School dataset, we compare our method to one that removes node covariates during the query stage by setting 𝐗\mathbf{X} as the identity matrix 𝐈\mathbf{I}. Figure 3(c) illustrates that the prediction MSE for GPA is significantly lower when incorporating node covariates, thus distinguishing our node query strategy from existing graph signal recovery methods [15] that do not account for node covariate information.

B.4 Code

The implementation code for the proposed algorithm is available at
github.com/Yuanchen-Wu/RobustActiveLearning/.