On the Probabilistic Approximation in
Reproducing Kernel Hilbert Spaces
Abstract.
This paper generalizes the least square method to probabilistic approximation in reproducing kernel Hilbert spaces. We show the existence and uniqueness of the optimizer. Furthermore, we generalize the celebrated representer theorem in this setting, and especially when the probability measure is finitely supported, or the Hilbert space is finite-dimensional, we show that the approximation problem turns out to be a measure quantization problem. Some discussions and examples are also given when the space is infinite-dimensional and the measure is infinitely supported.
1. Introduction and Main Results
Let be a set, or , and the set of functions from to . is naturally equipped with the vector space structure over by pointwise addition and scalar multiplication:
A vector subspace is said to be a reproducing kernel Hilbert space (RKHS) on if
-
•
is endowed with a Hilbert space structure . Our convention is that this is -linear in the first argument.
-
•
for every , the linear evaluation functional , defined by , is bounded.
If is an RKHS on , then Riesz representation theorem shows that for each , there exists a unique vector such that for any ,
The function is called the reproducing kernel for the point , and the function defined by is called the reproducing kernel for . One can check that is indeed a kernel function, meaning that for any and any distinct points , the matrix is symmetric (Hermitian) and positive semidefinite. It is well-known that there is a one-to-one correspondence between RKHSs and kernel functions on : by Moore’s theorem [5], if is a kernel function, then there exists a unique RKHS on such that is the reproducing kernel of . We let denote the unique RKHS with the reproducing kernel , and define the feature map by . We refer to [1, 2, 4, 6, 7, 8] for more details on the RKHS and its applications.
One of the interesting topics on the RKHS is interpolation. Let be an RKHS on , a finite set of distinct points in , and . If the matrix is invertible, then there exists such that for all . However, if is not invertible, such may not exist. In this case, one is often interested in finding the best approximation in to minimize the least square error:
The theorem below shows the existence of the optimizer and describes its structure:
Theorem 1.1 (Theorem 3.8 in [6]).
Let be an RKHS on , a finite set of distinct points in , and . Let and the null space of . Then there exists with . If we let
then minimizes the least square error. Besides, among all such minimizers in , is the unique function with the minimum norm.
Now, let be the set of probability measures on and . Then the above least square error problem is equivalent to
where is any given function with for all . This inspires us to replace with any probability measure and consider the probabilistic approximation problem in the RKHS.
The general formulation is as follows. Throughout this paper, we assume that is a Polish space, and all functions and measures considered are Borel measurable. Let be a given function, , and a nonnegative cost function. We then consider the following minimization problem
(1.1) |
Our first result is about the case when the cost function is from the norm, and the feature map is a continuous -frame 111A set of vectors in is a continuous -frame with respect to if there exist such that for any , . for with respect to .
Theorem 1.2.
Let be an RKHS on with the feature map . Let and , where . Assume that is a continuous -frame for with respect to . Then the following problem
admits an optimizer . Furthermore, if , the optimizer is unique.
Note that the continuous -frame condition is the same as that the norm is equivalent to the Hilbert space norm, as in the following inequality:
for some . Thus we can rewrite Theorem 1.2 as the following:
Corollary 1.3.
Let be an RKHS on , , and where . If is a (closed) subspace of , and the norm induced by the inner product is equivalent to the norm, then the following problem
admits an optimizer . Furthermore, if , the optimizer is unique.
In the special case where and is a Hilbert subspace of , such a unique closest vector is classically given by the orthogonal projection of onto . Although is not a Hilbert space for general , our corollary (under some assumptions) provides such a unique optimizer in the probabilistic approximation sense, which can be viewed as "projections" to the given subspace.
Our next result involves adding an extra regularization term to the minimization problem. In statistical regression and machine learning, a regularization term is preferred to perform variable selection, enhance the prediction accuracy, and prevent overfitting. Examples of such practice are ridge regression [3] and Lasso regression [10]. In this paper, we consider the following minimization problem with regularization:
(1.2) |
We show the existence and uniqueness of the optimizer in the following theorem:
Theorem 1.4.
Let be an RKHS on . Let and . Let a function and a cost function be given such that
-
•
;
-
•
for any given , is lower-semicontinuous.
Then Problem 1.2 admits an optimizer . Furthermore, when and is convex for any given , the optimizer is unique.
The theorems in this section will be proved in the last part of this paper. In the following sections, we will show some representer-type theorems describing the optimizer, mainly that from Theorem 1.4.
2. Probabilistic Representer Theorem
As witnessed by the work of Wahba [11] and followed by Schölkopf, Herbrich, and Smola [9], the celebrated representer theorem (Theorem 2.1 below) states that the optimizer in an RKHS that minimizes the regularized loss function is in the linear span of the kernel function at the given points.
Theorem 2.1 (Theorem 8.7 and 8.8 in [6]).
Let be an RKHS on , a finite set of distinct points in , and . Let be a convex loss function and consider the minimizing problem
Then the optimizer to this problem exists and is unique. Furthermore, the optimizer is in the linear span of the functions .
The representer theorem is useful in practice since it turns the minimization problem into a finite-dimensional optimization problem. It is natural to ask whether there is a corresponding version of the representer theorem under the probabilistic approximation setting as we introduced in the previous section. We confirm this speculation with the following representer theorem in measure representation forms.
Theorem 2.2 (Probabilistic Representer).
Let be the unique optimizer in Theorem 1.4. Assume the probability measure satisfies that for any -measure on with , the following holds:
(2.1) |
Then there exists a sequence of -measures on with such that
Furthermore, if is finitely supported, or is finite-dimensional, then there exists an -measure on with such that
The finiteness condition 2.1 in Theorem 2.2 holds when is finitely supported, or when is compact and is continuous. Also, the condition 2.1 holds on the Hardy space when is a compact set in the open unit disk , as we will see in Example 3.2 later.
Proof.
Define
and let be the closure of in . The integral above is defined via duality, and we check that for any ,
where is the variation measure of . By the Assumption 2.1, we see that defines a bounded linear functional on .
It is easy to see that is a closed subspace of . Let be the orthogonal projection of onto . Note that for any , by taking as the delta measure at . Therefore, for any ,
Therefore,
Hence is also an optimizer. Since the optimizer is unique, we conclude and thus . Therefore, there exists a sequence of -measures on with for each , such that
If is finitely supported or is finite-dimensional, then the set is automatically closed and . Thus there exists an -measure on with such that
We can furthermore assume that the -measures on are finitely supported, and the set then becomes the linear span of . This leads to the following corollary:
Corollary 2.3 (Discrete Probabilistic Representer).
Let be the unique optimizer in Theorem 1.4. Then is in the closure of the linear span of . Furthermore, if is finitely supported, or is finite-dimensional, then is in the linear span of .
Proof.
Here, we use the following definition of :
and let be the closure of in . Then is a closed linear subspace of . Using the same arguments as in Theorem 2.2, we conclude . If is finitely supported or is finite-dimensional, the set is automatically closed and thus . ∎
Note that when is finitely supported, Corollary 2.3 recovers the celebrated Representer Theorem 2.1. On the other hand, when is finite-dimensional, the unique optimizer is also in the linear span of , which does not depend on the cardinality of the support of the measure . Both these two cases indicate that the probabilistic approximation problem turns out to be a measure quantization problem about the measure with respect to the loss function 1.2.
Remark 2.4.
When the cost function and the given function satisfy the assumption in Theorem 1.4, the existence and unique still holds for the following more general problem
(2.2) |
where can be any strictly convex function. Furthermore, when is increasing and strictly convex, the probabilistic representer theorems (Theorem 2.2 and Corollary 2.3) also hold.
3. Discussions on the Representer Theorem
The preferable result in the probabilistic representer theorem (Theorem 2.2) is that the minimizer can be represented directly by an -measure , instead of an approximating sequence . We conjecture such nice form holds merely under the Assumption 2.1:
Conjecture 3.1.
Let be the unique optimizer in Theorem 1.4. Assume the probability measure satisfies that for any -measure on with , the following holds:
Then there exists an -measure on with such that
To support this conjecture as well as to illustrate Theorem 2.2, we here provide the following example:
Example 3.2 (Measure Representation).
Consider the -Hardy space on the unit disk . Fix and let be the open disk in centered at the origin of radius . Let be the uniform probability measure on . First we check Assumption 2.1 in this setting: for any -measure with , we have
where is the total variation of . Now let , the Bergman space consisting of Lebesgue- holomorphic functions on . Consider the minimization problem:
If we use the power series expressions and , then we can apply variations on the coefficients to get an Euler-Lagrange equation for the minimizer :
Thus is determined by via this formula, and our goal is to find a -measure representation of , as in Conjecture 3.1.
Our strategy is to first find the measure representation for the basis vectors of . Computation gives
That is, can be represented by the -measure
From , we would imagine that is represented by the measure . Indeed this is a well-defined -measure since we have
where the first square root is finite since .
Now we show precisely that is represented by . Let be the partial sums of . Then we have strongly in since is an orthonormal basis. On the other hand, converges weakly to in since all functions in are bounded on . Thus by the uniqueness of the weak limit, we have .
4. Proof of Theorem 1.2 and Theorem 1.4
Proof of Theorem 1.2.
Since and , we have
i.e., the problem is bounded. Let be a minimizing sequence. Then
Thus there exists such that for each , . Then
On the other hand, since is a continuous –frame for with respect to , for some lower frame bound we have
Combing these results we get
Thus is a bounded sequence in . Then has a weakly convergent subsequence , i.e., there exists such that for any , Taking , we get . Now by Fatou’s lemma, we get
Using the pointwise convergence and that is minimizing, we obtain
Therefore is an optimizer.
Next, we show that the optimizer is unique when . Let and be optimizers. Then by Minkowski’s inequality, we have
Since is the infimum and , the equality must hold and we infer the following two cases:
-
(1)
-a.e. In this case, we have and -a.e.
-
(2)
-a.e. for some number . It easily follows that and hence -a.e.
In either case, we conclude in by the continuous frame condition. ∎
Proof of Theorem 1.4.
Since and , then
Hence the problem is bounded. Let be a minimizing sequence. Then
Then there exists such that for each ,
Thus is a bounded sequence in . Then has a weakly convergent subsequence , i.e., there exists such that for any , Taking , we get . By lower-semicontinuity of and Fatou’s lemma, we get
On the other hand, since converges to weakly in , we have Furthermore, by the superadditivity of limit inferior, we get
Combining the results and that is minimizing, we obtain
Therefore is an optimizer.
Next, we show the uniqueness when and is convex for any fixed . Let and be optimizers attaining . Since , we have
Since is convex for any given , we then have
On the other hand, by triangle inequality and that is (strictly) convex for , we get
Hence we see that the equalities above must hold, and we infer for some as well as (the case implies , and vice versa). Therefore and . ∎
Acknowledgement
The authors would like to express gratitude to Qiyu Sun for valuable discussions.
References
- [1] Nachman Aronszajn. Theory of reproducing kernels. Transactions of the American mathematical society, 68(3):337–404, 1950.
- [2] Alain Berlinet and Christine Thomas-Agnan. Reproducing kernel Hilbert spaces in probability and statistics. Springer Science & Business Media, 2011.
- [3] Arthur E Hoerl and Robert W Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1):55–67, 1970.
- [4] Jonathan H Manton, Pierre-Olivier Amblard, et al. A primer on reproducing kernel hilbert spaces. Foundations and Trends in Signal Processing, 8(1–2):1–126, 2015.
- [5] EH Moore. General analysis, part 2: The fundamental notions of general analysis. edited by rw barnard. Memoirs of the American Philosophical Society, 1, 2013.
- [6] Vern I Paulsen and Mrinal Raghupathi. An introduction to the theory of reproducing kernel Hilbert spaces, volume 152. Cambridge University Press, 2016.
- [7] Sergei Pereverzyev. An introduction to artificial intelligence based on reproducing kernel Hilbert spaces. Springer Nature, 2022.
- [8] Saburou Saitoh, Yoshihiro Sawano, et al. Theory of reproducing kernels and applications. Springer, 2016.
- [9] Bernhard Schölkopf, Ralf Herbrich, and Alex J Smola. A generalized representer theorem. In International conference on computational learning theory, pages 416–426. Springer, 2001.
- [10] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1):267–288, 1996.
- [11] Grace Wahba. Spline models for observational data. SIAM, 1990.