On the Probabilistic Approximation in
Reproducing Kernel Hilbert Spaces

Dongwei Chen Department of Mathematics, Colorado State University, CO, US dongwei.chen@colostate.edu and Kai-Hsiang Wang Department of Mathematics, Northwestern University, IL, US kai-hsiangwang2025@u.northwestern.edu

Abstract.

This paper generalizes the least square method to probabilistic approximation in reproducing kernel Hilbert spaces. We show the existence and uniqueness of the optimizer. Furthermore, we generalize the celebrated representer theorem in this setting, and especially when the probability measure is finitely supported, or the Hilbert space is finite-dimensional, we show that the approximation problem turns out to be a measure quantization problem. Some discussions and examples are also given when the space is infinite-dimensional and the measure is infinitely supported.

1. Introduction and Main Results

Let $X$ be a set, $\mathbb{F}=\mathbb{R}$ or $\mathbb{C}$ , and $\mathscr{F}(X,\mathbb{F})$ the set of functions from $X$ to $\mathbb{F}$ . $\mathscr{F}(X,\mathbb{F})$ is naturally equipped with the vector space structure over $\mathbb{F}$ by pointwise addition and scalar multiplication:

(f+h)(x)=f(x)+h(x),\ (\lambda\cdot f)(x)=\lambda\cdot f(x)\;\text{for $x\in X$ and $\lambda\in\mathbb{F}$}.

A vector subspace $\mathscr{H}\subset\mathscr{F}(X,\mathbb{F})$ is said to be a reproducing kernel Hilbert space (RKHS) on $X$ if

•

$\mathscr{H}$ is endowed with a Hilbert space structure $\langle\cdot,\cdot\rangle$ . Our convention is that this is $\mathbb{F}$ -linear in the first argument.
•

for every $x\in X$ , the linear evaluation functional $E_{x}:\mathscr{H}\rightarrow\mathbb{F}$ , defined by $E_{x}(f)=f(x)$ , is bounded.

If $\mathscr{H}$ is an RKHS on $X$ , then Riesz representation theorem shows that for each $x\in X$ , there exists a unique vector $k_{x}\in\mathscr{H}$ such that for any $f\in\mathscr{H}$ ,

E_{x}(f)=\langle f,k_{x}\rangle=f(x).

The function $k_{x}$ is called the reproducing kernel for the point $x$ , and the function $K:X\times X\rightarrow\mathbb{F}$ defined by $K(y,x)=k_{x}(y)$ is called the reproducing kernel for $\mathscr{H}$ . One can check that $K$ is indeed a kernel function, meaning that for any $n\in{\mathbb{N}}$ and any $n$ distinct points $\{x_{1},\cdots,x_{n}\}\subset X$ , the matrix $(K(x_{i},x_{j}))$ is symmetric (Hermitian) and positive semidefinite. It is well-known that there is a one-to-one correspondence between RKHSs and kernel functions on $X$ : by Moore’s theorem [5], if $K:X\times X\rightarrow\mathbb{F}$ is a kernel function, then there exists a unique RKHS $\mathscr{H}$ on $X$ such that $K$ is the reproducing kernel of $\mathscr{H}$ . We let $\mathscr{H}(K)$ denote the unique RKHS with the reproducing kernel $K$ , and define the feature map $\phi:X\rightarrow\mathscr{H}(K)$ by $\phi(x)=k_{x}$ . We refer to [1, 2, 4, 6, 7, 8] for more details on the RKHS and its applications.

One of the interesting topics on the RKHS is interpolation. Let $\mathscr{H}(K)$ be an RKHS on $X$ , $F=\{x_{1},\cdots,x_{N}\}$ a finite set of distinct points in $X$ , and $\{c_{1},\dots,c_{N}\}\subset\mathbb{F}$ . If the matrix $(K(x_{i},x_{j}))$ is invertible, then there exists $f\in\mathscr{H}(K)$ such that $f(x_{i})=c_{i}$ for all $1\leq i\leq N$ . However, if $(K(x_{i},x_{j}))$ is not invertible, such $f$ may not exist. In this case, one is often interested in finding the best approximation in $\mathscr{H}(K)$ to minimize the least square error:

\underset{f\in\mathscr{H}(K)}{\inf}\sum\limits_{i=1}^{N}|f(x_{i})-c_{i}|^{2}.

The theorem below shows the existence of the optimizer and describes its structure:

Theorem 1.1 (Theorem 3.8 in [6]).

Let $\mathscr{H}(K)$ be an RKHS on $X$ , $F=\{x_{1},\cdots,x_{N}\}$ a finite set of distinct points in $X$ , and $v=(c_{1},\dots,c_{N})^{T}\in\mathbb{F}^{N}$ . Let $Q=(K(x_{i},x_{j}))$ and $\mathscr{N}(Q)$ the null space of $Q$ . Then there exists $w=(\alpha_{1},\cdots,\alpha_{N})^{T}\in\mathbb{F}^{N}$ with $v-Qw\in\mathscr{N}(Q)$ . If we let

\hat{f}=\alpha_{1}k_{x_{1}}+\cdots+\alpha_{N}k_{x_{N}},

then $\hat{f}$ minimizes the least square error. Besides, among all such minimizers in $\mathscr{H}(K)$ , $\hat{f}$ is the unique function with the minimum norm.

Now, let $\mathscr{P}(X)$ be the set of probability measures on $X$ and $\mu_{N}=\frac{1}{N}\sum\limits_{i=1}^{N}\delta_{x_{i}}\in\mathscr{P}(X)$ . Then the above least square error problem is equivalent to

\underset{f\in\mathscr{H}(K)}{\text{inf}}\ \int_{X}|f(x)-g(x)|^{2}d\mu_{N}(x)=\underset{f\in\mathscr{H}(K)}{\text{inf}}\ \frac{1}{N}\sum_{i=1}^{N}|f(x_{i})-c_{i}|^{2}

where $g$ is any given function with $g(x_{i})=c_{i}$ for all $i=1,\dots,N$ . This inspires us to replace $\mu_{N}$ with any probability measure $\mu\in\mathscr{P}(X)$ and consider the probabilistic approximation problem in the RKHS.

The general formulation is as follows. Throughout this paper, we assume that $X$ is a Polish space, and all functions and measures considered are Borel measurable. Let $g:X\rightarrow\mathbb{F}$ be a given function, $\mu\in\mathscr{P}(X)$ , and $c:\mathbb{F}\times\mathbb{F}\rightarrow\mathbb{R}^{+}$ a nonnegative cost function. We then consider the following minimization problem

(1.1)

\underset{f\in\mathscr{H}(K)}{\text{inf}}\ \int_{X}c(f(x),g(x))d\mu(x).

Our first result is about the case when the cost function $c$ is from the $L^{p}$ norm, and the feature map $\phi$ is a continuous $p$ -frame ¹¹1A set of vectors $\{f_{x}\}_{x\in X}$ in $\mathscr{H}(K)$ is a continuous $p$ -frame with respect to $\mu\in\mathscr{P}(X)$ if there exist $0<A\leq B$ such that for any $f\in{\mathscr{H}(K)}$ , $A\|f\|^{p}_{\mathscr{H}(K)}\leq\int_{X}|\langle f,f_{x}\rangle|^{p}d\mu(x)\leq B\|f\|^{p}_{\mathscr{H}(K)}$ . for $\mathscr{H}(K)$ with respect to $\mu$ .

Theorem 1.2.

Let $\mathscr{H}(K)$ be an RKHS on $X$ with the feature map $\phi$ . Let $\mu\in\mathscr{P}(X)$ and $g\in L^{p}(X,\mu)$ , where $1\leq p<\infty$ . Assume that $\{\phi(x),x\in X\}$ is a continuous $p$ -frame for $\mathscr{H}(K)$ with respect to $\mu$ . Then the following problem

\underset{f\in\mathscr{H}(K)}{\text{inf}}\ \int_{X}|f(x)-g(x)|^{p}d\mu(x)

admits an optimizer $\hat{f}\in\mathscr{H}(K)$ . Furthermore, if $p>1$ , the optimizer is unique.

Note that the continuous $p$ -frame condition is the same as that the $L^{p}$ norm is equivalent to the Hilbert space norm, as in the following inequality:

C_{1}\|f\|_{\mathscr{H}(K)}\leq\|f\|_{L^{p}(X,\mu)}=\left(\int_{X}|\langle f,\phi(x)\rangle_{\mathscr{H}(K)}|^{p}d\mu(x)\right)^{\frac{1}{p}}\leq C_{2}\|f\|_{\mathscr{H}(K)}

for some $0<C_{1}\leq C_{2}<\infty$ . Thus we can rewrite Theorem 1.2 as the following:

Corollary 1.3.

Let $\mathscr{H}(K)$ be an RKHS on $X$ , $\mu\in\mathscr{P}(X)$ , and $g\in L^{p}(X,\mu)$ where $1\leq p<\infty$ . If $\mathscr{H}(K)$ is a (closed) subspace of $L^{p}(X,\mu)$ , and the norm induced by the inner product is equivalent to the $L^{p}$ norm, then the following problem

\underset{f\in\mathscr{H}(K)}{\text{inf}}\ \int_{X}|f(x)-g(x)|^{p}d\mu(x)

admits an optimizer $\hat{f}\in\mathscr{H}(K)$ . Furthermore, if $p>1$ , the optimizer is unique.

In the special case where $p=2$ and $\mathscr{H}$ is a Hilbert subspace of $L^{2}(X,\mu)$ , such a unique closest vector is classically given by the orthogonal projection of $g$ onto $\mathscr{H}$ . Although $L^{p}(X,\mu)$ is not a Hilbert space for general $p\neq 2$ , our corollary (under some assumptions) provides such a unique optimizer in the probabilistic approximation sense, which can be viewed as "projections" to the given subspace.

Our next result involves adding an extra regularization term to the minimization problem. In statistical regression and machine learning, a regularization term is preferred to perform variable selection, enhance the prediction accuracy, and prevent overfitting. Examples of such practice are ridge regression [3] and Lasso regression [10]. In this paper, we consider the following minimization problem with regularization:

(1.2)

\underset{f\in\mathscr{H}(K)}{\text{inf}}\ \int_{X}c(f(x),g(x))d\mu(x)+\|f\|_{\mathscr{H}(K)}^{p}.

We show the existence and uniqueness of the optimizer in the following theorem:

Theorem 1.4.

Let $\mathscr{H}(K)$ be an RKHS on $X$ . Let $\mu\in\mathscr{P}(X)$ and $0<p<\infty$ . Let a function $g:X\to\mathbb{F}$ and a cost function $c:\mathbb{F}\times\mathbb{F}\rightarrow\mathbb{R}^{+}$ be given such that

•

$c(0,g(\cdot))\in L^{1}(X,\mu)$ ;
•

for any given $z\in\mathbb{F}$ , $c(\cdot,z)$ is lower-semicontinuous.

Then Problem 1.2 admits an optimizer $\hat{f}\in\mathscr{H}(K)$ . Furthermore, when $p>1$ and $c(\cdot,z)$ is convex for any given $z\in\mathbb{F}$ , the optimizer is unique.

The theorems in this section will be proved in the last part of this paper. In the following sections, we will show some representer-type theorems describing the optimizer, mainly that from Theorem 1.4.

2. Probabilistic Representer Theorem

As witnessed by the work of Wahba [11] and followed by Schölkopf, Herbrich, and Smola [9], the celebrated representer theorem (Theorem 2.1 below) states that the optimizer in an RKHS that minimizes the regularized loss function is in the linear span of the kernel function at the given points.

Theorem 2.1 (Theorem 8.7 and 8.8 in [6]).

Let $\mathscr{H}(K)$ be an RKHS on $X$ , $F=\{x_{1},\cdots,x_{N}\}$ a finite set of distinct points in $X$ , and $\{c_{1},\dots,c_{N}\}\subset\mathbb{F}$ . Let $L$ be a convex loss function and consider the minimizing problem

\underset{f\in\mathscr{H}(K)}{\inf}L(f(x_{1}),\cdots,f(x_{n}))+\|f\|_{\mathscr{H}(K)}^{2}.

Then the optimizer to this problem exists and is unique. Furthermore, the optimizer is in the linear span of the functions $k_{x_{1}},\cdots,k_{x_{n}}$ .

The representer theorem is useful in practice since it turns the minimization problem into a finite-dimensional optimization problem. It is natural to ask whether there is a corresponding version of the representer theorem under the probabilistic approximation setting as we introduced in the previous section. We confirm this speculation with the following representer theorem in measure representation forms.

Theorem 2.2 (Probabilistic Representer).

Let $\hat{f}$ be the unique optimizer in Theorem 1.4. Assume the probability measure $\mu$ satisfies that for any $\mathbb{F}$ -measure $\xi$ on $X$ with $\operatorname{supp}\xi\subset\operatorname{supp}\mu$ , the following holds:

(2.1)

\int_{X}\|\phi(x)\|_{\mathscr{H}(K)}d|\xi|(x)<\infty.

Then there exists a sequence of $\mathbb{F}$ -measures $\{\nu_{n}\}_{n=1}^{\infty}$ on $X$ with $\operatorname{supp}(\nu_{n})\subset\operatorname{supp}(\mu)$ such that

\hat{f}=\underset{n\rightarrow\infty}{\text{lim}}\int_{X}\phi(x)d\nu_{n}(x).

Furthermore, if $\mu$ is finitely supported, or $\mathscr{H}(K)$ is finite-dimensional, then there exists an $\mathbb{F}$ -measure $\nu$ on $X$ with $\operatorname{supp}(\nu)\subset\operatorname{supp}(\mu)$ such that

\hat{f}=\int_{X}\phi(x)d\nu(x).

The finiteness condition 2.1 in Theorem 2.2 holds when $\mu$ is finitely supported, or when $X$ is compact and $K$ is continuous. Also, the condition 2.1 holds on the Hardy space $H^{2}(D)$ when $\operatorname{supp}\mu$ is a compact set in the open unit disk $D$ , as we will see in Example 3.2 later.

Proof.

Define

\mathscr{A}:=\left\{\int_{X}\phi(x)d\xi(x):\xi\ \text{is an $\mathbb{F}$-measure on $X$},\operatorname{supp}(\xi)\subset\operatorname{supp}(\mu)\right\},

and let $\Omega$ be the closure of $\mathscr{A}$ in $\mathscr{H}(K)$ . The integral above is defined via duality, and we check that for any $f\in\mathscr{H}(K)$ ,

	$\displaystyle\left\|\langle\int_{X}\phi(x)d\xi(x),f\rangle_{\mathscr{H}(K)}\right\|$	$\displaystyle=\left\|\int_{X}\langle\phi(x),f\rangle_{\mathscr{H}(K)}d\xi(x)\right\|$
		$\displaystyle\leq\\|f\\|_{\mathscr{H}(K)}\int_{X}\\|\phi(x)\\|_{\mathscr{H}(K)}d\|\xi\|(x),$

where $|\xi|$ is the variation measure of $\xi$ . By the Assumption 2.1, we see that $\int_{X}\phi(x)d\xi(x)$ defines a bounded linear functional on $\mathscr{H}(K)$ .

It is easy to see that $\Omega$ is a closed subspace of $\mathscr{H}(K)$ . Let $P_{\Omega}$ be the orthogonal projection of $\mathscr{H}(K)$ onto $\Omega$ . Note that for any $x\in\operatorname{supp}(\mu)$ , $\phi(x)\in\Omega$ by taking $\xi$ as the delta measure at $x$ . Therefore, for any $x\in\operatorname{supp}(\mu)$ ,

(P_{\Omega}\hat{f})(x)=\langle P_{\Omega}\hat{f},\phi(x)\rangle_{\mathscr{H}(K)}=\langle\hat{f},P_{\Omega}\phi(x)\rangle_{\mathscr{H}(K)}=\langle\hat{f},\phi(x)\rangle_{\mathscr{H}(K)}=\hat{f}(x).

Therefore,

\int_{X}c(P_{\Omega}\hat{f}(x),g(x))d\mu(x)+\|P_{\Omega}\hat{f}\|^{p}_{\mathscr{H}(K)}\leq\int_{X}c(\hat{f}(x),g(x))d\mu(x)+\|\hat{f}\|^{p}_{\mathscr{H}(K)}.

Hence $P_{\Omega}\hat{f}$ is also an optimizer. Since the optimizer is unique, we conclude $P_{\Omega}\hat{f}=\hat{f}$ and thus $\hat{f}\in\Omega$ . Therefore, there exists a sequence of $\mathbb{F}$ -measures $\{\nu_{n}\}_{n=1}^{\infty}$ on $X$ with $\operatorname{supp}(\nu_{n})\subset\operatorname{supp}(\mu)$ for each $n$ , such that

\hat{f}=\underset{n\rightarrow\infty}{\text{lim}}\int_{X}\phi(x)d\nu_{n}(x).

If $\mu$ is finitely supported or $\mathscr{H}(K)$ is finite-dimensional, then the set $\mathscr{A}$ is automatically closed and $\Omega=\mathscr{A}$ . Thus there exists an $\mathbb{F}$ -measure $\nu$ on $X$ with $\operatorname{supp}(\nu)\subset\operatorname{supp}(\mu)$ such that

\hat{f}=\int_{X}\phi(x)d\nu(x).\qed

We can furthermore assume that the $\mathbb{F}$ -measures $\xi$ on $X$ are finitely supported, and the set $\mathscr{A}$ then becomes the linear span of $\{k_{x}\}_{x\in\operatorname{supp}(\mu)}$ . This leads to the following corollary:

Corollary 2.3 (Discrete Probabilistic Representer).

Let $\hat{f}$ be the unique optimizer in Theorem 1.4. Then $\hat{f}$ is in the closure of the linear span of $\{k_{x}\}_{x\in\operatorname{supp}(\mu)}$ . Furthermore, if $\mu$ is finitely supported, or $\mathscr{H}(K)$ is finite-dimensional, then $\hat{f}$ is in the linear span of $\{k_{x}\}_{x\in\operatorname{supp}(\mu)}$ .

Proof.

Here, we use the following definition of $\mathscr{A}$ :

\mathscr{A}:=span{\{k_{x}\}_{x\in\operatorname{supp}(\mu)}}=\left\{\sum_{i=1}^{m}w_{i}k_{x_{i}}:m\in\mathbb{N}^{+},\{w_{i}\}_{i=1}^{m}\subset\mathbb{F},\{x_{i}\}_{i=1}^{m}\subset\operatorname{supp}(\mu)\right\},

and let $\Omega$ be the closure of $\mathscr{A}$ in $\mathscr{H}(K)$ . Then $\Omega$ is a closed linear subspace of $\mathscr{H}(K)$ . Using the same arguments as in Theorem 2.2, we conclude $\hat{f}\in\Omega$ . If $\mu$ is finitely supported or $\mathscr{H}(K)$ is finite-dimensional, the set $\mathscr{A}$ is automatically closed and thus $\hat{f}\in\Omega=\mathscr{A}$ . ∎

Note that when $\mu$ is finitely supported, Corollary 2.3 recovers the celebrated Representer Theorem 2.1. On the other hand, when $\mathscr{H}(K)$ is finite-dimensional, the unique optimizer is also in the linear span of $\{k_{x}\}_{x\in\operatorname{supp}(\mu)}$ , which does not depend on the cardinality of the support of the measure $\mu$ . Both these two cases indicate that the probabilistic approximation problem turns out to be a measure quantization problem about the measure $\mu$ with respect to the loss function 1.2.

Remark 2.4.

When the cost function $c$ and the given function $g$ satisfy the assumption in Theorem 1.4, the existence and unique still holds for the following more general problem

(2.2)

\underset{f\in\mathscr{H}(K)}{\text{inf}}\ \int_{X}c(f(x),g(x))d\mu(x)+h(\|f\|_{\mathscr{H}(K)}),

where $h:\mathbb{R}^{+}\rightarrow\mathbb{R}^{+}$ can be any strictly convex function. Furthermore, when $h:\mathbb{R}^{+}\rightarrow\mathbb{R}^{+}$ is increasing and strictly convex, the probabilistic representer theorems (Theorem 2.2 and Corollary 2.3) also hold.

3. Discussions on the Representer Theorem

The preferable result in the probabilistic representer theorem (Theorem 2.2) is that the minimizer $\hat{f}$ can be represented directly by an $\mathbb{F}$ -measure $\nu$ , instead of an approximating sequence $\{\nu_{n}\}$ . We conjecture such nice form holds merely under the Assumption 2.1:

Conjecture 3.1.

\int_{X}\|\phi(x)\|_{\mathscr{H}(K)}d|\xi|(x)<\infty.

Then there exists an $\mathbb{F}$ -measure $\nu$ on $X$ with $\operatorname{supp}(\nu)\subset\operatorname{supp}(\mu)$ such that

\hat{f}=\int_{X}\phi(x)d\nu(x).

To support this conjecture as well as to illustrate Theorem 2.2, we here provide the following example:

Example 3.2 (Measure Representation).

Consider the $2$ -Hardy space $H^{2}(D)$ on the unit disk $D$ . Fix $0<r<1$ and let $D_{r}$ be the open disk in $\mathbb{C}$ centered at the origin of radius $r$ . Let $\mu$ be the uniform probability measure on $D_{r}$ . First we check Assumption 2.1 in this setting: for any $\mathbb{F}$ -measure $\xi$ with $\operatorname{supp}\xi\subset\operatorname{supp}\mu$ , we have

\int_{D}\|\phi(w)\|_{H^{2}(D)}d|\xi|(w)=\int_{D_{r}}\frac{1}{1-|w|^{2}}d|\xi|(w)\leq\frac{1}{1-r^{2}}\|\xi\|<\infty,

where $\|\xi\|$ is the total variation of $\xi$ . Now let $g\in B^{2}(D)$ , the Bergman space consisting of Lebesgue- $L^{2}$ holomorphic functions on $D$ . Consider the minimization problem:

\underset{f\in H^{2}(D)}{\inf}\frac{1}{|D_{r}|}\int_{D_{r}}|f-g|^{2}d\operatorname{Area}+\|f\|^{2}_{H^{2}(D)}.

If we use the power series expressions $\hat{f}=\sum_{n}a_{n}z^{n}$ and $g=\sum_{n}b_{n}z^{n}$ , then we can apply variations on the coefficients $a_{n}$ to get an Euler-Lagrange equation for the minimizer $\hat{f}$ :

(1+\frac{n+1}{r^{2n}})a_{n}=b_{n}.

Thus $\hat{f}$ is determined by $g$ via this formula, and our goal is to find a $\mathbb{C}$ -measure representation of $\hat{f}$ , as in Conjecture 3.1.

Our strategy is to first find the measure representation for the basis vectors $\{z^{k}\}_{k\in{\mathbb{N}}_{0}}$ of $H^{2}(D)$ . Computation gives

z^{k}=\int_{D_{r}}\frac{1}{1-\bar{w}z}\frac{k+1}{\pi}r^{-2k-2}w^{k}d\operatorname{Area}(w)

That is, $z^{k}$ can be represented by the $\mathbb{C}$ -measure $\xi_{k}:$

d\xi_{k}(w):=\frac{k+1}{\pi}r^{-2k-2}w^{k}d\operatorname{Area}\llcorner D_{r}(w).

From $\hat{f}=\sum_{n}a_{n}z^{n}$ , we would imagine that $\hat{f}$ is represented by the measure $\nu:=\sum_{n}a_{n}\xi_{n}$ . Indeed this is a well-defined $\mathbb{C}$ -measure since we have

	$\displaystyle\\|\nu\\|\leq\sum_{n}\|a_{n}\|\\|\xi_{n}\\|$	$\displaystyle\leq\sum_{n}\frac{r^{2n}}{r^{2n}+n+1}\|b_{n}\|\frac{2n+2}{n+2}r^{-n}$
		$\displaystyle\leq\sqrt{\sum_{n}\frac{1}{n+1}\|b_{n}\|^{2}}\sqrt{\sum_{n}\frac{4}{n+1}r^{2n}}<\infty,$

where the first square root is finite since $g\in B^{2}(D)$ .

Now we show precisely that $\hat{f}$ is represented by $\nu$ . Let $\nu_{k}=\sum_{n=1}^{k}a_{n}\xi_{n}$ be the partial sums of $\nu$ . Then we have $\hat{f}=\lim_{k\rightarrow\infty}\int_{D}\phi d\nu_{k}$ strongly in $H^{2}(D)$ since $\{z^{k}\}_{k\in{\mathbb{N}}_{0}}$ is an orthonormal basis. On the other hand, $\int_{D}\phi d\nu_{k}$ converges weakly to $\int_{D}\phi d\nu$ in $H^{2}(D)$ since all functions in $H^{2}(D)$ are bounded on $D_{r}$ . Thus by the uniqueness of the weak limit, we have $\hat{f}=\int_{D}\phi d\nu$ .

4. Proof of Theorem 1.2 and Theorem 1.4

Proof of Theorem 1.2.

Since $g\in L^{p}(X,\mu)$ and $f=0\in\mathscr{H}(K)$ , we have

I:=\underset{f\in\mathscr{H}(K)}{\text{inf}}\ \int_{X}|f(x)-g(x)|^{p}d\mu(x)\leq\int_{X}|g(x)|^{p}d\mu(x)<+\infty,

i.e., the problem $I$ is bounded. Let $\{f_{i}\}_{i=1}^{\infty}$ be a minimizing sequence. Then

\int_{X}|f_{i}(x)-g(x)|^{p}d\mu(x)\rightarrow I.

Thus there exists $0<M<\infty$ such that for each $i$ , $\|f_{i}-g\|_{L^{p}(X,\mu)}\leq M$ . Then

\|f_{i}\|_{L^{p}(X,\mu)}\leq\|f_{i}-g\|_{L^{p}(X,\mu)}+\|g\|_{L^{p}(X,\mu)}\leq M+\|g\|_{L^{p}(X,\mu)},\ \text{for any $i$}.

On the other hand, since $\{\phi(x),x\in X\}$ is a continuous $p$ –frame for $\mathscr{H}(K)$ with respect to $\mu$ , for some lower frame bound $A>0$ we have

A\|f_{i}\|_{\mathscr{H}(K)}^{p}\leq\int_{X}|f_{i}(x)|^{p}d\mu(x)=\int_{X}|\langle f_{i},\phi(x)\rangle_{\mathscr{H}(K)}|^{p}d\mu(x),\;\text{for any $i$.}

Combing these results we get

\|f_{i}\|_{\mathscr{H}(K)}^{p}\leq\frac{1}{A}\int_{X}|f_{i}(x)|^{p}d\mu(x)=\frac{\|f_{i}\|_{L^{p}(X,\mu)}^{p}}{A}\leq\frac{(M+\|g\|_{L^{p}(X,\mu)})^{p}}{A}<+\infty.

Thus $\{f_{i}\}_{i=1}^{\infty}$ is a bounded sequence in $\mathscr{H}(K)$ . Then $\{f_{i}\}_{i=1}^{\infty}$ has a weakly convergent subsequence $\{f_{i_{k}}\}_{k=1}^{\infty}$ , i.e., there exists $\hat{f}\in\mathscr{H}(K)$ such that for any $h\in\mathscr{H}(K)$ , $\langle f_{i_{k}},h\rangle\rightarrow\langle\hat{f},h\rangle\ \text{as}\ k\rightarrow\infty.$ Taking $h=\phi(x)$ , we get $f_{i_{k}}(x)\rightarrow\hat{f}(x)$ . Now by Fatou’s lemma, we get

\int_{X}\underset{k\rightarrow\infty}{\text{lim inf}}\ |f_{i_{k}}(x)-g(x)|^{p}d\mu(x)\leq\underset{k\rightarrow\infty}{\text{lim inf}}\int_{X}|f_{i_{k}}(x)-g(x)|^{p}d\mu(x).

Using the pointwise convergence and that $\{f_{i}\}_{i=1}^{\infty}$ is minimizing, we obtain

\int_{X}|\hat{f}(x)-g(x)|^{p}d\mu(x)\leq\underset{f\in\mathscr{H}(K)}{\text{inf}}\int_{X}|f(x)-g(x)|^{p}d\mu(x).

Therefore $\hat{f}$ is an optimizer.

Next, we show that the optimizer is unique when $p>1$ . Let $\hat{f_{1}}$ and $\hat{f_{2}}$ be optimizers. Then by Minkowski’s inequality, we have

\|\frac{\hat{f_{1}}+\hat{f_{2}}}{2}-g\|_{L^{p}(X,\mu)}\leq\|\frac{\hat{f_{1}}-g}{2}\|_{L^{p}(X,\mu)}+\|\frac{\hat{f_{2}}-g}{2}\|_{L^{p}(X,\mu)}=I^{\frac{1}{p}}.

Since $I$ is the infimum and $\frac{\hat{f_{1}}+\hat{f_{2}}}{2}\in\mathscr{H}(K)$ , the equality must hold and we infer the following two cases:

(1)

$\hat{f_{2}}-g=0$ $\mu$ -a.e. In this case, we have $I=0$ and $\hat{f_{1}}=g=\hat{f_{2}}$ $\mu$ -a.e.
(2)

$\hat{f_{1}}-g=\lambda(\hat{f_{2}}-g)$ $\mu$ -a.e. for some number $\lambda\geq 0$ . It easily follows that $\lambda=1$ and hence $\hat{f_{1}}=\hat{f_{2}}$ $\mu$ -a.e.

In either case, we conclude $\hat{f_{1}}=\hat{f_{2}}$ in $\mathscr{H}(K)$ by the continuous frame condition. ∎

Proof of Theorem 1.4.

Since $c(0,g(\cdot))\in L^{1}(X,\mu)$ and $f=0\in\mathscr{H}(K)$ , then

I_{g}:=\underset{f\in\mathscr{H}(K)}{\text{inf}}\ \int_{X}c(f(x),g(x))d\mu(x)+\|f\|^{p}_{\mathscr{H}(K)}\leq\int_{X}c(0,g(x))d\mu(x)<+\infty.

Hence the problem $I_{g}$ is bounded. Let $\{f_{i}\}_{i=1}^{\infty}$ be a minimizing sequence. Then

\int_{X}c(f_{i}(x),g(x))d\mu(x)+\|f_{i}\|^{p}_{\mathscr{H}(K)}\rightarrow I_{g}<+\infty.

Then there exists $0<M<\infty$ such that for each $i$ ,

\int_{X}c(f_{i}(x),g(x))d\mu(x)+\|f_{i}\|^{p}_{\mathscr{H}(K)}\leq M.

\begin{split}\int_{X}c(\hat{f}(x),g(x))d\mu(x)&\leq\int_{X}\underset{k\rightarrow\infty}{\text{lim inf}}\ c(\hat{f}_{i_{k}}(x),g(x))d\mu(x)\\ &\leq\underset{k\rightarrow\infty}{\text{lim inf}}\int_{X}c(\hat{f}_{i_{k}}(x),g(x))d\mu(x).\end{split}

On the other hand, since $f_{i_{k}}$ converges to $\hat{f}$ weakly in $\mathscr{H}(K)$ , we have $\|\hat{f}\|_{\mathscr{H}(K)}^{p}\leq\underset{k\rightarrow\infty}{\text{lim inf}}\ \|f_{i_{k}}\|_{\mathscr{H}(K)}^{p}.$ Furthermore, by the superadditivity of limit inferior, we get

\begin{split}\underset{k\rightarrow\infty}{\text{lim inf}}\int_{X}c(\hat{f}_{i_{k}}(x),g(x))&d\mu(x)+\underset{k\rightarrow\infty}{\text{lim inf}}\ \|f_{i_{k}}\|^{p}_{\mathscr{H}(K)}\\ &\leq\underset{k\rightarrow\infty}{\text{lim inf}}\int_{X}c(f_{i_{k}}(x),g(x))d\mu(x)+\|f_{i_{k}}\|^{p}_{\mathscr{H}(K)}.\end{split}

Combining the results and that $\{f_{i}\}_{i=1}^{\infty}$ is minimizing, we obtain

\int_{X}c(\hat{f}(x),g(x))d\mu(x)+\|\hat{f}\|^{p}_{\mathscr{H}(K)}\leq I_{g}.

Therefore $\hat{f}$ is an optimizer.

Next, we show the uniqueness when $p>1$ and $c(\cdot,z)$ is convex for any fixed $z\in\mathbb{F}$ . Let $\hat{f_{1}}$ and $\hat{f_{2}}$ be optimizers attaining $I_{g}$ . Since $\frac{\hat{f_{1}}+\hat{f_{2}}}{2}\in\mathscr{H}(K)$ , we have

\begin{split}\int_{X}&c(\frac{\hat{f_{1}}(x)+\hat{f_{2}}(x)}{2},g(x))d\mu(x)+\|\frac{\hat{f_{1}}+\hat{f_{2}}}{2}\|^{p}_{\mathscr{H}(K)}\geq I_{g}\\ &=\frac{\|\hat{f_{1}}\|^{p}_{\mathscr{H}(K)}}{2}+\frac{\|\hat{f_{2}}\|^{p}_{\mathscr{H}(K)}}{2}+\frac{\int_{X}c(\hat{f_{1}}(x),g(x))d\mu(x)}{2}+\frac{\int_{X}c(\hat{f_{2}}(x),g(x))d\mu(x)}{2}.\end{split}

Since $c(\cdot,z)$ is convex for any given $z$ , we then have

\|\frac{\hat{f_{1}}+\hat{f_{2}}}{2}\|^{p}_{\mathscr{H}(K)}\geq\frac{\|\hat{f_{1}}\|^{p}_{\mathscr{H}(K)}}{2}+\frac{\|\hat{f_{2}}\|^{p}_{\mathscr{H}(K)}}{2}.

On the other hand, by triangle inequality and that $|x|^{p}$ is (strictly) convex for $p>1$ , we get

\|\frac{\hat{f_{1}}+\hat{f_{2}}}{2}\|^{p}_{\mathscr{H}(K)}\leq\left(\frac{\|\hat{f_{1}}\|_{\mathscr{H}(K)}}{2}+\frac{\|\hat{f_{2}}\|_{\mathscr{H}(K)}}{2}\right)^{p}\leq\frac{\|\hat{f_{1}}\|^{p}_{\mathscr{H}(K)}}{2}+\frac{\|\hat{f_{2}}\|^{p}_{\mathscr{H}(K)}}{2}.

Hence we see that the equalities above must hold, and we infer $\hat{f_{1}}=c\hat{f_{2}}$ for some $c\geq 0$ as well as $\|\hat{f_{1}}\|_{\mathscr{H}(K)}=\|\hat{f_{2}}\|_{\mathscr{H}(K)}$ (the case $\hat{f_{1}}=0$ implies $\hat{f_{2}}=0$ , and vice versa). Therefore $c=1$ and $\hat{f_{1}}=\hat{f_{2}}$ . ∎

Acknowledgement

The authors would like to express gratitude to Qiyu Sun for valuable discussions.

References

[1] Nachman Aronszajn. Theory of reproducing kernels. Transactions of the American mathematical society, 68(3):337–404, 1950.
[2] Alain Berlinet and Christine Thomas-Agnan. Reproducing kernel Hilbert spaces in probability and statistics. Springer Science & Business Media, 2011.
[3] Arthur E Hoerl and Robert W Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1):55–67, 1970.
[4] Jonathan H Manton, Pierre-Olivier Amblard, et al. A primer on reproducing kernel hilbert spaces. Foundations and Trends in Signal Processing, 8(1–2):1–126, 2015.
[5] EH Moore. General analysis, part 2: The fundamental notions of general analysis. edited by rw barnard. Memoirs of the American Philosophical Society, 1, 2013.
[6] Vern I Paulsen and Mrinal Raghupathi. An introduction to the theory of reproducing kernel Hilbert spaces, volume 152. Cambridge University Press, 2016.
[7] Sergei Pereverzyev. An introduction to artificial intelligence based on reproducing kernel Hilbert spaces. Springer Nature, 2022.
[8] Saburou Saitoh, Yoshihiro Sawano, et al. Theory of reproducing kernels and applications. Springer, 2016.
[9] Bernhard Schölkopf, Ralf Herbrich, and Alex J Smola. A generalized representer theorem. In International conference on computational learning theory, pages 416–426. Springer, 2001.
[10] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1):267–288, 1996.
[11] Grace Wahba. Spline models for observational data. SIAM, 1990.

On the Probabilistic Approximation in Reproducing Kernel Hilbert Spaces

Abstract.

1. Introduction and Main Results

Theorem 1.1 (Theorem 3.8 in [6]).

Theorem 1.2.

Corollary 1.3.

Theorem 1.4.

2. Probabilistic Representer Theorem

Theorem 2.1 (Theorem 8.7 and 8.8 in [6]).

Theorem 2.2 (Probabilistic Representer).

Proof.

Corollary 2.3 (Discrete Probabilistic Representer).

Proof.

Remark 2.4.

3. Discussions on the Representer Theorem

Conjecture 3.1.

Example 3.2 (Measure Representation).

4. Proof of Theorem 1.2 and Theorem 1.4

Proof of Theorem 1.2.

Proof of Theorem 1.4.

Acknowledgement

References

On the Probabilistic Approximation in
Reproducing Kernel Hilbert Spaces