This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Simple initialization and parametrization of sinusoidal networks via their kernel bandwidth

Filipe de Avila Belbute-Peres
Carnegie Mellon University
Pittsburgh, PA
filiped@cs.cmu.edu
&J. Zico Kolter
Carnegie Mellon University & Bosch Center for AI
Pittsburgh, PA
zkolter@cs.cmu.edu
Abstract

Neural networks with sinusoidal activations have been proposed as an alternative to networks with traditional activation functions. Despite their promise, particularly for learning implicit models, their training behavior is not yet fully understood, leading to a number of empirical design choices that are not well justified. In this work, we first propose a simplified version of such sinusoidal neural networks, which allows both for easier practical implementation and simpler theoretical analysis. We then analyze the behavior of these networks from the neural tangent kernel perspective and demonstrate that their kernel approximates a low-pass filter with an adjustable bandwidth. Finally, we utilize these insights to inform the sinusoidal network initialization, optimizing their performance for each of a series of tasks, including learning implicit models and solving differential equations.

1 Introduction

Sinusoidal networks are neural networks with sine nonlinearities, instead of the traditional ReLU or hyperbolic tangent. They have been recently popularized, particularly for applications in implicit representation models, in the form of SIRENs (Sitzmann et al., 2020). However, despite their popularity, many aspects of their behavior and comparative advantages are not yet fully understood. Particularly, some initialization and parametrization choices for sinusoidal networks are often defined arbitrarily, without a clear understanding of how to optimize these settings in order to maximize performance.

In this paper, we first propose a simplified version of such sinusoidal networks, that allows for easier implementation and theoretical analysis. We show that these simple sinusoidal networks can match and outperform SIRENs in implicit representation learning tasks, such as fitting videos, images and audio signals. We then analyze sinusoidal networks from a neural tangent kernel (NTK) perspective (Jacot et al., 2018), demonstrating that their NTK approximates a low-pass filter with adjustable bandwidth. We confirm, through an empirical analysis this theoretically predicted behavior also holds approximately in practice. We then use the insights from this analysis to inform the choices of initialization and parameters for sinusoidal networks. We demonstrate we can optimize the performance of a sinusoidal network by tuning the bandwidth of its kernel to the maximum frequency present in the input signal being learned. Finally, we apply these insights in practice, demonstrating that “well tuned” sinusoidal networks outperform other networks in learning implicit representation models with good interpolation outside the training points, and in learning the solution to differential equations.

2 Background and Related Work

Sinusoidal networks.   Sinusoidal networks have been recently popularized for implicit modelling tasks by sinusoidal representation networks (SIRENs) (Sitzmann et al., 2020). They have also been evaluated in physics-informed learning settings, demonstrating promising results in a series of domains (Raissi et al., 2019b; Song et al., 2021; Huang et al., 2021b; a; Wong et al., 2022). Among the benefits of such networks is the fact that the mapping of the inputs through an (initially) random linear layer followed by a sine function is mathematically equivalent to a transformation to a random Fourier basis, rendering them close to networks with Fourier feature transformations (Tancik et al., 2020; Rahimi & Recht, 2007), and possibly able to address spectral bias (Basri et al., 2019; Rahaman et al., 2019; Wang et al., 2021). Sinusoidal networks also have the property that the derivative of their outputs is given simply by another sinusoidal network, due to the fact that the derivative of sine function is simply a phase-shifted sine.

Neural tangent kernel.   An important prior result to the neural tangent kernel (NTK) is the neural network Gaussian process (NNGP). At random initialization of its parameters θ\theta, the output function of a neural network of depth LL with nonlinearity σ\sigma, converges to a Gaussian process, called the NNGP, as the width of its layers n1,,nLn_{1},\dots,n_{L}\rightarrow\infty. (Neal, 1994; Lee et al., 2018). This result, though interesting, does not say much on its own about the behavior of trained neural networks. This role is left to the NTK, which is defined as the kernel given by Θ(x,x~)=θfθ(x),θfθ(x~)\Theta(x,\tilde{x})=\left\langle\nabla_{\theta}f_{\theta}(x),\nabla_{\theta}f_{\theta}(\tilde{x})\right\rangle. It can be shown that this kernel can be written out as a recursive expression involving the NNGP. Importantly, Jacot et al. (2018) demonstrated that, again as the network layer widths n1,,nLn_{1},\dots,n_{L}\rightarrow\infty, the NTK is (1) deterministic at initialization and (2) constant throughout training. Finally, it has also been demonstrated that under some assumptions on its parametrization, the output function of the trained neural network fθf_{\theta} converges to the kernel regression solution using the NTK (Lee et al., 2020; Arora et al., 2019). In other words, under certain assumptions the behavior of a trained deep neural network can be modeled as kernel regression using the NTK.

Physics-informed neural networks.   Physics-informed neural networks (Raissi et al., 2019a) are a method for approximating the solution to differential equations using neural networks (NNs). In this method, a neural network u^(t,x;θ)\hat{u}(t,x;{\theta}), with learned parameters θ{\theta}, is trained to approximate the actual solution function u(t,x)u(t,x) to a given partial differential equation (PDE). Importantly, PINNs employ not only a standard “supervised” data loss, but also a physics-informed loss, which consists of the differential equation residual 𝒩\mathcal{N}. Thus, the training loss consists of a linear combination of two loss terms, one directly supervised from data and one informed by the underlying differential equations.

3 Simple Sinusoidal Networks

There are many details that complicate the practical implementation of current sinusoidal networks. We aim to propose a simplified version of such networks in order to facilitate theoretical analysis and practical implementation, by removing such complications.

As an example we can look at SIRENs, which have their layer activations defined as fl(x)=sin(ω(Wlx+bl))f_{l}(x)=\mathrm{sin}{(\omega(W_{l}x+b_{l}))}. Then, in order to cancel the ω\omega factor, layers after the first one have their weight initialization follow a uniform distribution with range [6/nω,6/nω][-\frac{\sqrt{6/n}}{\omega},\frac{\sqrt{6/n}}{\omega}], where nn is the size of the layer. Unlike the other layers, the first layer is sampled from a uniform distribution with range [1/n,1/n][-1/n,1/n].

We instead propose a simple sinusoidal network, with the goal of formulating an architecture that mainly amounts to substituting its activation functions by the sine function. We will, however, keep the ω\omega parameter, since (as we will see in future analyses) it is in fact a useful tool for allowing the network to fit inputs of diverse frequencies. The layer activation equations of our simple sinusoidal network, with parameter ω\omega, are defined as

f1(x)\displaystyle f_{1}(x) =sin(ω(W1x+b1)),\displaystyle=\mathrm{sin}{\left(\omega\left(W_{1}x+b_{1}\right)\right)}, (1)
fl(x)\displaystyle f_{l}(x) =sin(Wlx+bl),l>1.\displaystyle=\mathrm{sin}{\left(W_{l}x+b_{l}\right)},\;\;\;l>1.

Finally, instead of utilizing a uniform initialization as in SIRENs (with different bounds for the first and subsequent layers), we propose initializing all parameters in our simple sinusoidal network using a default Kaiming (He) normal initialization scheme. This choice not only greatly simplifies the initialization scheme of the network, but it also facilitates theoretical analysis of the behavior of the network under the NTK framework, as we will see in Section 4.

Analysis of the initialization scheme.   The initialization scheme proposed above differs from the one implemented in SIRENs. We will now show that this particular choice of initialization distribution preserves the variance of the original proposed SIREN initialization distribution. As a consequence, the original theoretical justifications for its initialization scheme still hold under this activation, namely that the distribution of activations across layers are stable, well-behaved and shift-invariant. Due to space constraints, proofs are presented in Appendix A. Moreover, we also demonstrate empirically that these properties are maintained in practice.

Lemma 1.

Given any cc, for X𝒩(0,13c2)X\sim\mathcal{N}\left(0,\frac{1}{3}c^{2}\right) and Y𝒰(c,c)Y\sim\mathcal{U}\left(-c,c\right), we have Var[X]=Var[Y]=13c2\mathrm{Var}[X]=\mathrm{Var}[Y]=\frac{1}{3}c^{2}.

This simple Lemma and relates to Lemma 1.7 in Sitzmann et al. (2020), showing that the initialization we propose here has the same variance as the one proposed for SIRENs. Using this result we can translate the result from the main Theorem 1.8 from Sitzmann et al. (2020), which claims that the SIREN initialization indeed has the desired properties, to our proposed initialization:111We note that despite being named Theorem 1.8 in Sitzmann et al. (2020), this result is not fully formal, due to the Gaussian distribution being approximated without a formal analysis of this approximation. Additionally, a CLT result is employed which assumes infinite width, which is not applicable in this context. We thus refrain from calling our equivalent result a theorem. Nevertheless, to the extent that the argument is applicable, it would still hold for our proposed initialization, due to its dependence solely on the variance demonstrated in Lemma 1 above.

For a uniform input in [1,1][-1,1], the activations throughout a sinusoidal network are approximately standard normal distributed before each sine non-linearity and arcsine-distributed after each sine non-linearity, irrespective of the depth of the network, if the weights are distributed normally, with mean 0 and variance 2n\frac{2}{n}, where nn is a layer’s fan-in.

Empirical evaluation of initialization scheme.   To empirically demonstrate the proposed simple initialization scheme preserves the properties from the SIREN initialization scheme, we perform the same analysis performed by Sitzmann et al. (2020). We observe that the distribution of activations matches the predicted normal (before the non-linearity) and arcsine (after the non-linearity) distributions, and that this behavior is stable across many layers. These results are reported in detail in the Appendix B.

3.1 Comparison to SIREN

In order to demonstrate our simplified sinusoidal network has comparable performance to a standard SIREN, in this section we reproduce the main results from Sitzmann et al. (2020). Table 1 compiles the results for all experiments. In order to be fair, we compare the simplified sinusoidal network proposed in this chapter with both the results directly reported in Sitzmann et al. (2020), and our own reproduction of the SIREN results (using the same parameters and settings as the original). We can see from the numbers reported in the table that the performance of the simple sinusoidal network proposed in this chapter matches the performance of the SIREN in all cases, in fact surpassing it in most of the experiments. Qualitative results are presented in Appendix C.

It is important to note that this is not a favorable setting for simple sinusoidal networks, given that the training durations were very short. The SIREN favors quickly converging to a solution, though it does not have as strong asymptotic behavior. This effect is likely due to the multiplicative factor applied to later layers described in Section 3. We observe that indeed in almost all cases we can compensate for this effect by simply increasing the learning rate in the Adam optimizer (Kingma & Ba, 2014).

Finally, we observe that besides being able to surpass the performance of SIREN in most cases in a short training regimen, the simple sinusoidal network performs even more strongly with longer training. To demonstrate this, we repeated some experiments from above, but with longer training durations. These results are shown in Table 4 in Appendix C.

Table 1: Comparison of the simple sinusoidal network and SIREN results, both directly from Sitzmann et al. (2020) and from our own reproduced experiments. Values above the horizontal center line are peak signal to noise ratio (PSNR), values below are mean squared error (MSE), except for SDF which uses a composite loss. \;{}^{\dagger}Audio experiments utilized a separate learning rate for the first layer.

Experiment Simple Sinusoidal Network SIREN [paper] SIREN [ours] Image 50.04 49 (approx.) 49.0 Poisson (Gradient) 39.66 32.91 38.92 Poisson (Laplacian) 20.97 14.95 20.85 Video (cat) 34.03 29.90 32.09 Video (bikes) 37.4 32.88 33.75 Audio (Bach) 1.571051.57\cdot 10^{-5} 1.10𝟏𝟎𝟓\mathbf{1.10\cdot 10^{-5}} 3.281053.28\cdot 10^{-5} Audio (counting) 3.17𝟏𝟎𝟒\mathbf{3.17\cdot 10^{-4}} 3.82104{3.82\cdot 10^{-4}} 4.381044.38\cdot 10^{-4} Helmholtz equation 5.94𝟏𝟎𝟐\mathbf{5.94\cdot 10^{-2}} 5.971025.97\cdot 10^{-2} SDF (room) 12.99 14.32 SDF (statue) 6.22 5.98

4 Neural tangent kernel analysis

In the following we derive the NTK for sinusoidal networks. This analysis will show us that the sinusoidal networks NTK is approximately a low-pass filter, with its bandwidth directly defined by ω\omega. We support these findings with an empirical analysis as well in the following section. Finally, we demonstrate how the insights from the NTK can be leveraged to properly “tune” sinusoidal networks to the spectrum of the desired signal. Full derivations and extensive, detailed analysis are left to Appendix D.

The NTK for a simple sinusoidal network with a single hidden layer is presented in the theorem below. The NTK for siren with 11 and 66 hidden layers are shown in Figure 1.

Theorem 2.

Shallow SSN NTK. For a simple sinusoidal network with one hidden layer f(1):n0n2f^{(1)}:\mathbb{R}^{n_{0}}\rightarrow\mathbb{R}^{n_{2}} following Definition 1, its neural tangent kernel (NTK), as defined in Theorem 6, is given by

Θ(1)(x,x~)\displaystyle\Theta^{(1)}(x,\tilde{x}) =12(ω2(xTx~+1)+1)eω22xx~22\displaystyle=\frac{1}{2}\left(\omega^{2}\left(x^{T}\tilde{x}+1\right)+1\right)e^{-\frac{\omega^{2}}{2}\|x-\tilde{x}\|_{2}^{2}}
12(ω2(xTx~+1)1)eω22x+x~22e2ω2+ω2(xTx~+1)+1.\displaystyle\quad\quad\quad-\frac{1}{2}\left(\omega^{2}\left(x^{T}\tilde{x}+1\right)-1\right)e^{-\frac{\omega^{2}}{2}\|x+\tilde{x}\|_{2}^{2}}e^{-2\omega^{2}}+\omega^{2}\left(x^{T}\tilde{x}+1\right)+1.

We can see that for values of ω>2\omega>2, the second term quickly vanishes due to the e2ω2e^{-2\omega^{2}} factor. This leaves us with only the first term, which has a Gaussian form. Due to the linear scaling term xTx~x^{T}\tilde{x}, this is only approximately Gaussian, but the approximation improves as ω\omega increases. We can thus observe that this kernel approximates a Gaussian kernel, which is a low-pass filter, with its bandwidth defined by ω\omega. Figure 1 presents visualizations for NTKs for the simple sinusoidal network, compared to a (scaled) pure Gaussian with variance ω2\omega^{-2}, showing there is a close match between the two.

If we write out the NTK for networks with more than one hidden layer, it quickly becomes un-interpretable due to the recursive nature of the NTK definition (see Appendix D). However, as shown empirically in Figure 1, these kernels are still approximated by Gaussians with variance ω2\omega^{-2}.

We also observe that the NTK for a SIREN with a single hidden layer is analogous, but with a sinc\mathrm{sinc} form, which is also a low-pass filter.

Theorem 3.

Shallow SIREN NTK. For a single hidden layer SIREN f(1):n0n2f^{(1)}:\mathbb{R}^{n_{0}}\rightarrow\mathbb{R}^{n_{2}} following Definition 1, its neural tangent kernel (NTK), as defined in Theorem 6, is given by

Θ(1)(x,x~)\displaystyle\Theta^{(1)}(x,\tilde{x}) =c26(ω2(xTx~+1)+1)j=1n0sinc(cω(xjx~j))\displaystyle=\frac{c^{2}}{6}\left(\omega^{2}\left(x^{T}\tilde{x}+1\right)+1\right)\prod_{j=1}^{n_{0}}\mathrm{sinc}{\left(c\,\omega\left(x_{j}-\tilde{x}_{j}\right)\right)}
c26(ω2(xTx~+1)1)e2ω2j=1n0sinc(cω(xj+x~j))+ω2(xTx~+1)+1.\displaystyle\quad\quad\quad-\frac{c^{2}}{6}\left(\omega^{2}\left(x^{T}\tilde{x}+1\right)-1\right)e^{-2\omega^{2}}\prod_{j=1}^{n_{0}}\mathrm{sinc}{\left(c\,\omega\left(x_{j}+\tilde{x}_{j}\right)\right)}+\omega^{2}\left(x^{T}\tilde{x}+1\right)+1.

For deeper SIREN networks, the kernels defined by the later layers are in fact Gaussian too, as discussed in Appendix D. This leads to an NTK that is approximated by a product of a sinc function and a Gaussian. These SIREN kernels are also presented in Figure 1.

Refer to caption
Refer to caption
Refer to caption
Refer to caption

 

L=1L=1

Refer to caption
Refer to caption

L=1L=1

Refer to caption
Refer to caption

 

L=6L=6

Refer to caption
ω=10\omega=10
Refer to caption
ω=30\omega=30

L=6L=6

Refer to caption
ω=10\omega=10
Refer to caption
ω=30\omega=30

 

SSN

SIREN

Figure 1: The NTK for SIREN and SSN at different ω\omega. Top: Kernel values for pairs (x,x~)[1,1]2(x,\tilde{x})\in[-1,1]^{2}. Bottom: Slice at fixed x~=0\tilde{x}=0. SSN plots show a superimposed Gaussian kernel with variance ω2\omega^{-2} scaled to match the max and min values of the NTK. Similarly, SIREN plots show a sinc\mathrm{sinc} function.

5 Empirical analysis

As shown above, neural tangent kernel theory suggests that sinusoidal networks work as low-pass filters, with their bandwidth controlled by the parameter ω\omega. In this section, we demonstrate empirically that we can observe this predicted behavior even in real sinusoidal networks. For this experiment, we generate a 512×512512\times 512 monochromatic image by super-imposing two orthogonal sinusoidal signals, each consisting of a single frequency, f(x,y)=cos(128πx)+cos(32πy)f(x,y)=\mathrm{cos}(128\pi x)+\mathrm{cos}(32\pi y). This function is sampled in the domain [1,1]2[-1,1]^{2} to generate the image on the left of Figure 2.

Refer to caption
Ground truth
Refer to caption
ω=1\omega=1
Refer to caption
ω=8\omega=8
Refer to caption
ω=64\omega=64

 

Figure 2: Left: The test signal used to analyze the behavior of sinusoidal networks. It is created from two orthogonal single frequencies, f(x,y)=cos(128πx)+cos(32πy)f(x,y)=\mathrm{cos}(128\pi x)+\mathrm{cos}(32\pi y). Right: Examples of the reconstructed signal from networks with different ω\omega, demonstrating each of the loss levels in Figure 3.

To demonstrate what we can expect from applying low-pass filters of different bandwidths to this signal, we perform a discrete Fourier transform (DFT), cut off frequencies above a certain value, and perform an inverse transform to recover the (filtered) image. The MSE of the reconstruction, as a function of the cutoff frequency, is shown in Figure 3. We can see that due to the simple nature of the signal, containing only two frequencies, there are only three loss levels. If indeed the NTK analysis is correct and sinusoidal networks act as low-pass filters, with bandwidth controlled by ω\omega, we should be able to observe similar behavior with sinusoidal networks with different ω\omega values. We plot the final training loss and training curves for sinusoidal networks with different ω\omega in Figure 3. We can observe, again, that there are three consistent loss levels following the magnitude of the ω\omega parameter, in line with the intuition that the sinusoidal network is working as a low-pass filter. This is also observable in Figure 2, where we see example reconstructions for networks of various ω\omega values after training.

However, unlike with the DFT low-pass filter (which does not involve any learning), we see in Figure 3 that during training some sinusoidal networks shift from one loss level to a lower one. This demonstrates that sinusoidal networks differ from true low-pass filters in that their weights can change, which implies that the bandwidth defined by ω\omega also changes with learning. We know the weights W1W_{1} in the first layer of a sinusoidal network, given by f1(x)=sin(ωW1Tx+b1)f_{1}(x)=\mathrm{sin}{\left(\omega\cdot W_{1}^{T}x+b_{1}\right)}, will change with training. Empirically, we observed that the spectral norm of W1W_{1} increases throughout training for small ω\omega values. We can interpret that as the overall magnitude of the term ωW1Tx\omega\cdot W_{1}^{T}x increasing, which is functionally equivalent to an increase in ω\omega itself. In Figure 3, we observe that sinusoidal networks with smaller values of ω\omega take a longer time to achieve a lower loss (if at all). Intuitively, this happens because, due to the effect described above, lower ω\omega values require a larger increase in magnitude by the weights W1W_{1}. Given that all networks were trained with the same learning rate, the ones with a smaller ω\omega require their weights to move a longer distance, and thus take more training steps to achieve a lower loss.

Refer to caption
Refer to caption
Figure 3: Left: Reconstruction loss for different cutoff frequencies for a low-pass filter applied to Figure 2. The three loss levels reflect the 2 frequencies present in the simple signal: 128128 and 3232. Superimposed is the final loss, after fitting the same input image with sinusoidal networks, for different values of ω\omega (scaled to align with the frequencies). Right: Training curves for the sinusoidal networks. The three loss levels reflect the 2 frequencies present in the simple signal, and demonstrate that the sinusoidal network is indeed acting as a low-pass filter with bandwidth defined by ω\omega.

6 Tuning ω\omega

As shown in the previous section, though the bandwidth of a network can change throughout training, the choice of ω\omega still influences how easily and quickly (if at all) it can learn a given signal. The value of the ω\omega parameter is thus crucial for the learning of the network. Despite this fact, in SIRENs, for example, this value is not adjusted for each task (except for the audio fitting experiments), and is simply set empirically to an arbitrary value. In this section, we seek to justify a proper initialization for this parameter, such that it can be chosen appropriately for each given task.

Moreover, it is often not the case that we simply want to fit only the exact training samples but instead want to find a good interpolation (i.e., generalize well). Setting ω\omega too high, and thus allowing the network to model frequencies that are much larger than the ones present in the actual signal is likely to cause overfitting. This is demonstrated empirically in Figure 4.

Consequently, we want instead to tune the network to the highest frequency present in the signal. However, we do not always have the knowledge of what is the value of the highest frequency in the true underlying signal of interest. Moreover, we have also observed that, since the network learns and its weights change in magnitude, that value in fact changes with training. Therefore, the most we can hope for is to have a good heuristic to guide the choice of ω\omega. Nevertheless, having a reasonable guess for ω\omega is also likely sufficient for good performance, precisely due to the ability of the network to adapt during training and compensate for a possibly slightly suboptimal choice.

Choosing ω\omega from the Nyquist frequency.    One source of empirical information on the relationship between ω\omega and the sinusoidal network’s “learnable frequencies” is the previous section’s empirical analysis. Taking into account the scaling, we can see from Fig. 3 that around ω=16\omega=16 the network starts to be able to learn the full signal (freq. 128128). We can similarly note that at about ω=4\omega=4 the sinusoidal network starts to be able to efficiently learn a signal with frequency 3232, but not the one with frequency 128128. This scaling suggests a heuristic of setting ω\omega to about 1/81/8 of the signal’s maximum frequency.

For natural signals, such as pictures, it is common for frequencies up to the Nyquist frequency of the discrete sampling to be present. We provide an example for the “camera” image we have utilized so far in Figure 21 in Appendix E, where we can see that the reconstruction loss through a low-pass filter continues to decrease significantly up to the Nyquist frequency for the image resolution.

In light of this information, analyzing the choices of ω\omega for the experiments in Section 3.1 again suggests that ω\omega should be set around 1/81/8 of the Nyquist frequency of the signal. These values of ω\omega are summarized in Table 2 in the “Fitting ω\omega” column. For example, the image fitting experiment shows that, for an image of shape 512×512512\times 512 (and thus Nyquist frequency of 256256 for each dimension), this heuristic suggests an ω\omega value of 256/8=32256/8=32, which is the value found to work best empirically through search. We find similar results for the audio fitting experiments. The audio signals used in the audio fitting experiment contained approximately 300,000300,000 and 500,000500,000 points, and thus maximum frequencies of approximately 150,00150,00 and 250,000250,000. This suggests reasonable values for ω\omega of 18,75018,750 and 31,25031,250, which are close to the ones found empirically to work well. In examples such as the video fitting experiments, in which each dimension has a different frequency, it is not completely clear how to pick a single ω\omega to fit all dimensions. This suggests that having independent values of ω\omega for each dimension might be useful for such cases, as discussed in the next section.

Finally, when performing the generalization experiments in Section 7, we show the best performing ω\omega ended up being half the value of the best ω\omega used in the fitting tasks from Section 3.1. This follows intuitively, since for the generalization task we set apart half the points for training and the other half for testing, thus dividing the maximum possible frequency in the training sample in half, providing further evidence of the relationship between ω\omega and the maximum frequency in the input signal.

Multi-dimensional ω\omega.

In many problems, such as the video fitting and PDE problems, not only is the input space multi-dimensional, it also contains time and space dimensions (which are additionally possibly of different shape). This suggests that employing a multi-dimensional ω\omega, specifying different frequencies for each dimension might be beneficial. In practice, if we employ a scaling factor λ=[λ1λ2λd]T\lambda=\begin{bmatrix}\lambda_{1}&\lambda_{2}&\dots&\lambda_{d}\end{bmatrix}^{T}, we have the first layer of the sinusoidal network given by

f1(x)\displaystyle f_{1}(x) =sin(ω(W1(λx)+b1))=sin(W1(Ωx)+ωb1),\displaystyle=\mathrm{sin}{\left(\omega\left(W_{1}\left(\lambda\odot x\right)+b_{1}\right)\right)}=\mathrm{sin}{\left(W_{1}\left(\Omega\odot x\right)+\omega b_{1}\right)}, (2)

where Ω=[λ1ωλ2ωλdω]T\Omega=\begin{bmatrix}\lambda_{1}\omega&\lambda_{2}\omega&\dots&\lambda_{d}\omega\end{bmatrix}^{T} works as a multi-dimensional ω\omega. In the following experiments, we employ this approach to three-dimensional problems, in which we have time and differently shaped space domains, namely the video fitting and physics-informed neural network PDE experiments. For these experiments, we report the ω\omega in the form of the (already scaled) Ω\Omega vector for simplicity.

Choosing ω\omega from available information

Finally, in many problems we do have some knowledge of the underlying signal we can leverage, such as in the case of inverse problems. For example, let’s say we have velocity fields for a fluid and we are trying to solve for the coupled pressure field and the Reynolds number using a physics-informed neural network (as done in Section 7). In this case, we have access to two components of the solution field. Performing a Fourier transform on the training data we have can reveal the relevant spectrum and inform our choice of ω\omega. If the maximum frequency in the signal is lower than the Nyquist frequency implied by the sampling, this can lead to a more appropriate choice of ω\omega than suggested purely from the sampling.

7 Experiments

In this section, we first perform experiments to demonstrate how the optimal value of ω\omega influences the generalization error of a sinusoidal network, following the discussion in Section 6. After that, we demonstrate that sinusoidal networks with properly tuned ω\omega values outperform traditional physics-informed neural networks in classic PDE tasks.

7.1 Evaluating generalization

We now evaluate the simple sinusoidal network generalization capabilities. To do this, in all experiments in this section we segment the input signal into training and test sets using a checkerboard pattern – along all axis-aligned directions, points alternate between belonging to train and test set. We perform audio, image and video fitting experiments. When performing these experiments, we search for the best performing ω\omega value for generalization (defined as performance on the held-out points). We report the best values on Table 2. We observe that, as expected from the discussion in Section 6, the best performing ω\omega values follow the heuristic discussed above, and are in fact half the best-performing value found in the previous fitting experiments from Section 3.1, confirming our expectation. This is also demonstrated in the plot in Figure 4. Using a higher ω\omega leads to overfitting and poor generalization outside the training points. This is demonstrated in Figure 4, in which we can see that choosing an appropriate ω\omega value from the heuristics described previously leads to a good fit and interpolation. Setting ω\omega too high leads to interpolation artifacts, due to overfitting of spurious high-frequency components.

For the video signals, which have different size along each axis, we employ a multi-dimensional ω\omega. We scale each dimension of ω\omega proportional to the size of the input signal along the corresponding axis.

Table 2: Generalization results and the respective tuned ω\omega value. Generalization values are mean squared error (MSE). We can observe the best performing ω\omega for generalization is half the ω\omega used previously for fitting the full signal due to the fact that this task used half the sample points from previously.

Experiment SIREN SSN ω\omega Fitting ω\omega Image 2.761042.76\cdot 10^{-4} 1.25𝟏𝟎𝟒\mathbf{1.25\cdot 10^{-4}} 16 32 Audio (Bach) 4.551064.55\cdot 10^{-6} 3.87𝟏𝟎𝟔\mathbf{3.87\cdot 10^{-6}} 8,000 15,000 Audio (counting) 1.371041.37\cdot 10^{-4} 5.97𝟏𝟎𝟓\mathbf{5.97\cdot 10^{-5}} 16,000 32,000 Video (cat) 3.401033.40\cdot 10^{-3} 1.76𝟏𝟎𝟑\mathbf{1.76\cdot 10^{-3}} [488]\begin{bmatrix}4&8&8\end{bmatrix} 8 Video (bikes) 2.741032.74\cdot 10^{-3} 8.79𝟏𝟎𝟒\mathbf{8.79\cdot 10^{-4}} [448]\begin{bmatrix}4&4&8\end{bmatrix} 8

Refer to caption
Refer to caption
Ground truth
Refer to caption
ω=16\omega=16
Refer to caption
ω=128\omega=128
Figure 4: Left: Final training loss for different values of ω\omega in the image fitting generalization experiment. Right: Examples of generalization from half the points using sinusoidal networks with different values of ω\omega. Even though both networks achieve equivalent training loss, the rightmost one, with ω\omega higher than what would be suggested from the Nyquist frequency of the input signal, overfits the data, causing high-frequency noise artifacts in the reconstruction (e.g., notice the sky).

 

7.2 Solving differential equations

Table 3: Comparison of the sinusoidal network and MLP with tanh non-linearity on PINN experiments from (Raissi et al., 2019a; Sitzmann et al., 2020). Values are percent error relative to ground truth value for each parameter for identification problems and mean squared error (MSE) for inference problems. The Helmholtz experiment is the same from Section 3.1.

Experiment Baseline SSN ω\omega Burgers (Identification) [0.0521%,0.4522%][0.0521\%,0.4522\%] [0.0071%,0.0507%]\mathbf{[0.0071\%,0.0507\%]} 10 Navier-Stokes (Identification) [0.0046%,2.093%][0.0046\%,2.093\%] [0.0038%,1.782%]\mathbf{[0.0038\%,1.782\%]} [0.60.31.2]\begin{bmatrix}0.6&0.3&1.2\end{bmatrix} Schrödinger (Inference) 1.041031.04\cdot 10^{-3} 4.30𝟏𝟎𝟒\mathbf{4.30\cdot 10^{-4}} 4 Helmholtz (Inference) 5.971025.97\cdot 10^{-2} 5.94𝟏𝟎𝟐\mathbf{5.94\cdot 10^{-2}} 16

Finally, we apply our analysis to physics-informed learning. We compare the performance of simple sinusoidal networks to the tanh\tanh networks that are commonly used for these tasks. Results are summarized in Table 3. Details for the Schrödinger and Helmholtz experiments are presented in Appendix E.

7.2.1 Burgers equation (Identification)

This experiment reproduces the Burgers equation identification experiment from Raissi et al. (2019a). Here we are identifying the parameters λ1\lambda_{1} and λ2\lambda_{2} of a 1D Burgers equation, ut+λ1uuxλ2uxx=0u_{t}+\lambda_{1}uu_{x}-\lambda_{2}u_{xx}=0, given a known solution field. The ground truth value of the parameters are λ1=1.0\lambda_{1}=1.0 and λ2=0.01/π\lambda_{2}=0.01/\pi.

In order to find a good value for ω\omega, we perform a low-pass reconstruction of the solution as before. We can observe in Figure 5 that the solution does not have high bandwidth, with most of the loss being minimized with only the lower half of the spectrum. Note that the sampling performed for the training data (N=2,000N=2,000) is sufficient to support such frequencies. This suggests an ω\omega value in the range 8108-10. Indeed, we observe that ω=10\omega=10 gives the best identification of the desired parameters, with errors of 0.0071%0.0071\% and 0.0507%0.0507\% for λ1\lambda_{1} and λ2\lambda_{2} respectively, against errors of 0.0521%0.0521\% and 0.4522%0.4522\% of the baseline. This value of ω\omega also achieves the lowest reconstruction loss against the known solution, with an MSE of 8.0341048.034\cdot 10^{-4}. Figure 5 shows the reconstructed solution using the identified parameters.

Refer to caption
Refer to caption
Figure 5: Left:Reconstruction loss for different cutoff frequencies for a low-pass filter applied to the solution of the Burgers equation. Right: Reconstructed solution of the Burgers equation using the identified parameters with the sinusoidal network, together with the position of the sampled training points.

    

7.2.2 Navier-Stokes (Identification)

This experiment reproduces the Navier-Stokes identification experiment from Raissi et al. (2019a). In this experiment, we are trying to identify, the parameters λ1,λ2\lambda_{1},\lambda_{2} and the pressure field pp of the 2D Navier-Stokes equations given by ut+λ1uu=p+λ22u\frac{\partial u}{\partial t}+\lambda_{1}u\cdot\nabla u=-\nabla p+\lambda_{2}\nabla^{2}u, given known velocity fields uu and vv. The ground truth value of the parameters are λ1=1.0\lambda_{1}=1.0 and λ2=0.01\lambda_{2}=0.01.

Unlike the 1D Burgers case, in this case the amount of points sampled for the training set (N=5,000N=5,000) is not high, compared to the size of the full solution volume, and is thus the limiting factor for the bandwidth of the input signal. Given the random sampling of points from the full solution, the generalized sampling theorem applies. The original solution has dimensions of 100×50×200100\times 50\times 200. With the 5,0005,000 randomly sampled points, the average sampling rate per dimension is approximately 1717, on average, corresponding to a Nyquist frequency of approximately 8.58.5. Furthermore, given the multi-dimensional nature of this problem, with both spatial and temporal axes, we employ an independent scaling to ω\omega for each dimension. The analysis above suggests an average ω1\omega\approx 1, with the dimensions of the problem suggesting scaling factors of [0.512]T\begin{bmatrix}0.5&1&2\end{bmatrix}^{T}. Indeed, we observe that Ω=[0.30.61.2]T\Omega=\begin{bmatrix}0.3&0.6&1.2\end{bmatrix}^{T} gives the best results. With with errors of 0.0038%0.0038\% and 1.782%1.782\% for λ1\lambda_{1} and λ2\lambda_{2} respectively, against errors of 0.0046%0.0046\% and 2.093%2.093\% of the baseline. Figure 6 shows the identified pressure field. Note that given the nature of the problem, this field can only be identified up to a constant.

Refer to caption
Refer to caption
Figure 6: Left: One timestep of the ground truth Navier-Stokes solution. The black rectangle indicates the domain region used for the task. Right: Identified pressure field for the Navier-Stokes equations using the sinusoidal network. Notice that the identification is only possible up to a constant.

 

8 Conclusion

In this work, we have present a simplified formulation for sinusoidal networks. Analysis of this architecture from the neural tangent kernel perspective, combined with empirical results, reveals that the kernel for sinusoidal networks corresponds to a low-pass filter with adjustable bandwidth. We leverage this information in order to initialize these networks appropriately, choosing their bandwidth such that it is tuned to the signal being learned. Employing this strategy, we demonstrated improved results in both implicit modelling and physics-informed learning tasks.

References

  • Arora et al. (2019) Sanjeev Arora, Simon S. Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks. arXiv:1901.08584 [cs, stat], May 2019. URL http://arxiv.org/abs/1901.08584. arXiv: 1901.08584.
  • Basri et al. (2019) Ronen Basri, David Jacobs, Yoni Kasten, and Shira Kritchman. The Convergence Rate of Neural Networks for Learned Functions of Different Frequencies. June 2019. URL https://arxiv.org/abs/1906.00425v3.
  • Huang et al. (2021a) Xiang Huang, Hongsheng Liu, Beiji Shi, Zidong Wang, Kang Yang, Yang Li, Bingya Weng, Min Wang, Haotian Chu, Jing Zhou, Fan Yu, Bei Hua, Lei Chen, and Bin Dong. Solving Partial Differential Equations with Point Source Based on Physics-Informed Neural Networks. arXiv:2111.01394 [physics], November 2021a. URL http://arxiv.org/abs/2111.01394. arXiv: 2111.01394.
  • Huang et al. (2021b) Xinquan Huang, Tariq Alkhalifah, and Chao Song. A modified physics-informed neural network with positional encoding. pp.  2484, September 2021b. doi: 10.1190/segam2021-3584127.1.
  • Jacot et al. (2018) Arthur Jacot, Franck Gabriel, and Clement Hongler. Neural Tangent Kernel: Convergence and Generalization in Neural Networks. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2018/hash/5a4be1fa34e62bb8a6ec6b91d2462f5a-Abstract.html.
  • Jacot et al. (2020) Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. 2020.
  • Kingma & Ba (2014) Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. 2014. URL http://arxiv.org/abs/1412.6980.
  • Lee et al. (2018) Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S. Schoenholz, Jeffrey Pennington, and Jascha Sohl-Dickstein. Deep Neural Networks as Gaussian Processes, March 2018. URL http://arxiv.org/abs/1711.00165. arXiv:1711.00165 [cs, stat].
  • Lee et al. (2020) Jaehoon Lee, Lechao Xiao, Samuel S. Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington. Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent. Journal of Statistical Mechanics: Theory and Experiment, 2019(12):124002, December 2020. ISSN 1742-5468. doi: 10.1088/1742-5468/abc62b. URL http://arxiv.org/abs/1902.06720. arXiv: 1902.06720.
  • Neal (1994) Radford M. Neal. Priors for infinite networks. Technical Report CRG-TR-94-1, University of Toronto, 1994. URL https://www.cs.toronto.edu/~radford/ftp/pin.pdf.
  • Pearce et al. (2019) Tim Pearce, Russell Tsuchida, Mohamed Zaki, Alexandra Brintrup, and Andy Neely. Expressive Priors in Bayesian Neural Networks: Kernel Combinations and Periodic Functions, June 2019. URL http://arxiv.org/abs/1905.06076. arXiv:1905.06076 [cs, stat].
  • Rahaman et al. (2019) Nasim Rahaman, Aristide Baratin, Devansh Arpit, Felix Draxler, Min Lin, Fred Hamprecht, Yoshua Bengio, and Aaron Courville. On the Spectral Bias of Neural Networks. In Proceedings of the 36th International Conference on Machine Learning, pp.  5301–5310. PMLR, May 2019. URL https://proceedings.mlr.press/v97/rahaman19a.html. ISSN: 2640-3498.
  • Rahimi & Recht (2007) Ali Rahimi and Benjamin Recht. Random Features for Large-Scale Kernel Machines. In Advances in Neural Information Processing Systems, volume 20. Curran Associates, Inc., 2007. URL https://papers.nips.cc/paper/2007/hash/013a006f03dbc5392effeb8f18fda755-Abstract.html.
  • Raissi et al. (2019a) Maziar Raissi, Paris Perdikaris, and George Em Karniadakis. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378, 2019a. doi: 10.1016/j.jcp.2018.10.045. URL https://linkinghub.elsevier.com/retrieve/pii/S0021999118307125.
  • Raissi et al. (2019b) Maziar Raissi, Zhicheng Wang, Michael S. Triantafyllou, and George Em Karniadakis. Deep Learning of Vortex Induced Vibrations. Journal of Fluid Mechanics, 861:119–137, February 2019b. ISSN 0022-1120, 1469-7645. doi: 10.1017/jfm.2018.872. URL http://arxiv.org/abs/1808.08952. arXiv: 1808.08952.
  • Sitzmann et al. (2020) Vincent Sitzmann, Julien N. P. Martel, Alexander W. Bergman, David B. Lindell, and Gordon Wetzstein. Implicit Neural Representations with Periodic Activation Functions. 2020. URL http://arxiv.org/abs/2006.09661.
  • Song et al. (2021) Chao Song, Tariq Alkhalifah, and Umair Bin Waheed. A versatile framework to solve the Helmholtz equation using physics-informed neural networks. Geophysical Journal International, 228(3):1750–1762, November 2021. ISSN 0956-540X, 1365-246X. doi: 10.1093/gji/ggab434. URL https://academic.oup.com/gji/article/228/3/1750/6409132.
  • Tancik et al. (2020) Matthew Tancik, Pratul P. Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan T. Barron, and Ren Ng. Fourier features let networks learn high frequency functions in low dimensional domains. 2020.
  • Tsuchida (2020) Russell Tsuchida. Results on infinitely wide multi-layer perceptrons. PhD Thesis, The University of Queensland, December 2020. URL https://espace.library.uq.edu.au/view/UQ:cc3d959.
  • Wang et al. (2021) Sifan Wang, Hanwen Wang, and Paris Perdikaris. On the eigenvector bias of fourier feature networks: From regression to solving multi-scale pdes with physics-informed neural networks. Computer Methods in Applied Mechanics and Engineering, 384:113938, 2021.
  • Wong et al. (2022) Jian Cheng Wong, Chinchun Ooi, Abhishek Gupta, and Yew-Soon Ong. Learning in Sinusoidal Spaces with Physics-Informed Neural Networks. arXiv:2109.09338 [physics], March 2022. URL http://arxiv.org/abs/2109.09338. arXiv: 2109.09338 version: 2.

Appendix A Simple Sinusoidal Network Initialization

We present here the proofs for the initialization scheme of the simple sinusoidal network from Section 3.

Lemma 4.

Given any cc, for X𝒩(0,13c2)X\sim\mathcal{N}\left(0,\frac{1}{3}c^{2}\right) and Y𝒰(c,c)Y\sim\mathcal{U}\left(-c,c\right), we have Var[X]=Var[Y]=13c2\mathrm{Var}[X]=\mathrm{Var}[Y]=\frac{1}{3}c^{2}.

Proof.

By definition, Var[X]=σ2=13c2\mathrm{Var}[X]=\sigma^{2}=\frac{1}{3}c^{2}. For YY, we know that the variance of a uniformly distributed random variable with bound [a,b][a,b] is given by 112(ba)2\frac{1}{12}(b-a)^{2}. Thus, Var[Y]=112(2c)2=13c2.\mathrm{Var}[Y]~{}=~{}\frac{1}{12}(2c)^{2}~{}=~{}\frac{1}{3}c^{2}.

Theorem 5.

For a uniform input in [1,1][-1,1], the activations throughout a sinusoidal networks are approximately standard normal distributed before each sine non-linearity and arcsine-distributed after each sine non-linearity, irrespective of the depth of the network, if the weights are distributed normally, with mean 0 and variance 2n\frac{2}{n} with nn is the layer’s fan-in.

Proof.

The proof follows exactly the proof for Theorem 1.8 in Sitzmann et al. (2020), only using Lemma 4 when necessary to show that the initialization proposed here has the same variance necessary for the proof to follow. ∎

Appendix B Empirical evaluation of SSN initialization

Here we report an empirical analysis the initialization scheme of simple sinusoidal networks, referenced in Section 3. For this analysis we use a sinusoidal MLP with 6 hidden layers of 2048 units, and single-dimensional input and output. This MLP is initialized using the simplified scheme described above. For testing, 282^{8} equally spaced inputs from the range [1,1][-1,1] are passed through the network. We then plot the histogram of activations after each linear operation (before the sine non-linearity) and after each sine non-linearity. To match the original plot, we also plot the 1D Fast Fourier Transform of all activations in a layer, and the gradient of this output with respect to each activation. These results are presented in Figure LABEL:fig:simple_acts. The main conclusion from this figure is that the distribution of activations matches the predicted normal (before the non-linearity) and arcsine (after the non-linearity) distributions, and that this behavior is stable across many layers. We also reproduced the same result up to 50 layers.

We then perform an additional experiment in which the exact same setup as above is employed, yet the 1D inputs are shifted by a large value (i.e., xx+1000x\rightarrow x+1000). We the show the same plot as before in Figure LABEL:fig:simple_shift. We can see that there is essentially no change from the previous plot, which demonstrates the sinusoidal networks shift-invariance in the input space, one of its important desirable properties, as discussed previously.

Appendix C Experimental Details for Comparison to SIREN

Table 4: Comparison of the simple sinusoidal network and SIREN on some experiments, with a longer training duration. The specific durations are described below in the details for each experiment. We can see that the simple sinusoidal network has stronger asymptotic performance. Values above the horizontal center line are peak signal to noise ratio (PSNR), values below are mean squared error (MSE). \;{}^{\dagger}Audio experiments utilized a different learning rate for the first layer, see the full description below for details.

Experiment Simple Sinusoidal Network SIREN [ours] Image 54.70 52.43 Poisson (Gradient) 39.51 38.70 Poisson (Laplacian) 22.09 20.82 Video (cat) 34.64 32.26 Video (bikes) 37.71 34.07 Audio (Bach) 5.66𝟏𝟎𝟕\mathbf{5.66\cdot 10^{-7}} 3.021063.02\cdot 10^{-6} Audio (counting) 4.02𝟏𝟎𝟓\mathbf{4.02\cdot 10^{-5}} 6.331056.33\cdot 10^{-5}

Below, we present qualitative results and describe experimental details for each experiment. As these are a reproduction of the experiments in Sitzmann et al. (2020), we refer to their details as well for further information.

C.1 Image

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Image
Refer to caption
Gradient
Refer to caption
Laplacian
Figure 7: Top row: Ground truth image. Bottom: Reconstructed with sinusoidal network.

In the image fitting experiment, we treat an image as a function from the spatial domain to color values (x,y)(r,g,b)(x,y)\rightarrow(r,g,b). In the case of a monochromatic image, used here, this function maps instead to one-dimensional intensity values. We try to learn a function f:2f:\mathbb{R}^{2}\rightarrow\mathbb{R}, parametrized as a sinusoidal network, in order to fit such an image.

Figure 7 shows the image used in this experiment, and the reconstruction from the fitted sinusoidal network. The gradient and Laplacian for the learned function are also presented, demonstrating that higher order derivatives are also learned appropriately.

Training parameters.

The input image used is 512×512512\times 512, mapped to an input domain [1,1]2[-1,1]^{2}. The sinusoidal network used is a 5-layer MLP with hidden size 256256, following the proposed initialization scheme above. The parameter ω\omega is set to 3232. The Adam optimizer is used with a learning rate of 31033\cdot 10^{-3}, trained for 10,00010,000 steps in the short duration training results and for 20,00020,000 steps in the long duration training results.

C.2 Poisson

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Image
Refer to caption
Gradient
Refer to caption
Laplacian
Figure 8: Top row: Ground truth image. Mid: Image reconstructed from sinusoidal network supervised on gradient. Bottom: Image reconstructed from sinusoidal network supervised on Laplacian.

These tasks are similar to the image fitting experiment, but instead of supervising directly on the ground truth image, the learned fitted sinusoidal network is supervised on its derivatives, constituting a Poisson problem. We perform the experiment by supervising both on the input image’s gradient and Laplacian, and report the reconstruction of the image and it’s gradients in each case.

Figure 8 shows the image used in this experiment, and the reconstruction from the fitted sinusoidal networks. Since reconstruction from derivatives can only be correct up to a scaling factor, we scale the reconstructions for visualization. As in the original SIREN results, we can observe that the reconstruction from the gradient is of higher quality than the one from the Laplacian.

Training parameters.

The input image used is of size 256×256256\times 256, mapped from an input domain [1,1]2[-1,1]^{2}. The sinusoidal network used is a 5-layer MLP with hidden size 256256, following the proposed initialization scheme above. For both experiments, the parameter ω\omega is set to 3232 and the Adam optimizer is used. For the gradient experiments, in short and long training results, a learning rate of 11041\cdot 10^{-4} is used, trained for 10,00010,000 and 20,00020,000 steps respectively. For the Laplace experiments, in short and long training results, a learning rate of 11031\cdot 10^{-3} is used, trained for 10,00010,000 and 20,00020,000 steps respectively.

C.3 Video

Refer to caption
Figure 9: Top row: Frames from ground truth “cat” video. Bottom: Video reconstructed from sinusoidal network.
Refer to caption
Figure 10: Top row: Frames from ground truth “bikes” video. Bottom: Video reconstructed from sinusoidal network.

These tasks are similar to the image fitting experiment, but we instead fit a video, which also has a temporal input dimension, (t,x,y)(r,g,b)(t,x,y)\rightarrow(r,g,b). We learn a function f:33f:\mathbb{R}^{3}\rightarrow\mathbb{R}^{3}, parametrized as a sinusoidal network, in order to fit such a video.

Figures 9 and 10 show sampled frames from the videos used in this experiment, and their respective reconstructions from the fitted sinusoidal networks.

Training parameters.

The cat video contains 300300 frames of size 512×512512\times 512. The bikes video contains 250250 frames of size 272×640272\times 640. These signals are fitted from the input domain [1,1]3[-1,1]^{3}. The sinusoidal network used is a 5-layer MLP with hidden size 10241024, following the proposed initialization scheme above. The parameter ω\omega is set to 88. The Adam optimizer is used, with a learning rate of 31043\cdot 10^{-4} trained for 100,000100,000 steps in the short duration training results and for 200,000200,000 steps in the long duration training results.

C.4 Audio

Refer to caption
Refer to caption
Figure 11: Ground truth and reconstructed waveforms for “Bach” and “counting” audios.

In the audio experiments, we fit an audio signal in the temporal domain as a waveform twt\rightarrow w. We to learn a function f:f:\mathbb{R}\rightarrow\mathbb{R}, parametrized as a sinusoidal network, in order to fit the audio.

Figure 11 shows the waveforms for the input audios and the reconstructed audios from the fitted sinusoidal network.

In this experiment, we utilized a lower learning rate for the first layer compared to the rest of the network. This was used to compensate the very large ω\omega used (in the 15,00030,00015,000-30,000 range, compared to the 103010-30 range for all other experiments). One might argue that this is re-introducing complexity, counteracting the purpose the proposed simplification. However, we would claim (1) that this is only limited to cases with extremely high ω\omega, which was not present in any case except for fitting audio waves, and (2) that adjusting the learning rate for an individual layer is still an approach that is simpler and more in line with standard machine learning practice compared to multiplying all layers by a scaling factor and then adjusting their initialization variance by the same amount.

Training parameters.

Both audios use a sampling rate of 4410044100Hz. The Bach audio is 77s long and the counting audio is approximately 1212s long. These signals are fitted from the input domain [1,1][-1,1]. The sinusoidal network used is a 5-layer MLP with hidden size 256256, following the proposed initialization scheme above. For short and long training results, training is performed for 5,0005,000 and 50,00050,000 steps respectively. For the Bach experiment, the parameter ω\omega is set to 15,00015,000. The Adam optimizer is used, with a general learning rate of 31033\cdot 10^{-3}. A separate learning rate of 11061\cdot 10^{-6} is used for the first layer to stabilize training due to the large ω\omega value. For the counting experiment, the parameter ω\omega is set to 32,00032,000. The Adam optimizer is used, with a general learning rate of 11031\cdot 10^{-3} and a first layer learning rate of 11061\cdot 10^{-6}.

C.5 Helmholtz equation

Refer to caption
Refer to caption
Refer to caption
Real
Refer to caption
Imaginary
Figure 12: Top row: Ground truth real and imaginary fields. Bottom: Reconstructed with sinusoidal network.

In this experiment we solve for the unknown wavefield Φ:22\Phi:\mathbb{R}^{2}\rightarrow\mathbb{R}^{2} in the Helmholtz equation

(Δ+k2)Φ(x)=f(x),\displaystyle(\Delta+k^{2})\Phi(x)=-f(x), (3)

with known wavenumber kk and source function ff (a Gaussian with μ=0\mu=0 and σ2=104\sigma^{2}=10^{-4}). We solve this differential equation using a sinusoidal network supervised with the physics-informed loss Ω(Δ+k2)Φ(x)+f(x)1𝑑x\int_{\Omega}\|(\Delta+k^{2})\Phi(x)+f(x)\|_{1}dx, evaluated at random points sampled uniformly in the domain Ω=[1,1]2\Omega=[-1,1]^{2}.

Figure 12 shows the real and imaginary components of the ground truth solution to the differential equation and the solution recovered by the fitted sinusoidal network.

Training parameters.

The sinusoidal network used is a 5-layer MLP with hidden size 256256, following the proposed initialization scheme above. The parameter ω\omega is set to 1616. The Adam optimizer is used, with a learning rate of 31043\cdot 10^{-4} trained for 50,00050,000 steps.

C.6 Signed distance function (SDF)

In these tasks we learn a 3D signed distance function. We learn a function f:3f:\mathbb{R}^{3}\rightarrow\mathbb{R}, parametrized as a sinusoidal network, to model a signed distance function representing a 3D scene. This function is supervised indirectly from point cloud data of the scene. Figures 14 and 13 show 3D renderings of the volumes inferred from the learned SDFs.

Training parameters.

The statue point cloud contains 4,999,9964,999,996 points. The room point cloud contains 10,250,68810,250,688 points. These signals are fitted from the input domain [1,1]3[-1,1]^{3}. The sinusoidal network used is a 5-layer MLP with hidden size 256256 for the statue and 10241024 for the room. The parameter ω\omega is set to 44. The Adam optimizer is used, with a learning rate of 81048\cdot 10^{-4} and a batch size of 14001400. All models are trained for 190,000190,000 steps for the statue experiment and for 410,000410,000 steps for the room experiment.

Refer to caption
Refer to caption
Figure 13: Rendering of the “room” 3D scene SDF learned by the sinusoidal network from a point cloud.
Refer to caption
Refer to caption
Figure 14: Rendering of the “statue” 3D scene SDF learned by the sinusoidal network from a point cloud.

Appendix D Neural Tangent Kernel Analysis and Proofs

D.1 Preliminaries

In order to perform the subsequent NTK analysis, we first need to formalize definitions for simple sinusoidal networks and SIRENs. The definitions used here adhere to the common NTK analysis practices, and thus differ slightly from practical implementation.

Definition 1.

For the purposes of the following proofs, a (sinusoidal) fully-connected neural network with LL hidden layers that takes as input xn0x\in\mathbb{R}^{n_{0}}, is defined as the function f(L):n0nL+1f^{(L)}:\mathbb{R}^{n_{0}}\rightarrow\mathbb{R}^{n_{L+1}}, recursively given by

f(0)(x)\displaystyle f^{(0)}(x) =ω(W(0)x+b(0)),\displaystyle=\omega\,\left(W^{(0)}x+b^{(0)}\right),
f(L)(x)\displaystyle f^{(L)}(x) =W(L)1nLsin(f(L1))+b(L),\displaystyle=W^{(L)}\frac{1}{\sqrt{n_{L}}}\mathrm{sin}{\left(f^{(L-1)}\right)}+b^{(L)},

where ω\omega\in\mathbb{R}. The parameters {W(j)}j=0L\left\{W^{(j)}\right\}_{j=0}^{L} have shape nj+1×njn_{j+1}\times n_{j} and all have each element sampled independently either from 𝒩(0,1)\mathcal{N}(0,1) (for simple sinusoidal networks) or from 𝒰(c,c)\mathcal{U}(-c,c) with some bound cc\in\mathbb{R} (for SIRENs). The {b(j)}j=0L\left\{b^{(j)}\right\}_{j=0}^{L} are nj+1n_{j+1}-dimensional vectors sampled independently from 𝒩(0,Inj+1)\mathcal{N}(0,I_{n_{j+1}}).


With this definition, we now state the general formulation of the NTK, which applies in general to fully-connected networks with Lipschitz non-linearities, and consequently in particular to the sinusoidal networks studied here as well. Let us first define the NNGP, which has covariance recursively defined by

Σ(L+1)(x,x~)\displaystyle\Sigma^{(L+1)}(x,\tilde{x}) =𝔼f𝒩(0,Σ(L))[σ(f(x))σ(f(x~))]+β2,\displaystyle=\mathbb{E}_{{f\sim\mathcal{N}(0,\Sigma^{(L)})}}\left[{\sigma(f(x))\sigma(f(\tilde{x}))}\right]+\beta^{2},

with base case Σ(1)(x,x~)=1n0xTx~+β2\Sigma^{(1)}(x,\tilde{x})=\frac{1}{n_{0}}x^{T}\tilde{x}+\beta^{2}, and where β\beta gives the variance of the bias terms in the neural network layers (Neal, 1994; Lee et al., 2018). Now the NTK is given by the following theorem.

Theorem 6.

For a neural network with LL hidden layers f(L):n0nL+1f^{(L)}:\mathbb{R}^{n_{0}}\rightarrow\mathbb{R}^{n_{L+1}} following Definition 1, as the size of the hidden layers n1,,nLn_{1},\dots,n_{L}\rightarrow\infty sequentially, the neural tangent kernel (NTK) of f(L)f^{(L)} converges in probability to the deterministic kernel Θ(L)\Theta^{(L)} defined recursively as

Θ(0)(x,x~)\displaystyle\Theta^{(0)}(x,\tilde{x}) =Σ(0)(x,x~)=ω2(xTx~+1),\displaystyle=\Sigma^{(0)}(x,\tilde{x})=\omega^{2}\left(x^{T}\tilde{x}+1\right),
Θ(L)(x,x~)\displaystyle\Theta^{(L)}(x,\tilde{x}) =Θ(L1)(x,x~)Σ˙(L)(x,x~)+Σ(L)(x,x~),\displaystyle=\Theta^{(L-1)}(x,\tilde{x})\dot{\Sigma}^{(L)}(x,\tilde{x})+\Sigma^{(L)}(x,\tilde{x}),

where {Σ(l)}l=0L\left\{\Sigma^{(l)}\right\}_{l=0}^{L} are the neural network Gaussian processes (NNGPs) corresponding to each f(l)f^{(l)} and

Σ˙(l)(x,x~)=𝔼(u,v)Σ(l1)(x,x~)[cos(u)cos(v)].\displaystyle\dot{\Sigma}^{(l)}(x,\tilde{x})=\mathbb{E}_{{(u,v)\sim\Sigma^{(l-1)}(x,\tilde{x})}}\left[{\mathrm{cos}{\left(u\right)}\mathrm{cos}{\left(v\right)}}\right].
Proof.

This is a standard general NTK theorem, showing that the limiting kernel recursively in terms of the network’s NNGPs and the previous layer’s NTK. For brevity we omit the proof here and refer the reader to, for example, Jacot et al. (2020).

The only difference is for the base case Σ(0)\Sigma^{(0)}, due to the fact that we have an additional ω\omega parameter in the first layer. It is simple to see that the neural network with 0 hidden layers, i.e. the linear model ω(W(0)x+b(0))\omega\left(W^{(0)}x+b^{(0)}\right) will lead to the same Gaussian process covariance kernel as the original proof, xTx~+1x^{T}\tilde{x}+1, only adjusted by the additional variance factor ω2\omega^{2}. ∎

Theorem 6 demonstrates that the NTK can be constructed as a recursive function of the NTK of previous layers and the network’s NNGPs. In the following sections we will derive the NNGPs for the SIREN and the simple sinusoidal network directly. We will then use these NNGPs with Theorem 6 to derive their NTKs as well.

To finalize this preliminary section, we also provide two propositions that will be useful in following proofs in this section.

Proposition 7.

For any ω\omega\in\mathbb{R}, xdx\in\mathbb{R}^{d},

𝔼w𝒩(0,Id)[eiω(wTx)]=eω22x22\displaystyle\mathbb{E}_{{w\sim\mathcal{N}(0,I_{d})}}\left[{e^{i\omega\left(w^{T}x\right)}}\right]=e^{-\frac{\omega^{2}}{2}\|x\|^{2}_{2}}
Proof.

Omitting w𝒩(0,Id){w\sim\mathcal{N}(0,I_{d})} from the expectation for brevity, we have

𝔼[eiω(wTx)]\displaystyle\mathbb{E}\left[{e^{i\omega\left(w^{T}x\right)}}\right] =𝔼[eiωj=1dwjxj].\displaystyle=\mathbb{E}\left[{e^{i\omega\sum_{j=1}^{d}w_{j}x_{j}}}\right].

By independence of the components of ww and the definition of expectation,

𝔼[eiωj=1diwjxj]\displaystyle\mathbb{E}\left[{e^{i\omega\sum_{j=1}^{d}iw_{j}x_{j}}}\right] =j=1d𝔼[eiωwjxj]=j=1d12πeiωwjxjewj22𝑑wj.\displaystyle=\prod_{j=1}^{d}\mathbb{E}\left[{e^{i\omega\,w_{j}x_{j}}}\right]=\prod_{j=1}^{d}\frac{1}{\sqrt{2\pi}}\int_{-\infty}^{\infty}e^{i\omega\,w_{j}x_{j}}e^{-\frac{w_{j}^{2}}{2}}dw_{j}.

Completing the square, we get

j=1d12πeiωwjxje12wj2𝑑wj\displaystyle\prod_{j=1}^{d}\frac{1}{\sqrt{2\pi}}\int_{-\infty}^{\infty}e^{i\omega\,w_{j}x_{j}}e^{-\frac{1}{2}w_{j}^{2}}dw_{j} =j=1d12πe12(i2ω2xj2i2ω2xj2+2ixjwjwj2)𝑑wj\displaystyle=\prod_{j=1}^{d}\frac{1}{\sqrt{2\pi}}\int_{-\infty}^{\infty}e^{\frac{1}{2}\left(i^{2}\omega^{2}x_{j}^{2}-i^{2}\omega^{2}x_{j}^{2}+2ix_{j}w_{j}-w_{j}^{2}\right)}dw_{j}
=j=1de12i2ω2xj212πe12(i2ω2xj22iω2xjwj+wj2)𝑑wj\displaystyle=\prod_{j=1}^{d}e^{\frac{1}{2}i^{2}\omega^{2}x_{j}^{2}}\frac{1}{\sqrt{2\pi}}\int_{-\infty}^{\infty}e^{-\frac{1}{2}\left(i^{2}\omega^{2}x_{j}^{2}-2i\omega^{2}x_{j}w_{j}+w_{j}^{2}\right)}dw_{j}
=j=1de12ω2xj212πe12(wjiωxj)2𝑑wj.\displaystyle=\prod_{j=1}^{d}e^{-\frac{1}{2}\omega^{2}x_{j}^{2}}\frac{1}{\sqrt{2\pi}}\int_{-\infty}^{\infty}e^{-\frac{1}{2}\left(w_{j}-i\omega x_{j}\right)^{2}}dw_{j}.

Since the integral and its preceding factor constitute a Gaussian pdf, they integrate to 1, leaving the final result

j=1deω22xj2=eω22j=1dxj2=eω22xj22.\displaystyle\prod_{j=1}^{d}e^{-\frac{\omega^{2}}{2}x_{j}^{2}}=e^{-\frac{\omega^{2}}{2}\sum_{j=1}^{d}x_{j}^{2}}=e^{-\frac{\omega^{2}}{2}\|x_{j}\|_{2}^{2}}.

Proposition 8.

For any c,ωc,\omega\in\mathbb{R}, xdx\in\mathbb{R}^{d},

𝔼w𝒰d(c,c)[eiω(wTx)]=j=1dsinc(cωxj).\displaystyle\mathbb{E}_{{w\sim\mathcal{U}_{d}(-c,c)}}\left[{e^{i\omega\left(w^{T}x\right)}}\right]=\prod_{j=1}^{d}\mathrm{sinc}{\left(c\,\omega x_{j}\right)}.
Proof.

Omitting w𝒰d(c,c){w\sim\mathcal{U}_{d}(-c,c)} from the expectation for brevity, we have

𝔼[eiω(wTx)]\displaystyle\mathbb{E}\left[{e^{i\omega\left(w^{T}x\right)}}\right] =𝔼[eiωj=1dwjxj].\displaystyle=\mathbb{E}\left[{e^{i\omega\sum_{j=1}^{d}w_{j}x_{j}}}\right].

By independence of the components of ww and the definition of expectation,

𝔼[eiωj=1dwjxj]\displaystyle\mathbb{E}\left[{e^{i\omega\sum_{j=1}^{d}w_{j}x_{j}}}\right] =j=1d𝔼[eiωwjxj]=j=1dcceiωwjxj12c𝑑wj=j=1d12ccceiωwjxj𝑑wj.\displaystyle=\prod_{j=1}^{d}\mathbb{E}\left[{e^{i\omega\,w_{j}x_{j}}}\right]=\prod_{j=1}^{d}\int_{-c}^{c}e^{i\omega\,w_{j}x_{j}}\frac{1}{2c}dw_{j}=\prod_{j=1}^{d}\frac{1}{2c}\int_{-c}^{c}e^{i\omega\,w_{j}x_{j}}dw_{j}.

Now, focusing on the integral above, we have

cceiωwjxj𝑑wj\displaystyle\int_{-c}^{c}e^{i\omega\,w_{j}x_{j}}dw_{j} =cccos(ωwjxj)𝑑wj+iccsin(ωwjxj)𝑑wj\displaystyle=\int_{-c}^{c}\mathrm{cos}(\omega\,w_{j}x_{j})dw_{j}+i\int_{-c}^{c}\mathrm{sin}{\left(\omega\,w_{j}x_{j}\right)}dw_{j}
=sin(ωwjxj)ωxj|ccicos(ωwjxj)ωxj|cc\displaystyle=\frac{\mathrm{sin}{\left(\omega\,w_{j}x_{j}\right)}}{\omega x_{j}}\Biggr{|}_{-c}^{c}-i\frac{\mathrm{cos}(\omega\,w_{j}x_{j})}{\omega x_{j}}\Biggr{|}_{-c}^{c}
=2sin(cωxj)ωxj.\displaystyle=\frac{2\mathrm{sin}{\left(c\,\omega x_{j}\right)}}{\omega x_{j}}.

Finally, plugging this back into the product above, we get

j=1d12ccceiωwjxj𝑑wj\displaystyle\prod_{j=1}^{d}\frac{1}{2c}\int_{-c}^{c}e^{i\omega\,w_{j}x_{j}}dw_{j} =j=1d12c2sin(cωxj)ωxj=j=1dsinc(cωxj).\displaystyle=\prod_{j=1}^{d}\frac{1}{2c}\frac{2\mathrm{sin}{\left(c\,\omega x_{j}\right)}}{\omega x_{j}}=\prod_{j=1}^{d}\mathrm{sinc}{(c\,\omega x_{j})}.

D.2 Shallow sinusoidal networks

For the next few proofs, we will be focusing on neural networks with a single hidden layer, i.e. L=1L=1. Expanding the definition above, such a network is given by

f(1)(x)=W(1)1n1sin(ω(W(0)x+b(0)))+b(1).\displaystyle f^{(1)}(x)=W^{(1)}\frac{1}{\sqrt{n_{1}}}\mathrm{sin}{\left(\omega\,\left(W^{(0)}x+b^{(0)}\right)\right)}+b^{(1)}. (4)

The advantage of analysing such shallow networks is that their NNGPs and NTKs have formulations that are intuitively interpretable, providing insight into their characteristics. We later extend these derivations to networks of arbitrary depth.

D.2.1 SIREN

First, let us derive the NNGP for a SIREN with a single hidden layer.

Theorem 9.

Shallow SIREN NNGP. For a single hidden layer SIREN f(1):n0n2f^{(1)}:\mathbb{R}^{n_{0}}\rightarrow\mathbb{R}^{n_{2}} following Definition 1, as the size of the hidden layer n1n_{1}\rightarrow\infty, f(1)f^{(1)} tends (by law of large numbers) to the neural network Gaussian Process (NNGP) with covariance

Σ(1)(x,x~)=c26[j=1n0sinc(cω(xjx~j))e2ω2j=1n0sinc(cω(xj+x~j))]+1.\displaystyle\Sigma^{(1)}(x,\tilde{x})=\frac{c^{2}}{6}\left[\prod_{j=1}^{n_{0}}\mathrm{sinc}{\left(c\,\omega\left(x_{j}-\tilde{x}_{j}\right)\right)}-e^{-2\omega^{2}}\prod_{j=1}^{n_{0}}\mathrm{sinc}{\left(c\,\omega\left(x_{j}+\tilde{x}_{j}\right)\right)}\right]+1.
Proof.

We first show that despite the usage of a uniform distribution for the weights, this initialization scheme still leads to an NNGP. In this initial part, we follow an approach similar to Lee et al. (2018), with the modifications necessary for this conclusion to hold.

From our neural network definition, each element f(1)(x)jf^{(1)}(x)_{j} in the output vector is a weighted combination of elements in W(1)W^{(1)} and b(1)b^{(1)}. Conditioning on the outputs from the first layer (L=0L=0), since the sine function is bounded and each of the parameters is uniformly distributed with finite variance and zero mean, the f(1)(x)jf^{(1)}(x)_{j} become normally distributed with mean zero as n1n_{1}\rightarrow\infty by the (Lyapunov) central limit theorem (CLT). Since any subset of elements in f(1)(x)f^{(1)}(x) is jointly Gaussian, we have that this outer layer is described by a Gaussian process.

Now that we have concluded that this initialization scheme still entails an NNGP, we have that its covariance is determined by σW2Σ(1)+σb2=c23Σ(1)+1\sigma^{2}_{W}\Sigma^{(1)}+\sigma^{2}_{b}=\frac{c^{2}}{3}\Sigma^{(1)}+1, where

Σ(1)(x,x~)\displaystyle\Sigma^{(1)}(x,\tilde{x}) =limn1[1n1sin(f(0)(x)),sin(f(0)(x~))]\displaystyle=\lim_{n_{1}\rightarrow\infty}\left[\frac{1}{n_{1}}\left\langle\mathrm{sin}{\left(f^{(0)}(x)\right)},\mathrm{sin}{\left(f^{(0)}(\tilde{x})\right)}\right\rangle\right]
=limn1[1n1j=1n1sin(f(0)(x))jsin(f(0)(x~))j]\displaystyle=\lim_{n_{1}\rightarrow\infty}\left[\frac{1}{n_{1}}\sum_{j=1}^{n_{1}}\mathrm{sin}{\left(f^{(0)}(x)\right)}_{j}\mathrm{sin}{\left(f^{(0)}(\tilde{x})\right)}_{j}\right]
=limn1[1n1j=1n1sin(ω(Wj(0)x+bj(0)))sin(ω(Wj(0)x~+bj(0)))].\displaystyle=\lim_{n_{1}\rightarrow\infty}\left[\frac{1}{n_{1}}\sum_{j=1}^{n_{1}}\mathrm{sin}{\left(\omega\left(W^{(0)}_{j}x+b^{(0)}_{j}\right)\right)}\mathrm{sin}{\left(\omega\left(W^{(0)}_{j}\tilde{x}+b^{(0)}_{j}\right)\right)}\right].

Now by the law of large number (LLN) the limit above converges to

𝔼w𝒰n0(c,c),b𝒩(0,1)[sin(ω(wTx+b))sin(ω(wTx~+b))],\displaystyle\mathbb{E}_{{w\sim\mathcal{U}_{n_{0}}(-c,c),\,b\sim\mathcal{N}(0,1)}}\left[{\mathrm{sin}{\left(\omega\left(w^{T}x+b\right)\right)}\mathrm{sin}{\left(\omega\left(w^{T}\tilde{x}+b\right)\right)}}\right],

where wn0w\in\mathbb{R}^{n_{0}} and bb\in\mathbb{R}. Omitting the distributions from the expectation for brevity and expanding the exponential definition of sine, we have

𝔼[12i(eiω(wTx+b)eiω(wTx+b))12i(eiω(wTx~+b)eiω(wTx~+b))]\displaystyle\mathbb{E}\left[{\frac{1}{2i}\left(e^{i\omega\left(w^{T}x+b\right)}-e^{-i\omega\left(w^{T}x+b\right)}\right)\frac{1}{2i}\left(e^{i\omega\left(w^{T}\tilde{x}+b\right)}-e^{-i\omega\left(w^{T}\tilde{x}+b\right)}\right)}\right]
=14𝔼[eiω(wTx+b)+iω(wTx~+b)eiω(wTx+b)iω(wTx~+b)eiω(wTx+b)+iω(wTx~+b)+eiω(wTx+b)iω(wTx~+b)]\displaystyle=-\frac{1}{4}\mathbb{E}\left[{e^{i\omega\left(w^{T}x+b\right)+i\omega\left(w^{T}\tilde{x}+b\right)}-e^{i\omega\left(w^{T}x+b\right)-i\omega\left(w^{T}\tilde{x}+b\right)}-e^{-i\omega\left(w^{T}x+b\right)+i\omega\left(w^{T}\tilde{x}+b\right)}+e^{-i\omega\left(w^{T}x+b\right)-i\omega\left(w^{T}\tilde{x}+b\right)}}\right]
=14[𝔼[eiω(wT(x+x~))]𝔼[e2iωb]𝔼[eiω(wT(xx~))]𝔼[eiω(wT(x~x))]+𝔼[eiω(wT(xx~))]𝔼[e2iωb]]\displaystyle=-\frac{1}{4}\left[\mathbb{E}\left[{e^{i\omega\left(w^{T}\left(x+\tilde{x}\right)\right)}}\right]\mathbb{E}\left[{e^{2i\omega b}}\right]-\mathbb{E}\left[{e^{i\omega\left(w^{T}\left(x-\tilde{x}\right)\right)}}\right]-\mathbb{E}\left[{e^{i\omega\left(w^{T}\left(\tilde{x}-x\right)\right)}}\right]+\mathbb{E}\left[{e^{i\omega\left(w^{T}\left(-x-\tilde{x}\right)\right)}}\right]\mathbb{E}\left[{e^{-2i\omega b}}\right]\right]

Applying Propositions 7 and 8 to each expectation above and noting that the sinc\mathrm{sinc} function is even, we are left with

14[2j=1n0sinc(cω(xj+x~j))2e2ω2j=1n0sinc(cω(xjx~j))]\displaystyle-\frac{1}{4}\left[2\prod_{j=1}^{n_{0}}\mathrm{sinc}{\left(c\,\omega\left(x_{j}+\tilde{x}_{j}\right)\right)}-2e^{-2\omega^{2}}\prod_{j=1}^{n_{0}}\mathrm{sinc}{\left(c\,\omega\left(x_{j}-\tilde{x}_{j}\right)\right)}\right]
=12[j=1n0sinc(cω(xjx~j))e2ω2j=1n0sinc(cω(xj+x~j))].\displaystyle\;\;\;\;\;\;\;\;\;\;=\frac{1}{2}\left[\prod_{j=1}^{n_{0}}\mathrm{sinc}{\left(c\,\omega\left(x_{j}-\tilde{x}_{j}\right)\right)}-e^{-2\omega^{2}}\prod_{j=1}^{n_{0}}\mathrm{sinc}{\left(c\,\omega\left(x_{j}+\tilde{x}_{j}\right)\right)}\right].

For simplicity, if we take the case of a one-dimensional output (e.g., an audio signal or a monochromatic image) with the standard SIREN setting of c=6c=\sqrt{6}, the NNGP reduces to

Σ(1)(x,x~)=sinc(6ω(xx~))e2ω2sinc(6ω(x+x~))+1.\displaystyle\Sigma^{(1)}(x,\tilde{x})=\mathrm{sinc}{\left(\sqrt{6}\,\omega\left(x-\tilde{x}\right)\right)}-e^{-2\omega^{2}}\mathrm{sinc}{\left(\sqrt{6}\,\omega\left(x+\tilde{x}\right)\right)}+1.

We can already notice that this kernel is composed of sinc\mathrm{sinc} functions. The sinc\mathrm{sinc} function is the ideal low-pass filter. For any value of ω>1\omega>1, we can see the the first term in the expression above will completely dominate the expression, due to the exponential e2ω2e^{-2\omega^{2}} factor. In practice, ω\omega is commonly set to values at least one order of magnitude above 11, if not multiple orders of magnitude above that in certain cases (e.g., high frequency audio signals). This leaves us with simply

Σ(1)(x,x~)=sinc(6ω(xx~))+1.\displaystyle\Sigma^{(1)}(x,\tilde{x})=\mathrm{sinc}{\left(\sqrt{6}\,\omega\left(x-\tilde{x}\right)\right)}+1.

Notice that not only does our kernel reduce to the sinc\mathrm{sinc} function, but it also reduces to a function solely of Δx=xx~\Delta x=x-\tilde{x}. This agrees with the shift-invariant property we observe in SIRENs, since the NNGP is dependent only on Δx\Delta x, but not on the particular values of xx and x~\tilde{x}. Notice also that ω\omega defines the bandwidth of the sinc\mathrm{sinc} function, thus determining the maximum frequencies it allows to pass.

The general sinc\mathrm{sinc} form and the shift-invariance of this kernel can be visualized in Figure 15, along with the effect of varying ω\omega on the bandwidth of the NNGP kernel.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 15: The NNGP for SIREN at different ω\omega values. The top row shows the kernel values for pairs (x,x~)[1,1]2(x,\tilde{x})\in[-1,1]^{2}. Bottom row shows a slice at fixed x~=0\tilde{x}=0.

We can see that the NTK of the shallow SIREN, derived below, maintains the same relevant characteristics as the NNGP. We first derive Σ˙\dot{\Sigma} in the Lemma below.

Lemma 10.

For ω\omega\in\mathbb{R}, Σ˙(1)(x,x~):n0×n0\dot{\Sigma}^{(1)}(x,\tilde{x}):\mathbb{R}^{n_{0}}\times\mathbb{R}^{n_{0}}\rightarrow\mathbb{R} is given by

Σ(1)(x,x~)=c26[j=1n0sinc(cω(xjx~j))+e2ω2j=1n0sinc(cω(xj+x~j))]+1.\displaystyle\Sigma^{(1)}(x,\tilde{x})=\frac{c^{2}}{6}\left[\prod_{j=1}^{n_{0}}\mathrm{sinc}{\left(c\,\omega\left(x_{j}-\tilde{x}_{j}\right)\right)}+e^{-2\omega^{2}}\prod_{j=1}^{n_{0}}\mathrm{sinc}{\left(c\,\omega\left(x_{j}+\tilde{x}_{j}\right)\right)}\right]+1.
Proof.

The proof follows the same pattern as Theorem 9, with the only difference being a few sign changes after the exponential expansion of the trigonometric functions, due to the different identities for sine and cosine. ∎

Now we can derive the NTK for the shallow SIREN.

Corollary 11.

Shallow SIREN NTK. For a single hidden layer SIREN f(1):n0n2f^{(1)}:\mathbb{R}^{n_{0}}\rightarrow\mathbb{R}^{n_{2}} following Definition 1, its neural tangent kernel (NTK), as defined in Theorem 6, is given by

Θ(1)(x,x~)\displaystyle\Theta^{(1)}(x,\tilde{x}) =(ω2(xTx~+1))(c26[j=1n0sinc(cω(xjx~j))e2ω2j=1n0sinc(cω(xj+x~j))]+1)\displaystyle=\left(\omega^{2}\left(x^{T}\tilde{x}+1\right)\right)\left(\frac{c^{2}}{6}\left[\prod_{j=1}^{n_{0}}\mathrm{sinc}{\left(c\,\omega\left(x_{j}-\tilde{x}_{j}\right)\right)}-e^{-2\omega^{2}}\prod_{j=1}^{n_{0}}\mathrm{sinc}{\left(c\,\omega\left(x_{j}+\tilde{x}_{j}\right)\right)}\right]+1\right)
+c26[j=1n0sinc(cω(xjx~j))+e2ω2j=1n0sinc(cω(xj+x~j))]+1\displaystyle\quad\quad\quad+\frac{c^{2}}{6}\left[\prod_{j=1}^{n_{0}}\mathrm{sinc}{\left(c\,\omega\left(x_{j}-\tilde{x}_{j}\right)\right)}+e^{-2\omega^{2}}\prod_{j=1}^{n_{0}}\mathrm{sinc}{\left(c\,\omega\left(x_{j}+\tilde{x}_{j}\right)\right)}\right]+1
=c26(ω2(xTx~+1)+1)j=1n0sinc(cω(xjx~j))\displaystyle=\frac{c^{2}}{6}\left(\omega^{2}\left(x^{T}\tilde{x}+1\right)+1\right)\prod_{j=1}^{n_{0}}\mathrm{sinc}{\left(c\,\omega\left(x_{j}-\tilde{x}_{j}\right)\right)}
c26(ω2(xTx~+1)1)e2ω2j=1n0sinc(cω(xj+x~j))+ω2(xTx~+1)+1.\displaystyle\quad\quad\quad-\frac{c^{2}}{6}\left(\omega^{2}\left(x^{T}\tilde{x}+1\right)-1\right)e^{-2\omega^{2}}\prod_{j=1}^{n_{0}}\mathrm{sinc}{\left(c\,\omega\left(x_{j}+\tilde{x}_{j}\right)\right)}+\omega^{2}\left(x^{T}\tilde{x}+1\right)+1.
Proof.

Follows trivially by applying Theorem 9 and Lemma 10 to Theorem 6. ∎

Though the expressions become more complex due to the formulation of the NTK, we can see that many of the same properties from the NNGP still apply. Again, for reasonable values of ω\omega, the term with the exponential factor e2ω2e^{-2\omega^{2}} will be of negligible relative magnitude. With c=6c=\sqrt{6}, this leaves us with

(ω2(xTx~+1)+1)j=1n0sinc(6ω(xjx~j))+ω2(xTx~+1)+1,\displaystyle\left(\omega^{2}\left(x^{T}\tilde{x}+1\right)+1\right)\prod_{j=1}^{n_{0}}\mathrm{sinc}{\left(\sqrt{6}\,\omega\left(x_{j}-\tilde{x}_{j}\right)\right)}+\omega^{2}\left(x^{T}\tilde{x}+1\right)+1,

which is of the same form as the NNGP, with some additional linear terms xTx~x^{T}\tilde{x}. Though these linear terms break the pure shift-invariance, we still have a strong diagonal and the sinc\mathrm{sinc} form with bandwidth determined by ω\omega, as can be seen in Figure 16.

Similarly to the NNGP, the SIREN NTK suggests that training a shallow SIREN is approximately equivalent to performing kernel regression with a sinc\mathrm{sinc} kernel, a low-pass filter, with its bandwidth defined by ω\omega. This agrees intuitively with the experimental observations from the paper that in order to fit higher frequencies signals, a larger ω\omega is required.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 16: The NTK for SIREN at different ω\omega values. The top row shows the kernel values for pairs (x,x~)[1,1]2(x,\tilde{x})\in[-1,1]^{2}. Bottom row shows a slice at fixed x~=0\tilde{x}=0.

D.2.2 Simple sinusoidal network

Just as we did in the last section, we will now first derive the NNGP for a simple sinusoidal network, and then use that in order to obtain its NTK as well. As we will see, the Gaussian initialization employed in the SSN has the benefit of rendering the derivations cleaner, while retaining the relevant properties from the SIREN initialization. We observe that a similar derivation of this NNGP (using cosine functions instead of sine) can be found in Pearce et al. (2019), with a focus on a Bayesian perspective for the result.

Theorem 12.

Shallow SSN NNGP. For a single hidden layer simple sinusoidal network f(1):n0n2f^{(1)}:\mathbb{R}^{n_{0}}\rightarrow\mathbb{R}^{n_{2}} following Definition 1, as the size of the hidden layer n1n_{1}\rightarrow\infty, f(1)f^{(1)} tends (by law of large numbers) to the neural network Gaussian Process (NNGP) with covariance

Σ(1)(x,x~)=12(eω22xx~22eω22x+x~22e2ω2)+1.\displaystyle\Sigma^{(1)}(x,\tilde{x})=\frac{1}{2}\left(e^{-\frac{\omega^{2}}{2}\|x-\tilde{x}\|_{2}^{2}}-e^{-\frac{\omega^{2}}{2}\|x+\tilde{x}\|_{2}^{2}}e^{-2\omega^{2}}\right)+1.
Proof.

We again initially follow an approach similar to the one described in Lee et al. (2018).

From our sinusoidal network definition, each element f(1)(x)jf^{(1)}(x)_{j} in the output vector is a weighted combination of elements in W(1)W^{(1)} and b(1)b^{(1)}. Conditioning on the outputs from the first layer (L=0L=0), since the sine function is bounded and each of the parameters is Gaussian with finite variance and zero mean, the f(1)(x)jf^{(1)}(x)_{j} are also normally distributed with mean zero by the CLT. Since any subset of elements in f(1)(x)f^{(1)}(x) is jointly Gaussian, we have that this outer layer is described by a Gaussian process.

Therefore, its covariance is determined by σW2Σ(1)+σb2=Σ(1)+1\sigma^{2}_{W}\Sigma^{(1)}+\sigma^{2}_{b}=\Sigma^{(1)}+1, where

Σ(1)(x,x~)\displaystyle\Sigma^{(1)}(x,\tilde{x}) =limn1[1n1sin(f(0)(x)),sin(f(0)(x~))]\displaystyle=\lim_{n_{1}\rightarrow\infty}\left[\frac{1}{n_{1}}\left\langle\mathrm{sin}{\left(f^{(0)}(x)\right)},\mathrm{sin}{\left(f^{(0)}(\tilde{x})\right)}\right\rangle\right]
=limn1[1n1j=1n1sin(f(0)(x))jsin(f(0)(x~))j]\displaystyle=\lim_{n_{1}\rightarrow\infty}\left[\frac{1}{n_{1}}\sum_{j=1}^{n_{1}}\mathrm{sin}{\left(f^{(0)}(x)\right)}_{j}\mathrm{sin}{\left(f^{(0)}(\tilde{x})\right)}_{j}\right]
=limn1[1n1j=1n1sin(ω(Wj(0)x+bj(0)))sin(ω(Wj(0)x~+bj(0)))].\displaystyle=\lim_{n_{1}\rightarrow\infty}\left[\frac{1}{n_{1}}\sum_{j=1}^{n_{1}}\mathrm{sin}{\left(\omega\,\left(W^{(0)}_{j}x+b^{(0)}_{j}\right)\right)}\mathrm{sin}{\left(\omega\,\left(W^{(0)}_{j}\tilde{x}+b^{(0)}_{j}\right)\right)}\right].

Now by the LLN the limit above converges to

𝔼w𝒩(0,In0),b𝒩(0,1)[sin(ω(wTx+b))sin(ω(wTx~+b))],\displaystyle\mathbb{E}_{{w\sim\mathcal{N}(0,I_{n_{0}}),b\sim\mathcal{N}(0,1)}}\left[{\mathrm{sin}{\left(\omega\,\left(w^{T}x+b\right)\right)}\mathrm{sin}{\left(\omega\,\left(w^{T}\tilde{x}+b\right)\right)}}\right],

where wn0w\in\mathbb{R}^{n_{0}} and bb\in\mathbb{R}. Omitting the distributions from the expectation for brevity and expanding the exponential definition of sine, we have

𝔼[12i(eiω(wTx+b)eiω(wTx+b))12i(eiω(wTx~+b)eiω(wTx~+b))]\displaystyle\mathbb{E}\left[{\frac{1}{2i}\left(e^{i\omega\left(w^{T}x+b\right)}-e^{-i\omega\left(w^{T}x+b\right)}\right)\frac{1}{2i}\left(e^{i\omega\left(w^{T}\tilde{x}+b\right)}-e^{-i\omega\left(w^{T}\tilde{x}+b\right)}\right)}\right]
=14𝔼[eiω(wTx+b)+iω(wTx~+b)eiω(wTx+b)iω(wTx~+b)eiω(wTx+b)+iω(wTx~+b)+eiω(wTx+b)iω(wTx~+b)]\displaystyle=-\frac{1}{4}\mathbb{E}\left[{e^{i\omega\left(w^{T}x+b\right)+i\omega\left(w^{T}\tilde{x}+b\right)}-e^{i\omega\left(w^{T}x+b\right)-i\omega\left(w^{T}\tilde{x}+b\right)}-e^{-i\omega\left(w^{T}x+b\right)+i\omega\left(w^{T}\tilde{x}+b\right)}+e^{-i\omega\left(w^{T}x+b\right)-i\omega\left(w^{T}\tilde{x}+b\right)}}\right]
=14[𝔼[eiω(wT(x+x~))]𝔼[e2iωb]𝔼[eiω(wT(xx~))]𝔼[eiω(wT(x~x))]+𝔼[eiω(wT(xx~))]𝔼[e2iωb]]\displaystyle=-\frac{1}{4}\left[\mathbb{E}\left[{e^{i\omega\left(w^{T}\left(x+\tilde{x}\right)\right)}}\right]\mathbb{E}\left[{e^{2i\omega b}}\right]-\mathbb{E}\left[{e^{i\omega\left(w^{T}\left(x-\tilde{x}\right)\right)}}\right]-\mathbb{E}\left[{e^{i\omega\left(w^{T}\left(\tilde{x}-x\right)\right)}}\right]+\mathbb{E}\left[{e^{i\omega\left(w^{T}\left(-x-\tilde{x}\right)\right)}}\right]\mathbb{E}\left[{e^{-2i\omega b}}\right]\right]

Applying Proposition 7 to each expectation above, it becomes

14\displaystyle-\frac{1}{4} (eω22x+x~22e2ω2eω22xx~22eω22x+x~22+eω22x+x~22e2ω2)\displaystyle\left(e^{-\frac{\omega^{2}}{2}\|x+\tilde{x}\|_{2}^{2}}e^{-2\omega^{2}}-e^{-\frac{\omega^{2}}{2}\|x-\tilde{x}\|_{2}^{2}}-e^{-\frac{\omega^{2}}{2}\|x+\tilde{x}\|_{2}^{2}}+e^{-\frac{\omega^{2}}{2}\|x+\tilde{x}\|_{2}^{2}}e^{-2\omega^{2}}\right)
=12(eω22xx~22eω22x+x~22e2ω2).\displaystyle=\frac{1}{2}\left(e^{-\frac{\omega^{2}}{2}\|x-\tilde{x}\|_{2}^{2}}-e^{-\frac{\omega^{2}}{2}\|x+\tilde{x}\|_{2}^{2}}e^{-2\omega^{2}}\right).

We an once again observe that, for practical values of ω\omega, the NNGP simplifies to

12eω22xx~22+1.\displaystyle\frac{1}{2}e^{-\frac{\omega^{2}}{2}\|x-\tilde{x}\|_{2}^{2}}+1.

This takes the form of a Gaussian kernel, which is also a low-pass filter, with its bandwidth determined by ω\omega. We note that, similar to the c=6c=\sqrt{6} setting from SIRENs, in practice a scaling factor of 2\sqrt{2} is applied to the normal activations, as described in Section 3, which cancels out the 1/21/2 factors from the kernels, preserving the variance magnitude.

Moreover, we can also observe again that the kernel is a function solely of Δx\Delta x, in agreement with the shift invariance that is also observed in simple sinusoidal networks. Visualizations of this NNGP are provided in Figure 17.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 17: The NNGP for SSN at different ω\omega values. The top row shows the kernel values for pairs (x,x~)[1,1]2(x,\tilde{x})\in[-1,1]^{2}. Bottom row shows a slice at fixed x~=0\tilde{x}=0.

We will now proceed to derive the NTK, which requires first obtaining Σ˙\dot{\Sigma}.

Lemma 13.

For ω\omega\in\mathbb{R}, Σ˙(1)(x,x~):n0×n0\dot{\Sigma}^{(1)}(x,\tilde{x}):\mathbb{R}^{n_{0}}\times\mathbb{R}^{n_{0}}\rightarrow\mathbb{R} is given by

Σ˙(1)(x,x~)=12(eω22xx~22+eω22x+x~22e2ω2)+1.\displaystyle\dot{\Sigma}^{(1)}(x,\tilde{x})=\frac{1}{2}\left(e^{-\frac{\omega^{2}}{2}\|x-\tilde{x}\|_{2}^{2}}+e^{-\frac{\omega^{2}}{2}\|x+\tilde{x}\|_{2}^{2}}e^{-2\omega^{2}}\right)+1.
Proof.

The proof follows the same pattern as Theorem 12, with the only difference being a few sign changes after the exponential expansion of the trigonometric functions, due to the different identities for sine and cosine.

Corollary 14.

Shallow SSN NTK. For a simple sinusoidal network with a single hidden layer f(1):n0n2f^{(1)}:\mathbb{R}^{n_{0}}\rightarrow\mathbb{R}^{n_{2}} following Definition 1, its neural tangent kernel (NTK), as defined in Theorem 6, is given by

Θ(1)(x,x~)\displaystyle\Theta^{(1)}(x,\tilde{x}) =(ω2(xTx~+1))[12(eω22xx~22+eω22x+x~22e2ω2)+1]\displaystyle=\left(\omega^{2}\left(x^{T}\tilde{x}+1\right)\right)\left[\frac{1}{2}\left(e^{-\frac{\omega^{2}}{2}\|x-\tilde{x}\|_{2}^{2}}+e^{-\frac{\omega^{2}}{2}\|x+\tilde{x}\|_{2}^{2}}e^{-2\omega^{2}}\right)+1\right]
+12(eω22xx~22eω22x+x~22e2ω2)+1\displaystyle\quad\quad\quad+\frac{1}{2}\left(e^{-\frac{\omega^{2}}{2}\|x-\tilde{x}\|_{2}^{2}}-e^{-\frac{\omega^{2}}{2}\|x+\tilde{x}\|_{2}^{2}}e^{-2\omega^{2}}\right)+1
=12(ω2(xTx~+1)+1)eω22xx~22\displaystyle=\frac{1}{2}\left(\omega^{2}\left(x^{T}\tilde{x}+1\right)+1\right)e^{-\frac{\omega^{2}}{2}\|x-\tilde{x}\|_{2}^{2}}
12(ω2(xTx~+1)1)eω22x+x~22e2ω2+ω2(xTx~+1)+1.\displaystyle\quad\quad\quad-\frac{1}{2}\left(\omega^{2}\left(x^{T}\tilde{x}+1\right)-1\right)e^{-\frac{\omega^{2}}{2}\|x+\tilde{x}\|_{2}^{2}}e^{-2\omega^{2}}+\omega^{2}\left(x^{T}\tilde{x}+1\right)+1.
Proof.

Follows trivially by applying Theorem 12 and Lemma 13 to Theorem 6. ∎

We again note the vanishing factor e2ω2e^{-2\omega^{2}}, which leaves us with

12(ω2(xTx~+1)+1)eω22xx~22+ω2(xTx~+1)+1.\displaystyle\frac{1}{2}\left(\omega^{2}\left(x^{T}\tilde{x}+1\right)+1\right)e^{-\frac{\omega^{2}}{2}\|x-\tilde{x}\|_{2}^{2}}+\omega^{2}\left(x^{T}\tilde{x}+1\right)+1. (5)

As with the SIREN before, this NTK is still of the same form as its corresponding NNGP. While again we have additional linear terms xTx~x^{T}\tilde{x} in the NTK compared to the NNGP, in this case as well the kernel preserves its strong diagonal. It is still close to a Gaussian kernel, with its bandwidth determined directly by ω\omega. We demonstrate this in Figure 18, where the NTK for different values of ω\omega is shown. Additionally, we also plot a pure Gaussian kernel with variance ω2\omega^{2}, scaled to match the maximum and minimum values of the NTK. We can observe the NTK kernel closely matches the Gaussian. Moreover, we can also observe that, at x~=0\tilde{x}=0 the maximum value is predicted by kω2/2k\approx\omega^{2}/2, as expected from the scaling factors in the kernel in Equation 5.

This NTK suggests that training a simple sinusoidal network is approximately equivalent to performing kernel regression with a Gaussian kernel, a low-pass filter, with its bandwidth defined by ω\omega.

We note that even though this sinusoidal network kernel approximates a Gaussian kernel, an actual Gaussian kernel can be recovered if a combination of sine and cosine activations are employed, as demonstrated in Tsuchida (2020) (Proposition 18).

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 18: The NTK for SSN at different ω\omega values. The top row shows the kernel values for pairs (x,x~)[1,1]2(x,\tilde{x})\in[-1,1]^{2}. Bottom row shows a slice at fixed x~=0\tilde{x}=0, together with a Gaussian kernel scaled to match the maximum and minimum values of the NTK.

D.3 Deep sinusoidal networks

We will now look at the full NNGP and NTK for sinusoidal networks of arbitrary depth. As we will see, due to the recursive nature of these kernels, for networks deeper than the ones analyzed in the previous section, their full unrolled expressions quickly become intractable intuitively, especially for the NTK. Nevertheless, these kernels can still provide some insight, into the behavior of their corresponding networks. Moreover, despite their symbolic complexity, we will also demonstrate empirically that the resulting kernels can be approximated by simple Gaussian kernels, even for deep networks.

D.3.1 Simple sinusoidal network

As demonstrated in the previous section, simple sinusoidal networks produce simpler NNGP and NTK kernels due to their Gaussian initialization. We thus begin this section by now analyzing SSNs first, starting with their general NNGP.

Theorem 15.

SSN NNGP. For a simple sinusoidal network with LL hidden layers f(L):n0nL+1f^{(L)}:\mathbb{R}^{n_{0}}\rightarrow\mathbb{R}^{n_{L+1}} following Definition 1, as the size of the hidden layers n1,,nLn_{1},\dots,n_{L}\rightarrow\infty sequentially, f(L)f^{(L)} tends (by law of large numbers) to the neural network Gaussian Process (NNGP) with covariance Σ(L)(x,x~)\Sigma^{(L)}(x,\tilde{x}), recursively defined as

Σ(0)(x,x~)\displaystyle\Sigma^{(0)}(x,\tilde{x}) =ω2(xTx~+1)\displaystyle=\omega^{2}\left(x^{T}\tilde{x}+1\right)
Σ(L)(x,x~)\displaystyle\Sigma^{(L)}(x,\tilde{x}) =12e12(Σ(L1)(x,x)+Σ(L1)(x~,x~))(eΣ(L1)(x,x~)eΣ(L1)(x,x~))+1.\displaystyle=\frac{1}{2}e^{-\frac{1}{2}\left(\Sigma^{(L-1)}(x,x)+\Sigma^{(L-1)}(\tilde{x},\tilde{x})\right)}\left(e^{\Sigma^{(L-1)}(x,\tilde{x})}-e^{-\Sigma^{(L-1)}(x,\tilde{x})}\right)+1.
Proof.

We will proceed by induction on the depth LL, demonstrating the NNGP for successive layers as n1,,nLn_{1},\dots,n_{L}\rightarrow\infty sequentially. To demonstrate the base case L=1L=1, let us rearrange Σ(1)\Sigma^{(1)} from Theorem 12 in order to express it in terms of inner products,

Σ(1)(x,x~)\displaystyle\Sigma^{(1)}(x,\tilde{x}) =12(eω22xx~22+eω22x+x~22e2ω2)+1\displaystyle=\frac{1}{2}\left(e^{-\frac{\omega^{2}}{2}\|x-\tilde{x}\|_{2}^{2}}+e^{-\frac{\omega^{2}}{2}\|x+\tilde{x}\|_{2}^{2}}e^{-2\omega^{2}}\right)+1
=12[eω22(xTx2xTx~+x~Tx~)eω22(xTx+2xTx~+x~Tx~)e2ω2]+1\displaystyle=\frac{1}{2}\left[e^{-\frac{\omega^{2}}{2}\left(x^{T}x-2x^{T}\tilde{x}+\tilde{x}^{T}\tilde{x}\right)}-e^{-\frac{\omega^{2}}{2}\left(x^{T}x+2x^{T}\tilde{x}+\tilde{x}^{T}\tilde{x}\right)}e^{-2\omega^{2}}\right]+1
=12[e12[ω2(xTx+1)+ω2(x~Tx~+1)]+ω2(xTx~+1)e12[ω2(xTx+1)+ω2(x~Tx~+1)]ω2(xTx~+1)]+1.\displaystyle=\frac{1}{2}\left[e^{-\frac{1}{2}\left[\omega^{2}\left(x^{T}x+1\right)+\omega^{2}\left(\tilde{x}^{T}\tilde{x}+1\right)\right]+\omega^{2}\left(x^{T}\tilde{x}+1\right)}-e^{-\frac{1}{2}\left[\omega^{2}\left(x^{T}x+1\right)+\omega^{2}\left(\tilde{x}^{T}\tilde{x}+1\right)\right]-\omega^{2}\left(x^{T}\tilde{x}+1\right)}\right]+1.

Given the definition of Σ(0)\Sigma^{(0)}, this is equivalent to

12e12(Σ(0)(x,x)+Σ(0)(x~,x~))(eΣ(0)(x,x~)eΣ(0)(x,x~))+1,\displaystyle\frac{1}{2}e^{-\frac{1}{2}\left(\Sigma^{(0)}(x,x)+\Sigma^{(0)}(\tilde{x},\tilde{x})\right)}\left(e^{\Sigma^{(0)}(x,\tilde{x})}-e^{-\Sigma^{(0)}(x,\tilde{x})}\right)+1,

which concludes this case.

Now given the inductive hypothesis, as n1,,nL1n_{1},\dots,n_{L-1}\rightarrow\infty we have that the first L1L-1 layers define a network f(L1)f^{(L-1)} with NNGP given by Σ(L1)(x,x~)\Sigma^{(L-1)}(x,\tilde{x}). Now it is left to show that as nLn_{L}\rightarrow\infty, we get the NNGP given by Σ(L)\Sigma^{(L)}. Following the same argument in Theorem 12, the network

f(L)(x)=W(L)1nLsin(f(L1))+b(L)\displaystyle f^{(L)}(x)=W^{(L)}\frac{1}{\sqrt{n_{L}}}\mathrm{sin}{\left(f^{(L-1)}\right)}+b^{(L)}

constitutes a Gaussian process given the outputs of the previous layer, due to the distributions of W(L)W^{(L)} and b(L)b^{(L)}. Its covariance is given by σW2Σ(L)+σb2=Σ(L)+1\sigma^{2}_{W}\Sigma^{(L)}+\sigma^{2}_{b}=\Sigma^{(L)}+1, where

Σ(L)(x,x~)\displaystyle\Sigma^{(L)}(x,\tilde{x}) =limnL[1nLsin(f(L1)(x)),sin(f(L1)(x~))]\displaystyle=\lim_{n_{L}\rightarrow\infty}\left[\frac{1}{n_{L}}\left\langle\mathrm{sin}{\left(f^{(L-1)}(x)\right)},\mathrm{sin}{\left(f^{(L-1)}(\tilde{x})\right)}\right\rangle\right]
=limnL[1nLj=1nLsin(f(L1)(x))jsin(f(L1)(x~))j].\displaystyle=\lim_{n_{L}\rightarrow\infty}\left[\frac{1}{n_{L}}\sum_{j=1}^{n_{L}}\mathrm{sin}{\left(f^{(L-1)}(x)\right)}_{j}\mathrm{sin}{\left(f^{(L-1)}(\tilde{x})\right)}_{j}\right].

By inductive hypothesis, f(L1)f^{(L-1)} is a Gaussian process Σ(L1)(x,x~)\Sigma^{(L-1)}(x,\tilde{x}). Thus by the LLN the limit above equals

𝔼(u,v)𝒩(0,Σ(L1)(x,x~))[sin(u)sin(v)].\displaystyle\mathbb{E}_{{(u,v)\sim\mathcal{N}\left(0,\Sigma^{(L-1)}(x,\tilde{x})\right)}}\left[{\mathrm{sin}{\left(u\right)}\mathrm{sin}{\left(v\right)}}\right].

Omitting the distribution from the expectation for brevity and expanding the exponential definition of sine, we have

𝔼[12i(eiueiu)12i(eiveiv)]\displaystyle\mathbb{E}\left[{\frac{1}{2i}\left(e^{iu}-e^{-iu}\right)\frac{1}{2i}\left(e^{iv}-e^{-iv}\right)}\right] =14[𝔼[ei(u+v)]𝔼[ei(uv)]𝔼[ei(uv)]+𝔼[ei(u+v)]].\displaystyle=-\frac{1}{4}\left[\mathbb{E}\left[{e^{i\left(u+v\right)}}\right]-\mathbb{E}\left[{e^{i\left(u-v\right)}}\right]-\mathbb{E}\left[{e^{-i\left(u-v\right)}}\right]+\mathbb{E}\left[{e^{-i\left(u+v\right)}}\right]\right].

Since uu and vv are jointly Gaussian, p=u+vp=u+v and m=uvm=u-v are also Gaussian, with mean 0 and variance

σp2\displaystyle\sigma_{p}^{2} =σu2+σv2+2Cov[u,v]=Σ(L1)(x,x)+Σ(L1)(x~,x~)+2Σ(L1)(x,x~),\displaystyle=\sigma_{u}^{2}+\sigma_{v}^{2}+2\,\mathrm{Cov}{[u,v]}=\Sigma^{(L-1)}(x,x)+\Sigma^{(L-1)}(\tilde{x},\tilde{x})+2\,\Sigma^{(L-1)}(x,\tilde{x}),

σm2
\displaystyle\vskip 12.0pt plus 4.0pt minus 4.0pt\sigma_{m}^{2}
=σu2+σv22Cov[u,v]=Σ(L1)(x,x)+Σ(L1)(x~,x~)2Σ(L1)(x,x~).\displaystyle=\sigma_{u}^{2}+\sigma_{v}^{2}-2\,\mathrm{Cov}{[u,v]}=\Sigma^{(L-1)}(x,x)+\Sigma^{(L-1)}(\tilde{x},\tilde{x})-2\,\Sigma^{(L-1)}(x,\tilde{x}).

We can now rewriting the expectations in terms of normalized variables

14[𝔼z𝒩(0,1)[eiσpz]𝔼z𝒩(0,1)[eiσmz]𝔼z𝒩(0,1)[eiσmz]+𝔼z𝒩(0,1)[eiσpz]].\displaystyle-\frac{1}{4}\left[\mathbb{E}_{{z\sim\mathcal{N}(0,1)}}\left[{e^{i\sigma_{p}z}}\right]-\mathbb{E}_{{z\sim\mathcal{N}(0,1)}}\left[{e^{i\sigma_{m}z}}\right]-\mathbb{E}_{{z\sim\mathcal{N}(0,1)}}\left[{e^{-i\sigma_{m}z}}\right]+\mathbb{E}_{{z\sim\mathcal{N}(0,1)}}\left[{e^{-i\sigma_{p}z}}\right]\right].

Applying Proposition 7 to each expectation, we get

12\displaystyle\frac{1}{2} [e12σm2e12σp2]\displaystyle\left[e^{-\frac{1}{2}\sigma_{m}^{2}}-e^{-\frac{1}{2}\sigma_{p}^{2}}\right]
=12[e12(Σ(L1)(x,x)+Σ(L1)(x~,x~)2Σ(L1)(x,x~))e12(Σ(L1)(x,x)+Σ(L1)(x~,x~)+2Σ(L1)(x,x~))]\displaystyle=\frac{1}{2}\left[e^{-\frac{1}{2}\left(\Sigma^{(L-1)}(x,x)+\Sigma^{(L-1)}(\tilde{x},\tilde{x})-2\,\Sigma^{(L-1)}(x,\tilde{x})\right)}-e^{-\frac{1}{2}\left(\Sigma^{(L-1)}(x,x)+\Sigma^{(L-1)}(\tilde{x},\tilde{x})+2\,\Sigma^{(L-1)}(x,\tilde{x})\right)}\right]
=12e12(Σ(L1)(x,x)+Σ(L1)(x~,x~))(eΣ(L1)(x,x~)eΣ(L1)(x,x~)))\displaystyle=\frac{1}{2}e^{-\frac{1}{2}\left(\Sigma^{(L-1)}(x,x)+\Sigma^{(L-1)}(\tilde{x},\tilde{x})\right)}\left(e^{\Sigma^{(L-1)}(x,\tilde{x})}-e^{-\Sigma^{(L-1)}(x,\tilde{x}))}\right)

Unrolling the definition beyond L=1L=1 leads to expressions that are difficult to parse. However, without unrolling, we can rearrange the terms in the NNGP above as

Σ(L)(x,x~)\displaystyle\Sigma^{(L)}(x,\tilde{x}) =12e12(Σ(L1)(x,x)+Σ(L1)(x~,x~))(eΣ(L1)(x,x~)eΣ(L1)(x,x~))+1\displaystyle=\frac{1}{2}e^{-\frac{1}{2}\left(\Sigma^{(L-1)}(x,x)+\Sigma^{(L-1)}(\tilde{x},\tilde{x})\right)}\left(e^{\Sigma^{(L-1)}(x,\tilde{x})}-e^{-\Sigma^{(L-1)}(x,\tilde{x})}\right)+1
=12[e12(Σ(L1)(x,x)2Σ(L1)(x,x~)+Σ(L1)(x~,x~))e12(Σ(L1)(x,x)+2Σ(L1)(x,x~)+Σ(L1)(x~,x~))]+1.\displaystyle=\frac{1}{2}\left[e^{-\frac{1}{2}\left(\Sigma^{(L-1)}(x,x)-2\Sigma^{(L-1)}(x,\tilde{x})+\Sigma^{(L-1)}(\tilde{x},\tilde{x})\right)}-e^{-\frac{1}{2}\left(\Sigma^{(L-1)}(x,x)+2\Sigma^{(L-1)}(x,\tilde{x})+\Sigma^{(L-1)}(\tilde{x},\tilde{x})\right)}\right]+1.

Since the covariance matrix Σ(L1)\Sigma^{(L-1)} is positive semi-definite, we can observe that the exponent expressions can be reformulated into a quadratic forms analogous to the ones in Theorem 12. We can thus observe that the same structure is essentially preserved through the composition of layers, except for the ω\omega factor present in the first layer. Moreover, given this recursive definition, since the NNGP at any given depth LL is a function only of the preceding kernels, the resulting kernel will also be shift-invariant.

Let us now derive the Σ˙\dot{\Sigma} kernel, required for the NTK.

Lemma 16.

For ω\omega\in\mathbb{R}, Σ˙(L)(x,x~):n0×n0\dot{\Sigma}^{(L)}(x,\tilde{x}):\mathbb{R}^{n_{0}}\times\mathbb{R}^{n_{0}}\rightarrow\mathbb{R}, is given by

Σ˙(L)(x,x~)\displaystyle\dot{\Sigma}^{(L)}(x,\tilde{x}) =12e12(Σ(L1)(x,x)+Σ(L1)(x~,x~))(eΣ(L1)(x,x~)+eΣ(L1)(x,x~))+1.\displaystyle=\frac{1}{2}e^{-\frac{1}{2}\left(\Sigma^{(L-1)}(x,x)+\Sigma^{(L-1)}(\tilde{x},\tilde{x})\right)}\left(e^{\Sigma^{(L-1)}(x,\tilde{x})}+e^{-\Sigma^{(L-1)}(x,\tilde{x})}\right)+1.
Proof.

The proof follows the same pattern as Theorem 15, with the only difference being a few sign changes after the exponential expansion of the trigonometric functions, due to the different identities for sine and cosine. ∎

As done in the previous section, it would be simple to now derive the full NTK for a simple sinusoidal network of arbitrary depth by applying Theorem 6 with the NNGP kernels from above. However, there is not much to be gained by writing the convoluted NTK expression explicitly, beyond what we have already gleaned from the NNGP above.

Nevertheless, some insight can be gained from the recursive expression of the NTK itself, as defined in Theorem 6. First, note that, as before, for practical values of ω\omega, Σ˙Σ\dot{\Sigma}\approx\Sigma, both converging to simply a single Gaussian kernel. Thus, our NTK recursion becomes

Θ(L)(x,x~)\displaystyle\Theta^{(L)}(x,\tilde{x}) (Θ(L1)(x,x~)+1)Σ(L)(x,x~).\displaystyle\approx\left(\Theta^{(L-1)}(x,\tilde{x})+1\right)\Sigma^{(L)}(x,\tilde{x}).

Now, note that when expanded, the form of this NTK recursion is essentially as a product of the Gaussian Σ\Sigma kernels,

Θ(L)(x,x~)\displaystyle\Theta^{(L)}(x,\tilde{x}) ((((Σ(0)(x,x~)+1)Σ(1)(x,x~)+1))Σ(L1)(x,x~)+1)Σ(L)(x,x~)\displaystyle\approx\left(\left(\dots\left(\left(\Sigma^{(0)}(x,\tilde{x})+1\right)\Sigma^{(1)}(x,\tilde{x})+1\right)\dots\right)\Sigma^{(L-1)}(x,\tilde{x})+1\right)\Sigma^{(L)}(x,\tilde{x}) (6)
=((((ω2(xTx~+1)+1)Σ(1)(x,x~)+1))Σ(L1)(x,x~)+1)Σ(L)(x,x~).\displaystyle=\left(\left(\dots\left(\left(\omega^{2}\left(x^{T}\tilde{x}+1\right)+1\right)\Sigma^{(1)}(x,\tilde{x})+1\right)\dots\right)\Sigma^{(L-1)}(x,\tilde{x})+1\right)\Sigma^{(L)}(x,\tilde{x}).

We know that the product of two Gaussian kernels is Gaussian and thus the general form of the kernel should be approximately a sum of Gaussian kernels. As long as the magnitude of one of the terms dominates the sum, the overall resulting kernel will be approximately Gaussian. Empirically, we observe this to be the case, with the inner term containing ω2\omega^{2} dominating the sum, for reasonable values (e.g., ω>1\omega>1 and L<10L<10). In Figure 19, we show the NTK for networks of varying depth and ω\omega, together with a pure Gaussian kernel of variance ω2\omega^{2}, scaled to match the maximum and minimum values of the NTK. We can observe that the NTKs are still approximately Gaussian, with their maximum value approximated by k12Lω2k\approx\frac{1}{2^{L}}\omega^{2}, as expected from the product of ω2\omega^{2} and LL kernels above. We also observe that the width of the kernels is mainly defined by ω\omega.

L=4L=4

Refer to caption
Refer to caption
Refer to caption

L=6L=6

Refer to caption
Refer to caption
Refer to caption

L=8L=8

Refer to caption
ω=4\omega=4
Refer to caption
ω=10\omega=10
Refer to caption
ω=30\omega=30
Figure 19: The NTK for SSN at different ω\omega and network depth (LL) values. Kernel values at a slice for fixed x~=0\tilde{x}=0 are shown, together with a Gaussian kernel scaled to match the maximum and minimum values of the NTK.

D.3.2 SIREN

For completeness, in this section we will derive the full SIREN NNGP and NTK. As discussed previously, both the SIREN and the simple sinusoidal network have kernels that approximate low-pass filters. Due to the SIREN initialization, its NNGP and NTK were previously shown to have more complex expressions. However, we will show in this section that the sinc\mathrm{sinc} kernel that arises from the shallow SIREN is gradually “dampened” as the depth of the network increases, gradually approximating a Gaussian kernel.

Theorem 17.

SIREN NNGP. For a SIREN with LL hidden layers f(L):n0nL+1f^{(L)}:\mathbb{R}^{n_{0}}\rightarrow\mathbb{R}^{n_{L+1}} following Definition 1, as the size of the hidden layers n1,,nLn_{1},\dots,n_{L}\rightarrow\infty sequentially, f(L)f^{(L)} tends (by law of large numbers) to the neural network Gaussian Process (NNGP) with covariance Σ(L)(x,x~)\Sigma^{(L)}(x,\tilde{x}), recursively defined as

Σ(1)(x,x~)\displaystyle\Sigma^{(1)}(x,\tilde{x}) =c26[j=1n0sinc(cω(xjx~j))e2ω2j=1n0sinc(cω(xj+x~j))]+1\displaystyle=\frac{c^{2}}{6}\left[\prod_{j=1}^{n_{0}}\mathrm{sinc}{\left(c\,\omega\left(x_{j}-\tilde{x}_{j}\right)\right)}-e^{-2\omega^{2}}\prod_{j=1}^{n_{0}}\mathrm{sinc}{\left(c\,\omega\left(x_{j}+\tilde{x}_{j}\right)\right)}\right]+1
Σ(L)(x,x~)\displaystyle\Sigma^{(L)}(x,\tilde{x}) =12e12(Σ(L1)(x,x)+Σ(L1)(x~,x~))(eΣ(L1)(x,x~)eΣ(L1)(x,x~))+1.\displaystyle=\frac{1}{2}e^{-\frac{1}{2}\left(\Sigma^{(L-1)}(x,x)+\Sigma^{(L-1)}(\tilde{x},\tilde{x})\right)}\left(e^{\Sigma^{(L-1)}(x,\tilde{x})}-e^{-\Sigma^{(L-1)}(x,\tilde{x})}\right)+1.
Proof.

Intuitively, after the first hidden layer, the inputs to every subsequent hidden layer are of infinite width, due to the NNGP assumptions. Therefore, due to the CLT, the pre-activation values at every layer are Gaussian, and the NNGP is unaffected by the uniform weight initialization (compared to the Gaussian weight initialization case). The only layer for which this is not the case is the first layer, since the input size is fixed and finite. This gives rise to the different Σ(1)\Sigma^{(1)}.

Formally, this proof proceed by induction on the depth LL, demonstrating the NNGP for successive layers as n1,,nLn_{1},\dots,n_{L}\rightarrow\infty sequentially. The base case comes straight from Theorem 9. After the base case, the proof follows exactly the same as in Theorem 15. ∎

For the same reasons as in the proof above, the Σ˙\dot{\Sigma} kernels after the first layer are also equal to the ones for the simple sinusoidal network, given in Lemma 16.

Given the similarity of the kernels beyond the first layer, the interpretation of this NNGP is the same as discussed in the previous section for the simple sinusoidal network.

Analogously to the SSN case before, the SIREN NTK expansion can also be approximated as a product of Σ\Sigma kernels, as in Equation 6. The product of a sinc\mathrm{sinc} function with L1L-1 subsequent Gaussians “dampens” the sinc\mathrm{sinc}, such that as the network depth increases the NTK approaches a Gaussian, as can be seen in Figure 20.

L=2L=2

Refer to caption
Refer to caption
Refer to caption

L=4L=4

Refer to caption
Refer to caption
Refer to caption

L=6L=6

Refer to caption
ω=4\omega=4
Refer to caption
ω=10\omega=10
Refer to caption
ω=30\omega=30
Figure 20: The NTK for SIREN at different ω\omega and network depth (LL) values. Kernel values at a slice for fixed x~=0\tilde{x}=0 are shown.

Appendix E Experimental details

E.1 Generalization

Where not explicitly commented, details for the generalization experiments are the same for the comparisons to SIREN, described in Appendix C.

Refer to caption
Figure 21: Reconstruction loss through a low-pass filter for the “camera” image. Notice the loss continues to go down up to the Nyquist frequency.

E.2 Burgers equation (Identification)

We follow the same training procedures as in Raissi et al. (2019a). The training set is created by randomly sampling 2,0002,000 points from the available exact solution grid (shown in Figure 5). The neural networks used are 9-layer MLPs with 20 neurons per hidden layer. The network structure is the same for both the tanh\tanh and sinusoidal networks. As in the original work, the network is trained by using L-BFGS to minimize a mean square error loss composed of the sum of an MSE loss over the data points and a physics-informed MSE loss derived from the equation

ut+λ1uuxλ2uxx=0.\displaystyle u_{t}+\lambda_{1}uu_{x}-\lambda_{2}u_{xx}=0. (7)

E.3 Navier-Stokes (Identification)

Refer to caption
Figure 22: Reconstruction loss for different cutoff frequencies for a low-pass filter applied to the solution of the Navier-Stokes equations.

We follow the same training procedures as in Raissi et al. (2019a). The training set is created by randomly sampling 5,0005,000 points from the available exact solution grid (one timestep is shown in Figure 6). The neural networks used are 9-layer MLPs with 20 neurons per hidden layer. The network structure is the same for both the tanh\tanh and sinusoidal networks. As in the original work, the network is trained by using the Adam optimizer to minimize a mean square error loss composed of the sum of an MSE loss over the data points and a physics-informed MSE loss derived from the equations

ut+λ1(uux+vuy)=px+λ2(uxx+uyy)\displaystyle u_{t}+\lambda_{1}(uu_{x}+vu_{y})=-p_{x}+\lambda_{2}(u_{xx}+u_{yy}) (8)
vt+λ1(uvx+vvy)=py+λ2(vxx+vyy).\displaystyle v_{t}+\lambda_{1}(uv_{x}+vv_{y})=-p_{y}+\lambda_{2}(v_{xx}+v_{yy}). (9)

E.4 Schrödinger (Inference)

This experiment reproduces the Schrödinger equation experiment from Raissi et al. (2019a). In this experiment, we are trying to find the solution to the Schrödinger equation, given by

iht+0.5hxx+|h|2h=0\displaystyle ih_{t}+0.5h_{xx}+|h|^{2}h=0 (10)

Since in this case we have a forward problem, we do not have any prior information to base our choice of ω\omega on, besides a maximum limit given by the Nyquist frequency given the sampling for our training data. We thus follow usual machine learning procedures and experiment with a number of small ω\omega values, based on the previous experiments.

We find that ω=4\omega=4 gives the best results, with a solution MSE of 4.301044.30\cdot 10^{-4}, against an MSE of 1.041031.04\cdot 10^{-3} for the baseline. Figure 23 shows the solution from the sinusoidal network, together with the position of the sampled data points used for training.

Refer to caption
Figure 23: Solution to the Schrodinger equation with the sinusoidal network, together with the position of the sampled data points used for training.
Training details.

We follow the same training procedures as in Raissi et al. (2019a). The training set is created by randomly sampling 20,00020,000 points from the domain (x[5,5]x\in[-5,5], t[0,π/2]t\in[0,\pi/2]) for evaluation for the physics-informed loss. Additionally, 5050 points are sampled from each of the boundary and initial conditions for direct data supervision. The neural networks used are 5-layer MLPs with 100 neurons per hidden layer. The network structure is the same for both the tanh\tanh and sinusoidal networks. As in the original work, the network is trained first using the Adam optimizer by 50,00050,000 steps and then by using L-BFGS until convergence. The loss is composed of the sum of an MSE loss over the data points and a physics-informed MSE loss derived from Equation 10.

E.5 Helmholtz equation

Details for the Helmholtz equation experiment are the same as in the previous Helmholtz experiment in Appendix C.