Optimal Prediction using Learning and Shape Optimization

M. Sajjad Edalatzadeh Roland Herzog

Abstract

This paper investigates the problem of optimal predictor design for distributed parameter systems using neural networks and shape optimization. Sensors with various shapes are placed on the domain of the distributed parameter system. Data provided by these sensors are fed into a re-constructor to generate a full state of the system. After that, a trained neural-network predictor produces a prediction of the state at future time steps. The cost of prediction is defined as the weighted sensor area plus the squared norm of the prediction error. The location and shape of a sensor influences the prediction cost as well as the predictor performance. With the aid of the gradient of the network with respect to its inputs, an outer optimization layer is augmented to find optimized locations and shapes of the sensors. Simulation results show good agreement between the predicted and the true states as well as a significant reduction in cost by sensors with optimized locations and shapes.

learning, prediction, shape optimization, distributed parameter systems, sensors

1 Introduction

Estimation is of fundamental importance in many disciplines. For example, it is important in climate science to predict the temperature changes in the Earth atmosphere (Park et al., 2019). Estimation is also used to find sea surface temperature in order to predict weather (Nielsen-Englyst et al., 2018). In economics, estimation is used to predict investment risks. In control theory, estimation is used to feedback from the state of the system at locations not accessible by sensors. Most commonly, estimation is used to reconstruct the state of a system modeled by partial differential equations such as the heat and wave equations.

In recent years, machine learning methods have been used to extract predictive models from large sets of data provided by sensors. These methods can further find a symbolic representation of the model (symbolic regression) and also discover the hidden physics. Among the machine learning methods, deep neural networks have gained considerable attention (LeCun et al., 2015). Evolutionary algorithms (Vaddireddy et al., 2020), compressive sensing (Candès & Wakin, 2008), and sparse optimization (Candès et al., 2008) are also machine learning algorithms used for logistic regression. In (Long et al., 2018), a PDE-net approach is proposed to predict the dynamics of systems modeled by partial differential equations (PDEs) from sensor data. The network is also able to uncover the underlying PDE model. In (Lu et al., 2018), it is shown that deep neural networks such as ResNet, PolyNet, FractalNet, and RevNet can be interpreted as different numerical discretizations of differential equations. This interpretation is further used to design new deep neural networks. In (Liu et al., 2010, 2013), training data for image restoration is used to create a learning-based PDE model for computer vision tasks. The problem is stated as an optimal control problem where the inputs and outputs are training images. The authors show the effectiveness of their model by numerical experiments on image denoising and inpainting problems in image restoration.

The shape optimization of sensors and actuators in the context of PDEs has been discussed in relatively few works. In (Privat et al., 2013), the optimal shape and position of an actuator for the wave equation in one spatial dimension are discussed. An actuator is placed on a subset $\omega\in[0,\pi]$ with a constant Lebesgue measure $L\pi$ for some $L\in(0,1)$ . The optimal actuator minimizes the norm of a Hilbert Uniqueness Method (HUM)-based control; such control steers the system from a given initial state to zero state in finite time. In (Privat et al., 2017), the optimal actuator shape and position for linear parabolic systems are discussed. This paper adopts the same approach as in (Privat et al., 2013) but with initial conditions that have randomized Fourier coefficients. The cost is defined as the average of the norm of HUM-based controls. In (Kalise et al., 2018), optimal actuator design for linear diffusion equations has been discussed. A quadratic cost function is considered, and shape and topological derivatives of this function are derived. Numerical results show significant improvement in the cost and performance of the control. In (Edalatzadeh et al., 2019), the optimal shape of actuators for vibration control of flexible beams is investigated using a linear-quadratic performance index and shape optimization.

Optimal sensor design problems are in many ways similar to the optimal actuator design problem. In (Privat et al., 2014), optimal sensor shape design has been studied where the observability is being maximized over all admissible sensor shapes. Optimal actuator design problems for nonlinear distributed parameter systems have also been studied. In (Edalatzadeh & Morris, 2019), it is shown that under certain conditions on the nonlinearity and the cost function, an optimal input and actuator design exist, and optimality conditions are derived. Results are applied to the nonlinear railway track model as well as to the semi-linear wave model in two spatial dimensions. Numerical techniques to calculate the optimal actuator shape design are mostly limited to linear quadratic regulator problems, see, \eg, (Allaire et al., 2010). For controllability-based approaches, numerical schemes have been studied in (Münch & Periago, 2011; Münch, 2007; Münch & Periago, 2013).

In this paper, we combine shape optimization with learning-based PDE models to find the optimal shape of sensors, as well as an estimator of the full state of the PDE. In \crefsection:estimator_design, the neural-network predictor is formulated, which recovers the full PDE state from limited sensor data and predicts the state at the subsequent time step. \Crefsection:shape_optimization discusses the shape optimization of sensors. Simulations results are presented in \crefsection:simulation_results. Concluding remarks are given in \crefsection:conclusion.

2 Estimator design

Consider a first-order dynamical system with state $z(t)$ , initial condition $z_{0}$ , state space $X$ , and linear or nonlinear operator $F$ on $X$ :

\paren[auto]\{.{\aligned\dot{z}(t)&=F(z),\\ z(0)=z_{0}.}

(1)

In our applications, $F$ is a second-order partial differential (PDE) operator acting on functions in one or two spatial dimensions. The evolution of the state $z(t)$ in discrete time with sufficiently small time increment $\Delta t$ can be described, using for simplicity an explicit Euler method, as

z_{k+1}=z_{k}+F(z_{k})\Delta t

(2)

with initial value $z_{0}$ . For numerical computation, we also need to discretize in space. For simplicity of notation, we currently assume $X$ to be a space of functions on a one-dimensional interval, which we discretize with a mesh of $c$ equidistant points. The results provided in \crefsection:simulation_results include two-dimensional examples as well. Discretizing $F$ accordingly, here using a finite difference method, we obtain from \eqrefeq:first-order_system_discrete_in_time the fully discrete evolution

{\mathbf{z}}_{k+1}={\mathbf{z}}_{k}+{\mathbf{F}}({\mathbf{z}}_{k})\Delta t.

(3)

Each element in the vector ${\mathbf{z}}_{k}$ represents an approximation of the true state $z$ at one of the spatial grid points and time $t=k\,\Delta t$ .

A neural-network predictor takes the current state vector ${\mathbf{z}}_{k}$ (or a reconstruction thereof) and predicts the solution at the next time step ${\mathbf{z}}_{k+1}$ . The estimator network is trained over simulation data or acquired real-life data. In most cases, the state ${\mathbf{z}}_{k}$ cannot be fully measured by an array of sensors. Therefore, the data provided by the sensors available represents an incomplete state vector and the full state vector is reconstructed first before being fed into the predictor network.

The reconstruction of an incomplete state reading works as follows. A sensor arrangement is encoded in a $c$ -dimensional vector $\omega$ . Each element of $\omega$ indicates the weight of the presence of a sensor at the respective vertex. Weights less than 0.5 will be treated as no sensor present. The function $f_{r}\colon\R^{c}\times\R^{c}\to\R^{c}$ provides an approximate reconstruction based on an interpolation of neighboring sensors measurements, \Creffig-recons. That is, we have ${\mathbf{z}}\approx f_{r}({\mathbf{z}},\omega)$ . A cubic spline interpolation is used in the one-dimensional simulations. This interpolation is a suitable choice for PDEs with second-order spatial derivatives.

Refer to caption — Figure 1: Illustration of the reconstruction function $f_{r}$ .

The function $f_{n}\colon\R^{c}\to\R^{c}$ provides a neural-network prediction ${\mathbf{Z}}_{k+1}$ of the state ${\mathbf{z}}_{k+1}$ using reconstructed data; that is

{\mathbf{z}}_{k+1}\approx{\mathbf{Z}}_{k+1}\coloneqq f_{n}(f_{r}({\mathbf{z}}_{k},\omega)).

(4)

Different types of neural networks lead to different formulations for the predictor function $f_{n}$ . Examples are given in \crefsection:simulation_results.

3 Sensor shape optimization

In this section, we consider an optimization of the sensor configuration encoded in the weight vector $\omega$ . Clearly, from the prediction point of view, it is beneficial to have sensors everywhere. In this case, we would have ${\mathbf{z}}_{k}=f_{r}({\mathbf{z}}_{k},\omega)$ since no interpolation is necessary. On the other hand, we associate a cost with each sensor placed. Our goal is to obtain a compromise between the number of sensors and the accuracy of the predictions ${\mathbf{Z}}_{k+1}$ of the state ${\mathbf{z}}_{k+1}$ , based on the partially observed previous state ${\mathbf{z}}_{k}$ . We express this goal in terms of the cost function {multline} J(ω) = α ∑_i=1^c[ω]_i
+ ∑_k=0^K-1 ∑_i=1^c \paren[big][]f_n(f_r(z_k,ω)) - z_k+1_i^2 h_i Δt . Here $K$ denotes the number of grid points in time. Notice that the second term in \eqrefeq:cost_function represents a discretized version of the $L^{2}$ mean squared error. The sequence $h_{i}$ depends on the spatial mesh size and the quadrate scheme used; for instance, $h_{i}=h/2,h,\cdots,h,h/2$ implements the trapezoidal rule with mesh size $h$ . A similar formula holds in two spatial dimensions.

The derivative of $J(\omega)$ with respect to (\wrt) the sensor weight vector $\omega$ is {multline} [J’(ω)]_i = α
+ 2∑_k=0^K-1 ∑_i=1^c \paren[big][]\paren[big]()f_n(f_r(z_k,ω)) - z_k+1^\transpZ_k+1’_i h_i Δt . Here ${\mathbf{Z}}_{k+1}^{\prime}$ is the derivative of ${\mathbf{Z}}_{k+1}=f_{n}(f_{r}({\mathbf{z}}_{k},\omega))$ \wrt $\omega$ . From the chain rule, we obtain

{\mathbf{Z}}_{k+1}^{\prime}=f_{n}^{\prime}(f_{r}({\mathbf{z}}_{k},\omega))\,f_{r}^{\prime}({\mathbf{z}}_{k},\omega),

(5)

where the right hand side is a product of two $c\times c$ matrices. The first factor represents the derivative of the estimator network’s output \wrt its input vector, while the second factor is the derivative of the outcome of the spline interpolation \wrt $\omega$ . The latter derivative is zero except when there are components of $\omega$ right at the threshold $\omega=0.5$ . In this case, the derivative becomes a Dirac function. Simulation results, however, indicate that the first term in \eqrefeq:chain_rule is sufficient to provide search directions suitable to reduce the cost function. Therefore, we use the approximation

{\mathbf{Z}}_{k+1}^{\prime}\approx f_{n}^{\prime}(f_{r}({\mathbf{z}}_{k},\omega))

(6)

to evaluate the derivative $J^{\prime}(\omega)$ in \eqrefeq:derivative_of_cost_function. Its transpose, the approximate gradient $\nabla J(\omega)$ , is then used in a trust-region optimization algorithm to find an improved sensor configuration vector $\omega$ .

1 Generate training data from the simulated solution

{\mathbf{z}}

2 Build and train an estimator neural network model

f_{n}

3 Randomly select an initial weight vector

\omega_{0}

4 for $j=0,\ldots,J$ do

5 for $k=0,\ldots,K$ do

6 Evaluate

{\mathbf{Z}}_{k+1}=f_{n}(f_{r}({\mathbf{z}}_{k},\omega_{j}))

7 Evaluate

{\mathbf{Z}}^{\prime}_{k+1}\approx f^{\prime}_{n}(f_{r}({\mathbf{z}}_{k},\omega_{j}))

8 end for

9 Find

\omega_{j+1}{\in[0,1]^{c}}

using a constrained optimization method, which utilizes evaluations of

J(\omega_{j})

and

J^{\prime}(\omega_{j})

based on \eqrefeq:cost_function and \eqrefeq:derivative_of_cost_function

j\leftarrow j+1

11 end for

Algorithm 1 Optimal learning-based predictor design for distributed parameter systems

4 Simulation results

In this section, we provide numerical results for a range of different PDE models. Our goal is to demonstrate that, in each case, optimized sensor configurations and properly trained predictor networks $f_{n}$ can obtain sufficiently accurate predictions of the state vector. To this end, we show that the cost $J(\omega)$ for an optimized weight $\omega$ is significantly lower than the cost for the all-sensor case $[\omega]_{i}=1$ , $i=1,2,\ldots,c$ , for which the lower bound $J(\omega)\geq\alpha\,c$ is valid. The overall procedure is described in \crefalgorithm:deep_learning_and_shape_optimization.

For each example, we build an estimator neural network $f_{n}$ using the deep learning package Keras (version 2.4.0). We employ a linear activation function and the mean squared error (MSE) loss function in the networks and use the Adam optimizer for training. The Adam optimizer and the MSE loss function yielded better performance compared to the rest of available optimization algorithms and loss functions. For PDE models in one space dimension, the network layout consists of two dense layers, where each layer has as many neurons as grid nodes. For PDE models in two space dimensions, a convolutional neural network with one layer, one filter and kernel size $3\times 3$ and zero padding is considered. In each case, the gradient of the network is calculated using the method backend in Keras.

In the implementation of the reconstructor function $f_{r}$ , the sensor data is reconstructed using the Python package interpolate. For PDE models in one space dimension, CubicSpline is used, whereas interp2d is chosen in two space dimensions.

Simulations are presented for a variety of PDE models in the following subsections. In figures and formulas, $u_{r}$ denotes the true (simulated) solution, which corresponds to ${\mathbf{z}}$ in the notation of \crefsection:estimator_design,section:shape_optimization. Moreover, $u_{p}$ denotes the predicted solution corresponding to ${\mathbf{Z}}$ according to \eqrefeq:reconstruction_followed_by_prediction.

4.1 One-dimensional heat equation

Consider the following heat equation in one space dimension over $x\in[0,1]$

\cases{u}_{t}(x,t)=\kappa\,u_{xx}(x,t),\\ u(x,0)=u_{0},\\ u(0,t)=0,\\ u_{x}(1,t)=0.

(7)

A forward-in-time and space-centered finite difference method is used to discretize the model and extract the solution data ${\mathbf{z}}_{k}$ . The predictor neural network is then trained on the solution data. The estimation performance is shown in \creffig-snapshots-heat-IC1,fig-snapshots-heat-IC2,fig-snapshots-heat-IC3,fig-snapshots-heat-step for various initial conditions $u_{0}$ . The sensor weight $\alpha$ is set to 5; mesh size $h$ is set to $10^{-2}$ (resulting in $c=101$ ); time increment $\Delta t$ is set to $0.1$ ; and conductivity $\kappa$ is set to $10^{-4}$ .

\cref

fig-l1norm-linearheat shows the $L_{1}$ -norm of the error over time, \ie,

\norm{u_{r}-u_{p}}_{1}=\int_{0}^{1}\abs{u_{r}(x,t)-u_{p}(x,t)}\d{x},

(8)

which is small compared to the solution although only a fraction of the sensors is being used in each case, see \creffig-snapshots-heat-IC1,fig-snapshots-heat-IC2,fig-snapshots-heat-IC3,fig-snapshots-heat-step. In these figures, 17% of the PDE region is covered with sensors.

The reduction in cost over iteration is shown in \creffig-cost-heat-linear.

4.2 One-dimensional wave equation

Consider the following linear one-dimensional wave equation over $x\in[0,1]$

\cases{u}_{tt}(x,t)=\lambda\,u_{xx}(x,t),\\ u(x,0)=u_{0},\\ u_{t}(x,0)=0,\\ u(0,t)=u(1,t)=0.

(9)

A time-centered and space-centered finite difference method is used to discretize the wave model and extract the solution data. The estimation performance is shown in \creffig-snapshots-wave-IC1,fig-snapshots-wave-IC2,fig-snapshots-wave-IC3 for various initial conditions $u_{0}$ . The sensor weight $\alpha$ is set to 2, the mesh size $h$ is set to $10^{-2}$ (resulting in $c=101$ ); time increment $\Delta t$ is set to $0.1$ ; and squared wave speed $\lambda$ is set to $3\cdot 10^{-3}$ .

Similar to the previous example, we observe good agreement between the predicted and the true state (\creffig-linftynorm-wave-linear) although only very few sensors are in use. In \creffig-snapshots-wave-IC1 and \creffig-snapshots-wave-IC3, only 9% of the wave region is covered with sensors, and in \creffig-snapshots-wave-IC2, this number is 14%. The reduction in cost over iteration is shown in \creffig-cost-wave-linear.

4.3 Two-dimensional heat equation

Consider the following heat equation in two space dimensions over $(x,y)\in[0,1]^{2}$

\cases{u}_{t}(x,y,t)=\kappa\paren[auto](){u_{xx}(x,y,t)+u_{yy}(x,y,t)},\\ u(x,y,0)=u_{0},\\ u(0,y,t)=u_{x}(1,y,t)=0,\\ u_{y}(x,0,t)=u_{y}(x,1,t)=0.

(10)

The discretization is similar as in \crefsubsection:1Dlinearheat. The mesh size is uniform with length $h=3\times 10^{-2}$ ; time increment $\Delta t$ is set to $0.1$ ; and conductivity $\kappa$ is set to $10^{-4}$ . The initial condition is set to $u_{0}(x,y)=x^{2}y^{2}(x-1)^{2}(y-1)^{2}$ , scaled to $[0,1]$ .

In this example, we vary the value of the sensor weight coefficient $\alpha$ . Once again, we observe good agreement between the predicted and the true state (\creffig-norm-2D-heat) although only few sensors are in use; see \creffig-optimal-sensor-2D-heat,fig-snapshots-2D-heat. The reduction in cost over iteration is shown in \creffig-cost-2D-heat.

5 Conclusion

Optimal prediction for distributed parameter systems was discussed in this paper. The optimal predictions involves two steps. In the first step, a neural network predictor is designed. In the second step, sensor shapes are optimized using the gradient of the network. Computer simulations are conducted for several models including heat and wave equation in one and two space dimensions. The results show a significant reduction in cost of prediction as well as improvement in the predictor performance. The computer codes are accessible in the repository \urlhttps://github.com/TUCMath/optimal-predictor.

References

Allaire et al. (2010) Allaire, G., Münch, A., and Periago, F. Long time behavior of a two-phase optimal design for the heat equation. SIAM Journal on Control and Optimization, 48(8):5333–5356, 2010. doi: 10.1137/090780481.
Candès & Wakin (2008) Candès, E. J. and Wakin, M. B. An introduction to compressive sampling. IEEE Signal Processing Magazine, 25(2):21–30, 2008. doi: 10.1109/msp.2007.914731.
Candès et al. (2008) Candès, E. J., Wakin, M. B., and Boyd, S. P. Enhancing sparsity by reweighted $\ell_{1}$ minimization. Journal of Fourier Analysis and Applications, 14(5-6):877–905, 2008. doi: 10.1007/s00041-008-9045-x.
Edalatzadeh & Morris (2019) Edalatzadeh, M. S. and Morris, K. A. Optimal actuator design for semilinear systems. SIAM Journal on Control and Optimization, 57(4):2992–3020, 2019. doi: 10.1137/18m1171229.
Edalatzadeh et al. (2019) Edalatzadeh, M. S., Kalise, D., Morris, K. A., and Sturm, K. Optimal actuator design for vibration control based on LQR performance and shape calculus. arXiv: \hrefhttps://arxiv.org/abs/1903.075721903.07572, 2019.
Kalise et al. (2018) Kalise, D., Kunisch, K., and Sturm, K. Optimal actuator design based on shape calculus. Mathematical Models and Methods in Applied Sciences, 28(13):2667–2717, 2018. doi: 10.1142/s0218202518500586.
LeCun et al. (2015) LeCun, Y., Bengio, Y., and Hinton, G. Deep learning. Nature, 521(7553):436–444, 2015. doi: 10.1038/nature14539.
Liu et al. (2010) Liu, R., Lin, Z., Zhang, W., and Su, Z. Learning PDEs for image restoration via optimal control. In Computer Vision ECCV 2010, pp. 115–128. Springer Berlin Heidelberg, 2010. doi: 10.1007/978-3-642-15549-9_9.
Liu et al. (2013) Liu, R., Lin, Z., Zhang, W., Tang, K., and Su, Z. Toward designing intelligent PDEs for computer vision: An optimal control approach. Image and Vision Computing, 31(1):43–56, 2013. doi: 10.1016/j.imavis.2012.09.004. arXiv: \hrefhttps://arxiv.org/abs/1109.10571109.1057.
Long et al. (2018) Long, Z., Lu, Y., Ma, X., and Dong, B. PDE-Net: Learning PDEs from data. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 3208–3216, Stockholmsmässan, Stockholm, Sweden, 2018. PMLR. URL \urlhttp://proceedings.mlr.press/v80/long18a.html.
Lu et al. (2018) Lu, Y., Zhong, A., Li, Q., and Dong, B. Beyond finite layer neural networks: bridging deep architectures and numerical differential equations. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 3276–3285, Stockholmsmässan, Stockholm, Sweden, 2018. PMLR. URL \urlhttp://proceedings.mlr.press/v80/lu18d.html.
Münch (2007) Münch, A. Optimal location of the support of the control for the 1-D wave equation: numerical investigations. Computational Optimization and Applications, 42(3):443–470, 2007. doi: 10.1007/s10589-007-9133-x.
Münch & Periago (2011) Münch, A. and Periago, F. Optimal distribution of the internal null control for the one-dimensional heat equation. Journal of Differential Equations, 250(1):95–111, 2011. doi: 10.1016/j.jde.2010.10.020.
Münch & Periago (2013) Münch, A. and Periago, F. Numerical approximation of bang–bang controls for the heat equation: An optimal design approach. Systems & Control Letters, 62(8):643–655, 2013. doi: 10.1016/j.sysconle.2013.04.009.
Nielsen-Englyst et al. (2018) Nielsen-Englyst, P., Høyer, J. L., Pedersen, L. T., Gentemann, C. L., Alerskans, E., Block, T., and Donlon, C. Optimal estimation of sea surface temperature from AMSR-E. Remote Sensing, 10(2):229, 2018. doi: 10.3390/rs10020229.
Park et al. (2019) Park, I., Kim, H. S., Lee, J., Kim, J. H., Song, C. H., and Kim, H. K. Temperature prediction using the missing data refinement model based on a long short-term memory neural network. Atmosphere, 10(11):718, 2019. doi: 10.3390/atmos10110718.
Privat et al. (2013) Privat, Y., Trélat, E., and Zuazua, E. Optimal location of controllers for the one-dimensional wave equation. Annales de l’Institut Henri Poincare (C) Non Linear Analysis, 30(6):1097–1126, 2013. doi: 10.1016/j.anihpc.2012.11.005.
Privat et al. (2014) Privat, Y., Trélat, E., and Zuazua, E. Optimal shape and location of sensors for parabolic equations with random initial data. Archive for Rational Mechanics and Analysis, 216(3):921–981, 2014. doi: 10.1007/s00205-014-0823-0.
Privat et al. (2017) Privat, Y., Trélat, E., and Zuazua, E. Actuator design for parabolic distributed parameter systems with the moment method. SIAM Journal on Control and Optimization, 55(2):1128–1152, 2017. doi: 10.1137/16m1058418.
Vaddireddy et al. (2020) Vaddireddy, H., Rasheed, A., Staples, A. E., and San, O. Feature engineering and symbolic regression methods for detecting hidden physics from sparse sensor observation data. Physics of Fluids, 32(1):015113, 2020. doi: 10.1063/1.5136351.