Unmasking Transformers: A Theoretical Approach to Data Recovery via Attention Weights

Yichuan Deng ycdeng@cs.washington.edu. The University of Washington. Zhao Song zsong@adobe.com. Adobe Research. Shenghao Xie xsh1302@gmail.com. The Chinese University of Hong Kong, Shenzhen. Chiwun Yang christiannyang37@gmail.com. Sun Yat-sen University.

In the realm of deep learning, transformers have emerged as a dominant architecture, particularly in natural language processing tasks. However, with their widespread adoption, concerns regarding the security and privacy of the data processed by these models have arisen. In this paper, we address a pivotal question: Can the data fed into transformers be recovered using their attention weights and outputs? We introduce a theoretical framework to tackle this problem. Specifically, we present an algorithm that aims to recover the input data $X\in\mathbb{R}^{d\times n}$ from given attention weights $W=QK^{\top}\in\mathbb{R}^{d\times d}$ and output $B\in\mathbb{R}^{n\times n}$ by minimizing the loss function $L(X)$ . This loss function captures the discrepancy between the expected output and the actual output of the transformer. Our findings have significant implications for the Localized Layer-wise Mechanism (LLM), suggesting potential vulnerabilities in the model’s design from a security and privacy perspective. This work underscores the importance of understanding and safeguarding the internal workings of transformers to ensure the confidentiality of processed data.

1 Introduction

In the intricate and constantly evolving domain of deep learning, the transformer architecture has emerged as a game-changing innovation [98]. This novel architecture has propelled the state-of-the-art performance in a myriad of tasks, and its potency lies in the underlying mechanism known as the “attention mechanism.” The essence of this mechanism can be distilled into its unique interaction between three distinct matrices: the Query ( $Q$ ), the Key ( $K$ ), and the Value ( $V$ ), where the Query matrix ( $Q$ ) represents the questions or the aspects we’re interested in, the Key matrix ( $K$ ) denotes the elements against which these questions are compared or matched, and the he Value matrix ( $V$ ) encapsulates the information we want to retrieve based on the comparisons. These matrices are not just mere multidimensional arrays; they play vital roles in encoding, comparing, and extracting pertinent information from the data.

Given this context, the attention mechanism can be mathematically captured as follows:

Definition 1.1 (Attention matrix computation).

Let $Q,K\in\mathbb{R}^{n\times d}$ be two matrices that respectively represent the query and key. Similarly, for a matrix $V\in\mathbb{R}^{n\times d}$ denoting the value, the attention matrix is defined as

\displaystyle\mathrm{Att}(Q,K,V):=D^{-1}AV,

In this equation, two matrices are introduced: $A\in\mathbb{R}^{n\times n}$ and $D\in\mathbb{R}^{n\times n}$ , defined as:

\displaystyle A:=\exp(QK^{\top})\text{~{}~{}and~{}~{}}D:=\operatorname{diag}(A\mathbf{1}_{n}).

Here, the matrix $A$ represents the relationship scores between the query and key, and $D$ ensures normalization, ensuring that the attention weights sum to one. The computation hence, deftly combines these relationships with the value matrix to output the final attended representation.

In practical large-scale language models [16, 72], there might be multi-levels of the attention computation. For those multi-level architecture, the feed-forward can be represented as

\displaystyle\underbrace{X_{\ell+1}^{\top}}_{n\times d}\leftarrow\underbrace{D(X_{\ell})^{-1}\exp(X_{\ell}^{\top}Q_{\ell}K_{\ell}X_{\ell})}_{n\times n}\underbrace{X_{\ell}^{\top}}_{n\times d}\underbrace{V_{\ell}}_{d\times d}

where $X_{\ell}$ is the input of $\ell$ -th layer, and $X_{\ell+1}$ is the output of $\ell$ -th layer, and $Q_{\ell},K_{\ell},V_{\ell}$ are the attention weights in $\ell$ -th layer.

This architecture has particularly played a pivotal role in driving progress across various sub-disciplines of natural language processing (NLP). It has profoundly influenced sectors such as machine translation [32, 13], sentiment analysis [96, 71], language modeling [65], and even the generation of creative text [16, 72]. This trajectory of influence is most prominently embodied by the creation and widespread adoption of Large Language Models (LLMs) like GPT [81] and BERT [25]. These models, along with their successive versions, e.g., GPT-2 [83], GPT-3 [10], PaLM [21], OPT [115], are hallmarks in the field due to their staggering number of parameters and complex architectural designs. These LLMs have achieved unparalleled performance levels, setting new standards in machine understanding and automated text generation [16, 72]. Moreover, their emergence has acted as a catalyst for rethinking what algorithms are capable of, spurring new lines of inquiry and scrutiny within both academic and industrial circles [79]. As these LLMs find broader application across an array of sectors, gaining a thorough understanding of their intricate internal mechanisms is evolving from a topic of scholarly interest into a crucial requirement for their effective and responsible deployment.

Yet, the very complexity and architectural sophistication that propel the success of transformers come with a host of consequential challenges, making their effective and responsible usage nontrivial. Prominent among these challenges is the overarching imperative of ensuring data security and privacy [77, 9, 55]. Within the corridors of the research community, an increasingly pertinent question is emerging regarding the inherent vulnerabilities of these architectures. Specifically,

is it possible to know the input data by analyzing the attention weights and model outputs?

To put it in mathematical terms, given a language model represented as $Y=f(W;X)$ , if one has access to the output $Y$ and the attention weights $W$ , is it possible to mathematically invert the model to obtain the original input data $X$ ?

Addressing this line of inquiry extends far beyond the realm of academic speculation; it has direct and significant implications for practical, real-world applications. This is especially true when these transformer models interact with data that is either sensitive in nature, like personal health records [20], or proprietary, as in the financial sector [99]. With the broader deployment of Large Language Models (LLMs) into environments that adhere to stringent data confidentiality regulations, the mandate for achieving absolute data security becomes unequivocally critical. In this work, we aim to delve deeply into this paramount issue, striving to offer a nuanced understanding of these potential vulnerabilities while suggesting pathways for ensuring safety in the development, training, and utilization of transformer technologies.

In this study, we address a distinct problem that differs from the conventional task of finding optimal weights for a given input and output. Specifically, we assume that the weights are already known, and our objective is to invert the input to recover the original data. The key focus of our investigation lies in identifying the conditions under which successful inversion of the original input is feasible. This problem holds significant relevance in the context of addressing security concerns associated with attention networks.

To provide a formal definition of our training objective for data recovery, we aim to optimize a specific criterion that enables effective inversion of the input. By formulating and solving this objective, we aim to gain valuable insights into the security implications and vulnerabilities of attention networks.

Definition 1.2 (Regression model).

Given the attention weights $W=KQ^{\top}\in\mathbb{R}^{d\times d}$ , $V\in\mathbb{R}^{d\times d}$ and output $B\in\mathbb{R}^{n\times d}$ , the goal is find $X\in\mathbb{R}^{d\times n}$ such that

\displaystyle L(X):=\|\underbrace{D(X)^{-1}\exp(X^{\top}WX)}_{n\times n}\underbrace{X^{\top}}_{n\times d}\underbrace{V}_{d\times d}-\underbrace{B}_{n\times d}\|_{F}^{2}

where

•

$D(X)=\operatorname{diag}(\exp(X^{\top}WX){\bf 1}_{n})\in\mathbb{R}^{n\times n}$

Refer to caption — Figure 1: Visualization of our loss function.

In order to establish an understanding of attacking on the above model, we present our main result in the following section.

1.1 Our Result

We state our result as follows:

Theorem 1.3 (Informal version of Theorem J.1).

Given a model with several layers of attention. For each layer, we have parameters $Q\in\mathbb{R}^{d\times d},K\in\mathbb{R}^{d\times d},V\in\mathbb{R}^{d\times d}$ . We denote $W:=KQ^{\top}$ . Given a desired output $B\in\mathbb{R}^{d\times n}$ , then we can denote the training data input

\displaystyle X^{*}=\arg\min_{X}\|D(X)^{-1}\exp(X^{\top}WX)X^{\top}V-B\|_{F}^{2}+L_{\rm reg}

Next, we choose a good initial point $X_{0}$ that is close enough to $X^{*}$ . Assume that there exists a scalar $R>1$ such that $\|W\|_{F}\leq R$ , $\|V\|_{F}\leq R$ , $|b_{i,j}|\leq R$ where $b_{i,j}$ denotes the $i,j$ -th entry of $B$ for all $i\in[n],j\in[d]$ .

Then, for any accuracy parameter $\epsilon\in(0,0.1)$ and a failure probability $\delta\in(0,0.1)$ , an algorithm based on the Newton method can be employed to recover the initial data. The result of this algorithm guarantee within $T=O(\log(\|X_{0}-X^{*}\|_{F}/\epsilon))$ executions, it outputs a matrix $\tilde{X}\in\mathbb{R}^{d\times n}$ satisfying $\|\widetilde{X}-X^{*}\|_{F}\leq\epsilon$ with a probability of at least $1-\delta$ .

Roadmap.

We arrange the rest of our paper as follows. In Section 2 we present some works related our topic. In Section 3 we provide a preliminary for our work. In Section 4, we state an overview of our techniques, summarizing the method we use to recover data via attention weights. We conclude our work and propose some future directions in Section 5.

2 Related Works

Attention Computation Theory.

Following the rise of LLM, numerous studies have emerged on attention computation [53, 94, 18, 107, 95, 84, 75, 109, 3, 93, 27, 102, 54]. LSH techniques approximate attention, and based on them, the KDEformer offers a notable dot-product attention approximation [107]. Recent works [6, 12, 28] explored diverse attention computation methods and strategies to enhance model efficiency. On the optimization front, [110] highlighted that adaptive methods excel over SGD due to heavy-tailed noise distributions. Other insights include the emergence of the KTIW property [91] and various regression problems inspired by attention computation [33, 63, 59], revealing deeper nuances of attention models.

Security concerns about LLM.

Amid LLM advancements, concerns about misuse have arisen [77, 9, 55, 52, 97, 23, 103, 38, 52, 47, 48, 39, 88]. [77] assesses the privacy risks of capturing sensitive data with eight models and introduces defensive strategies, balancing performance and privacy. [9] asserts that current methods fall short in guaranteeing comprehensive privacy for language models, recommending training on publicly intended text. [55] reveals that the vulnerability of large language models to privacy attacks is significantly tied to data duplication in training sets, emphasizing that deduplicating this data greatly boosts their resistance to such breaches. [52] devised a way to watermark LLM output without compromising quality or accessing LLM internals. Meanwhile, [97] introduced near access-freeness (NAF), ensuring generative models, like transformers and image diffusion models, don’t closely mimic copyrighted content by over $k$ -bits.

Inverting the neural network.

Originating from the explosion of deep learning, there have been a series of works focused on inverting the neural network [49, 56, 69, 24, 108]. [49] surveys various techniques for neural network inversion, which involves finding input values that produce desired outputs, and highlights its applications in query-based learning, sonar performance analysis, power system security assessment, control, and codebook vector generation. [56] presents a method for inverting trained neural networks by formulating the problem as a mathematical programming task, enabling various network inversions and enhancing generalization performance.. [69] explores the reconstruction of image representations, including CNNs, to assess the extent to which it’s possible to recreate the original image, revealing that certain layers in CNNs retain accurate visual information with varying degrees of geometric and photometric invariance. [108] presents a novel generative model-inversion attack method that can effectively reverse deep neural networks, particularly in the context of face image reconstruction, and explores the connection between a model’s predictive ability and vulnerability to such attacks while noting limitations in using differential privacy for defense.

Attacking the Neural Networks.

During the development of artificial intelligence, there have been many works on attaching the neural networks [111, 100, 80, 46, 104, 43, 36]. Several studies [111, 100, 80, 104] have warned that local training data can be compromised using only exchanged gradient information. These methods start with dummy data and gradients, and through gradient descent, they empirically show that the original data can be fully reconstructed. A follow-up study [112] specifically focuses on classification tasks and finds that the real labels can also be accurately recovered. Other types of attacks include membership and property inference [86, 67], the use of Generative Adversarial Networks (GANs) [42, 34], and additional machine-learning techniques [68, 74]. A recent paper [101] uses tensor decomposition for gradient leakage attacks but is limited by its inefficiency and focus on over-parametrized networks.

Theoretical Approaches to Understanding LLMs.

Recent strides have been made in understanding and optimizing regression models using various activation functions. Research on over-parameterized neural networks has examined exponential and hyperbolic activation functions for their convergence properties and computational efficiency [33, 63, 27, 38, 61, 40, 90, 87, 23, 22, 85]. Modifications such as regularization terms and algorithmic innovations, like a convergent approximation Newton method, have been introduced to enhance their performance [63, 29]. Studies have also leveraged tensor tricks to vectorize regression models, allowing for advanced Lipschitz and time-complexity analyses [37, 26]. Simultaneously, the field is seeing innovations in optimization algorithms tailored for LLMs. Techniques like block gradient estimators have been employed for huge-scale optimization problems, significantly reducing computational complexity [17]. Unique approaches like Direct Preference Optimization bypass the need for reward models, fine-tuning LLMs based on human preference data [82]. Additionally, advancements in second-order optimizers have relaxed the conventional Lipschitz Hessian assumptions, providing more flexibility in convergence proofs [58]. Also, there is a series of work on understanding fine-tuning [64, 70, 76]. Collectively, these theoretical contributions are refining our understanding and optimization of LLMs, even as they introduce new techniques to address challenges such as non-guaranteed Hessian Lipschitz conditions.

Optimization and Convergence of Deep Neural Networks.

Prior research [57, 31, 7, 8, 1, 2, 89, 15, 113, 14, 105, 73, 51, 60, 45, 114, 11, 109, 92, 4, 66, 106, 33, 63, 78] on the optimization and convergence of deep neural networks has been crucial in understanding their exceptional performance across various tasks. These studies have also contributed to enhancing the safety and efficiency of AI systems. In [33] they define a neural function using an exponential activation function and apply the gradient descent algorithm to find optimal weights. In [63], they focus on the exponential regression problem inspired by the attention mechanism in large language models. They address the non-convex nature of standard exponential regression by considering a regularization version that is convex. They propose an algorithm that leverages input sparsity to achieve efficient computation. The algorithm has a logarithmic number of iterations and requires nearly linear time per iteration, making use of the sparsity of the input matrix.

3 Preliminary

In this section, we present the preliminary concepts and introductions to the background of our research that form the foundation of our paper. We begin by introducing the notations we utilize in Section 3.1. In Section 3.2, we introduce a solid method to attack neural networks by inverting their weights and outputs. In Section 3.3, we use a regression form to simplify the training process when transformer implements back-propagation.

3.1 Notations

We used $\mathbb{R}$ to denote real numbers. We use $A\in\mathbb{R}^{n\times d}$ to denote an $n\times d$ size matrix where each entry is a real number. For any positive integer $n$ , we use $[n]$ to denote $\{1,2,\cdots,n\}$ . For a matrix $A\in\mathbb{R}^{n\times d}$ , we use $a_{i,j}$ to denote the an entry of $A$ which is in $i$ -th row and $j$ -th column of $A$ , for each $i\in[n]$ , $j\in[d]$ . We use $A_{i,j}\in\mathbb{R}^{n\times d}$ to denote a matrix such that all of its entries equal to $0$ except for $a_{i,j}$ . We use ${\bf 1}_{n}$ to denote a length- $n$ vector where all the entries are ones. For a vector $w\in\mathbb{R}^{n}$ , we use $\operatorname{diag}(w)\in\mathbb{R}^{n\times n}$ denote a diagonal matrix where $(\operatorname{diag}(w))_{i,i}=w_{i}$ and all other off-diagonal entries are zero. Let $D\in\mathbb{R}^{n\times n}$ be a diagonal matrix, we use $D^{-1}\in\mathbb{R}^{n\times n}$ to denote a diagonal matrix where $i$ -th entry on diagonal is $D_{i,i}$ and all the off-diagonal entries are zero. Given two vectors $a,b\in\mathbb{R}^{n}$ , we use $(a\circ b)\in\mathbb{R}^{n}$ to denote the length- $n$ vector where $i$ -th entry is $a_{i}b_{i}$ . For a matrix $A\in\mathbb{R}^{n\times d}$ , we use $A^{\top}\in\mathbb{R}^{d\times n}$ to denote the transpose of matrix $A$ . For a vector $x\in\mathbb{R}^{n}$ , we use $\exp(x)\in\mathbb{R}^{n}$ to denote a length- $n$ vector where $\exp(x)_{i}=\exp(x_{i})$ for all $i\in[n]$ . For a matrix $X\in\mathbb{R}^{n\times n}$ , we use $\exp(X)\in\mathbb{R}^{n\times n}$ to denote matrix where $\exp(X)_{i,j}=\exp(X_{i,j})$ . For any matrix $A\in\mathbb{R}^{n\times d}$ , we define $\|A\|_{F}:=(\sum_{i=1}^{n}\sum_{j=1}^{d}A_{i,j}^{2})^{1/2}$ . For a vector $a,b\in\mathbb{R}^{n}$ , we use $\langle a,b\rangle$ to denote $\sum_{i=1}^{n}a_{i}b_{i}$ .

3.2 Model Inversion Attack

A model inversion attack is a type of adversarial attack in which a malicious user attempts to recover the private dataset used to train a supervised machine learning model . The goal of a model inversion attack is to generate realistic and diverse samples that accurately describe each class in the private dataset.

The attacker typically has access to the trained model and can use it to make predictions on input data . By carefully crafting input data and observing the model’s predictions, the attacker can infer information about the training data.

Model inversion attacks can be a significant privacy concern, as they can potentially reveal sensitive information about individuals or organizations. These attacks exploit vulnerabilities in the model’s behavior and can be used to extract information that was not intended to be disclosed.

Model inversion attacks can be formulated as an optimization problem. Given the output $Y$ , the model function $f_{\theta}$ with parameters $\theta$ , and the loss function $\mathcal{L}$ , the objective of a model inversion attack is to find an input data $X^{*}$ that minimizes the loss between the model’s prediction $f_{\theta}(X)$ and the target output $Y$ . Mathematically, this can be expressed as:

\displaystyle X^{*}=\arg\min_{X}\mathcal{L}(f_{\theta}(X),Y)

Since the loss function $\mathcal{L}(f_{\theta}(X),Y)$ is convex with respect to optimizing $X$ , we can employ a specific method for model inversion attack, which involves the following steps:

1.

Initialize an input data $X$ .
2.

Compute the gradient $\nabla_{X}\mathcal{L}(f_{\theta}(X),Y)$ .
3.

Optimize $X$ using a learning rate $\eta$ by updating $X=X-\eta\nabla_{X}\mathcal{L}(f_{\theta}(X),Y)$ .

This iterative process aims to find an input $X$ that minimizes the loss between the model’s prediction and the target output. By updating $X$ in the direction opposite to the gradient, the attack can potentially converge to an input that generates a prediction close to the desired output, thereby inverting the model. In this work, we focus our effort on the Attention models (which is natural due to the explosive development of LLMs). In this case, the parameters $\theta$ in our model are considered to consist of $\{Q,K,V\}$ . During the script, to avoid the abuse of notations, we use $B=Y$ to denote the ground truth label.

3.3 Regression Problem Inspired by Attention Computation

In this paper, we extend the prior work of [37] and focus on the training process of the attention mechanism in the context of the Transformer model. We decompose the training procedure into a regression form based on the insights provided by [27].

Specifically, we investigate the training process for a specific layer, denoted as the $l$ -th layer, and consider the case of single-headed attention. In this setting, we have an input matrix represented as $X\in\mathbb{R}^{d\times n}$ and a target matrix denoted as $B\in\mathbb{R}^{d\times n}$ . Given $Q\in\mathbb{R}^{d\times d},K\in\mathbb{R}^{d\times d},V\in\mathbb{R}^{d\times d}$ as the trained weights of attention architecture. The objective of the training process in the Transformer model is to minimize the loss function by utilizing back-propagation.

The loss function, denoted as $L(X)$ , is defined as follows:

\displaystyle L(X)=\|D^{-1}\exp(X^{\top}K^{\top}QX)X^{\top}V-B\|_{F}^{2},

where $D:=\text{diag}(\exp(X^{\top}K^{\top}QX)\mathbf{1}_{n})$ and each row of $D^{-1}\exp()$ corresponds to a softmax function.

The goal of minimizing this loss function is to align the predicted output, obtained by applying the attention mechanism, with the target matrix $B$ .

4 Recovering Data via Attention Weights

In this section, we propose our theoretical method to recover the training data from trained transformer weights and outputs. Besides, we solve our method by proving hessian of our training objective is Lipschitz-continuous and positive definite. In Section 4.1, we provide a detailed description of our approach. In Section 4.3, we show our result that proving hessian of training objective is Lipschitz-continuous. In Section 4.4, we show our result that the hessian of training objective is positive definite.

4.1 Training Objective of Attention Inversion Attack

In this study, we propose a novel technique for inverting the attention weights of a transformer model using Hessian decomposition. Our aim is to find the input $X\in\mathbb{R}^{d\times n}$ that minimizes the Frobenius norm of the difference between $D(X)^{-1}\exp(X^{\top}WX)V$ and $B$ , where $W=KQ^{\top}\in\mathbb{R}^{d\times d}$ represents the attention weights, $B\in\mathbb{R}^{n\times d}$ is the desired output, and $D(X)=\operatorname{diag}(\exp(X^{\top}WX))\in\mathbb{R}^{n\times n}$ is a diagonal matrix.

To achieve this, we introduce an algorithm that minimizes the loss function $L(X)$ , defined as follows:

\displaystyle L(X):=\|D(X)^{-1}\exp(X^{\top}WX)X^{\top}V-B\|_{F}^{2}+L_{\rm reg},

(1)

where $V\in\mathbb{R}^{d\times d}$ is a matrix of values, and $L_{\rm reg}$ captures any additional regularization terms. This loss function quantifies the discrepancy between the expected output and the actual output of the transformer.

In our approach, we leverage Hessian decomposition to efficiently compute the Hessian matrix and apply a second-order method to approximate the optimal input $X$ . By utilizing the Hessian, we can gain insights into the curvature of the loss function and improve the efficiency of optimization. This approach enables us to efficiently find an approximate solution for the input $X$ that minimizes the loss function, thereby inverting the attention weights of the transformer model.

By integrating Hessian decomposition and second-order optimization techniques ([5, 62, 19, 50, 44, 35, 41]), our proposed algorithm provides a promising approach for addressing the challenging task of inverting attention weights in transformer models.

Due to the complexity of the loss function (Eq. (1)), directly computing its Hessian is challenging or even impossible. To simplify the computation, we introduce several notations (See Figure 2 for visualization):

	$\displaystyle\text{Exponential Function:}~{}u(X)_{i}$	$\displaystyle:=\exp(X^{\top}WX_{*,i})$
	$\displaystyle\text{Sum of Softmax:}~{}\alpha(X)_{i}$	$\displaystyle:=\langle u(X)_{i},\mathbf{1}_{n}\rangle$
	$\displaystyle\text{Softmax Probability:}~{}f(X)_{i}$	$\displaystyle:=\alpha(X)_{i}^{-1}u(X)_{i}$
	$\displaystyle\text{Value Function:}~{}h(X)_{j}$	$\displaystyle:=X^{\top}V_{*,j}$
	$\displaystyle\text{One-unit Loss Function:}~{}c(X)_{i,j}$	$\displaystyle:=\langle f(X)_{i},h(X)_{j}\rangle-b_{i,j}.$

Using these terms, we can express the loss function $L(X)$ as the sum over all elements:

\displaystyle L(X)=\sum_{i=1}^{n}\sum_{j=1}^{d}(c(X)_{i,j})^{2}

This allows us to break down the computation into several steps. Specifically, we start by computing the gradients of the predefined terms. Given two integers $i_{0}\in[n]$ and $j_{0}\in[d]$ , we define $c(X)_{i_{0},j_{0}}$ as a matrix where all entries are zero except for the entry $c_{i_{0},j_{0}}$ . Additionally, we denote $i_{1}\in[n]$ and $j_{1}\in[d]$ as two other integers, and use $x_{i_{1},j_{1}}$ to represent the entry in $X$ corresponding to the $i_{1}$ -th row and $j_{1}$ -th column.

We can now express $\frac{\mathrm{d}c(X)_{i_{0},j_{0}}}{\mathrm{d}x_{i_{1},j_{1}}}$ (the gradient of $c(X)_{i_{0},j_{0}}$ ) in two cases:

•

Case 1: The situation when $i_{0}=i_{1}$ .
•

Case 2: The situation when $i_{0}\neq i_{1}$ .

By decomposing the Hessian into several cases (See Section F for details), we can calculate the final Hessian. Similar to the approach used when computing the gradients, we introduce two additional integers $i_{2}\in[n]$ and $j_{2}\in[d]$ . The Hessian can then be expressed as $\frac{\mathrm{d}^{2}c(X)_{i_{0},j_{0}}}{\mathrm{d}x_{i_{1},j_{1}}\mathrm{d}x_{i_{2},j_{2}}}$ . We can further break down the computation into four cases to handle different scenarios:

•

Case 1: The situation when $i_{0}=i_{1}=i_{2}$ .
•

Case 2: The situation when $i_{0}=i_{1}\neq i_{2}$ .
•

Case 3: The situation when $i_{0}\neq i_{1}$ , $i_{0}\neq i_{2}$ and $i_{1}=i_{2}$ .
•

Case 4: The situation when $i_{0}\neq i_{1}$ , $i_{0}\neq i_{2}$ and $i_{1}\neq i_{2}$ .

It is worth mentioning that there is a case that $i_{0}\neq i_{1}$ , $i_{0}=i_{2}$ , is equivalent to the case that $i_{0}=i_{1}\neq i_{2}$ . By considering these four cases, we can calculate the Hessian for each element in $X$ . This allows us to gain further insights into the curvature of the loss function and optimize the parameters more effectively.

4.2 Hessian Decomposition

By considering different conditions of Hessian, we have the following decomposition.

Definition 4.1 (Hessian of functions of matrix).

We define the Hessian of $c(X)_{i_{0},j_{0}}$ by considering its Hessian with respect to $x=\operatorname{vec}(X)$ . This means that, $\nabla^{2}c(X)_{i_{0},j_{0}}$ is a $nd\times nd$ matrix with its $i_{1}\cdot j_{1},i_{2}\cdot j_{2}$ -th entry being $\frac{\mathrm{d}c(X)_{i_{0},j_{0}}}{\mathrm{d}x_{i_{1},j_{2}}x_{i_{2},j_{2}}}$ .

Definition 4.2 (Hessian split).

We split the hessian of $c(X)_{i_{0},j_{0}}$ into following cases

•

$i_{0}=i_{1}=i_{2}$ : $H_{1}^{(i_{1},i_{2})}$
•

$i_{0}=i_{1}$ , $i_{0}\neq i_{2}$ : $H_{2}^{(i_{1},i_{2})}$
•

$i_{0}\neq i_{1}$ , $i_{0}=i_{2}$ : $H_{3}^{(i_{1},i_{2})}$
•

$i_{0}\neq i_{1}$ , $i_{0}\neq i_{2}$ , $i_{1}=i_{2}$ : $H_{4}^{(i_{1},i_{2})}$
•

$i_{0}\neq i_{1}$ , $i_{0}\neq i_{2}$ , $i_{1}\neq i_{2}$ : $H_{5}^{(i_{1},i_{2})}$

In above, $H_{i}^{(i_{1},i_{2})}$ is a $d\times d$ matrix with its $j_{1},j_{2}$ -th entry being $\frac{\mathrm{d}c(X)_{i_{0},j_{0}}}{\mathrm{d}x_{i_{1},j_{2}}x_{i_{2},j_{2}}}$ .

Utilizing above definitions, we split the Hessian to a $n\times n$ partition with its $i_{1},i_{2}$ -th component being $H_{i}{(i_{1},i_{2})}$ .

Definition 4.3.

We define $\nabla^{2}c(X)_{i_{0},j_{0}}$ to be as following

\displaystyle\begin{bmatrix}H_{4}^{(1,1)}&H_{5}^{(1,2)}&H_{5}^{(1,3)}&\cdots&H_{5}^{(1,i_{0}-1)}&H_{3}^{(1,i_{0})}&H_{5}^{(1,i_{0}+1)}&\cdots&H_{5}^{(1,n)}\\ H_{5}^{(2,1)}&H_{4}^{(2,2)}&H_{5}^{(2,3)}&\cdots&H_{5}^{(2,i_{0}-1)}&H_{3}^{(2,i_{0})}&H_{5}^{(2,i_{0}+1)}&\cdots&H_{5}^{(2,n)}\\ H_{5}^{(3,1)}&H_{5}^{(3,2)}&H_{4}^{(3,3)}&\cdots&H_{5}^{(3,i_{0}-1)}&H_{3}^{(3,i_{0})}&H_{5}^{(3,i_{0}+1)}&\cdots&H_{5}^{(3,n)}\\ \vdots&\vdots&\vdots&\ddots&\vdots&\vdots&\vdots&\ddots&\vdots\\ H_{2}^{(i_{0},1)}&H_{2}^{(i_{0},2)}&H_{2}^{(i_{0},3)}&\cdots&H_{2}^{(i_{0},i_{0}-1)}&H_{1}^{(i_{0},i_{0})}&H_{2}^{(i_{0},i_{0}+1)}&\cdots&H_{2}^{(i_{0},n)}\\ H_{5}^{(i_{0}+1,1)}&H_{5}^{(i_{0}+1,2)}&H_{5}^{(i_{0}+1,3)}&\cdots&H_{5}^{(i_{0}+1,i_{0}-1)}&H_{3}^{(i_{0}+1,i_{0})}&H_{4}^{(i_{0}+1,i_{0}+1)}&\cdots&H_{5}^{(i_{0}+1,n)}\\ \vdots&\vdots&\vdots&\vdots&\vdots&\ddots&\vdots&\ddots&\vdots\\ H_{5}^{(n,1)}&H_{5}^{(n,2)}&H_{5}^{(n,3)}&\cdots&H_{5}^{(n,i_{0}-1)}&H_{3}^{(n,i_{0})}&H_{5}^{(n,i_{0}+1)}&\cdots&H_{4}^{(n,n)}\end{bmatrix}

4.3 Hessian of $L(X)$ is Lipschitz- continuous

We present our findings that establish the Lipschitz continuity property of the Hessian of $L(X)$ , which is a highly desirable characteristic in optimization. This property signifies that the second derivatives of $L(X)$ exhibit smooth changes within a defined range. Leveraging this Lipschitz property enables us to employ gradient-based methods with guaranteed convergence rates and enhanced stability. Consequently, our results validate the feasibility of utilizing the proposed training objective to achieve convergence in the model inversion attack. This finding holds significant promise for the development of efficient and effective optimization strategies in this context.

Lemma 4.4 (informal version of Lemma H.11).

Under following conditions

•

Assumption G.1 (bounded parameter) holds
•

Let $c(X)_{i_{0},j_{0}}$ be defined as Definition B.8

For $X,Y\in\mathbb{R}^{d\times n}$ , we have

\displaystyle\|\nabla^{2}L(X)-\nabla^{2}L(Y)\|\leq O(n^{3.5}d^{3.5}R^{10})\|X-Y\|_{F}

4.4 Hessian of $L(X)$ is Positive Definite

After computing the Hessian of $L(X)$ , we now show our result that can confirm it is positive definite under proper regularization. Therefore, we can apply a modified Newton’s method to approach the optimal solution.

Lemma 4.5 (PSD bounds for $\nabla^{2}L(X)$ ).

Under following conditions,

•

Let $L(X)$ be defined as in Definition B.9
•

Let Assumption G.1 (bounded parameter) be satisfied

we have

\displaystyle\nabla^{2}L(X)\succeq-O(ndR^{8})\cdot{\bf I}_{nd}

Therefore, we define the regulatization term as follows to have the PSD guarantee.

Definition 4.6 (Regularization).

Let $\gamma=O(-ndR^{8})$ , we define

\displaystyle L_{\mathrm{reg}}(X):=\gamma\cdot\|\operatorname{vec}(X)\|_{2}^{2}

With above properties of the loss function, we have the convergence result in Theorem 1.3.

5 Conclusion and Future Discussion

In this study, we have presented a theoretical approach for inverting input data using weights and outputs. Our investigation delved into the mathematical frameworks that underpin the attention mechanism, with the aim of determining whether knowledge of attention weights and model outputs could enable the reconstruction of sensitive information from the input data. The insights gained from this research are intended to deepen our understanding and facilitate the development of more secure and robust transformer models. By doing so, we strive to foster responsible and ethical advancements in the field of deep learning.

This work lays the groundwork for future research and development aimed at fortifying transformer technologies against potential threats and vulnerabilities. Our ultimate goal is to enhance the safety and effectiveness of these groundbreaking models across a wide range of applications. By addressing potential risks and ensuring the integrity of sensitive information, we aim to create a more secure and trustworthy environment for the deployment of transformer models.

Roadmap.

We arrange the appendix as follows. In Section A, we provide several preliminary notations. In Section B we provide details of computing the gradients. In Section C and Section D we provide detail of computing Hessian for two cases. In Section E we show how to split the Hessian matrix. In Section F we combine the results before and compute the Hessian for the loss function. In Section G we bound the basic functions to be used later. In Section H we provide proof for the Lipschitz property of the loss function. We provide our final result in Section J.

Appendix A Notations

Appendix B Gradients

Here in this section, we provide analysis for the gradient computation. In Section B.1 we state some facts to be used. In Section B.2 we provide some definitions. In Sections B.3, B.4, B.5, B.6, B.7, B.8 and B.9 we compute the gradient for the terms defined respectively. Finally in Section B.10 we compute the gradient for $L(X)$ .

B.1 Facts

Fact B.1 (Basic algebra).

We have

•

$\langle u,v\rangle=\langle v,u\rangle=u^{\top}v=v^{\top}u$ .
•

$\langle u\circ v,w\rangle=\langle u\circ v\circ w,{\bf 1}_{n}\rangle$
•

$u^{\top}(v\circ w)=u^{\top}\operatorname{diag}(v)w$

Fact B.2 (Basic calculus rule).

We have

•

$\frac{\mathrm{d}\langle f(x),g(x)\rangle}{\mathrm{d}t}=\langle\frac{\mathrm{d}f(x)}{\mathrm{d}t},g(x)\rangle+\langle f(x),\frac{\mathrm{d}g(x)}{\mathrm{d}t}\rangle$ (here $t$ can be any variable)
•

$\frac{\mathrm{d}y^{z}}{\mathrm{d}x}=z\cdot y^{z-1}\frac{\mathrm{d}y}{\mathrm{d}x}$
•

$u\cdot v=v\cdot u$
•

$\frac{\mathrm{d}x}{\mathrm{d}x_{j}}=e_{j}$ where $e_{j}$ is a vector that only $j$ -th entry is $1$ and zero everywhere else.
•

Let $x\in\mathbb{R}^{d}$ , let $y\in\mathbb{R}$ be independent of $x$ , we have $\frac{\mathrm{d}x}{\mathrm{d}y}={\bf 0}_{d}$ .
•

Let $f(x),g(x)\in\mathbb{R}$ , we have $\frac{\mathrm{d}(f(x)g(x))}{\mathrm{d}t}=\frac{\mathrm{d}f(x)}{\mathrm{d}t}g(x)+f(x)\frac{\mathrm{d}g(x)}{\mathrm{d}t}$
•

Let $x\in\mathbb{R}$ , $\frac{\mathrm{d}}{\mathrm{d}x}\exp{(x)}=\exp{(x)}$
•

Let $f(x)\in\mathbb{R}^{n}$ , we have $\frac{\mathrm{d}\exp(f(x))}{\mathrm{d}t}=\exp(f(x))\circ\frac{\mathrm{d}f(x)}{\mathrm{d}t}$

B.2 Definitions

Definition B.3 (Simplified notations).

We have following definitions

•

We use $u(X)_{i_{0},i_{1}}$ to denote the $i_{1}$ -th entry of $u(X)_{i_{0}}$ .
•

We use $f(X)_{i_{0},i_{1}}$ to denote the $i_{1}$ -th entry of $f(X)_{i_{0}}$ .
•

We define $W_{j_{1},*}$ to denote the $j_{1}$ -th row of $W$ . (In the proof, we treat $W_{j_{1},*}$ as a column vector).
•

We define $W_{*,j_{1}}$ to denote the $j_{1}$ -th column of $W$ .
•

We define $w_{j_{1},j_{0}}$ to denote the scalar equals to the entry in $j_{1}$ -th row, $j_{0}$ -th column of $W$ .
•

We define $V_{*,j_{1}}$ to denote the $j_{1}$ -th column of $V$ .
•

We define $v_{j_{1},j_{0}}$ to denote the scalar equals to the entry in $j_{1}$ -th row, $j_{0}$ -th column of $V$ .
•

We define $X_{*,i_{0}}$ to denote the $i_{0}$ -th column of $X$ .
•

We define $x_{i_{1},j_{1}}$ to denote the scalar equals to the entry in $i_{1}$ -th column, $j_{1}$ -th row of $X$ .

Definition B.4 (Exponential function $u$ ).

If the following conditions hold

•

Let $X\in\mathbb{R}^{d\times n}$
•

Let $W\in\mathbb{R}^{d\times d}$

For each $i_{0}\in[n]$ , we define $u(X)_{i_{0}}\in\mathbb{R}^{n}$ as follows

\displaystyle u(X)_{i_{0}}=\exp(X^{\top}WX_{*,i_{0}})

Definition B.5 (Sum function of softmax $\alpha$ ).

If the following conditions hold

•

Let $X\in\mathbb{R}^{d\times n}$
•

Let $u(X)_{i_{0}}$ be defined as Definition B.4

We define $\alpha(X)_{i_{0}}\in\mathbb{R}$ for all $i_{0}\in[n]$ as follows

\displaystyle\alpha(X)_{i_{0}}=\langle u(X)_{i_{0}},{\bf 1}_{n}\rangle

Definition B.6 (Softmax probability function $f$ ).

If the following conditions hold

•

Let $X\in\mathbb{R}^{d\times n}$
•

Let $u(X)_{i_{0}}$ be defined as Definition B.4
•

Let $\alpha(X)_{i_{0}}$ be defined as Definition B.5

We define $f(X)_{i_{0}}\in\mathbb{R}^{n}$ for each $i_{0}\in[n]$ as follows

\displaystyle f(X)_{i_{0}}:=\alpha(X)_{i_{0}}^{-1}u(X)_{i_{0}}

Definition B.7 (Value function $h$ ).

If the following conditions hold

•

Let $X\in\mathbb{R}^{d\times n}$
•

Let $V\in\mathbb{R}^{d\times d}$

We define $h(X)_{j_{0}}\in\mathbb{R}^{n}$ for each $j_{0}\in[n]$ as follows

\displaystyle h(X)_{j_{0}}:=X^{\top}V_{*,j_{0}}

Definition B.8 (One-unit loss function $c$ ).

If the following conditions hold

•

Let $f(X)_{i_{0}}$ be defined as Definition B.6
•

Let $h(X)_{j_{0}}$ be defined as Definition B.7

We define $c(X)\in\mathbb{R}^{n\times d}$ as follows

\displaystyle c(X)_{i_{0},j_{0}}:=\langle f(X)_{i_{0}},h(X)_{j_{0}}\rangle-b_{i_{0},j_{0}},\forall i_{0}\in[n],j_{0}\in[d]

Definition B.9 (Overall function $L$ ).

If the following conditions hold

•

Let $c(X)_{i_{0},j_{0}}$ be defined as Definition B.8

We define $L(X)\in\mathbb{R}$ as follows

\displaystyle L(X):=\sum_{i_{0}=1}^{n}\sum_{j_{0}=1}^{d}(c(X)_{i_{0},j_{0}})^{2}

B.3 Gradient for each column of $X^{\top}WX_{*,i_{0}}$

Lemma B.10.

We have

•

Part 1. Let $i_{0}=i_{1}\in[n]$ , $j_{1}\in[d]$

\displaystyle\underbrace{\frac{\mathrm{d}X^{\top}WX_{*,i_{0}}}{\mathrm{d}x_{i_{1},j_{1}}}}_{n\times 1}=\underbrace{e_{i_{0}}}_{n\times 1}\cdot\underbrace{\langle W_{j_{1},*},X_{*,i_{0}}\rangle}_{\mathrm{scalar}}+\underbrace{X^{\top}}_{n\times d}\underbrace{W_{*,j_{1}}}_{d\times 1}

•

Part 2 Let $i_{0}\neq i_{1}\in[n]$ , $j_{1}\in[d]$

\displaystyle\underbrace{\frac{\mathrm{d}X^{\top}WX_{*,i_{0}}}{\mathrm{d}x_{i_{1},j_{1}}}}_{n\times 1}=\underbrace{e_{i_{1}}}_{n\times 1}\cdot\underbrace{\langle W_{j_{1},*},X_{*,i_{0}}\rangle}_{\mathrm{scalar}}

Proof.

Proof of Part 1.

	$\displaystyle\underbrace{\frac{\mathrm{d}X^{\top}WX_{*,i_{0}}}{\mathrm{d}x_{i_{1},j_{1}}}}_{n\times 1}=$	$\displaystyle~{}\underbrace{\frac{\mathrm{d}X^{\top}}{\mathrm{d}X_{i_{1},j_{1}}}}_{n\times d}\underbrace{W}_{d\times d}\underbrace{X_{,i_{0}}}_{d\times 1}+\underbrace{X^{\top}}_{n\times d}\underbrace{W}_{d\times d}\underbrace{\frac{\mathrm{d}X_{,i_{0}}}{\mathrm{d}X_{i_{1},j_{1}}}}_{d\times 1}$
	$\displaystyle=$	$\displaystyle~{}\underbrace{e_{i_{1}}}_{n\times 1}\underbrace{e_{j_{1}}^{\top}}_{1\times d}\underbrace{W}_{d\times d}\underbrace{X_{*,i_{0}}}_{d\times 1}+\underbrace{X^{\top}}_{n\times d}\underbrace{W}_{d\times d}\underbrace{e_{j_{1}}}_{d\times 1}$
	$\displaystyle=$	$\displaystyle~{}\underbrace{e_{i_{1}}}_{n\times 1}\cdot\underbrace{\langle W_{j_{1},},X_{,i_{0}}\rangle}_{\mathrm{scalar}}+\underbrace{X^{\top}}_{n\times d}\underbrace{W_{*,j_{1}}}_{d\times 1}$
	$\displaystyle=$	$\displaystyle~{}\underbrace{e_{i_{0}}}_{n\times 1}\cdot\underbrace{\langle W_{j_{1},},X_{,i_{0}}\rangle}_{\mathrm{scalar}}+\underbrace{X^{\top}}_{n\times d}\underbrace{W_{*,j_{1}}}_{d\times 1}$

where the 1st step follows from Fact B.2, the 2nd step follows from simple derivative rule, the 3rd is simple algebra, the 4th step ie because $i_{0}=i_{1}$ .

Proof of Part 2

	$\displaystyle\underbrace{\frac{\mathrm{d}X^{\top}WX_{*,i_{0}}}{\mathrm{d}x_{i_{1},j_{1}}}}_{n\times 1}=$	$\displaystyle~{}\underbrace{\frac{\mathrm{d}X^{\top}}{\mathrm{d}x_{i_{1},j_{1}}}}_{n\times d}\underbrace{W}_{d\times d}\underbrace{X_{,i_{0}}}_{d\times 1}+\underbrace{X^{\top}}_{n\times d}\underbrace{W}_{d\times d}\underbrace{\frac{\mathrm{d}X_{,i_{0}}}{\mathrm{d}x_{i_{1},j_{1}}}}_{d\times 1}$
	$\displaystyle=$	$\displaystyle~{}\underbrace{e_{i_{1}}}_{n\times 1}\underbrace{e_{j_{1}}^{\top}}_{1\times d}\underbrace{W}_{d\times d}\underbrace{X_{*,i_{0}}}_{d\times 1}+\underbrace{X^{\top}}_{n\times d}\underbrace{W}_{d\times d}\underbrace{{\bf 0}_{d}}_{d\times 1}$
	$\displaystyle=$	$\displaystyle~{}\underbrace{e_{i_{1}}}_{n\times 1}\cdot\underbrace{\langle W_{j_{1},},X_{,i_{0}}\rangle}_{\mathrm{scalar}}$

where the 1st step follows from Fact B.2, the 2nd step follows from simple derivative rule, the 3rd is simple algebra. ∎

B.4 Gradient for $u(X)_{i_{0}}$

Lemma B.11.

Under following conditions

•

Let $u(X)_{i_{0}}$ be defined as Definition B.4

We have

•

Part 1. For each $i_{0}=i_{1}\in[n]$ , $j_{1}\in[d]$

\displaystyle\underbrace{\frac{\mathrm{d}u(X)_{i_{0}}}{\mathrm{d}x_{i_{1},j_{1}}}}_{n\times 1}=u(X)_{i_{0}}\circ(e_{i_{0}}\cdot\langle W_{j_{1},*},X_{*,i_{0}}\rangle+X^{\top}W_{*,j_{1}})

•

Part 2 For each $i_{0}\neq i_{1}\in[n]$ , $j_{1}\in[d]$

\displaystyle\underbrace{\frac{\mathrm{d}u(X)_{i_{0}}}{\mathrm{d}x_{i_{1},j_{1}}}}_{n\times 1}=\underbrace{u(X)_{i_{0}}}_{n\times 1}\circ(e_{i_{1}}\cdot\langle W_{j_{1},*},X_{*,i_{0}}\rangle)

Proof.

Proof of Part 1

	$\displaystyle\underbrace{\frac{\mathrm{d}u(X)_{i_{0}}}{\mathrm{d}x_{i_{1},j_{1}}}}_{n\times 1}=$	$\displaystyle~{}\underbrace{\frac{\mathrm{d}\exp(X^{\top}WX_{*,i_{0}})}{\mathrm{d}x_{i_{1},j_{1}}}}_{n\times 1}$
	$\displaystyle=$	$\displaystyle~{}\exp(\underbrace{X^{\top}}_{n\times d}\underbrace{W}_{d\times d}\underbrace{X_{,i_{0}}}_{d\times 1})\circ\underbrace{\frac{\mathrm{d}X^{\top}WX_{,i_{0}}}{\mathrm{d}x_{i_{1},j_{1}}}}_{n\times 1}$
	$\displaystyle=$	$\displaystyle~{}\underbrace{u(X)_{i_{0}}}_{n\times 1}\circ\underbrace{\frac{\mathrm{d}X^{\top}WX_{*,i_{0}}}{\mathrm{d}x_{i_{1},j_{1}}}}_{n\times 1}$
	$\displaystyle=$	$\displaystyle~{}\underbrace{u(X)_{i_{0}}}_{n\times 1}\circ(\underbrace{e_{i_{0}}}_{n\times 1}\cdot\underbrace{\langle W_{j_{1},},X_{,i_{0}}\rangle}_{\mathrm{scalar}}+\underbrace{X^{\top}}_{n\times d}\underbrace{W_{*,j_{1}}}_{d\times 1})$

where the 1st step and the 3rd step follow from Definition of $u(X)_{i_{0}}$ (see Definition B.4), the 2nd step follows from Fact B.2, the 4th step follows by Lemma B.10.

Proof of Part 2

	$\displaystyle\underbrace{\frac{\mathrm{d}u(X)_{i_{0}}}{\mathrm{d}x_{i_{1},j_{1}}}}_{n\times 1}=$	$\displaystyle~{}\underbrace{\frac{\mathrm{d}\exp(X^{\top}WX_{*,i_{0}})}{\mathrm{d}x_{i_{1},j_{1}}}}_{n\times 1}$
	$\displaystyle=$	$\displaystyle~{}\exp(\underbrace{X^{\top}}_{n\times d}\underbrace{W}_{d\times d}\underbrace{X_{,i_{0}}}_{d\times 1})\circ\underbrace{\frac{\mathrm{d}X^{\top}WX_{,i_{0}}}{\mathrm{d}x_{i_{1},j_{1}}}}_{n\times 1}$
	$\displaystyle=$	$\displaystyle~{}\underbrace{u(X)_{i_{0}}}_{n\times 1}\circ\underbrace{\frac{\mathrm{d}X^{\top}WX_{*,i_{0}}}{\mathrm{d}x_{i_{1},j_{1}}}}_{n\times 1}$
	$\displaystyle=$	$\displaystyle~{}\underbrace{u(X)_{i_{0}}}_{n\times 1}\circ(\underbrace{e_{i_{1}}}_{n\times 1}\cdot\underbrace{\langle W_{j_{1},},X_{,i_{0}}\rangle}_{\mathrm{scalar}})$

where the 1st step and the 3rd step follow from Definition of $u(X)_{i_{0}}$ (see Definition B.4), the 2nd step follows from Fact B.2, the 4th step follows by Lemma B.10.

∎

B.5 Gradient Computation for $\alpha(X)_{i_{0}}$

Lemma B.12 (A generalization of Lemma 5.6 in [27]).

If the following conditions hold

•

Let $\alpha(X)_{i_{0}}$ be defined as Definition B.5

Then, we have

•

Part 1. For each $i_{0}=i_{1}\in[n]$ , $j_{1}\in[d]$

\displaystyle\underbrace{\frac{\mathrm{d}\alpha(X)_{i_{0}}}{\mathrm{d}x_{i_{1},j_{1}}}}_{\mathrm{scalar}}=u(X)_{i_{0},i_{0}}\cdot\langle W_{j_{1},*},X_{*,i_{0}}\rangle+\langle u(X)_{i_{0}},X^{\top}W_{*,j_{1}}\rangle

•

Part 2. For each $i_{0}\neq i_{1}\in[n]$ , $j_{1}\in[d]$

\displaystyle\underbrace{\frac{\mathrm{d}\alpha(X)_{i_{0}}}{\mathrm{d}x_{i_{1},j_{1}}}}_{\mathrm{scalar}}=u(X)_{i_{0},i_{1}}\cdot\langle W_{j_{1},*},X_{*,i_{0}}\rangle

Proof.

Proof of Part 1.

	$\displaystyle\underbrace{\frac{\mathrm{d}\alpha(X)_{i_{0}}}{\mathrm{d}x_{i_{1},j_{1}}}}_{\mathrm{scalar}}=$	$\displaystyle~{}\underbrace{\frac{\mathrm{d}\langle u(X)_{i_{0}},{\bf 1}_{n}\rangle}{\mathrm{d}x_{i_{1},j_{1}}}}_{\mathrm{scalar}}$
	$\displaystyle=$	$\displaystyle~{}\langle\underbrace{\frac{\mathrm{d}u(X)_{i_{0}}}{\mathrm{d}x_{i_{1},j_{1}}}}_{n\times 1},\underbrace{{\bf 1}_{n}}_{n\times 1}\rangle$
	$\displaystyle=$	$\displaystyle~{}\langle\underbrace{u(X)_{i_{0}}}_{n\times 1}\circ(e_{i_{0}}\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle+X^{\top}W_{*,j_{1}}),\underbrace{{\bf 1}_{n}}_{n\times 1}\rangle$
	$\displaystyle=$	$\displaystyle~{}\langle\underbrace{u(X)_{i_{0}}}_{n\times 1}\circ e_{i_{0}},{\bf 1}_{n}\rangle\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle+\langle u(X)_{i_{0}}\circ(X^{\top}W_{*,j_{1}}),\underbrace{{\bf 1}_{n}}_{n\times 1}\rangle$
	$\displaystyle=$	$\displaystyle~{}\langle\underbrace{u(X)_{i_{0}}}_{n\times 1},e_{i_{0}}\rangle\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle+\langle u(X)_{i_{0}},X^{\top}W_{*,j_{1}}\rangle$
	$\displaystyle=$	$\displaystyle~{}u(X)_{i_{0},i_{0}}\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle+\langle u(X)_{i_{0}},X^{\top}W_{*,j_{1}}\rangle$

where the 1st step follows from the definition of $\alpha(X)_{i_{0}}$ (see Definition B.5), the 2nd step follows from Fact B.2, the 3rd step follows from Lemma B.11, the 4th step is rearrangement, the 5th step is derived by Fact B.1, the last step is by the definition of $U(X)_{i_{0},i_{0}}$ .

Proof of Part 2.

	$\displaystyle\underbrace{\frac{\mathrm{d}\alpha(X)_{i_{0}}}{\mathrm{d}x_{i_{1},j_{1}}}}_{\mathrm{scalar}}=$	$\displaystyle~{}\underbrace{\frac{\mathrm{d}\langle u(X)_{i_{0}},{\bf 1}_{n}\rangle}{\mathrm{d}x_{i_{1},j_{1}}}}_{\mathrm{scalar}}$
	$\displaystyle=$	$\displaystyle~{}\langle\underbrace{\frac{\mathrm{d}u(X)_{i_{0}}}{\mathrm{d}x_{i_{1},j_{1}}}}_{n\times 1},\underbrace{{\bf 1}_{n}}_{n\times 1}\rangle$
	$\displaystyle=$	$\displaystyle~{}\langle\underbrace{u(X)_{i_{0}}}_{n\times 1}\circ(e_{i_{1}}\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle),\underbrace{{\bf 1}_{n}}_{n\times 1}\rangle$
	$\displaystyle=$	$\displaystyle~{}\langle\underbrace{u(X)_{i_{0}}}_{n\times 1}\circ e_{i_{1}},\underbrace{{\bf 1}_{n}}_{n\times 1}\rangle\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle$
	$\displaystyle=$	$\displaystyle~{}\underbrace{u(X)_{i_{0},i_{1}}}_{\mathrm{scalar}}\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle$

∎

B.6 Gradient Computation for $\alpha(X)_{i_{0}}^{-1}$

Lemma B.13 (A generalization of Lemma 5.6 in [27]).

If the following conditions hold

•

Let $\alpha(X)_{i_{0}}$ be defined as Definition B.5

we have

•

Part 1. For $i_{0}=i_{1}\in[n]$ , $j_{1}\in[d]$

\displaystyle\underbrace{\frac{\mathrm{d}\alpha(X)_{i_{0}}^{-1}}{\mathrm{d}x_{i_{1},j_{1}}}}_{\mathrm{scalar}}=-\alpha(X)_{i_{0}}^{-1}\cdot(f(X)_{i_{0},i_{0}}\cdot\langle W_{j_{1},*},X_{*,i_{0}}\rangle+\langle f(X)_{i_{0}},X^{\top}W_{*,j_{1}}\rangle)\rangle)

•

Part 2. For $i_{0}\neq i_{1}\in[n]$ , $j_{1}\in[d]$

\displaystyle\underbrace{\frac{\mathrm{d}\alpha(X)_{i_{0}}^{-1}}{\mathrm{d}x_{i_{1},j_{1}}}}_{\mathrm{scalar}}=-\alpha(X)_{i_{0}}^{-1}\cdot f(X)_{i_{0},i_{1}}\cdot\langle W_{j_{1},*},X_{*,i_{0}}\rangle

Proof.

Proof of Part 1.

	$\displaystyle\underbrace{\frac{\mathrm{d}\alpha(X)_{i_{0}}^{-1}}{\mathrm{d}x_{i_{1},j_{1}}}}_{\mathrm{scalar}}=$	$\displaystyle~{}\underbrace{{-1}}_{\mathrm{scalar}}\cdot(\underbrace{{\alpha(X)_{i_{0}}})^{-2}}_{\mathrm{scalar}}\cdot\underbrace{\frac{\mathrm{d}(\alpha(X)_{i_{0}})}{\mathrm{d}x_{i_{1},j_{1}}}}_{\mathrm{scalar}}$
	$\displaystyle=$	$\displaystyle~{}{-}(\underbrace{{\alpha(X)_{i_{0}}})^{-2}}_{\mathrm{scalar}}\cdot(u(X)_{i_{0},i_{0}}\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle+\langle u(X)_{i_{0}},X^{\top}W_{*,j_{1}}\rangle)$
	$\displaystyle=$	$\displaystyle~{}-\alpha(X)_{i_{0}}^{-1}\cdot(f(X)_{i_{0},i_{0}}\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle+\langle f(X)_{i_{0}},X^{\top}W_{*,j_{1}}\rangle)$

where the 1st step follows from Fact B.2, the 2nd step follows by Lemma B.12.

Proof of Part 2.

	$\displaystyle\underbrace{\frac{\mathrm{d}\alpha(X)_{i_{0}}^{-1}}{\mathrm{d}x_{i_{1},j_{1}}}}_{\mathrm{scalar}}=$	$\displaystyle~{}\underbrace{{-1}}_{\mathrm{scalar}}\cdot(\underbrace{{\alpha(X)_{i_{0}}})^{-2}}_{\mathrm{scalar}}\cdot\underbrace{\frac{\mathrm{d}(\alpha(X)_{i_{0}})}{\mathrm{d}x_{i_{1},j_{1}}}}_{\mathrm{scalar}}$
	$\displaystyle=$	$\displaystyle~{}{-}(\underbrace{{\alpha(X)_{i_{0}}})^{-2}}_{\mathrm{scalar}}\cdot u(X)_{i_{0},i_{1}}\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle$
	$\displaystyle=$	$\displaystyle~{}-\alpha(X)_{i_{0}}^{-1}\cdot f(X)_{i_{0},i_{1}}\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle$

where the 1st step follows from Fact B.2, the 2nd step follows from result from Lemma B.12.

∎

B.7 Gradient for $f(X)_{i_{0}}$

Lemma B.14.

If the following conditions hold

•

Let $f(X)_{i_{0}}$ be defined as Definition B.6

Then, we have

•

Part 1. For all $i_{0}=i_{1}\in[n]$ , $j_{1}\in[d]$

	$\displaystyle\underbrace{\frac{\mathrm{d}f(X)_{i_{0}}}{\mathrm{d}x_{i_{1},j_{1}}}}_{n\times 1}=$	$\displaystyle~{}-\underbrace{f(X)_{i_{0}}}_{n\times 1}\cdot\underbrace{(f(X)_{i_{0},i_{0}}\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle+\langle f(X)_{i_{0}},X^{\top}W_{*,j_{1}}\rangle)}_{\mathrm{scalar}}$
		$\displaystyle~{}+\underbrace{f(X)_{i_{0}}\circ(e_{i_{0}}\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle+X^{\top}W_{*,j_{1}})}_{n\times 1}$

•

Part 2. For all $i_{0}\neq i_{1}\in[n]$ , $j_{1}\in[d]$

	$\displaystyle\underbrace{\frac{\mathrm{d}f(X)_{i_{0}}}{\mathrm{d}x_{i_{1},j_{1}}}}_{n\times 1}=$	$\displaystyle~{}-\underbrace{f(X)_{i_{0}}}_{n\times 1}\cdot\underbrace{f(X)_{i_{0},i_{1}}\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle}_{\mathrm{scalar}}$
		$\displaystyle~{}+\underbrace{f(X)_{i_{0}}\circ(e_{i_{1}}\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle)}_{n\times 1}$

Proof.

Proof of Part 1.

	$\displaystyle\underbrace{\frac{\mathrm{d}f(X)_{i_{0}}}{\mathrm{d}x_{i_{1},j_{1}}}}_{n\times 1}=$	$\displaystyle~{}\underbrace{\frac{\mathrm{d}\alpha(X)_{i_{0}}^{-1}u(X)_{i_{0}}}{\mathrm{d}x_{i_{1},j_{1}}}}_{n\times 1}$
	$\displaystyle=$	$\displaystyle~{}\underbrace{u(X)_{i_{0}}}_{n\times 1}\cdot\underbrace{\frac{\mathrm{d}}{{\mathrm{d}x_{i_{1},j_{1}}}}{\alpha(X)_{i_{0}}^{-1}}}_{\mathrm{scalar}}+\underbrace{\alpha(X)_{i_{0}}^{-1}}_{\mathrm{scalar}}\cdot\underbrace{\frac{\mathrm{d}}{{\mathrm{d}x_{i_{1},j_{1}}}}u(X)_{i_{0}}}_{n\times 1}$
	$\displaystyle=$	$\displaystyle~{}-\underbrace{u(X)_{i_{0}}}_{n\times 1}\cdot\underbrace{(\alpha(X)_{i_{0}})^{-1}\cdot(f(X)_{i_{0},i_{0}}\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle+\langle f(X)_{i_{0}},X^{\top}W_{*,j_{1}}\rangle)}_{\mathrm{scalar}}$
		$\displaystyle~{}+\underbrace{{\alpha(X)_{i_{0}}^{-1}}}_{\mathrm{scalar}}\cdot\underbrace{\frac{\mathrm{d}}{{\mathrm{d}x_{i_{1},j_{1}}}}u(X)_{i_{0}}}_{n\times 1}$
	$\displaystyle=$	$\displaystyle~{}-\underbrace{u(X)_{i_{0}}}_{n\times 1}\cdot\underbrace{(\alpha(X)_{i_{0}})^{-1}\cdot(f(X)_{i_{0},i_{0}}\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle+\langle f(X)_{i_{0}},X^{\top}W_{*,j_{1}}\rangle)}_{\mathrm{scalar}}$
		$\displaystyle~{}+\underbrace{{\alpha(X)_{i_{0}}^{-1}}}_{\mathrm{scalar}}\cdot\underbrace{(u(X)_{i_{0}}\circ(e_{i_{0}}\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle+X^{\top}W_{*,j_{1}}))}_{n\times 1}$
	$\displaystyle=$	$\displaystyle~{}-\underbrace{f(X)_{i_{0}}}_{n\times 1}\cdot\underbrace{(f(X)_{i_{0},i_{0}}\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle+\langle f(X)_{i_{0}},X^{\top}W_{*,j_{1}}\rangle)}_{\mathrm{scalar}}$
		$\displaystyle~{}+\underbrace{f(X)_{i_{0}}\circ(e_{i_{0}}\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle+X^{\top}W_{*,j_{1}})}_{n\times 1}$

where the 1st step follows from the definition of $f(X)_{i_{0}}$ (see Definition B.6), the 2nd step follows from Fact B.2, the 3rd step follows from Lemma B.13, the 4th step follows from result from Lemma B.11, the 5th step from the definition of $f(X)_{i_{0}}$ (see Definition B.6).

Proof of Part 2.

	$\displaystyle\underbrace{\frac{\mathrm{d}f(X)_{i_{0}}}{\mathrm{d}x_{i_{1},j_{1}}}}_{n\times 1}=$	$\displaystyle~{}\underbrace{\frac{\mathrm{d}\alpha(X)_{i_{0}}^{-1}u(X)_{i_{0}}}{\mathrm{d}x_{i_{1},j_{1}}}}_{n\times 1}$
	$\displaystyle=$	$\displaystyle~{}\underbrace{u(X)_{i_{0}}}_{n\times 1}\cdot\underbrace{\frac{\mathrm{d}}{{\mathrm{d}x_{i_{1},j_{1}}}}{\alpha(X)_{i_{0}}^{-1}}}_{\mathrm{scalar}}+\underbrace{\alpha(X)_{i_{0}}^{-1}}_{\mathrm{scalar}}\cdot\underbrace{\frac{\mathrm{d}}{{\mathrm{d}x_{i_{1},j_{1}}}}u(X)_{i_{0}}}_{n\times 1}$
	$\displaystyle=$	$\displaystyle~{}-\underbrace{u(X)_{i_{0}}}_{n\times 1}\cdot\underbrace{(\alpha(X)_{i_{0}})^{-2}\cdot u(X)_{i_{0},i_{1}}\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle}_{\mathrm{scalar}}$
		$\displaystyle~{}+\underbrace{{\alpha(X)_{i_{0}}^{-1}}}_{\mathrm{scalar}}\cdot\underbrace{\frac{\mathrm{d}}{{\mathrm{d}x_{i_{1},j_{1}}}}u(X)_{i_{0}}}_{n\times 1}$
	$\displaystyle=$	$\displaystyle~{}-\underbrace{u(X)_{i_{0}}}_{n\times 1}\cdot\underbrace{(\alpha(X)_{i_{0}})^{-2}\cdot u(X)_{i_{0},i_{1}}\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle}_{\mathrm{scalar}}$
		$\displaystyle~{}+\underbrace{{\alpha(X)_{i_{0}}^{-1}}}_{\mathrm{scalar}}\cdot\underbrace{(u(X)_{i_{0}}\circ(e_{i_{1}}\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle)}_{n\times 1}$
	$\displaystyle=$	$\displaystyle~{}-\underbrace{f(X)_{i_{0}}}_{n\times 1}\cdot\underbrace{f(X)_{i_{0},i_{1}}\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle}_{\mathrm{scalar}}$
		$\displaystyle~{}+e_{i_{1}}\cdot\underbrace{f(X)_{i_{0},i_{1}}\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle)}_{\mathrm{scalar}}$

B.8 Gradient for $h(X)_{j_{0}}$

Lemma B.15.

If the following conditions hold

•

Let $h(X)_{j_{0}}$ be defined as Definition B.7

Then, for all $i_{1}\in[n]$ , $j_{0},j_{1}\in[d]$ , we have

\displaystyle\underbrace{\frac{\mathrm{d}h(X)_{j_{0}}}{\mathrm{d}x_{i_{1},j_{1}}}}_{n\times 1}=e_{i_{1}}\cdot v_{j_{1},j_{0}}

Proof.

	$\displaystyle\underbrace{\frac{\mathrm{d}h(X)_{j_{0}}}{\mathrm{d}x_{i_{1},j_{1}}}}_{n\times 1}=$	$\displaystyle~{}\underbrace{\frac{\mathrm{d}X^{\top}V_{*,j_{0}}}{\mathrm{d}x_{i_{1},j_{1}}}}_{n\times 1}$
	$\displaystyle=$	$\displaystyle~{}\underbrace{\frac{\mathrm{d}X^{\top}}{\mathrm{d}x_{i_{1},j_{1}}}}_{n\times d}\cdot\underbrace{V_{*,j_{0}}}_{d\times 1}$
	$\displaystyle=$	$\displaystyle~{}\underbrace{e_{i_{1}}}_{n\times 1}\cdot\underbrace{e_{j_{1}}^{\top}}_{1\times d}\cdot\underbrace{V_{*,j_{0}}}_{d\times 1}$
	$\displaystyle=$	$\displaystyle~{}\underbrace{e_{i_{1}}}_{n\times 1}\cdot\underbrace{v_{j_{1},j_{0}}}_{\mathrm{scalar}}$

where the first step is by definition of $h(X)_{j_{0}}$ (see Definition B.7), the 2nd and the 3rd step are by differentiation rules, the 4th step is by simple algebra. ∎

B.9 Gradient for $c(X)_{i_{0},j_{0}}$

Lemma B.16.

If the following conditions hold

•

Let $c(X)_{i_{0}}$ be defined as Definition B.8
•

Let $s(X)_{i_{0},j_{0}}:=\langle f(X)_{i_{0}},h(X)_{j_{0}}\rangle$

Then, we have

•
Part 1. For all $i_{0}=i_{1}\in[n]$ , $j_{0},j_{1}\in[d]$

$\displaystyle\frac{\mathrm{d}c(X)_{i_{0},j_{0}}}{\mathrm{d}x_{i_{1},j_{1}}}=C_{1}(X)+C_{2}(X)+C_{3}(X)+C_{4}(X)+C_{5}(X)$

where we have definitions:
- –
  
  $C_{1}(X):=-s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{0}}\cdot\langle W_{j_{1},*},X_{*,i_{0}}\rangle$
- –
  
  $C_{2}(X):=-s(X)_{i_{0},j_{0}}\cdot\langle f(X)_{i_{0}},X^{\top}W_{*,j_{1}}\rangle$
- –
  
  $C_{3}(X):=f(X)_{i_{0},i_{0}}\cdot h(X)_{j_{0},i_{0}}\cdot\langle W_{j_{1},*},X_{*,i_{0}}\rangle$
- –
  
  $C_{4}(X):=\langle f(X)_{i_{0}}\circ(X^{\top}W_{*,j_{1}}),h(X)_{j_{0}}\rangle$
- –
  
  $C_{5}(X):=f(X)_{i_{0},i_{0}}\cdot v_{j_{1},j_{0}}$
•
Part 2. For all $i_{0}\neq i_{1}\in[n]$ , $j_{0},j_{1}\in[d]$

$\displaystyle\frac{\mathrm{d}c(X)_{i_{0},j_{0}}}{\mathrm{d}x_{i_{1},j_{1}}}=C_{6}(X)+C_{7}(X)+C_{8}(X)$

where we have definitions:
- –
  $C_{6}(X):=-s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{1}}\cdot\langle W_{j_{1},*},X_{*,i_{0}}\rangle$
  - *
    
    This is corresponding to $C_{1}(X)$
- –
  $C_{7}(X):=f(X)_{i_{0},i_{1}}\cdot h(X)_{j_{0},i_{1}}\cdot\langle W_{j_{1},*},X_{*,i_{0}}\rangle$
  - *
    
    This is corresponding to $C_{3}(X)$
- –
  $C_{8}(X):=f(X)_{i_{0},i_{1}}\cdot v_{j_{1},j_{0}}$
  - *
    
    This is corresponding to $C_{5}(X)$

Proof.

Proof of Part 1

	$\displaystyle\underbrace{\frac{\mathrm{d}c(X)_{i_{0},j_{1}}}{\mathrm{d}x_{i_{1},j_{1}}}}_{\mathrm{scalar}}=$	$\displaystyle~{}\underbrace{\frac{\mathrm{d}(\langle f(X)_{i_{0}},h(X)_{j_{0}}\rangle-b_{i_{0},j_{0}})}{\mathrm{d}x_{i_{1},j_{1}}}}_{\mathrm{scalar}}$
	$\displaystyle=$	$\displaystyle~{}\underbrace{\frac{\mathrm{d}\langle f(X)_{i_{0}},h(X)_{j_{0}}\rangle}{\mathrm{d}x_{i_{1},j_{1}}}}_{\mathrm{scalar}}$
	$\displaystyle=$	$\displaystyle~{}\langle\underbrace{\frac{\mathrm{d}f(X)_{i_{0}}}{\mathrm{d}x_{i_{1},j_{1}}}}_{n\times 1},\underbrace{h(X)_{j_{0}}}_{n\times 1}\rangle+\langle\underbrace{f(X)_{i_{0}}}_{n\times 1},\underbrace{\frac{\mathrm{d}h(X)_{j_{0}}}{\mathrm{d}x_{i_{1},j_{1}}}}_{n\times 1}\rangle$
	$\displaystyle=$	$\displaystyle~{}\langle\underbrace{\frac{\mathrm{d}f(X)_{i_{0}}}{\mathrm{d}x_{i_{1},j_{1}}}}_{n\times 1},\underbrace{h(X)_{j_{0}}}_{n\times 1}\rangle+\langle\underbrace{f(X)_{i_{0}}}_{n\times 1},\underbrace{e_{i_{1}}}_{n\times 1}\cdot\underbrace{v_{j_{1},j_{0}}}_{\mathrm{scalar}}\rangle$
	$\displaystyle=$	$\displaystyle~{}\langle-\underbrace{f(X)_{i_{0}}}_{n\times 1}\cdot\underbrace{(f(X)_{i_{0},i_{0}}\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle+\langle f(X)_{i_{0}},X^{\top}W_{*,j_{1}}\rangle)}_{\mathrm{scalar}}$
		$\displaystyle~{}+\underbrace{f(X)_{i_{0}}\circ(e_{i_{0}}\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle+X^{\top}W_{*,j_{1}})}_{n\times 1},\underbrace{h(X)_{j_{0}}}_{n\times 1}\rangle+\langle\underbrace{f(X)_{i_{0}}}_{n\times 1},\underbrace{e_{i_{1}}}_{n\times 1}\cdot\underbrace{v_{j_{1},j_{0}}}_{\mathrm{scalar}}\rangle$
	$\displaystyle=$	$\displaystyle~{}-s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{0}}\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle$
		$\displaystyle~{}-s(X)_{i_{0},j_{0}}\cdot\langle f(X)_{i_{0}},X^{\top}W_{*,j_{1}}\rangle$
		$\displaystyle~{}+f(X)_{i_{0},i_{0}}h(X)_{j_{0},i_{0}}\langle W_{j_{1},},X_{,i_{0}}\rangle$
		$\displaystyle~{}+\langle f(X)_{i_{0}}\circ(X^{\top}W_{*,j_{1}}),h(X)_{j_{0}}\rangle$
		$\displaystyle~{}+f(X)_{i_{0},i_{1}}v_{j_{1},j_{0}}$
	$\displaystyle:=$	$\displaystyle~{}C_{1}(X)+C_{2}(X)+C_{3}(X)+C_{4}(X)+C_{5}(X)$

where the first step is by definition of $c(X)_{i_{0},j_{0}}$ (see Definition B.8), the 2nd step is because $b_{i_{0},j_{0}}$ is independent of $X$ , the 3rd step is by Fact B.2, the 4th step uses Lemma B.15, the 5th step uses Lemma B.14, the 6th and 8th step are rearrangement of terms, the 7th step holds by the definition of $f(X)_{i_{0}}$ (see Definition B.6).

Proof of Part 2

	$\displaystyle\underbrace{\frac{\mathrm{d}c(X)_{i_{0},j_{1}}}{\mathrm{d}x_{i_{1},j_{1}}}}_{\mathrm{scalar}}=$	$\displaystyle~{}\underbrace{\frac{\mathrm{d}(\langle f(X)_{i_{0}},h(X)_{j_{0}}\rangle-b_{i_{0},j_{0}})}{\mathrm{d}x_{i_{1},j_{1}}}}_{\mathrm{scalar}}$
	$\displaystyle=$	$\displaystyle~{}\underbrace{\frac{\mathrm{d}\langle f(X)_{i_{0}},h(X)_{j_{0}}\rangle}{\mathrm{d}x_{i_{1},j_{1}}}}_{\mathrm{scalar}}$
	$\displaystyle=$	$\displaystyle~{}\langle\underbrace{\frac{\mathrm{d}f(X)_{i_{0}}}{\mathrm{d}x_{i_{1},j_{1}}}}_{n\times 1},\underbrace{h(X)_{j_{0}}}_{n\times 1}\rangle+\langle\underbrace{f(X)_{i_{0}}}_{n\times 1},\underbrace{\frac{\mathrm{d}h(X)_{j_{0}}}{\mathrm{d}x_{i_{1},j_{1}}}}_{n\times 1}\rangle$
	$\displaystyle=$	$\displaystyle~{}\langle\underbrace{\frac{\mathrm{d}f(X)_{i_{0}}}{\mathrm{d}x_{i_{1},j_{1}}}}_{n\times 1},\underbrace{h(X)_{j_{0}}}_{n\times 1}\rangle+\langle\underbrace{f(X)_{i_{0}}}_{n\times 1},\underbrace{e_{i_{1}}}_{n\times 1}\cdot\underbrace{v_{j_{1},j_{0}}}_{\mathrm{scalar}}\rangle$
	$\displaystyle=$	$\displaystyle~{}\langle-\underbrace{(\alpha(X)_{i_{0}})^{-1}}_{\mathrm{scalar}}\cdot\underbrace{f(X)_{i_{0}}}_{n\times 1}\cdot\underbrace{u(X)_{i_{0},i_{1}}\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle}_{\mathrm{scalar}}$
		$\displaystyle~{}+\underbrace{f(X)_{i_{0}}\circ(e_{i_{1}}\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle)}_{n\times 1},\underbrace{h(X)_{j_{0}}}_{n\times 1}\rangle+\langle\underbrace{f(X)_{i_{0}}}_{n\times 1},\underbrace{e_{i_{1}}}_{n\times 1}\cdot\underbrace{v_{j_{1},j_{0}}}_{\mathrm{scalar}}\rangle$
	$\displaystyle=$	$\displaystyle~{}-\underbrace{(\alpha(X)_{i_{0}})^{-1}\cdot\langle f(X)_{i_{0}},h(X)_{j_{0}}\rangle\cdot u(X)_{i_{0},i_{1}}\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle}_{\mathrm{scalar}}$
		$\displaystyle~{}+\underbrace{\langle f(X)_{i_{0}}\circ e_{i_{1}},h(X)_{j_{0}}\rangle\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle}_{\mathrm{scalar}}$
		$\displaystyle~{}+\langle\underbrace{f(X)_{i_{0}}}_{n\times 1},\underbrace{e_{i_{1}}}_{n\times 1}\cdot\underbrace{v_{j_{1},j_{0}}}_{\mathrm{scalar}}\rangle$
	$\displaystyle=$	$\displaystyle~{}-s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{1}}\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle$
		$\displaystyle~{}+f(X)_{i_{0},i_{1}}\cdot h(X)_{j_{0},i_{1}}\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle$
		$\displaystyle~{}+f(X)_{i_{0},i_{1}}\cdot v_{j_{1},j_{0}}$
	$\displaystyle:=$	$\displaystyle~{}C_{6}(X)+C_{7}(X)+C_{8}(X)$

B.10 Gradient for $L(X)$

Lemma B.17.

If the following holds

•

Let $L(X)$ be defined as Definition B.9

For $i_{1}\in[n]$ , $j_{1}\in[d]$ , we have

\displaystyle\frac{\mathrm{d}L(X)}{\mathrm{d}x_{i_{1},j_{1}}}=\sum_{i_{0}=1}^{n}\sum_{j_{0}=1}^{d}c(X)_{i_{0},j_{0}}\cdot\frac{\mathrm{d}c(X)_{i_{0},j_{0}}}{\mathrm{d}x_{i_{1},j_{1}}}

Proof.

The result directly follows by chain rule. ∎

Appendix C Hessian case 1: $i_{0}=i_{1}$

Here in this section, we provide Hessian analysis for the first case. In Sections C.1, C.2, C.3, C.4, C.5, C.6 and C.8, we calculate the derivative for several important terms. In Section C.9, C.10, C.11, C.12 and C.13 we calculate derivative for $C_{1},C_{2},C_{3},C_{4}$ and $C_{5}$ respectively. Finally in Section C.14 we calculate derivative of $\frac{c(X)_{i_{0},j_{0}}}{\mathrm{d}x_{i_{1},j_{1}}\mathrm{d}_{i_{2},j_{2}}}$ .

Now, we list some simplified notations which will be used in following sections.

Definition C.1.

We have following definitions to simplify the expression.

•

$s(X)_{i,j}:=\langle f(X)_{i},h(X)_{j}\rangle$
•

$w(X)_{i,j}:=\langle W_{j,*},X_{*,i}\rangle$
•

$z(X)_{i,j}:=\langle f(X)_{i},X^{\top}W_{*,j}\rangle$
•

$z(X)_{i}:=WX\cdot f(X)_{i}$
•

$w(X)_{i,*}:=WX_{*,i}$

C.1 Derivative of Scalar Function $w(X)_{i_{0},j_{1}}$

Lemma C.2.

We have

•

Part 1 For $i_{0}=i_{1}=i_{2}\in[n]$ , $j_{1},j_{2}\in[d]$

$\displaystyle\frac{\mathrm{d}w(X)_{i_{0},j_{1}}}{\mathrm{d}x_{i_{2},j_{2}}}=w_{j_{1},j_{2}}$
•

Part 2 For $i_{0}=i_{1}\neq i_{2}\in[n]$ , $j_{1},j_{2}\in[d]$

$\displaystyle\frac{\mathrm{d}w(X)_{i_{0},j_{1}}}{\mathrm{d}x_{i_{2},j_{2}}}=0$

Proof.

Proof of Part 1

	$\displaystyle\frac{\mathrm{d}w(X)_{i_{0},j_{1}}}{\mathrm{d}x_{i_{2},j_{2}}}=$	$\displaystyle~{}\langle W_{j_{1},},\frac{\mathrm{d}X_{,i_{0}}}{\mathrm{d}x_{i_{2},j_{2}}}\rangle$
	$\displaystyle=$	$\displaystyle~{}\langle W_{j_{1},*},e_{j_{2}}\rangle$
	$\displaystyle=$	$\displaystyle~{}w_{j_{1},j_{2}}$

where the first step and the 2nd step are by Fact B.2, the 3rd step is simple algebra.

Proof of Part 2

	$\displaystyle\frac{\mathrm{d}w(X)_{i_{0},j_{1}}}{\mathrm{d}x_{i_{2},j_{2}}}=$	$\displaystyle~{}\langle W_{j_{1},},\frac{\mathrm{d}X_{,i_{0}}}{\mathrm{d}x_{i_{2},j_{2}}}\rangle$
	$\displaystyle=$	$\displaystyle~{}\langle W_{j_{1},*},{\bf 0}_{d}\rangle$
	$\displaystyle=$	$\displaystyle~{}0$

where the first step is by Fact B.2, the 2nd step is because $i_{0}\neq i_{2}$ . ∎

C.2 Derivative of Vector Function $X^{\top}W_{*,j_{1}}$

Lemma C.3.

We have

•

Part 1 For $i_{0}=i_{1}=i_{2}\in[n]$ , $j_{1},j_{2}\in[d]$

$\displaystyle\frac{\mathrm{d}X^{\top}W_{*,j_{1}}}{\mathrm{d}x_{i_{2},j_{2}}}=e_{i_{0}}\cdot w_{j_{2},j_{1}}$
•

Part 2 For $i_{0}=i_{1}\neq i_{2}\in[n]$ , $j_{1},j_{2}\in[d]$

$\displaystyle\frac{\mathrm{d}X^{\top}W_{*,j_{1}}}{\mathrm{d}x_{i_{2},j_{2}}}=e_{i_{2}}\cdot w_{j_{2},j_{1}}$

Proof.

Proof of Part 1

	$\displaystyle\frac{\mathrm{d}X^{\top}W_{*,j_{1}}}{\mathrm{d}x_{i_{2},j_{2}}}=$	$\displaystyle~{}\frac{\mathrm{d}X^{\top}}{\mathrm{d}x_{i_{2},j_{2}}}\cdot W_{*,j_{1}}$
	$\displaystyle=$	$\displaystyle~{}e_{i_{2}}e_{j_{2}}^{\top}\cdot W_{*,j_{1}}$
	$\displaystyle=$	$\displaystyle~{}e_{i_{2}}\cdot w_{j_{2},j_{1}}$
	$\displaystyle=$	$\displaystyle~{}e_{i_{0}}\cdot w_{j_{2},j_{1}}$

where the first step and the 2nd step are by Fact B.2, the 3rd step is simple algebra, the 4th step holds since $i_{0}=i_{2}$ .

Proof of Part 2

	$\displaystyle\frac{\mathrm{d}X^{\top}W_{*,j_{1}}}{\mathrm{d}x_{i_{2},j_{2}}}=$	$\displaystyle~{}\frac{\mathrm{d}X^{\top}}{\mathrm{d}x_{i_{2},j_{2}}}\cdot W_{*,j_{1}}$
	$\displaystyle=$	$\displaystyle~{}e_{i_{2}}e_{j_{2}}^{\top}\cdot W_{*,j_{1}}$
	$\displaystyle=$	$\displaystyle~{}e_{i_{2}}\cdot w_{j_{2},j_{1}}$

where the first step and the 2nd step are by Fact B.2, the 3rd step is simple algebra. ∎

C.3 Derivative of Scalar Function $f(X)_{i_{0},i_{0}}$

Lemma C.4.

If the following holds:

•

Let $f(X)_{i_{0}}$ be defined as Definition B.6

We have

•

Part 1 For $i_{0}=i_{2}\in[n]$ , $j_{1},j_{2}\in[d]$

	$\displaystyle\frac{\mathrm{d}f(X)_{i_{0},i_{0}}}{\mathrm{d}x_{i_{2},j_{2}}}=$	$\displaystyle~{}-f(X)_{i_{0},i_{0}}\cdot(f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}+\langle f(X)_{i_{0}},X^{\top}W_{*,j_{2}}\rangle)$
		$\displaystyle~{}+f(X)_{i_{0},i_{0}}\cdot\langle W_{j_{2},}+W_{,j_{2}},X_{*,i_{0}}\rangle$

•

Part 2 For $i_{0}\neq i_{2}\in[n]$ , $j_{1},j_{2}\in[d]$

\displaystyle\frac{\mathrm{d}f(X)_{i_{0},i_{0}}}{\mathrm{d}x_{i_{2},j_{2}}}=-f(X)_{i_{0},i_{0}}\cdot f(X)_{i_{0},i_{2}}\cdot w(X)_{i_{0},j_{2}}

Proof.

Proof of Part 1

	$\displaystyle\frac{\mathrm{d}f(X)_{i_{0},i_{0}}}{\mathrm{d}x_{i_{2},j_{2}}}=$	$\displaystyle~{}(-(\alpha(X)_{i_{0}})^{-1}\cdot f(X)_{i_{0}}\cdot(u(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}+\langle u(X)_{i_{0}},X^{\top}W_{*,j_{2}}\rangle)$
		$\displaystyle~{}+f(X)_{i_{0}}\circ(e_{i_{0}}\cdot w(X)_{i_{0},j_{2}}+X^{\top}W_{*,j_{2}}))_{i_{0}}$
	$\displaystyle=$	$\displaystyle~{}-(\alpha(X)_{i_{0}})^{-1}\cdot f(X)_{i_{0},i_{0}}\cdot(u(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}+\langle u(X)_{i_{0}},X^{\top}W_{*,j_{2}}\rangle)$
		$\displaystyle~{}+(f(X)_{i_{0}}\circ(e_{i_{0}}\cdot w(X)_{i_{0},j_{2}}))_{i_{0}}+(f(X)_{i_{0}}\circ(X^{\top}W_{*,j_{2}}))_{i_{0}}$
	$\displaystyle=$	$\displaystyle~{}-(\alpha(X)_{i_{0}})^{-1}\cdot f(X)_{i_{0},i_{0}}\cdot(u(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}+\langle u(X)_{i_{0}},X^{\top}W_{*,j_{2}}\rangle)$
		$\displaystyle~{}+f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}+f(X)_{i_{0},i_{0}}\cdot\langle W_{,j_{2}},X_{,i_{0}}\rangle$
	$\displaystyle=$	$\displaystyle~{}-f(X)_{i_{0},i_{0}}\cdot(f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}+\langle f(X)_{i_{0}},X^{\top}W_{*,j_{2}}\rangle)$
		$\displaystyle~{}+f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}+f(X)_{i_{0},i_{0}}\cdot\langle W_{,j_{2}},X_{,i_{0}}\rangle$

where the first step uses Lemma B.14 for $i_{0}=i_{2}$ , the following steps are taking the $i_{0}$ -th entry of $f(X)_{i_{0}}$ , the last step is by the definition of $f(X)_{i_{0}}$ (see Definition B.6).

Proof of Part 2

	$\displaystyle\frac{\mathrm{d}f(X)_{i_{0},i_{0}}}{\mathrm{d}x_{i_{2},j_{2}}}=$	$\displaystyle~{}(-(\alpha(X)_{i_{0}})^{-1}\cdot f(X)_{i_{0}}\cdot u(X)_{i_{0},i_{2}}\cdot w(X)_{i_{0},j_{2}}$
		$\displaystyle~{}+f(X)_{i_{0}}\circ(e_{i_{2}}\cdot w(X)_{i_{0},j_{2}}))_{i_{0}}$
	$\displaystyle=$	$\displaystyle~{}-(\alpha(X)_{i_{0}})^{-1}\cdot f(X)_{i_{0},i_{0}}\cdot u(X)_{i_{0},i_{2}}\cdot w(X)_{i_{0},j_{2}}$
		$\displaystyle~{}+(f(X)_{i_{0}}\circ(e_{i_{2}}\cdot w(X)_{i_{0},j_{2}}))_{i_{0}}$
	$\displaystyle=$	$\displaystyle~{}-(\alpha(X)_{i_{0}})^{-1}\cdot f(X)_{i_{0},i_{0}}\cdot u(X)_{i_{0},i_{2}}\cdot w(X)_{i_{0},j_{2}}$
	$\displaystyle=$	$\displaystyle~{}-f(X)_{i_{0},i_{0}}\cdot f(X)_{i_{0},i_{2}}\cdot w(X)_{i_{0},j_{2}}$

where the first step uses Lemma B.14 for $i_{0}\neq i_{2}$ , the 2nd step is taking the $i_{0}$ -th entry of $f(X)_{i_{0}}$ , the 3rd step is because $i_{0}\neq i_{2}$ , the last step is by the definition of $f(X)_{i_{0}}$ (see Definition B.6). ∎

C.4 Derivative of Scalar Function $h(X)_{j_{0},i_{0}}$

Lemma C.5.

If the following holds:

•

Let $h(X)_{j_{0}}$ be defined as Definition B.7

We have

•

Part 1 For $i_{0}=i_{2}\in[n]$ , $j_{1},j_{2}\in[d]$

$\displaystyle\frac{\mathrm{d}h(X)_{j_{0},i_{0}}}{\mathrm{d}x_{i_{2},j_{2}}}=v_{j_{2},j_{0}}$
•

Part 2 For $i_{0}\neq i_{2}\in[n]$ , $j_{1},j_{2}\in[d]$

$\displaystyle\frac{\mathrm{d}h(X)_{j_{0},i_{0}}}{\mathrm{d}x_{i_{2},j_{2}}}=0$

Proof.

Proof of Part 1

	$\displaystyle\frac{\mathrm{d}h(X)_{j_{0},i_{0}}}{\mathrm{d}x_{i_{2},j_{2}}}=$	$\displaystyle~{}(e_{i_{2}}\cdot v_{j_{2},j_{0}})_{i_{0}}$
	$\displaystyle=$	$\displaystyle~{}v_{j_{2},j_{0}}$

where the first step is by Lemma B.15, the 2nd step is because $i_{0}=i_{2}$ .

Proof of Part 2

	$\displaystyle\frac{\mathrm{d}h(X)_{j_{0},i_{0}}}{\mathrm{d}x_{i_{2},j_{2}}}=$	$\displaystyle~{}(e_{i_{2}}\cdot v_{j_{2},j_{0}})_{i_{0}}$
	$\displaystyle=$	$\displaystyle~{}0$

where the first step is by Lemma B.15, the 2nd step is because $i_{0}\neq i_{2}$ . ∎

C.5 Derivative of Scalar Function $z(X)_{i_{0},j_{1}}$

Lemma C.6.

If the following holds:

•

Let $f(X)_{i_{0}}$ be defined as Definition B.6
•

Let $z(X)_{i_{0},j_{1}}:=\langle f(X)_{i_{0}},X^{\top}W_{*,j_{1}}\rangle$
•

Let $w(X)_{i_{0},j_{1}}=\langle W_{j_{1},*},X_{*,i_{0}}\rangle$

We have

•

Part 1 For $i_{0}=i_{1}=i_{2}\in[n]$ , $j_{1},j_{2}\in[d]$

		$\displaystyle~{}\frac{\mathrm{d}z(X)_{i_{0},j_{1}}}{\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}-z(X)_{i_{0},j_{1}}\cdot f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}$
		$\displaystyle~{}-z(X)_{i_{0},j_{1}}\cdot z(X)_{i_{0},j_{2}}$
		$\displaystyle~{}+f(X)_{i_{0},i_{0}}\cdot\langle W_{,j_{1}},X_{,i_{0}}\rangle\cdot w(X)_{i_{0},j_{2}}$
		$\displaystyle~{}+\langle f(X)_{i_{0}}\circ X^{\top}W_{,j_{2}},X^{\top}W_{,j_{1}}\rangle$
		$\displaystyle~{}+f(X)_{i_{0},i_{0}}\cdot w_{j_{2},j_{1}}$

•

Part 2 For $i_{0}=i_{1}\neq i_{2}\in[n]$ , $j_{1},j_{2}\in[d]$

		$\displaystyle~{}\frac{\mathrm{d}\langle f(X)_{i_{0}},X^{\top}W_{*,j_{1}}\rangle}{\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}-z(X)_{i_{0},j_{1}}\cdot f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}$
		$\displaystyle~{}+f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}\cdot\langle W_{,j_{1}},X_{,i_{0}}\rangle$
		$\displaystyle~{}+f(X)_{i_{0},i_{0}}\cdot w_{j_{2},j_{1}}$

Proof.

Proof of Part 1

		$\displaystyle~{}\frac{\mathrm{d}\langle f(X)_{i_{0}},X^{\top}W_{*,j_{1}}\rangle}{\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}\langle\frac{\mathrm{d}f(X)_{i_{0}}}{\mathrm{d}x_{i_{2},j_{2}}},X^{\top}W_{,j_{1}}\rangle+\langle f(X)_{i_{0}},\frac{\mathrm{d}X^{\top}W_{,j_{1}}}{\mathrm{d}x_{i_{2},j_{2}}}\rangle$
	$\displaystyle=$	$\displaystyle~{}\langle\frac{\mathrm{d}f(X)_{i_{0}}}{\mathrm{d}x_{i_{2},j_{2}}},X^{\top}W_{*,j_{1}}\rangle+\langle f(X)_{i_{0}},e_{i_{0}}\cdot w_{j_{2},j_{1}}\rangle$
	$\displaystyle=$	$\displaystyle~{}\langle\frac{\mathrm{d}f(X)_{i_{0}}}{\mathrm{d}x_{i_{2},j_{2}}},X^{\top}W_{*,j_{1}}\rangle+f(X)_{i_{0},i_{0}}\cdot w_{j_{2},j_{1}}$
	$\displaystyle=$	$\displaystyle~{}\langle-(\alpha(X)_{i_{0}})^{-1}\cdot f(X)_{i_{0}}\cdot(u(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}+\langle u(X)_{i_{0}},X^{\top}W_{*,j_{2}}\rangle)$
		$\displaystyle~{}+f(X)_{i_{0}}\circ(e_{i_{0}}\cdot w(X)_{i_{0},j_{2}}+X^{\top}W_{,j_{2}}),X^{\top}W_{,j_{1}}\rangle+f(X)_{i_{0},i_{0}}\cdot w_{j_{2},j_{1}}$
	$\displaystyle=$	$\displaystyle~{}\langle-f(X)_{i_{0}}\cdot(f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}+\langle f(X)_{i_{0}},X^{\top}W_{*,j_{2}}\rangle)$
		$\displaystyle~{}+f(X)_{i_{0}}\circ(e_{i_{0}}\cdot w(X)_{i_{0},j_{2}}+X^{\top}W_{,j_{2}}),X^{\top}W_{,j_{1}}\rangle+f(X)_{i_{0},i_{0}}\cdot w_{j_{2},j_{1}}$
	$\displaystyle=$	$\displaystyle~{}-z(X)_{i_{0},j_{1}}\cdot f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}$
		$\displaystyle~{}-z(X)_{i_{0},j_{1}}\cdot z(X)_{i_{0},j_{2}}$
		$\displaystyle~{}+f(X)_{i_{0},i_{0}}\cdot\langle W_{,j_{1}},X_{,i_{0}}\rangle\cdot w(X)_{i_{0},j_{2}}$
		$\displaystyle~{}+\langle f(X)_{i_{0}}\circ X^{\top}W_{,j_{2}},X^{\top}W_{,j_{1}}\rangle$
		$\displaystyle~{}+f(X)_{i_{0},i_{0}}\cdot w_{j_{2},j_{1}}$

where the 1st step is by Fact B.2, the 2nd step uses Lemma C.3, the 3rd step is taking the $i_{0}$ -th entry of $f(X)_{i_{0}}$ , the 4th step uses Lemma B.14, the 5th step is by the definition of $f(X)_{i_{0}}$ (see Definition B.6).

Proof of Part 2

		$\displaystyle~{}\frac{\mathrm{d}\langle f(X)_{i_{0}},X^{\top}W_{*,j_{1}}\rangle}{\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}\langle\frac{\mathrm{d}f(X)_{i_{0}}}{\mathrm{d}x_{i_{2},j_{2}}},X^{\top}W_{,j_{1}}\rangle+\langle f(X)_{i_{0}},\frac{\mathrm{d}X^{\top}W_{,j_{1}}}{\mathrm{d}x_{i_{2},j_{2}}}\rangle$
	$\displaystyle=$	$\displaystyle~{}\langle\frac{\mathrm{d}f(X)_{i_{0}}}{\mathrm{d}x_{i_{2},j_{2}}},X^{\top}W_{*,j_{1}}\rangle+\langle f(X)_{i_{0}},e_{i_{2}}\cdot w_{j_{2},j_{1}}\rangle$
	$\displaystyle=$	$\displaystyle~{}\langle\frac{\mathrm{d}f(X)_{i_{0}}}{\mathrm{d}x_{i_{2},j_{2}}},X^{\top}W_{*,j_{1}}\rangle+f(X)_{i_{0},i_{2}}\cdot w_{j_{2},j_{1}}$
	$\displaystyle=$	$\displaystyle~{}\langle-(\alpha(X)_{i_{0}})^{-1}\cdot f(X)_{i_{0}}\cdot u(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}$
		$\displaystyle~{}+f(X)_{i_{0}}\circ(e_{i_{0}}\cdot w(X)_{i_{0},j_{2}}),X^{\top}W_{*,j_{1}}\rangle+f(X)_{i_{0},i_{0}}\cdot w_{j_{2},j_{1}}$
	$\displaystyle=$	$\displaystyle~{}\langle-f(X)_{i_{0}}\cdot f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}$
		$\displaystyle~{}+f(X)_{i_{0}}\circ(e_{i_{0}}\cdot w(X)_{i_{0},j_{2}}),X^{\top}W_{*,j_{1}}\rangle+f(X)_{i_{0},i_{0}}\cdot w_{j_{2},j_{1}}$
	$\displaystyle=$	$\displaystyle~{}-z(X)_{i_{0},j_{1}}\cdot f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}$
		$\displaystyle~{}+f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}\cdot\langle W_{,j_{1}},X_{,i_{0}}\rangle$
		$\displaystyle~{}+f(X)_{i_{0},i_{0}}\cdot w_{j_{2},j_{1}}$

where the 1st step is by Fact B.2, the 2nd step uses Lemma C.3, the 3rd step is taking the $i_{0}$ -th entry of $f(X)_{i_{0}}$ , the 4th step uses Lemma B.14, the last step is by the definition of $f(X)_{i_{0}}$ (see Definition B.6). ∎

C.6 Derivative of Scalar Function $f(X)_{i_{0},i_{0}}\cdot h(X)_{j_{0},i_{0}}$

Lemma C.7.

If the following holds:

•

Let $f(X)_{i_{0}}$ be defined as Definition B.6
•

Let $h(X)_{j_{0}}$ be defined as Definition B.7

We have

•

Part 1 For $i_{0}=i_{1}=i_{2}\in[n]$ , $j_{1},j_{2}\in[d]$

		$\displaystyle~{}\frac{\mathrm{d}f(X)_{i_{0},i_{0}}\cdot h(X)_{j_{0},i_{0}}}{\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}(-f(X)_{i_{0},i_{0}}\cdot(f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}+\langle f(X)_{i_{0}},X^{\top}W_{*,j_{2}}\rangle)$
		$\displaystyle~{}+f(X)_{i_{0},i_{0}}\cdot\langle W_{j_{2},}+W_{,j_{2}},X_{*,i_{0}}\rangle)\cdot h(X)_{j_{0},i_{0}}+f(X)_{i_{0},i_{0}}\cdot v_{j_{2},j_{0}}$

•

Part 2 For $i_{0}=i_{1}\neq i_{2}\in[n]$ , $j_{1},j_{2}\in[d]$

\displaystyle\frac{\mathrm{d}f(X)_{i_{0},i_{0}}\cdot h(X)_{j_{0},i_{0}}}{\mathrm{d}x_{i_{2},j_{2}}}=-f(X)_{i_{0},i_{0}}\cdot f(X)_{i_{0},i_{2}}\cdot w(X)_{i_{0},j_{2}}\cdot h(X)_{j_{0},i_{0}}

Proof.

Proof of Part 1

		$\displaystyle~{}\frac{\mathrm{d}f(X)_{i_{0},i_{0}}\cdot h(X)_{j_{0},i_{0}}}{\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}\frac{\mathrm{d}f(X)_{i_{0},i_{0}}}{\mathrm{d}x_{i_{2},j_{2}}}\cdot h(X)_{j_{0},i_{0}}+f(X)_{i_{0},i_{0}}\cdot\frac{\mathrm{d}h(X)_{j_{0},i_{0}}}{\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}\frac{\mathrm{d}f(X)_{i_{0},i_{0}}}{\mathrm{d}x_{i_{2},j_{2}}}\cdot h(X)_{j_{0},i_{0}}+f(X)_{i_{0},i_{0}}\cdot v_{j_{2},j_{0}}$
	$\displaystyle=$	$\displaystyle~{}(-(\alpha(X)_{i_{0}})^{-1}\cdot f(X)_{i_{0},i_{0}}\cdot(u(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}+\langle u(X)_{i_{0}},X^{\top}W_{*,j_{2}}\rangle)$
		$\displaystyle~{}+f(X)_{i_{0},i_{0}}\cdot\langle W_{j_{2},}+W_{,j_{2}},X_{*,i_{0}}\rangle)\cdot h(X)_{j_{0},i_{0}}+f(X)_{i_{0},i_{0}}\cdot v_{j_{2},j_{0}}$
	$\displaystyle=$	$\displaystyle~{}(-f(X)_{i_{0},i_{0}}\cdot(f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}+\langle f(X)_{i_{0}},X^{\top}W_{*,j_{2}}\rangle)$
		$\displaystyle~{}+f(X)_{i_{0},i_{0}}\cdot\langle W_{j_{2},}+W_{,j_{2}},X_{*,i_{0}}\rangle)\cdot h(X)_{j_{0},i_{0}}+f(X)_{i_{0},i_{0}}\cdot v_{j_{2},j_{0}}$

where the fist step is by Fact B.2, the 2nd step calls Lemma C.5, the 3rd step uses Lemma C.4, the last step is by the definition of $f(X)_{i_{0}}$ (see Definition B.6).

Proof of Part 2

		$\displaystyle~{}\frac{\mathrm{d}f(X)_{i_{0},i_{0}}\cdot h(X)_{j_{0},i_{0}}}{\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}\frac{\mathrm{d}f(X)_{i_{0},i_{0}}}{\mathrm{d}x_{i_{2},j_{2}}}\cdot h(X)_{j_{0},i_{0}}+f(X)_{i_{0},i_{0}}\cdot\frac{\mathrm{d}h(X)_{j_{0},i_{0}}}{\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}-(\alpha(X)_{i_{0}})^{-1}\cdot f(X)_{i_{0},i_{0}}\cdot u(X)_{i_{0},i_{2}}\cdot w(X)_{i_{0},j_{2}}\cdot h(X)_{j_{0},i_{0}}$
	$\displaystyle=$	$\displaystyle~{}-f(X)_{i_{0},i_{0}}\cdot f(X)_{i_{0},i_{2}}\cdot w(X)_{i_{0},j_{2}}\cdot h(X)_{j_{0},i_{0}}$

where the fist step is by Fact B.2, the 2nd step calls Lemma C.5, the 3rd step uses Lemma C.4, the last step is by the definition of $f(X)_{i_{0}}$ (see Definition B.6). ∎

C.7 Derivative of Scalar Function $f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{1}}$

Lemma C.8.

If the following holds:

•

Let $f(X)_{i_{0}}$ be defined as Definition B.6

We have

•

Part 1 For $i_{0}=i_{1}=i_{2}\in[n]$ , $j_{1},j_{2}\in[d]$

		$\displaystyle~{}\frac{\mathrm{d}f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{1}}}{\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}(f(X)_{i_{0},i_{0}}\cdot(f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}+\langle f(X)_{i_{0}},X^{\top}W_{*,j_{2}}\rangle)$
		$\displaystyle~{}+f(X)_{i_{0},i_{0}}\cdot\langle W_{j_{2},}+W_{,j_{2}},X_{*,i_{0}}\rangle)\cdot w(X)_{i_{0},j_{1}}+f(X)_{i_{0},i_{0}}\cdot w_{j_{1},j_{2}}$

•

Part 2 For $i_{0}=i_{1}\neq i_{2}\in[n]$ , $j_{1},j_{2}\in[d]$

\displaystyle\frac{\mathrm{d}f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{1}}}{\mathrm{d}x_{i_{2},j_{2}}}=-f(X)_{i_{0},i_{0}}\cdot f(X)_{i_{0},i_{2}}\cdot w(X)_{i_{0},j_{2}}\cdot w(X)_{i_{0},j_{1}}

Proof.

Proof of Part 1

		$\displaystyle~{}\frac{\mathrm{d}f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{1}}}{\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}\frac{\mathrm{d}f(X)_{i_{0},i_{0}}}{\mathrm{d}x_{i_{2},j_{2}}}\cdot w(X)_{i_{0},j_{1}}+f(X)_{i_{0},i_{0}}\cdot\frac{\mathrm{d}w(X)_{i_{0},j_{1}}}{\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}\frac{\mathrm{d}f(X)_{i_{0},i_{0}}}{\mathrm{d}x_{i_{2},j_{2}}}\cdot w(X)_{i_{0},j_{1}}+f(X)_{i_{0},i_{0}}\cdot w_{j_{1},j_{2}}$
	$\displaystyle=$	$\displaystyle~{}(-(\alpha(X)_{i_{0}})^{-1}\cdot f(X)_{i_{0},i_{0}}\cdot(u(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}+\langle u(X)_{i_{0}},X^{\top}W_{*,j_{2}}\rangle)$
		$\displaystyle~{}+f(X)_{i_{0},i_{0}}\cdot\langle W_{j_{2},}+W_{,j_{2}},X_{*,i_{0}}\rangle)\cdot w(X)_{i_{0},j_{1}}+f(X)_{i_{0},i_{0}}\cdot w_{j_{1},j_{2}}$
	$\displaystyle=$	$\displaystyle~{}(-f(X)_{i_{0},i_{0}}\cdot(f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}+\langle f(X)_{i_{0}},X^{\top}W_{*,j_{2}}\rangle)$
		$\displaystyle~{}+f(X)_{i_{0},i_{0}}\cdot\langle W_{j_{2},}+W_{,j_{2}},X_{*,i_{0}}\rangle)\cdot w(X)_{i_{0},j_{1}}+f(X)_{i_{0},i_{0}}\cdot w_{j_{1},j_{2}}$

where step 1 is by Fact B.2, the 2nd step calls Lemma C.2, the 3rd step uses Lemma C.4, the last step is by the definition of $f(X)_{i_{0}}$ (see Definition B.6).

Proof of Part 2

		$\displaystyle~{}\frac{\mathrm{d}f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{1}}}{\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}\frac{\mathrm{d}f(X)_{i_{0},i_{0}}}{\mathrm{d}x_{i_{2},j_{2}}}\cdot w(X)_{i_{0},j_{1}}+f(X)_{i_{0},i_{0}}\cdot\frac{\mathrm{d}w(X)_{i_{0},j_{1}}}{\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}\frac{\mathrm{d}f(X)_{i_{0},i_{0}}}{\mathrm{d}x_{i_{2},j_{2}}}\cdot w(X)_{i_{0},j_{1}}$
	$\displaystyle=$	$\displaystyle~{}-(\alpha(X)_{i_{0}})^{-1}\cdot f(X)_{i_{0},i_{0}}\cdot u(X)_{i_{0},i_{2}}\cdot w(X)_{i_{0},j_{2}}\cdot w(X)_{i_{0},j_{1}}$
	$\displaystyle=$	$\displaystyle~{}-f(X)_{i_{0},i_{0}}\cdot f(X)_{i_{0},i_{2}}\cdot w(X)_{i_{0},j_{2}}\cdot w(X)_{i_{0},j_{1}}$

where step 1 is by Fact B.2, the 2nd step calls Lemma C.2, the 3rd step uses Lemma C.4, the last step is by the definition of $f(X)_{i_{0}}$ (see Definition B.6). ∎

C.8 Derivative of Vector Function $f(X)_{i_{0}}\circ(X^{\top}W_{*,j_{1}})$

Lemma C.9.

If the following holds:

•

Let $f(X)_{i_{0}}$ be defined as Definition B.6

We have

•

Part 1 For $i_{0}=i_{1}=i_{2}\in[n]$ , $j_{1},j_{2}\in[d]$

		$\displaystyle~{}\frac{\mathrm{d}f(X)_{i_{0}}\circ(X^{\top}W_{*,j_{1}})}{\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}(-f(X)_{i_{0}}\cdot(f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}+\langle f(X)_{i_{0}},X^{\top}W_{*,j_{2}}\rangle)$
		$\displaystyle~{}+f(X)_{i_{0}}\circ(e_{i_{0}}\cdot w(X)_{i_{0},j_{2}}+X^{\top}W_{,j_{2}}))\circ(X^{\top}W_{,j_{1}})+f(X)_{i_{0}}\circ(e_{i_{0}}\cdot w_{j_{2},j_{1}})$

•

Part 2 For $i_{0}=i_{1}\neq i_{2}\in[n]$ , $j_{1},j_{2}\in[d]$

		$\displaystyle~{}\frac{\mathrm{d}f(X)_{i_{0}}\circ(X^{\top}W_{*,j_{1}})}{\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}(-f(X)_{i_{0}}\cdot f(X)_{i_{0},i_{2}}\cdot w(X)_{i_{0},j_{2}}$
		$\displaystyle~{}+f(X)_{i_{0}}\circ(e_{i_{2}}\cdot w(X)_{i_{0},j_{2}}))\circ(X^{\top}W_{*,j_{1}})+f(X)_{i_{0}}\circ(e_{i_{2}}\cdot w_{j_{2},j_{1}})$

Proof.

Proof of Part 1

		$\displaystyle~{}\frac{\mathrm{d}f(X)_{i_{0}}\circ(X^{\top}W_{*,j_{1}})}{\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}\frac{\mathrm{d}f(X)_{i_{0}}}{\mathrm{d}x_{i_{2},j_{2}}}\circ(X^{\top}W_{,j_{1}})+f(X)_{i_{0}}\circ\frac{\mathrm{d}X^{\top}W_{,j_{1}}}{\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}\frac{\mathrm{d}f(X)_{i_{0}}}{\mathrm{d}x_{i_{2},j_{2}}}\circ(X^{\top}W_{*,j_{1}})+f(X)_{i_{0}}\circ(e_{i_{0}}\cdot w_{j_{2},j_{1}})$
	$\displaystyle=$	$\displaystyle~{}(-(\alpha(X)_{i_{0}})^{-1}\cdot f(X)_{i_{0}}\cdot(u(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}+\langle u(X)_{i_{0}},X^{\top}W_{*,j_{2}}\rangle)$
		$\displaystyle~{}+f(X)_{i_{0}}\circ(e_{i_{0}}\cdot w(X)_{i_{0},j_{2}}+X^{\top}W_{,j_{2}}))\circ(X^{\top}W_{,j_{1}})+f(X)_{i_{0}}\circ(e_{i_{0}}\cdot w_{j_{2},j_{1}})$
	$\displaystyle=$	$\displaystyle~{}(-f(X)_{i_{0}}\cdot(f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}+\langle f(X)_{i_{0}},X^{\top}W_{*,j_{2}}\rangle)$
		$\displaystyle~{}+f(X)_{i_{0}}\circ(e_{i_{0}}\cdot w(X)_{i_{0},j_{2}}+X^{\top}W_{,j_{2}}))\circ(X^{\top}W_{,j_{1}})+f(X)_{i_{0}}\circ(e_{i_{0}}\cdot w_{j_{2},j_{1}})$

where the 1st step is by Fact B.2, the 2nd step uses Lemma C.3, the 3rd step uses Lemma B.14, the last step is by the definition of $f(X)_{i_{0}}$ (see Definition B.6).

Proof of Part 2

		$\displaystyle~{}\frac{\mathrm{d}f(X)_{i_{0}}\circ(X^{\top}W_{*,j_{1}})}{\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}\frac{\mathrm{d}f(X)_{i_{0}}}{\mathrm{d}x_{i_{2},j_{2}}}\circ(X^{\top}W_{,j_{1}})+f(X)_{i_{0}}\circ\frac{\mathrm{d}X^{\top}W_{,j_{1}}}{\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}\frac{\mathrm{d}f(X)_{i_{0}}}{\mathrm{d}x_{i_{2},j_{2}}}\circ(X^{\top}W_{*,j_{1}})+f(X)_{i_{0}}\circ(e_{i_{2}}\cdot w_{j_{2},j_{1}})$
	$\displaystyle=$	$\displaystyle~{}-((\alpha(X)_{i_{0}})^{-1}\cdot f(X)_{i_{0}}\cdot u(X)_{i_{0},i_{2}}\cdot w(X)_{i_{0},j_{2}}$
		$\displaystyle~{}+f(X)_{i_{0}}\circ(e_{i_{2}}\cdot w(X)_{i_{0},j_{2}}))\circ(X^{\top}W_{*,j_{1}})+f(X)_{i_{0}}\circ(e_{i_{2}}\cdot w_{j_{2},j_{1}})$
	$\displaystyle=$	$\displaystyle~{}(-f(X)_{i_{0}}\cdot f(X)_{i_{0},i_{2}}\cdot w(X)_{i_{0},j_{2}}$
		$\displaystyle~{}+f(X)_{i_{0}}\circ(e_{i_{2}}\cdot w(X)_{i_{0},j_{2}}))\circ(X^{\top}W_{*,j_{1}})+f(X)_{i_{0}}\circ(e_{i_{2}}\cdot w_{j_{2},j_{1}})$

where the 1st step is by Fact B.2, the 2nd step uses Lemma C.3, the 3rd step uses Lemma B.14, the last step is by the definition of $f(X)_{i_{0}}$ (see Definition B.6). ∎

C.9 Derivative of $C_{1}(X)$

Table 1:

C_{1}

Part 1 Summary

ID	Term	Symmetric?	Table Name
1	$+2s(X)_{i_{0},j_{0}}\cdot f(X)^{2}_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{1}}\cdot w(X)_{i_{0},j_{2}}$	Yes	N/A
2	$-f(X)^{2}_{i_{0},i_{0}}\cdot h(X)_{j_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}\cdot w(X)_{i_{0},j_{1}}$	Yes	N/A
3	$-f(X)_{i_{0},i_{0}}\cdot\langle f(X)_{i_{0}}\circ(X^{\top}W_{*,j_{2}}),h(X)_{j_{0}}\rangle\cdot w(X)_{i_{0},j_{1}}$	No	Table 4: 1
4	$-f(X)^{2}_{i_{0},i_{0}}\cdot v_{j_{2},j_{0}}\cdot w(X)_{i_{0},j_{1}}$	No	Table 5: 1
5	$-s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}\cdot w(X)_{i_{0},j_{1}}$	Yes	N/A
6	$-s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{0}}\cdot\langle W_{,j_{2}},X_{,i_{0}}\rangle\cdot w(X)_{i_{0},j_{1}}$	No	Table 2: 7
7	$-s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{0}}\cdot w_{j_{1},j_{2}}$	No	Table 2: 9
8	$2f(X)_{i_{0},i_{0}}\cdot s(X)_{i_{0},j_{0}}\cdot z(X)_{i_{0},j_{2}}\cdot w(X)_{i_{0},j_{1}}$	No	Table 2: 1

Lemma C.10.

If the following holds:

•

Let $C_{1}(X)\in\mathbb{R}$ be defined as in Lemma B.16
•

Let $z(X)_{i_{0},j_{1}}=\langle f(X)_{i_{0}},X^{\top}W_{*,j_{1}}\rangle$
•

Let $w(X)_{i_{0},j_{1}}=\langle W_{j_{1},*},X_{*,i_{0}}\rangle$

We have

•

Part 1 For $i_{0}=i_{1}=i_{2}\in[n]$ , $j_{1},j_{2}\in[d]$

		$\displaystyle~{}\frac{\mathrm{d}C_{1}(X)}{\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}+2s(X)_{i_{0},j_{0}}\cdot f(X)^{2}_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}\cdot w(X)_{i_{0},j_{1}}$
		$\displaystyle~{}+2f(X)_{i_{0},i_{0}}\cdot s(X)_{i_{0},j_{0}}\cdot z(X)_{i_{0},j_{2}}\cdot w(X)_{i_{0},j_{1}}$
		$\displaystyle~{}-f(X)^{2}_{i_{0},i_{0}}\cdot h(X)_{j_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}\cdot w(X)_{i_{0},j_{1}}$
		$\displaystyle~{}-f(X)_{i_{0},i_{0}}\cdot\langle f(X)_{i_{0}}\circ(X^{\top}W_{*,j_{2}}),h(X)_{j_{0}}\rangle\cdot w(X)_{i_{0},j_{1}}$
		$\displaystyle~{}-f(X)^{2}_{i_{0},i_{0}}\cdot v_{j_{2},j_{0}}\cdot w(X)_{i_{0},j_{1}}$
		$\displaystyle~{}-s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}\cdot w(X)_{i_{0},j_{1}}$
		$\displaystyle~{}-s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{0}}\cdot\langle W_{,j_{2}},X_{,i_{0}}\rangle\cdot w(X)_{i_{0},j_{1}}$
		$\displaystyle~{}-s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{0}}\cdot w_{j_{1},j_{2}}$

•

Part 2 For $i_{0}=i_{1}\neq i_{2}\in[n]$ , $j_{1},j_{2}\in[d]$

		$\displaystyle~{}\frac{\mathrm{d}C_{1}(X)}{\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{2}}\cdot w(X)_{i_{0},j_{2}}\cdot f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{1}}$
		$\displaystyle-f(X)_{i_{0},i_{2}}\cdot h(X)_{j_{0},i_{2}}\cdot w(X)_{i_{0},j_{2}}\cdot f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{1}}$
		$\displaystyle~{}-f(X)_{i_{0},i_{2}}\cdot v_{j_{2},j_{0}}\cdot f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{1}}$
		$\displaystyle~{}+s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{0}}\cdot f(X)_{i_{0},i_{2}}\cdot w(X)_{i_{0},j_{2}}\cdot w(X)_{i_{0},j_{1}}$

Proof.

Proof of Part 1

		$\displaystyle~{}\frac{\mathrm{d}C_{1}(X)}{\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}\frac{\mathrm{d}-s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{1}}}{\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}-\frac{\mathrm{d}s(X)_{i_{0},j_{0}}}{\mathrm{d}x_{i_{2},j_{2}}}\cdot f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{1}}$
		$\displaystyle~{}-s(X)_{i_{0},j_{0}}\cdot\frac{\mathrm{d}f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{1}}}{\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}-\frac{\mathrm{d}s(X)_{i_{0},j_{0}}}{\mathrm{d}x_{i_{2},j_{2}}}\cdot f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{1}}$
		$\displaystyle~{}-s(X)_{i_{0},j_{0}}\cdot((-(\alpha(X)_{i_{0}})^{-1}\cdot f(X)_{i_{0},i_{0}}\cdot(u(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}+\langle u(X)_{i_{0}},X^{\top}W_{*,j_{2}}\rangle)$
		$\displaystyle~{}+f(X)_{i_{0},i_{0}}\cdot\langle W_{j_{2},}+W_{,j_{2}},X_{*,i_{0}}\rangle)\cdot w(X)_{i_{0},j_{1}}+f(X)_{i_{0},i_{0}}\cdot w_{j_{1},j_{2}})$
	$\displaystyle=$	$\displaystyle~{}-(-s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}-s(X)_{i_{0},j_{0}}\cdot\langle f(X)_{i_{0}},X^{\top}W_{*,j_{2}}\rangle$
		$\displaystyle~{}+f(X)_{i_{0},i_{0}}\cdot h(X)_{j_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}$
		$\displaystyle~{}+\langle f(X)_{i_{0}}\circ(X^{\top}W_{*,j_{2}}),h(X)_{j_{0}}\rangle+f(X)_{i_{0},i_{2}}\cdot v_{j_{2},j_{0}})\cdot f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{1}}$
		$\displaystyle~{}-s(X)_{i_{0},j_{0}}\cdot((-f(X)_{i_{0},i_{0}}\cdot(f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}+\langle f(X)_{i_{0}},X^{\top}W_{*,j_{2}}\rangle)$
		$\displaystyle~{}+f(X)_{i_{0},i_{0}}\cdot\langle W_{j_{2},}+W_{,j_{2}},X_{*,i_{0}}\rangle)\cdot w(X)_{i_{0},j_{1}}+f(X)_{i_{0},i_{0}}\cdot w_{j_{1},j_{2}})$
	$\displaystyle=$	$\displaystyle~{}2s(X)_{i_{0},j_{0}}\cdot f(X)^{2}_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}\cdot w(X)_{i_{0},j_{1}}$
		$\displaystyle~{}+2s(X)_{i_{0},j_{0}}\cdot Z(X)_{i_{0},j_{2}}\cdot f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{1}}$
		$\displaystyle~{}-f(X)^{2}_{i_{0},i_{0}}\cdot h(X)_{j_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}\cdot w(X)_{i_{0},j_{1}}$
		$\displaystyle~{}-f(X)_{i_{0},i_{0}}\cdot\langle f(X)_{i_{0}}\circ(X^{\top}W_{*,j_{2}}),h(X)_{j_{0}}\rangle\cdot w(X)_{i_{0},j_{1}}$
		$\displaystyle~{}-f(X)^{2}_{i_{0},i_{0}}\cdot v_{j_{2},j_{0}}\cdot w(X)_{i_{0},j_{1}}$
		$\displaystyle~{}-s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{0}}\cdot\langle W_{j_{2},}+W_{,j_{2}},X_{*,i_{0}}\rangle\cdot w(X)_{i_{0},j_{1}}$
		$\displaystyle~{}-s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{0}}\cdot w_{j_{1},j_{2}}$

where the first step is by definition of $C_{1}(X)$ (see Lemma B.16), the 2nd step is by Fact B.2, the 3rd step is by Lemma C.8, the 4th step is because Lemma B.16, the 5th step is a rearrangement.

Proof of Part 2

		$\displaystyle~{}\frac{\mathrm{d}C_{1}(X)}{\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}\frac{\mathrm{d}-s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{1}}}{\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}-\frac{\mathrm{d}s(X)_{i_{0},j_{0}}}{\mathrm{d}x_{i_{2},j_{2}}}\cdot f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{1}}$
		$\displaystyle~{}-s(X)_{i_{0},j_{0}}\cdot\frac{\mathrm{d}f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{1}}}{\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}-\frac{\mathrm{d}s(X)_{i_{0},j_{0}}}{\mathrm{d}x_{i_{2},j_{2}}}\cdot f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{1}}$
		$\displaystyle~{}+s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{0}}\cdot f(X)_{i_{0},i_{2}}\cdot w(X)_{i_{0},j_{2}}\cdot w(X)_{i_{0},j_{1}}$
	$\displaystyle=$	$\displaystyle~{}-(-s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{2}}\cdot w(X)_{i_{0},j_{2}}+f(X)_{i_{0},i_{2}}\cdot h(X)_{j_{0},i_{2}}\cdot w(X)_{i_{0},j_{2}}$
		$\displaystyle~{}+f(X)_{i_{0},i_{2}}\cdot v_{j_{2},j_{0}})\cdot f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{1}}$
		$\displaystyle~{}+s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{0}}\cdot f(X)_{i_{0},i_{2}}\cdot w(X)_{i_{0},j_{2}}\cdot w(X)_{i_{0},j_{1}}$
	$\displaystyle=$	$\displaystyle~{}s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{2}}\cdot w(X)_{i_{0},j_{2}}\cdot f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{1}}$
		$\displaystyle-f(X)_{i_{0},i_{2}}\cdot h(X)_{j_{0},i_{2}}\cdot w(X)_{i_{0},j_{2}}\cdot f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{1}}$
		$\displaystyle~{}-f(X)_{i_{0},i_{2}}\cdot v_{j_{2},j_{0}}\cdot f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{1}}$
		$\displaystyle~{}+s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{0}}\cdot f(X)_{i_{0},i_{2}}\cdot w(X)_{i_{0},j_{2}}\cdot w(X)_{i_{0},j_{1}}$

where the first step is by definition of $C_{1}(X)$ (see Lemma B.16), the 2nd step is by Fact B.2, the 3rd step is by Lemma C.8, the 4th step is because Lemma B.16, the 5th step is a rearrangement. ∎

C.10 Derivative of $C_{2}(X)$

Table 2:

C_{2}

Part 1 Summary

ID	Term	Symmetric Terms	Table Name
1	$2s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}\cdot z(X)_{i_{0},j_{1}}$	No	Table 1: 9
2	$s(X)_{i_{0},j_{0}}\cdot z(X)_{i_{0},j_{2}}\cdot z(X)_{i_{0},j_{1}}$	Yes	N/A
3	$-f(X)_{i_{0},i_{0}}\cdot h(X)_{j_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}\cdot z(X)_{i_{0},j_{1}}$	No	Table 3: 3
4	$-\langle f(X)_{i_{0}}\circ(X^{\top}W_{*,j_{2}}),h(X)_{j_{0}}\rangle\cdot z(X)_{i_{0},j_{1}}$	No	Table 4: 2
5	$-f(X)_{i_{0},i_{0}}\cdot v_{j_{2},j_{0}}\cdot z(X)_{i_{0},j_{1}}$	No	Table 5: 2
6	$+s(X)_{i_{0},j_{0}}\cdot z(X)_{i_{0},j_{1}}\cdot f(X)_{i_{0},i_{0}}\cdot z(X)_{i_{0},j_{2}}$	Yes	N/A
7	$-s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{0}}\cdot\langle W_{,j_{1}},X_{,i_{0}}\rangle\cdot w(X)_{i_{0},j_{2}}$	No	Table 1: 6
8	$-s(X)_{i_{0},j_{0}}\cdot\langle f(X)_{i_{0}}\circ(X^{\top}W_{,j_{2}}),X^{\top}W_{,j_{1}}\rangle$	Yes	N/A
9	$-s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{0}}\cdot w_{j_{2},j_{1}}$	No	Table 1: 7

Lemma C.11.

If the following holds:

•

Let $C_{2}(X)$ be defined as in Lemma B.16
•

We define $z(X)_{i_{0},j_{1}}:=\langle f(X)_{i_{0}},X^{\top}W_{*,j_{1}}\rangle$ .

We have

•

Part 1 For $i_{0}=i_{1}=i_{2}\in[n]$ , $j_{1},j_{2}\in[d]$

		$\displaystyle~{}\frac{\mathrm{d}C_{2}(X)}{\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}+2s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}\cdot z(X)_{i_{0},j_{1}}$
		$\displaystyle~{}+s(X)_{i_{0},j_{0}}\cdot z(X)_{i_{0},j_{2}}\cdot z(X)_{i_{0},j_{1}}$
		$\displaystyle~{}-f(X)_{i_{0},i_{0}}\cdot h(X)_{j_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}\cdot z(X)_{i_{0},j_{1}}$
		$\displaystyle~{}-\langle f(X)_{i_{0}}\circ(X^{\top}W_{*,j_{2}}),h(X)_{j_{0}}\rangle\cdot z(X)_{i_{0},j_{1}}$
		$\displaystyle~{}-f(X)_{i_{0},i_{0}}\cdot v_{j_{2},j_{0}}\cdot z(X)_{i_{0},j_{1}}$
		$\displaystyle~{}+s(X)_{i_{0},j_{0}}\cdot z(X)_{i_{0},j_{1}}\cdot f(X)_{i_{0},i_{0}}\cdot z(X)_{i_{0},j_{2}}$
		$\displaystyle~{}-s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{0}}\cdot\langle W_{,j_{1}},X_{,i_{0}}\rangle\cdot w(X)_{i_{0},j_{2}}$
		$\displaystyle~{}-s(X)_{i_{0},j_{0}}\cdot\langle f(X)_{i_{0}}\circ(X^{\top}W_{,j_{2}}),X^{\top}W_{,j_{1}}\rangle$
		$\displaystyle~{}-s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{0}}\cdot w_{j_{2},j_{1}}$

•

Part 2 For $i_{0}=i_{1}\neq i_{2}\in[n]$ , $j_{1},j_{2}\in[d]$

		$\displaystyle~{}\frac{\mathrm{d}C_{2}(X)}{\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}+s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{2}}\cdot w(X)_{i_{0},j_{2}}\cdot z(X)_{i_{0},j_{1}}$
		$\displaystyle~{}-f(X)_{i_{0},i_{2}}\cdot h(X)_{j_{0},i_{2}}\cdot w(X)_{i_{0},j_{2}}\cdot z(X)_{i_{0},j_{1}}$
		$\displaystyle~{}-f(X)_{i_{0},i_{2}}\cdot v_{j_{2},j_{0}}\cdot z(X)_{i_{0},j_{1}}$
		$\displaystyle~{}+s(X)_{i_{0},j_{0}}\cdot\langle f(X)_{i_{0}},X^{\top}W_{*,j_{1}}\rangle\cdot f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}$
		$\displaystyle~{}-s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{0}}\cdot\langle W_{,j_{1}},X_{,i_{0}}\rangle\cdot w(X)_{i_{0},j_{2}}$
		$\displaystyle~{}-s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{0}}\cdot w_{j_{2},j_{1}}$

Proof.

Proof of Part 1

		$\displaystyle~{}\frac{\mathrm{d}-C_{2}(X)}{\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}\frac{\mathrm{d}s(X)_{i_{0},j_{0}}\cdot z(X)_{i_{0},j_{1}}}{\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}\frac{\mathrm{d}s(X)_{i_{0},j_{0}}}{\mathrm{d}x_{i_{2},j_{2}}}\cdot z(X)_{i_{0},j_{1}}+s(X)_{i_{0},j_{0}}\cdot\frac{\mathrm{d}z(X)_{i_{0},j_{1}}}{\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}\frac{\mathrm{d}s(X)_{i_{0},j_{0}}}{\mathrm{d}x_{i_{2},j_{2}}}\cdot z(X)_{i_{0},j_{1}}$
		$\displaystyle~{}+s(X)_{i_{0},j_{0}}\cdot(\langle-(\alpha(X)_{i_{0}})^{-1}\cdot f(X)_{i_{0}}\cdot(u(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}+\langle u(X)_{i_{0}},X^{\top}W_{*,j_{2}}\rangle)$
		$\displaystyle~{}+f(X)_{i_{0}}\circ(e_{i_{0}}\cdot w(X)_{i_{0},j_{2}}+X^{\top}W_{,j_{2}}),X^{\top}W_{,j_{1}}\rangle+f(X)_{i_{0},i_{0}}\cdot w_{j_{2},j_{1}})$
	$\displaystyle=$	$\displaystyle~{}(-s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}-s(X)_{i_{0},j_{0}}\cdot\langle f(X)_{i_{0}},X^{\top}W_{*,j_{2}}\rangle$
		$\displaystyle~{}+f(X)_{i_{0},i_{0}}\cdot h(X)_{j_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}$
		$\displaystyle~{}+\langle f(X)_{i_{0}}\circ(X^{\top}W_{*,j_{2}}),h(X)_{j_{0}}\rangle+f(X)_{i_{0},i_{2}}\cdot v_{j_{2},j_{0}})\cdot z(X)_{i_{0},j_{1}}$
		$\displaystyle~{}+s(X)_{i_{0},j_{0}}\cdot(\langle-(\alpha(X)_{i_{0}})^{-1}\cdot f(X)_{i_{0}}\cdot(u(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}+\langle u(X)_{i_{0}},X^{\top}W_{*,j_{2}}\rangle)$
		$\displaystyle~{}+f(X)_{i_{0}}\circ(e_{i_{0}}\cdot w(X)_{i_{0},j_{2}}+X^{\top}W_{,j_{2}}),X^{\top}W_{,j_{1}}\rangle+f(X)_{i_{0},i_{0}}\cdot w_{j_{2},j_{1}})$
	$\displaystyle=$	$\displaystyle~{}-s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}\cdot z(X)_{i_{0},j_{1}}$
		$\displaystyle~{}-s(X)_{i_{0},j_{0}}\cdot z(X)_{i_{0},j_{2}}\cdot z(X)_{i_{0},j_{1}}$
		$\displaystyle~{}+f(X)_{i_{0},i_{0}}\cdot h(X)_{j_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}\cdot z(X)_{i_{0},j_{1}}$
		$\displaystyle~{}+\langle f(X)_{i_{0}}\circ(X^{\top}W_{*,j_{2}}),h(X)_{j_{0}}\rangle\cdot z(X)_{i_{0},j_{1}}$
		$\displaystyle~{}+f(X)_{i_{0},i_{2}}\cdot v_{j_{2},j_{0}}\cdot z(X)_{i_{0},j_{1}}$
		$\displaystyle~{}-s(X)_{i_{0},j_{0}}\cdot\langle f(X)_{i_{0}},X^{\top}W_{*,j_{1}}\rangle\cdot f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}$
		$\displaystyle~{}-s(X)_{i_{0},j_{0}}\cdot\langle f(X)_{i_{0}},X^{\top}W_{,j_{1}}\rangle\cdot f(X)_{i_{0},i_{0}}\cdot\langle f(X)_{i_{0}},X^{\top}W_{,j_{2}}\rangle$
		$\displaystyle~{}+s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{0}}\cdot\langle W_{,j_{1}},X_{,i_{0}}\rangle\cdot w(X)_{i_{0},j_{2}}$
		$\displaystyle~{}+s(X)_{i_{0},j_{0}}\cdot\langle f(X)_{i_{0}}\circ(X^{\top}W_{,j_{2}}),X^{\top}W_{,j_{1}}\rangle$
		$\displaystyle~{}+s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{0}}\cdot w_{j_{2},j_{1}}$

where the first step is by definition of $C_{2}(X)$ (see Lemma B.16), the 2nd step is by Fact B.2, the 3rd step is by Lemma C.6, the 4th step is because Lemma B.16, the 5th step is a rearrangement.

Proof of Part 2

		$\displaystyle~{}\frac{\mathrm{d}-C_{2}(X)}{\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}\frac{\mathrm{d}s(X)_{i_{0},j_{0}}\cdot\langle f(X)_{i_{0}},X^{\top}W_{*,j_{1}}\rangle}{\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}\frac{\mathrm{d}s(X)_{i_{0},j_{0}}}{\mathrm{d}x_{i_{2},j_{2}}}\cdot z(X)_{i_{0},j_{1}}+s(X)_{i_{0},j_{0}}\cdot\frac{\mathrm{d}\langle f(X)_{i_{0}},X^{\top}W_{*,j_{1}}\rangle}{\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}\frac{\mathrm{d}s(X)_{i_{0},j_{0}}}{\mathrm{d}x_{i_{2},j_{2}}}\cdot z(X)_{i_{0},j_{1}}$
		$\displaystyle~{}+s(X)_{i_{0},j_{0}}\cdot(\langle-(\alpha(X)_{i_{0}})^{-1}\cdot f(X)_{i_{0}}\cdot u(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}$
		$\displaystyle~{}+f(X)_{i_{0}}\circ(e_{i_{0}}\cdot w(X)_{i_{0},j_{2}}),X^{\top}W_{*,j_{1}}\rangle+f(X)_{i_{0},i_{0}}\cdot w_{j_{2},j_{1}})$
	$\displaystyle=$	$\displaystyle~{}(-s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{2}}\cdot w(X)_{i_{0},j_{2}}+f(X)_{i_{0},i_{2}}\cdot h(X)_{j_{0},i_{2}}\cdot w(X)_{i_{0},j_{2}}$
		$\displaystyle~{}+f(X)_{i_{0},i_{2}}\cdot v_{j_{2},j_{0}})\cdot z(X)_{i_{0},j_{1}}$
		$\displaystyle~{}+s(X)_{i_{0},j_{0}}\cdot(\langle-(\alpha(X)_{i_{0}})^{-1}\cdot f(X)_{i_{0}}\cdot u(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}$
		$\displaystyle~{}+f(X)_{i_{0}}\circ(e_{i_{0}}\cdot w(X)_{i_{0},j_{2}}),X^{\top}W_{*,j_{1}}\rangle+f(X)_{i_{0},i_{0}}\cdot w_{j_{2},j_{1}})$
	$\displaystyle=$	$\displaystyle~{}-s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{2}}\cdot w(X)_{i_{0},j_{2}}\cdot z(X)_{i_{0},j_{1}}$
		$\displaystyle~{}+f(X)_{i_{0},i_{2}}\cdot h(X)_{j_{0},i_{2}}\cdot w(X)_{i_{0},j_{2}}\cdot z(X)_{i_{0},j_{1}}$
		$\displaystyle~{}+f(X)_{i_{0},i_{2}}\cdot v_{j_{2},j_{0}}\cdot z(X)_{i_{0},j_{1}}$
		$\displaystyle~{}-s(X)_{i_{0},j_{0}}\cdot\langle f(X)_{i_{0}},X^{\top}W_{*,j_{1}}\rangle\cdot f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}$
		$\displaystyle~{}+s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{0}}\cdot\langle W_{,j_{1}},X_{,i_{0}}\rangle\cdot w(X)_{i_{0},j_{2}}$
		$\displaystyle~{}+s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{0}}\cdot w_{j_{2},j_{1}}$

where the first step is by definition of $C_{2}(X)$ (see Lemma B.16), the 2nd step is by Fact B.2, the 3rd step is by Lemma C.6, the 4th step is because Lemma B.16, the 5th step is a rearrangement. ∎

C.11 Derivative of $C_{3}(X)$

Table 3:

C_{3}

Part 1 Summary

ID	Term	Symmetric Terms	Table Name
1	$-f(X)^{2}_{i_{0},i_{0}}\cdot h(X)_{j_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}\cdot w(X)_{i_{0},j_{1}}$	Yes	N/A
2	$f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}\cdot h(X)_{j_{0},i_{0}}\cdot w(X)_{i_{0},j_{1}}$	Yes	N/A
3	$-f(X)_{i_{0},i_{0}}\cdot z(X)_{i_{0},j_{2}}\cdot h(X)_{j_{0},i_{0}}\cdot w(X)_{i_{0},j_{1}}$	No	Table 2: 3
4	$f(X)_{i_{0},i_{0}}\cdot\langle W_{,j_{2}},X_{,i_{0}}\rangle\cdot h(X)_{j_{0},i_{0}}\cdot w(X)_{i_{0},j_{1}}$	No	Table 4: 3
5	$f(X)_{i_{0},i_{0}}\cdot v_{j_{2},j_{0}}\cdot w(X)_{i_{0},j_{1}}$	No	Table 5: 3
6	$f(X)_{i_{0},i_{0}}\cdot h(X)_{i_{0},i_{0}}\cdot w_{j_{1},j_{2}}$	No	Table 4: 5

Lemma C.12.

If the following holds:

•

Let $C_{3}(X)$ be defined as in Lemma B.16

We have

•

Part 1 For $i_{0}=i_{1}=i_{2}\in[n]$ , $j_{1},j_{2}\in[d]$

		$\displaystyle~{}\frac{\mathrm{d}C_{3}(X)}{\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}-f(X)^{2}_{i_{0},i_{0}}\cdot h(X)_{j_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}\cdot w(X)_{i_{0},j_{1}}$
		$\displaystyle~{}-f(X)_{i_{0},i_{0}}\cdot z(X)_{i_{0},j_{2}}\cdot h(X)_{j_{0},i_{0}}\cdot w(X)_{i_{0},j_{1}}$
		$\displaystyle~{}+f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}\cdot h(X)_{j_{0},i_{0}}\cdot w(X)_{i_{0},j_{1}}$
		$\displaystyle~{}+f(X)_{i_{0},i_{0}}\cdot\langle W_{,j_{2}},X_{,i_{0}}\rangle\cdot h(X)_{j_{0},i_{0}}\cdot w(X)_{i_{0},j_{1}}$
		$\displaystyle~{}+f(X)_{i_{0},i_{0}}\cdot v_{j_{2},j_{0}}\cdot w(X)_{i_{0},j_{1}}$
		$\displaystyle~{}+f(X)_{i_{0},i_{0}}\cdot h(X)_{i_{0},i_{0}}\cdot w_{j_{1},j_{2}}$

•

Part 2 For $i_{0}=i_{1}\neq i_{2}\in[n]$ , $j_{1},j_{2}\in[d]$

		$\displaystyle~{}\frac{\mathrm{d}C_{3}(X)}{\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}-f(X)_{i_{0},i_{0}}\cdot f(X)_{i_{0},i_{2}}\cdot w(X)_{i_{0},j_{2}}\cdot h(X)_{j_{0},i_{0}}\cdot w(X)_{i_{0},j_{1}}$

Proof.

Proof of Part 1

		$\displaystyle~{}\frac{\mathrm{d}C_{3}(X)}{\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}\frac{\mathrm{d}f(X)_{i_{0},i_{0}}\cdot h(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{1}}}{\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}\frac{\mathrm{d}f(X)_{i_{0},i_{0}}\cdot h(X)_{i_{0},i_{0}}}{\mathrm{d}x_{i_{2},j_{2}}}\cdot w(X)_{i_{0},j_{1}}+f(X)_{i_{0},i_{0}}\cdot h(X)_{i_{0},i_{0}}\cdot\frac{\mathrm{d}w(X)_{i_{0},j_{1}}}{\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}\frac{\mathrm{d}f(X)_{i_{0},i_{0}}\cdot h(X)_{i_{0},i_{0}}}{\mathrm{d}x_{i_{2},j_{2}}}\cdot w(X)_{i_{0},j_{1}}+f(X)_{i_{0},i_{0}}\cdot h(X)_{i_{0},i_{0}}\cdot w_{j_{1},j_{2}}$
	$\displaystyle=$	$\displaystyle~{}((-f(X)_{i_{0},i_{0}}\cdot(f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}+\langle f(X)_{i_{0}},X^{\top}W_{*,j_{2}}\rangle)$
		$\displaystyle~{}+f(X)_{i_{0},i_{0}}\cdot\langle W_{j_{2},}+W_{,j_{2}},X_{*,i_{0}}\rangle)\cdot h(X)_{j_{0},i_{0}}+f(X)_{i_{0},i_{0}}\cdot v_{j_{2},j_{0}})\cdot w(X)_{i_{0},j_{1}}$
		$\displaystyle~{}+f(X)_{i_{0},i_{0}}\cdot h(X)_{i_{0},i_{0}}\cdot w_{j_{1},j_{2}}$
	$\displaystyle=$	$\displaystyle~{}-f(X)^{2}_{i_{0},i_{0}}\cdot h(X)_{j_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}\cdot w(X)_{i_{0},j_{1}}$
		$\displaystyle~{}-f(X)_{i_{0},i_{0}}\cdot Z(X)_{i_{0},j_{2}}\cdot h(X)_{j_{0},i_{0}}\cdot w(X)_{i_{0},j_{1}}$
		$\displaystyle~{}+f(X)_{i_{0},i_{0}}\cdot\langle W_{j_{2},}+W_{,j_{2}},X_{*,i_{0}}\rangle\cdot h(X)_{j_{0},i_{0}}\cdot w(X)_{i_{0},j_{1}}$
		$\displaystyle~{}+f(X)_{i_{0},i_{0}}\cdot v_{j_{2},j_{0}}\cdot w(X)_{i_{0},j_{1}}$
		$\displaystyle~{}+f(X)_{i_{0},i_{0}}\cdot h(X)_{i_{0},i_{0}}\cdot w_{j_{1},j_{2}}$

where the first step is by definition of $C_{3}(X)$ (see Lemma B.16), the 2nd step is by Fact B.2, the 3rd step is by Lemma C.2, the 4th step is because Lemma C.7, the 5th step is a rearrangement.

Proof of Part 2

		$\displaystyle~{}\frac{\mathrm{d}C_{3}(X)}{\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}\frac{\mathrm{d}f(X)_{i_{0},i_{0}}\cdot h(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{1}}}{\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}\frac{\mathrm{d}f(X)_{i_{0},i_{0}}\cdot h(X)_{i_{0},i_{0}}}{\mathrm{d}x_{i_{2},j_{2}}}\cdot w(X)_{i_{0},j_{1}}+f(X)_{i_{0},i_{0}}\cdot h(X)_{i_{0},i_{0}}\cdot\frac{\mathrm{d}w(X)_{i_{0},j_{1}}}{\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}\frac{\mathrm{d}f(X)_{i_{0},i_{0}}\cdot h(X)_{i_{0},i_{0}}}{\mathrm{d}x_{i_{2},j_{2}}}\cdot w(X)_{i_{0},j_{1}}$
	$\displaystyle=$	$\displaystyle~{}-f(X)_{i_{0},i_{0}}\cdot f(X)_{i_{0},i_{2}}\cdot w(X)_{i_{0},j_{2}}\cdot h(X)_{j_{0},i_{0}}\cdot w(X)_{i_{0},j_{1}}$

where the first step is by definition of $C_{3}(X)$ (see Lemma B.16), the 2nd step is by Fact B.2, the 3rd step is by Lemma C.2, the 4th step is because Lemma C.7, the 5th step is a rearrangement.

∎

C.12 Derivative of $C_{4}(X)$

Table 4:

C_{4}

Part 1 Summary

ID	Term	Symmetric?	Table Name
1	$-\langle f(X)_{i_{0}}\circ(X^{\top}W_{*,j_{1}}),h(X)_{j_{0}}\rangle\cdot f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}$	No	Table 1: 3
2	$-\langle f(X)_{i_{0}}\circ(X^{\top}W_{*,j_{1}}),h(X)_{j_{0}}\rangle\cdot Z(X)_{i_{0},j_{2}}$	No	Table 2: 4
3	$f(X)_{i_{0},i_{0}}\cdot h(X)_{j_{0},i_{0}}\cdot\langle W_{,j_{1}},X_{,i_{0}}\rangle\cdot w(X)_{i_{0},j_{2}}$	No	Table 3: 4
4	$\langle f(X)_{i_{0}}\circ(X^{\top}W_{,j_{2}})\circ(X^{\top}W_{,j_{1}}),h(X)_{j_{0}}\rangle$	Yes	N/A
5	$f(X)_{i_{0},i_{0}}\cdot h(X)_{j_{0},i_{0}}\cdot w_{j_{2},j_{1}}$	No	Table 3: 6
6	$f(X)_{i_{0},i_{0}}\cdot\langle W_{,j_{1}},X_{,i_{0}}\rangle\cdot v_{j_{2},j_{0}}$	No	Table 5:4

Lemma C.13.

If the following holds:

•

Let $C_{4}(X)$ be defined as in Lemma B.16

We have

•

Part 1 For $i_{0}=i_{1}=i_{2}\in[n]$ , $j_{1},j_{2}\in[d]$

		$\displaystyle~{}\frac{\mathrm{d}C_{4}(X)}{\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}-\langle f(X)_{i_{0}}\circ(X^{\top}W_{*,j_{1}}),h(X)_{j_{0}}\rangle\cdot f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}$
		$\displaystyle~{}-\langle f(X)_{i_{0}}\circ(X^{\top}W_{*,j_{1}}),h(X)_{j_{0}}\rangle\cdot Z(X)_{i_{0},j_{2}}$
		$\displaystyle~{}+f(X)_{i_{0},i_{0}}\cdot h(X)_{j_{0},i_{0}}\cdot\langle W_{,j_{1}},X_{,i_{0}}\rangle\cdot w(X)_{i_{0},j_{2}}$
		$\displaystyle~{}+\langle f(X)_{i_{0}}\circ(X^{\top}W_{,j_{2}})\circ(X^{\top}W_{,j_{1}}),h(X)_{j_{0}}\rangle$
		$\displaystyle~{}+f(X)_{i_{0},i_{0}}\cdot h(X)_{j_{0},i_{0}}\cdot w_{j_{2},j_{1}}$
		$\displaystyle~{}+f(X)_{i_{0},i_{0}}\cdot\langle W_{,j_{1}},X_{,i_{0}}\rangle\cdot v_{j_{2},j_{0}}$

•

Part 2 For $i_{0}=i_{1}\neq i_{2}\in[n]$ , $j_{1},j_{2}\in[d]$

		$\displaystyle~{}\frac{\mathrm{d}C_{4}(X)}{\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}-\langle f(X)_{i_{0}}\circ(X^{\top}W_{*,j_{1}}),h(X)_{j_{0}}\rangle\cdot f(X)_{i_{0},i_{2}}\cdot w(X)_{i_{0},j_{2}}$
		$\displaystyle~{}+f(X)_{i_{0},i_{2}}\cdot h(X)_{j_{0},i_{2}}\cdot\langle W_{,j_{1}},X_{,i_{2}}\rangle\cdot w(X)_{i_{0},j_{2}}$
		$\displaystyle~{}+f(X)_{i_{0},i_{2}}\cdot h(X)_{j_{0},i_{2}}\cdot w_{j_{2},j_{1}}$
		$\displaystyle~{}+f(X)_{i_{0},i_{2}}\cdot\langle W_{,j_{1}},X_{,i_{2}}\rangle\cdot v_{j_{2},j_{0}}$

Proof.

Proof of Part 1

		$\displaystyle~{}\frac{\mathrm{d}C_{4}(X)}{\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}\frac{\mathrm{d}\langle f(X)_{i_{0}}\circ(X^{\top}W_{*,j_{1}}),h(X)_{j_{0}}\rangle}{\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}\langle\frac{\mathrm{d}f(X)_{i_{0}}\circ(X^{\top}W_{,j_{1}})}{\mathrm{d}x_{i_{2},j_{2}}},h(X)_{j_{0}}\rangle+\langle f(X)_{i_{0}}\circ(X^{\top}W_{,j_{1}}),\frac{\mathrm{d}h(X)_{j_{0}}}{\mathrm{d}x_{i_{2},j_{2}}}\rangle$
	$\displaystyle=$	$\displaystyle~{}\langle\frac{\mathrm{d}f(X)_{i_{0}}\circ(X^{\top}W_{,j_{1}})}{\mathrm{d}x_{i_{2},j_{2}}},h(X)_{j_{0}}\rangle+\langle f(X)_{i_{0}}\circ(X^{\top}W_{,j_{1}}),e_{i_{2}}\cdot v_{j_{2},j_{0}}\rangle$
	$\displaystyle=$	$\displaystyle~{}\langle(-f(X)_{i_{0}}\cdot(f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}+\langle f(X)_{i_{0}},X^{\top}W_{*,j_{2}}\rangle)$
		$\displaystyle~{}+f(X)_{i_{0}}\circ(e_{i_{0}}\cdot w(X)_{i_{0},j_{2}}+X^{\top}W_{,j_{2}}))\circ(X^{\top}W_{,j_{1}})+f(X)_{i_{0}}\circ(e_{i_{0}}\cdot w_{j_{2},j_{1}}),h(X)_{j_{0}}\rangle$
		$\displaystyle~{}+\langle f(X)_{i_{0}}\circ(X^{\top}W_{*,j_{1}}),e_{i_{0}}\cdot v_{j_{2},j_{0}}\rangle$
	$\displaystyle=$	$\displaystyle~{}-\langle f(X)_{i_{0}}\circ(X^{\top}W_{*,j_{1}}),h(X)_{j_{0}}\rangle\cdot f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}$
		$\displaystyle~{}-\langle f(X)_{i_{0}}\circ(X^{\top}W_{,j_{1}}),h(X)_{j_{0}}\rangle\cdot\langle f(X)_{i_{0}},X^{\top}W_{,j_{2}}\rangle$
		$\displaystyle~{}+f(X)_{i_{0},i_{0}}\cdot h(X)_{j_{0},i_{0}}\cdot\langle W_{,j_{1}},X_{,i_{0}}\rangle\cdot w(X)_{i_{0},j_{2}}$
		$\displaystyle~{}+\langle f(X)_{i_{0}}\circ(X^{\top}W_{,j_{2}})\circ(X^{\top}W_{,j_{1}}),h(X)_{j_{0}}\rangle$
		$\displaystyle~{}+f(X)_{i_{0},i_{0}}\cdot h(X)_{j_{0},i_{0}}\cdot w_{j_{2},j_{1}}$
		$\displaystyle~{}+f(X)_{i_{0},i_{0}}\cdot\langle W_{,j_{1}},X_{,i_{0}}\rangle\cdot v_{j_{2},j_{0}}$

where the first step is by definition of $C_{4}(X)$ (see Lemma B.16), the 2nd step is by Fact B.2, the 3rd step is by Lemma B.15, the 4th step is because Lemma C.9, the 5th step is a rearrangement.

Proof of Part 2

		$\displaystyle~{}\frac{\mathrm{d}C_{4}(X)}{\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}\frac{\mathrm{d}\langle f(X)_{i_{0}}\circ(X^{\top}W_{*,j_{1}}),h(X)_{j_{0}}\rangle}{\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}\langle\frac{\mathrm{d}f(X)_{i_{0}}\circ(X^{\top}W_{,j_{1}})}{\mathrm{d}x_{i_{2},j_{2}}},h(X)_{j_{0}}\rangle+\langle f(X)_{i_{0}}\circ(X^{\top}W_{,j_{1}}),\frac{\mathrm{d}h(X)_{j_{0}}}{\mathrm{d}x_{i_{2},j_{2}}}\rangle$
	$\displaystyle=$	$\displaystyle~{}\langle\frac{\mathrm{d}f(X)_{i_{0}}\circ(X^{\top}W_{,j_{1}})}{\mathrm{d}x_{i_{2},j_{2}}},h(X)_{j_{0}}\rangle+\langle f(X)_{i_{0}}\circ(X^{\top}W_{,j_{1}}),e_{i_{2}}\cdot v_{j_{2},j_{0}}\rangle$
	$\displaystyle=$	$\displaystyle~{}\langle-(f(X)_{i_{0}}\cdot f(X)_{i_{0},i_{2}}\cdot w(X)_{i_{0},j_{2}}$
		$\displaystyle~{}+f(X)_{i_{0}}\circ(e_{i_{2}}\cdot w(X)_{i_{0},j_{2}}))\circ(X^{\top}W_{*,j_{1}})+f(X)_{i_{0}}\circ(e_{i_{2}}\cdot w_{j_{2},j_{1}}),h(X)_{j_{0}}\rangle$
		$\displaystyle~{}+\langle f(X)_{i_{0}}\circ(X^{\top}W_{*,j_{1}}),e_{i_{2}}\cdot v_{j_{2},j_{0}}\rangle$
	$\displaystyle=$	$\displaystyle~{}-\langle f(X)_{i_{0}}\circ(X^{\top}W_{*,j_{1}}),h(X)_{j_{0}}\rangle\cdot f(X)_{i_{0},i_{2}}\cdot w(X)_{i_{0},j_{2}}$
		$\displaystyle~{}+f(X)_{i_{0},i_{2}}\cdot h(X)_{j_{0},i_{2}}\cdot\langle W_{,j_{1}},X_{,i_{2}}\rangle\cdot w(X)_{i_{0},j_{2}}$
		$\displaystyle~{}+f(X)_{i_{0},i_{2}}\cdot h(X)_{j_{0},i_{2}}\cdot w_{j_{2},j_{1}}$
		$\displaystyle~{}+f(X)_{i_{0},i_{2}}\cdot\langle W_{,j_{1}},X_{,i_{2}}\rangle\cdot v_{j_{2},j_{0}}$

where the first step is by definition of $C_{4}(X)$ (see Lemma B.16), the 2nd step is by Fact B.2, the 3rd step is by Lemma B.15, the 4th step is because Lemma C.9, the 5th step is a rearrangement.

∎

C.13 Derivative of $C_{5}(X)$

Table 5:

C_{5}

Part 1 Summary

Term	Symmetric Terms	Table Name
$-f(X)^{2}_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}\cdot v_{j_{1},j_{0}}$	No	$C_{1}(X):4$
$-f(X)_{i_{0},i_{0}}\cdot z(X)_{i_{0},j_{2}}\cdot v_{j_{1},j_{0}}$	No	Table 2: 5
$f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}\cdot v_{j_{1},j_{0}}$	No	Table 3:5
$f(X)_{i_{0},i_{0}}\cdot\langle W_{,j_{2}},X_{,i_{0}}\rangle\cdot v_{j_{1},j_{0}}$	No	Table 4: 6

Lemma C.14.

If the following holds:

•

Let $C_{5}(X)$ be defined as in Lemma B.16

We have

•

Part 1 For $i_{0}=i_{1}=i_{2}\in[n]$ , $j_{1},j_{2}\in[d]$

	$\displaystyle\frac{\mathrm{d}C_{5}(X)}{\mathrm{d}x_{i_{2},j_{2}}}=$	$\displaystyle~{}-f(X)^{2}_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}\cdot v_{j_{1},j_{0}}$
		$\displaystyle~{}-f(X)_{i_{0},i_{0}}\cdot z(X)_{i_{0},j_{2}}\cdot v_{j_{1},j_{0}}$
		$\displaystyle~{}+f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}\cdot v_{j_{1},j_{0}}$
		$\displaystyle~{}+f(X)_{i_{0},i_{0}}\cdot\langle W_{,j_{2}},X_{,i_{0}}\rangle\cdot v_{j_{1},j_{0}}$

•

Part 2 For $i_{0}=i_{1}\neq i_{2}\in[n]$ , $j_{1},j_{2}\in[d]$

\displaystyle\frac{\mathrm{d}C_{5}(X)}{\mathrm{d}x_{i_{2},j_{2}}}=

\displaystyle~{}-f(X)_{i_{0},i_{0}}\cdot f(X)_{i_{0},i_{2}}\cdot w(X)_{i_{0},j_{2}}\cdot v_{j_{1},j_{0}}

Proof.

Proof of Part 1

		$\displaystyle~{}\frac{\mathrm{d}C_{5}(X)}{\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}\frac{\mathrm{d}f(X)_{i_{0},i_{0}}\cdot v_{j_{1},j_{0}}}{\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}\frac{\mathrm{d}f(X)_{i_{0},i_{0}}}{\mathrm{d}x_{i_{2},j_{2}}}\cdot v_{j_{1},j_{0}}$
	$\displaystyle=$	$\displaystyle~{}(-f(X)_{i_{0},i_{0}}\cdot(f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}+\langle f(X)_{i_{0}},X^{\top}W_{*,j_{2}}\rangle)$
		$\displaystyle~{}+f(X)_{i_{0},i_{0}}\cdot\langle W_{j_{2},}+W_{,j_{2}},X_{*,i_{0}}\rangle)\cdot v_{j_{1},j_{0}}$
	$\displaystyle=$	$\displaystyle~{}-f(X)^{2}_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}\cdot v_{j_{1},j_{0}}$
		$\displaystyle~{}-f(X)_{i_{0},i_{0}}\cdot\langle f(X)_{i_{0}},X^{\top}W_{*,j_{2}}\rangle\cdot v_{j_{1},j_{0}}$
		$\displaystyle~{}+f(X)_{i_{0},i_{0}}\cdot\langle W_{j_{2},}+W_{,j_{2}},X_{*,i_{0}}\rangle\cdot v_{j_{1},j_{0}}$

where the first step is by definition of $C_{5}(X)$ (see Lemma B.16), the 2nd step is by Fact B.2, the 3rd step is by Lemma C.4, the 4th step is a rearrangement.

Proof of Part 2

		$\displaystyle~{}\frac{\mathrm{d}C_{5}(X)}{\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}\frac{\mathrm{d}f(X)_{i_{0},i_{0}}\cdot v_{j_{1},j_{0}}}{\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}\frac{\mathrm{d}f(X)_{i_{0},i_{0}}}{\mathrm{d}x_{i_{2},j_{2}}}\cdot v_{j_{1},j_{0}}$
	$\displaystyle=$	$\displaystyle~{}-f(X)_{i_{0},i_{0}}\cdot f(X)_{i_{0},i_{2}}\cdot w(X)_{i_{0},j_{2}}\cdot v_{j_{1},j_{0}}$

where the first step is by definition of $C_{5}(X)$ (see Lemma B.16), the 2nd step is by Fact B.2, the 3rd step is by Lemma C.4. ∎

C.14 Derivative of $\frac{c(X)_{i_{0},j_{0}}}{\mathrm{d}x_{i_{1},j_{1}}}$

Lemma C.15.

If the following holds:

•

Let $c(X)_{i_{0},j_{0}}$ be defined as in Definition B.8

We have

•

Part 1 For $i_{0}=i_{1}=i_{2}\in[n]$ , $j_{1},j_{2}\in[d]$

\displaystyle\frac{\mathrm{d}c(X)_{i_{0},j_{0}}}{\mathrm{d}x_{i_{1},j_{1}}x_{i_{2},j_{2}}}=

\displaystyle\sum_{i=1}^{21}D_{i}(X)

where we have following definitions

	$\displaystyle D_{1}(X):=$	$\displaystyle~{}2s(X)_{i_{0},j_{0}}\cdot f(X)^{2}_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}\cdot w(X)_{i_{0},j_{1}}$
	$\displaystyle D_{2}(X):=$	$\displaystyle~{}2f(X)_{i_{0},i_{0}}\cdot s(X)_{i_{0},j_{0}}\cdot z(X)_{i_{0},j_{2}}\cdot w(X)_{i_{0},j_{1}}$
		$\displaystyle~{}+2f(X)_{i_{0},i_{0}}\cdot s(X)_{i_{0},j_{0}}\cdot z(X)_{i_{0},j_{1}}\cdot w(X)_{i_{0},j_{2}}$
	$\displaystyle D_{3}(X):=$	$\displaystyle~{}-f(X)^{2}_{i_{0},i_{0}}\cdot h(X)_{j_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}\cdot w(X)_{i_{0},j_{1}}$
	$\displaystyle D_{4}(X):=$	$\displaystyle~{}-f(X)_{i_{0},i_{0}}\cdot\langle f(X)_{i_{0}}\circ(X^{\top}W_{*,j_{2}}),h(X)_{j_{0}}\rangle\cdot w(X)_{i_{0},j_{1}}$
		$\displaystyle~{}-f(X)_{i_{0},i_{0}}\cdot\langle f(X)_{i_{0}}\circ(X^{\top}W_{*,j_{1}}),h(X)_{j_{0}}\rangle\cdot w(X)_{i_{0},j_{2}}$
	$\displaystyle D_{5}(X):=$	$\displaystyle~{}-f(X)^{2}_{i_{0},i_{0}}\cdot v_{j_{2},j_{0}}\cdot w(X)_{i_{0},j_{1}}-f(X)^{2}_{i_{0},i_{0}}\cdot v_{j_{1},j_{0}}\cdot w(X)_{i_{0},j_{2}}$
	$\displaystyle D_{6}(X):=$	$\displaystyle~{}-s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}\cdot w(X)_{i_{0},j_{1}}$
	$\displaystyle D_{7}(X):=$	$\displaystyle~{}-s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{0}}\cdot\langle W_{,j_{2}},X_{,i_{0}}\rangle\cdot w(X)_{i_{0},j_{1}}$
		$\displaystyle~{}-s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{0}}\cdot\langle W_{,j_{1}},X_{,i_{0}}\rangle\cdot w(X)_{i_{0},j_{2}}$
	$\displaystyle D_{8}(X):=$	$\displaystyle~{}-s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{0}}\cdot w_{j_{1},j_{2}}-s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{0}}\cdot w_{j_{2},j_{1}}$
	$\displaystyle D_{9}(X):=$	$\displaystyle~{}s(X)_{i_{0},j_{0}}\cdot z(X)_{i_{0},j_{2}}\cdot z(X)_{i_{0},j_{1}}$
	$\displaystyle D_{10}(X):=$	$\displaystyle~{}-f(X)_{i_{0},i_{0}}\cdot h(X)_{j_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}\cdot z(X)_{i_{0},j_{1}}$
		$\displaystyle~{}-f(X)_{i_{0},i_{0}}\cdot h(X)_{j_{0},i_{0}}\cdot w(X)_{i_{0},j_{1}}\cdot z(X)_{i_{0},j_{2}}$
	$\displaystyle D_{11}(X):=$	$\displaystyle~{}-\langle f(X)_{i_{0}}\circ(X^{\top}W_{*,j_{2}}),h(X)_{j_{0}}\rangle\cdot z(X)_{i_{0},j_{1}}$
		$\displaystyle~{}-\langle f(X)_{i_{0}}\circ(X^{\top}W_{*,j_{1}}),h(X)_{j_{0}}\rangle\cdot z(X)_{i_{0},j_{2}}$
	$\displaystyle D_{12}(X):=$	$\displaystyle~{}-f(X)_{i_{0},i_{0}}\cdot v_{j_{2},j_{0}}\cdot z(X)_{i_{0},j_{1}}-f(X)_{i_{0},i_{0}}\cdot v_{j_{1},j_{0}}\cdot z(X)_{i_{0},j_{2}}$
	$\displaystyle D_{13}(X):=$	$\displaystyle~{}s(X)_{i_{0},j_{0}}\cdot z(X)_{i_{0},j_{1}}\cdot f(X)_{i_{0},i_{0}}\cdot z(X)_{i_{0},j_{2}}$
	$\displaystyle D_{14}(X):=$	$\displaystyle~{}-s(X)_{i_{0},j_{0}}\cdot\langle f(X)_{i_{0}}\circ(X^{\top}W_{,j_{2}}),X^{\top}W_{,j_{1}}\rangle$
	$\displaystyle D_{15}(X):=$	$\displaystyle~{}-f(X)^{2}_{i_{0},i_{0}}\cdot h(X)_{j_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}\cdot w(X)_{i_{0},j_{1}}$
	$\displaystyle D_{16}(X):=$	$\displaystyle~{}f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}\cdot h(X)_{j_{0},i_{0}}\cdot w(X)_{i_{0},j_{1}}$
	$\displaystyle D_{17}(X):=$	$\displaystyle~{}f(X)_{i_{0},i_{0}}\cdot\langle W_{,j_{2}},X_{,i_{0}}\rangle\cdot h(X)_{j_{0},i_{0}}\cdot w(X)_{i_{0},j_{1}}$
		$\displaystyle~{}+f(X)_{i_{0},i_{0}}\cdot\langle W_{,j_{1}},X_{,i_{0}}\rangle\cdot h(X)_{j_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}$
	$\displaystyle D_{18}(X):=$	$\displaystyle~{}f(X)_{i_{0},i_{0}}\cdot v_{j_{2},j_{0}}\cdot w(X)_{i_{0},j_{1}}+f(X)_{i_{0},i_{0}}\cdot v_{j_{1},j_{0}}\cdot w(X)_{i_{0},j_{2}}$
	$\displaystyle D_{19}(X):=$	$\displaystyle~{}f(X)_{i_{0},i_{0}}\cdot h(X)_{i_{0},i_{0}}\cdot w_{j_{1},j_{2}}+f(X)_{i_{0},i_{0}}\cdot h(X)_{i_{0},i_{0}}\cdot w_{j_{2},j_{1}}$
	$\displaystyle D_{20}(X):=$	$\displaystyle~{}\langle f(X)_{i_{0}}\circ(X^{\top}W_{,j_{2}})\circ(X^{\top}W_{,j_{1}}),h(X)_{j_{0}}\rangle$
	$\displaystyle D_{21}(X):=$	$\displaystyle~{}f(X)_{i_{0},i_{0}}\cdot\langle W_{,j_{2}},X_{,i_{0}}\rangle\cdot v_{j_{1},j_{0}}+f(X)_{i_{0},i_{0}}\cdot\langle W_{,j_{1}},X_{,i_{0}}\rangle\cdot v_{j_{2},j_{0}}$

•

Part 2 For $i_{0}=i_{1}\neq i_{2}\in[n]$ , $j_{1},j_{2}\in[d]$

$\displaystyle\frac{\mathrm{d}c(X)_{i_{0},j_{0}}}{\mathrm{d}x_{i_{1},j_{1}}x_{i_{2},j_{2}}}=$ $\displaystyle~{}\sum_{i=1}^{15}E_{i}(X)$

where we have following definitions

	$\displaystyle E_{1}(X):=$	$\displaystyle~{}2s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{2}}\cdot w(X)_{i_{0},j_{2}}\cdot f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{1}}$
	$\displaystyle E_{2}(X):=$	$\displaystyle-2f(X)_{i_{0},i_{2}}\cdot h(X)_{j_{0},i_{2}}\cdot w(X)_{i_{0},j_{2}}\cdot f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{1}}$
	$\displaystyle E_{3}(X):=$	$\displaystyle~{}-f(X)_{i_{0},i_{2}}\cdot v_{j_{2},j_{0}}\cdot f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{1}}$
	$\displaystyle E_{4}(X):=$	$\displaystyle~{}s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{2}}\cdot w(X)_{i_{0},j_{2}}\cdot z(X)_{i_{0},j_{1}}$
	$\displaystyle E_{5}(X):=$	$\displaystyle~{}-f(X)_{i_{0},i_{2}}\cdot h(X)_{j_{0},i_{2}}\cdot w(X)_{i_{0},j_{2}}\cdot z(X)_{i_{0},j_{1}}$
	$\displaystyle E_{6}(X):=$	$\displaystyle~{}-f(X)_{i_{0},i_{2}}\cdot v_{j_{2},j_{0}}\cdot z(X)_{i_{0},j_{1}}$
	$\displaystyle E_{7}(X):=$	$\displaystyle~{}s(X)_{i_{0},j_{0}}\cdot\langle f(X)_{i_{0}},X^{\top}W_{*,j_{1}}\rangle\cdot f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{2}}$
	$\displaystyle E_{8}(X):=$	$\displaystyle~{}-s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{0}}\cdot\langle W_{,j_{1}},X_{,i_{0}}\rangle\cdot w(X)_{i_{0},j_{2}}$
	$\displaystyle E_{9}(X):=$	$\displaystyle~{}-s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{0}}\cdot w_{j_{2},j_{1}}$
	$\displaystyle E_{10}(X):=$	$\displaystyle~{}-f(X)_{i_{0},i_{0}}\cdot f(X)_{i_{0},i_{2}}\cdot w(X)_{i_{0},j_{2}}\cdot h(X)_{j_{0},i_{0}}\cdot w(X)_{i_{0},j_{1}}$
	$\displaystyle E_{11}(X):=$	$\displaystyle~{}-\langle f(X)_{i_{0}}\circ(X^{\top}W_{*,j_{1}}),h(X)_{j_{0}}\rangle\cdot f(X)_{i_{0},i_{2}}\cdot w(X)_{i_{0},j_{2}}$
	$\displaystyle E_{12}(X):=$	$\displaystyle~{}f(X)_{i_{0},i_{2}}\cdot h(X)_{j_{0},i_{2}}\cdot\langle W_{,j_{1}},X_{,i_{2}}\rangle\cdot w(X)_{i_{0},j_{2}}$
	$\displaystyle E_{13}(X):=$	$\displaystyle~{}f(X)_{i_{0},i_{2}}\cdot h(X)_{j_{0},i_{2}}\cdot w_{j_{2},j_{1}}$
	$\displaystyle E_{14}(X):=$	$\displaystyle~{}f(X)_{i_{0},i_{2}}\cdot\langle W_{,j_{1}},X_{,i_{2}}\rangle\cdot v_{j_{2},j_{0}}$
	$\displaystyle E_{15}(X):=$	$\displaystyle~{}-f(X)_{i_{0},i_{0}}\cdot f(X)_{i_{0},i_{2}}\cdot w(X)_{i_{0},j_{2}}\cdot v_{j_{1},j_{0}}$

Proof.

The proof is a combination of derivatives of $C_{i}(X)$ in this section.

Notice that the symmetricity for Part 1 is verified by tables in this section. ∎

Appendix D Hessian case 2: $i_{0}\neq i_{1}$

In this section, we focus on the second case of Hessian. In Sections D.1, D.2, D.3, D.4 and D.5, we calculated derivative of some important terms. In Sections D.6, D.7 and D.8 we calculate derivative of $C_{6}$ , $C_{7}$ and $C_{8}$ respectively. And in Section D.9 we calculate the derivative of $\frac{\mathrm{d}c(X)_{i_{0},j_{1}}}{\mathrm{d}x_{i_{1},j_{1}}}$ .

D.1 Derivative of scalar function $f(X)_{i_{0},i_{1}}$

Lemma D.1.

If the following holds:

•

Let $f(X)_{i_{0}}$ be defined as Definition B.6
•

For $i_{0}\neq i_{2}\in[n]$ , $j_{1},j_{2}\in[d]$

We have

•

Part 1. For $i_{0}\neq i_{2},i_{1}=i_{2}\in[n]$ , $j_{1},j_{2}\in[d]$

	$\displaystyle\frac{\mathrm{d}f(X)_{i_{0},i_{1}}}{\mathrm{d}x_{i_{2},j_{2}}}=$	$\displaystyle~{}-f(X)_{i_{0},i_{1}}\cdot f(X)_{i_{0},i_{2}}\cdot w(X)_{i_{0},j_{2}}$
		$\displaystyle~{}+f(X)_{i_{0},i_{1}}\cdot w(X)_{i_{0},j_{2}}$

•

Part 2. For $i_{0}\neq i_{2},i_{1}\neq i_{2}\in[n]$ , $j_{1},j_{2}\in[d]$

\displaystyle\frac{\mathrm{d}f(X)_{i_{0},i_{1}}}{\mathrm{d}x_{i_{2},j_{2}}}=

\displaystyle~{}-f(X)_{i_{0},i_{1}}\cdot f(X)_{i_{0},i_{2}}\cdot w(X)_{i_{0},j_{2}}

Proof.

Proof of Part 1

	$\displaystyle\frac{\mathrm{d}f(X)_{i_{0},i_{1}}}{\mathrm{d}x_{i_{2},j_{2}}}=$	$\displaystyle~{}(-(\alpha(X)_{i_{0}})^{-1}\cdot f(X)_{i_{0}}\cdot u(X)_{i_{0},i_{2}}\cdot\langle W_{j_{2},},X_{,i_{0}}\rangle$
		$\displaystyle~{}+f(X)_{i_{0}}\circ(e_{i_{1}}\cdot\langle W_{j_{2},},X_{,i_{0}}\rangle))_{i_{1}}$
	$\displaystyle=$	$\displaystyle~{}-(\alpha(X)_{i_{0}})^{-1}\cdot f(X)_{i_{0},i_{1}}\cdot u(X)_{i_{0},i_{2}}\cdot\langle W_{j_{2},},X_{,i_{0}}\rangle$
		$\displaystyle~{}+f(X)_{i_{0},i_{1}}\cdot\langle W_{j_{2},},X_{,i_{0}}\rangle$
	$\displaystyle=$	$\displaystyle~{}-f(X)_{i_{0},i_{1}}\cdot f(X)_{i_{0},i_{2}}\cdot w(X)_{i_{0},j_{2}}$
		$\displaystyle~{}+f(X)_{i_{0},i_{1}}\cdot w(X)_{i_{0},j_{2}}$

where the first step follows from Part 1 of Lemma B.14, the second step follows from simple algebra, the first step follows from Definition B.6.

Proof of Part 2

	$\displaystyle\frac{\mathrm{d}f(X)_{i_{0},i_{1}}}{\mathrm{d}x_{i_{2},j_{2}}}=$	$\displaystyle~{}(-(\alpha(X)_{i_{0}})^{-1}\cdot f(X)_{i_{0}}\cdot u(X)_{i_{0},i_{2}}\cdot\langle W_{j_{2},},X_{,i_{0}}\rangle$
		$\displaystyle~{}+f(X)_{i_{0}}\circ(e_{i_{2}}\cdot\langle W_{j_{2},},X_{,i_{0}}\rangle))_{i_{1}}$
	$\displaystyle=$	$\displaystyle~{}-(\alpha(X)_{i_{0}})^{-1}\cdot f(X)_{i_{0},i_{1}}\cdot u(X)_{i_{0},i_{2}}\cdot\langle W_{j_{2},},X_{,i_{0}}\rangle$
	$\displaystyle=$	$\displaystyle~{}-f(X)_{i_{0},i_{1}}\cdot f(X)_{i_{0},i_{2}}\cdot w(X)_{i_{0},j_{2}}$

where the first step follows from Part 1 of Lemma B.14, the second step follows from simple algebra, the first step follows from Definition B.6. ∎

D.2 Derivative of scalar function $h(X)_{j_{0},i_{1}}$

Lemma D.2.

If the following holds:

•

Let $h(X)_{j_{0}}$ be defined as Definition B.7
•

For $i_{0}\neq i_{2}\in[n]$ , $j_{1},j_{2}\in[d]$

We have

•

Part 1. For $i_{0}\neq i_{2},i_{1}=i_{2}\in[n]$ , $j_{1},j_{2}\in[d]$

$\displaystyle\frac{\mathrm{d}h(X)_{j_{0},i_{1}}}{\mathrm{d}x_{i_{2},j_{2}}}=v_{j_{2},j_{0}}$
•

Part 2. For $i_{0}\neq i_{2},i_{1}\neq i_{2}\in[n]$ , $j_{1},j_{2}\in[d]$

$\displaystyle\frac{\mathrm{d}h(X)_{j_{0},i_{1}}}{\mathrm{d}x_{i_{2},j_{2}}}=0$

Proof.

Proof of Part 1.

	$\displaystyle\frac{\mathrm{d}h(X)_{j_{0},i_{1}}}{\mathrm{d}x_{i_{2},j_{2}}}=$	$\displaystyle~{}(e_{i_{2}}\cdot v_{j_{2},j_{0}})_{i_{1}}$
	$\displaystyle=$	$\displaystyle~{}v_{j_{2},j_{0}}$

where the first step follows from Lemma B.7, the second step follows from $i_{1}=i_{2}$ .

Proof of Part 1.

	$\displaystyle\frac{\mathrm{d}h(X)_{j_{0},i_{1}}}{\mathrm{d}x_{i_{2},j_{2}}}=$	$\displaystyle~{}(e_{i_{2}}\cdot v_{j_{2},j_{0}})_{i_{1}}$
	$\displaystyle=$	$\displaystyle~{}0$

where the first step follows from Lemma B.7, the second step follows from $i_{1}\neq i_{2}$ . ∎

D.3 Derivative of scalar function $\langle f(X)_{i_{0}},h(X)_{j_{0}}\rangle$

Lemma D.3.

If the following holds:

•

Let $f(X)_{i_{0}}$ be defined as Definition B.6
•

Let $h(X)_{j_{0}}$ be defined as Definition B.7
•

For $i_{0}\neq i_{2}\in[n]$ , $j_{1},j_{2}\in[d]$

We have

	$\displaystyle\frac{\mathrm{d}\langle f(X)_{i_{0}},h(X)_{j_{0}}\rangle}{\mathrm{d}x_{i_{2},j_{2}}}=$	$\displaystyle~{}\langle-f(X)_{i_{0}}\cdot f(X)_{i_{0},i_{2}}\cdot\langle W_{j_{2},},X_{,i_{0}}\rangle$
		$\displaystyle~{}+f(X)_{i_{0}}\circ(e_{i_{2}}\cdot\langle W_{j_{2},},X_{,i_{0}}\rangle),h(X)_{j_{0}}\rangle+f(X)_{i_{0},i_{2}}\cdot v_{j_{2},j_{0}}$

Proof.

	$\displaystyle\frac{\mathrm{d}\langle f(X)_{i_{0}},h(X)_{j_{0}}\rangle}{\mathrm{d}x_{i_{2},j_{2}}}=$	$\displaystyle~{}\langle\frac{\mathrm{d}f(X)_{i_{0}}}{\mathrm{d}x_{i_{2},j_{2}}},h(X)_{j_{0}}\rangle+\langle f(X)_{i_{0}},\frac{\mathrm{d}h(X)_{j_{0}}}{\mathrm{d}x_{i_{2},j_{2}}}\rangle$
	$\displaystyle=$	$\displaystyle~{}\langle-(\alpha(X)_{i_{0}})^{-1}\cdot f(X)_{i_{0}}\cdot u(X)_{i_{0},i_{2}}\cdot\langle W_{j_{2},},X_{,i_{0}}\rangle$
		$\displaystyle~{}+f(X)_{i_{0}}\circ(e_{i_{2}}\cdot\langle W_{j_{2},},X_{,i_{0}}\rangle),h(X)_{j_{0}}\rangle+\langle f(X)_{i_{0}},\frac{\mathrm{d}h(X)_{j_{0}}}{\mathrm{d}x_{i_{2},j_{2}}}\rangle$
	$\displaystyle=$	$\displaystyle~{}\langle-f(X)_{i_{0}}\cdot f(X)_{i_{0},i_{2}}\cdot\langle W_{j_{2},},X_{,i_{0}}\rangle$
		$\displaystyle~{}+f(X)_{i_{0}}\circ(e_{i_{2}}\cdot\langle W_{j_{2},},X_{,i_{0}}\rangle),h(X)_{j_{0}}\rangle+\langle f(X)_{i_{0}},\frac{\mathrm{d}h(X)_{j_{0}}}{\mathrm{d}x_{i_{2},j_{2}}}\rangle$
	$\displaystyle=$	$\displaystyle~{}\langle-f(X)_{i_{0}}\cdot f(X)_{i_{0},i_{2}}\cdot\langle W_{j_{2},},X_{,i_{0}}\rangle$
		$\displaystyle~{}+f(X)_{i_{0}}\circ(e_{i_{2}}\cdot\langle W_{j_{2},},X_{,i_{0}}\rangle),h(X)_{j_{0}}\rangle+\langle f(X)_{i_{0}},e_{i_{2}}\cdot v_{j_{2},j_{0}}\rangle$
	$\displaystyle=$	$\displaystyle~{}\langle-f(X)_{i_{0}}\cdot f(X)_{i_{0},i_{2}}\cdot\langle W_{j_{2},},X_{,i_{0}}\rangle$
		$\displaystyle~{}+f(X)_{i_{0}}\circ(e_{i_{2}}\cdot\langle W_{j_{2},},X_{,i_{0}}\rangle),h(X)_{j_{0}}\rangle+f(X)_{i_{0},i_{2}}\cdot v_{j_{2},j_{0}}$

where the first step follows from simple differential rule, the second step follows from Lemma B.14, the third step follows from simple algebra and Definition B.6, the fourth step follows from Lemma B.15, the last step follows from simple algebra. ∎

D.4 Derivative of scalar function $f(X)_{i_{0},i_{1}}\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle$

Lemma D.4.

If the following holds:

•

Let $f(X)_{i_{0}}$ be defined as Definition B.6
•

For $i_{0}\neq i_{2}\in[n]$ , $j_{1},j_{2}\in[d]$

We have

•

Part 1. For $i_{0}\neq i_{2},i_{1}=i_{2}\in[n]$ , $j_{1},j_{2}\in[d]$

		$\displaystyle~{}\frac{\mathrm{d}f(X)_{i_{0},i_{1}}\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle}{\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}(-f(X)_{i_{0},i_{2}}\cdot f(X)_{i_{0},i_{1}}+f(X)_{i_{0},i_{1}})\cdot\langle W_{j_{2},},X_{,i_{0}}\rangle\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle$

•

Part 2. For $i_{0}\neq i_{2},i_{1}\neq i_{2}\in[n]$ , $j_{1},j_{2}\in[d]$

		$\displaystyle~{}\frac{\mathrm{d}f(X)_{i_{0},i_{1}}\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle}{\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}-f(X)_{i_{0},i_{2}}\cdot f(X)_{i_{0},i_{1}}\cdot\langle W_{j_{2},},X_{,i_{0}}\rangle\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle$

Proof.

Proof of Part 1

		$\displaystyle~{}\frac{\mathrm{d}f(X)_{i_{0},i_{1}}\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle}{\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}\frac{\mathrm{d}f(X)_{i_{0},i_{1}}}{\mathrm{d}x_{i_{2},j_{2}}}\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle+\frac{\mathrm{d}\langle W_{j_{1},},X_{,i_{0}}\rangle}{\mathrm{d}x_{i_{2},j_{2}}}\cdot f(X)_{i_{0},i_{1}}$
	$\displaystyle=$	$\displaystyle~{}(-f(X)_{i_{0},i_{2}}f(X)_{i_{0},i_{1}}+f(X)_{i_{0},i_{1}})\cdot\langle W_{j_{2},},X_{,i_{0}}\rangle\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle$
		$\displaystyle~{}+\frac{\mathrm{d}\langle W_{j_{1},},X_{,i_{0}}\rangle}{\mathrm{d}x_{i_{2},j_{2}}}\cdot f(X)_{i_{0},i_{1}}$
	$\displaystyle=$	$\displaystyle~{}(-f(X)_{i_{0},i_{2}}f(X)_{i_{0},i_{1}}+f(X)_{i_{0},i_{1}})\cdot\langle W_{j_{2},},X_{,i_{0}}\rangle\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle+{\bf 0}_{d}\cdot f(X)_{i_{0},i_{1}}$
	$\displaystyle=$	$\displaystyle~{}(-f(X)_{i_{0},i_{2}}f(X)_{i_{0},i_{1}}+f(X)_{i_{0},i_{1}})\cdot\langle W_{j_{2},},X_{,i_{0}}\rangle\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle$

where the first step follows from simple differential rule, the second step follows from Lemma D.1, the third step follows from $i_{0}\neq i_{2}$ , the last step follows from simple algebra.

Proof of Part 2

		$\displaystyle~{}\frac{\mathrm{d}f(X)_{i_{0},i_{1}}\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle}{\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}\frac{\mathrm{d}f(X)_{i_{0},i_{1}}}{\mathrm{d}x_{i_{2},j_{2}}}\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle+\frac{\mathrm{d}\langle W_{j_{1},},X_{,i_{0}}\rangle}{\mathrm{d}x_{i_{2},j_{2}}}\cdot f(X)_{i_{0},i_{1}}$
	$\displaystyle=$	$\displaystyle~{}(-f(X)_{i_{0},i_{2}}f(X)_{i_{0},i_{1}}+f(X)_{i_{0},i_{1}})\cdot\langle W_{j_{2},},X_{,i_{0}}\rangle\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle$
		$\displaystyle~{}+\frac{\mathrm{d}\langle W_{j_{1},},X_{,i_{0}}\rangle}{\mathrm{d}x_{i_{2},j_{2}}}\cdot f(X)_{i_{0},i_{1}}$
	$\displaystyle=$	$\displaystyle~{}(-f(X)_{i_{0},i_{2}}f(X)_{i_{0},i_{1}}+f(X)_{i_{0},i_{1}})\cdot\langle W_{j_{2},},X_{,i_{0}}\rangle\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle+{\bf 0}_{d}\cdot f(X)_{i_{0},i_{1}}$
	$\displaystyle=$	$\displaystyle~{}-f(X)_{i_{0},i_{2}}\cdot f(X)_{i_{0},i_{1}}\cdot\langle W_{j_{2},},X_{,i_{0}}\rangle\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle$

where the first step follows from simple differential rule, the second step follows from Lemma D.1, the third step follows from $i_{0}\neq i_{2}$ , the last step follows from simple algebra. ∎

D.5 Derivative of scalar function $f(X)_{i_{0},i_{1}}\cdot h(X)_{j_{0},i_{1}}$

Lemma D.5.

If the following holds:

•

Let $f(X)_{i_{0}}$ be defined as Definition B.6
•

Let $h(X)_{j_{0}}$ be defined as Definition B.7

We have

•

Part 1 For $i_{0}\neq i_{2},i_{1}=i_{2}\in[n]$ , $j_{1},j_{2}\in[d]$

		$\displaystyle~{}\frac{\mathrm{d}f(X)_{i_{0},i_{1}}\cdot h(X)_{j_{0},i_{1}}}{\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}(-f(X)_{i_{0},i_{2}}\cdot f(X)_{i_{0},i_{1}}+f(X)_{i_{0},i_{1}})\cdot\langle W_{j_{2},},X_{,i_{0}}\rangle\cdot h(X)_{j_{0},i_{1}}$
		$\displaystyle~{}+v_{j_{2},j_{0}}\cdot f(X)_{i_{0},i_{1}}$

•

Part 2 For $i_{0}\neq i_{2},i_{1}\neq i_{2}\in[n]$ , $j_{1},j_{2}\in[d]$

		$\displaystyle~{}\frac{\mathrm{d}f(X)_{i_{0},i_{0}}\cdot h(X)_{j_{0},i_{0}}}{\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}-f(X)_{i_{0},i_{2}}\cdot f(X)_{i_{0},i_{1}}\cdot\langle W_{j_{2},},X_{,i_{0}}\rangle\cdot h(X)_{j_{0},i_{1}}$

Proof.

Proof of Part 1.

	$\displaystyle\frac{\mathrm{d}f(X)_{i_{0},i_{1}}\cdot h(X)_{j_{0},i_{1}}}{\mathrm{d}x_{i_{2},j_{2}}}=$	$\displaystyle~{}\frac{\mathrm{d}f(X)_{i_{0},i_{1}}}{\mathrm{d}x_{i_{2},j_{2}}}\cdot h(X)_{j_{0},i_{1}}+\frac{\mathrm{d}h(X)_{j_{0},i_{1}}}{\mathrm{d}x_{i_{2},j_{2}}}\cdot f(X)_{i_{0},i_{1}}$
	$\displaystyle=$	$\displaystyle~{}(-f(X)_{i_{0},i_{2}}f(X)_{i_{0},i_{1}}+f(X)_{i_{0},i_{1}})\cdot\langle W_{j_{2},},X_{,i_{0}}\rangle\cdot h(X)_{j_{0},i_{1}}$
		$\displaystyle~{}+\frac{\mathrm{d}h(X)_{j_{0},i_{1}}}{\mathrm{d}x_{i_{2},j_{2}}}\cdot f(X)_{i_{0},i_{1}}$
	$\displaystyle=$	$\displaystyle~{}(-f(X)_{i_{0},i_{2}}\cdot f(X)_{i_{0},i_{1}}+f(X)_{i_{0},i_{1}})\cdot\langle W_{j_{2},},X_{,i_{0}}\rangle\cdot h(X)_{j_{0},i_{1}}$
		$\displaystyle~{}+v_{j_{2},j_{0}}\cdot f(X)_{i_{0},i_{1}}$

where the first step follows from simple differential rule, the second step follows from Lemma D.1, the third step follows from Part 1 of Lemma D.2.

Proof of Part 2.

	$\displaystyle\frac{\mathrm{d}f(X)_{i_{0},i_{1}}\cdot h(X)_{j_{0},i_{1}}}{\mathrm{d}x_{i_{2},j_{2}}}=$	$\displaystyle~{}\frac{\mathrm{d}f(X)_{i_{0},i_{1}}}{\mathrm{d}x_{i_{2},j_{2}}}\cdot h(X)_{j_{0},i_{1}}+\frac{\mathrm{d}h(X)_{j_{0},i_{1}}}{\mathrm{d}x_{i_{2},j_{2}}}\cdot f(X)_{i_{0},i_{1}}$
	$\displaystyle=$	$\displaystyle~{}-f(X)_{i_{0},i_{2}}\cdot f(X)_{i_{0},i_{1}}\cdot\langle W_{j_{2},},X_{,i_{0}}\rangle\cdot h(X)_{j_{0},i_{1}}$
		$\displaystyle~{}+\frac{\mathrm{d}h(X)_{j_{0},i_{1}}}{\mathrm{d}x_{i_{2},j_{2}}}\cdot f(X)_{i_{0},i_{1}}$
	$\displaystyle=$	$\displaystyle~{}-f(X)_{i_{0},i_{2}}\cdot f(X)_{i_{0},i_{1}}\cdot\langle W_{j_{2},},X_{,i_{0}}\rangle\cdot h(X)_{j_{0},i_{1}}$

where the first step follows from simple differential rule, the second step follows from Lemma D.1, the third step follows from Part 2 of Lemma D.2. ∎

D.6 Derivative of $C_{6}(X)$

Lemma D.6.

If the following holds:

•

Let $C_{6}(X)\in\mathbb{R}$ be defined as in Lemma B.16
•

For $i_{0}\neq i_{2}\in[n]$ , $j_{1},j_{2}\in[d]$

We have

•

Part 1 For $i_{0}\neq i_{2},i_{1}=i_{2}\in[n]$ , $j_{1},j_{2}\in[d]$

		$\displaystyle~{}\frac{\mathrm{d}C_{6}(X)}{\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}-(\langle-f(X)_{i_{0}}\cdot f(X)_{i_{0},i_{2}}\cdot\langle W_{j_{2},},X_{,i_{0}}\rangle$
		$\displaystyle~{}+f(X)_{i_{0}}\circ(e_{i_{1}}\cdot\langle W_{j_{2},},X_{,i_{0}}\rangle),h(X)_{j_{0}}\rangle+f(X)_{i_{0},i_{2}}\cdot v_{j_{2},j_{0}})\cdot f(X)_{i_{0},i_{1}}\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle$
		$\displaystyle~{}+(-\langle f(X)_{i_{0}},h(X)_{j_{0}}\rangle)\cdot(-f(X)_{i_{0},i_{2}}f(X)_{i_{0},i_{1}}+f(X)_{i_{0},i_{1}})\cdot\langle W_{j_{2},},X_{,i_{0}}\rangle\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle$

•

Part 2 For $i_{0}\neq i_{2},i_{1}\neq i_{2}\in[n]$ , $j_{1},j_{2}\in[d]$

		$\displaystyle~{}\frac{\mathrm{d}C_{6}(X)}{\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}-(\langle-f(X)_{i_{0}}\cdot f(X)_{i_{0},i_{2}}\cdot\langle W_{j_{2},},X_{,i_{0}}\rangle$
		$\displaystyle~{}+f(X)_{i_{0}}\circ(e_{i_{2}}\cdot\langle W_{j_{2},},X_{,i_{0}}\rangle),h(X)_{j_{0}}\rangle+f(X)_{i_{0},i_{2}}\cdot v_{j_{2},j_{0}})\cdot f(X)_{i_{0},i_{1}}\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle$
		$\displaystyle~{}+\langle f(X)_{i_{0}},h(X)_{j_{0}}\rangle\cdot f(X)_{i_{0},i_{2}}\cdot f(X)_{i_{0},i_{1}}\cdot\langle W_{j_{2},},X_{,i_{0}}\rangle\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle$

Proof.

Proof of Part 1

		$\displaystyle~{}\frac{\mathrm{d}C_{6}(X)}{\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}\frac{\mathrm{d}}{\mathrm{d}x_{i_{2},j_{2}}}(-\langle f(X)_{i_{0}},h(X)_{j_{0}}\rangle\cdot f(X)_{i_{0},i_{1}}\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle)$
	$\displaystyle=$	$\displaystyle~{}\frac{\mathrm{d}}{\mathrm{d}x_{i_{2},j_{2}}}(-\langle f(X)_{i_{0}},h(X)_{j_{0}}\rangle)\cdot f(X)_{i_{0},i_{1}}\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle$
		$\displaystyle~{}+(-\langle f(X)_{i_{0}},h(X)_{j_{0}}\rangle)\cdot\frac{\mathrm{d}}{\mathrm{d}x_{i_{2},j_{2}}}(f(X)_{i_{0},i_{1}}\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle)$
	$\displaystyle=$	$\displaystyle~{}\frac{\mathrm{d}}{\mathrm{d}x_{i_{2},j_{2}}}(-\langle f(X)_{i_{0}},h(X)_{j_{0}}\rangle)\cdot f(X)_{i_{0},i_{1}}\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle$
		$\displaystyle~{}+(-\langle f(X)_{i_{0}},h(X)_{j_{0}}\rangle)\cdot(-f(X)_{i_{0},i_{2}}f(X)_{i_{0},i_{1}}+f(X)_{i_{0},i_{1}})\cdot\langle W_{j_{2},},X_{,i_{0}}\rangle\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle$
	$\displaystyle=$	$\displaystyle~{}-(\langle-f(X)_{i_{0}}\cdot f(X)_{i_{0},i_{2}}\cdot\langle W_{j_{2},},X_{,i_{0}}\rangle$
		$\displaystyle~{}+f(X)_{i_{0}}\circ(e_{i_{1}}\cdot\langle W_{j_{2},},X_{,i_{0}}\rangle),h(X)_{j_{0}}\rangle+f(X)_{i_{0},i_{2}}\cdot v_{j_{2},j_{0}})\cdot f(X)_{i_{0},i_{1}}\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle$
		$\displaystyle~{}+(-\langle f(X)_{i_{0}},h(X)_{j_{0}}\rangle)\cdot(-f(X)_{i_{0},i_{2}}f(X)_{i_{0},i_{1}}+f(X)_{i_{0},i_{1}})\cdot\langle W_{j_{2},},X_{,i_{0}}\rangle\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle$

where the first step follows from Lemma B.16, the second step follows from simple differential rule, the third step follows from Lemma D.4, last step follows from Lemma D.3.

Proof of Part 2

		$\displaystyle~{}\frac{\mathrm{d}C_{6}(X)}{\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}\frac{\mathrm{d}}{\mathrm{d}x_{i_{2},j_{2}}}(-\langle f(X)_{i_{0}},h(X)_{j_{0}}\rangle\cdot f(X)_{i_{0},i_{1}}\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle)$
	$\displaystyle=$	$\displaystyle~{}\frac{\mathrm{d}}{\mathrm{d}x_{i_{2},j_{2}}}(-\langle f(X)_{i_{0}},h(X)_{j_{0}}\rangle)\cdot f(X)_{i_{0},i_{1}}\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle$
		$\displaystyle~{}+(-\langle f(X)_{i_{0}},h(X)_{j_{0}}\rangle)\cdot\frac{\mathrm{d}}{\mathrm{d}x_{i_{2},j_{2}}}(f(X)_{i_{0},i_{1}}\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle)$
	$\displaystyle=$	$\displaystyle~{}\frac{\mathrm{d}}{\mathrm{d}x_{i_{2},j_{2}}}(-\langle f(X)_{i_{0}},h(X)_{j_{0}}\rangle)\cdot f(X)_{i_{0},i_{1}}\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle$
		$\displaystyle~{}+\langle f(X)_{i_{0}},h(X)_{j_{0}}\rangle)\cdot f(X)_{i_{0},i_{2}}\cdot f(X)_{i_{0},i_{1}}\cdot\langle W_{j_{2},},X_{,i_{0}}\rangle\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle$
	$\displaystyle=$	$\displaystyle~{}-(\langle-f(X)_{i_{0}}\cdot f(X)_{i_{0},i_{2}}\cdot\langle W_{j_{2},},X_{,i_{0}}\rangle$
		$\displaystyle~{}+f(X)_{i_{0}}\circ(e_{i_{2}}\cdot\langle W_{j_{2},},X_{,i_{0}}\rangle),h(X)_{j_{0}}\rangle+f(X)_{i_{0},i_{2}}\cdot v_{j_{2},j_{0}})\cdot f(X)_{i_{0},i_{1}}\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle$
		$\displaystyle~{}+\langle f(X)_{i_{0}},h(X)_{j_{0}}\rangle\cdot f(X)_{i_{0},i_{2}}\cdot f(X)_{i_{0},i_{1}}\cdot\langle W_{j_{2},},X_{,i_{0}}\rangle\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle$

where the first step follows from Lemma B.16, the second step follows from simple differential rule, the third step follows from Lemma D.4, last step follows from Lemma D.3. ∎

D.7 Derivative of $C_{7}(X)$

Lemma D.7.

If the following holds:

•

Let $C_{7}(X)\in\mathbb{R}$ be defined as in Lemma B.16

We have

•

Part 1. For $i_{0}\neq i_{2},i_{1}=i_{2}\in[n]$ , $j_{1},j_{2}\in[d]$

		$\displaystyle~{}\frac{\mathrm{d}C_{7}(X)}{\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}(-f(X)_{i_{0},i_{2}}+1)\cdot f(X)_{i_{0},i_{1}}\cdot\langle W_{j_{2},},X_{,i_{0}}\rangle\cdot h(X)_{j_{0},i_{1}}\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle$
		$\displaystyle~{}+v_{j_{2},j_{0}}\cdot f(X)_{i_{0},i_{1}}\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle$

•

Part 2. For $i_{0}\neq i_{2},i_{1}\neq i_{2}\in[n]$ , $j_{1},j_{2}\in[d]$

		$\displaystyle~{}\frac{\mathrm{d}C_{7}(X)}{\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}-f(X)_{i_{0},i_{2}}\cdot f(X)_{i_{0},i_{1}}\cdot\langle W_{j_{2},},X_{,i_{0}}\rangle\cdot h(X)_{j_{0},i_{1}}\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle$

Proof.

Proof of Part 1.

		$\displaystyle~{}\frac{\mathrm{d}C_{7}(X)}{\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}\frac{\mathrm{d}}{\mathrm{d}x_{i_{2},j_{2}}}(f(X)_{i_{0},i_{1}}\cdot h(X)_{j_{0},i_{1}}\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle)$
	$\displaystyle=$	$\displaystyle~{}\frac{\mathrm{d}}{\mathrm{d}x_{i_{2},j_{2}}}(f(X)_{i_{0},i_{1}}\cdot h(X)_{j_{0},i_{1}})\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle+f(X)_{i_{0},i_{1}}\cdot h(X)_{j_{0},i_{1}}\cdot\frac{\mathrm{d}}{\mathrm{d}x_{i_{2},j_{2}}}(\langle W_{j_{1},},X_{,i_{0}}\rangle)$
	$\displaystyle=$	$\displaystyle~{}(-f(X)_{i_{0},i_{2}}+1)\cdot f(X)_{i_{0},i_{1}}\cdot\langle W_{j_{2},},X_{,i_{0}}\rangle\cdot h(X)_{j_{0},i_{1}}\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle$
		$\displaystyle~{}+v_{j_{2},j_{0}}\cdot f(X)_{i_{0},i_{1}}\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle$
		$\displaystyle~{}+f(X)_{i_{0},i_{1}}\cdot h(X)_{j_{0},i_{1}}\cdot\frac{\mathrm{d}}{\mathrm{d}x_{i_{2},j_{2}}}(\langle W_{j_{1},},X_{,i_{0}}\rangle)$
	$\displaystyle=$	$\displaystyle~{}(-f(X)_{i_{0},i_{2}}+1)\cdot f(X)_{i_{0},i_{1}}\cdot\langle W_{j_{2},},X_{,i_{0}}\rangle\cdot h(X)_{j_{0},i_{1}}\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle$
		$\displaystyle~{}+v_{j_{2},j_{0}}\cdot f(X)_{i_{0},i_{1}}\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle$

where the first step follows from Lemma B.16, the second step follows from differential rule, the third step follows from Part 1 of Lemma D.3, the fourth step follows from $i_{0}\neq i_{2}$ .

Proof of Part 2.

		$\displaystyle~{}\frac{\mathrm{d}C_{7}(X)}{\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}\frac{\mathrm{d}}{\mathrm{d}x_{i_{2},j_{2}}}(f(X)_{i_{0},i_{1}}\cdot h(X)_{j_{0},i_{1}}\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle)$
	$\displaystyle=$	$\displaystyle~{}\frac{\mathrm{d}}{\mathrm{d}x_{i_{2},j_{2}}}(f(X)_{i_{0},i_{1}}\cdot h(X)_{j_{0},i_{1}})\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle+f(X)_{i_{0},i_{1}}\cdot h(X)_{j_{0},i_{1}}\cdot\frac{\mathrm{d}}{\mathrm{d}x_{i_{2},j_{2}}}(\langle W_{j_{1},},X_{,i_{0}}\rangle)$
	$\displaystyle=$	$\displaystyle~{}-f(X)_{i_{0},i_{2}}f(X)_{i_{0},i_{1}}\cdot\langle W_{j_{2},},X_{,i_{0}}\rangle\cdot h(X)_{j_{0},i_{1}}\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle$
		$\displaystyle~{}+f(X)_{i_{0},i_{1}}\cdot h(X)_{j_{0},i_{1}}\cdot\frac{\mathrm{d}}{\mathrm{d}x_{i_{2},j_{2}}}(\langle W_{j_{1},},X_{,i_{0}}\rangle)$
	$\displaystyle=$	$\displaystyle~{}-f(X)_{i_{0},i_{2}}f(X)_{i_{0},i_{1}}\cdot\langle W_{j_{2},},X_{,i_{0}}\rangle\cdot h(X)_{j_{0},i_{1}}\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle$
		$\displaystyle~{}+f(X)_{i_{0},i_{1}}\cdot h(X)_{j_{0},i_{1}}\cdot{\bf 0}_{d}$
	$\displaystyle=$	$\displaystyle~{}-f(X)_{i_{0},i_{2}}\cdot f(X)_{i_{0},i_{1}}\cdot\langle W_{j_{2},},X_{,i_{0}}\rangle\cdot h(X)_{j_{0},i_{1}}\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle$

where the first step follows from Lemma B.16, the second step follows from differential rule, the third step follows from Part 2 of Lemma D.3, the fourth step follows from $i_{0}\neq i_{2}$ , the last step follows from simple algebra. ∎

D.8 Derivative of $C_{8}(X)$

Lemma D.8.

If the following holds:

•

Let $C_{8}(X)\in\mathbb{R}$ be defined as in Lemma B.16
•

For $i_{0}\neq i_{2}\in[n]$ , $j_{1},j_{2}\in[d]$

We have

•

Part 1. For $i_{0}\neq i_{2},i_{1}=i_{2}\in[n]$ , $j_{1},j_{2}\in[d]$

		$\displaystyle~{}\frac{\mathrm{d}C_{8}(X)}{\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}(-f(X)_{i_{0},i_{2}}f(X)_{i_{0},i_{1}}+f(X)_{i_{0},i_{1}})\cdot\langle W_{j_{2},},X_{,i_{0}}\rangle\cdot v_{j_{1},j_{0}}$

•

Part 2. For $i_{0}\neq i_{2},i_{1}\neq i_{2}\in[n]$ , $j_{1},j_{2}\in[d]$

		$\displaystyle~{}\frac{\mathrm{d}C_{8}(X)}{\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}-f(X)_{i_{0},i_{2}}\cdot f(X)_{i_{0},i_{1}}\cdot\langle W_{j_{2},},X_{,i_{0}}\rangle\cdot v_{j_{1},j_{0}}$

Proof.

Proof of Part 1

	$\displaystyle\frac{\mathrm{d}C_{8}(X)}{\mathrm{d}x_{i_{2},j_{2}}}=$	$\displaystyle~{}\frac{\mathrm{d}}{\mathrm{d}x_{i_{2},j_{2}}}f(X)_{i_{0},i_{1}}\cdot v_{j_{1},j_{0}}$
	$\displaystyle=$	$\displaystyle~{}(-f(X)_{i_{0},i_{2}}f(X)_{i_{0},i_{1}}+f(X)_{i_{0},i_{1}})\cdot\langle W_{j_{2},},X_{,i_{0}}\rangle\cdot v_{j_{1},j_{0}}$

where the first step follows from Lemma B.16, the second step follows from differential rule and Lemma D.1.

Proof of Part 2

	$\displaystyle\frac{\mathrm{d}C_{8}(X)}{\mathrm{d}x_{i_{2},j_{2}}}=$	$\displaystyle~{}\frac{\mathrm{d}}{\mathrm{d}x_{i_{2},j_{2}}}f(X)_{i_{0},i_{1}}\cdot v_{j_{1},j_{0}}$
	$\displaystyle=$	$\displaystyle~{}-f(X)_{i_{0},i_{2}}\cdot f(X)_{i_{0},i_{1}}\cdot\langle W_{j_{2},},X_{,i_{0}}\rangle\cdot v_{j_{1},j_{0}}$

where the first step follows from Lemma B.16, the second step follows from differential rule and Lemma D.1. ∎

D.9 Derivative of $\frac{\mathrm{d}c(X)_{i_{0},j_{1}}}{\mathrm{d}x_{i_{1},j_{1}}}$

Lemma D.9.

If the following holds:

•

Let $c(X)_{i_{0},j_{1}}\in\mathbb{R}$ be defined as in Lemma B.16 and Definition B.8

We have

•

Part 1 For $i_{0}\neq i_{2},i_{1}=i_{2}\in[n]$ , $j_{1},j_{2}\in[d]$

\displaystyle\frac{\mathrm{d}c(X)}{\mathrm{d}x_{i_{1},j_{1}},\mathrm{d}x_{i_{2},j_{2}}}=\sum_{i=1}^{6}F_{i}(X)

where we have following definitions

	$\displaystyle F_{1}(X)=$	$\displaystyle~{}2s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{1}}^{2}\cdot w(X)_{i_{0},j_{2}}\cdot w(X)_{i_{0},j_{1}}$
	$\displaystyle F_{2}(X)=$	$\displaystyle~{}-f(X)_{i_{0},i_{1}}^{2}\cdot h(X)_{j_{0},i_{1}}\cdot w(X)_{i_{0},j_{2}}\cdot w(X)_{i_{0},j_{1}}$
	$\displaystyle F_{3}(X)=$	$\displaystyle~{}-f(X)_{i_{0},i_{1}}^{2}\cdot v_{j_{2},j_{0}}\cdot w(X)_{i_{0},j_{1}}-f(X)_{i_{0},i_{1}}^{2}\cdot v_{j_{1},j_{0}}\cdot w(X)_{i_{0},j_{2}}$
	$\displaystyle F_{4}(X)=$	$\displaystyle~{}-s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{1}}\cdot w(X)_{i_{0},j_{1}}\cdot w(X)_{i_{0},j_{2}}$
	$\displaystyle F_{5}(X)=$	$\displaystyle~{}f(X)_{i_{0},i_{1}}\cdot w(X)_{i_{0},j_{1}}\cdot w(X)_{i_{0},j_{2}}\cdot h(X)_{j_{0},i_{1}}$
	$\displaystyle F_{6}(X)=$	$\displaystyle~{}v_{j_{2},j_{0}}\cdot f(X)_{i_{0},i_{1}}\cdot w(X)_{i_{0},j_{1}}+v_{j_{1},j_{0}}\cdot f(X)_{i_{0},i_{1}}\cdot w(X)_{i_{0},j_{2}}$

•

Part 2 For $i_{0}\neq i_{2},i_{1}\neq i_{2}\in[n]$ , $j_{1},j_{2}\in[d]$

\displaystyle\frac{\mathrm{d}c(X)}{\mathrm{d}x_{i_{1},j_{1}},\mathrm{d}x_{i_{2},j_{2}}}=\sum_{i=1}^{3}G_{i}(X)

where we have following definitions

	$\displaystyle G_{1}(X)=$	$\displaystyle~{}2s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{1}}\cdot f(X)_{i_{0},i_{2}}\cdot w(X)_{i_{0},j_{2}}\cdot w(X)_{i_{0},j_{1}}$
	$\displaystyle G_{2}(X)=$	$\displaystyle~{}-f(X)_{i_{0},i_{1}}\cdot f(X)_{i_{0},i_{2}}\cdot w(X)_{i_{0},j_{2}}\cdot w(X)_{i_{0},j_{1}}\cdot(h(X)_{j_{0},i_{2}}+h(X)_{j_{0},i_{1}})$
	$\displaystyle G_{3}(X)=$	$\displaystyle~{}-f(X)_{i_{0},i_{1}}\cdot f(X)_{i_{0},i_{2}}\cdot(v_{j_{2},j_{0}}\cdot w(X)_{i_{0},j_{1}}+v_{j_{1},j_{0}}\cdot w(X)_{i_{0},j_{2}})$

Proof.

Proof of Part 1.

		$\displaystyle~{}\frac{\mathrm{d}c(X)_{i_{0},j_{0}}}{\mathrm{d}x_{i_{1},j_{1}},\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}\frac{\mathrm{d}C_{6}}{\mathrm{d}x_{i_{2},j_{2}}}+\frac{\mathrm{d}C_{7}}{\mathrm{d}x_{i_{2},j_{2}}}+\frac{\mathrm{d}C_{8}}{\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}-(\langle-f(X)_{i_{0}}\cdot f(X)_{i_{0},i_{1}}\cdot\langle W_{j_{2},},X_{,i_{0}}\rangle+f(X)_{i_{0}}\circ(e_{i_{1}}\cdot\langle W_{j_{2},},X_{,i_{0}}\rangle),h(X)_{j_{0}}\rangle$
		$\displaystyle~{}+f(X)_{i_{0},i_{1}}\cdot v_{j_{2},j_{0}})\cdot f(X)_{i_{0},i_{1}}\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle$
		$\displaystyle~{}+(-\langle f(X)_{i_{0}},h(X)_{j_{0}}\rangle)\cdot(-f(X)_{i_{0},i_{1}}^{2}+f(X)_{i_{0},i_{1}})\cdot\langle W_{j_{2},},X_{,i_{0}}\rangle\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle$
		$\displaystyle~{}(-f(X)_{i_{0},i_{2}}+1)\cdot f(X)_{i_{0},i_{1}}\cdot\langle W_{j_{2},},X_{,i_{0}}\rangle\cdot h(X)_{j_{0},i_{1}}\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle$
		$\displaystyle~{}+v_{j_{2},j_{0}}\cdot f(X)_{i_{0},i_{1}}\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle$
		$\displaystyle~{}+(-f(X)_{i_{0},i_{1}}^{2}+f(X)_{i_{0},i_{1}})\cdot\langle W_{j_{2},},X_{,i_{0}}\rangle\cdot v_{j_{1},j_{0}}$
	$\displaystyle=$	$\displaystyle~{}2s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{1}}^{2}\cdot w(X)_{i_{0},j_{2}}\cdot w(X)_{i_{0},j_{1}}$
		$\displaystyle~{}-2f(X)_{i_{0},i_{1}}^{2}\cdot h(X)_{j_{0},i_{1}}\cdot w(X)_{i_{0},j_{2}}\cdot w(X)_{i_{0},j_{1}}$
		$\displaystyle~{}-f(X)_{i_{0},i_{1}}^{2}\cdot v_{j_{2},j_{0}}\cdot w(X)_{i_{0},j_{1}}-f(X)_{i_{0},i_{1}}^{2}\cdot v_{j_{1},j_{0}}\cdot w(X)_{i_{0},j_{2}}$
		$\displaystyle~{}-s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{1}}\cdot w(X)_{i_{0},j_{1}}\cdot w(X)_{i_{0},j_{2}}$
		$\displaystyle~{}+f(X)_{i_{0},i_{1}}\cdot w(X)_{i_{0},j_{1}}\cdot w(X)_{i_{0},j_{2}}\cdot h(X)_{j_{0},i_{1}}$
		$\displaystyle+v_{j_{2},j_{0}}\cdot f(X)_{i_{0},i_{1}}\cdot w(X)_{i_{0},j_{1}}+v_{j_{1},j_{0}}\cdot f(X)_{i_{0},i_{1}}\cdot w(X)_{i_{0},j_{2}}$

where the first step follows from Lemma B.16, the second step follows from previous results in this section, the last step is a rearrangement.

Proof of Part 2.

		$\displaystyle~{}\frac{\mathrm{d}c(X)_{i_{0},j_{0}}}{\mathrm{d}x_{i_{1},j_{1}},\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}\frac{\mathrm{d}C_{6}}{\mathrm{d}x_{i_{2},j_{2}}}+\frac{\mathrm{d}C_{7}}{\mathrm{d}x_{i_{2},j_{2}}}+\frac{\mathrm{d}C_{8}}{\mathrm{d}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}-(\langle-f(X)_{i_{0}}\cdot f(X)_{i_{0},i_{2}}\cdot\langle W_{j_{2},},X_{,i_{0}}\rangle$
		$\displaystyle~{}+f(X)_{i_{0}}\circ(e_{i_{2}}\cdot\langle W_{j_{2},},X_{,i_{0}}\rangle),h(X)_{j_{0}}\rangle+f(X)_{i_{0},i_{2}}\cdot v_{j_{2},j_{0}})\cdot f(X)_{i_{0},i_{1}}\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle$
		$\displaystyle~{}+\langle f(X)_{i_{0}},h(X)_{j_{0}}\rangle\cdot f(X)_{i_{0},i_{2}}\cdot f(X)_{i_{0},i_{1}}\cdot\langle W_{j_{2},},X_{,i_{0}}\rangle\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle$
		$\displaystyle~{}-f(X)_{i_{0},i_{2}}\cdot f(X)_{i_{0},i_{1}}\cdot\langle W_{j_{2},},X_{,i_{0}}\rangle\cdot h(X)_{j_{0},i_{1}}\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle$
		$\displaystyle~{}-f(X)_{i_{0},i_{2}}\cdot f(X)_{i_{0},i_{1}}\cdot\langle W_{j_{2},},X_{,i_{0}}\rangle\cdot v_{j_{1},j_{0}}$
	$\displaystyle=$	$\displaystyle~{}2s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{1}}\cdot f(X)_{i_{0},i_{2}}\cdot w(X)_{i_{0},j_{2}}\cdot w(X)_{i_{0},j_{1}}$
		$\displaystyle~{}-f(X)_{i_{0},i_{1}}\cdot f(X)_{i_{0},i_{2}}\cdot h(X)_{j_{0},i_{2}}\cdot w(X)_{i_{0},j_{2}}\cdot w(X)_{i_{0},j_{1}}$
		$\displaystyle~{}-f(X)_{i_{0},i_{1}}\cdot f(X)_{i_{0},i_{2}}\cdot w(X)_{i_{0},j_{1}}\cdot w(X)_{i_{0},j_{2}}\cdot h(X)_{j_{0},i_{1}}$
		$\displaystyle~{}-f(X)_{i_{0},i_{1}}\cdot f(X)_{i_{0},i_{2}}\cdot v_{j_{2},j_{0}}\cdot w(X)_{i_{0},j_{1}}-f(X)_{i_{0},i_{1}}\cdot f(X)_{i_{0},i_{2}}\cdot v_{j_{1},j_{0}}\cdot w(X)_{i_{0},j_{2}}$

where the first step follows from Lemma B.16, the second step follows from Lemma D.6, the third step follows from Part 2 of Lemma D.7, the last step follows from Lemma D.8.

Notice that, by our construction, Part 1 should be symmetric w.r.t. $j_{1},j_{2}$ , Part 2 should be symmetric w.r.t. $i_{1},i_{2}$ , which are all satisfied. ∎

Appendix E Hessian Reformulation

In this section, we provide a reformulation of Hessian formula, which simplifies our calculation and analysis. In Section E.1 we show the way we split the Hessian. In Section E.2 we show the decomposition when $i_{0}=i_{1}=i_{2}$ .

E.1 Hessian split

Definition E.1 (Hessian of functions of matrix).

We define the Hessian of $c(X)_{i_{0},j_{0}}$ by considering its Hessian with respect to $x=\operatorname{vec}(X)$ . This means that, $\nabla^{2}c(X)_{i_{0},j_{0}}$ is a $nd\times nd$ matrix with its $(i_{1}\cdot j_{1},i_{2}\cdot j_{2})$ -th entry being

\displaystyle\frac{\mathrm{d}c(X)_{i_{0},j_{0}}}{\mathrm{d}x_{i_{1},j_{2}}x_{i_{2},j_{2}}}

Definition E.2 (Hessian split).

We split the hessian of $c(X)_{i_{0},j_{0}}$ into following cases

•

Part 1: $i_{0}=i_{1}=i_{2}$ : $H_{1}^{(i_{1},i_{2})}$
•

Part 2: $i_{0}=i_{1}$ , $i_{0}\neq i_{2}$ : $H_{2}^{(i_{1},i_{2})}$
•

Part 3: $i_{0}\neq i_{1}$ , $i_{0}=i_{2}$ : $H_{3}^{(i_{1},i_{2})}$
•

Part 4: $i_{0}\neq i_{1}$ , $i_{0}\neq i_{2}$ , $i_{1}=i_{2}$ : $H_{4}^{(i_{1},i_{2})}$
•

Part 5: $i_{0}\neq i_{1}$ , $i_{0}\neq i_{2}$ , $i_{1}\neq i_{2}$ : $H_{5}^{(i_{1},i_{2})}$

In above, $H_{i}^{(i_{1},i_{2})}$ is a $d\times d$ matrix with its $j_{1},j_{2}$ -th entry being

\displaystyle\frac{\mathrm{d}c(X)_{i_{0},j_{0}}}{\mathrm{d}x_{i_{1},j_{2}}x_{i_{2},j_{2}}}

Utilizing above definitions, we split the Hessian to a $n\times n$ partition with its $i_{1},i_{2}$ -th component being $H_{i}{(i_{1},i_{2})}$ based on above definition.

Definition E.3.

We define $\nabla^{2}c(X)_{i_{0},j_{0}}$ to be as following

\displaystyle\begin{bmatrix}H_{4}^{(1,1)}&H_{5}^{(1,2)}&H_{5}^{(1,3)}&\cdots&H_{5}^{(1,i_{0}-1)}&H_{3}^{(1,i_{0})}&H_{5}^{(1,i_{0}+1)}&\cdots&H_{5}^{(1,n)}\\ H_{5}^{(2,1)}&H_{4}^{(2,2)}&H_{5}^{(2,3)}&\cdots&H_{5}^{(2,i_{0}-1)}&H_{3}^{(2,i_{0})}&H_{5}^{(2,i_{0}+1)}&\cdots&H_{5}^{(2,n)}\\ H_{5}^{(3,1)}&H_{5}^{(3,2)}&H_{4}^{(3,3)}&\cdots&H_{5}^{(3,i_{0}-1)}&H_{3}^{(3,i_{0})}&H_{5}^{(3,i_{0}+1)}&\cdots&H_{5}^{(3,n)}\\ \vdots&\vdots&\vdots&\ddots&\vdots&\vdots&\vdots&\ddots&\vdots\\ H_{2}^{(i_{0},1)}&H_{2}^{(i_{0},2)}&H_{2}^{(i_{0},3)}&\cdots&H_{2}^{(i_{0},i_{0}-1)}&H_{1}^{(i_{0},i_{0})}&H_{2}^{(i_{0},i_{0}+1)}&\cdots&H_{2}^{(i_{0},n)}\\ H_{5}^{(i_{0}+1,1)}&H_{5}^{(i_{0}+1,2)}&H_{5}^{(i_{0}+1,3)}&\cdots&H_{5}^{(i_{0}+1,i_{0}-1)}&H_{3}^{(i_{0}+1,i_{0})}&H_{4}^{(i_{0}+1,i_{0}+1)}&\cdots&H_{5}^{(i_{0}+1,n)}\\ \vdots&\vdots&\vdots&\ddots&\vdots&\vdots&\vdots&\ddots&\vdots\\ H_{5}^{(n,1)}&H_{5}^{(n,2)}&H_{5}^{(n,3)}&\cdots&H_{5}^{(n,i_{0}-1)}&H_{3}^{(n,i_{0})}&H_{5}^{(n,i_{0}+1)}&\cdots&H_{4}^{(n,n)}\end{bmatrix}

E.2 Decomposition Hessian : Part 1

Lemma E.4 (Helpful lemma).

Under following conditions

•

Let $z(X)_{i_{0}}:=W^{\top}X\cdot f(X)_{i_{0}}$
•

Let $w(X)_{i_{0},*}:=WX_{*,i_{0}}$

we have

•

Part 1: $w(X)_{i_{0},j_{1}}=e_{j_{1}}^{\top}\cdot w(X)_{i_{0},*}$
•

Part 2: $z(X)_{i_{0},j_{1}}=e_{j_{1}}^{\top}\cdot z(X)_{i_{0}}$

Proof.

Proof of Part 1

	$\displaystyle w(X)_{i_{0},j_{1}}=$	$\displaystyle~{}\langle W_{j_{1},},X_{,i_{0}}\rangle$
	$\displaystyle=$	$\displaystyle~{}W_{j_{1},}^{\top}X_{,i_{0}}$
	$\displaystyle=$	$\displaystyle~{}e_{j_{1}}^{\top}\cdot WX_{*,i_{0}}$
	$\displaystyle=$	$\displaystyle~{}e_{j_{1}}^{\top}\cdot w(X)_{i_{0},*}$

where the first step is by the definition of $w(X)_{i_{0},j_{1}}$ the 2nd and 3rd step are from linear algebra facts, the 4th step is by the definition of $w(X)_{i_{0},*}$ .

Proof of Part 2

	$\displaystyle z(X)_{i_{0},j_{1}}=$	$\displaystyle~{}\langle f(X){i_{0}},X^{\top}W_{*,j_{1}}\rangle$
	$\displaystyle=$	$\displaystyle~{}(X^{\top}W_{*,j_{1}})^{\top}f(X)_{i_{0}}$
	$\displaystyle=$	$\displaystyle~{}W_{*,j_{1}}^{\top}X\cdot f(X)_{i_{0}}$
	$\displaystyle=$	$\displaystyle~{}e_{j_{1}}^{\top}\cdot W^{\top}X\cdot f(X)_{i_{0}}$
	$\displaystyle=$	$\displaystyle~{}e_{j_{1}}^{\top}\cdot z(X)_{i_{0}}$

where the first step is by the definition of $w(X)_{i_{0},j_{1}}$ the 2nd, 3rd, and the 4th step are from linear algebra facts, the 5th step is by the definition of $w(X)_{i_{0},*}$ . ∎

Lemma E.5.

Under following conditions

•

Let $D_{i}(X)$ be defined as Lemma C.15
•

Let $z(X)_{i_{0}}:=W^{\top}X\cdot f(X)_{i_{0}}$
•

Let $w(X)_{i_{0},*}:=WX_{*,i_{0}}$

we have

	$\displaystyle D_{1}(X)=$	$\displaystyle~{}e_{j_{1}}^{\top}\cdot w(X)_{i_{0},}\cdot 2s(X)_{i_{0},j_{0}}\cdot f(X)^{2}_{i_{0},i_{0}}\cdot w(X)_{i_{0},}^{\top}\cdot e_{j_{2}}$
	$\displaystyle D_{2}(X)=$	$\displaystyle~{}e_{j_{1}}^{\top}\cdot(w(X)_{i_{0},*}\cdot 2f(X)_{i_{0},i_{0}}\cdot s(X)_{i_{0},j_{0}}\cdot z(X)_{i_{0}}^{\top}$
		$\displaystyle~{}+z(X)_{i_{0}}\cdot 2f(X)_{i_{0},i_{0}}\cdot s(X)_{i_{0},j_{0}}\cdot w(X)_{i_{0},*}^{\top})\cdot e_{j_{2}}$
	$\displaystyle D_{3}(X)=$	$\displaystyle~{}-e_{j_{1}}^{\top}\cdot w(X)_{i_{0},}\cdot f(X)^{2}_{i_{0},i_{0}}\cdot h(X)_{j_{0},i_{0}}\cdot w(X)_{i_{0},}^{\top}\cdot e_{j_{2}}$
	$\displaystyle D_{4}(X)=$	$\displaystyle~{}-e_{j_{1}}^{\top}\cdot W^{\top}\cdot f(X)_{i_{0},i_{0}}\cdot X\cdot\operatorname{diag}(f(X)_{i_{0}})\cdot h(X)_{j_{0}}\cdot w(X)_{i_{0},*}^{\top}\cdot e_{j_{2}}$
		$\displaystyle~{}-e_{j_{1}}^{\top}\cdot w(X)_{i_{0},*}\cdot f(X)_{i_{0},i_{0}}\cdot h(X)_{j_{0}}^{\top}\cdot\operatorname{diag}(f(X)_{i_{0}})\cdot X^{\top}\cdot W\cdot e_{j_{2}}$
	$\displaystyle D_{5}(X)=$	$\displaystyle~{}-e_{j_{1}}^{\top}\cdot(w(X)_{i_{0},}\cdot f(X)^{2}_{i_{0},i_{0}}\cdot V_{,j_{0}}^{\top}+V_{,j_{0}}\cdot f(X)^{2}_{i_{0},i_{0}}\cdot w(X)_{i_{0},}^{\top})\cdot e_{j_{2}}$
	$\displaystyle D_{6}(X)=$	$\displaystyle~{}-e_{j_{1}}^{\top}\cdot w(X)_{i_{0},}\cdot s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},}^{\top}\cdot e_{j_{2}}$
	$\displaystyle D_{7}(X)=$	$\displaystyle~{}-e_{j_{1}}^{\top}\cdot w(X)_{i_{0},}\cdot s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{0}}\cdot X_{,i_{0}}^{\top}\cdot W\cdot e_{j_{2}}$
		$\displaystyle~{}-e_{j_{1}}^{\top}\cdot W^{\top}\cdot X_{,i_{0}}\cdot s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},}^{\top}\cdot e_{j_{2}}$
	$\displaystyle D_{8}(X)=$	$\displaystyle~{}e_{j_{1}}^{\top}\cdot s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{0}}\cdot(W^{\top}-W)\cdot e_{j_{2}}$
	$\displaystyle D_{9}(X)=$	$\displaystyle~{}e_{j_{1}}^{\top}\cdot z(X)_{i_{0}}\cdot s(X)_{i_{0},j_{0}}\cdot z(X)_{i_{0}}^{\top}\cdot e_{j_{2}}$
	$\displaystyle D_{10}(X)=$	$\displaystyle~{}-e_{j_{1}}^{\top}\cdot(z(X)_{i_{0}}\cdot f(X)_{i_{0},i_{0}}\cdot h(X)_{j_{0},i_{0}}\cdot w(X)_{i_{0},*}^{\top}$
		$\displaystyle~{}+w(X)_{i_{0},*}\cdot f(X)_{i_{0},i_{0}}\cdot h(X)_{j_{0},i_{0}}\cdot z(X)_{i_{0}}^{\top})\cdot e_{j_{2}}$
	$\displaystyle D_{11}(X)=$	$\displaystyle~{}-e_{j_{1}}^{\top}\cdot(z(X)_{i_{0}}\cdot h(X)_{j_{0}}^{\top}\cdot\operatorname{diag}(f(X)_{i_{0}})\cdot X^{\top}\cdot W$
		$\displaystyle~{}+W^{\top}\cdot X\cdot\operatorname{diag}(f(X)_{i_{0}})\cdot h(X)_{j_{0}}\cdot z(X)_{i_{0}}^{\top})\cdot e_{j_{2}}$
	$\displaystyle D_{12}(X)=$	$\displaystyle~{}-e_{j_{1}}^{\top}\cdot(z(X)_{i_{0}}\cdot f(X)_{i_{0},i_{0}}\cdot V_{,j_{0}}^{\top}+V_{,j_{0}}\cdot f(X)_{i_{0},i_{0}}\cdot z(X)_{i_{0}}^{\top})\cdot e_{j_{2}}$
	$\displaystyle D_{13}(X)=$	$\displaystyle~{}e_{j_{1}}^{\top}\cdot z(X)_{i_{0}}\cdot s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{0}}\cdot z(X)_{i_{0}}^{\top}\cdot e_{j_{2}}$
	$\displaystyle D_{14}(X)=$	$\displaystyle~{}-e_{j_{1}}^{\top}\cdot W^{\top}\cdot X\cdot s(X)_{i_{0},j_{0}}\cdot\operatorname{diag}(f(X)_{i_{0}})\cdot X^{\top}\cdot W\cdot e_{j_{2}}$
	$\displaystyle D_{15}(X)=$	$\displaystyle~{}-e_{j_{1}}^{\top}\cdot w(X)_{i_{0},}\cdot f(X)^{2}_{i_{0},i_{0}}\cdot h(X)_{j_{0},i_{0}}\cdot\cdot w(X)_{i_{0},}^{\top}\cdot e_{j_{2}}$
	$\displaystyle D_{16}(X)=$	$\displaystyle~{}e_{j_{1}}^{\top}\cdot w(X)_{i_{0},}\cdot f(X)_{i_{0},i_{0}}\cdot h(X)_{j_{0},i_{0}}\cdot\cdot w(X)_{i_{0},}^{\top}\cdot e_{j_{2}}$
	$\displaystyle D_{17}(X)=$	$\displaystyle~{}e_{j_{1}}^{\top}\cdot(w(X)_{i_{0},}\cdot f(X)_{i_{0},i_{0}}\cdot X_{,i_{0}}^{\top}\cdot h(X)_{j_{0},i_{0}}\cdot W$
		$\displaystyle~{}+W^{\top}\cdot X_{*,i_{0}}\cdot f(X)_{i_{0},i_{0}}\cdot h(X)_{j_{0},i_{0}}\cdot w(X)_{i_{0}})\cdot e_{j_{2}}$
	$\displaystyle D_{18}(X)=$	$\displaystyle~{}e_{j_{1}}^{\top}\cdot(w(X)_{i_{0},}f(X)_{i_{0},i_{0}}\cdot V_{j_{2},}^{\top}+V_{j_{1},}^{\top}\cdot f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},}^{\top})\cdot e_{j_{2}}$
	$\displaystyle D_{19}(X)=$	$\displaystyle~{}e_{j_{1}}^{\top}\cdot f(X)_{i_{0},i_{0}}\cdot h(X)_{i_{0},i_{0}}\cdot(W+W^{\top})\cdot e_{j_{2}}$
	$\displaystyle D_{20}(X):=$	$\displaystyle~{}e_{j_{1}}^{\top}\cdot W^{\top}\cdot X\cdot\operatorname{diag}(f(X)_{i_{0}})\cdot\operatorname{diag}(h(X)_{j_{0}})\cdot X^{\top}\cdot W\cdot e_{j_{2}}$
	$\displaystyle D_{21}(X):=$	$\displaystyle~{}e_{j_{1}}^{\top}\cdot(W^{\top}\cdot X_{,i_{0}}\cdot f(X)_{i_{0},i_{0}}\cdot V_{,j_{0}}^{\top}+V_{,j_{0}}\cdot f(X)_{i_{0},i_{0}}\cdot X_{,i_{0}}^{\top}\cdot W)\cdot e_{j_{2}}$

Proof.

This lemma is followed by Lemma E.4 and linear algebra facts. ∎

Based on above auxiliary lemma, we have following definition.

Definition E.6.

Under following conditions

•

Let $z(X)_{i_{0}}:=W^{\top}X\cdot f(X)_{i_{0}}$
•

Let $w(X)_{i_{0},*}:=WX_{*,i_{0}}$

We present the Case 1 component of Hessian $c(X)_{i_{0},j_{0}}$ to be

\displaystyle H_{1}^{(i_{0},i_{0})}(X):=B(X)

where we have

	$\displaystyle B(X):=$	$\displaystyle~{}\sum_{i=1}^{21}B_{i}(X)$
	$\displaystyle B_{1}(X):=$	$\displaystyle~{}w(X)_{i_{0},}\cdot 2s(X)_{i_{0},j_{0}}\cdot f(X)^{2}_{i_{0},i_{0}}\cdot w(X)_{i_{0},}^{\top}$
	$\displaystyle B_{2}(X):=$	$\displaystyle~{}w(X)_{i_{0},*}\cdot 2f(X)_{i_{0},i_{0}}\cdot s(X)_{i_{0},j_{0}}\cdot z(X)_{i_{0}}^{\top}$
		$\displaystyle~{}+z(X)_{i_{0}}\cdot 2f(X)_{i_{0},i_{0}}\cdot s(X)_{i_{0},j_{0}}\cdot w(X)_{i_{0},*}^{\top}$
	$\displaystyle B_{3}(X):=$	$\displaystyle~{}-w(X)_{i_{0},}\cdot f(X)^{2}_{i_{0},i_{0}}\cdot h(X)_{j_{0},i_{0}}\cdot w(X)_{i_{0},}^{\top}$
	$\displaystyle B_{4}(X):=$	$\displaystyle~{}-W^{\top}\cdot f(X)_{i_{0},i_{0}}\cdot X\cdot\operatorname{diag}(f(X)_{i_{0}})\cdot h(X)_{j_{0}}\cdot w(X)_{i_{0},*}^{\top}$
		$\displaystyle~{}-w(X)_{i_{0},*}\cdot f(X)_{i_{0},i_{0}}\cdot h(X)_{j_{0}}^{\top}\cdot\operatorname{diag}(f(X)_{i_{0}})\cdot X^{\top}\cdot W$
	$\displaystyle B_{5}(X):=$	$\displaystyle~{}-w(X)_{i_{0},}\cdot f(X)^{2}_{i_{0},i_{0}}\cdot V_{,j_{0}}^{\top}-V_{,j_{0}}\cdot f(X)^{2}_{i_{0},i_{0}}\cdot w(X)_{i_{0},}^{\top}$
	$\displaystyle B_{6}(X):=$	$\displaystyle~{}-w(X)_{i_{0},}\cdot s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},}^{\top}$
	$\displaystyle B_{7}(X):=$	$\displaystyle~{}-w(X)_{i_{0},}\cdot s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{0}}\cdot X_{,i_{0}}^{\top}\cdot W$
		$\displaystyle~{}-W^{\top}\cdot X_{,i_{0}}\cdot s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},}^{\top}$
	$\displaystyle B_{8}(X):=$	$\displaystyle~{}s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{0}}\cdot(W^{\top}-W)$
	$\displaystyle B_{9}(X):=$	$\displaystyle~{}z(X)_{i_{0}}\cdot s(X)_{i_{0},j_{0}}\cdot z(X)_{i_{0}}^{\top}$
	$\displaystyle B_{10}(X):=$	$\displaystyle~{}-z(X)_{i_{0}}\cdot f(X)_{i_{0},i_{0}}\cdot h(X)_{j_{0},i_{0}}\cdot w(X)_{i_{0},*}^{\top}$
		$\displaystyle~{}-w(X)_{i_{0},*}\cdot f(X)_{i_{0},i_{0}}\cdot h(X)_{j_{0},i_{0}}\cdot z(X)_{i_{0}}^{\top}$
	$\displaystyle B_{11}(X):=$	$\displaystyle~{}-z(X)_{i_{0}}\cdot(h(X)_{j_{0}}^{\top}\cdot\operatorname{diag}(f(X)_{i_{0}})\cdot X^{\top}\cdot W$
		$\displaystyle~{}-W^{\top}\cdot X\cdot\operatorname{diag}(f(X)_{i_{0}})\cdot h(X)_{j_{0}}\cdot z(X)_{i_{0}}^{\top}$
	$\displaystyle B_{12}(X):=$	$\displaystyle~{}-z(X)_{i_{0}}\cdot f(X)_{i_{0},i_{0}}\cdot V_{,j_{0}}^{\top}+V_{,j_{0}}\cdot f(X)_{i_{0},i_{0}}\cdot z(X)_{i_{0}}^{\top}$
	$\displaystyle B_{13}(X):=$	$\displaystyle~{}z(X)_{i_{0}}\cdot s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{0}}\cdot z(X)_{i_{0}}^{\top}$
	$\displaystyle B_{14}(X):=$	$\displaystyle~{}-W^{\top}\cdot X\cdot s(X)_{i_{0},j_{0}}\cdot\operatorname{diag}(f(X)_{i_{0}})\cdot X^{\top}\cdot W$
	$\displaystyle B_{15}(X):=$	$\displaystyle~{}-w(X)_{i_{0},}\cdot f(X)^{2}_{i_{0},i_{0}}\cdot h(X)_{j_{0},i_{0}}\cdot\cdot w(X)_{i_{0},}^{\top}$
	$\displaystyle B_{16}(X):=$	$\displaystyle~{}w(X)_{i_{0},}\cdot f(X)_{i_{0},i_{0}}\cdot h(X)_{j_{0},i_{0}}\cdot\cdot w(X)_{i_{0},}^{\top}$
	$\displaystyle B_{17}(X):=$	$\displaystyle~{}w(X)_{i_{0},}\cdot f(X)_{i_{0},i_{0}}\cdot X_{,i_{0}}^{\top}\cdot h(X)_{j_{0},i_{0}}\cdot W$
		$\displaystyle~{}+W^{\top}\cdot X_{*,i_{0}}\cdot f(X)_{i_{0},i_{0}}\cdot h(X)_{j_{0},i_{0}}\cdot w(X)_{i_{0}}$
	$\displaystyle B_{18}(X):=$	$\displaystyle~{}w(X)_{i_{0},}\cdot f(X)_{i_{0},i_{0}}\cdot V_{j_{2},}^{\top}+V_{j_{1},}^{\top}\cdot f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},}^{\top}$
	$\displaystyle B_{19}(X):=$	$\displaystyle~{}f(X)_{i_{0},i_{0}}\cdot h(X)_{i_{0},i_{0}}\cdot(W+W^{\top})$
	$\displaystyle B_{20}(X):=$	$\displaystyle~{}W^{\top}\cdot X\cdot\operatorname{diag}(f(X)_{i_{0}})\cdot\operatorname{diag}(h(X)_{j_{0}})\cdot X^{\top}$
	$\displaystyle B_{21}(X):=$	$\displaystyle~{}W^{\top}\cdot X_{,i_{0}}\cdot f(X)_{i_{0},i_{0}}\cdot V_{,j_{0}}^{\top}+V_{,j_{0}}\cdot f(X)_{i_{0},i_{0}}\cdot X_{,i_{0}}^{\top}\cdot W$

E.3 Decomposition Hessian: Part 2 and Part 3

Lemma E.7.

Under following conditions

•

Let $E_{i}(X)$ be defined as Lemma C.15
•

Let $z(X)_{i_{0}}:=W^{\top}X\cdot f(X)_{i_{0}}$
•

Let $w(X)_{i_{0},*}:=WX_{*,i_{0}}$

we have

	$\displaystyle E_{1}(X)=$	$\displaystyle~{}e_{j_{1}}^{\top}\cdot w(X)_{i_{0},}\cdot 2s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{2}}\cdot f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},}^{\top}\cdot e_{j_{2}}$
	$\displaystyle E_{2}(X)=$	$\displaystyle-e_{j_{1}}^{\top}\cdot w(X)_{i_{0},}\cdot 2f(X)_{i_{0},i_{2}}\cdot h(X)_{j_{0},i_{2}}\cdot f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},}^{\top}\cdot e_{j_{2}}$
	$\displaystyle E_{3}(X)=$	$\displaystyle~{}-e_{j_{1}}^{\top}\cdot w(X)_{i_{0},}\cdot f(X)_{i_{0},i_{2}}\cdot f(X)_{i_{0},i_{0}}\cdot V_{,j_{0}}^{\top}\cdot e_{j_{2}}$
	$\displaystyle E_{4}(X)=$	$\displaystyle~{}e_{j_{1}}^{\top}\cdot z(X)_{i_{0}}\cdot s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{2}}\cdot w(X)_{i_{0},*}^{\top}\cdot e_{j_{2}}$
	$\displaystyle E_{5}(X)=$	$\displaystyle~{}-e_{j_{1}}^{\top}\cdot z(X)_{i_{0}}\cdot f(X)_{i_{0},i_{2}}\cdot h(X)_{j_{0},i_{2}}\cdot w(X)_{i_{0},*}^{\top}\cdot e_{j_{2}}$
	$\displaystyle E_{6}(X)=$	$\displaystyle~{}-e_{j_{1}}^{\top}\cdot z(X)_{i_{0}}\cdot f(X)_{i_{0},i_{2}}\cdot V_{*,j_{0}}^{\top}\cdot e_{j_{2}}$
	$\displaystyle E_{7}(X)=$	$\displaystyle~{}e_{j_{1}}^{\top}\cdot z(X)_{i_{0}}\cdot s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},*}^{\top}\cdot e_{j_{2}}$
	$\displaystyle E_{8}(X)=$	$\displaystyle~{}-e_{j_{1}}^{\top}\cdot w(X)_{i_{0},}\cdot s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},}^{\top}\cdot e_{j_{2}}$
	$\displaystyle E_{9}(X)=$	$\displaystyle~{}-e_{j_{1}}^{\top}\cdot W^{\top}\cdot s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{0}}\cdot e_{j_{2}}$
	$\displaystyle E_{10}(X)=$	$\displaystyle~{}-e_{j_{1}}^{\top}\cdot w(X)_{i_{0},}\cdot f(X)_{i_{0},i_{0}}\cdot f(X)_{i_{0},i_{2}}\cdot h(X)_{j_{0},i_{0}}\cdot w(X)_{i_{0},}^{\top}\cdot e_{j_{2}}$
	$\displaystyle E_{11}(X)=$	$\displaystyle~{}-e_{j_{1}}^{\top}\cdot W^{\top}\cdot X\cdot\operatorname{diag}(f(X)_{i_{0}})\cdot h(X)_{j_{0}}\cdot f(X)_{i_{0},i_{2}}\cdot w(X)_{i_{0},*}^{\top}\cdot e_{j_{2}}$
	$\displaystyle E_{12}(X)=$	$\displaystyle~{}e_{j_{1}}^{\top}\cdot W^{\top}\cdot X_{,i_{2}}\cdot f(X)_{i_{0},i_{2}}\cdot h(X)_{j_{0},i_{2}}\cdot w(X)_{i_{0},}^{\top}\cdot e_{j_{2}}$
	$\displaystyle E_{13}(X)=$	$\displaystyle~{}e_{j_{1}}^{\top}\cdot W^{\top}f(X)_{i_{0},i_{2}}\cdot h(X)_{j_{0},i_{2}}\cdot e_{j_{2}}$
	$\displaystyle E_{14}(X)=$	$\displaystyle~{}e_{j_{1}}^{\top}\cdot W^{\top}\cdot X_{,i_{2}}\cdot f(X)_{i_{0},i_{2}}\cdot V_{,j_{0}}^{\top}\cdot e_{j_{2}}$
	$\displaystyle E_{15}(X)=$	$\displaystyle~{}-e_{j_{1}}^{\top}\cdot V_{,j_{0}}\cdot f(X)_{i_{0},i_{0}}\cdot f(X)_{i_{0},i_{2}}\cdot w(X)_{i_{0},}^{\top}\cdot e_{j_{2}}$

Proof.

This lemma is followed by Lemma E.4 and linear algebra facts. ∎

Based on above auxiliary lemma, we have following definition.

Definition E.8.

Under following conditions

•

Let $z(X)_{i_{0}}:=W^{\top}X\cdot f(X)_{i_{0}}$
•

Let $w(X)_{i_{0},*}:=WX_{*,i_{0}}$

We present the Case 2 component of Hessian $c(X)_{i_{0},j_{0}}$ to be

\displaystyle H_{2}^{(i_{0},i_{2})}(X):=J(X)

where we have

	$\displaystyle J(X):=$	$\displaystyle~{}\sum_{i=1}^{15}J_{i}(X)$
	$\displaystyle J_{1}(X):=$	$\displaystyle~{}w(X)_{i_{0},}\cdot 2s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{2}}\cdot f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},}^{\top}$
	$\displaystyle J_{2}(X):=$	$\displaystyle-w(X)_{i_{0},}\cdot 2f(X)_{i_{0},i_{2}}\cdot h(X)_{j_{0},i_{2}}\cdot f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},}^{\top}$
	$\displaystyle J_{3}(X):=$	$\displaystyle~{}-w(X)_{i_{0},}\cdot f(X)_{i_{0},i_{2}}\cdot f(X)_{i_{0},i_{0}}\cdot V_{,j_{0}}^{\top}$
	$\displaystyle J_{4}(X):=$	$\displaystyle~{}z(X)_{i_{0}}\cdot s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{2}}\cdot w(X)_{i_{0},*}^{\top}$
	$\displaystyle J_{5}(X):=$	$\displaystyle~{}-z(X)_{i_{0}}\cdot f(X)_{i_{0},i_{2}}\cdot h(X)_{j_{0},i_{2}}\cdot w(X)_{i_{0},*}^{\top}$
	$\displaystyle J_{6}(X):=$	$\displaystyle~{}-z(X)_{i_{0}}\cdot f(X)_{i_{0},i_{2}}\cdot V_{*,j_{0}}^{\top}$
	$\displaystyle J_{7}(X):=$	$\displaystyle~{}z(X)_{i_{0}}\cdot s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},*}^{\top}$
	$\displaystyle J_{8}(X):=$	$\displaystyle~{}-w(X)_{i_{0},}\cdot s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},}^{\top}$
	$\displaystyle J_{9}(X):=$	$\displaystyle~{}-W^{\top}\cdot s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{0}}$
	$\displaystyle J_{10}(X):=$	$\displaystyle~{}-w(X)_{i_{0},}\cdot f(X)_{i_{0},i_{0}}\cdot f(X)_{i_{0},i_{2}}\cdot h(X)_{j_{0},i_{0}}\cdot w(X)_{i_{0},}^{\top}$
	$\displaystyle J_{11}(X):=$	$\displaystyle~{}-W^{\top}\cdot X\cdot\operatorname{diag}(f(X)_{i_{0}})\cdot h(X)_{j_{0}}\cdot f(X)_{i_{0},i_{2}}\cdot w(X)_{i_{0},*}^{\top}$
	$\displaystyle J_{12}(X):=$	$\displaystyle~{}W^{\top}\cdot X_{,i_{2}}\cdot f(X)_{i_{0},i_{2}}\cdot h(X)_{j_{0},i_{2}}\cdot w(X)_{i_{0},}^{\top}$
	$\displaystyle J_{13}(X):=$	$\displaystyle~{}W^{\top}f(X)_{i_{0},i_{2}}\cdot h(X)_{j_{0},i_{2}}$
	$\displaystyle J_{14}(X):=$	$\displaystyle~{}W^{\top}\cdot X_{,i_{2}}\cdot f(X)_{i_{0},i_{2}}\cdot V_{,j_{0}}^{\top}$
	$\displaystyle J_{15}(X):=$	$\displaystyle~{}-V_{,j_{0}}\cdot f(X)_{i_{0},i_{0}}\cdot f(X)_{i_{0},i_{2}}\cdot w(X)_{i_{0},}^{\top}$

Next, we define the third case by the symmetricity of Hessian.

Definition E.9.

We present the Case 3 component of Hessian $c(X)_{i_{0},j_{0}}$ to be

\displaystyle H_{3}^{(i,i_{0})}(X):=H_{2}^{(i_{0},i)}(X)

E.4 Decomposition Hessian : Part 4

Lemma E.10.

Under following conditions

•

Let $F_{i}(X)$ be defined as Lemma D.9
•

Let $z(X)_{i_{0}}:=W^{\top}X\cdot f(X)_{i_{0}}$
•

Let $w(X)_{i_{0},*}:=WX_{*,i_{0}}$

we have

	$\displaystyle F_{1}(X)=$	$\displaystyle~{}e_{j_{1}}^{\top}\cdot w(X)_{i_{0},}\cdot 2s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{1}}^{2}\cdot w(X)_{i_{0},}^{\top}\cdot e_{j_{2}}$
	$\displaystyle F_{2}(X)=$	$\displaystyle~{}-e_{j_{1}}^{\top}\cdot w(X)_{i_{0},}\cdot f(X)_{i_{0},i_{1}}^{2}\cdot h(X)_{j_{0},i_{1}}\cdot w(X)_{i_{0},}^{\top}\cdot e_{j_{2}}$
	$\displaystyle F_{3}(X)=$	$\displaystyle~{}-e_{j_{1}}^{\top}\cdot(w(X)_{i_{0},}\cdot f(X)_{i_{0},i_{1}}^{2}\cdot V_{,j_{0}}^{\top}+V_{,j_{0}}\cdot f(X)_{i_{0},i_{1}}^{2}\cdot w(X)_{i_{0},}^{\top})\cdot e_{j_{2}}$
	$\displaystyle F_{4}(X)=$	$\displaystyle~{}-e_{j_{1}}^{\top}\cdot w(X)_{i_{0},}\cdot s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{1}}\cdot w(X)_{i_{0},}^{\top}\cdot e_{j_{2}}$
	$\displaystyle F_{5}(X)=$	$\displaystyle~{}e_{j_{1}}^{\top}\cdot w(X)_{i_{0},}\cdot f(X)_{i_{0},i_{1}}\cdot h(X)_{j_{0},i_{1}}\cdot w(X)_{i_{0},}^{\top}\cdot e_{j_{2}}$
	$\displaystyle F_{6}(X)=$	$\displaystyle~{}e_{j_{1}}^{\top}\cdot(w(X)_{i_{0},}\cdot f(X)_{i_{0},i_{1}}\cdot V_{,j_{0}}^{\top}+V_{,j_{0}}\cdot f(X)_{i_{0},i_{1}}\cdot w(X)_{i_{0},}^{\top})\cdot e_{j_{2}}$

Proof.

This lemma is followed by Lemma E.4 and linear algebra facts. ∎

Based on above auxiliary lemma, we have following definition.

Definition E.11.

Under following conditions

•

Let $z(X)_{i_{0}}:=W^{\top}X\cdot f(X)_{i_{0}}$
•

Let $w(X)_{i_{0},*}:=WX_{*,i_{0}}$

We present the Case 4 component of Hessian $c(X)_{i_{0},j_{0}}$ to be

\displaystyle H_{4}^{(i_{1},i_{1})}(X):=K(X)

where we have

	$\displaystyle K(X):=$	$\displaystyle~{}\sum_{i=1}^{6}K_{i}(X)$
	$\displaystyle K_{1}(X):=$	$\displaystyle~{}w(X)_{i_{0},}\cdot 2s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{1}}^{2}\cdot w(X)_{i_{0},}^{\top}$
	$\displaystyle K_{2}(X):=$	$\displaystyle~{}-w(X)_{i_{0},}\cdot f(X)_{i_{0},i_{1}}^{2}\cdot h(X)_{j_{0},i_{1}}\cdot w(X)_{i_{0},}^{\top}$
	$\displaystyle K_{3}(X):=$	$\displaystyle~{}-w(X)_{i_{0},}\cdot f(X)_{i_{0},i_{1}}^{2}\cdot V_{,j_{0}}^{\top}-V_{,j_{0}}\cdot f(X)_{i_{0},i_{1}}^{2}\cdot w(X)_{i_{0},}^{\top}$
	$\displaystyle K_{4}(X):=$	$\displaystyle~{}-w(X)_{i_{0},}\cdot s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{1}}\cdot w(X)_{i_{0},}^{\top}$
	$\displaystyle K_{5}(X):=$	$\displaystyle~{}w(X)_{i_{0},}\cdot f(X)_{i_{0},i_{1}}\cdot h(X)_{j_{0},i_{1}}\cdot w(X)_{i_{0},}^{\top}$
	$\displaystyle K_{6}(X):=$	$\displaystyle~{}w(X)_{i_{0},}\cdot f(X)_{i_{0},i_{1}}\cdot V_{,j_{0}}^{\top}+V_{,j_{0}}\cdot f(X)_{i_{0},i_{1}}\cdot w(X)_{i_{0},}^{\top}$

E.5 Decomposition Hessian : Part 5

Lemma E.12.

Under following conditions

•

Let $G_{i}(X)$ be defined as Lemma D.9
•

Let $z(X)_{i_{0}}:=W^{\top}X\cdot f(X)_{i_{0}}$
•

Let $w(X)_{i_{0},*}:=WX_{*,i_{0}}$

we have

	$\displaystyle G_{1}(X)=$	$\displaystyle~{}e_{j_{1}}^{\top}\cdot w(X)_{i_{0},}\cdot 2s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{1}}\cdot f(X)_{i_{0},i_{2}}\cdot w(X)_{i_{0},}^{\top}\cdot e_{j_{2}}$
	$\displaystyle G_{2}(X)=$	$\displaystyle~{}-e_{j_{1}}^{\top}\cdot w(X)_{i_{0},}\cdot f(X)_{i_{0},i_{1}}\cdot f(X)_{i_{0},i_{2}}\cdot(h(X)_{j_{0},i_{2}}+h(X)_{j_{0},i_{1}})\cdot w(X)_{i_{0},}^{\top}\cdot e_{j_{2}}$
	$\displaystyle G_{3}(X)=$	$\displaystyle~{}-e_{j_{1}}^{\top}\cdot f(X)_{i_{0},i_{1}}\cdot f(X)_{i_{0},i_{2}}\cdot(w(X)_{i_{0},}\cdot V_{,j_{0}}^{\top}+V_{,j_{0}}\cdot w(X)_{,j_{2}})\cdot e_{j_{2}}$

Proof.

This lemma is followed by Lemma E.4 and linear algebra facts. ∎

Based on above auxiliary lemma, we have following definition.

Definition E.13.

Under following conditions

•

Let $z(X)_{i_{0}}:=W^{\top}X\cdot f(X)_{i_{0}}$
•

Let $w(X)_{i_{0},*}:=WX_{*,i_{0}}$

We present the Case 5 component of Hessian $c(X)_{i_{0},j_{0}}$ to be

\displaystyle H_{5}^{(i_{1},i_{2})}(X):=N(X)

where we have

	$\displaystyle N(X):=$	$\displaystyle~{}\sum_{i=1}^{3}N_{i}(X)$
	$\displaystyle N_{1}(X):=$	$\displaystyle~{}w(X)_{i_{0},}\cdot 2s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{1}}\cdot f(X)_{i_{0},i_{2}}\cdot w(X)_{i_{0},}^{\top}$
	$\displaystyle N_{2}(X):=$	$\displaystyle~{}-w(X)_{i_{0},}\cdot f(X)_{i_{0},i_{1}}\cdot f(X)_{i_{0},i_{2}}\cdot(h(X)_{j_{0},i_{2}}+h(X)_{j_{0},i_{1}})\cdot w(X)_{i_{0},}^{\top}$
	$\displaystyle N_{3}(X):=$	$\displaystyle~{}-f(X)_{i_{0},i_{1}}\cdot f(X)_{i_{0},i_{2}}\cdot(w(X)_{i_{0},}\cdot V_{,j_{0}}^{\top}+V_{,j_{0}}\cdot w(X)_{,j_{2}}^{\top})$

Appendix F Hessian of loss function

In this section, we provide the Hessian of our loss function.

Lemma F.1 (A single entry).

Under following conditions

•

Let $L(X)$ be defined as Definition B.9

we have

\displaystyle\frac{\mathrm{d}L(X)}{\mathrm{d}x_{i_{1},j_{1}}x_{i_{2},j_{2}}}=\sum_{i_{0}=1}^{n}\sum_{j_{0}=1}^{d}\frac{\mathrm{d}c(X)_{i_{0},j_{0}}}{\mathrm{d}x_{i_{1},j_{1}}}\cdot\frac{\mathrm{d}c(X)_{i_{0},j_{0}}}{\mathrm{d}x_{i_{1},j_{2}}}+c(X)_{i_{0},j_{0}}\cdot\frac{\mathrm{d}c(X)_{i_{0},j_{0}}}{\mathrm{d}x_{i_{1},j_{1}}x_{i_{2},j_{2}}}

Proof.

Proof of Part 1: $i_{1}=i_{2}$

	$\displaystyle\frac{\mathrm{d}L(X)}{\mathrm{d}x_{i_{1},j_{1}}x_{i_{2},j_{2}}}=$	$\displaystyle~{}\frac{\mathrm{d}}{\mathrm{d}x_{i_{2},j_{2}}}(\sum_{i_{0}=1}^{n}\sum_{j_{0}=1}^{d}c(X)_{i_{0},j_{0}}\cdot\frac{\mathrm{d}c(X)_{i_{0},j_{0}}}{\mathrm{d}x_{i_{1},j_{1}}})$
	$\displaystyle=$	$\displaystyle~{}\sum_{i_{0}=1}^{n}\sum_{j_{0}=1}^{d}\frac{\mathrm{d}c(X)_{i_{0},j_{0}}}{\mathrm{d}x_{i_{1},j_{1}}}\cdot\frac{\mathrm{d}c(X)_{i_{0},j_{0}}}{\mathrm{d}x_{i_{2},j_{2}}}+c(X)_{i_{0},j_{0}}\cdot\frac{\mathrm{d}c(X)_{i_{0},j_{0}}}{\mathrm{d}x_{i_{1},j_{1}}x_{i_{2},j_{2}}}$

where the first step is given by chain rule, and the 2nd step are given by product rule. ∎

Lemma F.2 (Matrix Representation of Hessian).

Under following conditions

•

Let $c(X)_{i_{0},j_{0}}$ be defined as Definition B.8
•

Let $L(X)$ be defined as Definition B.9

we have

\displaystyle\nabla^{2}L(X)=\sum_{i_{0}=1}^{n}\sum_{j_{0}=1}^{d}\nabla c(X)_{i_{0},j_{0}}\cdot\nabla c(X)_{i_{0},j_{0}}^{\top}+c(X)_{i_{0},j_{0}}\cdot\nabla^{2}c(X)_{i_{0},j_{0}}

Proof.

This is directly given by the single-entry representation in Lemma F.1. ∎

Appendix G Bounds for basic functions

In this section, we prove the upper bound for each function, with following assumption about the domain of parameters. In Section G.1 we bound the basic terms. In Section G.2 we bound the gradient of $f(X)_{i_{0}}$ . In Section G.3 we bound the gradient of $c(X)_{i_{0},j_{0}}$

Assumption G.1 (Bounded parameters).

Let $W,V,X,B$ be defined as in Section B.2,

•

Let $R$ be some fixed constant satisfies $R>1$
•

We have $\|W\|\leq R$ , $\|V\|\leq R$ , $\|X\|\leq R$ where $\|\cdot\|$ is the matrix spectral norm
•

We have $b_{i_{0},j_{0}}\leq R^{2}$

G.1 Bounds for basic functions

Lemma G.2.

Under Assumption G.1, for all $i_{0}\in[n],j_{0}\in[d]$ , we have following bounds:

•

Part 1

$\displaystyle\|f(X)_{i_{0}}\|_{2}\leq 1$
•

Part 2

$\displaystyle\|h(X)_{i_{0}}\|_{2}\leq R^{2}$
•

Part 3

$\displaystyle|c(X)_{i_{0},j_{0}}|\leq 2R^{2}$
•

Part 4

$\displaystyle\|x^{\top}W_{*,j_{0}}\|_{2}\leq R^{2}$
•

Part 5

$\displaystyle|w(X)_{i_{0},j_{0}}|\leq R^{2}$
•

Part 6

$\displaystyle|z(X)_{i_{0},j_{0}}|\leq R^{2}$
•

Part 7

$\displaystyle|s(X)_{i_{0},j_{0}}|\leq R^{2}$

Proof.

Proof of Part 1

The proof is similar to [30], and hence is omitted here.

Proof of Part 2

	$\displaystyle\\|h(X)_{j_{0}}\\|_{2}=$	$\displaystyle~{}\\|X^{\top}V_{*,j_{0}}\\|_{2}$
	$\displaystyle\leq$	$\displaystyle~{}\\|V\\|\cdot\\|X\\|$
	$\displaystyle\leq$	$\displaystyle~{}R^{2}$

where the first step is by Definition B.7, the 2nd step is by basic algebra, the 3rd follows by Assumption G.1.

Proof of Part 3

	$\displaystyle\|c(X)_{i_{0},j_{0}}\|=$	$\displaystyle~{}\|\langle f(X)_{i_{0}},h(X)_{j_{0}}\rangle-b_{i_{0},j_{0}}\|$
	$\displaystyle\leq$	$\displaystyle~{}\|\langle f(X)_{i_{0}},h(X)_{j_{0}}\rangle\|+\|b_{i_{0},j_{0}}\|$
	$\displaystyle\leq$	$\displaystyle~{}\\|f(X)_{i_{0}}\\|_{2}\cdot\\|h(X)_{j_{0}}\\|_{2}+\|b_{i_{0},j_{0}}\|$
	$\displaystyle\leq$	$\displaystyle~{}2R^{2}$

where the first step is by Definition B.8, the 2nd step uses triangle inequality, the 3rd step uses Cauchy-Schwartz inequality, the 4th step is by Assumption G.1 and Part 2.

Proof of Part 4

	$\displaystyle\\|x^{\top}W_{*,j_{0}}\\|_{2}\leq$	$\displaystyle~{}\\|x\\|\cdot\\|W\\|$
	$\displaystyle\leq$	$\displaystyle~{}R^{2}$

where the first step is by basic algebra, the second is by Assumption G.1.

Proof of Part 5

	$\displaystyle\|w(X)_{i_{0},j_{0}}\|=$	$\displaystyle~{}\|\langle W_{j_{0},},X_{,i_{0}}\|$
	$\displaystyle\leq$	$\displaystyle~{}\\|W_{j_{0},}\\|_{2}\cdot\\|X_{,i_{0}}\\|_{2}$
	$\displaystyle\leq$	$\displaystyle~{}R^{2}$

where the first step is by the definition of $w(X)_{i_{0},j_{0}}$ , the 2nd step is Cauchy-Schwartz inequality, the 3rd step is by Assumption G.1.

Proof of Part 6

	$\displaystyle\|z(X)_{i_{0},j_{0}}\|=$	$\displaystyle~{}\|\langle f(X)_{i_{0}},X^{\top}W_{*,j_{0}}\rangle\|$
	$\displaystyle\leq$	$\displaystyle~{}\\|f(X)_{i_{0}}\\|_{2}\cdot\\|X\\|\cdot\\|W_{*,j_{0}}\\|$
	$\displaystyle\leq$	$\displaystyle~{}R^{2}$

where the first step is by the definition of $z(X)_{i_{0},j_{0}}$ , the 2nd step is Cauchy-Schwartz inequality, the 3rd step is by Assumption G.1.

Proof of Part 7

	$\displaystyle\|s(X)_{i_{0},j_{0}}\|=$	$\displaystyle~{}\|\langle f(X)_{i_{0}},h(X)_{j_{0}}\rangle\|$
	$\displaystyle\leq$	$\displaystyle~{}\\|f(X)_{i_{0}}\\|_{2}\cdot\\|h(X)_{j_{0}}\\|_{2}$
	$\displaystyle\leq$	$\displaystyle~{}R^{2}$

where the first step is by the definition of $s(X)_{i_{0},j_{0}}$ , the 2nd step is Cauchy-Schwartz inequality, the 3rd step is by Part 1 and Part 2. ∎

G.2 Bounds for gradient of $f(X)_{i_{0}}$

Lemma G.3.

Under following conditions

•

Let $f(X)_{i_{0}}$ be defined as Definition B.6
•

Assumption G.1 holds
•

We use $\nabla f(X)_{i_{0}}$ to define a matrix that its $(j_{0},i_{1}\cdot j_{1})$ -th entry is

$\displaystyle\frac{\mathrm{d}f(X)_{i_{0},j_{0}}}{\mathrm{d}x_{i_{1},j_{1}}}$

i.e., its $(i_{1}\cdot j_{1})$ -th column is

$\displaystyle\frac{\mathrm{d}f(X)_{i_{0}}}{\mathrm{d}x_{i_{1},j_{1}}}$

Then we have:

•

Part 1: for all $i_{0},i_{1}\in[n],j_{1}\in[d]$ ,

$\displaystyle\|\frac{\mathrm{d}f(X)_{i_{0}}}{\mathrm{d}x_{i_{1},j_{1}}}\|_{2}\leq 4R^{2}$
•

Part 2:

$\displaystyle\|\nabla f(X)_{i_{0}}\|_{F}\leq 4\sqrt{nd}R^{2}$

Proof.

Proof of Part 1

	$\displaystyle\|\frac{\mathrm{d}f(X)_{i_{0}}}{\mathrm{d}x_{i_{1},j_{1}}}\|=$	$\displaystyle~{}\|-f(X)_{i_{0}}\cdot(f(X)_{i_{0},i_{0}}\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle+\langle f(X)_{i_{0}},X^{\top}W_{*,j_{1}}\rangle)$
		$\displaystyle~{}+f(X)_{i_{0}}\circ(e_{i_{0}}\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle+X^{\top}W_{*,j_{1}})\|$
	$\displaystyle\leq$	$\displaystyle~{}\\|f(X)_{i_{0}}\\|^{2}_{2}\cdot\|\langle W_{j_{1},},X_{,i_{0}}\rangle\|+\\|f(X)_{i_{0}}\\|^{2}_{2}\cdot\\|X^{\top}W_{*,j_{1}}\\|$
		$\displaystyle~{}+\\|f(X)_{i_{0}}\\|_{2}\cdot\|\langle W_{j_{1},},X_{,i_{0}}\rangle\|+\\|f(X)_{i_{0}}\\|_{2}\cdot\\|X^{\top}W_{*,j_{1}})\\|_{2}$
	$\displaystyle\leq$	$\displaystyle~{}4R^{2}$

where the 1st step is by Lemma B.14, the 2nd step is by Fact B.1, the 3rd step is by Lemma G.2.

Proof of Part 2

	$\displaystyle\\|\nabla f(X)_{i_{0}}\\|_{F}=$	$\displaystyle~{}(\sum_{i_{1}=1}^{n}\sum_{j_{1}=1}^{d}\\|\frac{\mathrm{d}f(X)_{i_{0}}}{\mathrm{d}x_{i_{1},j_{1}}}\\|_{2}^{2})^{\frac{1}{2}}$
	$\displaystyle\leq$	$\displaystyle~{}(\sum_{i_{1}=1}^{n}\sum_{j_{1}=1}^{d}16R^{4})^{\frac{1}{2}}$
	$\displaystyle=$	$\displaystyle~{}4\sqrt{nd}R^{2}$

where the first step is by the definition of $\nabla f(X)_{i_{0}}$ , the 2nd step is by Part 1. ∎

G.3 Bounds for gradient of $c(X)_{i_{0},j_{0}}$

Lemma G.4.

Under following conditions

•

Let $c(X)_{i_{0},j_{0}}$ be defined as Definition B.8
•

Assumption G.1 holds
•

We use $\nabla c(X)_{i_{0},j_{0}}$ to denote the Hessian of $c(X)_{i_{0},j_{0}}$ w.r.t. $\operatorname{vec}(X)$

Then we have:

•

Part 1: for all $i_{0},i_{1}\in[n],j_{1}\in[d]$ ,

$\displaystyle|\frac{c(X)_{i_{0},j_{0}}}{\mathrm{d}x_{i_{1},j_{1}}}|_{2}\leq 5R^{4}$
•

Part 2:

$\displaystyle\|\nabla c(X)_{i_{0},j_{0}}\|_{2}\leq 5\sqrt{nd}R^{4}$

Proof.

Proof of part 1

	$\displaystyle\|\frac{\mathrm{d}c(X)_{i_{0},j_{0}}}{\mathrm{d}x_{i_{1},j_{1}}}\|=$	$\displaystyle~{}\|C_{1}(X)+C_{2}(X)+C_{3}(X)+C_{4}(X)+C_{5}(X)\|$
	$\displaystyle\leq$	$\displaystyle~{}\|C_{1}(X)\|+\|C_{2}(X)\|+\|C_{3}(X)\|+\|C_{4}(X)\|+\|C_{5}(X)\|$
	$\displaystyle\leq$	$\displaystyle~{}\\|f(X)_{i_{0}}\\|_{2}^{2}\cdot\\|h(X)_{j_{0}}\\|_{2}\cdot\|w(X)_{i_{0},j_{0}}\|+\\|f(X)_{i_{0}}\\|_{2}\cdot\\|h(X)_{j_{0}}\\|_{2}\cdot\|z(X)_{i_{0},j_{1}}\|$
		$\displaystyle~{}+\\|f(X)_{i_{0}}\\|_{2}\cdot\\|h(X)_{j_{0}}\\|_{2}\cdot\|w(X)_{i_{0},j_{0}}\|$
		$\displaystyle~{}+\\|f(X)_{i_{0}}\\|_{2}\cdot\\|X\\|\cdot\\|W_{*,j_{1}}\\|_{2}\cdot\\|h(X)_{j_{0}}\\|_{2}+\\|f(X)_{i_{0}}\\|_{2}\cdot\\|V\\|$
	$\displaystyle\leq$	$\displaystyle~{}R^{4}+R^{4}+R^{4}+R^{4}+R^{2}$
	$\displaystyle\leq$	$\displaystyle~{}5R^{4}$

where the first step is by Lemma B.16, the 2nd step is by triangle inequality, the 3rd step is by Fact B.1, the 4th step is by Lemma G.2, the 5th step holds by $R>1$ .

Proof of Part 2

	$\displaystyle\\|\nabla c(X)_{i_{0},j_{0}}\\|_{2}=$	$\displaystyle~{}(\sum_{i_{1}=1}^{n}\sum_{j_{1}=1}^{d}\\|\frac{\mathrm{d}c(X)_{i_{0},j_{0}}}{\mathrm{d}x_{i_{1},j_{1}}}\\|_{2}^{2})^{\frac{1}{2}}$
	$\displaystyle\leq$	$\displaystyle~{}(\sum_{i_{1}=1}^{n}\sum_{j_{1}=1}^{d}25R^{8})^{\frac{1}{2}}$
	$\displaystyle=$	$\displaystyle~{}5\sqrt{nd}R^{4}$

where the first step is by the definition of $\nabla f(X)_{i_{0}}$ , the 2nd step is by Part 1.

∎

G.4 Bounds for Hessian of $c(X)_{i_{0},j_{0}}$

Lemma G.5.

Under following conditions

•

Let $c(X)_{i_{0},j_{0}}$ be defined as Definition B.8
•

Assumption G.1 (Bounded parameter) holds
•

Let $B_{i}(X)$ be defined as in Definition E.6

we have

•

Part 1: For all $i_{0}=i_{1}=i_{2}\in[n]$ , we have

$\displaystyle\|H_{1}(X)^{(i_{0},i_{0})}\|\leq 23R^{6}+R^{5}+12R^{3}$
•

Part 2: For all $i_{0}=i_{1}\neq i_{2}\in[n]$ , we have

$\displaystyle\|H_{2}(X)^{(i_{0},i_{2})}\|\leq 11R^{6}+6R^{3}$
•

Part 3: For all $i_{0}=i_{2}\neq i_{1}\in[n]$ , we have

$\displaystyle\|H_{3}(X)^{(i_{1},i_{0})}\|\leq 11R^{6}+6R^{3}$
•

Part 4: For all $i_{0}\neq i_{1}=i_{2}\in[n]$ , we have

$\displaystyle\|H_{4}(X)^{(i_{1},i_{1})}\|\leq 5R^{6}+4R^{3}$
•

Part 5: For all $i_{0}\neq i_{1},i_{0}\neq i_{2},i_{1}\neq i_{2}\in[n]$ , we have

$\displaystyle\|H_{5}(X)^{(i_{1},i_{2})}\|\leq 4R^{6}+2R^{3}$

Proof.

The proof is similar to Lemma G.4 and hence omit. ∎

Appendix H Lipschitz of Hessian

In Section H.1 we provide tools and facts. In Sections H.2, H.3, H.4, H.7, H.6, H.7 and H.8 we provide proof of lipschitz property of several important terms. And finally in Section H.9 we provide proof for Lipschitz property of Hessian of $L(X)$ .

H.1 Facts and Tools

In this section, we introduce 2 tools for effectively calculate the Lipschitz for Hessian.

Fact H.1 (Mean value theorem for vector function, Fact 34 in [30]).

Under following conditions,

•

Let $x,y\in C\subset\mathbb{R}^{n}$ where $C$ is an open convex domain
•

Let $g(x):C\to\mathbb{R}^{n}$ be a differentiable vector function on $C$
•

Let $\|g^{\prime}(a)\|_{F}\leq M$ for all $a\in C$ , where $g^{\prime}(a)$ denotes a matrix which its $(i,j)$ -th term is $\frac{\mathrm{d}g(a)_{j}}{\mathrm{d}a_{i}}$

then we have

\displaystyle\|g(y)-g(x)\|_{2}\leq M\|y-x\|_{2}

Fact H.2 (Lipschitz for product of functions).

Under following conditions

•

Let $\{f_{i}(x)\}_{i=1}^{n}$ be a sequence of function with same domain and range
•
For each $i\in[n]$ we have
- –
  
  $f_{i}(x)$ is bounded: $\forall x,\|f_{i}(x)\|\leq M_{i}$ with $M_{i}\geq 1$
- –
  
  $f_{i}(x)$ is Lipschitz continuous: $\forall x,y,\|f_{i}(x)-f_{i}(y)\|\leq L_{i}\|x-y\|$

Then we have

\displaystyle\|\prod_{i=1}^{n}f_{i}(x)-\prod_{i=1}^{n}f_{i}(y)\|\leq 2^{n-1}\cdot\max_{i\in[n]}\{L_{i}\}\cdot(\prod_{i=1}^{n}M_{i})\cdot\|x-y\|

Proof.

We prove it by mathematical induction. The case that $i=1$ obviously.

Now assume the case holds for $i=k$ . Consider $i=k+1$ , we have.

		$\displaystyle~{}\\|\prod_{i=1}^{k+1}f_{i}(x)-\prod_{i=1}^{k+1}f_{i}(y)\\|$
	$\displaystyle\leq$	$\displaystyle~{}\\|\prod_{i=1}^{k+1}f_{i}(x)-f_{k+1}(x)\cdot\prod_{i=1}^{k}f_{i}(y)\\|+\\|f_{k+1}(x)\cdot\prod_{i=1}^{k}f_{i}(y)-\prod_{i=1}^{k+1}f_{i}(y)\\|$
	$\displaystyle\leq$	$\displaystyle~{}\\|f_{k+1}(x)\\|\cdot\\|\prod_{i=1}^{k}f_{i}(x)-\prod_{i=1}^{k}f_{i}(y)\\|+\\|f_{k+1}(x)-f_{k+1}(y)\\|\cdot\\|\prod_{i=1}^{k}f_{i}(y)-\prod_{i=1}^{k}f_{i}(y)\\|$
	$\displaystyle\leq$	$\displaystyle~{}M_{k+1}\cdot\\|\prod_{i=1}^{k}f_{i}(x)-\prod_{i=1}^{k}f_{i}(y)\\|+(\prod_{i=1}^{k}M_{i})\cdot\\|f_{k+1}(x)-f_{k+1}(y)\\|$
	$\displaystyle\leq$	$\displaystyle~{}2^{k-1}(\prod_{i=1}^{k+1}M_{i})\cdot\max_{i\in[k]}\{L_{i}\}\\|x-y\\|+(\prod_{i=1}^{k}M_{i})\cdot\\|f_{k+1}(x)-f_{k+1}(y)\\|$
	$\displaystyle\leq$	$\displaystyle~{}2^{k-1}(\prod_{i=1}^{k+1}M_{i})\cdot\max_{i\in[k]}\{L_{i}\}\\|x-y\\|+(\prod_{i=1}^{k}M_{i})\cdot L_{k+1}\\|x-y\\|$
	$\displaystyle\leq$	$\displaystyle~{}2^{k-1}(\prod_{i=1}^{k+1}M_{i})\cdot\max_{i\in[k]}\{L_{i}\}\\|x-y\\|+(\prod_{i=1}^{k+1}M_{i})\cdot L_{k+1}\\|x-y\\|$
	$\displaystyle\leq$	$\displaystyle~{}2^{k}(\prod_{i=1}^{k+1}M_{i})\cdot\max_{i\in[k+1]}\{L_{i}\}\\|x-y\\|$

where the first step is by triangle inequality, the 2nd step is by property of norm, the 3rd step is by upper bound of functions, the 4th step is by induction hypothesis, the 5th step is by Lipschitz of $f_{k+1}(x)$ , the 6th step is by $M_{k+1}\geq 1$ , the 7th step is a rearrangement.

Since the claim holds for $i=k+1$ , we prove the desired result. ∎

H.2 Lipschitz for $f(X)_{i_{0}}$

Definition H.3 (Notation of norm).

For writing efficiency, we use $\|X-Y\|$ to denote $\|\operatorname{vec}(X)-\operatorname{vec}(Y)\|_{2}$ , which is equivalent to $\|X-Y\|_{F}$ .

Lemma H.4.

Under following conditions

•

Assumption G.1 holds
•

Let $f(X)_{i_{0}}$ be defined as Definition B.6

For $X,Y\in\mathbb{R}^{d\times n}$ , we have

\displaystyle\|f(X)_{i_{0}}-f(Y)_{i_{0}}\|_{2}\leq 4\sqrt{nd}R^{2}\cdot\|X-Y\|

Proof.

	$\displaystyle\\|f(X)_{i_{0}}-f(Y)_{i_{0}}\\|_{2}\leq$	$\displaystyle~{}\\|\nabla f(X)_{i_{0}}\\|_{F}\cdot\\|X-Y\\|$
	$\displaystyle\leq$	$\displaystyle~{}4\sqrt{nd}R^{2}\cdot\\|X-Y\\|$

where the first step is given by Mean Value Theorem (Lemma H.1) and the 2nd step is due to upper bound for gradient of $f(X)_{i_{0}}$ (Lemma G.3). ∎

H.3 Lipschitz for $c(X)_{i_{0},j_{0}}$

Lemma H.5.

Under following conditions

•

Assumption G.1 holds
•

Let $c(X)_{i_{0},j_{0}}$ be defined as Definition B.8

For $X,Y\in\mathbb{R}^{d\times n}$ , we have

\displaystyle|c(X)_{i_{0},j_{0}}-c(Y)_{i_{0},j_{0}}|\leq 5\sqrt{nd}R^{4}\cdot\|X-Y\|

Proof.

	$\displaystyle\|c(X)_{i_{0},j_{0}}-c(Y)_{i_{0},j_{0}}\|\leq$	$\displaystyle~{}\\|\nabla c(X)_{i_{0},j_{0}}\\|_{2}\cdot\\|X-Y\\|$
	$\displaystyle\leq$	$\displaystyle~{}5\sqrt{nd}R^{4}\cdot\\|X-Y\\|$

where the first step is given by Mean Value Theorem (Lemma H.1) and the 2nd step is due to upper bound for gradient of $c(X)_{i_{0},j_{0}}$ (Lemma G.4). ∎

H.4 Lipschitz for $h(X)_{j_{0}}$

Lemma H.6.

Under following conditions

•

Assumption G.1 holds
•

Let $h(X)_{j_{0}}$ be defined as Definition B.7

For $X,Y\in\mathbb{R}^{d\times n}$ , we have

\displaystyle\|h(X)_{j_{0}}-h(Y)_{j_{0}}\|_{2}\leq R\|X-Y\|

Proof.

	$\displaystyle\\|h(X)_{j_{0}}-h(Y)_{j_{0}}\\|=$	$\displaystyle~{}\\|V_{*,j_{0}}\\|_{2}\cdot\\|X-Y\\|$
	$\displaystyle\leq$	$\displaystyle~{}R\cdot\\|X-Y\\|$

where the first step is from the definition of $h(X)_{j_{0}}$ (see Definition B.7), the 2nd step is by Assumption G.1. ∎

H.5 Lipschitz for $w(X)_{i_{0},j_{0}}$

Lemma H.7.

Under following conditions

•

Assumption G.1 holds

For $X,Y\in\mathbb{R}^{d\times n}$ , we have

\displaystyle|w(X)_{i_{0},j_{0}}-w(Y)_{i_{0},j_{0}}|\leq R\|X-Y\|

Proof.

	$\displaystyle\|w(X)_{i_{0},j_{0}}-w(Y)_{i_{0},j_{0}}\|=$	$\displaystyle~{}\|\langle W_{j_{0},},X_{,i_{0}}-Y_{*,i_{0}}\rangle\|$
	$\displaystyle\leq$	$\displaystyle~{}\\|W_{j_{0},*}\\|_{2}\cdot\\|X-Y\\|$
	$\displaystyle\leq$	$\displaystyle~{}R\cdot\\|X-Y\\|$

where the first step is from the definition of $w(X)_{i_{0},j_{0}}$ , the 2nd step is by Fact B.1, the 3rd step holds since Assumption G.1. ∎

H.6 Lipschitz for $z(X)_{i_{0},j_{0}}$

Lemma H.8.

Under following conditions

•

Assumption G.1 holds

For $X,Y\in\mathbb{R}^{d\times n}$ , we have

\displaystyle|z(X)_{i_{0},j_{0}}-z(Y)_{i_{0},j_{0}}|\leq 5\sqrt{nd}R^{4}\cdot\|X-Y\|

Proof.

	$\displaystyle\|z(X)_{i_{0},j_{0}}-z(Y)_{i_{0},j_{0}}\|=$	$\displaystyle~{}\|\langle f(X)_{i_{0}},X^{\top}W_{,j_{0}}\rangle-\langle f(Y)_{i_{0}},Y^{\top}W_{,j_{0}}\rangle\|$
	$\displaystyle\leq$	$\displaystyle~{}\|\langle f(X)_{i_{0}},X^{\top}W_{,j_{0}}\rangle-\langle f(X)_{i_{0}},Y^{\top}W_{,j_{0}}\rangle\|$
		$\displaystyle~{}+\|\langle f(X)_{i_{0}},Y^{\top}W_{,j_{0}}\rangle-\langle f(Y)_{i_{0}},Y^{\top}W_{,j_{0}}\rangle\|$
	$\displaystyle\leq$	$\displaystyle~{}\\|f(X)_{i_{0}}\\|_{2}\cdot\\|X-Y\\|\cdot\\|W_{,j_{0}}\\|_{2}+\\|f(X)_{i_{0}}-f(Y)_{i_{0}}\\|\cdot\\|Y\\|\cdot\\|W_{,j_{0}}\\|$
	$\displaystyle\leq$	$\displaystyle~{}R\cdot\\|X-Y\\|+R^{2}\\|f(X)_{i_{0}}-f(Y)_{i_{0}}\\|$
	$\displaystyle\leq$	$\displaystyle~{}5\sqrt{nd}R^{4}\cdot\\|X-Y\\|$

where the first step is from the definition of $w(X)_{i_{0},j_{0}}$ , the 2nd step is by Fact B.1, the 3rd step holds since Assumption G.1, the 4th step uses Lemma H.4. ∎

H.7 Lipschitz for first order derivative of $c(X)_{i_{0},j_{0}}$

Lemma H.9.

Under following conditions

•

Assumption G.1 holds
•

Let $c(X)_{i_{0},j_{0}}$ be defined as Definition B.8

For $X,Y\in\mathbb{R}^{d\times n}$ , we have

\displaystyle|\frac{c(X)_{i_{0},j_{0}}}{\mathrm{d}x_{i_{1},j_{1}}}-\frac{c(Y)_{i_{0},j_{0}}}{\mathrm{d}y_{i_{1},j_{1}}}|\leq O(\sqrt{nd}R^{6})\cdot\|X-Y\|

Proof.

Recall $C_{i}(X)$ defined in Lemma B.16. The Lipschitz constant of $\frac{c(X)_{i_{0},j_{0}}}{\mathrm{d}x_{i_{1},j_{1}}}$ is bounded the summation of that of $C_{i}(X)$ . We only present the proof for Lipschitz for $C_{1}(X)$ here.

Notice that

\displaystyle C_{1}(X):=-s(X)_{i_{0},j_{0}}\cdot f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{1}}

By upper bound and lipschitz constant for basic functions, we have

•

$|s(X)_{i_{0},j_{0}}|\leq R^{2}$
•

$|f(X)_{i_{0},i_{0}}|\leq 1$
•

$|w(X)_{i_{0},j_{1}}|\leq R^{2}$
•

$\max_{f\in\{s(X)_{i_{0},j_{0}},f(X)_{i_{0},i_{0}},w(X)_{i_{0},j_{1}}\}}\{\mathrm{Lipschitz}(f)\}=4\sqrt{nd}R^{2}$
•

$n=3$

By Fact H.2.

	$\displaystyle\|C_{1}(X)-C_{1}(Y)\|\leq$	$\displaystyle~{}2^{n-1}\cdot\max_{i\in[n]}\{L_{i}\}\cdot(\prod_{i=1}^{n}M_{i})\cdot\\|X-Y\\|$
	$\displaystyle=$	$\displaystyle~{}4\cdot 4\sqrt{nd}R^{2}\cdot R^{4}\cdot\\|X-Y\\|$
	$\displaystyle=$	$\displaystyle~{}16\sqrt{nd}R^{6}\cdot\\|X-Y\\|$

∎

H.8 Lipschitz for second order derivative of $c(X)_{i_{0},j_{0}}$

Lemma H.10.

Under following conditions

•

Assumption G.1 holds
•

Let $c(X)_{i_{0},j_{0}}$ be defined as Definition B.8

For $X,Y\in\mathbb{R}^{d\times n}$ , we have

\displaystyle|\frac{c(X)_{i_{0},j_{0}}}{\mathrm{d}x_{i_{1},j_{1}}x_{i_{2},j_{2}}}-\frac{c(Y)_{i_{0},j_{0}}}{\mathrm{d}y_{i_{1},j_{1}}y_{i_{2},j_{2}}}|\leq O(\sqrt{nd}R^{8})\cdot\|X-Y\|

Proof.

The proof is similar to Lemma H.9 and hence omit. Notice that the upper bound for $\frac{c(X)_{i_{0},j_{0}}}{\mathrm{d}x_{i_{1},j_{1}}x_{i_{2},j_{2}}}$ is given by Lemma G.5. ∎

H.9 Lipschitz for Hessian of $L(X)$

Lemma H.11.

Under following conditions

•

Assumption G.1 holds
•

Let $c(X)_{i_{0},j_{0}}$ be defined as Definition B.8

For $X,Y\in\mathbb{R}^{d\times n}$ , we have

\displaystyle\|\nabla^{2}L(X)-\nabla^{2}L(Y)\|\leq O(n^{3.5}d^{3.5}R^{10})\cdot\|X-Y\|

Proof.

Recall that

	$\displaystyle\frac{\mathrm{d}L(X)}{\mathrm{d}x_{i_{1},j_{1}}x_{i_{2},j_{2}}}=$	$\displaystyle~{}\sum_{i_{0}=1}^{n}\sum_{j_{0}=1}^{d}\frac{\mathrm{d}c(X)_{i_{0},j_{0}}}{\mathrm{d}x_{i_{1},j_{1}}}\cdot\frac{\mathrm{d}c(X)_{i_{0},j_{0}}}{\mathrm{d}x_{i_{2},j_{2}}}+c(X)_{i_{0},j_{0}}\cdot\frac{\mathrm{d}c(X)_{i_{0},j_{0}}}{\mathrm{d}x_{i_{1},j_{1}}x_{i_{2},j_{2}}}$
	$\displaystyle=$	$\displaystyle~{}\sum_{i_{0}=1}^{n}\sum_{j_{0}=1}^{d}U_{1}(X)+U_{2}(X)$

For the first item $U_{1}(X)$ , we have

	$\displaystyle\|U_{1}(X)-U_{1}(Y)\|=$	$\displaystyle~{}\|\frac{\mathrm{d}c(X)_{i_{0},j_{0}}}{\mathrm{d}x_{i_{1},j_{1}}}\cdot\frac{\mathrm{d}c(X)_{i_{0},j_{0}}}{\mathrm{d}x_{i_{1},j_{2}}}-\frac{\mathrm{d}c(Y)_{i_{0},j_{0}}}{\mathrm{d}x_{i_{1},j_{1}}}\cdot\frac{\mathrm{d}c(Y)_{i_{0},j_{0}}}{\mathrm{d}y_{i_{1},j_{2}}}\|$
	$\displaystyle\leq$	$\displaystyle~{}\|\frac{\mathrm{d}c(X)_{i_{0},j_{0}}}{\mathrm{d}x_{i_{1},j_{1}}}\|\cdot\|\frac{\mathrm{d}c(X)_{i_{0},j_{0}}}{\mathrm{d}x_{i_{1},j_{2}}}-\frac{\mathrm{d}c(Y)_{i_{0},j_{0}}}{\mathrm{d}y_{i_{2},j_{2}}}\|$
		$\displaystyle~{}+\|\frac{\mathrm{d}c(X)_{i_{0},j_{0}}}{\mathrm{d}x_{i_{1},j_{1}}}\cdot-\frac{\mathrm{d}c(Y)_{i_{0},j_{0}}}{\mathrm{d}y_{i_{1},j_{1}}}\|\cdot\|\frac{\mathrm{d}c(Y)_{i_{0},j_{0}}}{\mathrm{d}y_{i_{2},j_{2}}}\|$
	$\displaystyle\leq$	$\displaystyle~{}10R^{4}\cdot\|\frac{\mathrm{d}c(X)_{i_{0},j_{0}}}{\mathrm{d}x_{i_{1},j_{1}}}\cdot-\frac{\mathrm{d}c(Y)_{i_{0},j_{0}}}{\mathrm{d}y_{i_{1},j_{1}}}\|$
	$\displaystyle\leq$	$\displaystyle~{}O(\sqrt{nd}R^{10})\cdot\\|X-Y\\|$

where the 2nd step is by triangle inequality, the 3rd step is by Lemma G.4, the 4th step uses Lemma H.9.

For the 2nd item $U_{2}(X)$ , we have

	$\displaystyle\|U_{2}(X)-U_{2}(Y)\|=$	$\displaystyle~{}\|c(X)_{i_{0},j_{0}}\cdot\frac{\mathrm{d}c(X)_{i_{0},j_{0}}}{\mathrm{d}x_{i_{1},j_{1}}x_{i_{2},j_{2}}}-c(Y)_{i_{0},j_{0}}\cdot\frac{\mathrm{d}c(Y)_{i_{0},j_{0}}}{\mathrm{d}y_{i_{1},j_{1}}y_{i_{2},j_{2}}}\|$
	$\displaystyle\leq$	$\displaystyle~{}\|c(X)_{i_{0},j_{0}}\|\cdot\|\frac{\mathrm{d}c(X)_{i_{0},j_{0}}}{\mathrm{d}x_{i_{1},j_{1}}x_{i_{2},j_{2}}}-\frac{\mathrm{d}c(Y)_{i_{0},j_{0}}}{\mathrm{d}y_{i_{1},j_{1}}y_{i_{2},j_{2}}}\|$
		$\displaystyle~{}+\|c(X)_{i_{0},j_{0}}-c(Y)_{i_{0},j_{0}}\|\cdot\|\frac{\mathrm{d}c(Y)_{i_{0},j_{0}}}{\mathrm{d}y_{i_{1},j_{1}}y_{i_{2},j_{2}}}\|$
	$\displaystyle\leq$	$\displaystyle~{}2R^{2}\cdot\|\frac{\mathrm{d}c(X)_{i_{0},j_{0}}}{\mathrm{d}x_{i_{1},j_{1}}x_{i_{2},j_{2}}}-\frac{\mathrm{d}c(Y)_{i_{0},j_{0}}}{\mathrm{d}y_{i_{1},j_{1}}y_{i_{2},j_{2}}}\|$
		$\displaystyle~{}+\|c(X)_{i_{0},j_{0}}-c(Y)_{i_{0},j_{0}}\|\cdot\|\frac{\mathrm{d}c(Y)_{i_{0},j_{0}}}{\mathrm{d}y_{i_{1},j_{1}}y_{i_{2},j_{2}}}\|$
	$\displaystyle\leq$	$\displaystyle~{}2R^{2}\cdot\|\frac{\mathrm{d}c(X)_{i_{0},j_{0}}}{\mathrm{d}x_{i_{1},j_{1}}x_{i_{2},j_{2}}}-\frac{\mathrm{d}c(Y)_{i_{0},j_{0}}}{\mathrm{d}y_{i_{1},j_{1}}y_{i_{2},j_{2}}}\|+5\sqrt{nd}R^{4}\cdot\\|X-Y\\|\cdot\|\frac{\mathrm{d}c(Y)_{i_{0},j_{0}}}{\mathrm{d}y_{i_{1},j_{1}}y_{i_{2},j_{2}}}\|$
	$\displaystyle\leq$	$\displaystyle~{}O(\sqrt{nd}R^{10})\cdot\\|X-Y\\|+5\sqrt{nd}R^{4}\cdot\\|X-Y\\|\cdot\|\frac{\mathrm{d}c(Y)_{i_{0},j_{0}}}{\mathrm{d}y_{i_{1},j_{1}}y_{i_{2},j_{2}}}\|$
	$\displaystyle\leq$	$\displaystyle~{}O(\sqrt{nd}R^{10})\cdot\\|X-Y\\|$

where the 2nd step is by triangle inequality, the 3rd step uses Lemma G.2, the 4th step uses Lemma H.5, the 5th step uses Lemma H.10, the last step uses Lemma G.5.

Combining the above 2 items, we have

\displaystyle|\frac{\mathrm{d}L(X)}{\mathrm{d}x_{i_{1},j_{1}}x_{i_{2},j_{2}}}-\frac{\mathrm{d}L(Y)}{\mathrm{d}y_{i_{1},j_{1}}y_{i_{2},j_{2}}}|\leq O(n^{1.5}d^{1.5}R^{10})\cdot\|X-Y\|

Then, we have

	$\displaystyle\\|\nabla^{2}L(X)-\nabla^{2}L(Y)\\|\leq$	$\displaystyle~{}\\|\nabla^{2}L(X)-\nabla^{2}L(Y)\\|_{F}$
	$\displaystyle\leq$	$\displaystyle~{}n^{2}d^{2}\cdot O(n^{1.5}d^{1.5}R^{10}\\|X-Y\\|$
	$\displaystyle=$	$\displaystyle~{}O(n^{3.5}d^{3.5}R^{10})\cdot\\|X-Y\\|$

where the 1st step is by matrix calculus, the 2nd is by the lipschitz for each entry of $\nabla^{2}L(X)$ . ∎

Appendix I Strongly Convexity

In this section, we provide proof for PSD bounds for the Hessian of Loss function.

I.1 PSD bounds for Hessian of $c(X)_{i_{0},j_{0}}$

Lemma I.1 (PSD bounds for $\nabla^{2}c(X)_{i_{0},j_{0}}$ ).

Under following conditions,

•

Let $c_{i_{0},j_{0}}$ be defined as in Definition B.8
•

Let Assumption G.1 be satisfied

For all $i_{0}\in[n],j_{0}\in[d]$ , we have

\displaystyle-36R^{6}\cdot{\bf I}_{nd}\preceq\nabla^{2}c(X)_{i_{0},j_{0}}\preceq 36R^{6}\cdot{\bf I}_{nd}

Proof.

We prove this statement by the definition of PSD. Let $p\in\mathbb{R}^{n\times d}$ be a vector. Let $i\in[n]$ , we use $p_{i}\in\mathbb{R}^{d}$ to denote the vector formed by the $(i-1)\cdot n+1$ -th term to the $i\cdot n$ -th term of vector $p$ .

Then, we have

	$\displaystyle\|p^{\top}\nabla^{2}c(X)_{i_{0},j_{0}}p\|=$	$\displaystyle~{}\|p_{i_{0}}^{\top}H_{1}(X)^{i_{0},i_{0}}p_{i_{0}}+\sum_{i\in[n]\backslash\{i_{0}\}}p_{i_{0}}^{\top}H_{2}(X)^{(i_{0},i)}p_{i}$
		$\displaystyle~{}+\sum_{i\in[n]\backslash\{i_{0}\}}p_{i}^{\top}H_{3}(X)^{(i,i_{0})}p_{i_{0}}+\sum_{i\in[n]\backslash\{i_{0}\}}p_{i}^{\top}H_{4}(X)^{(i,i)}p_{i}$
		$\displaystyle~{}+\sum_{i_{1}\in[n]\backslash\{i_{0}\}}\sum_{i_{2}\in[n]\backslash\{i_{0}\}}p_{i_{1}}^{\top}H_{5}(X)^{(i_{1},i_{2})}p_{i_{2}}\|$
	$\displaystyle\leq$	$\displaystyle~{}\max_{i\in[5]}\\|H_{i}(X)\\|\cdot\sum_{i_{1}\in[n]}\sum_{i_{2}\in[n]}p_{i_{1}}^{\top}p_{i_{2}}$
	$\displaystyle\leq$	$\displaystyle~{}\max_{i\in[5]}\\|H_{i}(X)\\|\cdot p^{\top}p$
	$\displaystyle\leq$	$\displaystyle~{}36R^{6}\cdot p^{\top}p$

where the 1st step is by the formulation of $\nabla^{2}c(X)_{i_{0},j_{0}}$ (see Definition E.3), the 2nd and 3rd steps are from simple algebra, the 4th step uses Lemma G.5. ∎

I.2 PSD bounds for Hessian of loss

Lemma I.2 (PSD bound for $\nabla^{2}L(X)$ ).

Under following conditions,

•

Let $L(X)$ be defined as in Definition B.9
•

Let Assumption G.1 be satisfied

we have

\displaystyle\nabla^{2}L(X)\succeq-O(ndR^{8})\cdot{\bf I}_{nd}

Proof.

Recall in Lemma F.2, we have

\displaystyle\nabla^{2}L(X)=\sum_{i_{0}=1}^{n}\sum_{j_{0}=1}^{d}\nabla c(X)_{i_{0},j_{0}}\cdot\nabla c(X)_{i_{0},j_{0}}^{\top}+c(X)_{i_{0},j_{0}}\cdot\nabla^{2}c(X)_{i_{0},j_{0}}

(2)

Notice that the first term is PSD, so we omit it.

By Lemma G.2, we have

\displaystyle|c(X)_{i_{0},j_{0}}|\leq 2R^{2}

Therefore, we have

	$\displaystyle\nabla^{2}c(X)_{i_{0},j_{0}}\succeq$	$\displaystyle~{}-72R^{8}\cdot{\bf I}_{nd}$
	$\displaystyle i.e.,\nabla^{2}L(X)\succeq$	$\displaystyle~{}-72ndR^{8}\cdot{\bf I}_{nd}$

where the first line is by Lemma I.1 and the 2nd line is given by Eq. (2).

∎

Appendix J Final Result

Theorem J.1 (Formal of Theorem 1.3, Main Result).

We assume our model satisfies the following conditions

•
Bounded parameters: there exists $R>1$ such that
- –
  
  $\|W\|_{F}\leq R$ , $\|V\|_{F}\leq R$
- –
  
  $\|X\|_{F}\leq R$
- –
  
  $\forall i\in[n],j\in[d],|b_{i,j}|\leq R$ where $b_{i,j}$ denotes the $i,j$ -th entry of $B$

•

Regularization: we consider the following problem:

\displaystyle\min_{X\in\mathbb{R}^{n\times d}}\|D(X)^{-1}\exp(X^{\top}WX)X^{\top}V-B\|_{F}^{2}+\gamma\cdot\|\operatorname{vec}(X)\|_{2}^{2}

•

Good initial point: We choose an initial point $X_{0}$ such that $M\cdot\|X_{0}-X^{*}\|_{F}\leq 0.1l$ , where $M=O(n^{3}d^{3}R^{10})$

Then, for any accuracy parameter $\epsilon\in(0,0.1)$ and a failure probability $\delta\in(0,0.1)$ , an algorithm based on the Newton method can be employed to recover the initial data. The result of this algorithm guarantee within $T=O(\log(|X_{0}-X^{*}|_{F}/\epsilon))$ executions, it outputs a matrix $\tilde{X}\in\mathbb{R}^{d\times n}$ satisfying $\|\widetilde{X}-X^{*}\|_{F}\leq\epsilon$ with a probability of at least $1-\delta$ . The execution time for each iteration is $\operatorname{poly}(n,d)$ .

Proof.

Choosing $\gamma\geq O(ndR^{8})$ , by Lemma I.2, we have the PD property of Hessian.

By Lemma H.11, we have the Lipschitz property of Hessian.

Since $M$ is bounded (in the condition of Theorem), then by iterative shrinking lemma (see Lemma 6.9 in [63] as an example), we prove the convergence. ∎

References

[1] Sanjeev Arora, Simon Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In International Conference on Machine Learning, pages 322–332. PMLR, 2019.
[2] Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Russ R Salakhutdinov, and Ruosong Wang. On exact computation with an infinitely wide neural net. Advances in neural information processing systems, 32, 2019.
AG [23] Sanjeev Arora and Anirudh Goyal. A theory for emergence of complex skills in language models. arXiv preprint arXiv:2307.15936, 2023.
ALS⁺ [23] Josh Alman, Jiehao Liang, Zhao Song, Ruizhe Zhang, and Danyang Zhuo. Bypass exponential time preprocessing: Fast neural network training via weight-data correlation preprocessing. In NeurIPS. arXiv preprint arXiv:2211.14227, 2023.
Ans [00] Kurt M Anstreicher. The volumetric barrier for semidefinite programming. Mathematics of Operations Research, 2000.
AS [23] Josh Alman and Zhao Song. Fast attention requires bounded entries. In NeurIPS. arXiv preprint arXiv:2302.13214, 2023.
[7] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. In International conference on machine learning, pages 242–252. PMLR, 2019.
[8] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. On the convergence rate of training recurrent neural networks. Advances in neural information processing systems, 32, 2019.
BLM⁺ [22] Hannah Brown, Katherine Lee, Fatemehsadat Mireshghallah, Reza Shokri, and Florian Tramèr. What does it mean for a language model to preserve privacy? In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 2280–2292, 2022.
BMR⁺ [20] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
BPSW [20] Jan van den Brand, Binghui Peng, Zhao Song, and Omri Weinstein. Training (overparametrized) neural networks in near-linear time. arXiv preprint arXiv:2006.11648, 2020.
BSZ [23] Jan van den Brand, Zhao Song, and Tianyi Zhou. Algorithm and hardness for dynamic attention maintenance in large language models. arXiv preprint arXiv:2304.02207, 2023.
CCB [18] Heeyoul Choi, Kyunghyun Cho, and Yoshua Bengio. Fine-grained attention mechanism for neural machine translation. Neurocomputing, 284:171–176, 2018.
CG [19] Yuan Cao and Quanquan Gu. Generalization bounds of stochastic gradient descent for wide and deep neural networks. Advances in neural information processing systems, 32, 2019.
CGH⁺ [19] Tianle Cai, Ruiqi Gao, Jikai Hou, Siyu Chen, Dong Wang, Di He, Zhihua Zhang, and Liwei Wang. Gram-gauss-newton method: Learning overparameterized neural networks for regression problems. arXiv preprint arXiv:1905.11675, 2019.
Cha [22] ChatGPT. Optimizing language models for dialogue. OpenAI Blog, November 2022.
CLMY [21] HanQin Cai, Yuchen Lou, Daniel Mckenzie, and Wotao Yin. A zeroth-order block coordinate descent algorithm for huge-scale black-box optimization. arXiv preprint arXiv:2102.10707, 2021.
CLP⁺ [21] Beidi Chen, Zichang Liu, Binghui Peng, Zhaozhuo Xu, Jonathan Lingjie Li, Tri Dao, Zhao Song, Anshumali Shrivastava, and Christopher Re. Mongoose: A learnable lsh framework for efficient neural network training. In International Conference on Learning Representations, 2021.
CLS [19] Michael B Cohen, Yin Tat Lee, and Zhao Song. Solving linear programs in the current matrix multiplication time. In STOC, 2019.
CMBB [23] Marco Cascella, Jonathan Montomoli, Valentina Bellini, and Elena Bignami. Evaluating the feasibility of chatgpt in healthcare: an analysis of multiple clinical and research scenarios. Journal of Medical Systems, 47(1):33, 2023.
CND⁺ [22] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
[22] Timothy Chu, Zhao Song, and Chiwun Yang. Fine-tune language models to approximate unbiased in-context learning. arXiv preprint arXiv:2310.03331, 2023.
[23] Timothy Chu, Zhao Song, and Chiwun Yang. How to protect copyright data in optimization of large language models? arXiv preprint arXiv:2308.12247, 2023.
DB [16] Alexey Dosovitskiy and Thomas Brox. Inverting visual representations with convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4829–4837, 2016.
DCLT [18] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
DLMS [23] Yichuan Deng, Zhihang Li, Sridhar Mahadevan, and Zhao Song. Zero-th order algorithm for softmax attention optimization. arXiv preprint arXiv:2307.08352, 2023.
DLS [23] Yichuan Deng, Zhihang Li, and Zhao Song. Attention scheme inspired softmax regression. arXiv preprint arXiv:2304.10411, 2023.
DMS [23] Yichuan Deng, Sridhar Mahadevan, and Zhao Song. Randomized and deterministic attention sparsification algorithms for over-parameterized feature dimension. arxiv preprint: arxiv 2304.03426, 2023.
DSW [22] Yichuan Deng, Zhao Song, and Omri Weinstein. Discrepancy minimization in input sparsity time. arXiv preprint arXiv:2210.12468, 2022.
DSX [23] Yichuan Deng, Zhao Song, and Shenghao Xie. Convergence of two-layer regression with nonlinear units. arXiv preprint arXiv:2308.08358, 2023.
DZPS [18] Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks. arXiv preprint arXiv:1810.02054, 2018.
FCB [16] Orhan Firat, Kyunghyun Cho, and Yoshua Bengio. Multi-way, multilingual neural machine translation with a shared attention mechanism. arXiv preprint arXiv:1601.01073, 2016.
GMS [23] Yeqi Gao, Sridhar Mahadevan, and Zhao Song. An over-parameterized exponential regression. arXiv preprint arXiv:2303.16504, 2023.
GPAM⁺ [14] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
GS [22] Yuzhou Gu and Zhao Song. A faster small treewidth sdp solver. arXiv preprint arXiv:2211.06033, 2022.
[36] Yeqi Gao, Zhao Song, and Shenghao Xie. In-context learning for attention scheme: from single softmax regression to multiple softmax regression via a tensor trick. arXiv preprint arXiv:2307.02419, 2023.
[37] Yeqi Gao, Zhao Song, and Shenghao Xie. In-context learning for attention scheme: from single softmax regression to multiple softmax regression via a tensor trick. arXiv preprint arXiv:2307.02419, 2023.
[38] Yeqi Gao, Zhao Song, and Xin Yang. Differentially private attention computation. arXiv preprint arXiv:2305.04701, 2023.
[39] Yeqi Gao, Zhao Song, and Junze Yin. Gradientcoin: A peer-to-peer decentralized large language models. arXiv preprint arXiv:2308.10502, 2023.
GSYZ [23] Yeqi Gao, Zhao Song, Xin Yang, and Ruizhe Zhang. Fast quantum algorithm for attention computation. arXiv preprint arXiv:2307.08045, 2023.
GSZ [23] Yuzhou Gu, Zhao Song, and Lichen Zhang. A nearly-linear time algorithm for structured support vector machines. arXiv preprint arXiv:2307.07735, 2023.
HAPC [17] Briland Hitaj, Giuseppe Ateniese, and Fernando Perez-Cruz. Deep models under the gan: information leakage from collaborative deep learning. In Proceedings of the 2017 ACM SIGSAC conference on computer and communications security, pages 603–618, 2017.
HGS⁺ [21] Yangsibo Huang, Samyak Gupta, Zhao Song, Kai Li, and Sanjeev Arora. Evaluating gradient inversion attacks and defenses in federated learning. Advances in Neural Information Processing Systems, 34:7232–7241, 2021.
HJS⁺ [22] Baihe Huang, Shunhua Jiang, Zhao Song, Runzhou Tao, and Ruizhe Zhang. Solving sdp faster: A robust ipm framework and efficient implementation. In 2022 IEEE 63rd Annual Symposium on Foundations of Computer Science (FOCS), pages 233–244. IEEE, 2022.
HLSY [21] Baihe Huang, Xiaoxiao Li, Zhao Song, and Xin Yang. Fl-ntk: A neural tangent kernel-based framework for federated learning analysis. In International Conference on Machine Learning, pages 4423–4434. PMLR, 2021.
HSLA [20] Yangsibo Huang, Zhao Song, Kai Li, and Sanjeev Arora. Instahide: Instance-hiding schemes for private distributed learning. In International conference on machine learning, pages 4507–4518. PMLR, 2020.
HXL⁺ [22] Xuanli He, Qiongkai Xu, Lingjuan Lyu, Fangzhao Wu, and Chenguang Wang. Protecting intellectual property of language generation apis with lexical watermark. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 10758–10766, 2022.
HXZ⁺ [22] Xuanli He, Qiongkai Xu, Yi Zeng, Lingjuan Lyu, Fangzhao Wu, Jiwei Li, and Ruoxi Jia. Cater: Intellectual property protection on text generation apis via conditional watermarks. Advances in Neural Information Processing Systems, 35:5431–5445, 2022.
JRM⁺ [99] Craig A Jensen, Russell D Reed, Robert Jackson Marks, Mohamed A El-Sharkawi, Jae-Byung Jung, Robert T Miyamoto, Gregory M Anderson, and Christian J Eggen. Inversion of feedforward neural networks: algorithms and applications. Proceedings of the IEEE, 87(9):1536–1549, 1999.
JSWZ [21] Shunhua Jiang, Zhao Song, Omri Weinstein, and Hengjie Zhang. Faster dynamic matrix inverse for faster lps. In STOC, 2021.
JT [19] Ziwei Ji and Matus Telgarsky. Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow relu networks. arXiv preprint arXiv:1909.12292, 2019.
KGW⁺ [23] John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, and Tom Goldstein. A watermark for large language models. arXiv preprint arXiv:2301.10226, 2023.
KKL [20] Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451, 2020.
KMZ [23] Praneeth Kacham, Vahab Mirrokni, and Peilin Zhong. Polysketchformer: Fast transformers via sketches for polynomial kernels. arXiv preprint arXiv:2310.01655, 2023.
KWR [22] Nikhil Kandpal, Eric Wallace, and Colin Raffel. Deduplicating training data mitigates privacy risks in language models. In International Conference on Machine Learning, pages 10697–10707. PMLR, 2022.
LKN [99] Bao-Liang Lu, Hajime Kita, and Yoshikazu Nishikawa. Inverting feedforward neural networks using linear and nonlinear programming. IEEE Transactions on Neural networks, 10(6):1271–1290, 1999.
LL [18] Yuanzhi Li and Yingyu Liang. Learning overparameterized neural networks via stochastic gradient descent on structured data. Advances in neural information processing systems, 31, 2018.
LLH⁺ [23] Hong Liu, Zhiyuan Li, David Hall, Percy Liang, and Tengyu Ma. Sophia: A scalable stochastic second-order optimizer for language model pre-training. arXiv preprint arXiv:2305.14342, 2023.
LLR [23] Yuchen Li, Yuanzhi Li, and Andrej Risteski. How do transformers learn topic structure: Towards a mechanistic understanding. arXiv preprint arXiv:2303.04245, 2023.
LSS⁺ [20] Jason D Lee, Ruoqi Shen, Zhao Song, Mengdi Wang, et al. Generalized leverage score sampling for neural networks. Advances in Neural Information Processing Systems, 33:10775–10787, 2020.
LSX⁺ [23] Shuai Li, Zhao Song, Yu Xia, Tong Yu, and Tianyi Zhou. The closeness of in-context learning and weight shifting for softmax regression. arXiv preprint arXiv:2304.13276, 2023.
LSZ [19] Yin Tat Lee, Zhao Song, and Qiuyi Zhang. Solving empirical risk minimization in the current matrix multiplication time. In Conference on Learning Theory (COLT), pages 2140–2157. PMLR, 2019.
LSZ [23] Zhihang Li, Zhao Song, and Tianyi Zhou. Solving regularized exp, cosh and sinh regression problems. arXiv preprint, 2303.15725, 2023.
MGN⁺ [23] Sadhika Malladi, Tianyu Gao, Eshaan Nichani, Alex Damian, Jason D Lee, Danqi Chen, and Sanjeev Arora. Fine-tuning language models with just forward passes. arXiv preprint arXiv:2305.17333, 2023.
MMS⁺ [19] Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suarez, Yoann Dupont, Laurent Romary, Eric Villemonte de La Clergerie, Djame Seddah, and Benoit Sagot. Camembert: a tasty french language model. arXiv preprint arXiv:1911.03894, 2019.
MOSW [22] Alexander Munteanu, Simon Omlor, Zhao Song, and David Woodruff. Bounding the width of neural networks via coupled initialization a worst case analysis. In International Conference on Machine Learning, pages 16083–16122. PMLR, 2022.
MSDCS [19] Luca Melis, Congzheng Song, Emiliano De Cristofaro, and Vitaly Shmatikov. Exploiting unintended feature leakage in collaborative learning. In 2019 IEEE symposium on security and privacy (SP), pages 691–706. IEEE, 2019.
MSS [16] Richard McPherson, Reza Shokri, and Vitaly Shmatikov. Defeating image obfuscation with deep learning. arXiv preprint arXiv:1609.00408, 2016.
MV [15] Aravindh Mahendran and Andrea Vedaldi. Understanding deep image representations by inverting them. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5188–5196, 2015.
MWY⁺ [23] Sadhika Malladi, Alexander Wettig, Dingli Yu, Danqi Chen, and Sanjeev Arora. A kernel-based view of language model fine-tuning. In International Conference on Machine Learning, pages 23610–23641. PMLR, 2023.
NRMI [20] Usman Naseem, Imran Razzak, Katarzyna Musial, and Muhammad Imran. Transformer based deep intelligent contextual embedding for twitter sentiment analysis. Future Generation Computer Systems, 113:58–69, 2020.
Ope [23] OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
OS [20] Samet Oymak and Mahdi Soltanolkotabi. Toward moderate overparameterization: Global convergence guarantees for training shallow neural networks. IEEE Journal on Selected Areas in Information Theory, 1(1):84–105, 2020.
PMJ⁺ [16] Nicolas Papernot, Patrick McDaniel, Somesh Jha, Matt Fredrikson, Z Berkay Celik, and Ananthram Swami. The limitations of deep learning in adversarial settings. In 2016 IEEE European symposium on security and privacy (EuroS&P), pages 372–387. IEEE, 2016.
PMXA [23] Abhishek Panigrahi, Sadhika Malladi, Mengzhou Xia, and Sanjeev Arora. Trainable transformer in transformer. arXiv preprint arXiv:2307.01189, 2023.
PSZA [23] Abhishek Panigrahi, Nikunj Saunshi, Haoyu Zhao, and Sanjeev Arora. Task-specific skill localization in fine-tuned language models. arXiv preprint arXiv:2302.06600, 2023.
PZJY [20] Xudong Pan, Mi Zhang, Shouling Ji, and Min Yang. Privacy risks of general-purpose language models. In 2020 IEEE Symposium on Security and Privacy (SP), pages 1314–1331. IEEE, 2020.
QSY [23] Lianke Qin, Zhao Song, and Yuanyuan Yang. Efficient sgd neural network training via sublinear activated neuron identification. arXiv preprint arXiv:2307.06565, 2023.
Ray [23] Partha Pratim Ray. Chatgpt: A comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope. Internet of Things and Cyber-Physical Systems, 2023.
RG [20] Maria Rigaki and Sebastian Garcia. A survey of privacy attacks in machine learning. ACM Computing Surveys, 2020.
RNS⁺ [18] Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. ., 2018.
RSM⁺ [23] Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D.Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
RWC⁺ [19] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
SHT [23] Clayton Sanford, Daniel Hsu, and Matus Telgarsky. Representational strengths and limitations of transformers. arXiv preprint arXiv:2306.02896, 2023.
SMK [23] Lingfeng Shen, Aayush Mishra, and Daniel Khashabi. Do pretrained transformers really learn in-context by gradient descent? arXiv preprint arXiv:2310.08540, 2023.
SSSS [17] Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. Membership inference attacks against machine learning models. In 2017 IEEE symposium on security and privacy (SP), pages 3–18. IEEE, 2017.
SSZ [23] Ritwik Sinha, Zhao Song, and Tianyi Zhou. A mathematical abstraction for balancing the trade-off between creativity and reality in large language models. arXiv preprint arXiv:2306.02295, 2023.
SWX⁺ [23] Hanpu Shen, Cheng-Long Wang, Zihang Xiang, Yiming Ying, and Di Wang. Differentially private non-convex learning for multi-layer neural networks. arXiv preprint arXiv:2310.08425, 2023.
SY [19] Zhao Song and Xin Yang. Quadratic suffices for over-parametrization via matrix chernoff bound. arXiv preprint arXiv:1906.03593, 2019.
SYZ [23] Zhao Song, Junze Yin, and Lichen Zhang. Solving attention kernel regression problem via pre-conditioner. arXiv preprint arXiv:2308.14304, 2023.
SZKS [21] Charlie Snell, Ruiqi Zhong, Dan Klein, and Jacob Steinhardt. Approximating how single head attention learns. arXiv preprint arXiv:2103.07601, 2021.
SZZ [21] Zhao Song, Lichen Zhang, and Ruizhe Zhang. Training multi-layer over-parametrized neural network in subquadratic time. arXiv preprint arXiv:2112.07628, 2021.
TBM⁺ [21] Yi Tay, Dara Bahri, Donald Metzler, Da-Cheng Juan, Zhe Zhao, and Che Zheng. Synthesizer: Rethinking self-attention for transformer models. In International conference on machine learning, pages 10183–10192. PMLR, 2021.
TDA⁺ [20] Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham, Jinfeng Rao, Liu Yang, Sebastian Ruder, and Donald Metzler. Long range arena: A benchmark for efficient transformers. arXiv preprint arXiv:2011.04006, 2020.
TLTO [23] Davoud Ataee Tarzanagh, Yingcong Li, Christos Thrampoulidis, and Samet Oymak. Transformers as support vector machines. arXiv preprint arXiv:2308.16898, 2023.
UAS⁺ [20] Mohd Usama, Belal Ahmad, Enmin Song, M Shamim Hossain, Mubarak Alrashoud, and Ghulam Muhammad. Attention-based sentiment analysis using convolutional and recurrent neural network. Future Generation Computer Systems, 113:571–578, 2020.
VKB [23] Nikhil Vyas, Sham Kakade, and Boaz Barak. Provable copyright protection for generative models. arXiv preprint arXiv:2302.10870, 2023.
VSP⁺ [17] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
WIL⁺ [23] Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prabhanjan Kambadur, David Rosenberg, and Gideon Mann. Bloomberggpt: A large language model for finance. arXiv preprint arXiv:2303.17564, 2023.
WLL⁺ [20] Wenqi Wei, Ling Liu, Margaret Loper, Ka-Ho Chow, Mehmet Emre Gursoy, Stacey Truex, and Yanzhao Wu. A framework for evaluating gradient leakage attacks in federated learning. arXiv preprint arXiv:2004.10397, 2020.
WLL [23] Zihan Wang, Jason Lee, and Qi Lei. Reconstructing training data from model gradient, provably. In International Conference on Artificial Intelligence and Statistics, pages 6595–6612. PMLR, 2023.
XGZC [23] Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, and Danqi Chen. Sheared llama: Accelerating language model pre-training via structured pruning. arXiv preprint arXiv:2310.06694, 2023.
XZA⁺ [23] Zheng Xu, Yanxiang Zhang, Galen Andrew, Christopher A Choquette-Choo, Peter Kairouz, H Brendan McMahan, Jesse Rosenstock, and Yuanbo Zhang. Federated learning of gboard language models with differential privacy. arXiv preprint arXiv:2305.18465, 2023.
YMV⁺ [21] Hongxu Yin, Arun Mallya, Arash Vahdat, Jose M Alvarez, Jan Kautz, and Pavlo Molchanov. See through gradients: Image batch recovery via gradinversion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16337–16346, 2021.
ZG [19] Difan Zou and Quanquan Gu. An improved analysis of training over-parameterized deep neural networks. Advances in neural information processing systems, 32, 2019.
Zha [22] Lichen Zhang. Speeding up optimizations via data structures: Faster search, sample and maintenance. PhD thesis, Master’s thesis, Carnegie Mellon University, 2022.
ZHDK [23] Amir Zandieh, Insu Han, Majid Daliri, and Amin Karbasi. Kdeformer: Accelerating transformers via kernel density estimation. arXiv preprint arXiv:2302.02451, 2023.
ZJP⁺ [20] Yuheng Zhang, Ruoxi Jia, Hengzhi Pei, Wenxiao Wang, Bo Li, and Dawn Song. The secret revealer: Generative model-inversion attacks against deep neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 253–261, 2020.
[109] Jingzhao Zhang, Sai Praneeth Karimireddy, Andreas Veit, Seungyeon Kim, Sashank Reddi, Sanjiv Kumar, and Suvrit Sra. Why are adaptive methods good for attention models? Advances in Neural Information Processing Systems, 33:15383–15393, 2020.
[110] Jingzhao Zhang, Sai Praneeth Karimireddy, Andreas Veit, Seungyeon Kim, Sashank Reddi, Sanjiv Kumar, and Suvrit Sra. Why are adaptive methods good for attention models? Advances in Neural Information Processing Systems, 33:15383–15393, 2020.
ZLH [19] Ligeng Zhu, Zhijian Liu, and Song Han. Deep leakage from gradients. Advances in neural information processing systems, 32, 2019.
ZMB [20] Bo Zhao, Konda Reddy Mopuri, and Hakan Bilen. idlg: Improved deep leakage from gradients. arXiv preprint arXiv:2001.02610, 2020.
ZMG [19] Guodong Zhang, James Martens, and Roger B Grosse. Fast convergence of natural gradient descent for over-parameterized neural networks. Advances in Neural Information Processing Systems, 32, 2019.
ZPD⁺ [20] Yi Zhang, Orestis Plevrakis, Simon S Du, Xingguo Li, Zhao Song, and Sanjeev Arora. Over-parameterized adversarial training: An analysis overcoming the curse of dimensionality. Advances in Neural Information Processing Systems, 33:679–688, 2020.
ZRG⁺ [22] Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.

Unmasking Transformers: A Theoretical Approach to Data Recovery via Attention Weights

1 Introduction

Definition 1.1 (Attention matrix computation).

Definition 1.2 (Regression model).

1.1 Our Result

Theorem 1.3 (Informal version of Theorem J.1).

Roadmap.

2 Related Works

Attention Computation Theory.

Security concerns about LLM.

Inverting the neural network.

Attacking the Neural Networks.

Theoretical Approaches to Understanding LLMs.

Optimization and Convergence of Deep Neural Networks.

3 Preliminary

3.1 Notations

3.2 Model Inversion Attack

3.3 Regression Problem Inspired by Attention Computation

4 Recovering Data via Attention Weights

4.1 Training Objective of Attention Inversion Attack

4.2 Hessian Decomposition

Definition 4.1 (Hessian of functions of matrix).

Definition 4.2 (Hessian split).

Definition 4.3.

4.3 Hessian of L​(X)L(X) is Lipschitz- continuous

Lemma 4.4 (informal version of Lemma H.11).

4.4 Hessian of L​(X)L(X) is Positive Definite

Lemma 4.5 (PSD bounds for ∇2L​(X)\nabla^{2}L(X)).

Definition 4.6 (Regularization).

5 Conclusion and Future Discussion

Roadmap.

Appendix A Notations

Appendix B Gradients

B.1 Facts

Fact B.1 (Basic algebra).

Fact B.2 (Basic calculus rule).

B.2 Definitions

Definition B.3 (Simplified notations).

Definition B.4 (Exponential function uu).

Definition B.5 (Sum function of softmax α\alpha).

Definition B.6 (Softmax probability function ff).

Definition B.7 (Value function hh).

Definition B.8 (One-unit loss function cc).

Definition B.9 (Overall function LL).

B.3 Gradient for each column of X⊤​W​X∗,i0X^{\top}WX_{*,i_{0}}

Lemma B.10.

Proof.

B.4 Gradient for u​(X)i0u(X)_{i_{0}}

Lemma B.11.

Proof.

B.5 Gradient Computation for α​(X)i0\alpha(X)_{i_{0}}

Lemma B.12 (A generalization of Lemma 5.6 in [27]).

Proof.

B.6 Gradient Computation for α​(X)i0−1\alpha(X)_{i_{0}}^{-1}

Lemma B.13 (A generalization of Lemma 5.6 in [27]).

Proof.

B.7 Gradient for f​(X)i0f(X)_{i_{0}}

Lemma B.14.

Proof.

B.8 Gradient for h​(X)j0h(X)_{j_{0}}

Lemma B.15.

Proof.

B.9 Gradient for c​(X)i0,j0c(X)_{i_{0},j_{0}}

Lemma B.16.

Proof.

B.10 Gradient for L​(X)L(X)

Lemma B.17.

Proof.

Appendix C Hessian case 1: i0=i1i_{0}=i_{1}

Definition C.1.

C.1 Derivative of Scalar Function w​(X)i0,j1w(X)_{i_{0},j_{1}}

Lemma C.2.

Proof.

C.2 Derivative of Vector Function X⊤​W∗,j1X^{\top}W_{*,j_{1}}

Lemma C.3.

Proof.

C.3 Derivative of Scalar Function f​(X)i0,i0f(X)_{i_{0},i_{0}}

Lemma C.4.

Proof.

C.4 Derivative of Scalar Function h​(X)j0,i0h(X)_{j_{0},i_{0}}

4.3 Hessian of $L(X)$ is Lipschitz- continuous

4.4 Hessian of $L(X)$ is Positive Definite

Lemma 4.5 (PSD bounds for $\nabla^{2}L(X)$ ).

Definition B.4 (Exponential function $u$ ).

Definition B.5 (Sum function of softmax $\alpha$ ).

Definition B.6 (Softmax probability function $f$ ).

Definition B.7 (Value function $h$ ).

Definition B.8 (One-unit loss function $c$ ).

Definition B.9 (Overall function $L$ ).

B.3 Gradient for each column of $X^{\top}WX_{*,i_{0}}$

B.4 Gradient for $u(X)_{i_{0}}$

B.5 Gradient Computation for $\alpha(X)_{i_{0}}$

B.6 Gradient Computation for $\alpha(X)_{i_{0}}^{-1}$

B.7 Gradient for $f(X)_{i_{0}}$

B.8 Gradient for $h(X)_{j_{0}}$

B.9 Gradient for $c(X)_{i_{0},j_{0}}$

B.10 Gradient for $L(X)$

Appendix C Hessian case 1: $i_{0}=i_{1}$

C.1 Derivative of Scalar Function $w(X)_{i_{0},j_{1}}$

C.2 Derivative of Vector Function $X^{\top}W_{*,j_{1}}$

C.3 Derivative of Scalar Function $f(X)_{i_{0},i_{0}}$

C.4 Derivative of Scalar Function $h(X)_{j_{0},i_{0}}$

C.5 Derivative of Scalar Function $z(X)_{i_{0},j_{1}}$

C.6 Derivative of Scalar Function $f(X)_{i_{0},i_{0}}\cdot h(X)_{j_{0},i_{0}}$

C.7 Derivative of Scalar Function $f(X)_{i_{0},i_{0}}\cdot w(X)_{i_{0},j_{1}}$

C.8 Derivative of Vector Function $f(X)_{i_{0}}\circ(X^{\top}W_{*,j_{1}})$

C.9 Derivative of $C_{1}(X)$

C.10 Derivative of $C_{2}(X)$

C.11 Derivative of $C_{3}(X)$

C.12 Derivative of $C_{4}(X)$

C.13 Derivative of $C_{5}(X)$

C.14 Derivative of $\frac{c(X)_{i_{0},j_{0}}}{\mathrm{d}x_{i_{1},j_{1}}}$

Appendix D Hessian case 2: $i_{0}\neq i_{1}$

D.1 Derivative of scalar function $f(X)_{i_{0},i_{1}}$

D.2 Derivative of scalar function $h(X)_{j_{0},i_{1}}$

D.3 Derivative of scalar function $\langle f(X)_{i_{0}},h(X)_{j_{0}}\rangle$

D.4 Derivative of scalar function $f(X)_{i_{0},i_{1}}\cdot\langle W_{j_{1},},X_{,i_{0}}\rangle$

D.5 Derivative of scalar function $f(X)_{i_{0},i_{1}}\cdot h(X)_{j_{0},i_{1}}$

D.6 Derivative of $C_{6}(X)$

D.7 Derivative of $C_{7}(X)$

D.8 Derivative of $C_{8}(X)$

D.9 Derivative of $\frac{\mathrm{d}c(X)_{i_{0},j_{1}}}{\mathrm{d}x_{i_{1},j_{1}}}$