PRR: Customized Diffusion Models by Image Space Rank Reduction

David S. Hippocampus
Department of Computer Science
Cranberry-Lemon University
Pittsburgh, PA 15213
hippo@cs.cranberry-lemon.edu
Use footnote for providing further information about author (webpage, alternative address)—not for acknowledging funding agencies.

Appendix A Theorem and Proof of PRR

Theorem 1.

For matrix $W$ , the image space of $W$ is a d-dimension vector space $S_{d}$ . If we have matrix $Q=\begin{bmatrix}\vec{q_{1}}\ \vec{q_{2}}\ ...\ \vec{q_{r}}\end{bmatrix}$ with $q_{i}\in S_{d}$ and vectors $q_{i}$ are mutually orthonormal, then $W-QQ^{T}W$ has a (d-r)-dimension image space.

This is actually an intuitive result, and to minimize any potential misleading, we provide a proof here.

Proof

The dimension of the image space of a matrix is same as the dimension of the column space of the matrix.

For all column vectors of $Q$ and $W$ , $\{\vec{q}_{1},\vec{q}_{2}...\vec{q}_{r},\vec{w}_{i}\}\in S_{d}$ , we perform the Gram-Schmidt process to obtain an orthonormal basis $\{\vec{q}_{1},\vec{q}_{2}...\vec{q}_{r},\vec{v}_{1},\vec{v}_{2}...\vec{v}_{d-r}\}$ of the image space of $W$ .

$W-QQ^{T}W$ transforms each column vector $\vec{w_{i}}$ of $W$ into $\vec{w}_{i}-\sum_{j=1}^{r}(\vec{q}_{j}\cdot\vec{w}_{i})\vec{q}_{j}$ , which is a linear combination of $\{\vec{v}_{1},\vec{v}_{2}...\vec{v}_{d-r}\}$ . Then $W-QQ^{T}W$ has a (d-r)-dimension image space.

Detailed Proof

Assume $W$ is a $b\times k$ matrix. Denote $W-QQ^{T}W$ as $[\vec{u}_{1}\ \vec{u}_{2}...\ \vec{u}_{k}]$ . Denote $W$ as $[\vec{w}_{1}\ \vec{w}_{2}...\ \vec{w}_{k}]$ . Present $W-QQ^{T}W$ in a vector form is

$[\vec{u}_{1}\ \vec{u}_{2}...\ \vec{u}_{k}]=[\vec{w}_{1}\ \vec{w}_{2}...\ \vec{w}_{k}]-[\vec{q}_{1}\ \vec{q}_{2}...\ \vec{q}_{r}]\left[\begin{matrix}\vec{q}_{1}^{T}\\ \vec{q}_{2}^{T}\\ ...\\ \vec{q}_{r}^{T}\\ \end{matrix}\right][\vec{w}_{1}\ \vec{w}_{2}...\ \vec{w}_{k}]$

$=\left[\vec{w}_{1}\ \vec{w}_{2}...\ \vec{w}_{k}\right]-\left[\vec{q}_{1}\ \vec{q}_{2}...\ \vec{q}_{r}\right]\left[\begin{matrix}\vec{q}_{1}^{T}\vec{w}_{1}&\vec{q}_{1}^{T}\vec{w}_{2}&...&\vec{q}_{1}^{T}\vec{w}_{k}\\ \vec{q}_{2}^{T}\vec{w}_{1}&\vec{q}_{2}^{T}\vec{w}_{2}&...&\vec{q}_{2}^{T}\vec{w}_{k}\\ ...&...&...&...\\ \vec{q}_{r}^{T}\vec{w}_{1}&\vec{q}_{r}^{T}\vec{w}_{2}&...&\vec{q}_{r}^{T}\vec{w}_{k}\end{matrix}\right]$

$=\left[\vec{w}_{1}\ \vec{w}_{2}...\ \vec{w}_{k}\right]-\left[\begin{matrix}\sum_{i=1}^{r}\vec{q}_{i}\vec{q}_{i}^{T}\vec{w}_{1}&\sum_{i=1}^{r}\vec{q}_{i}\vec{q}_{i}^{T}\vec{w}_{2}&...&\sum_{i=1}^{r}\vec{q}_{i}\vec{q}_{i}^{T}\vec{w}_{k}\end{matrix}\right]$

For each column, it is $\vec{u}_{j}=\vec{w}_{j}-\sum_{i=1}^{r}\vec{q}_{i}\vec{q}_{i}^{T}\vec{w}_{j}$

The image space of $W$ is a d-dimension vector space $S_{d}$ , then we have $d<b$ and $d<k$ .

$W-QQ^{T}W$ has a (d-r)-dimension image space means:

$\exists\{\vec{g}_{1},\vec{g}_{2},...\vec{g}_{d-r}\}$ , s.t. $\vec{g}_{i}\in S_{d}$ , $\forall\vec{u}_{i}\in\{\vec{u}_{1},\ \vec{u}_{2}...\ \vec{u}_{k}\}$ , $\exists\{a_{1},...a_{d-r}\}$ , s.t. $\vec{u}_{i}=\sum_{i=1}^{d-r}\vec{a}_{i}\vec{g}_{i},a_{i}\in\mathbb{R}$

Construct a new set $P$ containing all column vectors of the matrices $Q$ and $W$ , $P=\{\vec{q}_{1},\vec{q}_{2},...,\vec{q}_{r},\vec{w}_{1},\vec{w}_{2},...,\vec{w}_{k}\}$ . Apply the Gram-Schmidt process to $P$ , we will get $d$ orthogonal basis vectors $\{\vec{v}_{1},\vec{v}_{2},...,\vec{v}_{d}\}$

Apply the Gram-Schmidt process to $P$ :

Step1: $\vec{v}_{1}=\vec{q}_{1}$

Step2: $\vec{v}_{2}=\vec{q}_{2}-\frac{\vec{q}_{1}\cdot\vec{q}_{2}}{\vec{q}_{1}\cdot\vec{q}_{1}}\vec{q}_{1}=\vec{q}_{2}$

Step3: $\vec{v}_{3}=\vec{q}_{3}-\frac{\vec{q}_{1}\cdot\vec{q}_{3}}{\vec{q}_{1}\cdot\vec{q}_{1}}\vec{q}_{1}-\frac{\vec{q}_{2}\cdot\vec{q}_{3}}{\vec{q}_{2}\cdot\vec{q}_{2}}=\vec{q}_{3}$

…

Step(r): $\vec{v}_{r}=\vec{q}_{r}$

Step(r+1): $\vec{v}_{r+1}=\vec{w}_{1}-\sum_{i=1}^{r}\frac{\vec{q}_{i}\cdot\vec{w}_{1}}{\vec{q}_{i}\cdot\vec{q}_{i}}\vec{q}_{i}$

Step(r+2): $\vec{v}_{r+2}=\vec{w}_{2}-\sum_{i=1}^{r}\frac{\vec{q}_{i}\cdot\vec{w}_{2}}{\vec{q}_{i}\cdot\vec{q}_{i}}\vec{q}_{i}-\frac{\vec{v}_{1}\cdot\vec{w}_{2}}{\vec{v}_{1}\cdot\vec{v}_{1}}\vec{v}_{1}$

…

As we only have $d$ basis vectors, we will have the orthogonal basis $\{\vec{q}_{1},\vec{q}_{2},...,\vec{q}_{r},\vec{v}_{r+1},...,\vec{v}_{d}\}$ finally.

Assume $\vec{w}_{j}=\sum_{l=1}^{r}a_{l}\vec{q}_{l}+\sum_{l=r+1}^{d}b_{l}\vec{v}_{l}$

$\vec{u}_{j}=\vec{w}_{j}-\sum_{i=1}^{r}\vec{q}_{i}\vec{q}_{i}^{T}\vec{w}_{j}\\ =\sum_{l=1}^{r}a_{l}\vec{q}_{l}+\sum_{l=r+1}^{d}b_{l}\vec{v}_{l}-\sum_{i=1}^{r}\vec{q}_{i}\vec{q}_{i}^{T}(\sum_{l=1}^{r}a_{l}\vec{q}_{l}+\sum_{l=r+1}^{d}b_{l}\vec{v}_{l})\\ =\sum_{l=1}^{r}a_{l}\vec{q}_{l}+\sum_{l=r+1}^{d}b_{l}\vec{v}_{l}-\sum_{l=1}^{r}a_{l}\vec{q}_{l}\quad\text{(as $\vec{q}_{i}^{T}\vec{q}_{l}=1$ if $i=l$, $\vec{q}_{i}^{T}\vec{q}_{l}=0$ if $i\neq l$, $\vec{q}_{i}^{T}\vec{v}_{l}=0$ if $l\geq r+1$)}\\ =\sum_{l=r+1}^{d}b_{l}\vec{v}_{l}$

As we want to proof $\exists\{\vec{g}_{1},\vec{g}_{2},...\vec{g}_{d-r}\}$ , s.t. $\vec{g}_{i}\in S_{d}$ , $\forall\vec{u}_{i}\in\{\vec{u}_{1},\ \vec{u}_{2}...\ \vec{u}_{k}\}$ , $\exists\{a_{1},...a_{d-r}\}$ , s.t. $\vec{u}_{i}=\sum_{i=1}^{d-r}\vec{a}_{i}\vec{g}_{i},a_{i}\in\mathbb{R}$ ,

so we have $\{\vec{g}_{1},\vec{g}_{2},...\vec{g}_{d-r}\}=\{\vec{v}_{r+1},...,\vec{v}_{d}\}$ , s.t. $\vec{g}_{i}\in S_{d}$ , $\forall\vec{u}_{i}\in\{\vec{u}_{1},\ \vec{u}_{2}...\ \vec{u}_{k}\}$ , $\exists\{a_{1},...a_{d-r}\}$ , s.t. $\vec{u}_{i}=\sum_{i=1}^{d-r}\vec{a}_{i}\vec{g}_{i},a_{i}\in\mathbb{R}$

Therefore, $W-QQ^{T}W$ has a (d-r)-dimension image space.

Appendix B Comparisons of the generation performance for different subset parameters

EXCA, as a parameter subset, is choosed in our experiments. Here, we provide additional examples generated using other subsets of parameters. We experimented with eight different parameter subsets.

•
Exclude Cross-Attention (ExCA)
- –
  
  This method trains all layers except cross-attention and time embedding layers.
•
Exclude Self-Attention (ExSA)
- –
  
  This method trains all layers except self-attention layers.
•
Self-Attention Only (SAO)
- –
  
  This method trains only the self-attention layers.
•
Cross-Attention Only (CAO)
- –
  
  This method trains only the cross-attention layers.
•
Full Model Training (FMT)
- –
  
  This method trains all layers of the model.
•
Strict Cross-Attention (SCA)
- –
  
  This method trains only the queries and keys within the cross-attention mechanisms.
•
Exclude Cross-Attention High-Level (ExCA-HL)
- –
  
  This method trains all layers except cross-attention layers, with an emphasis on high-level feature representations.
•
Exclude Cross-Attention High-Level Last (ExCA-HL-Last)
- –
  
  This method trains all layers except cross-attention layers, focusing on the final stages of the high-level feature space.

As shown in Table 1, regardless of the subset used, the parameter count of PRR is significantly lower than that of LoRA. In Fig. 1, we use the example of a bear plushie, with $r=16$ , to illustrate the effects of different parameter subsets. It can be observed that models with larger parameter counts, such as FMT, CAO, and ExSA, tend to align better with the training images. Conversely, models with smaller parameter counts may not align well with the target subject but match the text more closely. The model with the largest parameter count, FMT, even produced a ’hybrid’ result in the multi-subject example "A girl is holding a small bear [V]". Models like SAO and SCA strike a better balance. EXCA, being a well-performing subset, has numerous examples listed in other sections and will not be repeated here.

Subset	PRR	LoRA	PRR	LoRA	PRR	LoRA
	$r$ =2		$r$ =16		$r$ =128
Exclude Cross-Attention (ExCA)	1.8 MB	4.8 MB	13 MB	33 MB	87 MB	190 MB
Exclude Self-Attention (ExSA)	1.9 MB	4.8 MB	13 MB	33 MB	87 MB	190 MB
Self-Attention Only (SAO)	1.6M	3.1M	11M	21M	82M	163M
Cross-Attention Only (CAO)	1.6M	3.5M	11M	25M	82M	193M
Full Model Training (FMT)	3.4M	8.2M	23M	58M	169M	382M
Strict Cross-Attention (SCA)	1.2M	2.8M	7.9M	20M	62M	152M
Exclude Cross-Attention High-Level (ExCA-HL)	276K	700K	1.9M	5.0M	14M	30M
Exclude Cross-Attention High-Level Last (ExCA-HL-Last)	6.9K	53K	42K	403K	82K	803K

Table 1: Fine-tuning subsets of parameters in UNet, comparing them with LoRA at various ranks, along with their corresponding model sizes.

Refer to caption — Figure 1: PRR Single subject generation on different parameter subsets

Appendix C Proof of PRR Sequential Addition in PRR Combination

Here we will prove that in PRR combination, $W_{0}-Q_{m}^{\prime}Q_{m}^{\prime T}W_{0}=W_{1}-Q_{2}Q_{2}^{T}W_{1}=W_{0}-Q_{1}Q_{1}^{T}W_{0}-Q_{2}Q_{2}^{T}W_{0}+Q_{2}Q_{2}^{T}Q_{1}Q_{1}^{T}W_{0}$ .

For two trained PRRs, we have parameters:

$Q1=\begin{bmatrix}\vec{q_{11}}\ \vec{q_{12}}\ ...\vec{q_{1r_{1}}}\end{bmatrix}_{d\times r_{1}}$

$Q2=\begin{bmatrix}\vec{q_{21}}\ \vec{q_{22}}\ ...\vec{q_{2r_{2}}}\end{bmatrix}_{d\times r_{2}}$ .

For the standard combination of $Q_{1}$ and $Q_{2}$ : $Qm=\begin{bmatrix}\vec{q_{11}}\ \vec{q_{12}}\ ...\vec{q_{1r_{1}}}\vec{q_{21}}\ \vec{q_{22}}\ ...\vec{q_{2r_{2}}}\end{bmatrix}_{d\times(r_{1}+r_{2})}$

With $Q_{m}^{\prime}R_{m}^{\prime}=Q_{m}$ , same as what we dicussed in Appendix A, we can find out a orthonormal basis $\{\vec{q}_{11},\vec{q}_{12}...\vec{q}_{1r_{1}},\vec{p}_{1},\vec{p}_{2}...\vec{p}_{r_{m}-r_{1}}\}$ of $r_{m}$ vectors from the column vectors of $Q_{1}$ and $Q_{2}$ .

$W_{m}=W_{0}-Q_{m}^{\prime}Q_{m}^{\prime T}W_{0}$ means for each column vector $\vec{w_{i}}$ of $W$ , it is transferred to

\displaystyle\vec{w_{i}}-\sum_{j}^{r_{1}}(\vec{q}_{1j}\cdot\vec{w}_{i})\vec{q}_{1j}-\sum_{l}^{r_{m}-r_{1}}(\vec{p}_{l}\cdot\vec{w}_{i})\vec{p}_{l}

(1)

For the practical combination $W_{m}=W_{1}-Q_{2}Q_{2}^{T}W_{1}=W_{0}-Q_{1}Q_{1}^{T}W_{0}-Q_{2}Q_{2}^{T}W_{0}+Q_{2}Q_{2}^{T}Q_{1}Q_{1}^{T}W_{0}$ , $\vec{w_{i}}$ is transferred to

\displaystyle\vec{w_{i}}-\sum_{j}^{r_{1}}(\vec{q}_{1j}\cdot\vec{w}_{i})\vec{q}_{1j}-\sum_{s}^{r_{2}}(\vec{q}_{2s}\cdot\vec{w}_{i})\vec{q}_{2s}+\sum_{s}^{r_{2}}((\sum_{j}^{r_{1}}(\vec{q}_{1j}\cdot\vec{w}_{i})\vec{q}_{1j})\cdot\vec{q}_{2s})\vec{q}_{2s}

(2)

We have $\vec{q}_{2s}=\sum_{j}^{r_{1}}a_{sj}\vec{q}_{1j}+\sum_{l}^{r_{m}-r_{1}}b_{sl}\vec{p}_{l}$ , and $a_{sj}=\vec{q}_{1j}\cdot\vec{q}_{2s}$ , $b_{sl}=\vec{p}_{l}\cdot\vec{q}_{2s}$ then expression 2 is

	$\displaystyle\vec{w_{i}}-\sum_{j}^{r_{1}}(\vec{q}_{1j}\cdot\vec{w}_{i})\vec{q}_{1j}-\sum_{s}^{r_{2}}(\sum_{j}^{r_{1}}a_{sj}\vec{q}_{1j}+\sum_{l}^{r_{m}-r_{1}}b_{sl}\vec{p}_{l}\cdot\vec{w}_{i})\vec{q}_{2s}+\sum_{s}^{r_{2}}((\sum_{j}^{r_{1}}(\vec{q}_{1j}\cdot\vec{w}_{i})\vec{q}_{1j})\cdot\vec{q}_{2s})\vec{q}_{2s}$		(3)
	$\displaystyle=\vec{w_{i}}-\sum_{j}^{r_{1}}(\vec{q}_{1j}\cdot\vec{w}_{i})\vec{q}_{1j}-\sum_{s}^{r_{2}}((\sum_{j}^{r_{1}}a_{sj}\vec{q}_{1j}+\sum_{l}^{r_{m}-r_{1}}b_{sl}\vec{p}_{l})\cdot\vec{w}_{i})\vec{q}_{2s}+\sum_{s}^{r_{2}}((\sum_{j}^{r_{1}}(\vec{q}_{1j}\cdot\vec{w}_{i})\vec{q}_{1j})\cdot\vec{q}_{2s})\vec{q}_{2s}$		(4)
	$\displaystyle=...-\sum_{s}^{r_{2}}((\sum_{j}^{r_{1}}a_{sj}\vec{q}_{1j}\cdot\vec{w}_{i})-((\sum_{j}^{r_{1}}(\vec{q}_{1j}\cdot\vec{w}_{i})\vec{q}_{1j})\cdot\vec{q}_{2s}))\vec{q}_{2s}+\sum_{s}^{r_{2}}\sum_{l}^{r_{m}-r_{1}}(b_{sl}\vec{p}_{l}\cdot\vec{w}_{i})\vec{q}_{2s}$		(5)

When substituting $a_{sj}$ and $b_{sl}$ , the intermediate terms are eliminated, which is

	$\displaystyle\vec{w_{i}}-\sum_{j}^{r_{1}}(\vec{q}_{1j}\cdot\vec{w}_{i})\vec{q}_{1j}+\sum_{s}^{r_{2}}\sum_{l}^{r_{m}-r_{1}}(\vec{p}_{l}\cdot\vec{q}_{2s}\vec{p}_{l}\cdot\vec{w}_{i})\vec{q}_{2s}$		(6)
	$\displaystyle=\vec{w_{i}}-\sum_{j}^{r_{1}}(\vec{q}_{1j}\cdot\vec{w}_{i})\vec{q}_{1j}+\sum_{l}^{r_{m}-r_{1}}\sum_{s}^{r_{2}}(\vec{p}_{l}\cdot\vec{q}_{2s}\vec{q}_{2s})\vec{p}_{l}\cdot\vec{w}_{i}$		(7)

The term $\sum_{s}^{r_{2}}(\vec{p}_{l}\cdot\vec{q}{2s}\vec{q}{2s})$ represents the total component of $\vec{p}_{l}$ in the column space of $Q_{2}$ . As $\vec{p}_{l}$ is orthogonal with $\{\vec{q}_{11},\vec{q}_{12}...\vec{q}_{1r_{1}}\}$ , we have $\sum_{s}^{r_{2}}(\vec{p}_{l}\cdot\vec{q}_{2s}\vec{q}_{2s})=\vec{p}_{l}$ . Hence, it is proven that Expression 1 and Expression 7 are equal.

Appendix D compare different rank boundary $\gamma$

Appendix E different $r$ values in image editing and LoRA combination

Appendix F More generation results

comapre dreambooth and textual inversion

Appendix G Image Editing Mathematical Discussion

The stability of PRR outputs is reflected in that different Gaussian noises tend to yield the same result, represented as:

\displaystyle h=Wx=W(x+\Delta x)

(8)

the equation is true when $W\Delta x=\mathbf{0}$ , which is denoted by $\Delta x\in kernel(W)$ .

According to the rank-nullity theorem, for the linear transformation $W:X\to H$ , ${rank(W)+nullity(W)=\dim X}$ . (The nullity of $W$ is the dimension of $W$ kernel).) In PRR, our reduced rank $r=rank(W_{0})-rank(W_{reduce})=nullity(W_{reduce})-nullity(W_{0})$ , we have $rank(W_{0})>rank(W_{reduce})$ which implies $nullity(W_{reduce})>nullity(W_{0})$ , more of $\Delta x$ in PRR will not produce different outputs. This reduced rank $r$ demonstrates the problem [meng2021sdedit] of the trade-off between faithful reconstruction and editability in image editing, As $r$ increases, the modifiable features decrease, making the reconstruction more faithful. As $r$ decreases, the modifiable features increase, bringing the diversity of the generated images closer to that of the underlying pre-trained generation model, and improving the editability. When a large $r$ is selected for training the PRR on a single image, the model generates images that closely resemble the training image, even when using various text prompts. This enables direct modification of the text prompt to facilitate image editing on the single train image.

Appendix H Survey

Optionally include supplemental material (complete proofs, additional experiments and plots) in appendix. All such materials SHOULD be included in the main submission.