This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Probabilistic 3d regression with projected huber distribution

David Mohlin davmo@kth.se
RPL,KTH/Tobii
Josephine Sullivan sullivan@kth.se
RPL,KTH
Abstract

Estimating probability distributions which describe where an object is likely to be from camera data is a task with many applications. In this work we describe properties which we argue such methods should conform to. We also design a method which conform to these properties. In our experiments we show that our method produces uncertainties which correlate well with empirical errors. We also show that the mode of the predicted distribution outperform our regression baselines. The code for our implementation is available online

Refer to caption
Figure 1: Overview of out method. Our method takes an image as input and the focal length which was used to capture the scene and produces a log concave probability distribution in world coordinates in a way which can model the ambiguities which are inherent for camera sensors. On the left we have visualized the level curves in projected coordinates with the projected ground truth center of the cylinder. On the right we show the level curves of the distribution from a birds eye view. The estimated mode is shown as a cross and the ground truth position is a dot.

1 Introduction

Estimating 3d position of object have many use cases. For example 1) Estimating the position of a car or pedestrian is required to construct policies for how to traverse traffic for autonomous driving or ADAS. As a result this task is included in many such datasets, for example KITTI Geiger et al. (2012). 2) Body pose estimation can be seen as trying to find the 3d position of each joint for a human skeleton. Examples of such datasets are von Marcard et al. (2018) and Ionescu et al. (2013). 3) For many robotics applications estimating the 3d position of objects is relevant Bruns & Jensfelt (2022). For industrial robotics the shape of the object is often known, therefore the only source of variation is the six degrees of freedom due to the orientation and position of the object.

Estimating uncertainties associated with the estimated position has applications for fusion of multiple estimates Mohlin et al. (2021). Another application for uncertainty estimates is time filters, such as Kalman filters. Uncertainties can also be used for application specific purposes. For examples in an ADAS setting it makes sense to not only avoid the most likely position of an object, but to also include a margin proportional to the uncertainty of the estimate. In robotic grasping a good policy could be to try to pick up objects directly if the estimated position is certain while doing something else if the estimate is uncertain, such as changing viewing angle.

2 Prior work

When estimating 3d position from camera data there is an inherent ambiguity due to scale/depth ambiguity. Many types of interesting types of objects have a scale ambiguity, such as people, cars and animals.

The existence of scale ambiguity is reflected in the performance on competitive datasets. For example the errors of single view methods which estimate the position of people relative to the camera is on the order of 120mm where 100mm is in the depth direction Moon et al. (2019). This is consistent with the presence of scale/depth ambiguity. In the multi view setting where the depth ambiguity can be eliminated the state of the art is on the order of 17mm Iskakov et al. (2019).

There are many ways to resolve the scale/depth ambiguity.

One way to solve the problem is to add a depth sensor such as a lidar Geiger et al. (2012) or structured light Choi et al. (2016). However adding additional sensors come with several drawbacks such as price, complexity, range among others. For these reasons we will focus on methods which only use camera data.

For camera only methods there are two common ways to resolve the ambiguity, the first is only applicable if the object lies on a known plane, in practice this plane is often the ground plane, but could also be for example the surface of a table. Mills & Goldenberg (1989) Another approach is to do multi-view fusion. In this case the errors due to these ambiguities point in different directions and can therefore be cancelled out with a suitable method.

For this reason it is important if a method to estimate 3d position also fits into a framework which is able to resolve the scale/depth ambiguity.

It is well known that using densities which are log concave are well suited for sensor fusion since computing the optimal combination is a convex problem in this setting An (1997). In our work we show that both the ground plane assumption and multi view fusion turns into a convex optimization problems if model predicts a log concave probability distribution.

There are many other works which treat the problem of estimating 3d position in a probabilistic framework. For example Feng et al. (2019) models the the position of an object with a normal distribution. However they also use a detector on lidar data to avoid large errors and thereby large gradients. Meyer et al. (2019) models the locations of the corners of a bounding box with a laplace distributions, but they also avoid the scale/depth ambiguity by using lidar data. Many works treat depth estimation as a classification problem were a final prediction is constructed by computing the expected value of this discrete distribution Wang et al. (2022), but the probabilities predicted by such methods are in general not concave since it is possible to model a multimodal distribution with these methods. In Bertoni et al. (2019) they model the reciprocal of the depth with a laplace distribution. However they do not model the projected location in a probabilistic manner, which is not well suited for multi-view fusion. In section 3.3 we also show that modelling depth independently of the projected location always result in undesirable properties, which they, like many others do.

Some works investigate regressing 2d position by having neural networks estimate probability distributions parameterized by mean and covariance Kumar et al. (2020); Mohlin et al. (2021)

3 Motivation

This section will focus on the specific properties which data captured by cameras conform to and therefore which a model should take into account when doing probabilistic position estimation based on camera data.

In practice when applying this method the location of the object is modeled by a probability distribution which is parameterized by the output of a convolutional neural network.

3.1 Notation

We will investigate a setting where the model takes an input image which is produced by capturing a scene on a S×SS\times S sensor. The camera is a pin-hole camera without radial distoritions and have the known intrinsics

[f0S/20fS/2001]\begin{bmatrix}f&0&S/2\\ 0&f&S/2\\ 0&0&1\\ \end{bmatrix} (1)

Coordinates for this camera are denoted v=(x,y,z)v=(x,y,z) where the zz axis is aligned with the principal axis of the camera and v=0¯v=\bar{0} correspond to the center of the camera.

The method will predict a probability distribution parameterized by θ\theta based on the input image.

3.2 Desired constraints for estimated distribution

In this section we describe constraints which we argue a probability distribution estimated from camera data should conform to. We also motivate why these constaints should exist.

constraint 1: The model can express any variance for the projected coordinates on the image sensor and depth independently.

Formally:

(a,A)+×S++2θ such that Var[z|θ]=a and Var[x/z,y/z|θ]=A\forall(a,A)\in\mathbb{R}^{+}\times\textbf{S}^{2}_{++}\exists\theta\text{ such that }Var[z|\theta]=a\text{ and }Var[x/z,y/z|\theta]=A (2)

Motivation

For many tasks the error of depth estimation is inherently large due to scale/depth ambiguity. Therefore for objects which have a variation in size, but where the absolute size is hard to estimate there will be an inherent error proportional to the scale error and the distance to the object. Despite this estimating the position in projected coordinates can often be done accurately. For example assuming that the height of humans is between 1.5-1.9m and assuming that it is intrinsically hard to estimate the size of an unseen person an uncertainty of 13% of the distance to the person is reasonable to expect. Despite this it is possible to estimate the position of keypoints in the projected coordinate space of an image accurately, often within a few pixels.

constraint 2: The model should not try to estimate the coordinates directly, but instead estimate the depth divided by the focal length and the projected position of the object.

Formally:

pcam(x,y,z|θ(image);f)=pcam(fx/z,fy/z,z/f|θ(image))p_{cam}(x,y,z|\theta(image);f)=p_{cam}(fx/z,fy/z,z/f|\theta(image)) (3)

Motivation:

It is difficult to separate the the quantities scale, depth and focal length, since the main effect of each variable is to change the apparent size of the object on the sensor. For this reason the focal length needs to be taken into account when estimating the depth of an object, otherwise the variation in focal length adds additional ambiguity. When scale/depth ambiguity is present it is not possible to estimate the absolute position for the x or y coordinate accurately, but it is often possible to do so for the projected position of the object. This motivates why a method should also estimate the location in image space. Methods which conform to these constrants are not uncommon. For example Zhen et al. (2020) estimates the depth after dividing by the focal length. It is also common to estimate the location of objects in image space.

constraint 3: The output probability should have support only for points in front of the camera.

Formally:

pcam(x,y,z)=0 if z<0p_{cam}(x,y,z)=0\text{ if }z<0 (4)

Motivation:

Cameras only depict objects in front of them. The output distribution should reflect that.

constraint 4: The output probability should have support for all points in the field of view.

Formally: for a camera with focal length ff and sensor size SS

x,y,z such that |fx/z|<S/2,|fy/z|<S/2 and z>0 then pcam(x,y,z|θ)>0\forall x,y,z\text{ such that }|fx/z|<S/2,|fy/z|<S/2\text{ and }z>0\text{ then }p_{cam}(x,y,z|\theta)>0 (5)

Motivation:

One standard way to optimize neural networks which output probability distributions is to minimize the negative log likelihood. Backpropagation is only possible if the predicted probability is larger than 0 for the ground truth position. Since there are not any guarantees what an untrained network predicts the probability has to be positive for all locations which are reasonable without considering the input data. Since a camera only depict objects in its field of view this is a sufficient condition.

constraint 5: The negative log likelihood of the position is convex. Formally:

log(pcam(v|θ))-\log(p_{cam}(v|\theta)) (6)

is convex with respect to vv

Motivation:

A common method to combine estimates which are expressed as probability distributions is by assuming that the two estimates have errors which are independent random variables and letting the output of the fusion be the maximum likelihood point under this assumption Mohlin et al. (2021). It is desirable if computing this combination is a convex optimization problem since it guarantees that it is possible to find the fusion quickly. In section 3.3 we show that a sufficient constraint for this property is that the negative log likelihood is convex with respect to position.

constraint 6: The probability for objects infinitely far away from the camera is 0 Formally

limrmaxv=rp(v)=0\lim\limits_{r\rightarrow\infty}\max\limits_{\|v\|=r}p(v)=0 (7)
x,y2 p(x,y,0)=0\forall x,y\in\mathbb{R}^{2}\text{ }p(x,y,0)=0 (8)

Motivation:

Firstly objects of finite size are not visible if they are infinitely far away from the camera. Secondly we need this property to guarantee convergence when doing multi-view fusion.

constraint 7: pcam(v|θ)p_{cam}(v|\theta) is continuous with respect to vv.

Motivation:

It is not reasonable that small changes in position change the density by a large amount.

3.3 Properties given these constraints

Proposition 1: All distributions which are twice differentiable at least at one point and decompose into

pcomposed(x,y,z|θ)=pprojected(x/z,y/z|θ)pdepth(z|θ)p_{composed}(x,y,z|\theta)=p_{projected}(x/z,y/z|\theta)p_{depth}(z|\theta) (9)

Do not fulfill all of the above constraints. Proof in supplementary B

Proposition: 2 In multi view fusion, if the field of view for the cameras have a non-empty intersection then if each camera produce probability estimates conforming to constraints 1-7 then a valid maximum likelihood fusion point exist and can be found by convex optimization. Proof in supplementary C

Proposition 3: Imposing a ground plane constraint is a convex optimization problem. If any point of the ground plane is in the field of view of the camera, then the constraints also guarantee that a solution exist. Proof in supplementary C

4 Method

In this work we first describe the Projected Huber Distribution which we will show fulfills the constraints described in section 3

4.1 Projected Huber Distribution

The distribution can be decomposed into one component pprojp_{proj} which mainly model the probability in the projected coordinates and another component pdepthp_{depth}which models the distribution in depth

The component which mainly models the probability over the projected coordinates is

pproj(x,y|z;μ,A)=1Kdepth(A,μz)exp(h(A[x/zμxy/zμy]2zμz))p_{proj}(x,y|z;\mu,A)=\dfrac{1}{K_{depth}(A,\mu_{z})}\exp\left(-h\left(\left\|A\begin{bmatrix}x/z-\mu_{x}\\ y/z-\mu_{y}\end{bmatrix}\right\|_{2}\dfrac{z}{\mu_{z}}\right)\right) (10)

By including zz in this distribution we can avoid the problem described in proposition 1. Note that conditioning on zz is necessary to get a proper distribution.

μx\mu_{x}, μy\mu_{y} 2\in\mathbb{R}^{2} model the mean position on the camera sensor. μz+\mu_{z}\in\mathbb{R}^{+} models the estimated depth of the object and AS++2A\in\textbf{S}^{2}_{++} models the precision. The function h(.)h(.) is a huber function.

The distribution which models the depth is

pdepth(z;μz,a)={1Kdepth(a,μz)exp(amax(z/μz,μz/z)) if z>00 otherwisep_{depth}(z;\mu_{z},a)=\begin{cases}\dfrac{1}{K_{depth}(a,\mu_{z})}\exp(-a\max(z/\mu_{z},\mu_{z}/z))&\text{ if }z>0\\ 0&\text{ otherwise}\end{cases} (11)

Where μz+\mu_{z}\in\mathbb{R}^{+} models the estimated depth. Note that this is the same parameter as for pprojp_{proj}. a+a\in\mathbb{R}^{+} roughly models the precision, that is a larger aa reduces the variance.

By combining these distributions we get the Projected Huber Distribution

pcombined(x,y,z;μ,A,a)=pproj(x,y|z;A,μ)pdepth(z;μz,a)\displaystyle p_{combined}(x,y,z;\mu,A,a)=p_{proj}(x,y|z;A,\mu)p_{depth}(z;\mu_{z},a) (12)

for x,y,z in the cameras coordinate system.

The normalizing factors are

Kdepth(μz,a)\displaystyle K_{depth}(\mu_{z},a) =μz(exp(a)/a+Γ(1,a)a)\displaystyle=\mu_{z}(\exp(-a)/a+\Gamma(-1,a)a) (13)
Kproj(μz,A)\displaystyle K_{proj}(\mu_{z},A) =μz2|A|2π(1+exp(1/2))\displaystyle=\dfrac{\mu_{z}^{2}}{|A|}2\pi(1+\exp(-1/2)) (14)
Kcombined(μz,A,a)\displaystyle K_{combined}(\mu_{z},A,a) =Kdepth(μz,a)Kproj(μz,A)\displaystyle=K_{depth}(\mu_{z},a)K_{proj}(\mu_{z},A) (15)

Where KdepthK_{depth} is bounded by

μzexp(a)aKdepth(μz,a)μzexp(a)(1/a+1)\mu_{z}\dfrac{\exp(-a)}{a}\leq K_{depth}(\mu_{z},a)\leq\mu_{z}\exp(-a)(1/a+1) (16)

This distribution will conform to all constraints, except constraint 2 which relates to how to predict the parameters of the model. We show how to predict the parameters in a way which conform to constraint 2 in subsection 4.4

proposition 5: the moments of this distribution are

E[[x/zy/z]|μp]\displaystyle E\left[\begin{bmatrix}x/z\\ y/z\end{bmatrix}|\mu_{p}\right] =μp\displaystyle=\mu_{p} (17)
Var[[x/zy/z]|A]\displaystyle Var[\begin{bmatrix}x/z\\ y/z\end{bmatrix}|A] =A24+3exp(1/2)2+2exp(1/2)\displaystyle=A^{-2}\dfrac{4+3\exp(-1/2)}{2+2\exp(-1/2)} (18)
E[z]\displaystyle E\left[z\right] =μzΓ(2,a)/a2+Γ(2,a)a2Γ(1,a)/a+Γ(1,a)a\displaystyle=\mu_{z}\dfrac{\Gamma(2,a)/a^{2}+\Gamma(-2,a)a^{2}}{\Gamma(1,a)/a+\Gamma(-1,a)a} (19)
Var[z]\displaystyle Var\left[z\right] =μz2(Γ(3,a)/a3+Γ(3,a)a3)(Γ(1,a)/a+Γ(1,a)a)(Γ(2,a)/a2+Γ(2,a)a2)2(Γ(1,a)/a+Γ(1,a)a)2\displaystyle=\mu_{z}^{2}\dfrac{(\Gamma(3,a)/a^{3}+\Gamma(-3,a)a^{3})(\Gamma(1,a)/a+\Gamma(-1,a)a)-(\Gamma(2,a)/a^{2}+\Gamma(-2,a)a^{2})^{2}}{(\Gamma(1,a)/a+\Gamma(-1,a)a)^{2}} (20)

The Γ\Gamma terms are not very intuitive, but the fraction for the expected value quickly decrease to 1 while the fraction for the variance behaves similar to 1/a21/a^{2} that is

E[z]\displaystyle E\left[z\right] μz\displaystyle\approx\mu_{z} (21)
Var[z]\displaystyle Var\left[z\right] μz2/a2\displaystyle\approx\mu_{z}^{2}/a^{2} (22)

proof in supplementary F

4.2 Proof for constraints

constraint 1: Proof: From proposition 5 we see that AA models the uncertainty in projected coordinates, while aa models the uncertainty in depth coordinates.

constraint 3: p(z)=0p(z)=0 if z<0z<0, from definition in equation 12

constraint 4: p(z)>0p(z)>0 if z>0z>0 since the range of exp\exp is the (0,)(0,\infty) and the expression can be evaluated for all values of x,y,z2×+x,y,z\in\mathbb{R}^{2}\times\mathbb{R}^{+}. The field of view for a camera is a subset of the half plane z>0z>0.

constraint 5: Proof sketch: Show that both pprojp_{proj} and pdepthp_{depth} has a convex negative log likelihood. An affine basis change turns the argument of the log(pproj)-\log(p_{proj}) into a norm h(q2)h(\|q\|_{2}) which is convex. log(pdepth)-\log(p_{depth}) is also convex. Full proof in supplementary G.1

constraint 6: Proof sketch. For the region z0z\leq 0 the proof is trivial. Then we prove it for the region 0<z<=(x2+y2)1/40<z<=(x^{2}+y^{2})^{1/4}. Here the exp(h(.))\exp(-h(.)) term will go to 0. as v2\|v\|_{2}\rightarrow\infty while the other factors are bounded. For the region z>(x2+y2)1/4z>(x^{2}+y^{2})^{1/4} the factor exp(amax(z/μz,μz/z))\exp(-a\max(z/\mu_{z},\mu_{z}/z)) goes to 0 while the other factors are bounded. Full proof in supplementary G.2

constraint 7: Proof sketch for z>0z>0 the function is trivially continuous. For z<0z<0 the function is also trivially continuous. When z approaches 0 the factor exp(amax(z/μz,μz/z))\exp(-a\max(z/\mu_{z},\mu_{z}/z)) goes to 0. Full proof in G.3

4.3 Parameter remapping

When the parameters θ\theta of Projected Huber Distribution are estimated by a neural network distribution it is necessary to turn the output of the neural network into a valid parameterization for Projected Huber Distribution . Valid in this sense refers to conforming to the constraints AS++A\in\textbf{S}^{++}, a+a\in\mathbb{R}^{+} and μz+\mu_{z}\in\mathbb{R}^{+}, that is AA is positive definite while a and μz\mu_{z} are positive.

Neural networks give outputs in dout\mathbb{R}^{d_{out}} where doutd_{out} is decided by the architecture of the network. These outputs are do not conform to our desired parameter constraints, unless a suitable activation is applied.

To construct this activation we start by doing a basis change to conform to constraint 2 from the motivation.

vp\displaystyle v_{p} =[xpyp]=[2fxzS2fyzS]\displaystyle=\begin{bmatrix}x_{p}\\ y_{p}\end{bmatrix}=\begin{bmatrix}\dfrac{2fx}{zS}\\ \dfrac{2fy}{zS}\end{bmatrix} (23)
zp\displaystyle z_{p} =zμz0f\displaystyle=\dfrac{z}{\mu_{z0}f} (24)

Where μz0\mu_{z0} is a bias term defined by

μz0=(maxz,fDatasetz/f)(minz,fDatasetz/f)\mu_{z0}=\sqrt{\left(\max\limits_{z,f\in\textit{Dataset}}z/f\right)\left(\min\limits_{z,f\in\textit{Dataset}}z/f\right)} (25)

For the purpose of proving that our loss has bounded gradients we also need the constant DD defined by

D=(maxz,fDatasetz/f)/(minz,fDatasetz/f)D=\sqrt{\left(\max\limits_{z,f\in\textit{Dataset}}z/f\right)\bigg{/}\left(\min\limits_{z,f\in\textit{Dataset}}z/f\right)} (26)

These values can either be derived from known properties of the dataset or by computing these values for all samples in the training dataset.

f/zf/z proportional to the projected scale of an object. Even in the extreme case where the scale ambiguity is 40% and the projected size varies between 2 and 200 pixels DD would be less than 20. Furthermore DD is only used to prove that the loss has bounded gradients.

proposition 5 vp1\|v_{p}\|_{\infty}\leq 1 and 1/DzpD1/D\leq z_{p}\ \leq D. for points in the field of the camera with a depth conforming to equation 26.

proof Proof sketch vpv_{p} correspond to the projected image coordinates, scaled to be between -1 and 1. zpz_{p} is normalized by defininition 25-26. Full proof in supplementary H

Now vpv_{p} correspond to the coordinates in the projected image and zpz_{p} is the z coordinate after compensating for the scale change of the focal length and applying a logarithm to map +\mathbb{R}^{+} to \mathbb{R}

Define

νp\displaystyle\nu_{p} =[νxνy]=fμz0μzA[μxμy]\displaystyle=\begin{bmatrix}\nu_{x}\\ \nu_{y}\end{bmatrix}=\dfrac{f\mu_{z0}}{\mu_{z}}A\begin{bmatrix}\mu_{x}\\ \mu_{y}\end{bmatrix} (27)
νz\displaystyle\nu_{z} =μzμz0f\displaystyle=\dfrac{\mu_{z}}{\mu_{z0}f} (28)
B\displaystyle B =ASfμz02μz\displaystyle=A\dfrac{Sf\mu_{z0}}{2\mu_{z}} (29)

Using a negative log likelihood of pcombinedp_{combined} in the original basis can be written as

Lcombined_original(μ,A,a;x,y,z)\displaystyle L_{combined\_original}(\mu,A,a;x,y,z) =Lproj_original(μz,a,z)+Ldepth_original(μ,A;x,y,z)\displaystyle=L_{proj\_original}(\mu_{z},a,z)+L_{depth\_original}(\mu,A;x,y,z) (30)
Lproj_original\displaystyle L_{proj\_original} =h(A[x/zμxy/zμy]2zμz)log(|A|)+2log(μz)\displaystyle=h\left(\left\|A\begin{bmatrix}x/z-\mu_{x}\\ y/z-\mu_{y}\end{bmatrix}\right\|_{2}\dfrac{z}{\mu_{z}}\right)-\log(|A|)+2\log(\mu_{z}) (31)
Ldepth_original\displaystyle L_{depth\_original} =amax(z/μz,μz/z)+log(exp(a)/a+Γ(1,a)a)+log(μz)\displaystyle=a\max(z/\mu_{z},\mu_{z}/z)+\log(\exp(-a)/a+\Gamma(-1,a)a)+\log(\mu_{z}) (32)

Which in the new basis is

Lcombined(ν,B,a;vp,zp)\displaystyle L_{combined}(\nu,B,a;v_{p},z_{p}) =Lproj(νp,A,vp,zp)+Ldepth(νz,a;zp)\displaystyle=L_{proj}(\nu_{p},A,v_{p},z_{p})+L_{depth}(\nu_{z},a;z_{p}) (33)
Lproj(B,νp;vp,zp)\displaystyle L_{proj}(B,\nu_{p};v_{p},z_{p}) =h(Bvpνp2zp))log(|B|)\displaystyle=h\left(\left\|Bv_{p}-\nu_{p}\right\|_{2}z_{p})\right)-\log(|B|) (34)
Ldepth(a,νz;zp)\displaystyle L_{depth}(a,\nu_{z};z_{p}) =amax(zp/νz,νz/zp)+log(exp(a)/a+Γ(1,a)a)+log(νz)\displaystyle=a\max(z_{p}/\nu_{z},\nu_{z}/z_{p})+\log(\exp(-a)/a+\Gamma(-1,a)a)+\log(\nu_{z}) (35)

In these losses we exclude terms not required for computing gradients such as terms only containing constants, f,S,μz0f,S,\mu_{z0}.

Kdepth(a)K_{depth}(a) and its gradients is computed by numerical integrals. The full method is described in supplementary J

Estimating these parameters will conform to constraint 2 since νp\nu_{p} and BB models the position and precision for the projected coordinates while νz\nu_{z} and aa models the depth and precision for the depth estimate.

4.4 Enforcing constraints on distribution parameters

Designing the activation which outputs BB and νp\nu_{p} is done in the same way as Mohlin et al. (2021).

Estimating aa and νz\nu_{z} is done by starting with two real numbers w1w_{1}, w2w_{2}

a(w1)={w1+1 if w1<0exp(w1) otherwisea(w_{1})=\begin{cases}w_{1}+1&\text{ if }w_{1}<0\\ exp(w_{1})&\text{ otherwise}\end{cases} (36)
νz(w1,w2)={1+w2/a(w1)if w2>01/(1w2/a(w1))otherwise\nu_{z}(w_{1},w_{2})=\begin{cases}1+w_{2}/a(w_{1})&\text{if }w_{2}>0\\ 1/(1-w_{2}/a(w_{1}))&\text{otherwise}\end{cases} (37)

With this parameterization the the loss is convex with respect to the network output when a>1a>1. The gradients of the loss will be bounded with respect to ww. Proof in K and I

Having a loss which is Lipschitz-continous should aid with stability during training since it avoids back-propagating very large gradients.

5 Experiments

Refer to caption
Figure 2: Setup for experiments, images are fed into a standard resnet-50, the output of which is sent through the activation, basis change and loss described in section 4.4

In this section we show how the method described in section 4 can be applied in practice to estimate the position of an object.

5.1 Dataset

We construct a synthetic dataset by rendering objects in a similar way as Johnson et al. (2017). We choose our task to be to estimate the location of a rendered cylinder. The cylinder is rendered different reflection material properties, color and with random orientation.

The scene is rendered on a sensor of size 224×224224\times 224 pixels, the focal length of the camera is sampled from a uniform distribution fU(1200,2000)(pixels/m)f\sim U(1200,2000)(pixels/m) The cylinder has a height of 0.2 meters and a radius of 0.1 meter. The depth is uniformly sampled from zU(3,5)mz\sim U(3,5)m The x and y coordinates are sampled from SzfU(0.5,0.5)\dfrac{Sz}{f}U(-0.5,0.5) for each axis.

In this experimental setup the normalizing constants are

μz0\displaystyle\mu_{z0} =(maxf(1200,2000),z(3,5)z/f)(minf(1200,2000),z(3,5)z/f)2.5103\displaystyle=\sqrt{\left(\max\limits_{f\in(1200,2000),z\in(3,5)}z/f\right)\left(\min\limits_{f\in(1200,2000),z\in(3,5)}z/f\right)}\approx 2.5*10^{-3} (38)
D\displaystyle D =(maxf(1200,2000),z(3,5)z/f)/(minf(1200,2000),z(3,5)z/f)1.7\displaystyle=\sqrt{\left(\max\limits_{f\in(1200,2000),z\in(3,5)}z/f\right)\bigg{/}\left(\min\limits_{f\in(1200,2000),z\in(3,5)}z/f\right)}\approx 1.7 (39)

The data is fitted by the method by using the rendered image as input to a standard Resnet-50 with an output dimension of 7. Instead of applying a softmax on the output we use the mapping described in section 4.4 to predict the parameters of Projected Huber Distribution . Applying a negative log likelihood loss gives equation 33 which is used as the loss when fitting the parameters of the network. The parameters are updated using Adam with default parameters. This setup is visualized in figure 2

To showcase different cases we also generate additional datasets where the cylinder is scaled with a factor uniformly sampled from U(0.8,1.2). This dataset is constructed to showcase how the method performs when a scale/depth ambiguity is present.

We also generate a dataset where we also add a cube and a sphere to both add visual complexity and introduce occlusions which incentive the model to predict different uncertainties for the case where the cylinder is occluded compared to when it is not. For this dataset no scale ambiguity is present. The cube and sphere have orientations, material properties and color sampled indepenently from the cylinder and each other.

Finally we generate a dataset where we render a scene with occluding objects and with scale ambiguity. The scaling factor is sampled independently for all objects.

Each dataset containts 8000 training samples and 2000 test samples.

5.2 Experimental result

5.2.1 Empirical error vs. estimated error

We investigate how the predicted uncertainty correlates with the empirical error by for each sample computing the empirical squared error for the projected coordinates and for the depth. We then compute the predicted variance for each sample based on AA and aa by using equation 18 and 20

We check if the estimated uncertainties are well correlated by sorting the samples based on the predicted variance and apply a low pass filter over 200 adjacent samples for both the predicted variance and the empirical squared error. For a perfect model these should be identical. For a good model they should be highly correlated.

Refer to caption
Figure 3: correlation between predicted variance and empirical variance for depth (left) and projected coordinates (right). Red dashed line correspond to when the expected analytical depth and empirical squared error are equal. A method which predict uncertainty well should produce a curve which is close to this line. The unit for the left plot is meters, for the right plot the quantity is the unitless ratio between projected error and depth
Refer to caption
Figure 4: Figure showing how the average precision parameters increase during the training. In the case when there are no occluding objects the estimated precision in image coordinates increase quickly during the training, with occlusion estimating the position in image space is harder and the precision increase slower. The precision of the depth estimate increases throughout training when there is no scale ambiguity, but converges to a constant when there is scale ambiguity

From figure 3 we see that the estimated errors correlate well with the empirical errors over all datasets for both the depth and projected component. This indicates that the uncertainty estimates which the model produces are useful. The estimated uncertainties also follow intuition, the dataset without occlusion or scale ambiguity has low predicted uncertainty for both the projected coordinates and for the depth. For the dataset without occlusion but with scale ambiguity the depth uncertainty increases significantly, but the uncertainty for the projected coordinates remains low. When fitting the model on a dataset without scale ambiguity but with other objects which can occlude the cylinder both the predicted uncertainty for the projected cordinates and the depth increases, which is expected since it is harder to predict where objects are if they are partially or fully occluded.

We also show how the average predicted parameter aa and AA change over training in figure 4. From this figure we see that the predicted uncertainty decreases as the training proceeds. This is to due to the model adapt to fact that the error decrease when training.

5.2.2 Regression performance

To show that our method is able to estimate 3d position well we compare it against several regression baselines. The reference is to try to regress vpv_{p} and zpz_{p} directly. Some methods estimate the logarithm of depth instead of the raw depth such as Lin & Lee (2020). For completeness we compare to estimating log(zp)\log(z_{p}) as well. The results of our method and these baselines is shown in table 1. From this table we see that our method often exceeds the performance of both baselines, however this improvement is not likely to be significant, but does not need to be so either since our contribution is giving a probabilistic prediction with a convex negative log likelihood, not to produce a better single point estimate. The errors are measured in the cameras coordinate system. For the baselines the predicted (x,y,z)(x,y,z) coordinates are computed by using the mapping in equation 23. For our method we use the mode as our single point prediction

(x,y,z)=μz(μx,μy,1)(x,y,z)=\mu_{z}(\mu_{x},\mu_{y},1) (40)

To construct a point estimate. There are other possible single point predictors which could be equally good, such as geometric median or mean we did not evaluate these, but they should produce similar predictions.

We also do ablations where we ignore the focal length of the camera when estimating the position, thereby violating our proposed constraint 2. One such baseline is to let the network predict x,y,zx,y,z directly using a MAE loss. Another such baseline is to try to predict vpv_{p} and zpz_{p} but with a constant focal length of f=fmaxfmin1550f=\sqrt{f_{max}f_{min}}\approx 1550 for the mapping between x,y,zx,y,z and vp,zpv_{p},z_{p}. Note that the camera used to capture the scene had different focal length for different samples, we only use an incorrect average focal length for the mapping in equation 23 for training and evaluation. The results for the baselines and our method on the dataset without occluding objects nor scale ambiguity is shown in Table 2.

From this table we see that baselines which ignore the focal length perform significantly worse, likely due to the fact that in general it is not possible to measure depth from images, only scale, therefore by ignoring the focal length an irreducible error is introduced.

Finally we try to frame how large our errors are in image space. On average the focal length is approximately 1550 pixels. The object is on average 4000 mm away. Therefore the average error of 2.4mm correspond to approximately 0.90.9 pixels which quite good. Objects with a diameter of 0.2m will have a diameter of approximately 80 pixels when rendered with a focal length of 1550 at a depth of 4 meter. If estimating depth was done by the proxy task of estimating the diameter of objects an error of 1 pixel would therefore result in a depth error of 50mm which is on par with our average error when there are no occlusions or scale ambiguities, again indicating that our method works quite well.

Introducing a 20% scale ambiguity of the rendered object should result in irreducible errors which are on average 10% error of the depth. The object was on average 4000mm from the camera, which should give an average error of approximately 400mm. The measured errors which we observe are similar to this, but slightly lower, possibly due to the network being able to infer some information about the scale through some non-intuitive shading effect or due to the fact that zpz_{p} does not follow a uniform distribution.

Method Metric default occl. scale ambig. occl. & scale ambig
MAE, vpv_{p}, zpz_{p} projected error (mm) 3.6 22.8 3.7 37.2
depth error (mm) 69 218 359 386
MAE, vpv_{p}, log(zp)\log(z_{p}) projected error (mm) 5.0 18.0 5.1 27.6
depth error (mm) 36 178 355 355
probabilistic (ours) projected error (mm) 2.4 13.0 1.6 17.0
depth error (mm) 48 168 347 350
Table 1: regression performance of our proposed method (bottom) compared to two plain regression baselines
Method Projected error Depth error
MAE, (x,y,z) 13.0 327
MAE, vpv_{p}, zpz_{p}, constant ff 10.8 325
probabilistic (ours) 2.4 48
Table 2: Ablation on mean absolute error losses when 1) trying to directly regress x,y,zx,y,z, 2) when regressing vp,zpv_{p},z_{p} where the focal length is approximated with a constant and 3) When using our method

6 Discussion

The constraints which we described are well motivated for the general case, but there are cases where they are not necessary. For example if scale ambiguity is not present in the dataset then constraint 1 is not necessary. If the focal length is the same for all pictures in the dataset then constraint 2 is not neccessary. It is also possible to infer camera intrinsics from some scenes if this is the case then constraint 2 might not be necessary either. If the uncertainty in depth is small many distributions would assign a small probability that the object is behind the camera even if constraint 3 is not strictly true indicating that this constraint is mainly important when the depth uncertainty is large. Constraint 5 is only necessary if the estimated probability distribution needs to be log concave. we have shown this is useful for multi view fusion and imposing ground plane constraints, but if the goal is only to produce a as good as possible single point prediction of the position this constraint should not be necessary.

Furthermore our ablations indicate that using the correct focal length is important when estimating 3d position from camera data.

The focus of this work is to investigate the theoretical aspects of probabilistic 3d regression from camera data. To be able to highlight this the experiments are on synthetic data. Future work could be to verify that the proposed method works on more challenging real world data as well.

We also show that it should be easy to combine our method with multi-view fusion or the ground plane assumption in theory, but we do not implement it. This could also be future work.

7 Conclusion

In this work we have described constraints which we argue should be taken into account when designing a method to estimating probability distributions over 3d position from camera data. We have also designed a method which conform to these constraints. In our experiments we show that the uncertainty estimates which our method produce correlate well with empirical errors. Our experiments also show that our method perform on par or better than several regression baselines. In our ablations we show that in our experimental setting the performance decreases significantly if the camera intrinsics are ignored.

8 Acknowledgements

DM is an industrial PhD student at Tobii AB. this work was partially supported by the Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation.

References

  • An (1997) Mark Yuying An. Log-concave probability distributions: Theory and statistical testing. Duke University Dept of Economics Working Paper, (95-03), 1997.
  • Bertoni et al. (2019) Lorenzo Bertoni, Sven Kreiss, and Alexandre Alahi. Monoloco: Monocular 3d pedestrian localization and uncertainty estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019.
  • Bruns & Jensfelt (2022) Leonard Bruns and Patric Jensfelt. On the evaluation of rgb-d-based categorical pose and shape estimation. arXiv preprint arXiv:2202.10346, 2022.
  • Choi et al. (2016) Sungjoon Choi, Qian-Yi Zhou, Stephen Miller, and Vladlen Koltun. A large dataset of object scans. arXiv preprint arXiv:1602.02481, 2016.
  • Feng et al. (2019) Di Feng, Lars Rosenbaum, Claudius Glaeser, Fabian Timm, and Klaus Dietmayer. Can we trust you? on calibration of a probabilistic object detector for autonomous driving. arXiv preprint arXiv:1909.12358, 2019.
  • Geiger et al. (2012) Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
  • Ionescu et al. (2013) Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE transactions on pattern analysis and machine intelligence, 36(7):1325–1339, 2013.
  • Iskakov et al. (2019) Karim Iskakov, Egor Burkov, Victor Lempitsky, and Yury Malkov. Learnable triangulation of human pose. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  7718–7727, 2019.
  • Johnson et al. (2017) Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  2901–2910, 2017.
  • Kumar et al. (2020) Abhinav Kumar, Tim K Marks, Wenxuan Mou, Ye Wang, Michael Jones, Anoop Cherian, Toshiaki Koike-Akino, Xiaoming Liu, and Chen Feng. Luvli face alignment: Estimating landmarks’ location, uncertainty, and visibility likelihood. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  8236–8246, 2020.
  • Lin & Lee (2020) Jiahao Lin and Gim Hee Lee. Hdnet: Human depth estimation for multi-person camera-space localization. In European Conference on Computer Vision, pp.  633–648. Springer, 2020.
  • Meyer et al. (2019) Gregory P Meyer, Ankit Laddha, Eric Kee, Carlos Vallespi-Gonzalez, and Carl K Wellington. Lasernet: An efficient probabilistic 3d object detector for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  12677–12686, 2019.
  • Mills & Goldenberg (1989) James K Mills and Andrew A Goldenberg. Force and position control of manipulators during constrained motion tasks. IEEE Transactions on Robotics and Automation, 5(1):30–46, 1989.
  • Mohlin et al. (2021) David Mohlin, Gerald Bianchi, and Josephine Sullivan. Probabilistic regression with huber distributions. arXiv preprint arXiv:2111.10296, 2021.
  • Moon et al. (2019) Gyeongsik Moon, Ju Yong Chang, and Kyoung Mu Lee. Camera distance-aware top-down approach for 3d multi-person pose estimation from a single rgb image. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  10133–10142, 2019.
  • von Marcard et al. (2018) Timo von Marcard, Roberto Henschel, Michael Black, Bodo Rosenhahn, and Gerard Pons-Moll. Recovering accurate 3d human pose in the wild using imus and a moving camera. In European Conference on Computer Vision (ECCV), sep 2018.
  • Wang et al. (2022) Tai Wang, ZHU Xinge, Jiangmiao Pang, and Dahua Lin. Probabilistic and geometric depth: Detecting objects in perspective. In Conference on Robot Learning, pp.  1475–1485. PMLR, 2022.
  • Zhen et al. (2020) Jianan Zhen, Qi Fang, Jiaming Sun, Wentao Liu, Wei Jiang, Hujun Bao, and Xiaowei Zhou. Smap: Single-shot multi-person absolute 3d pose estimation. In European Conference on Computer Vision, pp.  550–566. Springer, 2020.

Appendix A Appendix

Appendix B Projected and depth independent result in finite precision

B.1 Variance decreases when reducing size of single tail

This section proves an intermediate result which we use for the main proof Here we show that for random varianbles Y,Z and a variable c>0c>0 and p[0,1]p\in[0,1] then if E[Y]E[Z]<0E\left[Y\right]E\left[Z\right]<0 then

X={cY w.p pZ otherwise X=\begin{cases}cY\text{ w.p }p\\ Z\text{ otherwise }\end{cases} (41)

Then Var(X)Var(X) decreases in cc decreases

proof

E[X]\displaystyle E\left[X\right] =pcE[Y]+(1p)E[Z]\displaystyle=pcE\left[Y\right]+(1-p)E\left[Z\right] (42)
E[X2]\displaystyle E\left[X^{2}\right] =pc2E[Y2]+(1p)E[Z2]\displaystyle=pc^{2}E\left[Y^{2}\right]+(1-p)E\left[Z^{2}\right] (43)
Var[X]\displaystyle Var\left[X\right] =c2p(E[Y2](E[Y])2)+c2(pp2)(E[Y])22cp(1p)E[Y]E[Z]+\displaystyle=c^{2}p(E\left[Y^{2}\right]-(E\left[Y\right])^{2})+c^{2}(p-p^{2})(E\left[Y\right])^{2}-2cp(1-p)E\left[Y\right]E\left[Z\right]+... (44)

Where … does not depend on c. The first term is c2c^{2} time the variance of Y which is positive, the second term is positive since (pp2)>0(p-p^{2})>0. The third term is positive since E[Y]E[Z]<0E\left[Y\right]E\left[Z\right]<0 and the negations cancel out.

Therefore if c>0c>0 Var(X)Var(X) increase when cc increase, which conclude the proof.

B.2 Main proof

Proof sketch, assume all constraints except constraint 1 are true, show that the smallest possible variance for the projected coordinates is at least a constant which is larger than 0.

By considering the line y=0y=0 we get

P(x,y=0,z)=pproj(x/z,0)pdepth(z)P(x,y=0,z)=p_{proj}(x/z,0)p_{depth}(z) (45)

The negative log likelihood is

log(p(x,y=0,z))=log(pproj(x/z,0))log(pdepth(z))-\log(p(x,y=0,z))=-\log(p_{proj}(x/z,0))-\log(p_{depth}(z)) (46)

For convenience we define

f(x/z)\displaystyle f(x/z) =log(pproj(x/z,0))\displaystyle=-\log(p_{proj}(x/z,0)) (47)
g(z)\displaystyle g(z) =log(pdepth(z)\displaystyle=-\log(p_{depth}(z) (48)

We compute the hessian with respect to xx and zz of

xlog(p(x,y=0,z))\displaystyle\dfrac{\partial}{\partial x}-\log(p(x,y=0,z)) =f(x/z)/z\displaystyle=f^{\prime}(x/z)/z (49)
2x2log(p(x,y=0,z))\displaystyle\dfrac{\partial^{2}}{\partial x^{2}}-\log(p(x,y=0,z)) =f′′(x/z)/z2\displaystyle=f^{\prime\prime}(x/z)/z^{2} (50)
2xzlog(p(x,y=0,z))\displaystyle\dfrac{\partial^{2}}{\partial x\partial z}-\log(p(x,y=0,z)) =f(x/z)/z2f′′(x/z)x/z3\displaystyle=-f^{\prime}(x/z)/z^{2}-f^{\prime\prime}(x/z)x/z^{3} (51)
zlog(p(x,y=0,z))\displaystyle\dfrac{\partial}{\partial z}-\log(p(x,y=0,z)) =g(z)f(x/z)x/z2\displaystyle=g^{\prime}(z)-f^{\prime}(x/z)x/z^{2} (52)
2z2log(p(x,y=0,z))\displaystyle\dfrac{\partial^{2}}{\partial z^{2}}-\log(p(x,y=0,z)) =g′′(z)+f′′(x/z)x2/z4+2f(x/z)x/z3\displaystyle=g^{\prime\prime}(z)+f^{\prime\prime}(x/z)x^{2}/z^{4}+2f^{\prime}(x/z)x/z^{3} (53)

If this function is convex then this hessian has a positive determinant. The determinant is

|Hx,z|\displaystyle|H_{x,z}| =1z4(f′′(x/z)g′′(z)+(f′′(x/z))2(x/z)2+2f(x/z)f′′(x/z)x/z(f(x/z)+f′′(x/z)(x/z))2)\displaystyle=\dfrac{1}{z^{4}}(f^{\prime\prime}(x/z)g^{\prime\prime}(z)+(f^{\prime\prime}(x/z))^{2}(x/z)^{2}+2f^{\prime}(x/z)f^{\prime\prime}(x/z)x/z-(f^{\prime}(x/z)+f^{\prime\prime}(x/z)(x/z))^{2}) (54)
=1z4(f′′(xp)g′′(z)(f(xp))2\displaystyle=\dfrac{1}{z^{4}}(f^{\prime\prime}(x_{p})g^{\prime\prime}(z)-(f^{\prime}(x_{p}))^{2} (55)

The function is twice differentiable at least for one point, let C=g′′(z)C=g^{\prime\prime}(z) exist for the zz value at this point. If C<0C<0 Then we reach a contradition, if C=0C=0 then f(xp)=0f^{\prime}(x_{p})=0 Where the function has support. The solution here would be pproj(xp)=1/Sfor|xp|S/2p_{proj}(x_{p})=1/Sfor|x_{p}|\leq S/2 which can not model arbitrary precision.

The solution of

Cf′′(xp)f(xp)2=0Cf^{\prime\prime}(x_{p})-f^{\prime}(x_{p})^{2}=0 (56)

is

f(xp)=alog(c1c2x)f(x_{p})=-a\log(c_{1}-c_{2}x) (57)

If the constraint is not tight, then the resulting function would increase faster than the solution of the differential equation as the distance to the mode increases. If we study the function for xpx_{p} which are larger than the mode then the derivative is

f(xp)=ac2c1c2xf^{\prime}(x_{p})=a\dfrac{c_{2}}{c_{1}-c_{2}x} (58)

Since we are to the right of the mode f(xp)>0f^{\prime}(x_{p})>0, c1c2xc_{1}-c_{2}x is also positive since the logarithm of this value is real. Thus in this region c2>0c_{2}>0 Therefore ff approach infinity when xc1/c2x\rightarrow c_{1}/c_{2}. Therefore c1/c2S/2c_{1}/c_{2}\geq S/2.

The mode has to exist since the distribution is proper. The same analysis holds to show that for xp<mx_{p}<m

f(xp)=Clog(b1+b2x)f(x_{p})=-C\log(b_{1}+b_{2}x) (59)

Where b2>0b_{2}>0 and b1/b2>S/2b_{1}/b_{2}>S/2.

The distribution will either give a value less than the mode or larger or equal to the mode We can now apply the proof in B.1 to show c1/c2=S/2c_{1}/c_{2}=S/2 and b1/b2=S/2b_{1}/b_{2}=S/2 because if that was not the case it would be possible to reduce the variance by making the constraint active.

Furthermore ff is continuous since it is convex.

Thus the distribution which minimizes the projected variance while being consistent with the constraints has the form

pproj(xp){(S/2x)C if x>m((S/2+x)(S/2m)S/2+m)C otherwisep_{proj}(x_{p})\propto\begin{cases}(S/2-x)^{C}\text{ if }x>m\\ \left(\dfrac{(S/2+x)(S/2-m)}{S/2+m}\right)^{C}\text{ otherwise}\end{cases} (60)

Which obviously can not model an arbitrary small variance for the projected coordinates for a fixed C.

Finally we need to show that

g′′(z)Cg^{\prime\prime}(z)\geq C (61)

Imposes constraints on what variance for z can be modeled

To maximize variance we minimize g while by keeping the constraint tight. This gives

g(z)=Cz2/2+k1z+k2g(z)=Cz^{2}/2+k_{1}z+k_{2} (62)

which gives the distribution after reparameterization

pdepth(z)exp((zμ)22/C)p_{depth}(z)\propto\exp(-\dfrac{(z-\mu)^{2}}{2/C}) (63)

Which is a normal distribution with variance 1/C1/\sqrt{C}

Therefore a low variance in the projected coordinates requires a large CC which requires a low variance for the depth estimate. This concludes the proof.

Appendix C Multi-view fusion and ground plane constraints

In this section we prove proposition 2 and 3.

C.1 Multi view

For this section we have nn cameras which produce predictions pcam(vi;θi)p_{cam}(v_{i};\theta_{i}) where viv_{i} are coordinates in the coordinate system for camera ii. θi\theta_{i} are the parameters of the estimated probability distribution based on the image captured by camera ii. The images are in theory captured at the same time or in practice within a very short time period. The multi view fusion is constructed under the assumption that the errors for each estimate are independent. The affine transform v1=Rivi+tiv_{1}=R_{i}v_{i}+t_{i} transform coordinates from coordinate system ii to coordinate system 11

Definition 1: We define the maximum likelihood fusion given the estimates to be

argsupv13p(v1|θ1,θn)=i=1np(Ri1(v1ti)|θi)\arg\sup\limits_{v_{1}\in\mathbb{R}^{3}}p(v_{1}|\theta_{1},\cdots\theta_{n})=\prod\limits_{i=1}^{n}p(R_{i}^{-1}(v_{1}-t_{i})|\theta_{i}) (64)

We introduce the notion of a valid fusion as

Definition 2: A valid maximum likelihood fusion point exist if

(v3 s.t. pcam(v|θ1,θn)=i=1npcam(Ri1(v1ti)|θi))(i{1,n}pcam(Ri1(v1ti)|θi)>0)(\exists v\in\mathbb{R}^{3}\text{ s.t. }p_{cam}(v|\theta_{1},\cdots\theta_{n})=\prod\limits_{i=1}^{n}p_{cam}(R_{i}^{-1}(v_{1}-t_{i})|\theta_{i}))\land(\forall i\in\{1,\cdots n\}p_{cam}(R_{i}^{-1}(v_{1}-t_{i})|\theta_{i})>0) (65)

To exclude the case where either most likely fusioned point exist infinitely far away or when the fusioned distribution is improper.

Proposition: 2 (existance) In multi view fusion, if the field of view for the cameras have a non-empty intersection then if each camera gives probability estimates conforming to constraints 1-7 there will exist a valid maximum likelihood fusion point.

Proof Having a non empty intersection of the field of views for the cameras implies

v1 such that i{1,n}vi=Ri1(v1ti) and viFOVcamera,i\exists v_{1}\text{ such that }\forall i\in\{1,\cdots n\}v_{i}=R_{i}^{-1}(v_{1}-t_{i})\text{ and }v_{i}\in FOV_{camera,i} (66)

Constraint 4 implies that pcam(vi|θi)=pcam(Ri1(v1ti)|θi)>0p_{cam}(v_{i}|\theta_{i})=p_{cam}(R_{i}^{-1}(v_{1}-t_{i})|\theta_{i})>0 for all cameras. since viv_{i} is in the field of view of the camera

Therefore

p(v1|θ1,θn)=i=1npcam(Ri1(v1ti)|θi)>0p(v_{1}|\theta_{1},\cdots\theta_{n})=\prod\limits_{i=1}^{n}p_{cam}(R_{i}^{-1}(v_{1}-t_{i})|\theta_{i})>0 (67)

Constraint 6 implies that there exist a RR such that

v such that if v>R then p(v|θ1θn)<p(v1|θ1,θn)\forall v\text{ such that if }\|v\|>R\text{ then }p(v|\theta_{1}\cdots\theta_{n})<p(v_{1}|\theta_{1},\cdots\theta_{n}) (68)

therefore

supvp(v|θ1,θn)\displaystyle\sup\limits_{v}p(v|\theta_{1},\cdots\theta_{n}) =supv s.t. vRp(v|θ1,θn)\displaystyle=\sup\limits_{v\text{ s.t. }\|v\|\leq R}p(v|\theta_{1},\cdots\theta_{n}) (69)
=maxv s.t. vRp(v|θ1,θn)\displaystyle=\max\limits_{v\text{ s.t. }\|v\|\leq R}p(v|\theta_{1},\cdots\theta_{n}) (70)

The last step is due to 1) (v|θ1,θn)(v|\theta_{1},\cdots\theta_{n}) is continuous since continuity is preserved when multiplying continuous functions.
2) A closed ball is compact.
3) the extreme value theorem states that the maximum over a compact set of a continous function exists. Thus applying the extreme value theorem concludes the proof.

Proposition 2: (convexity) Finding the maximum likelihood fusion is a convex optimization problem.

proof Maximizing a probability is equivalent to minimizing a log likelihood.

Formally, given definition 1

p(v1|θ1,,θn)=i=1np(Ri1(v1ti)|θ(imi))p(v_{1}|\theta_{1},\cdots,\theta_{n})=\prod\limits_{i=1}^{n}p(R_{i}^{-1}(v_{1}-t_{i})|\theta(im_{i})) (71)

Here we have used the fact that the transform from v1v_{1} to viv_{i} has a jacobian with determinant 1.

Maximizing the probability is the same as minimizing a negative log likelihood, since logarithms are increasing and negating an increasing function result in a decreasing function, that is

argmaxv13i=1np(Ri1(v1ti)|θi)\displaystyle\operatorname*{arg\,max}\limits_{v_{1}\in\mathbb{R}^{3}}\prod\limits_{i=1}^{n}p(R_{i}^{-1}(v_{1}-t_{i})|\theta_{i}) =argminv13log(i=1npcam(Ri1(v1ti)|θi))\displaystyle=\operatorname*{arg\,min}\limits_{v_{1}\in\mathbb{R}^{3}}-\log(\prod\limits_{i=1}^{n}p_{cam}(R_{i}^{-1}(v_{1}-t_{i})|\theta_{i})) (72)
=argminv13i=1nlog(pcam(Ri1(v1ti)|θi))\displaystyle=\operatorname*{arg\,min}\limits_{v_{1}\in\mathbb{R}^{3}}\sum\limits_{i=1}^{n}-\log(p_{cam}(R_{i}^{-1}(v_{1}-t_{i})|\theta_{i})) (73)

It is also well known that convexity is preserved under linear transformations of the argument

This means that if log(pcam(vi|θi))-\log(p_{cam}(v_{i}|\theta_{i})) is convex with respect to viv_{i} then log(p(Ri1(v1ti)|θi))-\log(p(R_{i}^{-1}(v_{1}-t_{i})|\theta_{i})) is convex with respect to v1v_{1} since the change in coordinate system is a linear transform.

Convexity is also preserved under addition therefore if log(p(Ri1(v1ti)|θi))-\log(p(R_{i}^{-1}(v_{1}-t_{i})|\theta_{i})) is convex with respect to v1v_{1} for all ii, then i=0nlog(p(Ri1(v1ti)|θi))\sum\limits_{i=0}^{n}-\log(p(R_{i}^{-1}(v_{1}-t_{i})|\theta_{i})) is convex with respect to v1v_{1}.

Finding a minima with respect to a convex function is a convex optimization problem. This concludes the proof.

proposition 3: The constraints makes finding the most likely point while enforcing a ground plane assumption, that is solving

argmaxv s.t. dTv=cp(v|θ)\operatorname*{arg\,max}_{v\text{ s.t. }d^{T}v=c}p(v|\theta) (74)

where dd and cc define the ground plane is a convex optimization problem. If any point of the ground plane is in the field of view of the camera, then the constraints also guarantee that a solution exists.

proof

argmaxv s.t. dTv=cp(v|θ)\displaystyle\operatorname*{arg\,max}_{v\text{ s.t. }d^{T}v=c}p(v|\theta) =argminv s.t. dTv=clog(p(v|θ))\displaystyle=\operatorname*{arg\,min}_{v\text{ s.t. }d^{T}v=c}-\log(p(v|\theta)) (75)

optimizing a convex function with an affine constraint in a convex optimization problem.

If a point v1v_{1} of the ground plane is in the field of view of the camera, then constraint 4 ensures p(v1|θ)>0p(v_{1}|\theta)>0

Constraint 6 ensures

R s.t. maxvs.t.v>Rp(v|θ)<p(v1|θ)\exists R\text{ s.t. }\max\limits_{v\text{s.t.}\|v\|>R}p(v|\theta)<p(v_{1}|\theta) (76)

Therefore

argmaxvp(v|θ)=argmaxv s.t. vRp(v|θ)\operatorname*{arg\,max}\limits_{v}p(v|\theta)=\operatorname*{arg\,max}\limits_{v\text{ s.t. }\|v\|\leq R}p(v|\theta) (77)

Which exist due to extreme value theorem and constraint 7.

Appendix D Intermediate integrals

D.1 Double exponential integral

For convenience for future proofs we derive the integral of a double exponential

0exp(axcexp(x))𝑑x=Γ(a,c)ca\int\limits_{0}^{\infty}\exp(ax-c\exp(x))dx=\Gamma(a,c)c^{-a} (78)

Where Γ\Gamma is the upper incomplete gamma function

Proof

0exp(axcexp(x))𝑑x\displaystyle\int\limits_{0}^{\infty}\exp(ax-c\exp(x))dx =0exp(alog(c))exp(a(x+log(c)))exp(exp(x+log(c)))𝑑x\displaystyle=\int\limits_{0}^{\infty}\exp(-alog(c))\exp(a(x+log(c)))\exp(-\exp(x+log(c)))dx (79)
=calog(c)exp(ay)exp(exp(y))𝑑y\displaystyle=c^{-a}\int\limits_{log(c)}^{\infty}\exp(ay)\exp(-\exp(y))dy (80)
=cacza1exp(z)𝑑z\displaystyle=c^{-a}\int\limits_{c}^{\infty}z^{a-1}\exp(-z)dz (81)
=caΓ(a,c)\displaystyle=c^{-a}\Gamma(a,c) (82)

The first step is rearranging.

The second step comes from the variable substitution

y=x+log(c)y=x+\log(c) (83)

The third step comes from the variable substitution

z=exp(y)z=\exp(y) (84)

D.2 Moments depth distribuion

Here we show

0zkexp(amax(μ/z,z/μ))𝑑z=μk+1(Γ(k1,a)ak+1+Γ(k+1,a)ak1)\int\limits_{0}^{\infty}z^{k}\exp(-a\max(\mu/z,z/\mu))dz=\mu^{k+1}(\Gamma(-k-1,a)a^{k+1}+\Gamma(k+1,a)a^{-k-1}) (85)

proof

0zkexp(amax(μ/z,z/μ))𝑑z\displaystyle\int\limits_{0}^{\infty}z^{k}\exp(-a\max(\mu/z,z/\mu))dz =0zk+11zexp(aexp(|log(z/μ)|))𝑑z\displaystyle=\int\limits_{0}^{\infty}z^{k+1}\dfrac{1}{z}\exp(-a\exp(|\log(z/\mu)|))dz (86)
=μk+10(zμ)k+11zexp(aexp(|log(z/μ)|))𝑑z\displaystyle=\mu^{k+1}\int\limits_{0}^{\infty}\left(\dfrac{z}{\mu}\right)^{k+1}\dfrac{1}{z}\exp(-a\exp(|\log(z/\mu)|))dz (87)
=μk+1exp((k+1)s)exp(aexp(|s|))𝑑s\displaystyle=\mu^{k+1}\int\limits_{-\infty}^{\infty}\exp((k+1)s)\exp(-a\exp(|s|))ds (88)
=μk+1(0exp((k+1)s)exp(aexp(s))ds\displaystyle=\mu^{k+1}(\int\limits_{0}^{\infty}\exp(-(k+1)s)\exp(-a\exp(s))ds (89)
+0exp((k+1)s)exp(aexp(s))ds)\displaystyle+\int\limits_{0}^{\infty}\exp((k+1)s)\exp(-a\exp(s))ds) (90)
=μk+1(Γ(k1,a)ak+1+Γ(k+1,a)ak1)\displaystyle=\mu^{k+1}(\Gamma(-k-1,a)a^{k+1}+\Gamma(k+1,a)a^{-k-1}) (91)

The first step is rearranging. The second step is the variable substitution s=log(z/μ)s=\log(z/\mu). The third step is separating the range of the integral into positive and negative numbers. The final step is using equation 78.

D.3 Expected value of a positive huber distribution in 1 dimension

For use in future parts we derive the integral of

0rexp(h(r))𝑑r=1+exp(1/2)\int\limits_{0}^{\infty}r\exp(-h(r))dr=1+\exp(-1/2) (92)

Proof

01rexp(r2/2))dr\displaystyle\int\limits_{0}^{1}r\exp(-r^{2}/2))dr =01/2exp(y)𝑑y\displaystyle=\int\limits_{0}^{1/2}\exp(-y)dy (93)
=1exp(1/2)\displaystyle=1-\exp(-1/2) (94)

The first step is the variable substitution y=r2/2y=r^{2}/2, which has a scale factor of dy/dr=rdy/dr=r

1rexp(r+1/2)𝑑r\displaystyle\int\limits_{1}^{\infty}r\exp(-r+1/2)dr =[rexp(r+1/2)]1+1exp(r+1/2)𝑑r\displaystyle=\left[-r\exp(-r+1/2)\right]_{1}^{\infty}+\int\limits_{1}^{\infty}\exp(-r+1/2)dr (95)
=2exp(1/2)\displaystyle=2\exp(-1/2) (96)

The first step is integration by parts.

This gives

0rexp(h(r))𝑑r\displaystyle\int\limits_{0}^{\infty}r\exp(-h(r))dr =01rexp(r2/2)𝑑r+1rexp(r+1/2)𝑑r\displaystyle=\int\limits_{0}^{1}r\exp(-r^{2}/2)dr+\int\limits_{1}^{\infty}r\exp(-r+1/2)dr (97)
=1+exp(1/2)\displaystyle=1+\exp(-1/2) (98)

D.4 Expected cube of positive Huber distribution in 1 dimension

For use in future parts we derive the integral of

0r3exp(h(r))𝑑r=4+3exp(1/2)\int\limits_{0}^{\infty}r^{3}\exp(-h(r))dr=4+3\exp(-1/2) (99)

Proof

0r3exp(h(r))𝑑r\displaystyle\int\limits_{0}^{\infty}r^{3}\exp(-h(r))dr =01r3exp(r2/2)𝑑r+1r3exp(r+1/2)𝑑r\displaystyle=\int\limits_{0}^{1}r^{3}\exp(-r^{2}/2)dr+\int\limits_{1}^{\infty}r^{3}\exp(-r+1/2)dr (100)
01r3exp(r2/2)𝑑r\displaystyle\int\limits_{0}^{1}r^{3}\exp(-r^{2}/2)dr =01r2rexp(r2/2)𝑑r\displaystyle=\int\limits_{0}^{1}r^{2}r\exp(-r^{2}/2)dr (101)
=01/22yexp(y)𝑑y\displaystyle=\int\limits_{0}^{1/2}2y\exp(-y)dy (102)
=[2yexp(y)]01/2+01/22exp(y)𝑑y\displaystyle=[-2y\exp(-y)]_{0}^{1/2}+\int\limits_{0}^{1/2}2\exp(-y)dy (103)
=2exp(1/2)+22exp(1/2)\displaystyle=2-\exp(-1/2)+2-2\exp(-1/2) (104)
=43exp(1/2)\displaystyle=4-3\exp(-1/2) (105)
1r3exp(r+1/2)𝑑r\displaystyle\int\limits_{1}^{\infty}r^{3}\exp(-r+1/2)dr =exp(1/2)([r3exp(r)]1+13r2exp(r)𝑑y)\displaystyle=\exp(1/2)([-r^{3}\exp(-r)]_{1}^{\infty}+\int\limits_{1}^{\infty}3r^{2}\exp(-r)dy) (106)
=exp(1/2)([3r2exp(r)]1+16rexp(r)𝑑y)\displaystyle=\exp(1/2)([-3r^{2}\exp(-r)]_{1}^{\infty}+\int\limits_{1}^{\infty}6r\exp(-r)dy) (107)
=exp(1/2)([6r2exp(r)]1+16exp(r)𝑑y)\displaystyle=\exp(1/2)([-6r^{2}\exp(-r)]_{1}^{\infty}+\int\limits_{1}^{\infty}6\exp(-r)dy) (108)
=6exp(1/2)\displaystyle=6\exp(-1/2) (109)

Therefore

0r3exp(h(r))𝑑r=4+3exp(1/2)\int\limits_{0}^{\infty}r^{3}\exp(-h(r))dr=4+3\exp(-1/2) (110)

D.5 Norm factor of 2d Huber distribution

Khuber=2π(1+exp(1/2))K_{huber}=2\pi(1+\exp(-1/2)) (111)

proof

Khuber=2exp(h(x2+y2))𝑑x𝑑y=2π0rexp(h(r))𝑑r=2π(1+exp(1/2))\displaystyle K_{huber}=\int\limits_{\mathbb{R}^{2}}\exp(-h(\sqrt{x^{2}+y^{2}}))dxdy=2\pi\int\limits_{0}^{\infty}r\exp(-h(r))dr=2\pi(1+\exp(-1/2)) (112)

D.6 Covariance of 2d Huber distribution

E[vvT]=I(4+3exp(1/2))2+2exp(1/2)E[vv^{T}]=I\dfrac{(4+3\exp(-1/2))}{2+2\exp(-1/2)} (113)

proof

E[tr(vvT)]=1Khuber2vTvexp(h(v2))𝑑v=\displaystyle E[tr(vv^{T})]=\dfrac{1}{K_{huber}}\int_{\mathbb{R}^{2}}v^{T}v\exp(-h(\|v\|_{2}))dv= (114)
1Khuber02π0r3exp(h(r))𝑑r=4+3exp(1/2)1+exp(1/2)\displaystyle\dfrac{1}{K_{huber}}\int_{0}^{2\pi}\int_{0}^{\infty}r^{3}\exp(-h(r))dr=\dfrac{4+3\exp(-1/2)}{1+\exp(-1/2)} (115)

From rotation symmetry we know the variance will be a scaled identity matrix

Therefore

E[vvT]=I4+3exp(1/2)2+2exp(1/2)E[vv^{T}]=I\dfrac{4+3\exp(-1/2)}{2+2\exp(-1/2)} (116)

Appendix E Projected Huber Distribution components

E.1 Depth distribution

pdepth(z;a,μz)={1Kdepth(a,μz)exp(amax(z/μz,μz/z))if z < 00otherwisep_{depth}(z;a,\mu_{z})=\begin{cases}\dfrac{1}{K_{depth}(a,\mu_{z})}\exp(-a\max(z/\mu_{z},\mu_{z}/z))&\text{if z < 0}\\ 0&\text{otherwise}\end{cases} (117)

with

Kdepth(a,μz)=μz(exp(a)/a+Γ(1,a)a)K_{depth}(a,\mu_{z})=\mu_{z}(exp(-a)/a+\Gamma(-1,a)a) (118)

proof The normalizing factor can be computed as

Kdepth(μ,a)\displaystyle K_{depth}(\mu,a) =0exp(amax(z/μz,μz/z))𝑑z\displaystyle=\int\limits_{0}^{\infty}\exp(-a\max(z/\mu_{z},\mu_{z}/z))dz (119)
=μz(Γ(1,a)/a+Γ(1,a)a)\displaystyle=\mu_{z}(\Gamma(1,a)/a+\Gamma(-1,a)a) (120)
=μz(exp(a)/a+Γ(1,a)a)\displaystyle=\mu_{z}(exp(-a)/a+\Gamma(-1,a)a) (121)

The first step is from definition. The second step is using equation 85. The third step is

Γ(1,a)=at11exp(t)𝑑t=exp(a)\Gamma(1,a)=\int\limits_{a}^{\infty}t^{1-1}\exp(-t)dt=\exp(-a) (122)

E.1.1 Expected value of depth

E[z]=μzΓ(2,a)a2+Γ(2,a)a2exp(a)a1+Γ(1,a)aE\left[z\right]=\mu_{z}\dfrac{\Gamma(2,a)a^{-2}+\Gamma(-2,a)a^{2}}{\exp(-a)a^{-1}+\Gamma(-1,a)a} (123)

This function is decreasing with respect to a with limit μz\mu_{z} as aa\rightarrow\infty When a>1a>1, μz<E[z]<1.7μz\mu_{z}<E\left[z\right]<1.7\mu_{z}

proof

Follows directly from KdepthK_{depth} and equation 85

The limit can be proven by multiplying numerator and denominator by aexp(a)a\exp(a) This gives

limaE[z;a,μz]=limaμzΓ(2,a)exp(a)/a+Γ(2,a)exp(a)a31+Γ(1,a)exp(a)a2\displaystyle\lim\limits_{a\rightarrow\infty}E[z;a,\mu_{z}]=\lim\limits_{a\rightarrow\infty}\mu_{z}\dfrac{\Gamma(2,a)\exp(a)/a+\Gamma(-2,a)\exp(a)a^{3}}{1+\Gamma(-1,a)\exp(a)a^{2}} =μz1+11+1=μz\displaystyle=\mu_{z}\dfrac{1+1}{1+1}=\mu_{z} (124)

Since

limaΓ(k,x)xk+1exp(x)=1\lim\limits_{a\rightarrow\infty}\Gamma(k,x)x^{-k+1}\exp(x)=1 (125)

E.1.2 Variance of depth

The variance of the depth is

Var[z]=μz2(Γ(1,a)/a+Γ(1,a)a)(Γ(3,a)a3+Γ(3,a)a3)(Γ(2,a)a2+Γ(2,a)a2)2(Γ(1,a)/a+Γ(1,a)a)2Var\left[z\right]=\mu_{z}^{2}\dfrac{(\Gamma(1,a)/a+\Gamma(-1,a)a)(\Gamma(3,a)a^{-3}+\Gamma(-3,a)a^{3})-(\Gamma(2,a)a^{2}+\Gamma(-2,a)a^{-2})^{2}}{(\Gamma(1,a)/a+\Gamma(-1,a)a)^{2}} (126)

This expression is not very intuitive, but it behaves like

Var[z]μz/a2Var\left[z\right]\approx\mu_{z}/a^{2} (127)

when a>1a>1

proof

E[z2]=μz2Γ(3,a)a3+Γ(3,a)a3Γ(1,a)/a+Γ(1,a)aE\left[z^{2}\right]=\mu_{z}^{2}\dfrac{\Gamma(3,a)a^{-3}+\Gamma(-3,a)a^{3}}{\Gamma(1,a)/a+\Gamma(-1,a)a} (128)

Follows directly from KdepthK_{depth} and equation 85

Var[z2]=E[z2]E[z]2Var\left[z^{2}\right]=E\left[z^{2}\right]-E\left[z\right]^{2} (129)

E.2 Projected distribution

The normalizing factor of the projected part of the distribution is

Kproj(A,μ)=μz2|A|2π(1+exp(1/2))K_{proj}(A,\mu)=\dfrac{\mu_{z}^{2}}{|A|}2\pi(1+\exp(-1/2)) (130)

For a distribution over a given plane where z is constant.

Proof

exp(h(A[x/zμxy/zμy]2zμz))𝑑x𝑑y\displaystyle\int_{-\infty}^{\infty}\int_{-\infty}^{\infty}\exp\left(-h\left(\left\|A\begin{bmatrix}x/z-\mu_{x}\\ y/z-\mu_{y}\end{bmatrix}\right\|_{2}\dfrac{z}{\mu_{z}}\right)\right)dxdy =exp(h(A[x/μzμxz/μzy/μzμyz/μz]2))𝑑x𝑑y\displaystyle=\int_{-\infty}^{\infty}\int_{-\infty}^{\infty}\exp\left(-h\left(\left\|A\begin{bmatrix}x/\mu_{z}-\mu_{x}z/\mu_{z}\\ y/\mu_{z}-\mu_{y}z/\mu_{z}\end{bmatrix}\right\|_{2}\right)\right)dxdy (131)
=μz2|A|exp(h(xp2+yp2))𝑑xp𝑑yp\displaystyle=\dfrac{\mu_{z}^{2}}{|A|}\int_{-\infty}^{\infty}\exp(-h(\sqrt{x_{p}^{2}+y_{p}^{2}}))dx_{p}dy_{p} (132)
=2π(1+exp(1/2))μz2|A|\displaystyle=2\pi(1+\exp(-1/2))\dfrac{\mu_{z}^{2}}{|A|} (133)

The second step is the variable change

[xpyp]=A[x/μzμxz/μzy/μzμyz/μz]\begin{bmatrix}x_{p}\\ y_{p}\end{bmatrix}=A\begin{bmatrix}x/\mu_{z}-\mu_{x}z/\mu_{z}\\ y/\mu_{z}-\mu_{y}z/\mu_{z}\end{bmatrix} (134)

The last step comes from 92 after a variable change to polar coordinates.

E.3 Expected value of projected components

The expected value

Ex,y[x/zy/z]=[μxμy]E_{x,y}\begin{bmatrix}x/z\\ y/z\end{bmatrix}=\begin{bmatrix}\mu_{x}\\ \mu_{y}\end{bmatrix} (135)

for a fixed plane where zz is constant.

Proof

Ex,y[x/zy/z]\displaystyle E_{x,y}\begin{bmatrix}x/z\\ y/z\end{bmatrix} =1Kproj(A,μ)[x/zy/z]exp(h(A[x/zμxy/zμy]2zμz))𝑑x𝑑y\displaystyle=\dfrac{1}{K_{proj}(A,\mu)}\int_{-\infty}^{\infty}\int_{-\infty}^{\infty}\begin{bmatrix}x/z\\ y/z\end{bmatrix}\exp\left(-h\left(\left\|A\begin{bmatrix}x/z-\mu_{x}\\ y/z-\mu_{y}\end{bmatrix}\right\|_{2}\dfrac{z}{\mu_{z}}\right)\right)dxdy (136)
=|A|Kproj(A,μ)μz2(μzzA1vp+μp)exp(h(vp2))𝑑xp𝑑yp\displaystyle=\dfrac{|A|}{K_{proj}(A,\mu)\mu_{z}^{2}}\int_{-\infty}^{\infty}\int_{-\infty}^{\infty}(\dfrac{\mu_{z}}{z}A^{-1}v_{p}+\mu_{p})\exp\left(-h\left(\|v_{p}\|_{2}\right)\right)dx_{p}dy_{p} (137)
=μp\displaystyle=\mu_{p} (138)

where the basis change is the same as when deriving the normalizing factor. The last step is recognizing that the expected value of vp=0¯v_{p}=\bar{0} due to symmetry.

E.4 variance of projected components

The for a constant z value is

Varx,y[x/zy/z]=μz2z2A24+3exp(1/2)2+2exp(1/2)Var_{x,y}\begin{bmatrix}x/z\\ y/z\end{bmatrix}=\dfrac{\mu_{z}^{2}}{z^{2}}A^{-2}\dfrac{4+3\exp(-1/2)}{2+2\exp(-1/2)} (139)

proof

Ex,y[[x/zy/z][x/zy/z]T]\displaystyle E_{x,y}\left[\begin{bmatrix}x/z\\ y/z\end{bmatrix}\begin{bmatrix}x/z\\ y/z\end{bmatrix}^{T}\right] =Ex,y[(μzzA1vp+μp)(μzzA1vp+μp)T]\displaystyle=E_{x,y}\left[(\dfrac{\mu_{z}}{z}A^{-1}v_{p}+\mu_{p})(\dfrac{\mu_{z}}{z}A^{-1}v_{p}+\mu_{p})^{T}\right] (140)
=(μz2z2A1E[vpvpT]A1+μzzA1E[vp]μpT+μzzμpE[vpT]A1+μpμpT\displaystyle=(\dfrac{\mu_{z}^{2}}{z^{2}}A^{-1}E\left[v_{p}v_{p}^{T}\right]A^{-1}+\dfrac{\mu_{z}}{z}A^{-1}E\left[v_{p}\right]\mu_{p}^{T}+\dfrac{\mu_{z}}{z}\mu_{p}E\left[v_{p}^{T}\right]A^{-1}+\mu_{p}\mu_{p}^{T} (141)
=μz2z2A1A14+3exp(1/2)2+2exp(1/2)+μpμpT\displaystyle=\dfrac{\mu_{z}^{2}}{z^{2}}A^{-1}A^{-1}\dfrac{4+3\exp(-1/2)}{2+2\exp(-1/2)}+\mu_{p}\mu_{p}^{T} (142)

Since

Varx,y[x/zy/z]=Ex,y[[x/zy/z][x/zy/z]T]Ex,y[x/zy/z]Ex,y[x/zy/z]TVar_{x,y}\begin{bmatrix}x/z\\ y/z\end{bmatrix}=E_{x,y}\left[\begin{bmatrix}x/z\\ y/z\end{bmatrix}\begin{bmatrix}x/z\\ y/z\end{bmatrix}^{T}\right]-E_{x,y}\begin{bmatrix}x/z\\ y/z\end{bmatrix}E_{x,y}\begin{bmatrix}x/z\\ y/z\end{bmatrix}^{T} (143)

We get

Varx,y[x/zy/z]=μz2z2A1A14+3exp(1/2)2+2exp(1/2)Var_{x,y}\begin{bmatrix}x/z\\ y/z\end{bmatrix}=\dfrac{\mu_{z}^{2}}{z^{2}}A^{-1}A^{-1}\dfrac{4+3\exp(-1/2)}{2+2\exp(-1/2)} (144)

Appendix F Projected Huber Distribution

F.1 Normalizing factor

In this section we show

K(μz,A,a)=Kdepth(μz,a)Kproj(μz,A)K(\mu_{z},A,a)=K_{depth}(\mu_{z},a)K_{proj}(\mu_{z},A) (145)

proof

K(μz,A,a)\displaystyle K(\mu_{z},A,a) =0pproj(x/z,y/z;μz,A)𝑑x𝑑ypdepth(z;μz,a)𝑑z\displaystyle=\int\limits_{0}^{\infty}\int\limits_{-\infty}^{\infty}\int\limits_{-\infty}^{\infty}p_{proj}(x/z,y/z;\mu_{z},A)dxdyp_{depth}(z;\mu_{z},a)dz (146)
=0Kproj(μz,A)pdepth(z;μz,a)𝑑z\displaystyle=\int\limits_{0}^{\infty}K_{proj}(\mu_{z},A)p_{depth}(z;\mu_{z},a)dz (147)
=Kproj(μz,A)Kdepth(μz,a)\displaystyle=K_{proj}(\mu_{z},A)K_{depth}(\mu_{z},a) (148)

Since the normalizing factor of pdepthp_{depth} does not depend on zz

F.2 Expected projected coordinates

E[x/zy/z]=μpE\begin{bmatrix}x/z\\ y/z\end{bmatrix}=\mu_{p} (149)

proof

Ez[Ex,y[x/zy/z]]=Ez[μp]=μpE_{z}\left[E_{x,y}\begin{bmatrix}x/z\\ y/z\end{bmatrix}\right]=E_{z}\left[\mu_{p}\right]=\mu_{p} (150)

F.3 Variance of projected coordinates

The variance of the projected coordinates is

Var[x/zy/z]=A24+3exp(1/2)2+2exp(1/2)Var\begin{bmatrix}x/z\\ y/z\end{bmatrix}=A^{-2}\dfrac{4+3\exp(-1/2)}{2+2\exp(-1/2)} (151)

proof

When integrating this over z we get

Ex,y,z[x/zy/z]\displaystyle E_{x,y,z}\begin{bmatrix}x/z\\ y/z\end{bmatrix} =Ez[Ex,y[x/zy/z]]\displaystyle=E_{z}\left[E_{x,y}\begin{bmatrix}x/z\\ y/z\end{bmatrix}\right] (152)
=μpμpT+4+3exp(1/2)2+2exp(1/2)A2Ez[μz2/z2]\displaystyle=\mu_{p}\mu_{p}^{T}+\dfrac{4+3\exp(-1/2)}{2+2\exp(-1/2)}A^{-2}E_{z}\left[\mu_{z}^{2}/z^{2}\right] (153)
E[μz2/z2]\displaystyle E\left[\mu_{z}^{2}/z^{2}\right] =μz2Kdepth(μz,a)01z2exp(aexp(|log(z/μz)|))𝑑z\displaystyle=\dfrac{\mu_{z}^{2}}{K_{depth}(\mu_{z},a)}\int_{0}^{\infty}\dfrac{1}{z^{2}}\exp(-a\exp(|\log(z/\mu_{z})|))dz (154)
=μz2μ(Γ(1,a)/a+Γ(1,a)a)μ1(Γ(1,a)/a+Γ(1,a)a)\displaystyle=\dfrac{\mu_{z}^{2}}{\mu(\Gamma(1,a)/a+\Gamma(-1,a)a)}\mu^{-1}(\Gamma(1,a)/a+\Gamma(-1,a)a) (155)
=1\displaystyle=1 (156)

Therefore

Var[x/zy/z]=4+3exp(1/2)2+2exp(1/2)A2+μpμpTμpμpTVar\begin{bmatrix}x/z\\ y/z\end{bmatrix}=\dfrac{4+3\exp(-1/2)}{2+2\exp(-1/2)}A^{-2}+\mu_{p}\mu_{p}^{T}-\mu_{p}\mu_{p}^{T} (157)

F.4 Expected depth

The expected depth is

E[z]=μzΓ(2,a)a2+Γ(2,a)a2Γ(1,a)a1+Γ(1,a)aE\left[z\right]=\mu_{z}\dfrac{\Gamma(2,a)a^{-2}+\Gamma(-2,a)a^{2}}{\Gamma(1,a)a^{-1}+\Gamma(-1,a)a} (158)

proof

E[z]\displaystyle E\left[z\right] =1Kproj(A,μ)Kdepth(a,μz)0zpdepth(z;a,μz)pproj(x/z,y/z;A,μ)𝑑x𝑑y𝑑z\displaystyle=\dfrac{1}{K_{proj}(A,\mu)K_{depth}(a,\mu_{z})}\int\limits_{0}^{\infty}zp_{depth}(z;a,\mu_{z})\int\limits_{-\infty}^{\infty}\int\limits_{-\infty}^{\infty}p_{proj}(x/z,y/z;A,\mu)dxdydz (159)
=1Kproj(A,μ)Kdepth(a,μz)0zpdepth(z;a,μz)Kproj(A,μ)𝑑z=μzΓ(2,a)a2+Γ(2,a)a2(Γ(1,a)a1+Γ(1,a)a)2\displaystyle=\dfrac{1}{K_{proj}(A,\mu)K_{depth}(a,\mu_{z})}\int\limits_{0}^{\infty}zp_{depth}(z;a,\mu_{z})K_{proj}(A,\mu)dz=\mu_{z}\dfrac{\Gamma(2,a)a^{-2}+\Gamma(-2,a)a^{2}}{(\Gamma(1,a)a^{-1}+\Gamma(-1,a)a)^{2}} (160)

F.5 Depth variance

Var[z]=μz2(Γ(3,a)a3+Γ(3,a)a3)(Γ(1,a)/a+Γ(1,a)a)(Γ(2,a)/a2+Γ(2,a)a2)2(Γ(1,a)/a+Γ(1,a)a)2Var\left[z\right]=\mu_{z}^{2}\dfrac{(\Gamma(3,a)a^{-3}+\Gamma(-3,a)a^{3})(\Gamma(1,a)/a+\Gamma(-1,a)a)-(\Gamma(2,a)/a^{2}+\Gamma(-2,a)a^{2})^{2}}{(\Gamma(1,a)/a+\Gamma(-1,a)a)^{2}} (161)

proof

E[z2]=𝔼z[𝔼x,y[z2]]=𝔼z[z2]=μz2Γ(3,a)a3+Γ(3,a)a3Γ(1,a)/a+Γ(1,a)aE\left[z^{2}\right]=\mathbb{E}_{z}\left[\mathbb{E}_{x,y}\left[z^{2}\right]\right]=\mathbb{E}_{z}\left[z^{2}\right]=\mu_{z}^{2}\dfrac{\Gamma(3,a)a^{-3}+\Gamma(-3,a)a^{3}}{\Gamma(1,a)/a+\Gamma(-1,a)a} (162)

and

Var[z]=E[z2]E[z]2Var[z]=E[z^{2}]-E[z]^{2} (163)

Appendix G Proof constraints

G.1 Proof constraint 5

We want to show that the negative log likelihood is convex with respect to (x,y,z)(x,y,z) for fixed μ,A,a\mu,A,a.

Since the distribution decomposes into

pcombined(x,y,z;μ,A,a)=pproj(x,y,z;μ,A)pdepth(x,y,z;μ,A)p_{combined}(x,y,z;\mu,A,a)=p_{proj}(x,y,z;\mu,A)p_{depth}(x,y,z;\mu,A) (164)

Therefore

log(pcombined(x,y,z;μ,A,a))=log(pproj(x,y,z;μ,A))log(pdepth(x,y,z;μ,A))-\log(p_{combined}(x,y,z;\mu,A,a))=-\log(p_{proj}(x,y,z;\mu,A))-\log(p_{depth}(x,y,z;\mu,A)) (165)

Since convexity is preserved under addition it is sufficient to show that the negative log likelihood of pprojp_{proj} and pdepthp_{depth} are convex separately.

log(pproj(x,y,z;μ,A)=h(A[x/zμxy/zμx]2zμz)-log(p_{proj}(x,y,z;\mu,A)=h\left(\left\|A\begin{bmatrix}x/z-\mu_{x}\\ y/z-\mu_{x}\end{bmatrix}\right\|_{2}\dfrac{z}{\mu_{z}}\right) (166)

Doing the affine parameter change

q\displaystyle q =A[xzμxyzμy]/μz\displaystyle=A\begin{bmatrix}x-z\mu_{x}\\ y-z\mu_{y}\end{bmatrix}/\mu_{z} (167)
u\displaystyle u =z/μz\displaystyle=z/\mu_{z} (168)

gives

log(pproj(x,y,z;μ,A)=h(q2)-log(p_{proj}(x,y,z;\mu,A)=h(\|q\|_{2}) (169)

which is convex. Since convexity is preserved under affine transforms this concludes the proof for pprojp_{proj}

The proof for pdepthp_{depth} is even simpler

log(pdepth(z,μz,a)=max(az/μz,aμz/z)-\log(p_{depth}(z,\mu_{z},a)=\max(az/\mu_{z},a\mu_{z}/z) (170)

maximum of convex functions is convex. Therefore it is sufficient to show that each expression in the max is convex.

az/μzaz/\mu_{z} is linear, linear functions are convex

2z2aμz/z=2aμz/z3>0\dfrac{\partial^{2}}{\partial z^{2}}a\mu_{z}/z=2a\mu_{z}/z^{3}>0 (171)

Therefore the second expression is convex as well. This concludes the proof.

G.2 Proof constraint 6

First consider the region z0z\leq 0. The statement is trivially true here since 0 converges to 0.

Secondly consider the region 0<z<=μz0<z<=\mu_{z}

in this region xi2+yi2x_{i}^{2}+y_{i}^{2}\rightarrow\infty

p(x,y,z)\displaystyle p(x,y,z) =1K(A,μ,a)exp(h(A[x/zμxy/zμy]2zμz))exp(aμz/z)\displaystyle=\dfrac{1}{K(A,\mu,a)}\exp\left(-h\left(\left\|A\begin{bmatrix}x/z-\mu_{x}\\ y/z-\mu_{y}\end{bmatrix}\right\|_{2}\dfrac{z}{\mu_{z}}\right)\right)\exp(-a\mu_{z}/z) (172)
exp(h(Aμz[xy]2A[μxμy]2))exp(a)K(A,μ,a)\displaystyle\leq\exp\left(-h\left(\left\|\dfrac{A}{\mu_{z}}\begin{bmatrix}x\\ y\end{bmatrix}\right\|_{2}-\left\|A\begin{bmatrix}\mu_{x}\\ \mu_{y}\end{bmatrix}\right\|_{2}\right)\right)\dfrac{\exp(-a)}{K(A,\mu,a)} (173)

The last step comes from applying the triangle inequality on the 2\|\|_{2} term and maximizing the depth and projected terms independently.

For the last expression the first factor converges to 0 since μz\mu_{z} is positive, AA is positive definite and μp2\|\mu_{p}\|_{2} is finite. The second factor is finite. Therefore for limits in this region the condition holds.

In the region μzz<(x2+y2)1/4\mu_{z}\leq z<(x^{2}+y^{2})^{1/4} also need x2+y2\sqrt{x^{2}+y^{2}} to go to infinity.

In this region the following inequalities hold.

maxμzz<(x2+y2)1/4exp(az/μz)ea\max\limits_{\mu_{z}\leq z<(x^{2}+y^{2})^{1/4}}\exp(-az/\mu_{z})\leq e^{-a} (174)

Since the depth distribution is decreasing where the first step relaxes the constraint μzz<(x2+y2)1/4\mu_{z}\leq z<(x^{2}+y^{2})^{1/4} to μzz\mu_{z}\leq z and does the variable change u=z/μzu=z/\mu_{z}

The following inequality holds in this region as well.

minμz<z<x2+y2Aμz[xy][μxμy]zμz2\displaystyle\min\limits_{\mu_{z}<z<\sqrt{x^{2}+y^{2}}}\left\|\dfrac{A}{\mu_{z}}\begin{bmatrix}x\\ y\end{bmatrix}-\begin{bmatrix}\mu_{x}\\ \mu_{y}\end{bmatrix}\dfrac{z}{\mu_{z}}\right\|_{2} minμz<z<x2+y2Aμz[xy]2[μxμy]zμz2\displaystyle\geq\min\limits_{\mu_{z}<z<\sqrt{x^{2}+y^{2}}}\left\|\dfrac{A}{\mu_{z}}\begin{bmatrix}x\\ y\end{bmatrix}\right\|_{2}-\left\|\begin{bmatrix}\mu_{x}\\ \mu_{y}\end{bmatrix}\dfrac{z}{\mu_{z}}\right\|_{2} (175)
(x2+y2)1/4(λmin(x2+y2)1/4μpμz)\displaystyle\geq(x^{2}+y^{2})^{1/4}\left(\dfrac{\lambda_{min}(x^{2}+y^{2})^{1/4}-\|\mu_{p}\|}{\mu_{z}}\right) (176)

which goes to infinity as x2+y2x^{2}+y^{2} goes to infinity.

Therefore p(x,y,z)p(x,y,z) goes to 0 in this region since

exp(h(A[x/zμxy/zμy]2zμz))\exp\left(-h\left(\left\|A\begin{bmatrix}x/z-\mu_{x}\\ y/z-\mu_{y}\end{bmatrix}\right\|_{2}\dfrac{z}{\mu_{z}}\right)\right) (177)

goes to 0 while

1K(A,μ,a)exp(az/μz)\dfrac{1}{K(A,\mu,a)}\exp(-az/\mu_{z}) (178)

is bounded by a constant.

For the region (x2+y2)1/4<z(x^{2}+y^{2})^{1/4}<z z will need to go to infinity. the exp(h())\exp(-h()) term is smaller than 1 and

minzexp(az/μz)=0\min\limits_{z\rightarrow\infty}\exp(-az/\mu_{z})=0 (179)

The union of these regions is 3\mathbb{R}^{3}. The limit is also the same for each region, therefore the limit is the same for the union of the regions. which concludes the proof.

G.3 Proof constraint 7

The function is continuous for all values when z>0z>0 due to the fact that the density is a combination of continuous functions for this region. The function is also continuous when z0z\leq 0 since the function is 0 in this region. For the regions μz>z>0\mu_{z}>z>0 we know

0<pcombined(x,y,z)\displaystyle 0<p_{combined}(x,y,z) =pproj(x,y|z;μ,A)pdepth(z;μz,a)1K(μ,A,a)exp(aμz/z)\displaystyle=p_{proj}(x,y|z;\mu,A)p_{depth}(z;\mu_{z},a)\leq\dfrac{1}{K(\mu,A,a)}\exp(-a\mu_{z}/z) (180)

The normalizing factor is finite and the exponential term goes to 0 as z0z\rightarrow 0

Appendix H Ground truth bounded after basis change

zp\displaystyle z_{p} =zfμz0\displaystyle=\dfrac{z}{f\mu_{z0}} (181)
=z/f/(maxz,fDataset(z/f))(minz,fDataset(z/f))\displaystyle=z/f/\sqrt{(\max\limits_{z,f\in Dataset}(z/f))(\min\limits_{z,f\in Dataset}(z/f))} (182)
maxz,fDataset(z/f)minz,fDataset(z/f))=D\displaystyle\leq\sqrt{\dfrac{\max\limits_{z,f\in Dataset}(z/f)}{\min\limits_{z,f\in Dataset}(z/f))}}=D (183)

and

1/zp\displaystyle 1/z_{p} =fμz0z\displaystyle=\dfrac{f\mu_{z0}}{z} (184)
(maxz,fDataset(z/f))(minz,fDataset(z/f))(minz,fDataset(z/f))\displaystyle\leq\dfrac{\sqrt{(\max\limits_{z,f\in Dataset}(z/f))(\min\limits_{z,f\in Dataset}(z/f))}}{(\min\limits_{z,f\in Dataset}(z/f))} (185)
=maxz,fDataset(z/f)minz,fDataset(z/f))=D\displaystyle=\sqrt{\dfrac{\max\limits_{z,f\in Dataset}(z/f)}{\min\limits_{z,f\in Dataset}(z/f))}}=D (186)

xpx_{p} and ypy_{p} are bounded by 1 because (fx/z+S/2,fx/z+S/2)[0,S]×[0,S](fx/z+S/2,fx/z+S/2)\in\left[0,S\right]\times\left[0,S\right] for objects in the field of view of the camera. Therefore

(fx/z,fy/z)[S/2,S/2]×[S/2,S/2]\displaystyle(fx/z,fy/z)\in\left[-S/2,S/2\right]\times\left[-S/2,S/2\right]\Rightarrow (187)
(2fx/Sz,2fy/Sz)[1,1]×[1,1]\displaystyle(2fx/Sz,2fy/Sz)\in\left[-1,1\right]\times\left[-1,1\right] (188)

Which concludes the proof

Appendix I Proof of bounded gradients for loss

This section contains a proof that the gradients are bounded with respect to the network output when the mapping in section 4.4 is used.

The proof for LprojL_{proj} is already done in Mohlin et al. (2021)

The mapping from w1w_{1} to aa is a contraction. Therefore it is sufficient to show that the loss has bounded gradients with respect to aa and w2w_{2} to prove bounded gradients with respect to w1,w2w_{1},w_{2}

For the region w2>0w_{2}>0

the loss regression part of the loss is

max((a+w2)/zp,zpa2a+w2)\max((a+w_{2})/z_{p},z_{p}\dfrac{a^{2}}{a+w_{2}}) (189)

The gradient of the first term is

a,w2(a+w2)/zp2=(1/zp,1/zp)22D\|\nabla_{a,w_{2}}(a+w_{2})/z_{p}\|_{2}=\|(1/z_{p},1/z_{p})\|_{2}\leq\sqrt{2}D (190)

The gradient of the second expression is

wzpa2a+w22=zp[aa+w2a+2w2a+w2a2(a+w2)2]2D5\left\|\nabla_{w}z_{p}\dfrac{a^{2}}{a+w_{2}}\right\|_{2}=z_{p}\left\|\begin{bmatrix}-\dfrac{a}{a+w_{2}}\dfrac{a+2w_{2}}{a+w_{2}}\\ -\dfrac{a^{2}}{(a+w_{2})^{2}}\end{bmatrix}\right\|_{2}\leq D\sqrt{5} (191)

If w20w_{2}\leq 0 the loss is

max(zp(aw2),1zpa2aw2)\max\left(z_{p}(a-w_{2}),\dfrac{1}{z_{p}}\dfrac{a^{2}}{a-w_{2}}\right) (192)

The first expression has gradient norm

a,w2zp(aw2)2=zp(1,1)22D\|\nabla_{a,w_{2}}z_{p}(a-w_{2})\|_{2}=z_{p}\|(1,-1)\|_{2}\leq\sqrt{2}D (193)

The second expression has gradient norm

a,w21zpa2aw22=1zp[aaw2a2w2aw2a2(aw2)2]2D5\left\|\nabla_{a,w_{2}}\dfrac{1}{z_{p}}\dfrac{a^{2}}{a-w_{2}}\right\|_{2}=\dfrac{1}{z_{p}}\left\|\begin{bmatrix}-\dfrac{a}{a-w_{2}}\dfrac{a-2w_{2}}{a-w_{2}}\\ \dfrac{a^{2}}{(a-w_{2})^{2}}\end{bmatrix}\right\|_{2}\leq D\sqrt{5} (194)

Which concludes for the regression part of the loss.

The normalizing factor component of the depth loss is

log(exp(a)/a+Γ(1,a)a)\log(\exp(-a)/a+\Gamma(-1,a)a) (195)

for The gradient of this is

log(Γ(1,a)/a+Γ(1,a)a)a=1a+21+1/a1+a2Γ(1,a)exp(a)\dfrac{\partial\log(\Gamma(1,a)/a+\Gamma(-1,a)a)}{\partial a}=-\dfrac{1}{a}+2\dfrac{1+1/a}{1+a^{2}\Gamma(-1,a)\exp(a)} (196)

Which is less than 44 when a>1a>1

If a<1a<1 the mapping between w2w_{2} and aa is a(w1)=exp(w1)a(w_{1})=exp(w_{1}) Therefore

log(Γ(1,a(w1))/a(w1)+Γ(1,a(w1))a(w1))w1\displaystyle\dfrac{\partial\log(\Gamma(1,a(w_{1}))/a(w_{1})+\Gamma(-1,a(w_{1}))a(w_{1}))}{\partial w_{1}} =(1a+1+1/a1+a2Γ(1,a)exp(a))aw1\displaystyle=(-\dfrac{1}{a}+\dfrac{1+1/a}{1+a^{2}\Gamma(-1,a)\exp(a)})\dfrac{\partial a}{\partial w_{1}} (197)
=(1+2a(w1)+11+a(w1)2Γ(1,a(w1))exp(a(w1)))\displaystyle=(-1+2\dfrac{a(w_{1})+1}{1+a(w_{1})^{2}\Gamma(-1,a(w_{1}))\exp(a(w_{1}))}) (198)

Which is less than 44. Therefore the negative log likelihood of the normalizing factor has bounded gradients with respect to w1w_{1}

Appendix J Computing normalizing factor

The normalizing factor can be rewritten as

log(Γ(1,a)/a+Γ(1,a)a)=log(a)a+log(1+Γ(1,a)a2exp(a))log(\Gamma(1,a)/a+\Gamma(-1,a)a)=-log(a)-a+\log(1+\Gamma(-1,a)a^{2}\exp(a)) (199)

with gradient

log(Γ(1,a)/a+Γ(1,a)a)a=1/a+21+1/a1+a2Γ(1,a)exp(a)\dfrac{\partial log(\Gamma(1,a)/a+\Gamma(-1,a)a)}{\partial a}=-1/a+2\dfrac{1+1/a}{1+a^{2}\Gamma(-1,a)\exp(a)} (200)

Therefore if a2Γ(1,a)exp(a)a^{2}\Gamma(-1,a)\exp(a) can be computed accurately both the function and its gradient can be computed. We show how to do this in section J.1

Proof of function evaluation

log(Γ(1,a)/a+Γ(1,a)a)\displaystyle\log(\Gamma(1,a)/a+\Gamma(-1,a)a) =log(exp(a)/a+Γ(1,a)a)\displaystyle=\log(\exp(-a)/a+\Gamma(-1,a)a) (201)
=log(exp(a)/a(1+Γ(1,a)a2exp(a)))\displaystyle=\log(\exp(-a)/a(1+\Gamma(-1,a)a^{2}\exp(a))) (202)
=log(1+Γ(1,a)a2exp(a))alog(a)\displaystyle=\log(1+\Gamma(-1,a)a^{2}\exp(a))-a-\log(a) (203)

Proof of gradient

log(1+Γ(1,a)a2exp(a))alog(a)a\displaystyle\dfrac{\partial log(1+\Gamma(-1,a)a^{2}\exp(a))-a-\log(a)}{\partial a} =11/a+(2a+a2)Γ(1,a)exp(a)a2exp(a)Γ(1,a)a1+Γ(1,a)a2exp(a)\displaystyle=-1-1/a+\dfrac{(2a+a^{2})\Gamma(-1,a)\exp(a)-a^{2}\exp(a)\dfrac{\partial\Gamma(-1,a)}{\partial a}}{1+\Gamma(-1,a)a^{2}\exp(a)} (204)
=11/a+(2/a+1)a2Γ(1,a)exp(a)11+Γ(1,a)a2exp(a)\displaystyle=-1-1/a+\dfrac{(2/a+1)a^{2}\Gamma(-1,a)\exp(a)-1}{1+\Gamma(-1,a)a^{2}\exp(a)} (205)
=11/a+(2/a+1)(a2Γ(1,a)exp(a)+1)(2/a+1)11+Γ(1,a)a2exp(a)\displaystyle=-1-1/a+\dfrac{(2/a+1)(a^{2}\Gamma(-1,a)\exp(a)+1)-(2/a+1)-1}{1+\Gamma(-1,a)a^{2}\exp(a)} (206)
=11/a(1+2/a)+(2/a+2)1+Γ(1,a)a2exp(a)\displaystyle=-1-1/a(1+2/a)+\dfrac{-(2/a+2)}{1+\Gamma(-1,a)a^{2}\exp(a)} (207)
=1/a21/a+11+Γ(1,a)a2exp(a)\displaystyle=1/a-2\dfrac{1/a+1}{1+\Gamma(-1,a)a^{2}\exp(a)} (208)

J.1 numerical integration of a2Γ(1,a)exp(a)a^{2}\Gamma(-1,a)\exp(a)

Here we show that

a2Γ(1,a)exp(a)=01(1+log(1/y)/a)2𝑑ya^{2}\Gamma(-1,a)\exp(a)=\int\limits_{0}^{\infty}\dfrac{1}{(1+\log(1/y)/a)^{2}}dy (209)

proof

a2Γ(1,a)exp(a)\displaystyle a^{2}\Gamma(-1,a)\exp(a) =aa2t2exp(at)𝑑t\displaystyle=\int\limits_{a}^{\infty}\dfrac{a^{2}}{t^{2}}\exp(a-t)dt (210)
=011(log(1/y)/a+1)2)dy\displaystyle=\int\limits_{0}^{1}\dfrac{1}{(\log(1/y)/a+1)^{2}})dy (211)

where that basis change y=exp(t+a)y=exp(-t+a) is used.

Appendix K Convex loss with respect to network output

Here we show that the network is convex with respect to the network outputs

Proof In the following sections we will prove that both the projected and depth loss are convex with repect to their parameters. To prove the depth loss is convex.

K.1 Projected term

We have the same loss and the same mapping to parameterize BB and νp\nu_{p} as in Mohlin et al. (2021) In this work they prove that this loss is convex when BθB\succ\theta. We use θ=1\theta=1 which concludes the proof.

K.2 Depth Loss

K.2.1 Rewriting argument of logarithm of depth normalizing factor

In this section we show that

1/a+aΓ(1,a)exp(a)=012a22alog(1/y)+log2(1/y)a(a+log(1/y))2𝑑y1/a+a\Gamma(-1,a)\exp(a)=\int\limits_{0}^{1}\dfrac{2a^{2}-2a\log(1/y)+\log^{2}(1/y)}{a(a+\log(1/y))^{2}}dy (212)

proof

1/a+aΓ(1,a)exp(a)\displaystyle 1/a+a\Gamma(-1,a)\exp(a) =011a+1a1(1+log(1/y)/a)2dy\displaystyle=\int\limits_{0}^{1}\dfrac{1}{a}+\dfrac{1}{a}\dfrac{1}{(1+\log(1/y)/a)^{2}}dy (213)
=011a+a(a+log(1/y))2dy\displaystyle=\int\limits_{0}^{1}\dfrac{1}{a}+\dfrac{a}{(a+\log(1/y))^{2}}dy (214)
=01a2+2alog(1/y)+log2(1/y)+a2a(a+log(1/y))2𝑑y\displaystyle=\int\limits_{0}^{1}\dfrac{a^{2}+2a\log(1/y)+\log^{2}(1/y)+a^{2}}{a(a+\log(1/y))^{2}}dy (215)

K.2.2 Log-convexity of integrand in equation 212

Here we show that for a>0a>0 and y[0,1)y\in[0,1)

qintegrand(y,a)=a2+2alog(1/y)+log2(1/y)+a2a(a+log(1/y))2q_{integrand}(y,a)=\dfrac{a^{2}+2a\log(1/y)+\log^{2}(1/y)+a^{2}}{a(a+\log(1/y))^{2}} (216)

is log-convex with respect to aa

proof

since log(1/y)log(1/y) is a constant for this proof we denote this value as kk

2a2log(2a2+2ak+k2a(a+k)2\displaystyle\dfrac{\partial^{2}}{\partial a^{2}}\log(\dfrac{2a^{2}+2ak+k^{2}}{a(a+k)^{2}} =2a2log(2a2+2ak+k2)log(a)2log(a+k)\displaystyle=\dfrac{\partial^{2}}{\partial a^{2}}\log(2a^{2}+2ak+k^{2})-\log(a)-2\log(a+k) (217)
=a4a+2k2a2+2ak+k21/a2/(a+k)\displaystyle=\dfrac{\partial}{\partial a}\dfrac{4a+2k}{2a^{2}+2ak+k^{2}}-1/a-2/(a+k) (218)
=4(a2+(a+k)2)+(4a+2k)2(a2+(a+k)2)2+1/a2+2/(a+k)2\displaystyle=\dfrac{4(a^{2}+(a+k)^{2})+(4a+2k)^{2}}{(a^{2}+(a+k)^{2})^{2}}+1/a^{2}+2/(a+k)^{2} (219)

To show log convexity we need to show that this expression is larger than 0. To make notation slighly easier we denote t=a+kt=a+k, since both a and k are positive t is positve as well.

2a2log(2a2+2ak+k2(a(a+k)2)\displaystyle\dfrac{\partial^{2}}{\partial a^{2}}\log(\dfrac{2a^{2}+2ak+k^{2}}{(a(a+k)^{2}}) =4(a2+t2)(2a+2t)2(a2+t2)2+1/a2+2/t2\displaystyle=\dfrac{4(a^{2}+t^{2})-(2a+2t)^{2}}{(a^{2}+t^{2})^{2}}+1/a^{2}+2/t^{2} (220)
=8a3t3+(t2+2a2)(a2+t2)2(a2+t2)2\displaystyle=\dfrac{-8a^{3}t^{3}+(t^{2}+2a^{2})(a^{2}+t^{2})^{2}}{(a^{2}+t^{2})^{2}} (221)
=8a3t3+(t2+2a2)(a2+t2)2(a2+t2)2\displaystyle=\dfrac{-8a^{3}t^{3}+(t^{2}+2a^{2})(a^{2}+t^{2})^{2}}{(a^{2}+t^{2})^{2}} (222)
=t6+4t4a28a3t3+5t2a4+2a6(a2+t2)2\displaystyle=\dfrac{t^{6}+4t^{4}a^{2}-8a^{3}t^{3}+5t^{2}a^{4}+2a^{6}}{(a^{2}+t^{2})^{2}} (223)
=t6+4t2a2(at)2t2a4+2a6(a2+t2)20\displaystyle=\dfrac{t^{6}+4t^{2}a^{2}(a-t)^{2}t^{2}a^{4}+2a^{6}}{(a^{2}+t^{2})^{2}}\geq 0 (224)

In the last step every term is positive in both numerator and denominator.

K.2.3 log-convex integral for log-convex integrands

For a function

g(a)=f(x,a)𝑑xg(a)=\int f(x,a)dx (225)

where ff is continuous and log-convex with respect to a, then g(a) is log-convex as well.

proof First we show

log(g(a))/2+log(g(b))/2log(g((a+b)/2))\log(g(a))/2+\log(g(b))/2\geq\log(g((a+b)/2)) (226)

which is equivalent to

g(a)g(b)(g((a+b)/2))2g(a)g(b)\geq(g((a+b)/2))^{2} (227)

Since exp\exp is increasing

g(a)g(b)\displaystyle g(a)g(b) =f(x,a)𝑑xf(x,b)𝑑x\displaystyle=\int f(x,a)dx\int f(x,b)dx (228)
(f(x,a)f(x,b)𝑑x)2\displaystyle\leq\left(\int\sqrt{f(x,a)}\sqrt{f(x,b)}dx\right)^{2} (229)
=(exp(1/2log(f(x,a))+1/2log(f(x,b)))𝑑x)2\displaystyle=\left(\int\exp(1/2\log(f(x,a))+1/2\log(f(x,b)))dx\right)^{2} (230)
(exp(log(f(x,(a+b)/2)))𝑑x)2\displaystyle\leq\left(\int\exp(\log(f(x,(a+b)/2)))dx\right)^{2} (231)
=((f(x,(a+b)/2)))2=(g((a+b)/2))2\displaystyle=\left(\int(f(x,(a+b)/2))\right)^{2}=(g((a+b)/2))^{2} (232)

The first inequality is cauchy-schwartz, the second comes from the fact that ff is log convex.

The proof can be extended for all rational θ\theta between 0 and 1 where the denominator is a power of 2 by recursive bisection.

log(g(a))θ+log(g(a))(1θ)log(g(aθ+(1θ)b))log(g(a))\theta+log(g(a))(1-\theta)\geq\log(g(a\theta+(1-\theta)b)) (233)

Using continuity completes the proof for all θ[0,1]\theta\in[0,1]

Which concludes the proof.

K.2.4 convexity of log(Γ(1,a)/a+Γ(1,a)a)\log(\Gamma(1,a)/a+\Gamma(-1,a)a)

In this section we show that

log(Γ(1,a)/a+Γ(1,a)a)\log(\Gamma(1,a)/a+\Gamma(-1,a)a) (234)

is convex with respect to aa

proof

This is the same as

1/a+Γ(1,a)aexp(a)1/a+\Gamma(-1,a)a\exp(a) (235)

being log convex.

Now the proof follows from section K.2.1, K.2.2 and K.2.3

K.2.5 depth loss error term

here we show that

amax(z/μ,μ/z)a\max(z/\mu,\mu/z) (236)

is convex when

μ={u/a+1 if u>01/(1u/a) otherwise\mu=\begin{cases}u/a+1&\text{ if }u>0\\ 1/(1-u/a)&\text{ otherwise}\end{cases} (237)

proof if u0u\geq 0 then

μ/z\displaystyle\mu/z =(u+a)/z\displaystyle=(u+a)/z (238)
az/μ\displaystyle az/\mu =za2/(u+a)\displaystyle=za^{2}/(u+a) (239)

The first expression is linear therefore convex.

The hessian of the second expression is

2a2za2/(u+a)\displaystyle\dfrac{\partial^{2}}{\partial a^{2}}za^{2}/(u+a) (240)
=az(2ua+a2)/(u+a)2\displaystyle=\dfrac{\partial}{\partial a}z(2ua+a^{2})/(u+a)^{2} (241)
=z(2(u+a)(u+a)2(2ua+a2))/(u+a)3\displaystyle=z(2(u+a)(u+a)-2(2ua+a^{2}))/(u+a)^{3} (242)
=2z(u2+2au+u22ua+a2)/(u+a)3\displaystyle=2z(u^{2}+2au+u^{2}-2ua+a^{2})/(u+a)^{3} (243)
=2zu2/(u+a)3\displaystyle=2zu^{2}/(u+a)^{3} (244)
2auza2/(u+a)\displaystyle\dfrac{\partial^{2}}{\partial a\partial u}za^{2}/(u+a) =z(2a(a+b)2(2ua+a2))/(u+a)3\displaystyle=z(2a(a+b)-2(2ua+a^{2}))/(u+a)^{3} (245)
=2zua/(u+a)3\displaystyle=-2zua/(u+a)^{3} (246)
2uza2/(u+a)\displaystyle\dfrac{\partial^{2}}{\partial u}za^{2}/(u+a) =uza2/(u+a)\displaystyle=\dfrac{\partial}{\partial u}-za^{2}/(u+a) (247)
=2za2/(u+a)2\displaystyle=-2za^{2}/(u+a)^{2} (248)

which has the eigenvalues 0 and 4z(a2+u2)/(a+u)3>04z(a^{2}+u^{2})/(a+u)^{3}>0. The denominator is positive since both aa and uu are positive. Therefore this expression is positive definite.

If u<0u<0 Then the expression az/μ=auaz/\mu=a-u which is linear and therefore convex. The expression

aμ/z=a2/(au)a\mu/z=a^{2}/(a-u) (249)

doing the variable change v=uv=-u turns this in the expression

aμ/z=a2/(a+v)a\mu/z=a^{2}/(a+v) (250)

which we have already shown is convex for v>0v>0. Therefore this expression is convex for this region. max of convex expressions is convex therefore the function is convex in each region. The mapping between uu and μ\mu has continous gradient at the boundary u=0u=0 Therefore the function is convex for all a>0,ua>0,u\in\mathbb{R}. Since mapping between network output and aa is linear for the region a>1a>1 this the loss is convex for this region.

This concludes the proof that the loss is convex in the region A1,a>1A\succ 1,a>1