Probabilistic 3d regression with projected huber distribution

David Mohlin davmo@kth.se
RPL,KTH/Tobii Josephine Sullivan sullivan@kth.se
RPL,KTH

Abstract

Estimating probability distributions which describe where an object is likely to be from camera data is a task with many applications. In this work we describe properties which we argue such methods should conform to. We also design a method which conform to these properties. In our experiments we show that our method produces uncertainties which correlate well with empirical errors. We also show that the mode of the predicted distribution outperform our regression baselines. The code for our implementation is available online

Refer to caption — Figure 1: Overview of out method. Our method takes an image as input and the focal length which was used to capture the scene and produces a log concave probability distribution in world coordinates in a way which can model the ambiguities which are inherent for camera sensors. On the left we have visualized the level curves in projected coordinates with the projected ground truth center of the cylinder. On the right we show the level curves of the distribution from a birds eye view. The estimated mode is shown as a cross and the ground truth position is a dot.

1 Introduction

Estimating 3d position of object have many use cases. For example 1) Estimating the position of a car or pedestrian is required to construct policies for how to traverse traffic for autonomous driving or ADAS. As a result this task is included in many such datasets, for example KITTI Geiger et al. (2012). 2) Body pose estimation can be seen as trying to find the 3d position of each joint for a human skeleton. Examples of such datasets are von Marcard et al. (2018) and Ionescu et al. (2013). 3) For many robotics applications estimating the 3d position of objects is relevant Bruns & Jensfelt (2022). For industrial robotics the shape of the object is often known, therefore the only source of variation is the six degrees of freedom due to the orientation and position of the object.

Estimating uncertainties associated with the estimated position has applications for fusion of multiple estimates Mohlin et al. (2021). Another application for uncertainty estimates is time filters, such as Kalman filters. Uncertainties can also be used for application specific purposes. For examples in an ADAS setting it makes sense to not only avoid the most likely position of an object, but to also include a margin proportional to the uncertainty of the estimate. In robotic grasping a good policy could be to try to pick up objects directly if the estimated position is certain while doing something else if the estimate is uncertain, such as changing viewing angle.

2 Prior work

When estimating 3d position from camera data there is an inherent ambiguity due to scale/depth ambiguity. Many types of interesting types of objects have a scale ambiguity, such as people, cars and animals.

The existence of scale ambiguity is reflected in the performance on competitive datasets. For example the errors of single view methods which estimate the position of people relative to the camera is on the order of 120mm where 100mm is in the depth direction Moon et al. (2019). This is consistent with the presence of scale/depth ambiguity. In the multi view setting where the depth ambiguity can be eliminated the state of the art is on the order of 17mm Iskakov et al. (2019).

There are many ways to resolve the scale/depth ambiguity.

One way to solve the problem is to add a depth sensor such as a lidar Geiger et al. (2012) or structured light Choi et al. (2016). However adding additional sensors come with several drawbacks such as price, complexity, range among others. For these reasons we will focus on methods which only use camera data.

For camera only methods there are two common ways to resolve the ambiguity, the first is only applicable if the object lies on a known plane, in practice this plane is often the ground plane, but could also be for example the surface of a table. Mills & Goldenberg (1989) Another approach is to do multi-view fusion. In this case the errors due to these ambiguities point in different directions and can therefore be cancelled out with a suitable method.

For this reason it is important if a method to estimate 3d position also fits into a framework which is able to resolve the scale/depth ambiguity.

It is well known that using densities which are log concave are well suited for sensor fusion since computing the optimal combination is a convex problem in this setting An (1997). In our work we show that both the ground plane assumption and multi view fusion turns into a convex optimization problems if model predicts a log concave probability distribution.

There are many other works which treat the problem of estimating 3d position in a probabilistic framework. For example Feng et al. (2019) models the the position of an object with a normal distribution. However they also use a detector on lidar data to avoid large errors and thereby large gradients. Meyer et al. (2019) models the locations of the corners of a bounding box with a laplace distributions, but they also avoid the scale/depth ambiguity by using lidar data. Many works treat depth estimation as a classification problem were a final prediction is constructed by computing the expected value of this discrete distribution Wang et al. (2022), but the probabilities predicted by such methods are in general not concave since it is possible to model a multimodal distribution with these methods. In Bertoni et al. (2019) they model the reciprocal of the depth with a laplace distribution. However they do not model the projected location in a probabilistic manner, which is not well suited for multi-view fusion. In section 3.3 we also show that modelling depth independently of the projected location always result in undesirable properties, which they, like many others do.

Some works investigate regressing 2d position by having neural networks estimate probability distributions parameterized by mean and covariance Kumar et al. (2020); Mohlin et al. (2021)

3 Motivation

This section will focus on the specific properties which data captured by cameras conform to and therefore which a model should take into account when doing probabilistic position estimation based on camera data.

In practice when applying this method the location of the object is modeled by a probability distribution which is parameterized by the output of a convolutional neural network.

3.1 Notation

We will investigate a setting where the model takes an input image which is produced by capturing a scene on a $S\times S$ sensor. The camera is a pin-hole camera without radial distoritions and have the known intrinsics

\begin{bmatrix}f&0&S/2\\ 0&f&S/2\\ 0&0&1\\ \end{bmatrix}

(1)

Coordinates for this camera are denoted $v=(x,y,z)$ where the $z$ axis is aligned with the principal axis of the camera and $v=\bar{0}$ correspond to the center of the camera.

The method will predict a probability distribution parameterized by $\theta$ based on the input image.

3.2 Desired constraints for estimated distribution

In this section we describe constraints which we argue a probability distribution estimated from camera data should conform to. We also motivate why these constaints should exist.

constraint 1: The model can express any variance for the projected coordinates on the image sensor and depth independently.

Formally:

\forall(a,A)\in\mathbb{R}^{+}\times\textbf{S}^{2}_{++}\exists\theta\text{ such that }Var[z|\theta]=a\text{ and }Var[x/z,y/z|\theta]=A

(2)

Motivation

For many tasks the error of depth estimation is inherently large due to scale/depth ambiguity. Therefore for objects which have a variation in size, but where the absolute size is hard to estimate there will be an inherent error proportional to the scale error and the distance to the object. Despite this estimating the position in projected coordinates can often be done accurately. For example assuming that the height of humans is between 1.5-1.9m and assuming that it is intrinsically hard to estimate the size of an unseen person an uncertainty of 13% of the distance to the person is reasonable to expect. Despite this it is possible to estimate the position of keypoints in the projected coordinate space of an image accurately, often within a few pixels.

constraint 2: The model should not try to estimate the coordinates directly, but instead estimate the depth divided by the focal length and the projected position of the object.

Formally:

p_{cam}(x,y,z|\theta(image);f)=p_{cam}(fx/z,fy/z,z/f|\theta(image))

(3)

Motivation:

It is difficult to separate the the quantities scale, depth and focal length, since the main effect of each variable is to change the apparent size of the object on the sensor. For this reason the focal length needs to be taken into account when estimating the depth of an object, otherwise the variation in focal length adds additional ambiguity. When scale/depth ambiguity is present it is not possible to estimate the absolute position for the x or y coordinate accurately, but it is often possible to do so for the projected position of the object. This motivates why a method should also estimate the location in image space. Methods which conform to these constrants are not uncommon. For example Zhen et al. (2020) estimates the depth after dividing by the focal length. It is also common to estimate the location of objects in image space.

constraint 3: The output probability should have support only for points in front of the camera.

Formally:

p_{cam}(x,y,z)=0\text{ if }z<0

(4)

Motivation:

Cameras only depict objects in front of them. The output distribution should reflect that.

constraint 4: The output probability should have support for all points in the field of view.

Formally: for a camera with focal length $f$ and sensor size $S$

\forall x,y,z\text{ such that }|fx/z|<S/2,|fy/z|<S/2\text{ and }z>0\text{ then }p_{cam}(x,y,z|\theta)>0

(5)

Motivation:

One standard way to optimize neural networks which output probability distributions is to minimize the negative log likelihood. Backpropagation is only possible if the predicted probability is larger than 0 for the ground truth position. Since there are not any guarantees what an untrained network predicts the probability has to be positive for all locations which are reasonable without considering the input data. Since a camera only depict objects in its field of view this is a sufficient condition.

constraint 5: The negative log likelihood of the position is convex. Formally:

-\log(p_{cam}(v|\theta))

(6)

is convex with respect to $v$

Motivation:

A common method to combine estimates which are expressed as probability distributions is by assuming that the two estimates have errors which are independent random variables and letting the output of the fusion be the maximum likelihood point under this assumption Mohlin et al. (2021). It is desirable if computing this combination is a convex optimization problem since it guarantees that it is possible to find the fusion quickly. In section 3.3 we show that a sufficient constraint for this property is that the negative log likelihood is convex with respect to position.

constraint 6: The probability for objects infinitely far away from the camera is 0 Formally

\lim\limits_{r\rightarrow\infty}\max\limits_{\|v\|=r}p(v)=0

(7)

\forall x,y\in\mathbb{R}^{2}\text{ }p(x,y,0)=0

(8)

Motivation:

Firstly objects of finite size are not visible if they are infinitely far away from the camera. Secondly we need this property to guarantee convergence when doing multi-view fusion.

constraint 7: $p_{cam}(v|\theta)$ is continuous with respect to $v$ .

Motivation:

It is not reasonable that small changes in position change the density by a large amount.

3.3 Properties given these constraints

Proposition 1: All distributions which are twice differentiable at least at one point and decompose into

p_{composed}(x,y,z|\theta)=p_{projected}(x/z,y/z|\theta)p_{depth}(z|\theta)

(9)

Do not fulfill all of the above constraints. Proof in supplementary B

Proposition: 2 In multi view fusion, if the field of view for the cameras have a non-empty intersection then if each camera produce probability estimates conforming to constraints 1-7 then a valid maximum likelihood fusion point exist and can be found by convex optimization. Proof in supplementary C

Proposition 3: Imposing a ground plane constraint is a convex optimization problem. If any point of the ground plane is in the field of view of the camera, then the constraints also guarantee that a solution exist. Proof in supplementary C

4 Method

In this work we first describe the Projected Huber Distribution which we will show fulfills the constraints described in section 3

4.1 Projected Huber Distribution

The distribution can be decomposed into one component $p_{proj}$ which mainly model the probability in the projected coordinates and another component $p_{depth}$ which models the distribution in depth

The component which mainly models the probability over the projected coordinates is

p_{proj}(x,y|z;\mu,A)=\dfrac{1}{K_{depth}(A,\mu_{z})}\exp\left(-h\left(\left\|A\begin{bmatrix}x/z-\mu_{x}\\ y/z-\mu_{y}\end{bmatrix}\right\|_{2}\dfrac{z}{\mu_{z}}\right)\right)

(10)

By including $z$ in this distribution we can avoid the problem described in proposition 1. Note that conditioning on $z$ is necessary to get a proper distribution.

$\mu_{x}$ , $\mu_{y}$ $\in\mathbb{R}^{2}$ model the mean position on the camera sensor. $\mu_{z}\in\mathbb{R}^{+}$ models the estimated depth of the object and $A\in\textbf{S}^{2}_{++}$ models the precision. The function $h(.)$ is a huber function.

The distribution which models the depth is

p_{depth}(z;\mu_{z},a)=\begin{cases}\dfrac{1}{K_{depth}(a,\mu_{z})}\exp(-a\max(z/\mu_{z},\mu_{z}/z))&\text{ if }z>0\\ 0&\text{ otherwise}\end{cases}

(11)

Where $\mu_{z}\in\mathbb{R}^{+}$ models the estimated depth. Note that this is the same parameter as for $p_{proj}$ . $a\in\mathbb{R}^{+}$ roughly models the precision, that is a larger $a$ reduces the variance.

By combining these distributions we get the Projected Huber Distribution

\displaystyle p_{combined}(x,y,z;\mu,A,a)=p_{proj}(x,y|z;A,\mu)p_{depth}(z;\mu_{z},a)

(12)

for x,y,z in the cameras coordinate system.

The normalizing factors are

$\displaystyle K_{depth}(\mu_{z},a)$	$\displaystyle=\mu_{z}(\exp(-a)/a+\Gamma(-1,a)a)$	(13)
$\displaystyle K_{proj}(\mu_{z},A)$	$\displaystyle=\dfrac{\mu_{z}^{2}}{\|A\|}2\pi(1+\exp(-1/2))$	(14)
$\displaystyle K_{combined}(\mu_{z},A,a)$	$\displaystyle=K_{depth}(\mu_{z},a)K_{proj}(\mu_{z},A)$	(15)

Where $K_{depth}$ is bounded by

\mu_{z}\dfrac{\exp(-a)}{a}\leq K_{depth}(\mu_{z},a)\leq\mu_{z}\exp(-a)(1/a+1)

(16)

This distribution will conform to all constraints, except constraint 2 which relates to how to predict the parameters of the model. We show how to predict the parameters in a way which conform to constraint 2 in subsection 4.4

proposition 5: the moments of this distribution are

$\displaystyle E\left[\begin{bmatrix}x/z\\ y/z\end{bmatrix}\|\mu_{p}\right]$	$\displaystyle=\mu_{p}$	(17)
$\displaystyle Var[\begin{bmatrix}x/z\\ y/z\end{bmatrix}\|A]$	$\displaystyle=A^{-2}\dfrac{4+3\exp(-1/2)}{2+2\exp(-1/2)}$	(18)
$\displaystyle E\left[z\right]$	$\displaystyle=\mu_{z}\dfrac{\Gamma(2,a)/a^{2}+\Gamma(-2,a)a^{2}}{\Gamma(1,a)/a+\Gamma(-1,a)a}$	(19)
$\displaystyle Var\left[z\right]$	$\displaystyle=\mu_{z}^{2}\dfrac{(\Gamma(3,a)/a^{3}+\Gamma(-3,a)a^{3})(\Gamma(1,a)/a+\Gamma(-1,a)a)-(\Gamma(2,a)/a^{2}+\Gamma(-2,a)a^{2})^{2}}{(\Gamma(1,a)/a+\Gamma(-1,a)a)^{2}}$	(20)

The $\Gamma$ terms are not very intuitive, but the fraction for the expected value quickly decrease to 1 while the fraction for the variance behaves similar to $1/a^{2}$ that is

	$\displaystyle E\left[z\right]$	$\displaystyle\approx\mu_{z}$		(21)
	$\displaystyle Var\left[z\right]$	$\displaystyle\approx\mu_{z}^{2}/a^{2}$		(22)

proof in supplementary F

4.2 Proof for constraints

constraint 1: Proof: From proposition 5 we see that $A$ models the uncertainty in projected coordinates, while $a$ models the uncertainty in depth coordinates.

constraint 3: $p(z)=0$ if $z<0$ , from definition in equation 12

constraint 4: $p(z)>0$ if $z>0$ since the range of $\exp$ is the $(0,\infty)$ and the expression can be evaluated for all values of $x,y,z\in\mathbb{R}^{2}\times\mathbb{R}^{+}$ . The field of view for a camera is a subset of the half plane $z>0$ .

constraint 5: Proof sketch: Show that both $p_{proj}$ and $p_{depth}$ has a convex negative log likelihood. An affine basis change turns the argument of the $-\log(p_{proj})$ into a norm $h(\|q\|_{2})$ which is convex. $-\log(p_{depth})$ is also convex. Full proof in supplementary G.1

constraint 6: Proof sketch. For the region $z\leq 0$ the proof is trivial. Then we prove it for the region $0<z<=(x^{2}+y^{2})^{1/4}$ . Here the $\exp(-h(.))$ term will go to 0. as $\|v\|_{2}\rightarrow\infty$ while the other factors are bounded. For the region $z>(x^{2}+y^{2})^{1/4}$ the factor $\exp(-a\max(z/\mu_{z},\mu_{z}/z))$ goes to 0 while the other factors are bounded. Full proof in supplementary G.2

constraint 7: Proof sketch for $z>0$ the function is trivially continuous. For $z<0$ the function is also trivially continuous. When z approaches 0 the factor $\exp(-a\max(z/\mu_{z},\mu_{z}/z))$ goes to 0. Full proof in G.3

4.3 Parameter remapping

When the parameters $\theta$ of Projected Huber Distribution are estimated by a neural network distribution it is necessary to turn the output of the neural network into a valid parameterization for Projected Huber Distribution . Valid in this sense refers to conforming to the constraints $A\in\textbf{S}^{++}$ , $a\in\mathbb{R}^{+}$ and $\mu_{z}\in\mathbb{R}^{+}$ , that is $A$ is positive definite while a and $\mu_{z}$ are positive.

Neural networks give outputs in $\mathbb{R}^{d_{out}}$ where $d_{out}$ is decided by the architecture of the network. These outputs are do not conform to our desired parameter constraints, unless a suitable activation is applied.

To construct this activation we start by doing a basis change to conform to constraint 2 from the motivation.

	$\displaystyle v_{p}$	$\displaystyle=\begin{bmatrix}x_{p}\\ y_{p}\end{bmatrix}=\begin{bmatrix}\dfrac{2fx}{zS}\\ \dfrac{2fy}{zS}\end{bmatrix}$		(23)
	$\displaystyle z_{p}$	$\displaystyle=\dfrac{z}{\mu_{z0}f}$		(24)

Where $\mu_{z0}$ is a bias term defined by

\mu_{z0}=\sqrt{\left(\max\limits_{z,f\in\textit{Dataset}}z/f\right)\left(\min\limits_{z,f\in\textit{Dataset}}z/f\right)}

(25)

For the purpose of proving that our loss has bounded gradients we also need the constant $D$ defined by

D=\sqrt{\left(\max\limits_{z,f\in\textit{Dataset}}z/f\right)\bigg{/}\left(\min\limits_{z,f\in\textit{Dataset}}z/f\right)}

(26)

These values can either be derived from known properties of the dataset or by computing these values for all samples in the training dataset.

$f/z$ proportional to the projected scale of an object. Even in the extreme case where the scale ambiguity is 40% and the projected size varies between 2 and 200 pixels $D$ would be less than 20. Furthermore $D$ is only used to prove that the loss has bounded gradients.

proposition 5 $\|v_{p}\|_{\infty}\leq 1$ and $1/D\leq z_{p}\ \leq D$ . for points in the field of the camera with a depth conforming to equation 26.

proof Proof sketch $v_{p}$ correspond to the projected image coordinates, scaled to be between -1 and 1. $z_{p}$ is normalized by defininition 25-26. Full proof in supplementary H

Now $v_{p}$ correspond to the coordinates in the projected image and $z_{p}$ is the z coordinate after compensating for the scale change of the focal length and applying a logarithm to map $\mathbb{R}^{+}$ to $\mathbb{R}$

Define

$\displaystyle\nu_{p}$	$\displaystyle=\begin{bmatrix}\nu_{x}\\ \nu_{y}\end{bmatrix}=\dfrac{f\mu_{z0}}{\mu_{z}}A\begin{bmatrix}\mu_{x}\\ \mu_{y}\end{bmatrix}$	(27)
$\displaystyle\nu_{z}$	$\displaystyle=\dfrac{\mu_{z}}{\mu_{z0}f}$	(28)
$\displaystyle B$	$\displaystyle=A\dfrac{Sf\mu_{z0}}{2\mu_{z}}$	(29)

Using a negative log likelihood of $p_{combined}$ in the original basis can be written as

$\displaystyle L_{combined\_original}(\mu,A,a;x,y,z)$	$\displaystyle=L_{proj\_original}(\mu_{z},a,z)+L_{depth\_original}(\mu,A;x,y,z)$	(30)
$\displaystyle L_{proj\_original}$	$\displaystyle=h\left(\left\\|A\begin{bmatrix}x/z-\mu_{x}\\ y/z-\mu_{y}\end{bmatrix}\right\\|_{2}\dfrac{z}{\mu_{z}}\right)-\log(\|A\|)+2\log(\mu_{z})$	(31)
$\displaystyle L_{depth\_original}$	$\displaystyle=a\max(z/\mu_{z},\mu_{z}/z)+\log(\exp(-a)/a+\Gamma(-1,a)a)+\log(\mu_{z})$	(32)

Which in the new basis is

$\displaystyle L_{combined}(\nu,B,a;v_{p},z_{p})$	$\displaystyle=L_{proj}(\nu_{p},A,v_{p},z_{p})+L_{depth}(\nu_{z},a;z_{p})$	(33)
$\displaystyle L_{proj}(B,\nu_{p};v_{p},z_{p})$	$\displaystyle=h\left(\left\\|Bv_{p}-\nu_{p}\right\\|_{2}z_{p})\right)-\log(\|B\|)$	(34)
$\displaystyle L_{depth}(a,\nu_{z};z_{p})$	$\displaystyle=a\max(z_{p}/\nu_{z},\nu_{z}/z_{p})+\log(\exp(-a)/a+\Gamma(-1,a)a)+\log(\nu_{z})$	(35)

In these losses we exclude terms not required for computing gradients such as terms only containing constants, $f,S,\mu_{z0}$ .

$K_{depth}(a)$ and its gradients is computed by numerical integrals. The full method is described in supplementary J

Estimating these parameters will conform to constraint 2 since $\nu_{p}$ and $B$ models the position and precision for the projected coordinates while $\nu_{z}$ and $a$ models the depth and precision for the depth estimate.

4.4 Enforcing constraints on distribution parameters

Designing the activation which outputs $B$ and $\nu_{p}$ is done in the same way as Mohlin et al. (2021).

Estimating $a$ and $\nu_{z}$ is done by starting with two real numbers $w_{1}$ , $w_{2}$

a(w_{1})=\begin{cases}w_{1}+1&\text{ if }w_{1}<0\\ exp(w_{1})&\text{ otherwise}\end{cases}

(36)

\nu_{z}(w_{1},w_{2})=\begin{cases}1+w_{2}/a(w_{1})&\text{if }w_{2}>0\\ 1/(1-w_{2}/a(w_{1}))&\text{otherwise}\end{cases}

(37)

With this parameterization the the loss is convex with respect to the network output when $a>1$ . The gradients of the loss will be bounded with respect to $w$ . Proof in K and I

Having a loss which is Lipschitz-continous should aid with stability during training since it avoids back-propagating very large gradients.

5 Experiments

In this section we show how the method described in section 4 can be applied in practice to estimate the position of an object.

5.1 Dataset

We construct a synthetic dataset by rendering objects in a similar way as Johnson et al. (2017). We choose our task to be to estimate the location of a rendered cylinder. The cylinder is rendered different reflection material properties, color and with random orientation.

The scene is rendered on a sensor of size $224\times 224$ pixels, the focal length of the camera is sampled from a uniform distribution $f\sim U(1200,2000)(pixels/m)$ The cylinder has a height of 0.2 meters and a radius of 0.1 meter. The depth is uniformly sampled from $z\sim U(3,5)m$ The x and y coordinates are sampled from $\dfrac{Sz}{f}U(-0.5,0.5)$ for each axis.

In this experimental setup the normalizing constants are

	$\displaystyle\mu_{z0}$	$\displaystyle=\sqrt{\left(\max\limits_{f\in(1200,2000),z\in(3,5)}z/f\right)\left(\min\limits_{f\in(1200,2000),z\in(3,5)}z/f\right)}\approx 2.5*10^{-3}$		(38)
	$\displaystyle D$	$\displaystyle=\sqrt{\left(\max\limits_{f\in(1200,2000),z\in(3,5)}z/f\right)\bigg{/}\left(\min\limits_{f\in(1200,2000),z\in(3,5)}z/f\right)}\approx 1.7$		(39)

The data is fitted by the method by using the rendered image as input to a standard Resnet-50 with an output dimension of 7. Instead of applying a softmax on the output we use the mapping described in section 4.4 to predict the parameters of Projected Huber Distribution . Applying a negative log likelihood loss gives equation 33 which is used as the loss when fitting the parameters of the network. The parameters are updated using Adam with default parameters. This setup is visualized in figure 2

To showcase different cases we also generate additional datasets where the cylinder is scaled with a factor uniformly sampled from U(0.8,1.2). This dataset is constructed to showcase how the method performs when a scale/depth ambiguity is present.

We also generate a dataset where we also add a cube and a sphere to both add visual complexity and introduce occlusions which incentive the model to predict different uncertainties for the case where the cylinder is occluded compared to when it is not. For this dataset no scale ambiguity is present. The cube and sphere have orientations, material properties and color sampled indepenently from the cylinder and each other.

Finally we generate a dataset where we render a scene with occluding objects and with scale ambiguity. The scaling factor is sampled independently for all objects.

Each dataset containts 8000 training samples and 2000 test samples.

5.2 Experimental result

5.2.1 Empirical error vs. estimated error

We investigate how the predicted uncertainty correlates with the empirical error by for each sample computing the empirical squared error for the projected coordinates and for the depth. We then compute the predicted variance for each sample based on $A$ and $a$ by using equation 18 and 20

We check if the estimated uncertainties are well correlated by sorting the samples based on the predicted variance and apply a low pass filter over 200 adjacent samples for both the predicted variance and the empirical squared error. For a perfect model these should be identical. For a good model they should be highly correlated.

From figure 3 we see that the estimated errors correlate well with the empirical errors over all datasets for both the depth and projected component. This indicates that the uncertainty estimates which the model produces are useful. The estimated uncertainties also follow intuition, the dataset without occlusion or scale ambiguity has low predicted uncertainty for both the projected coordinates and for the depth. For the dataset without occlusion but with scale ambiguity the depth uncertainty increases significantly, but the uncertainty for the projected coordinates remains low. When fitting the model on a dataset without scale ambiguity but with other objects which can occlude the cylinder both the predicted uncertainty for the projected cordinates and the depth increases, which is expected since it is harder to predict where objects are if they are partially or fully occluded.

We also show how the average predicted parameter $a$ and $A$ change over training in figure 4. From this figure we see that the predicted uncertainty decreases as the training proceeds. This is to due to the model adapt to fact that the error decrease when training.

5.2.2 Regression performance

To show that our method is able to estimate 3d position well we compare it against several regression baselines. The reference is to try to regress $v_{p}$ and $z_{p}$ directly. Some methods estimate the logarithm of depth instead of the raw depth such as Lin & Lee (2020). For completeness we compare to estimating $\log(z_{p})$ as well. The results of our method and these baselines is shown in table 1. From this table we see that our method often exceeds the performance of both baselines, however this improvement is not likely to be significant, but does not need to be so either since our contribution is giving a probabilistic prediction with a convex negative log likelihood, not to produce a better single point estimate. The errors are measured in the cameras coordinate system. For the baselines the predicted $(x,y,z)$ coordinates are computed by using the mapping in equation 23. For our method we use the mode as our single point prediction

(x,y,z)=\mu_{z}(\mu_{x},\mu_{y},1)

(40)

To construct a point estimate. There are other possible single point predictors which could be equally good, such as geometric median or mean we did not evaluate these, but they should produce similar predictions.

We also do ablations where we ignore the focal length of the camera when estimating the position, thereby violating our proposed constraint 2. One such baseline is to let the network predict $x,y,z$ directly using a MAE loss. Another such baseline is to try to predict $v_{p}$ and $z_{p}$ but with a constant focal length of $f=\sqrt{f_{max}f_{min}}\approx 1550$ for the mapping between $x,y,z$ and $v_{p},z_{p}$ . Note that the camera used to capture the scene had different focal length for different samples, we only use an incorrect average focal length for the mapping in equation 23 for training and evaluation. The results for the baselines and our method on the dataset without occluding objects nor scale ambiguity is shown in Table 2.

From this table we see that baselines which ignore the focal length perform significantly worse, likely due to the fact that in general it is not possible to measure depth from images, only scale, therefore by ignoring the focal length an irreducible error is introduced.

Finally we try to frame how large our errors are in image space. On average the focal length is approximately 1550 pixels. The object is on average 4000 mm away. Therefore the average error of 2.4mm correspond to approximately $0.9$ pixels which quite good. Objects with a diameter of 0.2m will have a diameter of approximately 80 pixels when rendered with a focal length of 1550 at a depth of 4 meter. If estimating depth was done by the proxy task of estimating the diameter of objects an error of 1 pixel would therefore result in a depth error of 50mm which is on par with our average error when there are no occlusions or scale ambiguities, again indicating that our method works quite well.

Introducing a 20% scale ambiguity of the rendered object should result in irreducible errors which are on average 10% error of the depth. The object was on average 4000mm from the camera, which should give an average error of approximately 400mm. The measured errors which we observe are similar to this, but slightly lower, possibly due to the network being able to infer some information about the scale through some non-intuitive shading effect or due to the fact that $z_{p}$ does not follow a uniform distribution.

Method	Metric	default	occl.	scale ambig.	occl. & scale ambig
MAE, $v_{p}$ , $z_{p}$	projected error (mm)	3.6	22.8	3.7	37.2
	depth error (mm)	69	218	359	386
MAE, $v_{p}$ , $\log(z_{p})$	projected error (mm)	5.0	18.0	5.1	27.6
	depth error (mm)	36	178	355	355
probabilistic (ours)	projected error (mm)	2.4	13.0	1.6	17.0
	depth error (mm)	48	168	347	350

Table 1: regression performance of our proposed method (bottom) compared to two plain regression baselines

Method	Projected error	Depth error
MAE, (x,y,z)	13.0	327
MAE, $v_{p}$ , $z_{p}$ , constant $f$	10.8	325
probabilistic (ours)	2.4	48

Table 2: Ablation on mean absolute error losses when 1) trying to directly regress

x,y,z

, 2) when regressing

v_{p},z_{p}

where the focal length is approximated with a constant and 3) When using our method

6 Discussion

The constraints which we described are well motivated for the general case, but there are cases where they are not necessary. For example if scale ambiguity is not present in the dataset then constraint 1 is not necessary. If the focal length is the same for all pictures in the dataset then constraint 2 is not neccessary. It is also possible to infer camera intrinsics from some scenes if this is the case then constraint 2 might not be necessary either. If the uncertainty in depth is small many distributions would assign a small probability that the object is behind the camera even if constraint 3 is not strictly true indicating that this constraint is mainly important when the depth uncertainty is large. Constraint 5 is only necessary if the estimated probability distribution needs to be log concave. we have shown this is useful for multi view fusion and imposing ground plane constraints, but if the goal is only to produce a as good as possible single point prediction of the position this constraint should not be necessary.

Furthermore our ablations indicate that using the correct focal length is important when estimating 3d position from camera data.

The focus of this work is to investigate the theoretical aspects of probabilistic 3d regression from camera data. To be able to highlight this the experiments are on synthetic data. Future work could be to verify that the proposed method works on more challenging real world data as well.

We also show that it should be easy to combine our method with multi-view fusion or the ground plane assumption in theory, but we do not implement it. This could also be future work.

7 Conclusion

In this work we have described constraints which we argue should be taken into account when designing a method to estimating probability distributions over 3d position from camera data. We have also designed a method which conform to these constraints. In our experiments we show that the uncertainty estimates which our method produce correlate well with empirical errors. Our experiments also show that our method perform on par or better than several regression baselines. In our ablations we show that in our experimental setting the performance decreases significantly if the camera intrinsics are ignored.

8 Acknowledgements

DM is an industrial PhD student at Tobii AB. this work was partially supported by the Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation.

References

An (1997) Mark Yuying An. Log-concave probability distributions: Theory and statistical testing. Duke University Dept of Economics Working Paper, (95-03), 1997.
Bertoni et al. (2019) Lorenzo Bertoni, Sven Kreiss, and Alexandre Alahi. Monoloco: Monocular 3d pedestrian localization and uncertainty estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019.
Bruns & Jensfelt (2022) Leonard Bruns and Patric Jensfelt. On the evaluation of rgb-d-based categorical pose and shape estimation. arXiv preprint arXiv:2202.10346, 2022.
Choi et al. (2016) Sungjoon Choi, Qian-Yi Zhou, Stephen Miller, and Vladlen Koltun. A large dataset of object scans. arXiv preprint arXiv:1602.02481, 2016.
Feng et al. (2019) Di Feng, Lars Rosenbaum, Claudius Glaeser, Fabian Timm, and Klaus Dietmayer. Can we trust you? on calibration of a probabilistic object detector for autonomous driving. arXiv preprint arXiv:1909.12358, 2019.
Geiger et al. (2012) Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
Ionescu et al. (2013) Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE transactions on pattern analysis and machine intelligence, 36(7):1325–1339, 2013.
Iskakov et al. (2019) Karim Iskakov, Egor Burkov, Victor Lempitsky, and Yury Malkov. Learnable triangulation of human pose. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7718–7727, 2019.
Johnson et al. (2017) Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2901–2910, 2017.
Kumar et al. (2020) Abhinav Kumar, Tim K Marks, Wenxuan Mou, Ye Wang, Michael Jones, Anoop Cherian, Toshiaki Koike-Akino, Xiaoming Liu, and Chen Feng. Luvli face alignment: Estimating landmarks’ location, uncertainty, and visibility likelihood. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8236–8246, 2020.
Lin & Lee (2020) Jiahao Lin and Gim Hee Lee. Hdnet: Human depth estimation for multi-person camera-space localization. In European Conference on Computer Vision, pp. 633–648. Springer, 2020.
Meyer et al. (2019) Gregory P Meyer, Ankit Laddha, Eric Kee, Carlos Vallespi-Gonzalez, and Carl K Wellington. Lasernet: An efficient probabilistic 3d object detector for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 12677–12686, 2019.
Mills & Goldenberg (1989) James K Mills and Andrew A Goldenberg. Force and position control of manipulators during constrained motion tasks. IEEE Transactions on Robotics and Automation, 5(1):30–46, 1989.
Mohlin et al. (2021) David Mohlin, Gerald Bianchi, and Josephine Sullivan. Probabilistic regression with huber distributions. arXiv preprint arXiv:2111.10296, 2021.
Moon et al. (2019) Gyeongsik Moon, Ju Yong Chang, and Kyoung Mu Lee. Camera distance-aware top-down approach for 3d multi-person pose estimation from a single rgb image. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 10133–10142, 2019.
von Marcard et al. (2018) Timo von Marcard, Roberto Henschel, Michael Black, Bodo Rosenhahn, and Gerard Pons-Moll. Recovering accurate 3d human pose in the wild using imus and a moving camera. In European Conference on Computer Vision (ECCV), sep 2018.
Wang et al. (2022) Tai Wang, ZHU Xinge, Jiangmiao Pang, and Dahua Lin. Probabilistic and geometric depth: Detecting objects in perspective. In Conference on Robot Learning, pp. 1475–1485. PMLR, 2022.
Zhen et al. (2020) Jianan Zhen, Qi Fang, Jiaming Sun, Wentao Liu, Wei Jiang, Hujun Bao, and Xiaowei Zhou. Smap: Single-shot multi-person absolute 3d pose estimation. In European Conference on Computer Vision, pp. 550–566. Springer, 2020.

Appendix A Appendix

Appendix B Projected and depth independent result in finite precision

B.1 Variance decreases when reducing size of single tail

This section proves an intermediate result which we use for the main proof Here we show that for random varianbles Y,Z and a variable $c>0$ and $p\in[0,1]$ then if $E\left[Y\right]E\left[Z\right]<0$ then

X=\begin{cases}cY\text{ w.p }p\\ Z\text{ otherwise }\end{cases}

(41)

Then $Var(X)$ decreases in $c$ decreases

proof

$\displaystyle E\left[X\right]$	$\displaystyle=pcE\left[Y\right]+(1-p)E\left[Z\right]$	(42)
$\displaystyle E\left[X^{2}\right]$	$\displaystyle=pc^{2}E\left[Y^{2}\right]+(1-p)E\left[Z^{2}\right]$	(43)
$\displaystyle Var\left[X\right]$	$\displaystyle=c^{2}p(E\left[Y^{2}\right]-(E\left[Y\right])^{2})+c^{2}(p-p^{2})(E\left[Y\right])^{2}-2cp(1-p)E\left[Y\right]E\left[Z\right]+...$	(44)

Where … does not depend on c. The first term is $c^{2}$ time the variance of Y which is positive, the second term is positive since $(p-p^{2})>0$ . The third term is positive since $E\left[Y\right]E\left[Z\right]<0$ and the negations cancel out.

Therefore if $c>0$ $Var(X)$ increase when $c$ increase, which conclude the proof.

B.2 Main proof

Proof sketch, assume all constraints except constraint 1 are true, show that the smallest possible variance for the projected coordinates is at least a constant which is larger than 0.

By considering the line $y=0$ we get

P(x,y=0,z)=p_{proj}(x/z,0)p_{depth}(z)

(45)

The negative log likelihood is

-\log(p(x,y=0,z))=-\log(p_{proj}(x/z,0))-\log(p_{depth}(z))

(46)

For convenience we define

	$\displaystyle f(x/z)$	$\displaystyle=-\log(p_{proj}(x/z,0))$		(47)
	$\displaystyle g(z)$	$\displaystyle=-\log(p_{depth}(z)$		(48)

We compute the hessian with respect to $x$ and $z$ of

$\displaystyle\dfrac{\partial}{\partial x}-\log(p(x,y=0,z))$	$\displaystyle=f^{\prime}(x/z)/z$	(49)
$\displaystyle\dfrac{\partial^{2}}{\partial x^{2}}-\log(p(x,y=0,z))$	$\displaystyle=f^{\prime\prime}(x/z)/z^{2}$	(50)
$\displaystyle\dfrac{\partial^{2}}{\partial x\partial z}-\log(p(x,y=0,z))$	$\displaystyle=-f^{\prime}(x/z)/z^{2}-f^{\prime\prime}(x/z)x/z^{3}$	(51)
$\displaystyle\dfrac{\partial}{\partial z}-\log(p(x,y=0,z))$	$\displaystyle=g^{\prime}(z)-f^{\prime}(x/z)x/z^{2}$	(52)
$\displaystyle\dfrac{\partial^{2}}{\partial z^{2}}-\log(p(x,y=0,z))$	$\displaystyle=g^{\prime\prime}(z)+f^{\prime\prime}(x/z)x^{2}/z^{4}+2f^{\prime}(x/z)x/z^{3}$	(53)

If this function is convex then this hessian has a positive determinant. The determinant is

	$\displaystyle\|H_{x,z}\|$	$\displaystyle=\dfrac{1}{z^{4}}(f^{\prime\prime}(x/z)g^{\prime\prime}(z)+(f^{\prime\prime}(x/z))^{2}(x/z)^{2}+2f^{\prime}(x/z)f^{\prime\prime}(x/z)x/z-(f^{\prime}(x/z)+f^{\prime\prime}(x/z)(x/z))^{2})$		(54)
		$\displaystyle=\dfrac{1}{z^{4}}(f^{\prime\prime}(x_{p})g^{\prime\prime}(z)-(f^{\prime}(x_{p}))^{2}$		(55)

The function is twice differentiable at least for one point, let $C=g^{\prime\prime}(z)$ exist for the $z$ value at this point. If $C<0$ Then we reach a contradition, if $C=0$ then $f^{\prime}(x_{p})=0$ Where the function has support. The solution here would be $p_{proj}(x_{p})=1/Sfor|x_{p}|\leq S/2$ which can not model arbitrary precision.

The solution of

Cf^{\prime\prime}(x_{p})-f^{\prime}(x_{p})^{2}=0

(56)

f(x_{p})=-a\log(c_{1}-c_{2}x)

(57)

If the constraint is not tight, then the resulting function would increase faster than the solution of the differential equation as the distance to the mode increases. If we study the function for $x_{p}$ which are larger than the mode then the derivative is

f^{\prime}(x_{p})=a\dfrac{c_{2}}{c_{1}-c_{2}x}

(58)

Since we are to the right of the mode $f^{\prime}(x_{p})>0$ , $c_{1}-c_{2}x$ is also positive since the logarithm of this value is real. Thus in this region $c_{2}>0$ Therefore $f$ approach infinity when $x\rightarrow c_{1}/c_{2}$ . Therefore $c_{1}/c_{2}\geq S/2$ .

The mode has to exist since the distribution is proper. The same analysis holds to show that for $x_{p}<m$

f(x_{p})=-C\log(b_{1}+b_{2}x)

(59)

Where $b_{2}>0$ and $b_{1}/b_{2}>S/2$ .

The distribution will either give a value less than the mode or larger or equal to the mode We can now apply the proof in B.1 to show $c_{1}/c_{2}=S/2$ and $b_{1}/b_{2}=S/2$ because if that was not the case it would be possible to reduce the variance by making the constraint active.

Furthermore $f$ is continuous since it is convex.

Thus the distribution which minimizes the projected variance while being consistent with the constraints has the form

p_{proj}(x_{p})\propto\begin{cases}(S/2-x)^{C}\text{ if }x>m\\ \left(\dfrac{(S/2+x)(S/2-m)}{S/2+m}\right)^{C}\text{ otherwise}\end{cases}

(60)

Which obviously can not model an arbitrary small variance for the projected coordinates for a fixed C.

Finally we need to show that

g^{\prime\prime}(z)\geq C

(61)

Imposes constraints on what variance for z can be modeled

To maximize variance we minimize g while by keeping the constraint tight. This gives

g(z)=Cz^{2}/2+k_{1}z+k_{2}

(62)

which gives the distribution after reparameterization

p_{depth}(z)\propto\exp(-\dfrac{(z-\mu)^{2}}{2/C})

(63)

Which is a normal distribution with variance $1/\sqrt{C}$

Therefore a low variance in the projected coordinates requires a large $C$ which requires a low variance for the depth estimate. This concludes the proof.

Appendix C Multi-view fusion and ground plane constraints

In this section we prove proposition 2 and 3.

C.1 Multi view

For this section we have $n$ cameras which produce predictions $p_{cam}(v_{i};\theta_{i})$ where $v_{i}$ are coordinates in the coordinate system for camera $i$ . $\theta_{i}$ are the parameters of the estimated probability distribution based on the image captured by camera $i$ . The images are in theory captured at the same time or in practice within a very short time period. The multi view fusion is constructed under the assumption that the errors for each estimate are independent. The affine transform $v_{1}=R_{i}v_{i}+t_{i}$ transform coordinates from coordinate system $i$ to coordinate system $1$

Definition 1: We define the maximum likelihood fusion given the estimates to be

\arg\sup\limits_{v_{1}\in\mathbb{R}^{3}}p(v_{1}|\theta_{1},\cdots\theta_{n})=\prod\limits_{i=1}^{n}p(R_{i}^{-1}(v_{1}-t_{i})|\theta_{i})

(64)

We introduce the notion of a valid fusion as

Definition 2: A valid maximum likelihood fusion point exist if

(\exists v\in\mathbb{R}^{3}\text{ s.t. }p_{cam}(v|\theta_{1},\cdots\theta_{n})=\prod\limits_{i=1}^{n}p_{cam}(R_{i}^{-1}(v_{1}-t_{i})|\theta_{i}))\land(\forall i\in\{1,\cdots n\}p_{cam}(R_{i}^{-1}(v_{1}-t_{i})|\theta_{i})>0)

(65)

To exclude the case where either most likely fusioned point exist infinitely far away or when the fusioned distribution is improper.

Proposition: 2 (existance) In multi view fusion, if the field of view for the cameras have a non-empty intersection then if each camera gives probability estimates conforming to constraints 1-7 there will exist a valid maximum likelihood fusion point.

Proof Having a non empty intersection of the field of views for the cameras implies

\exists v_{1}\text{ such that }\forall i\in\{1,\cdots n\}v_{i}=R_{i}^{-1}(v_{1}-t_{i})\text{ and }v_{i}\in FOV_{camera,i}

(66)

Constraint 4 implies that $p_{cam}(v_{i}|\theta_{i})=p_{cam}(R_{i}^{-1}(v_{1}-t_{i})|\theta_{i})>0$ for all cameras. since $v_{i}$ is in the field of view of the camera

Therefore

p(v_{1}|\theta_{1},\cdots\theta_{n})=\prod\limits_{i=1}^{n}p_{cam}(R_{i}^{-1}(v_{1}-t_{i})|\theta_{i})>0

(67)

Constraint 6 implies that there exist a $R$ such that

\forall v\text{ such that if }\|v\|>R\text{ then }p(v|\theta_{1}\cdots\theta_{n})<p(v_{1}|\theta_{1},\cdots\theta_{n})

(68)

therefore

	$\displaystyle\sup\limits_{v}p(v\|\theta_{1},\cdots\theta_{n})$	$\displaystyle=\sup\limits_{v\text{ s.t. }\\|v\\|\leq R}p(v\|\theta_{1},\cdots\theta_{n})$		(69)
		$\displaystyle=\max\limits_{v\text{ s.t. }\\|v\\|\leq R}p(v\|\theta_{1},\cdots\theta_{n})$		(70)

The last step is due to 1) $(v|\theta_{1},\cdots\theta_{n})$ is continuous since continuity is preserved when multiplying continuous functions.
2) A closed ball is compact.
3) the extreme value theorem states that the maximum over a compact set of a continous function exists. Thus applying the extreme value theorem concludes the proof.

Proposition 2: (convexity) Finding the maximum likelihood fusion is a convex optimization problem.

proof Maximizing a probability is equivalent to minimizing a log likelihood.

Formally, given definition 1

p(v_{1}|\theta_{1},\cdots,\theta_{n})=\prod\limits_{i=1}^{n}p(R_{i}^{-1}(v_{1}-t_{i})|\theta(im_{i}))

(71)

Here we have used the fact that the transform from $v_{1}$ to $v_{i}$ has a jacobian with determinant 1.

Maximizing the probability is the same as minimizing a negative log likelihood, since logarithms are increasing and negating an increasing function result in a decreasing function, that is

	$\displaystyle\operatorname*{arg\,max}\limits_{v_{1}\in\mathbb{R}^{3}}\prod\limits_{i=1}^{n}p(R_{i}^{-1}(v_{1}-t_{i})\|\theta_{i})$	$\displaystyle=\operatorname*{arg\,min}\limits_{v_{1}\in\mathbb{R}^{3}}-\log(\prod\limits_{i=1}^{n}p_{cam}(R_{i}^{-1}(v_{1}-t_{i})\|\theta_{i}))$		(72)
		$\displaystyle=\operatorname*{arg\,min}\limits_{v_{1}\in\mathbb{R}^{3}}\sum\limits_{i=1}^{n}-\log(p_{cam}(R_{i}^{-1}(v_{1}-t_{i})\|\theta_{i}))$		(73)

It is also well known that convexity is preserved under linear transformations of the argument

This means that if $-\log(p_{cam}(v_{i}|\theta_{i}))$ is convex with respect to $v_{i}$ then $-\log(p(R_{i}^{-1}(v_{1}-t_{i})|\theta_{i}))$ is convex with respect to $v_{1}$ since the change in coordinate system is a linear transform.

Convexity is also preserved under addition therefore if $-\log(p(R_{i}^{-1}(v_{1}-t_{i})|\theta_{i}))$ is convex with respect to $v_{1}$ for all $i$ , then $\sum\limits_{i=0}^{n}-\log(p(R_{i}^{-1}(v_{1}-t_{i})|\theta_{i}))$ is convex with respect to $v_{1}$ .

Finding a minima with respect to a convex function is a convex optimization problem. This concludes the proof.

proposition 3: The constraints makes finding the most likely point while enforcing a ground plane assumption, that is solving

\operatorname*{arg\,max}_{v\text{ s.t. }d^{T}v=c}p(v|\theta)

(74)

where $d$ and $c$ define the ground plane is a convex optimization problem. If any point of the ground plane is in the field of view of the camera, then the constraints also guarantee that a solution exists.

proof

\displaystyle\operatorname*{arg\,max}_{v\text{ s.t. }d^{T}v=c}p(v|\theta)

\displaystyle=\operatorname*{arg\,min}_{v\text{ s.t. }d^{T}v=c}-\log(p(v|\theta))

(75)

optimizing a convex function with an affine constraint in a convex optimization problem.

If a point $v_{1}$ of the ground plane is in the field of view of the camera, then constraint 4 ensures $p(v_{1}|\theta)>0$

Constraint 6 ensures

\exists R\text{ s.t. }\max\limits_{v\text{s.t.}\|v\|>R}p(v|\theta)<p(v_{1}|\theta)

(76)

Therefore

\operatorname*{arg\,max}\limits_{v}p(v|\theta)=\operatorname*{arg\,max}\limits_{v\text{ s.t. }\|v\|\leq R}p(v|\theta)

(77)

Which exist due to extreme value theorem and constraint 7.

Appendix D Intermediate integrals

D.1 Double exponential integral

For convenience for future proofs we derive the integral of a double exponential

\int\limits_{0}^{\infty}\exp(ax-c\exp(x))dx=\Gamma(a,c)c^{-a}

(78)

Where $\Gamma$ is the upper incomplete gamma function

Proof

$\displaystyle\int\limits_{0}^{\infty}\exp(ax-c\exp(x))dx$	$\displaystyle=\int\limits_{0}^{\infty}\exp(-alog(c))\exp(a(x+log(c)))\exp(-\exp(x+log(c)))dx$	(79)
	$\displaystyle=c^{-a}\int\limits_{log(c)}^{\infty}\exp(ay)\exp(-\exp(y))dy$	(80)
	$\displaystyle=c^{-a}\int\limits_{c}^{\infty}z^{a-1}\exp(-z)dz$	(81)
	$\displaystyle=c^{-a}\Gamma(a,c)$	(82)

The first step is rearranging.

The second step comes from the variable substitution

y=x+\log(c)

(83)

The third step comes from the variable substitution

z=\exp(y)

(84)

D.2 Moments depth distribuion

Here we show

\int\limits_{0}^{\infty}z^{k}\exp(-a\max(\mu/z,z/\mu))dz=\mu^{k+1}(\Gamma(-k-1,a)a^{k+1}+\Gamma(k+1,a)a^{-k-1})

(85)

proof

$\displaystyle\int\limits_{0}^{\infty}z^{k}\exp(-a\max(\mu/z,z/\mu))dz$	$\displaystyle=\int\limits_{0}^{\infty}z^{k+1}\dfrac{1}{z}\exp(-a\exp(\|\log(z/\mu)\|))dz$	(86)
	$\displaystyle=\mu^{k+1}\int\limits_{0}^{\infty}\left(\dfrac{z}{\mu}\right)^{k+1}\dfrac{1}{z}\exp(-a\exp(\|\log(z/\mu)\|))dz$	(87)
	$\displaystyle=\mu^{k+1}\int\limits_{-\infty}^{\infty}\exp((k+1)s)\exp(-a\exp(\|s\|))ds$	(88)
	$\displaystyle=\mu^{k+1}(\int\limits_{0}^{\infty}\exp(-(k+1)s)\exp(-a\exp(s))ds$	(89)
	$\displaystyle+\int\limits_{0}^{\infty}\exp((k+1)s)\exp(-a\exp(s))ds)$	(90)
	$\displaystyle=\mu^{k+1}(\Gamma(-k-1,a)a^{k+1}+\Gamma(k+1,a)a^{-k-1})$	(91)

The first step is rearranging. The second step is the variable substitution $s=\log(z/\mu)$ . The third step is separating the range of the integral into positive and negative numbers. The final step is using equation 78.

D.3 Expected value of a positive huber distribution in 1 dimension

For use in future parts we derive the integral of

\int\limits_{0}^{\infty}r\exp(-h(r))dr=1+\exp(-1/2)

(92)

Proof

	$\displaystyle\int\limits_{0}^{1}r\exp(-r^{2}/2))dr$	$\displaystyle=\int\limits_{0}^{1/2}\exp(-y)dy$		(93)
		$\displaystyle=1-\exp(-1/2)$		(94)

The first step is the variable substitution $y=r^{2}/2$ , which has a scale factor of $dy/dr=r$

	$\displaystyle\int\limits_{1}^{\infty}r\exp(-r+1/2)dr$	$\displaystyle=\left[-r\exp(-r+1/2)\right]_{1}^{\infty}+\int\limits_{1}^{\infty}\exp(-r+1/2)dr$		(95)
		$\displaystyle=2\exp(-1/2)$		(96)

The first step is integration by parts.

This gives

	$\displaystyle\int\limits_{0}^{\infty}r\exp(-h(r))dr$	$\displaystyle=\int\limits_{0}^{1}r\exp(-r^{2}/2)dr+\int\limits_{1}^{\infty}r\exp(-r+1/2)dr$		(97)
		$\displaystyle=1+\exp(-1/2)$		(98)

D.4 Expected cube of positive Huber distribution in 1 dimension

For use in future parts we derive the integral of

\int\limits_{0}^{\infty}r^{3}\exp(-h(r))dr=4+3\exp(-1/2)

(99)

Proof

\displaystyle\int\limits_{0}^{\infty}r^{3}\exp(-h(r))dr

\displaystyle=\int\limits_{0}^{1}r^{3}\exp(-r^{2}/2)dr+\int\limits_{1}^{\infty}r^{3}\exp(-r+1/2)dr

(100)

$\displaystyle\int\limits_{0}^{1}r^{3}\exp(-r^{2}/2)dr$	$\displaystyle=\int\limits_{0}^{1}r^{2}r\exp(-r^{2}/2)dr$	(101)
	$\displaystyle=\int\limits_{0}^{1/2}2y\exp(-y)dy$	(102)
	$\displaystyle=[-2y\exp(-y)]_{0}^{1/2}+\int\limits_{0}^{1/2}2\exp(-y)dy$	(103)
	$\displaystyle=2-\exp(-1/2)+2-2\exp(-1/2)$	(104)
	$\displaystyle=4-3\exp(-1/2)$	(105)

$\displaystyle\int\limits_{1}^{\infty}r^{3}\exp(-r+1/2)dr$	$\displaystyle=\exp(1/2)([-r^{3}\exp(-r)]_{1}^{\infty}+\int\limits_{1}^{\infty}3r^{2}\exp(-r)dy)$	(106)
	$\displaystyle=\exp(1/2)([-3r^{2}\exp(-r)]_{1}^{\infty}+\int\limits_{1}^{\infty}6r\exp(-r)dy)$	(107)
	$\displaystyle=\exp(1/2)([-6r^{2}\exp(-r)]_{1}^{\infty}+\int\limits_{1}^{\infty}6\exp(-r)dy)$	(108)
	$\displaystyle=6\exp(-1/2)$	(109)

Therefore

\int\limits_{0}^{\infty}r^{3}\exp(-h(r))dr=4+3\exp(-1/2)

(110)

D.5 Norm factor of 2d Huber distribution

K_{huber}=2\pi(1+\exp(-1/2))

(111)

proof

\displaystyle K_{huber}=\int\limits_{\mathbb{R}^{2}}\exp(-h(\sqrt{x^{2}+y^{2}}))dxdy=2\pi\int\limits_{0}^{\infty}r\exp(-h(r))dr=2\pi(1+\exp(-1/2))

(112)

D.6 Covariance of 2d Huber distribution

E[vv^{T}]=I\dfrac{(4+3\exp(-1/2))}{2+2\exp(-1/2)}

(113)

proof

	$\displaystyle E[tr(vv^{T})]=\dfrac{1}{K_{huber}}\int_{\mathbb{R}^{2}}v^{T}v\exp(-h(\\|v\\|_{2}))dv=$		(114)
	$\displaystyle\dfrac{1}{K_{huber}}\int_{0}^{2\pi}\int_{0}^{\infty}r^{3}\exp(-h(r))dr=\dfrac{4+3\exp(-1/2)}{1+\exp(-1/2)}$		(115)

From rotation symmetry we know the variance will be a scaled identity matrix

Therefore

E[vv^{T}]=I\dfrac{4+3\exp(-1/2)}{2+2\exp(-1/2)}

(116)

Appendix E Projected Huber Distribution components

E.1 Depth distribution

p_{depth}(z;a,\mu_{z})=\begin{cases}\dfrac{1}{K_{depth}(a,\mu_{z})}\exp(-a\max(z/\mu_{z},\mu_{z}/z))&\text{if z < 0}\\ 0&\text{otherwise}\end{cases}

(117)

with

K_{depth}(a,\mu_{z})=\mu_{z}(exp(-a)/a+\Gamma(-1,a)a)

(118)

proof The normalizing factor can be computed as

$\displaystyle K_{depth}(\mu,a)$	$\displaystyle=\int\limits_{0}^{\infty}\exp(-a\max(z/\mu_{z},\mu_{z}/z))dz$	(119)
	$\displaystyle=\mu_{z}(\Gamma(1,a)/a+\Gamma(-1,a)a)$	(120)
	$\displaystyle=\mu_{z}(exp(-a)/a+\Gamma(-1,a)a)$	(121)

The first step is from definition. The second step is using equation 85. The third step is

\Gamma(1,a)=\int\limits_{a}^{\infty}t^{1-1}\exp(-t)dt=\exp(-a)

(122)

E.1.1 Expected value of depth

E\left[z\right]=\mu_{z}\dfrac{\Gamma(2,a)a^{-2}+\Gamma(-2,a)a^{2}}{\exp(-a)a^{-1}+\Gamma(-1,a)a}

(123)

This function is decreasing with respect to a with limit $\mu_{z}$ as $a\rightarrow\infty$ When $a>1$ , $\mu_{z}<E\left[z\right]<1.7\mu_{z}$

proof

Follows directly from $K_{depth}$ and equation 85

The limit can be proven by multiplying numerator and denominator by $a\exp(a)$ This gives

\displaystyle\lim\limits_{a\rightarrow\infty}E[z;a,\mu_{z}]=\lim\limits_{a\rightarrow\infty}\mu_{z}\dfrac{\Gamma(2,a)\exp(a)/a+\Gamma(-2,a)\exp(a)a^{3}}{1+\Gamma(-1,a)\exp(a)a^{2}}

\displaystyle=\mu_{z}\dfrac{1+1}{1+1}=\mu_{z}

(124)

Since

\lim\limits_{a\rightarrow\infty}\Gamma(k,x)x^{-k+1}\exp(x)=1

(125)

E.1.2 Variance of depth

The variance of the depth is

Var\left[z\right]=\mu_{z}^{2}\dfrac{(\Gamma(1,a)/a+\Gamma(-1,a)a)(\Gamma(3,a)a^{-3}+\Gamma(-3,a)a^{3})-(\Gamma(2,a)a^{2}+\Gamma(-2,a)a^{-2})^{2}}{(\Gamma(1,a)/a+\Gamma(-1,a)a)^{2}}

(126)

This expression is not very intuitive, but it behaves like

Var\left[z\right]\approx\mu_{z}/a^{2}

(127)

when $a>1$

proof

E\left[z^{2}\right]=\mu_{z}^{2}\dfrac{\Gamma(3,a)a^{-3}+\Gamma(-3,a)a^{3}}{\Gamma(1,a)/a+\Gamma(-1,a)a}

(128)

Follows directly from $K_{depth}$ and equation 85

Var\left[z^{2}\right]=E\left[z^{2}\right]-E\left[z\right]^{2}

(129)

E.2 Projected distribution

The normalizing factor of the projected part of the distribution is

K_{proj}(A,\mu)=\dfrac{\mu_{z}^{2}}{|A|}2\pi(1+\exp(-1/2))

(130)

For a distribution over a given plane where z is constant.

Proof

$\displaystyle\int_{-\infty}^{\infty}\int_{-\infty}^{\infty}\exp\left(-h\left(\left\\|A\begin{bmatrix}x/z-\mu_{x}\\ y/z-\mu_{y}\end{bmatrix}\right\\|_{2}\dfrac{z}{\mu_{z}}\right)\right)dxdy$	$\displaystyle=\int_{-\infty}^{\infty}\int_{-\infty}^{\infty}\exp\left(-h\left(\left\\|A\begin{bmatrix}x/\mu_{z}-\mu_{x}z/\mu_{z}\\ y/\mu_{z}-\mu_{y}z/\mu_{z}\end{bmatrix}\right\\|_{2}\right)\right)dxdy$	(131)
	$\displaystyle=\dfrac{\mu_{z}^{2}}{\|A\|}\int_{-\infty}^{\infty}\exp(-h(\sqrt{x_{p}^{2}+y_{p}^{2}}))dx_{p}dy_{p}$	(132)
	$\displaystyle=2\pi(1+\exp(-1/2))\dfrac{\mu_{z}^{2}}{\|A\|}$	(133)

The second step is the variable change

\begin{bmatrix}x_{p}\\ y_{p}\end{bmatrix}=A\begin{bmatrix}x/\mu_{z}-\mu_{x}z/\mu_{z}\\ y/\mu_{z}-\mu_{y}z/\mu_{z}\end{bmatrix}

(134)

The last step comes from 92 after a variable change to polar coordinates.

E.3 Expected value of projected components

The expected value

E_{x,y}\begin{bmatrix}x/z\\ y/z\end{bmatrix}=\begin{bmatrix}\mu_{x}\\ \mu_{y}\end{bmatrix}

(135)

for a fixed plane where $z$ is constant.

Proof

$\displaystyle E_{x,y}\begin{bmatrix}x/z\\ y/z\end{bmatrix}$	$\displaystyle=\dfrac{1}{K_{proj}(A,\mu)}\int_{-\infty}^{\infty}\int_{-\infty}^{\infty}\begin{bmatrix}x/z\\ y/z\end{bmatrix}\exp\left(-h\left(\left\\|A\begin{bmatrix}x/z-\mu_{x}\\ y/z-\mu_{y}\end{bmatrix}\right\\|_{2}\dfrac{z}{\mu_{z}}\right)\right)dxdy$	(136)
	$\displaystyle=\dfrac{\|A\|}{K_{proj}(A,\mu)\mu_{z}^{2}}\int_{-\infty}^{\infty}\int_{-\infty}^{\infty}(\dfrac{\mu_{z}}{z}A^{-1}v_{p}+\mu_{p})\exp\left(-h\left(\\|v_{p}\\|_{2}\right)\right)dx_{p}dy_{p}$	(137)
	$\displaystyle=\mu_{p}$	(138)

where the basis change is the same as when deriving the normalizing factor. The last step is recognizing that the expected value of $v_{p}=\bar{0}$ due to symmetry.

E.4 variance of projected components

The for a constant z value is

Var_{x,y}\begin{bmatrix}x/z\\ y/z\end{bmatrix}=\dfrac{\mu_{z}^{2}}{z^{2}}A^{-2}\dfrac{4+3\exp(-1/2)}{2+2\exp(-1/2)}

(139)

proof

$\displaystyle E_{x,y}\left[\begin{bmatrix}x/z\\ y/z\end{bmatrix}\begin{bmatrix}x/z\\ y/z\end{bmatrix}^{T}\right]$	$\displaystyle=E_{x,y}\left[(\dfrac{\mu_{z}}{z}A^{-1}v_{p}+\mu_{p})(\dfrac{\mu_{z}}{z}A^{-1}v_{p}+\mu_{p})^{T}\right]$	(140)
	$\displaystyle=(\dfrac{\mu_{z}^{2}}{z^{2}}A^{-1}E\left[v_{p}v_{p}^{T}\right]A^{-1}+\dfrac{\mu_{z}}{z}A^{-1}E\left[v_{p}\right]\mu_{p}^{T}+\dfrac{\mu_{z}}{z}\mu_{p}E\left[v_{p}^{T}\right]A^{-1}+\mu_{p}\mu_{p}^{T}$	(141)
	$\displaystyle=\dfrac{\mu_{z}^{2}}{z^{2}}A^{-1}A^{-1}\dfrac{4+3\exp(-1/2)}{2+2\exp(-1/2)}+\mu_{p}\mu_{p}^{T}$	(142)

Since

Var_{x,y}\begin{bmatrix}x/z\\ y/z\end{bmatrix}=E_{x,y}\left[\begin{bmatrix}x/z\\ y/z\end{bmatrix}\begin{bmatrix}x/z\\ y/z\end{bmatrix}^{T}\right]-E_{x,y}\begin{bmatrix}x/z\\ y/z\end{bmatrix}E_{x,y}\begin{bmatrix}x/z\\ y/z\end{bmatrix}^{T}

(143)

We get

Var_{x,y}\begin{bmatrix}x/z\\ y/z\end{bmatrix}=\dfrac{\mu_{z}^{2}}{z^{2}}A^{-1}A^{-1}\dfrac{4+3\exp(-1/2)}{2+2\exp(-1/2)}

(144)

Appendix F Projected Huber Distribution

F.1 Normalizing factor

In this section we show

K(\mu_{z},A,a)=K_{depth}(\mu_{z},a)K_{proj}(\mu_{z},A)

(145)

proof

$\displaystyle K(\mu_{z},A,a)$	$\displaystyle=\int\limits_{0}^{\infty}\int\limits_{-\infty}^{\infty}\int\limits_{-\infty}^{\infty}p_{proj}(x/z,y/z;\mu_{z},A)dxdyp_{depth}(z;\mu_{z},a)dz$	(146)
	$\displaystyle=\int\limits_{0}^{\infty}K_{proj}(\mu_{z},A)p_{depth}(z;\mu_{z},a)dz$	(147)
	$\displaystyle=K_{proj}(\mu_{z},A)K_{depth}(\mu_{z},a)$	(148)

Since the normalizing factor of $p_{depth}$ does not depend on $z$

F.2 Expected projected coordinates

E\begin{bmatrix}x/z\\ y/z\end{bmatrix}=\mu_{p}

(149)

proof

E_{z}\left[E_{x,y}\begin{bmatrix}x/z\\ y/z\end{bmatrix}\right]=E_{z}\left[\mu_{p}\right]=\mu_{p}

(150)

F.3 Variance of projected coordinates

The variance of the projected coordinates is

Var\begin{bmatrix}x/z\\ y/z\end{bmatrix}=A^{-2}\dfrac{4+3\exp(-1/2)}{2+2\exp(-1/2)}

(151)

proof

When integrating this over z we get

	$\displaystyle E_{x,y,z}\begin{bmatrix}x/z\\ y/z\end{bmatrix}$	$\displaystyle=E_{z}\left[E_{x,y}\begin{bmatrix}x/z\\ y/z\end{bmatrix}\right]$		(152)
		$\displaystyle=\mu_{p}\mu_{p}^{T}+\dfrac{4+3\exp(-1/2)}{2+2\exp(-1/2)}A^{-2}E_{z}\left[\mu_{z}^{2}/z^{2}\right]$		(153)

$\displaystyle E\left[\mu_{z}^{2}/z^{2}\right]$	$\displaystyle=\dfrac{\mu_{z}^{2}}{K_{depth}(\mu_{z},a)}\int_{0}^{\infty}\dfrac{1}{z^{2}}\exp(-a\exp(\|\log(z/\mu_{z})\|))dz$	(154)
	$\displaystyle=\dfrac{\mu_{z}^{2}}{\mu(\Gamma(1,a)/a+\Gamma(-1,a)a)}\mu^{-1}(\Gamma(1,a)/a+\Gamma(-1,a)a)$	(155)
	$\displaystyle=1$	(156)

Therefore

Var\begin{bmatrix}x/z\\ y/z\end{bmatrix}=\dfrac{4+3\exp(-1/2)}{2+2\exp(-1/2)}A^{-2}+\mu_{p}\mu_{p}^{T}-\mu_{p}\mu_{p}^{T}

(157)

F.4 Expected depth

The expected depth is

E\left[z\right]=\mu_{z}\dfrac{\Gamma(2,a)a^{-2}+\Gamma(-2,a)a^{2}}{\Gamma(1,a)a^{-1}+\Gamma(-1,a)a}

(158)

proof

	$\displaystyle E\left[z\right]$	$\displaystyle=\dfrac{1}{K_{proj}(A,\mu)K_{depth}(a,\mu_{z})}\int\limits_{0}^{\infty}zp_{depth}(z;a,\mu_{z})\int\limits_{-\infty}^{\infty}\int\limits_{-\infty}^{\infty}p_{proj}(x/z,y/z;A,\mu)dxdydz$		(159)
		$\displaystyle=\dfrac{1}{K_{proj}(A,\mu)K_{depth}(a,\mu_{z})}\int\limits_{0}^{\infty}zp_{depth}(z;a,\mu_{z})K_{proj}(A,\mu)dz=\mu_{z}\dfrac{\Gamma(2,a)a^{-2}+\Gamma(-2,a)a^{2}}{(\Gamma(1,a)a^{-1}+\Gamma(-1,a)a)^{2}}$		(160)

F.5 Depth variance

Var\left[z\right]=\mu_{z}^{2}\dfrac{(\Gamma(3,a)a^{-3}+\Gamma(-3,a)a^{3})(\Gamma(1,a)/a+\Gamma(-1,a)a)-(\Gamma(2,a)/a^{2}+\Gamma(-2,a)a^{2})^{2}}{(\Gamma(1,a)/a+\Gamma(-1,a)a)^{2}}

(161)

proof

E\left[z^{2}\right]=\mathbb{E}_{z}\left[\mathbb{E}_{x,y}\left[z^{2}\right]\right]=\mathbb{E}_{z}\left[z^{2}\right]=\mu_{z}^{2}\dfrac{\Gamma(3,a)a^{-3}+\Gamma(-3,a)a^{3}}{\Gamma(1,a)/a+\Gamma(-1,a)a}

(162)

and

Var[z]=E[z^{2}]-E[z]^{2}

(163)

Appendix G Proof constraints

G.1 Proof constraint 5

We want to show that the negative log likelihood is convex with respect to $(x,y,z)$ for fixed $\mu,A,a$ .

Since the distribution decomposes into

p_{combined}(x,y,z;\mu,A,a)=p_{proj}(x,y,z;\mu,A)p_{depth}(x,y,z;\mu,A)

(164)

Therefore

-\log(p_{combined}(x,y,z;\mu,A,a))=-\log(p_{proj}(x,y,z;\mu,A))-\log(p_{depth}(x,y,z;\mu,A))

(165)

Since convexity is preserved under addition it is sufficient to show that the negative log likelihood of $p_{proj}$ and $p_{depth}$ are convex separately.

-log(p_{proj}(x,y,z;\mu,A)=h\left(\left\|A\begin{bmatrix}x/z-\mu_{x}\\ y/z-\mu_{x}\end{bmatrix}\right\|_{2}\dfrac{z}{\mu_{z}}\right)

(166)

Doing the affine parameter change

	$\displaystyle q$	$\displaystyle=A\begin{bmatrix}x-z\mu_{x}\\ y-z\mu_{y}\end{bmatrix}/\mu_{z}$		(167)
	$\displaystyle u$	$\displaystyle=z/\mu_{z}$		(168)

gives

-log(p_{proj}(x,y,z;\mu,A)=h(\|q\|_{2})

(169)

which is convex. Since convexity is preserved under affine transforms this concludes the proof for $p_{proj}$

The proof for $p_{depth}$ is even simpler

-\log(p_{depth}(z,\mu_{z},a)=\max(az/\mu_{z},a\mu_{z}/z)

(170)

maximum of convex functions is convex. Therefore it is sufficient to show that each expression in the max is convex.

$az/\mu_{z}$ is linear, linear functions are convex

\dfrac{\partial^{2}}{\partial z^{2}}a\mu_{z}/z=2a\mu_{z}/z^{3}>0

(171)

Therefore the second expression is convex as well. This concludes the proof.

G.2 Proof constraint 6

First consider the region $z\leq 0$ . The statement is trivially true here since 0 converges to 0.

Secondly consider the region $0<z<=\mu_{z}$

in this region $x_{i}^{2}+y_{i}^{2}\rightarrow\infty$

	$\displaystyle p(x,y,z)$	$\displaystyle=\dfrac{1}{K(A,\mu,a)}\exp\left(-h\left(\left\\|A\begin{bmatrix}x/z-\mu_{x}\\ y/z-\mu_{y}\end{bmatrix}\right\\|_{2}\dfrac{z}{\mu_{z}}\right)\right)\exp(-a\mu_{z}/z)$		(172)
		$\displaystyle\leq\exp\left(-h\left(\left\\|\dfrac{A}{\mu_{z}}\begin{bmatrix}x\\ y\end{bmatrix}\right\\|_{2}-\left\\|A\begin{bmatrix}\mu_{x}\\ \mu_{y}\end{bmatrix}\right\\|_{2}\right)\right)\dfrac{\exp(-a)}{K(A,\mu,a)}$		(173)

The last step comes from applying the triangle inequality on the $\|\|_{2}$ term and maximizing the depth and projected terms independently.

For the last expression the first factor converges to 0 since $\mu_{z}$ is positive, $A$ is positive definite and $\|\mu_{p}\|_{2}$ is finite. The second factor is finite. Therefore for limits in this region the condition holds.

In the region $\mu_{z}\leq z<(x^{2}+y^{2})^{1/4}$ also need $\sqrt{x^{2}+y^{2}}$ to go to infinity.

In this region the following inequalities hold.

\max\limits_{\mu_{z}\leq z<(x^{2}+y^{2})^{1/4}}\exp(-az/\mu_{z})\leq e^{-a}

(174)

Since the depth distribution is decreasing where the first step relaxes the constraint $\mu_{z}\leq z<(x^{2}+y^{2})^{1/4}$ to $\mu_{z}\leq z$ and does the variable change $u=z/\mu_{z}$

The following inequality holds in this region as well.

	$\displaystyle\min\limits_{\mu_{z}<z<\sqrt{x^{2}+y^{2}}}\left\\|\dfrac{A}{\mu_{z}}\begin{bmatrix}x\\ y\end{bmatrix}-\begin{bmatrix}\mu_{x}\\ \mu_{y}\end{bmatrix}\dfrac{z}{\mu_{z}}\right\\|_{2}$	$\displaystyle\geq\min\limits_{\mu_{z}<z<\sqrt{x^{2}+y^{2}}}\left\\|\dfrac{A}{\mu_{z}}\begin{bmatrix}x\\ y\end{bmatrix}\right\\|_{2}-\left\\|\begin{bmatrix}\mu_{x}\\ \mu_{y}\end{bmatrix}\dfrac{z}{\mu_{z}}\right\\|_{2}$		(175)
		$\displaystyle\geq(x^{2}+y^{2})^{1/4}\left(\dfrac{\lambda_{min}(x^{2}+y^{2})^{1/4}-\\|\mu_{p}\\|}{\mu_{z}}\right)$		(176)

which goes to infinity as $x^{2}+y^{2}$ goes to infinity.

Therefore $p(x,y,z)$ goes to 0 in this region since

\exp\left(-h\left(\left\|A\begin{bmatrix}x/z-\mu_{x}\\ y/z-\mu_{y}\end{bmatrix}\right\|_{2}\dfrac{z}{\mu_{z}}\right)\right)

(177)

goes to 0 while

\dfrac{1}{K(A,\mu,a)}\exp(-az/\mu_{z})

(178)

is bounded by a constant.

For the region $(x^{2}+y^{2})^{1/4}<z$ z will need to go to infinity. the $\exp(-h())$ term is smaller than 1 and

\min\limits_{z\rightarrow\infty}\exp(-az/\mu_{z})=0

(179)

The union of these regions is $\mathbb{R}^{3}$ . The limit is also the same for each region, therefore the limit is the same for the union of the regions. which concludes the proof.

G.3 Proof constraint 7

The function is continuous for all values when $z>0$ due to the fact that the density is a combination of continuous functions for this region. The function is also continuous when $z\leq 0$ since the function is 0 in this region. For the regions $\mu_{z}>z>0$ we know

\displaystyle 0<p_{combined}(x,y,z)

\displaystyle=p_{proj}(x,y|z;\mu,A)p_{depth}(z;\mu_{z},a)\leq\dfrac{1}{K(\mu,A,a)}\exp(-a\mu_{z}/z)

(180)

The normalizing factor is finite and the exponential term goes to 0 as $z\rightarrow 0$

Appendix H Ground truth bounded after basis change

$\displaystyle z_{p}$	$\displaystyle=\dfrac{z}{f\mu_{z0}}$	(181)
	$\displaystyle=z/f/\sqrt{(\max\limits_{z,f\in Dataset}(z/f))(\min\limits_{z,f\in Dataset}(z/f))}$	(182)
	$\displaystyle\leq\sqrt{\dfrac{\max\limits_{z,f\in Dataset}(z/f)}{\min\limits_{z,f\in Dataset}(z/f))}}=D$	(183)

and

$\displaystyle 1/z_{p}$	$\displaystyle=\dfrac{f\mu_{z0}}{z}$	(184)
	$\displaystyle\leq\dfrac{\sqrt{(\max\limits_{z,f\in Dataset}(z/f))(\min\limits_{z,f\in Dataset}(z/f))}}{(\min\limits_{z,f\in Dataset}(z/f))}$	(185)
	$\displaystyle=\sqrt{\dfrac{\max\limits_{z,f\in Dataset}(z/f)}{\min\limits_{z,f\in Dataset}(z/f))}}=D$	(186)

$x_{p}$ and $y_{p}$ are bounded by 1 because $(fx/z+S/2,fx/z+S/2)\in\left[0,S\right]\times\left[0,S\right]$ for objects in the field of view of the camera. Therefore

	$\displaystyle(fx/z,fy/z)\in\left[-S/2,S/2\right]\times\left[-S/2,S/2\right]\Rightarrow$		(187)
	$\displaystyle(2fx/Sz,2fy/Sz)\in\left[-1,1\right]\times\left[-1,1\right]$		(188)

Which concludes the proof

Appendix I Proof of bounded gradients for loss

This section contains a proof that the gradients are bounded with respect to the network output when the mapping in section 4.4 is used.

The proof for $L_{proj}$ is already done in Mohlin et al. (2021)

The mapping from $w_{1}$ to $a$ is a contraction. Therefore it is sufficient to show that the loss has bounded gradients with respect to $a$ and $w_{2}$ to prove bounded gradients with respect to $w_{1},w_{2}$

For the region $w_{2}>0$

the loss regression part of the loss is

\max((a+w_{2})/z_{p},z_{p}\dfrac{a^{2}}{a+w_{2}})

(189)

The gradient of the first term is

\|\nabla_{a,w_{2}}(a+w_{2})/z_{p}\|_{2}=\|(1/z_{p},1/z_{p})\|_{2}\leq\sqrt{2}D

(190)

The gradient of the second expression is

\left\|\nabla_{w}z_{p}\dfrac{a^{2}}{a+w_{2}}\right\|_{2}=z_{p}\left\|\begin{bmatrix}-\dfrac{a}{a+w_{2}}\dfrac{a+2w_{2}}{a+w_{2}}\\ -\dfrac{a^{2}}{(a+w_{2})^{2}}\end{bmatrix}\right\|_{2}\leq D\sqrt{5}

(191)

If $w_{2}\leq 0$ the loss is

\max\left(z_{p}(a-w_{2}),\dfrac{1}{z_{p}}\dfrac{a^{2}}{a-w_{2}}\right)

(192)

The first expression has gradient norm

\|\nabla_{a,w_{2}}z_{p}(a-w_{2})\|_{2}=z_{p}\|(1,-1)\|_{2}\leq\sqrt{2}D

(193)

The second expression has gradient norm

\left\|\nabla_{a,w_{2}}\dfrac{1}{z_{p}}\dfrac{a^{2}}{a-w_{2}}\right\|_{2}=\dfrac{1}{z_{p}}\left\|\begin{bmatrix}-\dfrac{a}{a-w_{2}}\dfrac{a-2w_{2}}{a-w_{2}}\\ \dfrac{a^{2}}{(a-w_{2})^{2}}\end{bmatrix}\right\|_{2}\leq D\sqrt{5}

(194)

Which concludes for the regression part of the loss.

The normalizing factor component of the depth loss is

\log(\exp(-a)/a+\Gamma(-1,a)a)

(195)

for The gradient of this is

\dfrac{\partial\log(\Gamma(1,a)/a+\Gamma(-1,a)a)}{\partial a}=-\dfrac{1}{a}+2\dfrac{1+1/a}{1+a^{2}\Gamma(-1,a)\exp(a)}

(196)

Which is less than $4$ when $a>1$

If $a<1$ the mapping between $w_{2}$ and $a$ is $a(w_{1})=exp(w_{1})$ Therefore

	$\displaystyle\dfrac{\partial\log(\Gamma(1,a(w_{1}))/a(w_{1})+\Gamma(-1,a(w_{1}))a(w_{1}))}{\partial w_{1}}$	$\displaystyle=(-\dfrac{1}{a}+\dfrac{1+1/a}{1+a^{2}\Gamma(-1,a)\exp(a)})\dfrac{\partial a}{\partial w_{1}}$		(197)
		$\displaystyle=(-1+2\dfrac{a(w_{1})+1}{1+a(w_{1})^{2}\Gamma(-1,a(w_{1}))\exp(a(w_{1}))})$		(198)

Which is less than $4$ . Therefore the negative log likelihood of the normalizing factor has bounded gradients with respect to $w_{1}$

Appendix J Computing normalizing factor

The normalizing factor can be rewritten as

log(\Gamma(1,a)/a+\Gamma(-1,a)a)=-log(a)-a+\log(1+\Gamma(-1,a)a^{2}\exp(a))

(199)

with gradient

\dfrac{\partial log(\Gamma(1,a)/a+\Gamma(-1,a)a)}{\partial a}=-1/a+2\dfrac{1+1/a}{1+a^{2}\Gamma(-1,a)\exp(a)}

(200)

Therefore if $a^{2}\Gamma(-1,a)\exp(a)$ can be computed accurately both the function and its gradient can be computed. We show how to do this in section J.1

Proof of function evaluation

$\displaystyle\log(\Gamma(1,a)/a+\Gamma(-1,a)a)$	$\displaystyle=\log(\exp(-a)/a+\Gamma(-1,a)a)$	(201)
	$\displaystyle=\log(\exp(-a)/a(1+\Gamma(-1,a)a^{2}\exp(a)))$	(202)
	$\displaystyle=\log(1+\Gamma(-1,a)a^{2}\exp(a))-a-\log(a)$	(203)

Proof of gradient

$\displaystyle\dfrac{\partial log(1+\Gamma(-1,a)a^{2}\exp(a))-a-\log(a)}{\partial a}$	$\displaystyle=-1-1/a+\dfrac{(2a+a^{2})\Gamma(-1,a)\exp(a)-a^{2}\exp(a)\dfrac{\partial\Gamma(-1,a)}{\partial a}}{1+\Gamma(-1,a)a^{2}\exp(a)}$	(204)
	$\displaystyle=-1-1/a+\dfrac{(2/a+1)a^{2}\Gamma(-1,a)\exp(a)-1}{1+\Gamma(-1,a)a^{2}\exp(a)}$	(205)
	$\displaystyle=-1-1/a+\dfrac{(2/a+1)(a^{2}\Gamma(-1,a)\exp(a)+1)-(2/a+1)-1}{1+\Gamma(-1,a)a^{2}\exp(a)}$	(206)
	$\displaystyle=-1-1/a(1+2/a)+\dfrac{-(2/a+2)}{1+\Gamma(-1,a)a^{2}\exp(a)}$	(207)
	$\displaystyle=1/a-2\dfrac{1/a+1}{1+\Gamma(-1,a)a^{2}\exp(a)}$	(208)

J.1 numerical integration of $a^{2}\Gamma(-1,a)\exp(a)$

Here we show that

a^{2}\Gamma(-1,a)\exp(a)=\int\limits_{0}^{\infty}\dfrac{1}{(1+\log(1/y)/a)^{2}}dy

(209)

proof

	$\displaystyle a^{2}\Gamma(-1,a)\exp(a)$	$\displaystyle=\int\limits_{a}^{\infty}\dfrac{a^{2}}{t^{2}}\exp(a-t)dt$		(210)
		$\displaystyle=\int\limits_{0}^{1}\dfrac{1}{(\log(1/y)/a+1)^{2}})dy$		(211)

where that basis change $y=exp(-t+a)$ is used.

Appendix K Convex loss with respect to network output

Here we show that the network is convex with respect to the network outputs

Proof In the following sections we will prove that both the projected and depth loss are convex with repect to their parameters. To prove the depth loss is convex.

K.1 Projected term

We have the same loss and the same mapping to parameterize $B$ and $\nu_{p}$ as in Mohlin et al. (2021) In this work they prove that this loss is convex when $B\succ\theta$ . We use $\theta=1$ which concludes the proof.

K.2 Depth Loss

K.2.1 Rewriting argument of logarithm of depth normalizing factor

In this section we show that

1/a+a\Gamma(-1,a)\exp(a)=\int\limits_{0}^{1}\dfrac{2a^{2}-2a\log(1/y)+\log^{2}(1/y)}{a(a+\log(1/y))^{2}}dy

(212)

proof

$\displaystyle 1/a+a\Gamma(-1,a)\exp(a)$	$\displaystyle=\int\limits_{0}^{1}\dfrac{1}{a}+\dfrac{1}{a}\dfrac{1}{(1+\log(1/y)/a)^{2}}dy$	(213)
	$\displaystyle=\int\limits_{0}^{1}\dfrac{1}{a}+\dfrac{a}{(a+\log(1/y))^{2}}dy$	(214)
	$\displaystyle=\int\limits_{0}^{1}\dfrac{a^{2}+2a\log(1/y)+\log^{2}(1/y)+a^{2}}{a(a+\log(1/y))^{2}}dy$	(215)

K.2.2 Log-convexity of integrand in equation 212

Here we show that for $a>0$ and $y\in[0,1)$

q_{integrand}(y,a)=\dfrac{a^{2}+2a\log(1/y)+\log^{2}(1/y)+a^{2}}{a(a+\log(1/y))^{2}}

(216)

is log-convex with respect to $a$

proof

since $log(1/y)$ is a constant for this proof we denote this value as $k$

$\displaystyle\dfrac{\partial^{2}}{\partial a^{2}}\log(\dfrac{2a^{2}+2ak+k^{2}}{a(a+k)^{2}}$	$\displaystyle=\dfrac{\partial^{2}}{\partial a^{2}}\log(2a^{2}+2ak+k^{2})-\log(a)-2\log(a+k)$	(217)
	$\displaystyle=\dfrac{\partial}{\partial a}\dfrac{4a+2k}{2a^{2}+2ak+k^{2}}-1/a-2/(a+k)$	(218)
	$\displaystyle=\dfrac{4(a^{2}+(a+k)^{2})+(4a+2k)^{2}}{(a^{2}+(a+k)^{2})^{2}}+1/a^{2}+2/(a+k)^{2}$	(219)

To show log convexity we need to show that this expression is larger than 0. To make notation slighly easier we denote $t=a+k$ , since both a and k are positive t is positve as well.

$\displaystyle\dfrac{\partial^{2}}{\partial a^{2}}\log(\dfrac{2a^{2}+2ak+k^{2}}{(a(a+k)^{2}})$	$\displaystyle=\dfrac{4(a^{2}+t^{2})-(2a+2t)^{2}}{(a^{2}+t^{2})^{2}}+1/a^{2}+2/t^{2}$	(220)
	$\displaystyle=\dfrac{-8a^{3}t^{3}+(t^{2}+2a^{2})(a^{2}+t^{2})^{2}}{(a^{2}+t^{2})^{2}}$	(221)
	$\displaystyle=\dfrac{-8a^{3}t^{3}+(t^{2}+2a^{2})(a^{2}+t^{2})^{2}}{(a^{2}+t^{2})^{2}}$	(222)
	$\displaystyle=\dfrac{t^{6}+4t^{4}a^{2}-8a^{3}t^{3}+5t^{2}a^{4}+2a^{6}}{(a^{2}+t^{2})^{2}}$	(223)
	$\displaystyle=\dfrac{t^{6}+4t^{2}a^{2}(a-t)^{2}t^{2}a^{4}+2a^{6}}{(a^{2}+t^{2})^{2}}\geq 0$	(224)

In the last step every term is positive in both numerator and denominator.

K.2.3 log-convex integral for log-convex integrands

For a function

g(a)=\int f(x,a)dx

(225)

where $f$ is continuous and log-convex with respect to a, then g(a) is log-convex as well.

proof First we show

\log(g(a))/2+\log(g(b))/2\geq\log(g((a+b)/2))

(226)

which is equivalent to

g(a)g(b)\geq(g((a+b)/2))^{2}

(227)

Since $\exp$ is increasing

$\displaystyle g(a)g(b)$	$\displaystyle=\int f(x,a)dx\int f(x,b)dx$	(228)
	$\displaystyle\leq\left(\int\sqrt{f(x,a)}\sqrt{f(x,b)}dx\right)^{2}$	(229)
	$\displaystyle=\left(\int\exp(1/2\log(f(x,a))+1/2\log(f(x,b)))dx\right)^{2}$	(230)
	$\displaystyle\leq\left(\int\exp(\log(f(x,(a+b)/2)))dx\right)^{2}$	(231)
	$\displaystyle=\left(\int(f(x,(a+b)/2))\right)^{2}=(g((a+b)/2))^{2}$	(232)

The first inequality is cauchy-schwartz, the second comes from the fact that $f$ is log convex.

The proof can be extended for all rational $\theta$ between 0 and 1 where the denominator is a power of 2 by recursive bisection.

log(g(a))\theta+log(g(a))(1-\theta)\geq\log(g(a\theta+(1-\theta)b))

(233)

Using continuity completes the proof for all $\theta\in[0,1]$

Which concludes the proof.

K.2.4 convexity of $\log(\Gamma(1,a)/a+\Gamma(-1,a)a)$

In this section we show that

\log(\Gamma(1,a)/a+\Gamma(-1,a)a)

(234)

is convex with respect to $a$

proof

This is the same as

1/a+\Gamma(-1,a)a\exp(a)

(235)

being log convex.

Now the proof follows from section K.2.1, K.2.2 and K.2.3

K.2.5 depth loss error term

here we show that

a\max(z/\mu,\mu/z)

(236)

is convex when

\mu=\begin{cases}u/a+1&\text{ if }u>0\\ 1/(1-u/a)&\text{ otherwise}\end{cases}

(237)

proof if $u\geq 0$ then

	$\displaystyle\mu/z$	$\displaystyle=(u+a)/z$		(238)
	$\displaystyle az/\mu$	$\displaystyle=za^{2}/(u+a)$		(239)

The first expression is linear therefore convex.

The hessian of the second expression is

$\displaystyle\dfrac{\partial^{2}}{\partial a^{2}}za^{2}/(u+a)$		(240)
	$\displaystyle=\dfrac{\partial}{\partial a}z(2ua+a^{2})/(u+a)^{2}$	(241)
	$\displaystyle=z(2(u+a)(u+a)-2(2ua+a^{2}))/(u+a)^{3}$	(242)
	$\displaystyle=2z(u^{2}+2au+u^{2}-2ua+a^{2})/(u+a)^{3}$	(243)
	$\displaystyle=2zu^{2}/(u+a)^{3}$	(244)
$\displaystyle\dfrac{\partial^{2}}{\partial a\partial u}za^{2}/(u+a)$	$\displaystyle=z(2a(a+b)-2(2ua+a^{2}))/(u+a)^{3}$	(245)
	$\displaystyle=-2zua/(u+a)^{3}$	(246)
$\displaystyle\dfrac{\partial^{2}}{\partial u}za^{2}/(u+a)$	$\displaystyle=\dfrac{\partial}{\partial u}-za^{2}/(u+a)$	(247)
	$\displaystyle=-2za^{2}/(u+a)^{2}$	(248)

which has the eigenvalues 0 and $4z(a^{2}+u^{2})/(a+u)^{3}>0$ . The denominator is positive since both $a$ and $u$ are positive. Therefore this expression is positive definite.

If $u<0$ Then the expression $az/\mu=a-u$ which is linear and therefore convex. The expression

a\mu/z=a^{2}/(a-u)

(249)

doing the variable change $v=-u$ turns this in the expression

a\mu/z=a^{2}/(a+v)

(250)

which we have already shown is convex for $v>0$ . Therefore this expression is convex for this region. max of convex expressions is convex therefore the function is convex in each region. The mapping between $u$ and $\mu$ has continous gradient at the boundary $u=0$ Therefore the function is convex for all $a>0,u\in\mathbb{R}$ . Since mapping between network output and $a$ is linear for the region $a>1$ this the loss is convex for this region.

This concludes the proof that the loss is convex in the region $A\succ 1,a>1$

	$\displaystyle\sup\limits_{v}p(v\|\theta_{1},\cdots\theta_{n})$	$\displaystyle=\sup\limits_{v\text{ s.t. }\\|v\\|\leq R}p(v\|\theta_{1},\cdots\theta_{n})$		(69)
		$\displaystyle=\max\limits_{v\text{ s.t. }\\|v\\|\leq R}p(v\|\theta_{1},\cdots\theta_{n})$		(70)

	$\displaystyle\operatorname*{arg\,max}\limits_{v_{1}\in\mathbb{R}^{3}}\prod\limits_{i=1}^{n}p(R_{i}^{-1}(v_{1}-t_{i})\|\theta_{i})$	$\displaystyle=\operatorname*{arg\,min}\limits_{v_{1}\in\mathbb{R}^{3}}-\log(\prod\limits_{i=1}^{n}p_{cam}(R_{i}^{-1}(v_{1}-t_{i})\|\theta_{i}))$		(72)
		$\displaystyle=\operatorname*{arg\,min}\limits_{v_{1}\in\mathbb{R}^{3}}\sum\limits_{i=1}^{n}-\log(p_{cam}(R_{i}^{-1}(v_{1}-t_{i})\|\theta_{i}))$		(73)

$\displaystyle\int_{-\infty}^{\infty}\int_{-\infty}^{\infty}\exp\left(-h\left(\left\\|A\begin{bmatrix}x/z-\mu_{x}\\ y/z-\mu_{y}\end{bmatrix}\right\\|_{2}\dfrac{z}{\mu_{z}}\right)\right)dxdy$	$\displaystyle=\int_{-\infty}^{\infty}\int_{-\infty}^{\infty}\exp\left(-h\left(\left\\|A\begin{bmatrix}x/\mu_{z}-\mu_{x}z/\mu_{z}\\ y/\mu_{z}-\mu_{y}z/\mu_{z}\end{bmatrix}\right\\|_{2}\right)\right)dxdy$	(131)
	$\displaystyle=\dfrac{\mu_{z}^{2}}{\|A\|}\int_{-\infty}^{\infty}\exp(-h(\sqrt{x_{p}^{2}+y_{p}^{2}}))dx_{p}dy_{p}$	(132)
	$\displaystyle=2\pi(1+\exp(-1/2))\dfrac{\mu_{z}^{2}}{\|A\|}$	(133)

	$\displaystyle\min\limits_{\mu_{z}<z<\sqrt{x^{2}+y^{2}}}\left\\|\dfrac{A}{\mu_{z}}\begin{bmatrix}x\\ y\end{bmatrix}-\begin{bmatrix}\mu_{x}\\ \mu_{y}\end{bmatrix}\dfrac{z}{\mu_{z}}\right\\|_{2}$	$\displaystyle\geq\min\limits_{\mu_{z}<z<\sqrt{x^{2}+y^{2}}}\left\\|\dfrac{A}{\mu_{z}}\begin{bmatrix}x\\ y\end{bmatrix}\right\\|_{2}-\left\\|\begin{bmatrix}\mu_{x}\\ \mu_{y}\end{bmatrix}\dfrac{z}{\mu_{z}}\right\\|_{2}$		(175)
		$\displaystyle\geq(x^{2}+y^{2})^{1/4}\left(\dfrac{\lambda_{min}(x^{2}+y^{2})^{1/4}-\\|\mu_{p}\\|}{\mu_{z}}\right)$		(176)

Probabilistic 3d regression with projected huber distribution

Abstract

1 Introduction

2 Prior work

3 Motivation

3.1 Notation

3.2 Desired constraints for estimated distribution

3.3 Properties given these constraints

4 Method

4.1 Projected Huber Distribution

4.2 Proof for constraints

4.3 Parameter remapping

4.4 Enforcing constraints on distribution parameters

5 Experiments

5.1 Dataset

5.2 Experimental result

5.2.1 Empirical error vs. estimated error

5.2.2 Regression performance

6 Discussion

7 Conclusion

8 Acknowledgements

References

Appendix A Appendix

Appendix B Projected and depth independent result in finite precision

B.1 Variance decreases when reducing size of single tail

B.2 Main proof

Appendix C Multi-view fusion and ground plane constraints

C.1 Multi view

Appendix D Intermediate integrals

D.1 Double exponential integral

D.2 Moments depth distribuion

D.3 Expected value of a positive huber distribution in 1 dimension

D.4 Expected cube of positive Huber distribution in 1 dimension

D.5 Norm factor of 2d Huber distribution

D.6 Covariance of 2d Huber distribution

Appendix E Projected Huber Distribution components

E.1 Depth distribution

E.1.1 Expected value of depth

E.1.2 Variance of depth

E.2 Projected distribution

E.3 Expected value of projected components

E.4 variance of projected components

Appendix F Projected Huber Distribution

F.1 Normalizing factor

F.2 Expected projected coordinates

F.3 Variance of projected coordinates

F.4 Expected depth

F.5 Depth variance

Appendix G Proof constraints

G.1 Proof constraint 5

G.2 Proof constraint 6

G.3 Proof constraint 7

Appendix H Ground truth bounded after basis change

Appendix I Proof of bounded gradients for loss

Appendix J Computing normalizing factor

J.1 numerical integration of a2​Γ​(−1,a)​exp⁡(a)a^{2}\Gamma(-1,a)\exp(a)

Appendix K Convex loss with respect to network output

K.1 Projected term

K.2 Depth Loss

K.2.1 Rewriting argument of logarithm of depth normalizing factor

K.2.2 Log-convexity of integrand in equation 212

K.2.3 log-convex integral for log-convex integrands

K.2.4 convexity of log⁡(Γ​(1,a)/a+Γ​(−1,a)​a)\log(\Gamma(1,a)/a+\Gamma(-1,a)a)

K.2.5 depth loss error term

J.1 numerical integration of $a^{2}\Gamma(-1,a)\exp(a)$

K.2.4 convexity of $\log(\Gamma(1,a)/a+\Gamma(-1,a)a)$