This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

An Algorithm for Finding Positive
Solutions to Polynomial Equations

Dustin Cartwright Dept. of Mathematics, University of California, Berkeley, CA 94720, USA dustin@math.berkeley.edu
Abstract.

We present a numerical algorithm for finding real non-negative solutions to polynomial equations. Our methods are based on the expectation maximization and iterative proportional fitting algorithms, which are used in statistics to find maximum likelihood parameters for certain classes of statistical models. Since our algorithm works by iteratively improving an approximate solution, we find approximate solutions in the cases when there are no exact solutions, such as overconstrained systems.

We present an iterative numerical method for finding non-negative solutions and approximate solutions to systems of polynomial equations. We require two assumptions about our system of equations. First, for each equation, all the coefficients other than the constant term must be non-negative. Second, there is a technical assumption on the exponents, described at the beginning of Section 1, which, for example, is satisfied if all non-constant terms have the same total degree. In Section 3, there is a discussion of the range of possible systems which can arise under these hypotheses.

Because of the assumption on signs, we can write our system of equations as,

(1) αSaiαxα=bifor i=1,,,\sum_{\alpha\in S}a_{i\alpha}x^{\alpha}=b_{i}\quad\mbox{for }i=1,\ldots,\ell,

where the coefficients aiαa_{i\alpha} are non-negative and the bib_{i} are positive, and S0nS\subset\mathbb{R}_{\geq 0}^{n} is a finite set of possibly non-integer multi-indices. Our algorithm works by iteratively decreasing the generalized Kullback-Leibler divergence of the left-hand side and right-hand side of (1). The generalized Kullback-Leibler divergence of two positive vectors aa and bb is defined to be

(2) D(ab):=i(bilog(biai)bi+ai).D(a\|b):=\sum_{i}\left(b_{i}\log\left(\frac{b_{i}}{a_{i}}\right)-b_{i}+a_{i}\right).

The standard Kullback-Leibler consists only of the first term and is defined only for probability distributions, i.e. the sum of each vector is 11. The last two terms are necessary so that the generalized divergence has, for arbitrary positive vectors aa and bb, the property of being non-negative and zero exactly when aa and bb are equal (Proposition 3).

Our algorithm converges to local minima of the Kullback-Leibler divergence, including exact solutions to the system (1). In order to find multiple local minima, we can repeat the algorithm for randomly chosen starting points. For finding approximate solutions, this may be sufficient. However, there are no guarantees of completeness for the exact solutions obtained in this way. Nonetheless, we hope that in certain situations, our algorithm will find applications both for finding exact and approximate solutions.

Lee and Seung applied the EM algorithm to the problem of non-negative matrix factorization [6]. They introduced the generalized Kullback-Leibler divergence in (2) and used it to find approximate non-negative matrix factorizations. Since the product of two matrices can be expressed by polynomials in the entries of the matrices, matrix-factorization is a special case of the equations in (1).

For finding exact solutions to arbitrary systems of polynomials, there are a variety of approaches which find all complex or all real solutions. Homotopy continuation methods find all complex roots of a system of equations [8]. Even to find only positive roots, these two methods finds all complex or all real solutions, respectively. Lasserre, Laurent, and Rostalski have applied semidefinite programming to find all real solutions to a system of equations and a slight modificiation of their algorithm will find all positive real solutions [4, 5]. Nonetheless, neither of these methods has any notion of approximate solutions.

For directly finding only positive real solutions, Bates and Sottile have proposed an algorithm based on fewnomials bounds on the number of real solutions [1]. However, their method is only effective when the number of monomials (the set SS in our notation) is only slightly more than the number of variables. Our method only makes weak assumptions on the set of monomials, but stronger assumptions on the coefficients.

Our inspiration comes from tools for maximum likelihood estimation in statistics. Parameters which maximize the likelihood are exactly the parameters such that the model probabilities are closest to the empirical probabilities, in the sense of mimimizing Kullback-Leibler divergence. Expectation-Maximization [7, Sec. 1.3] and Iterative Proporitional Fitting [3] are well-known iterative methods for maximum likelihood estimation. We re-interpret these algorithms as methods for approximating solutions to polynomial equations, in which case their applicability can be somewhat generalized.

The impetus behind the work in this paper was the need to find approximate positive solutions to systems of bilinear equations in [2]. In this case the variables represented expression levels, which only made sense as positive parameters. Moreover, in order to accomodate noise in the data, there were more equations than variables, so it was necessary to find approximate solutions. Thus, the algorithm described in this paper was the most appropriate tool. Here, we generalize beyond bilinear equations and present proofs.

An implementation of our algorithm in the C programming language is freely available at http://math.berkeley.edu/~dustin/pos/.

In Section 1, we describe the algorithm and the connection to maximum likelihood estimation. In Section 2, we prove that the necessary convergence for our algorithm. Finally, in Section 3, we show that even with our restrictions on the form of the equations, there can be exponentially positive real solutions.

1. Algorithm

We make the assumption that we have an s×ns\times n non-negative matrix gg, with no column identically zero, and positive real numbers djd_{j} for 1js1\leq j\leq s such that for each αS\alpha\in S and each jsj\leq s, i=1ngjiαi\sum_{i=1}^{n}g_{ji}\alpha_{i} is either 0 or djd_{j}. For example, if all the monomials xαx^{\alpha} have the same total degree d1d_{1}, we can take s=1s=1 and g1i=1g_{1i}=1 for all ii. The other case of particular interest is multilinear systems of equations, in which each αi\alpha_{i} is at most one. In this case the variables can be partitioned into sets such that the equations are linear in each set of variables, so we can take dj=1d_{j}=1 for all jj. Note that because djd_{j} is in the denominator in (4), convergence is fastest when the djd_{j} are small, such as in the multilinear case. We also note that, for an arbitrary set of exponents SS, there may not exist such a gg.

The algorithm begins with a randomly chosen starting vector and iteratively improves it through two nested iterations:

  1. \bullet

    Initialize xx with nn randomly chosen positive real numbers.

  2. \bullet

    Loop until the vector xx stabilizes:

    1. (a)

      For all αS\alpha\in S, compute

      (3) wα:=ibiaiαxαβaiβxβ.w_{\alpha}:=\sum_{i}b_{i}\frac{a_{i\alpha}x^{\alpha}}{\sum_{\beta}a_{i\beta}x^{\beta}}.
    2. (b)

      Loop until the vector xx stablizes:

      1. (i)

        Loop for jj from 11 to ss:

        1. (A)

          Simultaneously update all entries of xx:

          (4) xixi(ααigjiwαααigjiaαxα)gji/djwhere aα=iaiα.x_{i}\leftarrow x_{i}\left(\frac{\sum_{\alpha}\alpha_{i}g_{ji}w_{\alpha}}{\sum_{\alpha}\alpha_{i}g_{ji}a_{\alpha}x^{\alpha}}\right)^{g_{ji}/d_{j}}\quad\mbox{where }a_{\alpha}=\sum_{i}a_{i\alpha}.\hskip-108.405pt

Because there is no subtraction, it is clear that the entries of xx remain positive throughout this algorithm.

Our method is inspired by interpreting the equations in (1) as a maxmimum likelihood problem for a statistical model and applying the well-known methods of Expectation-Maximization (EM) and Iterative Proportional Fitting (IPF). Here, we assume that all the monomials xαx^{\alpha} have the same total degree. Our statistical model is that a hidden process generates an integer ii\leq\ell and an exponent vector α\alpha with joint probability aiαxαa_{i\alpha}x^{\alpha}. The vector xx contains nn positive parameters for the model, restricted such that the total probability i,αaiαxα\sum_{i,\alpha}a_{i\alpha}x^{\alpha} is 11. The empirical data consists of repeated observations of the integer ii, but not the exponent α\alpha, and bib_{i} is the proportion of observations of ii. In this situation, the vector xx which minimizes the divergence of (1) is the maximum likelihood parameters for the empirical distribution bib_{i}. The inner loop of the algorithm consists of using IPF to solve the log-linear hidden model and the outer loop consists of using EM to estimate the distribution on the hidden states.

2. Proof of convergence

In this section we prove our main theorem:

Theorem 1.

The Kullback-Leibler divergence

(5) i=1D(aSaiαxαbi).\sum_{i=1}^{\ell}D\left(\sum_{a\in S}a_{i\alpha}x^{\alpha}\|b_{i}\right).

is weakly decreasing during the algorithm in Section 1. Moreover, assuming that the set SS contains a multiple of each unit vector eie_{i}, i.e. some power of each xix_{i} appears in the system of equations, then the vector xx coverges to a critical point of the function (5) or the boundary of the positive orthant.

Remark 2.

The condition on SS is necessary to ensure that the vector xx remains bounded during the algorithm.

We begin by establishing several basic properties of the generalized Kullback-Leibler divergence in Proposition 3 and Lemmas 4 and 5. Note that D(ab)=i=1mD(aibi)D\left(a\|b\right)=\sum_{i=1}^{m}D\left(a_{i}\|b_{i}\right) and thus, we will check these basic properties by reducing to the case where aa and bb are scalars.

The proof of Theorem 1 itself is divided into two parts, corresponding to the two nested iterative loops. The first step is to prove that the updates (4) in the inner loop converge a local minimum of the divergence D(aαxαwα)D\left(a_{\alpha}x^{\alpha}\|w_{\alpha}\right). The second step is to show that this implies that the outer loop strictly decreases the divergence function (5) exact at a critical point.

Proposition 3.

For aa and bb vectors of positive real numbers, the divergence D(ab)D\left(a\|b\right) is non-negative with D(ab)=0D\left(a\|b\right)=0 if and only if a=ba=b.

Proof.

It suffices to prove this when aa and bb are scalars. We let t=b/at=b/a, and,

D(ab)=blog(ba)b+a=a(tlogtt+1)=a1tlogsds.D\left(a\|b\right)=b\log\left(\frac{b}{a}\right)-b+a=a(t\log t-t+1)=a\int_{1}^{t}\log s\,ds.

It is clear that 1tlogsds\int_{1}^{t}\log s\,ds is non-negative and equal to zero if and only if t=1t=1. ∎

Lemma 4.

Suppose that aa and bb are vectors of mm positive real numbers. Let tt be any positive real number, and then

D(tab)=D(ab)+(t1)i=1maii=1mbilogtD\left(ta\|b\right)=D\left(a\|b\right)+(t-1)\sum_{i=1}^{m}a_{i}-\sum_{i=1}^{m}b_{i}\log t
Proof.

Again, we assume that aa and bb are scalars and then it becomes a straightforward computation. ∎

Lemma 5.

If aa and bb are vectors of mm positive real numbers, then we can relate their divergence to the divergence of their sums by

D(ab)=D(i=1maii=1mbi)+D(i=1mbii=1maiab).D\left(a\|b\right)=D\left(\textstyle\sum_{i=1}^{m}a_{i}\|\textstyle\sum_{i=1}^{m}b_{i}\right)+D\left(\frac{\sum_{i=1}^{m}b_{i}}{\sum_{i=1}^{m}a_{i}}a\|b\right).
Proof.

We let A=i=1maiA=\sum_{i=1}^{m}a_{i} and B=i=1mbiB=\sum_{i=1}^{m}b_{i}, and apply Lemma 4 to the last term:

D(BAab)\displaystyle D\left(\frac{B}{A}a\|b\right) =D(ab)+(BA1)ABlogBA\displaystyle=D\left(a\|b\right)+\left(\frac{B}{A}-1\right)A-B\log\frac{B}{A}
=D(ab)BlogBA+BA=D(ab)D(AB).\displaystyle=D\left(a\|b\right)-B\log\frac{B}{A}+B-A=D\left(a\|b\right)-D\left(A\|B\right).

After rearranging, we get the desired expression. ∎

Lemma 6.

The update rule (4) weakly decreases the divergence. If

xi=xi(ααigjiwαααigjiaαxα)gji/dj,x_{i}^{\prime}=x_{i}\left(\frac{\sum_{\alpha}\alpha_{i}g_{ji}w_{\alpha}}{\sum_{\alpha}\alpha_{i}g_{ji}a_{\alpha}x^{\alpha}}\right)^{g_{ji}/d_{j}},

then

(6) αD(aα(x)αwα)αD(aαxαwα)1di=1nD(ααigjiaαxαααigjiwα).\sum_{\alpha}D\left(a_{\alpha}(x^{\prime})^{\alpha}\|w_{\alpha}\right)\leq\sum_{\alpha}D\left(a_{\alpha}x^{\alpha}\|w_{\alpha}\right)-\frac{1}{d}\sum_{i=1}^{n}D\left(\sum_{\alpha}\alpha_{i}g_{ji}a_{\alpha}x^{\alpha}\|\sum_{\alpha}\alpha_{i}g_{ji}w_{\alpha}\right).
Proof.

First, since the statement only depends on the jjth row of the matrix gg, we can assume that gg is a row vector and we drop jj from future subscripts. Second, we can assume that d=1d=1 by replacing gig_{i} with gi/dg_{i}/d.

Third, we reduce to the case when gi=1g_{i}=1 for all ii. We define a new set of exponents α~\tilde{\alpha} and coefficients a~α~\tilde{a}_{\tilde{\alpha}} by α~i=giαi\tilde{\alpha}_{i}=g_{i}\alpha_{i} and a~α=aαxi\tilde{a}_{\alpha}=a_{\alpha}\prod x_{i}, where the product is taken over all indices ii such that gi=0g_{i}=0. We take x~\tilde{x} to be a vector indexed by those ii such that gi0g_{i}\neq 0. Then, under the change of coordinates x~i=xi1/gi\tilde{x}_{i}=x_{i}^{1/g_{i}}, we have aαxα=a~αx~α~a_{\alpha}x^{\alpha}=\tilde{a}_{\alpha}\tilde{x}^{\tilde{\alpha}} and the update rule in (4) is the same for the new system with coefficients a~α~\tilde{a}_{\tilde{\alpha}} and exponents α~\tilde{\alpha}. Furthermore, if all entries of α~\tilde{\alpha} are zero, then x~α~=1\tilde{x}^{\tilde{\alpha}}=1 for all vectors xx and so we can drop α~\tilde{\alpha} from our exponent set. Therefore, for the rest of the proof, we drop the tildes, and assume that iαi=1\sum_{i}\alpha_{i}=1 for all αS\alpha\in S and gi=1g_{i}=1 for all ii, in which case gg drops out of the equations.

To prove the desired inequality, we substitute the updated assignment xx^{\prime} into the definition of Kullback-Leibler divergence:

D(aα(x)αwα)\displaystyle D\left(a_{\alpha}(x^{\prime})^{\alpha}\|w_{\alpha}\right) =wαlog(wαaαxαi=1n(βwβββiaβxβ)αi)wα+aα(x)α\displaystyle=w_{\alpha}\log\left(\frac{w_{\alpha}}{a_{\alpha}x^{\alpha}\prod_{i=1}^{n}\left(\frac{\sum_{\beta}w_{\beta}}{\sum_{\beta}\beta_{i}a_{\beta}x^{\beta}}\right)^{\alpha_{i}}}\right)-w_{\alpha}+a_{\alpha}(x^{\prime})^{\alpha}
=wαlogwαaαxαi=1nαiwαlogββiwβββiaβxβwα+aα(x)α\displaystyle=w_{\alpha}\log\frac{w_{\alpha}}{a_{\alpha}x^{\alpha}}-\sum_{i=1}^{n}\alpha_{i}w_{\alpha}\log\frac{\sum_{\beta}\beta_{i}w_{\beta}}{\sum_{\beta}\beta_{i}a_{\beta}x^{\beta}}-w_{\alpha}+a_{\alpha}(x^{\prime})^{\alpha}
(7) =D(aαxαwα)i=1nαiwαlogββiwβββiaβxβaαxα+aα(x)α.\displaystyle=D\left(a_{\alpha}x^{\alpha}\|w_{\alpha}\right)-\sum_{i=1}^{n}\alpha_{i}w_{\alpha}\log\frac{\sum_{\beta}\beta_{i}w_{\beta}}{\sum_{\beta}\beta_{i}a_{\beta}x^{\beta}}-a_{\alpha}x^{\alpha}+a_{\alpha}(x^{\prime})^{\alpha}.

On the other hand, let CC denote the last term of (6), which we can expand as,

C\displaystyle C =i=1nD(ααiaαxαααiwα)\displaystyle=\sum_{i=1}^{n}D\left(\sum_{\alpha}\alpha_{i}a_{\alpha}x^{\alpha}\|\sum_{\alpha}\alpha_{i}w_{\alpha}\right)
=i=1n((ααiwα)logααiwαααiaαxαααiwα+ααiaαxα)\displaystyle=\sum_{i=1}^{n}\left(\left(\sum_{\alpha}\alpha_{i}w_{\alpha}\right)\log\frac{\sum_{\alpha}\alpha_{i}w_{\alpha}}{\sum_{\alpha}\alpha_{i}a_{\alpha}x^{\alpha}}-\sum_{\alpha}\alpha_{i}w_{\alpha}+\sum_{\alpha}\alpha_{i}a_{\alpha}x^{\alpha}\right)
=i=1nα(αiwαlogββiwβββiaβxβαiwα+αiaαxα)\displaystyle=\sum_{i=1}^{n}\sum_{\alpha}\left(\alpha_{i}w_{\alpha}\log\frac{\sum_{\beta}\beta_{i}w_{\beta}}{\sum_{\beta}\beta_{i}a_{\beta}x^{\beta}}-\alpha_{i}w_{\alpha}+\alpha_{i}a_{\alpha}x^{\alpha}\right)
(8) =α(i=1nαiwαlogββiwβββiaβxβ)wα+aαxα,\displaystyle=\sum_{\alpha}\left(\sum_{i=1}^{n}\alpha_{i}w_{\alpha}\log\frac{\sum_{\beta}\beta_{i}w_{\beta}}{\sum_{\beta}\beta_{i}a_{\beta}x^{\beta}}\right)-w_{\alpha}+a_{\alpha}x^{\alpha},

where the last step follows from the assumption that that iαi=1\sum_{i}\alpha_{i}=1 for all αS\alpha\in S. We take the sum of (7) over all αS\alpha\in S and add it to (8) to get,

(9) αD(aα(x)αwα)+C=αD(aαxαwα)αbα+αaα(x)α.\sum_{\alpha}D\left(a_{\alpha}(x^{\prime})^{\alpha}\|w_{\alpha}\right)+C=\sum_{\alpha}D\left(a_{\alpha}x^{\alpha}\|w_{\alpha}\right)-\sum_{\alpha}b_{\alpha}+\sum_{\alpha}a_{\alpha}(x^{\prime})^{\alpha}.

Finally, we expand the last term of (8) using the definition of xx^{\prime} and apply the arithmetic-geometric mean inequality,

αaα(x)α\displaystyle\sum_{\alpha}a_{\alpha}(x^{\prime})^{\alpha} =αaαxαi=1n(ββibβββiaβxβ)αi\displaystyle=\sum_{\alpha}a_{\alpha}x^{\alpha}\prod_{i=1}^{n}\left(\frac{\sum_{\beta}\beta_{i}b_{\beta}}{\sum_{\beta}\beta_{i}a_{\beta}x^{\beta}}\right)^{\alpha_{i}}
αaαxαi=1nαiββibβββiaβxβ\displaystyle\leq\sum_{\alpha}a_{\alpha}x^{\alpha}\sum_{i=1}^{n}\alpha_{i}\frac{\sum_{\beta}\beta_{i}b_{\beta}}{\sum_{\beta}\beta_{i}a_{\beta}x^{\beta}}
=i=1n(ααiaαxα)ββibβββiaβxβ\displaystyle=\sum_{i=1}^{n}\left(\sum_{\alpha}\alpha_{i}a_{\alpha}x^{\alpha}\right)\frac{\sum_{\beta}\beta_{i}b_{\beta}}{\sum_{\beta}\beta_{i}a_{\beta}x^{\beta}}
=i=1nββibβ=βbβ.\displaystyle=\sum_{i=1}^{n}\sum_{\beta}\beta_{i}b_{\beta}=\sum_{\beta}b_{\beta}.

Together with (9), this gives the desired inequality. ∎

Proposition 7.

A positive vector xx is a fixed point of the update rule (4) for all 1js1\leq j\leq s if and only if xx is a critical point of the divergence function αD(aαxαwα)\sum_{\alpha}D\left(a_{\alpha}x^{\alpha}\|w_{\alpha}\right).

Proof.

For the update rule to be constant means that the numerator and denominator in (4) are equal, i.e.

(10) ααigjiaαxα=ααigjiwαfor all i and j.\sum_{\alpha}\alpha_{i}g_{ji}a_{\alpha}x^{\alpha}=\sum_{\alpha}\alpha_{i}g_{ji}w_{\alpha}\quad\mbox{for all $i$ and $j$.}

By our assumption on gg, for each ii, some gjig_{ji} is non-zero, so (10) is equivalent to

(11) ααiaαxα=ααiwαfor all i.\sum_{\alpha}\alpha_{i}a_{\alpha}x^{\alpha}=\sum_{\alpha}\alpha_{i}w_{\alpha}\quad\mbox{for all }i.

On the other hand, we compute the partial derivative

xiαD(aαxαwα)=αwααixi+αiaαxαxi.\frac{\partial}{\partial x_{i}}\sum_{\alpha}D\left(a_{\alpha}x^{\alpha}\|w_{\alpha}\right)=\sum_{\alpha}-w_{\alpha}\frac{\alpha_{i}}{x_{i}}+\alpha_{i}a_{\alpha}\frac{x^{\alpha}}{x_{i}}.

Since each xix_{i} is assumed to be non-zero, it is clear that all partial derivatives being zero is equivalent to (11). ∎

Lemma 8.

If we define wαw_{\alpha} as in (3), then

i=1nD(αaiα(x)αbi)i=1nD(αaiαxαbi)αD(aα(x)αwα)αD(aαxαwα).\sum_{i=1}^{n}D\left(\sum_{\alpha}a_{i\alpha}(x^{\prime})^{\alpha}\|b_{i}\right)-\sum_{i=1}^{n}D\left(\sum_{\alpha}a_{i\alpha}x^{\alpha}\|b_{i}\right)\\ \leq\sum_{\alpha}D\left(a_{\alpha}(x^{\prime})^{\alpha}\|w_{\alpha}\right)-\sum_{\alpha}D\left(a_{\alpha}x^{\alpha}\|w_{\alpha}\right).

Moreover, a positive vector xx is a fixed point if and only if xx is a critical point for the divergence function.

Proof.

We consider

(12) i,αD(aiα(x)αwiα)where wiα=biaiαxαβaiβxβ,\sum_{i,\alpha}D\left(a_{i\alpha}(x^{\prime})^{\alpha}\|w_{i\alpha}\right)\qquad\mbox{where }w_{i\alpha}=\frac{b_{i}a_{i\alpha}x^{\alpha}}{\sum_{\beta}a_{i\beta}x^{\beta}},

and apply Lemma 5 in two different ways. First, by applying Lemma 5 to each group of (12) with fixed α\alpha, we get

i,αD(aiα(x)αwiα)=αD(aα(x)αwα)+i,αD(jwjαjaα(x)αaiα(x)αwiα).\sum_{i,\alpha}D\left(a_{i\alpha}(x^{\prime})^{\alpha}\|w_{i\alpha}\right)=\sum_{\alpha}D\left(a_{\alpha}(x^{\prime})^{\alpha}\|w_{\alpha}\right)+\sum_{i,\alpha}D\left(\frac{\sum_{j}w_{j\alpha}}{\sum_{j}a_{\alpha}(x^{\prime})^{\alpha}}a_{i\alpha}(x^{\prime})^{\alpha}\|w_{i\alpha}\right).

In the last term, the monomials (x)α(x^{\prime})^{\alpha} cancel and so it is a constant independent of xx^{\prime} which we denote EE. On the other hand, we can apply Lemma 5 to each group in (12) with fixed ii,

i,αD(aiα(x)αwiα)=iD(αaiα(x)αbi)+i,αD(biaiα(x)αβaiβ(x)βwiα).\sum_{i,\alpha}D\left(a_{i\alpha}(x^{\prime})^{\alpha}\|w_{i\alpha}\right)=\sum_{i}D\left(\sum_{\alpha}a_{i\alpha}(x^{\prime})^{\alpha}\|b_{i}\right)+\sum_{i,\alpha}D\left(\frac{b_{i}a_{i\alpha}(x^{\prime})^{\alpha}}{\sum_{\beta}a_{i\beta}(x^{\prime})^{\beta}}\|w_{i\alpha}\right).

We can combine these equations to get

(13) iD(αaiα(x)αbi)=αD(aα(x)αwiα)+Ei,αD(biaiα(x)αβaiβ(x)βwiα).\sum_{i}D\left(\sum_{\alpha}a_{i\alpha}(x^{\prime})^{\alpha}\|b_{i}\right)=\sum_{\alpha}D\left(a_{\alpha}(x^{\prime})^{\alpha}\|w_{i\alpha}\right)+E-\sum_{i,\alpha}D\left(\frac{b_{i}a_{i\alpha}(x^{\prime})^{\alpha}}{\sum_{\beta}a_{i\beta}(x^{\prime})^{\beta}}\|w_{i\alpha}\right).

By Proposition 3, the last term of 13 is non-negative, and by the definition of wiαw_{i\alpha}, it is zero for x=xx^{\prime}=x. Therefore, any value of xx^{\prime} which decreases the first term compared to xx will also decrease the left hand side by at least as much, which i sthe desired inequality.

In order to prove the statement about the derivative, we consider the derivative of (13) at x=xx^{\prime}=x. Because the last term is mimimized at x=xx^{\prime}=x, its derivative is zero, so

xj|x=xiD(αaiα(x)αbi)=xj|x=xi,αD(aiα(x)αwiα).\left.\frac{\partial}{\partial x_{j}^{\prime}}\right|_{x^{\prime}=x}\sum_{i}D\left(\sum_{\alpha}a_{i\alpha}(x^{\prime})^{\alpha}\|b_{i}\right)=\left.\frac{\partial}{\partial x_{j}^{\prime}}\right|_{x^{\prime}=x}\sum_{i,\alpha}D\left(a_{i\alpha}(x^{\prime})^{\alpha}\|w_{i\alpha}\right).

By Proposition 7, a positive vector xx is a fixed point of the inner loop if and only if these partial derivatives on the right are zero for all indices jj, which is the definition of a critical point. ∎

Proof of Theorem 1.

The Kullback-Leibler divergence αD(aαxαwα)\sum_{\alpha}D\left(a_{\alpha}x^{\alpha}\|w_{\alpha}\right) decreases at each step of the inner loop by Lemma 6. Thus, by Lemma 8, the divergence

(14) i=1nD(αaiαxαbi)\sum_{i=1}^{n}D\left(\sum_{\alpha}a_{i\alpha}x^{\alpha}\|b_{i}\right)

decreases at least as much. However, the divergence (14) is non-negative according to Proposition 3. Therefore, the magnitude of the decreases in divergence must approach zero over the course of the algorithm. By Lemma 6, this means that the quantity CC in that theorem must approach zero. By Proposition 3, this means that the quantities in that divergence approach each other. However, up to a power, these are the numerator and denominator of the factor in the update rule (4), so the difference between consecutive vectors xx approaches zero.

Thus, we just need to show that xx remains bounded. However, since some power of each variable xix_{i} occurs in some equation, as xix_{i} gets large, the divergence for that equation also gets arbitrarily large. Therefore, each xix_{i} must remain bounded, so the vector xx must have a limit as the algorithm is iterated. If this limit is in the interior of the positive orthant, then it must be a fixed point. By Lemma 8 and Proposition 7, this fixed point must be a critical point of the divergence (5). ∎

3. Universality

Although the restriction on the exponents and especially the positivity of the coefficients seem like strong conditions, such systems can nonetheless be quite complex. In this section, we investigate the breadth of such equations.

Proposition 9.

For any system of \ell real polynomials in nn variables, there exists a system of +1\ell+1 equations in n+1n+1 variables, in the form (1), such that the positive solutions (x1,,xn)(x_{1},\ldots,x_{n}) to the former system are in bijection with the positive solutions (x1,,xn+1)(x_{1}^{\prime},\ldots,x_{n+1}^{\prime}) of the latter, with xi=xi/xn+1x_{i}^{\prime}=x_{i}/x_{n+1}.

Proof.

We write our system of equations as αSaiαxα=0\sum_{\alpha\in S}a_{i\alpha}x^{\alpha}=0 for 1i1\leq i\leq\ell, where SnS\subset\mathbb{N}^{n} is an arbitrary finite set of exponents and aiαa_{i\alpha} are any real numbers. We let dd be the maximum degree of any monomial xαx^{\alpha} for αS\alpha\in S. We homogenize the equations with a new variable xn+1x_{n+1}. Explicitly, define Sn+1S^{\prime}\subset\mathbb{N}^{n+1} to consist of α=(α,diαi)\alpha^{\prime}=(\alpha,d-\sum_{i}\alpha_{i}) for all α\alpha in SS and we write aiα=aiαa_{i\alpha^{\prime}}=a_{i\alpha}. We add a new equation with coefficients a+1,α=1a_{\ell+1,\alpha}=1 for all αS\alpha\in S^{\prime} and b+1=1b_{\ell+1}=1. For this system, we can clearly take g1i=1g_{1i}=1 and d1=dd_{1}=d to satisfy the condition on exponents. Furthermore, for any positive solution (x1,,xn)(x_{1},\ldots,x_{n}) to the original system of equations, (x1,,xn+1)(x_{1}^{\prime},\ldots,x_{n+1}^{\prime}) with xi=xi/(αxα)1/dx_{i}^{\prime}=x_{i}/\big{(}\sum_{\alpha}x^{\alpha}\big{)}^{1/d} and xn+1=1/(αxα)1/dx_{n+1}^{\prime}=1/\big{(}\sum_{\alpha}x^{\alpha}\big{)}^{1/d} is a solution to the homogenized system of equations.

Next, we add a multiple of the last equation to each of the others in order to make all the coefficients positive. For each 1i1\leq i\leq\ell, choose a positive bi>minα{aiααS}b_{i}>-\min_{\alpha}\{a_{i\alpha}\mid\alpha\in S^{\prime}\}, and define aiα=aiα+bia_{i\alpha}^{\prime}=a_{i\alpha}+b_{i}. By construction, the resulting system has all positive coefficients, and since the equations are formed from the previous equations by elementary linear transformations, the set of solutions are the same. ∎

The practical use of the construction in the proof of Proposition 9 is mixed. The first step, of homogenizing to deal with arbitrary sets of exponents, is a straightforward way of guaranteeing the existence of the matrix gg. However, for large systems, the second step tends to produce an ill-conditioned coefficient matrix. In these cases, our algorithm converges very slowly. Nonetheless, Proposition 9 shows that in the worst case, systems satisfying our hypotheses can be as complicated as arbitrary polynomial systems.

Proposition 10.

There exist bilinear equations in 2m2m variables with (2m2m1){2m-2\choose m-1} positive real solutions.

Proof.

We use a variation on the technique used to prove Proposition 9.

First, we pick 2m22m-2 generic homogeneous linear functions b1,,b2m2b_{1},\ldots,b_{2m-2} on mm variables. By generic, we mean for any mm of the bkb_{k}, the only simultaneous solution of all mm linear equations is the trivial one. This genericity implies that any m1m-1 of the bkb_{k} define a point in m1\mathbb{P}^{m-1} By taking a linear changes of coordinates in each set of variables, we can assume that all of these points are positive, i.e. have a representative consisting of all positive real numbers.

Then we consider the system of equations

(15) bk(x1,,xm)bk(xm+1,,xn)\displaystyle b_{k}(x_{1},\ldots,x_{m})\cdot b_{k}(x_{m+1},\ldots,x_{n}) =0, for 1k2m2\displaystyle=0,\mbox{ for }1\leq k\leq 2m-2
(16) (x1++xm)(xm+1++x2m)\displaystyle(x_{1}+\ldots+x_{m})(x_{m+1}+\ldots+x_{2m}) =1\displaystyle=1
(17) x1++xm\displaystyle x_{1}+\ldots+x_{m} =1.\displaystyle=1.

The equations (15) are bihomogeneous and so we can think of their solutions in m1×m1\mathbb{P}^{m-1}\times\mathbb{P}^{m-1}. There are exactly (2m2m1){2m-2\choose m-1} positive real solutions, corresponding to the subsets A[2m2]A\subset[2m-2] of size m1m-1. For any such AA, there is a unique, distinct solution satisfying bk(x1,,xm)=0b_{k}(x_{1},\ldots,x_{m})=0 for all kk in AA and bk(xm+1,,x2m)=0b_{k}(x_{m+1},\ldots,x_{2m})=0 for all kk not in AA. By assumption, for each solution, all the coordinates can be chosen to be positive. The last two equations (16) and (17) dehomogenize the system in a way such that there are (2m2m1){2m-2\choose m-1} positive real solutions. Finally, as in the last paragraph of the proof of Proposition 9, we can add multiples of (16) to the equations (15) in order to make all the coefficients positive. ∎

Acknowledgments

We thank Bernd Sturmfels for reading a draft of this manuscript. This work was supported by the Defense Advanced Research Projects Agency project “Microstates to Macrodynamics: A New Mathematics of Biology” and the U.S. National Science Foundation (DMS-0456960).

References

  • [1] Dan Bates and Frank Sottile. Khovanskii-Rolle continuation for real solutions. arXiv:0908.4579, 2009.
  • [2] Dustin A. Cartwright, Siobhan M. Brady, David A. Orlando, Bernd Sturmfels, and Philip N. Benfey. Reconstruction spatiotemporal gene expression data from partial observations. Bioinformatics, 25(19):2581–2587, 2009.
  • [3] J. N. Darroch and D. Ratcliff. Generalized iterative scaling for log-linear models. Annals of Math. Stat., 43(5):1470–1480, 1972.
  • [4] Jean B. Lasserre, Monique Laurent, and Philipp Rostalski. Semidefinite characterization and computation of real radical ideals. Foundations of Computational Mathematics, 8(5):607–647, 2008.
  • [5] Jean B. Lasserre, Monique Laurent, and Philipp Rostalski. A prolongation-projection algorithm for computing the finite real variety of an ideal. Theoretical Computer Science, 410(27–29):2685–2700, 2009.
  • [6] Daniel D. Lee and H. Sebastian Seung. Algorithms for non-negative matrix factorization. Adv. Neural Info. Proc. Syst., 13:556–562, 2001.
  • [7] Lior Pachter and Bernd Sturmfels. Algebraic Statistics for Computational Biology. Cambridge University Press, 2005.
  • [8] Jan Verschelde. PHCPACK: A general-purpose solver for polynomial systems by homotopy continuation.