This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Online algorithms for Nonnegative Matrix Factorization with the Itakura-Saito divergence

\nameAugustin Lefèvre \emailaugustin.lefevre@inria.fr
\addrINRIA/ENS Sierra project
23, avenue d’Italie, 75013 Paris \AND\nameFrancis Bach \emailfrancis.bach@ens.fr
\addrINRIA/ENS Sierra project
23, avenue d’Italie, 75013 Paris \AND\nameCédric Févotte \emailfevotte@telecom-paristech.fr
\addrCNRS LTCI; Télécom ParisTech
37-39, rue Dareau, 75014 Paris
Abstract

Nonnegative matrix factorization (NMF) is now a common tool for audio source separation. When learning NMF on large audio databases, one major drawback is that the complexity in time is O(FKN)O(FKN) when updating the dictionary (where (F,N)(F,N) is the dimension of the input power spectrograms, and KK the number of basis spectra), thus forbidding its application on signals longer than an hour. We provide an online algorithm with a complexity of O(FK)O(FK) in time and memory for updates in the dictionary. We show on audio simulations that the online approach is faster for short audio signals and allows to analyze audio signals of several hours.

1 Introduction

In audio source separation, nonnegative matrix factorization (NMF) is a popular technique for building a high-level representation of complex audio signals with a small number of basis spectra, forming together a dictionary (Smaragdis et al., 2007; Févotte et al., 2009; Virtanen et al., 2008). Using the Itakura-Saito divergence as a measure of fit of the dictionary to the training set allows to capture fine structure in the power spectrum of audio signals as shown in (Févotte et al., 2009).

However, estimating the dictionary can be quite slow for long audio signals, and indeed intractable for training sets of more than a few hours. We propose an algorithm to estimate Itakura-Saito NMF (IS-NMF) on audio signals of possibly infinite duration with tractable memory and time complexity. This article is organized as follows : in Section 2, we summarize Itakura-Saito NMF, propose an algorithm for online NMF, and discuss implementation details. In Section 3, we experiment our algorithms on real audio signals of short, medium and long durations. We show that our approach outperforms regular batch NMF in terms of computer time.

2 Online Dictionary Learning for the Itakura-Saito divergence

Various methods were recently proposed for online dictionary learning (Mairal et al., 2010; Hoffman et al., 2010; Bucak and Gunsel, 2009). However, to the best of our knowledge, no algorithm exists for online dictionary learning with the Itakura-Saito divergence. In this section we summarize IS-NMF, then introduce our algorithm for online NMF and explain briefly the mathematical framework.

2.1 Itakura-Saito NMF

Define the Itakura-Saito divergence as dIS(y,x)=i(yixilogyixi1)d_{IS}(y,x)=\sum_{i}(\frac{y_{i}}{x_{i}}-\log\frac{y_{i}}{x_{i}}-1). Given a data set V=(v1,,vN)+F×NV=(v_{1},\dots,v_{N})\in\mathbb{R}_{+}^{F\times N}, Itakura-Saito NMF consists in finding W+F×KW\in\mathbb{R}_{+}^{F\times K}, H=(h1,,hN)+K×NH=(h_{1},\dots,h_{N})\in\mathbb{R}_{+}^{K\times N} that minimize the following objective function :

H(W)=1Nn=1NdIS(vn,Whn),\mathcal{L}_{H}(W)=\frac{1}{N}\sum_{n=1}^{N}d_{IS}(v_{n},Wh_{n})\,, (1)

The standard approach to solving IS-NMF is to optimize alternately in WW and HH and use majorization-minimization (Févotte and Idier, in press). At each step, the objective function is replaced by an auxiliary function of the form H(W,W¯)\mathcal{L}_{H}(W,\underline{W}) such that H(W)H(W,W¯)\mathcal{L}_{H}(W)\leq\mathcal{L}_{H}(W,\underline{W}) with equality if W=W¯W=\underline{W} :

H(W,W¯)=fkAfk1Wfk+BfkWfk+c.\mathcal{L}_{H}(W,\underline{W})=\sum_{fk}A_{fk}\frac{1}{W_{fk}}+B_{fk}W_{fk}+c\,. (2)

where A,B+F×KA,B\in\mathbb{R}_{+}^{F\times K} and cc\in\mathbb{R} are given by:

Afk=n=1NHknVfn(W¯H)fn2W¯fk2,Bfk=n=1NHkn(W¯H)fn1,c=f=1Fn=1NlogVfn(W¯H)fnF.\begin{array}[]{rl}A_{fk}&=\sum_{n=1}^{N}H_{kn}V_{fn}(\underline{W}H)^{-2}_{fn}\underline{W}_{fk}^{2}\,,\\ \\ B_{fk}&=\sum_{n=1}^{N}H_{kn}(\underline{W}H)^{-1}_{fn}\,,\\ \\ c&=\sum_{f=1}^{F}\sum_{n=1}^{N}\log\dfrac{\displaystyle V_{fn}}{(\underline{W}H)_{fn}}-F\,.\end{array} (3)

Thus, updating WW by Wfk=Afk/BfkW_{fk}=\sqrt{A_{fk}/B_{fk}} yields a descent algorithm. Similar updates can be found for hnh_{n} so that the whole process defines a descent algorithm in (W,H)(W,H) (for more details see, e.g., (Févotte and Idier, in press)). In a nutshell, batch IS-NMF works in cycles: at each cycle, all sample points are visited, the whole matrix HH is updated, the auxiliary function in Eq. (2) is re-computed, and WW is then updated. We now turn to the description of online NMF.

2.2 An online algorithm for online NMF

When NN is large, multiplicative updates algorithms for IS-NMF become expensive because at the dictionary update step, they involve large matrix multiplications with time complexity in O(FKN)O(FKN) (computation of matrices AA and BB). We present here an online version of the classical multiplicative updates algorithm, in the sense that only a subset of the training data is used at each step of the algorithm.

Suppose that at each iteration of the algorithm we are provided a new data point vtv_{t}, and we are able to find hth_{t} that minimizes dIS(vt,W(t)ht)d_{IS}(v_{t},W^{(t)}h_{t}). Let us rewrite the updates in Eq. (3). Initialize A(0),B(0),W(0)A^{(0)},B^{(0)},W^{(0)} and at each step compute :

A(t)=A(t1)+(vt(W(t1)ht)2ht)(W(t1))2,B(t)=B(t1)+1W(t1)htht,W(t)=A(t)B(t).\begin{array}[]{rl}A^{(t)}=&A^{(t-1)}+(\frac{v_{t}}{(W^{(t-1)}h_{t})^{2}}h_{t}^{\top})\cdot(W^{(t-1)})^{2}\,,\\ \\ B^{(t)}=&B^{(t-1)}+\frac{1}{W^{(t-1)}h_{t}}h_{t}^{\top}\,,\\ \\ W^{(t)}=&\sqrt{\frac{A^{(t)}}{B^{(t)}}}\,.\\ \end{array} (4)

Now we may update WW each time a new data point vtv_{t} is visited, instead of visiting the whole data set. This differs from batch NMF in the following sense : suppose we replace the objective function in Eq. (1) by

LT(W)=1Tt=1TdIS(vt,Wht),L_{T}(W)=\frac{1}{T}\sum_{t=1}^{T}d_{IS}(v_{t},Wh_{t})\,,\\ (5)

where (v1,v2,,vt,)(v_{1},v_{2},\dots,v_{t},\dots) is an infinite sequence of data points, and the sequence (h1,,ht,)(h_{1},\dots,h_{t},\dots) is such that hth_{t} minimizes dIS(vt,W(t)h)d_{IS}(v_{t},W^{(t)}h). Then we may show that the modified sequence of updates corresponds to minimizing the following auxiliary function :

L^T(W)=kf(Afk(T)1Wfk+Bfk(T)Wfk)+c.\hat{L}_{T}(W)=\sum_{k}\sum_{f}\left(A^{(T)}_{fk}\frac{1}{W_{fk}}+B^{(T)}_{fk}W_{fk}\right)+c\,. (6)

If TT is fixed, this problem is exactly equivalent to IS-NMF on a finite training set. Whereas in the batch algorithm described in Section 2.1, all HH is updated once and then all WW, in online NMF, each new hth_{t} is estimated exactly and then WW is updated once. Another way to see it is that in standard NMF, the auxiliary function is updated at each pass through the whole dataset from the most recent updates in HH, whereas in online NMF, the auxiliary function takes into account all updates starting from the first one.

Extensions

Prior information on HH or WW is often useful for imposing structure in the factorization (Lefèvre et al., 2011; Virtanen, 2007; Smaragdis et al., 2007). Our framework for online NMF easily accomodates penalties such as :

  • Penalties depending on the dictionary WW only.

  • Penalties on HH that are decomposable and expressed in terms of a concave increasing function ψ\psi (Lefèvre et al., 2011): Ψ(H)=n=1Nψ(kHkn)\Psi(H)=\sum_{n=1}^{N}\psi(\sum_{k}H_{kn}).

2.3 Practical online NMF

Algorithm 1 Online Algorithm for IS-NMF
Input training set, W(0)W^{(0)}, A(0)A^{(0)}, B(0)B^{(0)}, ρ\rho, β\beta, η\eta, ε\varepsilon.
t0t\leftarrow 0
repeat
    tt+1t\leftarrow t+1
    draw vtv_{t} from the training set.
    htargminhdIS(ε+vt,ε+Wh)h_{t}\leftarrow\mathop{\arg\min}_{h}d_{IS}(\varepsilon+v_{t},\varepsilon+Wh)
    a(t)(ε+vt(ε+Wht)2ht)W2a^{(t)}\leftarrow(\frac{\varepsilon+v_{t}}{(\varepsilon+Wh_{t})^{2}}h_{t}^{\top})\cdot W^{2}
    b(t)1ε+Whthtb^{(t)}\leftarrow\frac{1}{\varepsilon+Wh_{t}}h_{t}^{\top}
    if t0[β]t\equiv 0\;[\beta]
      A(t)A(tβ)+ρs=tβ+1ta(s)A^{(t)}\leftarrow A^{(t-\beta)}+\rho\sum_{s=t-\beta+1}^{t}a^{(s)}
      B(t)B(tβ)+ρs=tβ+1tb(s)B^{(t)}\leftarrow B^{(t-\beta)}+\rho\sum_{s=t-\beta+1}^{t}b^{(s)}
      W(t)A(t)B(t)W^{(t)}\leftarrow\sqrt{\frac{A^{(t)}}{B^{(t)}}}
      for k=1Kk=1\dots K
        sfWfk,WfkWfk/ss\leftarrow\sum_{f}W_{fk}\,,\quad W_{fk}\leftarrow W_{fk}/s
        AfkAfk/s,BfkBfk×sA_{fk}\leftarrow A_{fk}/s\,,\quad B_{fk}\leftarrow B_{fk}\times s
      end for
    end if
until W(t)W(t1)F<η\|W^{(t)}-W^{(t-1)}\|_{F}<\eta

We provided a description of a pure version of online NMF, we now discuss several extensions that are commonly used in online algorithms and allow for considerable gains in speed.

Finite data sets.

When working on finite training sets, we cycle over the training set several times, and randomly permute the samples at each cycle.

Sampling method for infinite data sets.

When dealing with large (or infinite) training sets, samples may be drawn in batches and then permuted at random to avoid local correlations of the input.

Fresh or warm restarts.

Minimizing dIS(vt,Wht)d_{IS}(v_{t},Wh_{t}) is an inner loop in our algorithm. Finding an exact solution hth_{t} for each new sample may be costly (a rule of thumb is 100100 iterations from a random point). A shortcut is to stop the inner loop before convergence. This amounts to compute only an upper-bound of dIS(vt,Wht)d_{IS}(v_{t},Wh_{t}). Another shortcut is to warm restart the inner loop, at the cost of keeping all the most recent regression weights H=(h1,,hN)H=(h_{1},\dots,h_{N}) in memory. For small data sets, this allows to run online NMF very similarly to batch NMF : each time a sample is visited hth_{t} is updated only once, and then WW is updated. When using warm restarts, the time complexity of the algorithm is not changed, but the memory requirements become O((F+N)K)O((F+N)K).

Mini-batch.

Updating WW every time a sample is drawn costs O(FK)O(FK) : as shown in simulations, we may save some time by updating WW only every β\beta samples i.e., draw samples in batches and then update WW. This is also meant to stabilize the updates.

Scaling past data.

In order to speed up the online algorithm it is possible to scale past information so that newer information is given more importance :

A(t+β)=A(t)+ρs=t+1t+βa(s),B(t+β)=B(t)+ρs=t+1t+βb(s),\begin{array}[]{cc}A^{(t+\beta)}&=A^{(t)}+\rho\sum_{s=t+1}^{t+\beta}a^{(s)}\,,\\ B^{(t+\beta)}&=B{(t)}+\rho\sum_{s=t+1}^{t+\beta}b^{(s)}\,,\\ \end{array} (7)

where we choose ρ=rβ/N\rho=r^{\beta/N}. We choose this particular form so that when N+N\rightarrow+\infty, ρ=1\rho=1. Moreover, ρ\rho is taken to the power β\beta so that we can compare performance for several batch sizes and the same parameter rr. In principle this rescaling of past information amounts to discount each new sample at rate ρ\rho, thus replacing the objective function in Eq. (5) by :

1t=1Trtt=1TrT+1tl(vt,W),\frac{1}{\sum_{t=1}^{T}r^{t}}\sum_{t=1}^{T}r^{T+1-t}l(v_{t},W)\,,\\ (8)

Rescaling WW.

In order to avoid the scaling ambiguity, each time WW is updated, we rescale W(t)W^{(t)} so that its columns have unit norm. A(t)A^{(t)}, B(t)B^{(t)} must be rescaled accordingly (as well as HH when using warm restarts). This does not change the result and avoids numerical instabilities when computing the product WHWH.

Dealing with small amplitude values.

The Itakura-Saito divergence dIS(y,x)d_{IS}(y,x) is badly behaved when either y=0y=0 or x=0x=0. As a remedy we replace it in our algorithm by dIS(ε+y,ε+x)d_{IS}(\varepsilon+y,\varepsilon+x). The updates were modified consequently in Algorithm 1.

Overview.

Algorithm 1 summarizes our procedure. The two parameters of interest are the mini-batch size β\beta and the forgetting factor rr. Note that when β=N\beta=N, and r=0r=0, the online algorithm is equivalent to the batch algorithm.

3 Experimental study

In this section we validate the online algorithm and compare it with its batch counterpart. A natural criterion is to train both on the same data with the same initial parameters W(0)W^{(0)} (and H(0)H^{(0)} when applicable) and compare their respective fit to a held-out test set, as a function of the computer time available for learning. The input data are power spectrogram extracted from single-channel audio tracks, with analysis windows of 512512 samples and 256256 samples overlap. All silent frames were discarded.

We make the comparison for small, medium, and large audio tracks (resp. 103, 104,10510^{3},\,10^{4},10^{5} time windows). WW is initialized with random samples from the train set. For each process, several seeds were tried, the best seed (in terms of objective function value) is shown for each process. Finally, we use ε=1012\varepsilon=10^{-12} which is well below the hearing threshold.

Small data set (30 seconds).

We ran online NMF with warm restarts and one update of hh every sample. From Figure 1, we can see that there is a restriction on the values of (β,r)(\beta,r) that we can use : if r<1r<1 then β\beta should be chosen larger than 11. On the other hand, as long as r>0.5r>0.5, the stability of the algorithm is not affected by the value of β\beta. In terms of speed, clearly setting r<1r<1 is crucial for the online algorithm to compete with its batch counterpart. Then there is a tradeoff to make in β\beta : it should picked larger than 11 to avoid instabilities, and smaller than the size of the train set for faster learning (this was also shown in (Mairal et al., 2010) for the square loss).

Refer to caption Refer to caption
Refer to caption
Figure 1: Comparison of online and batch algorithm on a thirty-seconds long audio track.

Medium data set (4 minutes).

We ran online NMF with warm restarts and one update of hh every sample. The same remarks apply as before, moreover we can see on Figure 2 that the online algorithm outperforms its batch counterpart by several orders of magnitude in terms of computer time for a wide range of parameter values.

Refer to caption Refer to caption
Refer to caption
Figure 2: Comparison of online and batch algorithm on a three-minutes long audio track.

Large data set (1 hour 20 minutes).

For the large data set, we use fresh restarts and 100100 updates of hh for every sample. Since batch NMF does not fit into memory any more, we compare online NMF with batch NMF learnt on a subset of the training set. In Figure 3, we see that running online NMF on the whole training set yields a more accurate dictionary in a fraction of the time that batch NMF takes to run on a subset of the training set. We stress the fact that we used fresh restarts so that there is no need to store HH offline.

Refer to caption
Figure 3: Comparison of online and batch algorithm on an album of Django Reinhardt (1 hour 20 minutes).

Summary.

The online algorithm we proposed is stable provided minimal restrictions on the values of the parameters (r,β)(r,\beta) : if r=1r=1, then any value of β\beta is stable. If r<1r<1 then β\beta should be chosen large enough. Clearly there is a tradeoff in choosing the mini-batch size β\beta, which is explained by the way it works : when β\beta is small, frequent updates of WW are an additional cost as compared with batch NMF. On the other hand, when β\beta is small enough we take advantage of the redundancy in the training set. From our experiments we find that choosing r=0.7r=0.7 and β=103\beta=10^{3} yields satisfactory performance.

4 Conclusion

In this paper we make several contributions : we provide an algorithm for online IS-NMF with a complexity of O(FK)O(FK) in time and memory for updates in the dictionary. We propose several extensions that allow to speedup online NMF and summarize them in a concise algorithm (code will be made available soon). We show that online NMF competes with its batch counterpart on small data sets, while on large data sets it outperforms it by several orders of magnitude. In a pure online setting, data samples are processed only once, with constant time and memory cost. Thus, online NMF algorithms may be run on data sets of potentially infinite size which opens up many possibilities for audio source separation.

References

  • Bucak and Gunsel [2009] S. Bucak and B. Gunsel. Incremental subspace learning via non-negative matrix factorization. Pattern Recognition, 42(5):788–797, 2009.
  • Févotte and Idier [in press] C. Févotte and J. Idier. Algorithms for nonnegative matrix factorization with the beta-divergence. Neural Computation, in press.
  • Févotte et al. [2009] C. Févotte, N. Bertin, and J.-L. Durrieu. Nonnegative matrix factorization with the Itakura-Saito divergence: With application to music analysis. Neural Comput., 21(3):793–830, 2009.
  • Hoffman et al. [2010] Matthew D. Hoffman, David M. Blei, and Francis Bach. Online learning for latent dirichlet allocation. In Neural Information Processing Systems, 2010.
  • Lefèvre et al. [2011] A. Lefèvre, F. Bach, and C. Févotte. Itakura-Saito nonnegative matrix factorization with group sparsity. In Proc. Int. Conf. on Acous., Speech, and Sig. Proc. (ICASSP), 2011.
  • Mairal et al. [2010] J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learning for matrix factorization and sparse coding. Journal of Machine Learning Research, 11:19–60, 2010.
  • Smaragdis et al. [2007] P. Smaragdis, B. Raj, and M.V. Shashanka. Supervised and semi-supervised separation of sounds from single-channel mixtures. In Proc. Int. Conf. on ICA and Signal Separation. London, UK, September 2007.
  • Virtanen et al. [2008] T. O. Virtanen, A. T. Cemgil, and S. J. Godsill. Bayesian extensions to nonnegative matrix factorisation for audio signal modelling. In Proc. of IEEE ICASSP, Las Vegas, 2008. IEEE.
  • Virtanen [2007] T.O. Virtanen. Monaural sound source separation by non-negative matrix factorization with temporal continuity and sparseness criteria. IEEE Trans. on Audio, Speech, and Lang. Proc., 15(3), March 2007.