This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Frame Theory for Signal Processing in Psychoacoustics

Peter Balazs, Nicki Holighaus, Thibaud Necciari, and Diana Stoeva Acoustics Research Institute, Austrian Academy of Sciences, Wohllebengasse 12-14, 1040 Wien, Austria, e-mail: peter.balazs@oeaw.ac.at, nicki.holighaus@oeaw.ac.at, thibaud.necciari@oeaw.ac.at,diana.stoeva@oeaw.ac.at
Abstract.

This review chapter aims to strengthen the link between frame theory and signal processing tasks in psychoacoustics. On the one side, the basic concepts of frame theory are presented and some proofs are provided to explain those concepts in some detail. The goal is to reveal to hearing scientists how this mathematical theory could be relevant for their research. In particular, we focus on frame theory in a filter bank approach, which is probably the most relevant view-point for audio signal processing. On the other side, basic psychoacoustic concepts are presented to stimulate mathematicians to apply their knowledge in this field.

1. Introduction

In the fields of audio signal processing and hearing research, continuous research efforts are dedicated to the development of optimal representations of sound signals, suited for particular applications. However, each application and each of these two disciplines has specific requirements with respect to optimality of the transform.

For researchers in audio signal processing, an optimal signal representation should allow to extract, process, and re-synthesize relevant information, and avoid any useless inflation of the data, while at the same time being easily interpretable. In addition, although not a formal requirement, but being motivated by the fact that most audio signals are targeted at humans, the representation should take human auditory perception into account. Common tools used in signal processing are linear time-frequency analysis methods that are mostly implemented as filter banks.

For hearing scientists, an optimal signal representation should allow to extract the perceptually relevant information in order to better understand sound perception. In other terms, the representation should reflect the peripheral “internal” representation of sounds in the human auditory system. The tools used in hearing research are computational models of the auditory system. Those models come in various flavors but their initial steps in the analysis process usually consist in several parallel bandpass filters followed by one or more nonlinear and signal-dependent processing stages. The first stage, implemented as a (linear) filter bank, aims to account for the spectro-temporal analysis performed in the cochlea. The subsequent nonlinear stages aim to account for the various nonlinearities that occur in the periphery (e.g. cochlear compression) and at more central processing stages of the nervous system (e.g. neural adaptation). A popular auditory model, for instance, is the compressive gammachirp filter bank (see Sec. 2.2). In this model, a linear prototype filter is followed by a nonlinear and level-dependent compensation filter to account for cochlear compression. Because auditory models are mostly intended as perceptual analysis tools, they do not feature a synthesis stage, i.e. they are not necessarily invertible. Note that a few models do allow for an approximate reconstruction, though.

It becomes clear that filter banks play a central role in hearing research and audio signal processing alike, although the requirements of the two disciplines differ. This divergence of the requirements, in particular the need for signal-dependent nonlinear processing in auditory models, may contrast with the needs of signal processing applications. But even within each of those fields, demands for the properties of transforms are diverse, as becoming evident by the many already existing methods. Therefore, it can be expected that the perfect signal representation, i.e. one that would have all desired properties for arbitrary applications in one or even both fields, does not exist.

This manuscript demonstrates how frame theory can be considered a particularly useful conceptual background for scientists in both hearing and audio processing, and presents some first motivating applications. Frames provide the following general properties: perfect reconstruction, stability, redundancy, and a signal-independent, linear inversion procedure. In particular, frame theory can be used to analyze any filter bank, thereby providing useful insight into its structure and properties. In practice, if a filter bank construction (i.e. including both the analysis and synthesis filter banks) satisfies the frame condition (see Sec. 4), it benefits from all the frame properties mentioned above. Why are those properties essential to researchers in audio signal processing and hearing science?

Perfect reconstruction property: With the possible exception of frequencies outside the audible range, a non-adaptive analysis filter bank , i.e. one that is general, not signal-dependent, has no means of determining and extracting exactly the perceptually relevant information. For such an extraction, signal-dependent information would be crucial. Therefore, the only way to ensure that a linear, signal-independent analysis stage111As given by any fixed analysis filter bank., possibly followed by a nonlinear processing stage, captures all perceptually relevant signal components is to ensure that it does not lose any information at all. This, in fact, is equivalent to being perfectly invertible, i.e. having a perfect reconstruction property. Thus, this property benefits the user even when reconstruction is not intended per-se. Note that in general “being perfectly invertible” need not necessarily imply that a concrete inversion procedure is known. In the frame case, a constructive method exists, though.

Stability: For sound processing, stability is essential in the sense that, for the analysis stage, when two signals are similar (i.e., their difference is small), the difference between their corresponding analysis coefficients should also be small. For the synthesis stage, a signal reconstructed from slightly distorted coefficients should be relatively close to the original signal, that is the one reconstructed from undistorted coefficients. From an energy point of view, signals which are similar in energy should provide analysis coefficients whose energy is also similar. So the respective energies remain roughly proportional. In particular, considering a signal mixture, the combination of stability and linearity ensures that every signal component is represented and weighted according to its original energy. In other terms, individual signal components are represented proportional to their energy, which is very important for, e.g., visualization. Even in a perceptual analysis, where inaudible components should not be visualized equally to audible components having the same energy, this stability property is important. To illustrate this, recall that the nonlinear post-processing stages in auditory models are signal dependent. That is, also the inaudible information can be essential to properly characterize the nonlinearity. For instance, consider again the setup of the compressive gammachirp model where an intermediate representation is obtained through the application of a linear analysis filter bank to the input signal. The result of this linear transform determines the shape of the subsequent nonlinear compensation filter. Note that the whole intermediate representation is used. Consequently, the proper estimation of the nonlinearity crucially relies on the signal representation being accurate, i.e. all signal components being represented and appropriately weighted. This accurateness comes for free if the analysis filter bank forms a frame.

Signal-independent, linear inversion: A consistent (i.e. signal-independent) inversion procedure is of great benefit in signal processing applications. It implies that a single algorithm/implementation can perform all the necessary synthesis tasks. For nonlinear representations, finding a signal-independent procedure which provides a stable reconstruction is a highly nontrivial affair, if it is at all possible. With linear representations, such a procedure is easier to determine and this can be seen as an advantage of the linearity. The linearity provided by the reconstruction algorithm also significantly simplifies separation tasks. In a linear representation, a separation in the coefficient (time-frequency) domain, i.e. before synthesis, is equivalent to a separation in the signal domain. Such a property is highly relevant, for instance, to computational auditory scene analysis systems that, to some extent, are sound source separators (see Sec. 2.4).

Redundancy: Representations which are sampled at critical density are often unsuitable for visualization, since they lead to a low resolution, which may lead to many distinct signal components being integrated into a single coefficient of the transform. Thus, the individual coefficients may contain information from a lot of different sources, which makes them hard to interpret. Still, the whole set of coefficients captures all the desired signal information if (and only if) the transform is invertible. Redundancy provides higher resolution and so components that are separated in time or in frequency can be separated in the transform domain. Furthermore, redundant representations are smoother and therefore easier to read than their critically sampled counterparts.

Moreover, redundant representations provide some resistance against noise and errors. This is in contrast to non-redundant systems, where distortions can not be compensated for. This is used for de-noising approaches. In particular, if a signal is synthesized in a straight-forward way from noisy (redundant) coefficients, the synthesis process has the tendency to reduce the energy of the noise, i.e. there is some noise cancellation.

Besides the above properties, which are direct consequences of the frame inequalities, the generality of frame theory enables the consideration of additional important properties. In the setting of perceptually motivated audio signal analysis and processing, these include:

Perceptual relevance: We have stressed that the only way to ensure that all perceptually relevant information is kept is to accurately capture all the information by using a stable and perfectly invertible system for analysis. However, in an auditory model or in perceptually motivated signal processing, perceptually irrelevant components should be discarded at some point. If only a linear signal processing framework is desired, this can be achieved by applying a perceptual weighting222Different frequency ranges are given varying importance in the auditory system and a masking model, see Sec. 2. If a nonlinear auditory model like the compressive gammachirp filter bank is used, recall that the nonlinear stage is mostly determined by the coefficients at the output of the linear stage. Therefore, all information should be kept up to the nonlinear stage. In other words, discarding information already in the analysis stage might falsify the estimation of the nonlinear stage, thereby resulting in an incorrect perceptual analysis. We want to stress here the importance of being able to selectively discard unnecessary information, in contrast to information being involuntarily lost during the analysis and/or synthesis procedures.

A flexible signal processing framework: All stable and invertible filter banks form a frame and therefore benefit from the frame properties discussed above. In addition, using filter banks that are frames allows for flexibility. For instance, one can gradually tune the signal representation such as the time-frequency resolution, analysis filters’ shape and bandwidth, frequency scale, sampling density etc., while at the same time retaining the crucial frame properties. It can be tremendously useful to provide a single and adaptable framework that allows to switch model parameters and/or transition between them. By staying in the common general setting of filter bank frames, the linear filter bank analysis in an auditory model or signal processing scheme can be seen as an exchangeable, practically self-contained block in the scheme. Thus, the filter bank parameters, e.g. those mentioned before, can be tuned by scientists according to their preference, without the need to redesign the remainder of the model/scheme. Such a common background leads to results being more comparable across research projects and thus benefits not only the individual researcher, but the whole field. Two main advantages of a common background are the following: first, the properties and parameters of various models can be easily interpreted and compared across contributions; second, by the adaption of a linear model to obtain a nonlinear model the new model parameters remain interpretable.

Ease of integration: Filter banks are already a common tool in both hearing science and signal processing. Integrating a filter bank frame into an existing analysis/processing framework will often only require minor modifications of existing approaches. Thus, frames provide a theoretically sound foundation without the need to fundamentally re-design the remainder of your analysis (or processing) framework.

In some cases, you might already implicitly use frames without knowing it. In that case, we provide here the conceptual background necessary to unlock the full potential of your method.

The rest of this chapter is organized as follows: In Section 2, we provide basic information about the human auditory system and introduce some psychoacoustic concepts. In Section 3 we present the basics of frame theory providing the main definitions and a few crucial mathematical statements. In Section 4 we provide some details on filter bank frames. The chapter concludes with Section 5 where some examples are given for the application of frame theory to signal processing in psychoacoustics.

2. The auditory analysis of sounds

This section provides a brief introduction to the human auditory system. Important concepts that are relevant to the problems treated in this chapter are then introduced, namely auditory filtering and auditory masking. For a more complete description of the hearing organ, the interested reader is referred to e.g. [32, 73].

2.1. Ear’s anatomy

The human ear is a very sensitive and complex organ whose function is to transform pressure variations in the air into the percept of sound. To do so, sound waves must be converted into a form interpretable by the brain, specifically into neural action potentials. Fig. 1 shows a simplified view of the ear’s anatomy. Incoming sound waves are guided by the pinna into the ear canal and cause the eardrum to vibrate. Eardrum vibrations are then transmitted to the cochlea by three tiny bones that constitute the ossicular chain: the malleus, incus, and stapes. The ossicular chain acts as an impedance matcher. Its function is to ensure efficient transmission of pressure variations in the air into pressure variations in the fluids present in the cochlea. The cochlea is the most important part of the auditory system because it is where pressure variations are converted into neural action potentials.

The cochlea is a rolled-up tube filled with fluids and divided along its length by two membranes, the Reissner’s membrane and basilar membrane (BM). A schematic view of the unrolled cochlea is shown in Fig. 1 (the Reissner’s membrane is not represented). It is the response of the BM to pressure variations transmitted through the ossicular chain that is of primary importance. Because the mechanical properties of the BM vary across its lengths (precisely, there is a gradation of stiffness from base to apex), BM stimulation results in a complex movement of the membrane. In case of a sinusoidal stimulation, this movement is described as a traveling wave. The position of the peak in the pattern of vibration depends on the frequency of the stimulation. High-frequency sounds produce maximum displacement of the BM near the base with little movement on the rest of the membrane. Low-frequency sounds rather produce a pattern of vibration which extends all the way along the BM but reaches a maximum before the apex. The frequency that gives the maximum response at a particular point on the BM is called the “characteristic frequency” (CF) of that point. In case of a broadband stimulation (e.g. an impulsive sound like a click), all points on the BM will oscillate. In short, the BM separates out the spectral components of a sound similar to a Fourier analyzer.

The last step of peripheral processing is the conversion of BM vibrations into neural action potentials. This is achieved by the inner hair cells that sit on top of the BM. There are about 3500 inner hair cells along the length of the cochlea (\approx35 mm in humans). The tip of each cell is covered with sensor hairs called stereocilia. The base of each cell directly connects to auditory nerve fibers. When the BM vibrates, the stereocilia are set in motion, which results in a bio-electrical process in the inner hair cells and, finally, in the initiation of action potentials in auditory nerve fibers. Those action potentials are then coded in the auditory nerve and conveyed to the central system where they are further processed to end up in a sound percept. Because the response of auditory nerve fibers is also frequency specific and the action potentials vary over time, the “internal representation” of a sound signal in the auditory nerve can be likened to a time-frequency representation.

Refer to caption
Figure 1. Anatomy of the human ear with a schematic view of the unrolled cochlea. Adapted from [52].

2.2. The auditory filters concept

Because of the frequency-to-place transformation (also called tonotopic organization) in the cochlea, and the transmission of time-dependent neural signals, the BM can be modeled in a first linear approximation as a bank of overlapping bandpass filters, named “critical bands” or “auditory filters”. The center frequencies and bandwidth of the auditory filters, respectively, approximate the CF and width of excitation on the BM. Noteworthy, the width of excitation depends on level as well: patterns become wider and asymmetric as sound level increases (e.g. [37]). Several auditory filter models have been proposed based on the results from psychoacoustics experiments on masking (see e.g. [59] and Sec. 2.3). A popular auditory filter model is the gammatone filter [71] (see Fig 2). Although gammatone filters do not capture the level dependency of the actual auditory filters, their ease of implementation made them popular in audio signal processing (e.g. [90, 96]). More realistic auditory filter models are, for instance, the roex and gammachirp filters [37, 88]. Other level-dependent and more complex auditory filter banks include for example the dual resonance non-linear filter bank [58] or the dynamic compressive gammachirp filter bank [49]. The two approaches in [58, 49] feature a linear filter bank followed by a signal-dependent nonlinear stage. As mentioned in the introduction, this is a particular way of describing a nonlinear system by modifying a linear system. Finally, it is worth noting that besides psychoacoustic-driven auditory models, mathematically founded models of the auditory periphery have been proposed. Those include, for instance, the wavelet auditory model [12] or the “EarWig” time-frequency distribution [67].

Refer to caption
Figure 2. A popular auditory filter model: the gammatone filter bank. The magnitude responses (in dB) of 16 gammatone filters in the frequency range 300-8000 Hz are represented on a linear frequency scale.

The bandwidth of the auditory filters has been determined based on psychoacoustic experiments. The estimation of bandwidth based on loudness perception experiments gave rise to the concept of Bark bandwidth defined by [98]

(1) BWBark=25+75(1+1.4×106ξ2)0.69BW_{\mathrm{Bark}}=25+75\left(1+1.4\times 10^{-6}\xi^{2}\right)^{0.69}

where ξ\xi denotes the frequency and BWBW denotes the bandwidth, both in Hz. Another popular concept is the equivalent rectangular bandwidth (ERB), that is the bandwidth of a rectangular filter having the same peak output and energy as the auditory filter. The estimations of ERBs are based on masking experiments. The ERB is given by [37]

(2) BWERB=24.7+ξ9.265.BW_{\mathrm{ERB}}=24.7+\frac{\xi}{9.265}\quad.

BWBarkBW_{\mathrm{Bark}} and BWERBBW_{\mathrm{ERB}} are commonly used in psychoacoustics and signal processing to approximate the auditory spectral resolution at low to moderate sound pressure levels (i.e. 30–70 dB) where the auditory filters’ shape remains symmetric and constant. See for example [37, 88] for the variation of BWERBBW_{\mathrm{ERB}} with level.

Based on the concepts of Bark and ERB bandwidths, corresponding frequency scales have been proposed to represent and analyze data on a scale related to perception. To describe the different mappings between the linear frequency domain and the nonlinear perceptual domain we introduce the function FAUD:ξAUDF_{\mathrm{AUD}}:\xi\rightarrow\mathrm{AUD} where AUD\mathrm{AUD} is an auditory unit that depends on the scale. The Bark scale is [98]

(3) FBark(ξ)=13arctan(0.00076ξ)+3.5arctan(ξ/7500)2F_{\mathrm{Bark}}(\xi)=13\arctan(0.00076\xi)+3.5\arctan(\xi/7500)^{2}

and the ERB scale is [37]

(4) FERB(ξ)=9.265ln(1+ξ228.8455).F_{\mathrm{ERB}}(\xi)=9.265\ln\left(1+\frac{\xi}{228.8455}\right)\>.

Both auditory scales are connected to the ear’s anatomy. One AUD unit indeed corresponds to a constant distance along the BM. 1 Bark\mathrm{Bark} corresponds to 1.3 mm [32] while 1 ERB\mathrm{ERB} corresponds to 0.9 mm [37, 38].

2.3. Auditory masking

The phenomenon of masking is highly related to the spectro-temporal resolution of the ear and has been the focus of many psychoacoustics studies over the last 70 years. Auditory masking refers to the increase in the detection threshold of a sound signal (referred to as the “target”) due to the presence of another sound (the “masker”). Masking is quantified by measuring the detection thresholds of the target in presence and absence of the masker; the difference in thresholds (in dB) thus corresponds to the amount of masking. In the literature, masking has been extensively investigated in the spectral or temporal domain. The results were used to develop models of spectral or temporal masking that are currently implemented in audio applications like perceptual coding (e.g. [70, 76]) or sound processing (e.g. [9, 41]. Only a few studies investigated masking in the joint time-frequency domain. We present below some typical psychoacoustic results on spectral, temporal, and spectro-temporal masking. For more results and discussion on the origins of masking the interested reader is referred to e.g. [32, 64, 62].

In the following, we denote by ξ{M,T}\xi_{\{M,T\}}, D{M,T}D_{\{M,T\}}, and L{M,T}L_{\{M,T\}} the frequency, duration, and level, respectively, of masker or target. Those signal parameters are fixed by the experimenter, i.e. they are known. The frequency shift between masker and target is Δξ=ξTξM\Delta\xi=\xi_{T}-\xi_{M} and the time shift ΔT\Delta T is defined as the onset delay between masker and target. Finally, AMAM denotes the amount of masking in dB.

2.3.1. Spectral masking

To study spectral masking, masker and target are presented simultaneously (since usually DM>DTD_{M}>D_{T}, this is equivalent to saying that 0 ΔT<DMDT\leq\Delta T<D_{M}-D_{T}) and Δξ\Delta\xi is varied. There are two ways to vary Δξ\Delta\xi, either fix ξT\xi_{T} and vary ξM\xi_{M} or vice versa. Similarly, one can fix LML_{M} and vary LTL_{T} or vice versa. In short, various types of masking curves can be obtained depending on the signal parameters. A common spectral masking curve is a masking pattern that represents LTL_{T} or AMAM as a function of ξT\xi_{T} or Δξ\Delta\xi  (see Fig. 3). To measure masking patterns, ξM\xi_{M} and LML_{M} are fixed and AMAM is measured for various Δξ\Delta\xi. Under the assumption that AM(ξT)AM(\xi_{T}) corresponds to a certain ratio of masker-to-target energy at the output of the auditory filter centered at ξT\xi_{T}, masking patterns measure the responses of the auditory filters centered at the individual ξT\xi_{T}s. Thus, masking patterns can be used as indicator of the spectral spread of masking of the masker or, in other terms, the spread of excitation of the masker on the BM. This spectral spread can in turn be used to derive a masking threshold, as used for example in audio codecs [70]. See also Sec. 5.2.

Fig. 3 shows typical masking patterns measured for narrow-band noise maskers of different levels (LML_{M} = 45, 65, and 85 dB SPL, as indicated by the different lines) and frequencies (ξM\xi_{M} = 0.25, 1, and 4 kHz, as indicated by the different vertical dashed lines). In this study, DM=DTD_{M}=D_{T} = 200 ms. The masker was a 80-Hz-wide band of Gaussian noise centered at ξM\xi_{M}. The target was also a 80-Hz band of noise centered at ξT\xi_{T}. The main properties to be observed here are:

  1. (i)

    For a given masker (i.e. a pair of ξM\xi_{M} and LML_{M}), AMAM is maximum for Δξ\Delta\xi  = 0 and decreases as |Δξ||\Delta\xi| increases. This reflects the decay of masker excitation on the BM.

  2. (ii)

    Masking patterns broaden with increasing level. This reflects the broadening of auditory filters with increasing level [37].

  3. (iii)

    Masking patterns are broader at low than at high frequencies (see (1)-(2)). This reflects the fact that the density of auditory filters is higher at low than at high frequencies. Consequently, a masker with a given bandwidth will excite more auditory filters at low frequencies.

Refer to caption
Figure 3. Masking patterns for narrow-band noise maskers of different levels and frequencies. LTL_{T} (in dB SPL) is plotted as a function of ξT\xi_{T} (in Hz) on a logarithmic scale. The gray dotted curve indicates the threshold in quiet. The difference between any of the colored curves and the gray curve thus corresponds to AMAM, as indicated by the arrow. Source: mean data for listeners JA and AO in [63, Experiment 3, Figs. 5-6].

2.3.2. Temporal masking

By analogy with spectral masking, temporal masking is measured by setting Δξ\Delta\xi = 0 and varying ΔT\Delta T. Backward masking is observed for ΔT<\Delta T< 0, that is when the target precedes the masker in time. Forward masking is observed for ΔTDM\Delta T\geq D_{M}, that is when the target follows the masker. Backward masking is hardly observed for ΔT<\Delta T< -20 ms and is mainly thought to result from attentional effects [79, 32]. In contrast, forward masking can be observed for ΔTDM\Delta T\geq D_{M} + 200 ms. Therefore, in the following we focus on forward masking.

Typical forward masking curves are represented in Fig. 4. The left panel shows the effect of LML_{M} for ξM=ξT\xi_{M}=\xi_{T} = 4 kHz (mean data from [51]). In this study, masker and target were sinusoids (DMD_{M} = 300 ms, DTD_{T} = 20 ms). The main features to be observed here are (i) the temporal decay of forward masking is a linear function of log(ΔT)\log(\Delta T) and (ii) the rate of this decay strongly depends on LML_{M}. The right panel shows the effect of DMD_{M} for ξT\xi_{T} = 2 kHz and LML_{M} = 60 dB SPL (mean data from [97]). In this study, the masker was a pulse of uniformly masking noise (i.e. a broad-band noise producing the same AMAM at all frequencies in the range 0–20 kHz, see [32]). The target was a sinusoid with DTD_{T} = 5 ms. It can be seen that the AMAM (i.e. the difference between the connected symbols and the star) at a given ΔT\Delta T  increases with increasing DMD_{M}, at least for ΔTDM<\Delta T-D_{M}< 100 ms. Finally, a comparison of the two panels in Fig. 4 for LML_{M} = 60 dB indicates that, for ΔTDM\Delta T-D_{M}\leq 50 ms, the 300-ms sinusoidal masker (empty diamonds left) produces more masking than the 200-ms broad-band noise masker (empty squares right). Despite the difference in DMD_{M}, increasing the duration of the noise masker to 300 ms is not expected to account for the difference in AMAM of up to 20 dB observed here [32, 97].

Refer to caption
Figure 4. Temporal (forward) masking curves for sinusoidal (left) and broadband noise maskers (right). LTL_{T} (in dB SPL) is plotted as a function of the temporal gap between masker offset and target onset, i.e. ΔTDM\Delta T-D_{M} (in ms) on a logarithmic scale. Left panel: masking curves for various LML_{M}s and DMD_{M} = 300 ms (adapted from [51]). Right panel: masking curves for various DMD_{M}s and LML_{M} = 60 dB (adapted from [97]). Stars indicate the target thresholds in quiet.

2.3.3. Time-frequency masking

Only a few studies measured spectro-temporal masking patterns, that is ΔT\Delta T  and Δξ\Delta\xi  both systematically varied (e.g. [53, 79]). Those studies mostly involved long (DMD_{M}\geq 100 ms) sinusoidal maskers. In other words, those studies provide data on the time-frequency spread of masking for long and narrow-band maskers. In the context of time-frequency decompositions, a set of elementary functions, or “atoms”, with good localization in the time-frequency domain (i.e. short and narrow-band) is usually chosen, see Sec. 3. To best predict masking in the time-frequency decompositions of sounds, it seems intuitive to have data on the time-frequency spread of masking for such elementary atoms, as this will provide a good match between the masking model and the sound decomposition. This has been investigated in [64]. Precisely, spectral, forward, and time-frequency masking have been measured using Gabor atoms of the form si(t)=sin(2πξit+π/4)eπ(Γt)2s_{i}(t)=\sin(2\pi\xi_{i}t+\pi/4)e^{-\pi(\Gamma t)^{2}} with Γ=600s1\Gamma=600~\mathrm{s}^{-1} as masker and target. According to the definition of Gabor atoms in (7), the masker was defined by sM(t)={eiπ/4gξM,0}s_{M}(t)=\Im\{e^{{\rm i}\pi/4}g_{\xi_{M},0}\}, where \Im denotes the imaginary part, with a Gaussian window γ(t)=eπ(Γt)2\gamma(t)=e^{-\pi(\Gamma t)^{2}} and ξM\xi_{M} = 4 kHz. The masker level was fixed at LML_{M} = 80 dB. The target was defined by sT(t+ΔT)={ei(π/4+2πξTΔT)γξT,ΔT}s_{T}(t+\Delta T)=\Im\{e^{{\rm i}(\pi/4+2\pi\xi_{T}\Delta T)}\gamma_{\xi_{T},-\Delta T}\} with ξT=ξM+Δξ\xi_{T}=\xi_{M}+\Delta\xi. The set of time-frequency conditions measured in [64] is illustrated in Fig. 5a. Because in this particular case we have ξTΔT\xi_{T}\Delta T\in\mathbb{N}, the target term reduces to sT(t+ΔT)={ei(π/4)γξT,ΔT}s_{T}(t+\Delta T)=\Im\{e^{{\rm i}(\pi/4)}\gamma_{\xi_{T},-\Delta T}\}. The mean masking data are summarized in Fig. 5b. These data, together with those collected by Laback et al on the additivity of spectral [56] and temporal masking [55] for the same Gabor atoms, constitute a crucial basis for the development of an accurate time-frequency masking model to be used in audio applications like audio coding or audio processing (see Sec. 5).

Refer to caption
(a) Experimental conditions
Refer to caption
(b) Mean results
Figure 5. (a) Conditions measured in [64] illustrated in the time-FERBF_{\mathrm{ERB}} plane. The gray circle symbolizes the masker atom sM(t)s_{M}(t). The blue circles symbolize the target atoms sT(t+ΔT)s_{T}(t+\Delta T). The values of Δξ\Delta\xi  were -4, -2, -1, 0, +1, +2, +4, and +6 FERBF_{\mathrm{ERB}}. The values of ΔT\Delta T  were 0, 5, 10, 20, and 30 ms. (b) Mean data interpolated based on a cubic spline fit along the time-frequency plane. The ΔT\Delta T  axis was sampled at a step of 1 ms and the Δξ\Delta\xi  axis at a step of 0.25 FERBF_{\mathrm{ERB}}. For Δξ\Delta\xi  coordinates outside the range of measurements a value of AMAM = 0 was used.

2.4. Computational auditory scene analysis

The term auditory scene analysis (ASA), introduced by Bregman [16], refers to the perceptual organization of auditory events into auditory streams. It is assumed that this perceptual organization constitutes the basis for the remarkable ability of the auditory system to separate sound sources, especially in noisy environments. A demonstration of this ability is the so-called “cocktail party effect”, i.e. when one is able to concentrate on and follow a single speaker in a highly competing background (e.g. many concurring speakers combined with cutlery and glass sounds). The term computational auditory scene analysis (CASA) thus refers to the study of ASA by computational means [92]. The CASA problem is closely related to the problem of source separation. Generally speaking, CASA systems can be considered as perceptually motivated sound source separators. The basic work flow of a CASA system is to first compute an auditory-based time-frequency transform (most systems use a gammatone filter bank, but any auditory representation that allows reconstruction can be used, see Sec. 5.1). Second, some acoustic features like periodicity, pitch, amplitude and frequency modulations are extracted so as to build the perceptive organization (i.e. constitute the streams). Then, stream separation is achieved using so-called “time-frequency masks”. These masks are directly applied to the perceptual representation; they retain the “target” regions (mask = 1) and suppress the background (mask = 0). Those masks can be binary or real, see e.g. [92, 96]. The target regions are then re-synthesized by applying the inverse transform to obtain the signal of interest. Noteworthy, a perfect reconstruction transform is of importance here. Furthermore, the linearity and stability of the transform allow a separation of the audio streams directly in the transform domain. Most gammatone filter banks implemented in CASA systems are only approximately invertible, though. This is due to the fact that such systems implement gammatone filters in the analysis stage and their time-reversed impulse responses in the synthesis stage. This setting implies that the frequency response of the gammatone filter bank has an all-pass characteristic and features no ripple (equivalently in the frame context, that the system is tight, see 4.3). In practice, however, gammatone filter banks usually consider only a limited range of frequencies (typically in the interval 0.1–4 kHz for speech processing) and the frequency response features ripples if the filters’ density is not high enough. If a high density of filters is used, the audio quality of the reconstruction is rather good [85, 96]. Still, the quality could be perfect by using frame theory [66]. For instance, one could render the gammatone system tight (see Proposition 2) or use its dual frame (see Sec. 3.1.2).

The use of binary masks in CASA is directly motivated by the phenomenon of auditory masking explained above. However, time-frequency masking is hardly considered in CASA systems. As a final remark, an analogy can be established between the (binary) masks used in CASA and the concept of frame multipliers defined in Sec. 3.2. Specifically, the masks used in CASA systems correspond to the symbol mm in (15). This analogy is not considered in most CASA studies, though, and offers the possibility for some future research connecting acoustics and frame multipliers.

3. Frame theory

What is an appropriate setting for the mathematical background of audio signal processing? Since real-world signals are usually considered to have finite energy and technically are represented as functions of some variable (e.g. time), it is natural to think about them as elements of the space L2()L^{2}(\mathbb{R}). Roughly speaking, L2()L^{2}(\mathbb{R}) contains all functions x(t)x(t) with finite energy, i.e. with x2=+|x(t)|2dt<\|x\|^{2}=\int_{-\infty}^{+\infty}|x(t)|^{2}{\rm dt}<\infty. For working with sampled signals, the analogue appropriate space is 2(K)\ell^{2}(K) (KK denoting a countable index set) which consists of the sequences c=(ck)kKc=(c_{k})_{k\in K} with finite energy, i.e. c2=kK|ck|2<\|c\|^{2}=\sum_{k\in K}|c_{k}|^{2}<\infty.

Both spaces L2()L^{2}(\mathbb{R}) and 2(K)\ell^{2}(K) are Hilbert spaces and one may use the rich theory ensured by the availability of an inner product, that serves as a measure of correlation, and is used to define orthogonality, of elements in the Hilbert space. In particular, the inner product enables the representation of all functions in \mathcal{H} in terms of their inner products with a set of reference functions: A standard approach for such representations uses orthonormal bases (ONBs), see e.g. [42]. Every separable Hilbert space \mathcal{H} has an ONB (ek)kK(e_{k})_{k\in K} and every element xx\in\mathcal{H} can be written as

(5) x=kKx,ekekx=\sum_{k\in K}\langle x,e_{k}\rangle e_{k}

with uniqueness of the coefficients x,ek\langle x,e_{k}\rangle, kKk\in K. The convenience of this approach is that there is a clear (and efficient) way for calculating the coefficients in the representations using the same orthonormal sequence. Even more, the energy in the coefficient domain (i.e., the square of the 2\ell^{2}-norm) is exactly the energy of the element xx:

(Parseval equality) kK|x,ek|2=x2.\sum_{k\in K}|\langle x,e_{k}\rangle|^{2}=\|x\|^{2}.

Furthermore, the representation (5) is stable - if the coefficients (x,ek)kK(\langle x,e_{k}\rangle)_{k\in K} are slightly changed to (ak)kK2(a_{k})_{k\in K}\in\ell^{2}, one obtains an element x~=kKakek\widetilde{x}=\sum_{k\in K}a_{k}e_{k} close to the original one xx.

However, the use of ONBs has several disadvantages. Often the construction of orthonormal bases with some given side constraints is difficult or even impossible (see below). “Small perturbation”of the orthonormal basis’ elements may destroy the orthonormal structure [95]. Finally, the uniqueness of the coefficients in (5) leads to a lack of exact reconstruction when some of these coefficients are lost or disturbed during transmission.

This naturally leads to the question how the concept of ONBs could be generalized to overcome those disadvantages. As an extension of the above-mentioned Parseval equality for ONBs, one could consider inequalities instead of an equality, i.e. boundedness from above and below (see Def. 1). This leads to the concept of frames, which was introduced by Duffin and Schaeffer [29] in 1952. It took several decades for scientists to realize the importance and applicability of frames. Popularized around the 90s in the wake of wavelet theory [27, 43, 26], frames have seen increasing interest and extensive investigation by many researchers ever since. Frame theory is both a beautiful abstract mathematical theory and a concept applicable in many other disciplines like e.g. engineering, medicine, and psychoacoustics, see Sec. 5.

Via frames, one can avoid the restrictions of ONBs while keeping their important properties. Frames still allow perfect and stable reconstruction of all the elements of the space, though the representation-formulas in general are not as simple as the ones via an ONB (see Sec. 3.1.2). Compared to orthonormal bases, the frame property itself is much more stable under perturbations (see, e.g., [22, Sec. 15]). Also, in contrast to orthonormal bases, frames allow redundancy which is desirable e.g. in signal transmission, for reconstructing signals when some coefficients are lost, and for noise reduction. Via redundant frames one has multiple representations and this allows to choose appropriate coefficients fulfilling particular constraints, e.g. when aiming at sparse representations. Furthermore, frames can be easier and faster to construct than ONBs. Some advantageous side constraints can only be fulfilled for frames. For example, Gabor frames provide convenient and efficient signal processing tools, but good localization in both time and frequency can never be achieved if the Gabor frame is an ONB or even a Riesz basis (cf. Balian-Low Theorem, see e.g. [22, Theor. 4.1.1]), while redundant Gabor frames for this purpose are easily constructed (for example using the Gaussian function). See Sec. 2.3.3 on how good localization in time and frequency is important in masking experiments.

Some of the main properties of frames were already obtained in the first paper [29]. For extensive presentation on frame theory, we refer to [18, 40, 22, 42].

In this section we collect the basics of frame theory relevant to the topic of the current paper. All the statements presented here are well known. Proofs are given just to make the paper self-contained, for convenience of the readers, and to facilitate a better understanding of the mathematical concepts. They are mostly based on [22, 29, 40]. Throughout the rest of the section, \mathcal{H} denotes a separable Hilbert space with inner product ,\langle\cdot,\cdot\rangle, Id{\rm Id}_{\mathcal{H}} - the identity operator on \mathcal{H}, KK - a countable index set, and Φ\Phi (resp. Ψ\Psi) - a sequence (ϕk)kK(\phi_{k})_{k\in K} (resp. (ψk)kK(\psi_{k})_{k\in K}) with elements from \mathcal{H}. The term operator is used for a linear mapping. Readers not familiar with Hilbert space theory can simply assume =𝐋2()\mathcal{H}=\mathbf{L}^{2}(\mathbb{R}) for the remainder of this section.

3.1. Frames: A Mathematical viewpoint

The frame concept extends naturally the Parseval equality permitting inequalities, i.e., the ratio of the energy in the coefficient domain to the energy of the signal may be bounded from above and below instead of being necessarily one:

Definition 1.

A countable sequence Φ=(ϕk)kK\Phi=(\phi_{k})_{k\in K} is called a frame for the Hilbert space \mathcal{H} if there exist positive constants AA and BB such that

(6) Ax2kK|x,ϕk|2Bx2,x.A\cdot\left\|x\right\|_{\mathcal{H}}^{2}\leq\sum\limits_{k\in K}\left|\left<x,\phi_{k}\right>\right|^{2}\leq B\cdot\left\|x\right\|_{\mathcal{H}}^{2},\ \forall\ x\in\mathcal{H}.

The constant AA (resp. BB) is called a lower (resp. upper) frame bound of Φ\Phi. A frame is called tight with frame bound AA if AA is both a lower and an upper frame bound. A tight frame with bound 11 is called a Parseval frame.

Clearly, every ONB is a frame, but not vice-versa. Frames can naturally be split into two classes - the frames which still fulfill a basis-property, and the ones that do not:

Definition 2.

A frame Φ\Phi for \mathcal{H} which is a Schauder basis333A sequence Φ\Phi is called a Schauder basis for \mathcal{H} if every element xx\in\mathcal{H} can be written as x=kKckϕkx=\sum_{k\in K}c_{k}\phi_{k} with unique coefficients (ck)kK(c_{k})_{k\in K}. for \mathcal{H} is called a Riesz basis for \mathcal{H}. A frame for \mathcal{H} which is not a Schauder basis for \mathcal{H} is called redundant (also called overcomplete).

Note that Riesz bases were introduced by Bari [11] in different but equivalent ways. Riesz bases also extend ONBs, but contrary to frames, Riesz bases still have the disadvantages resulting from the basis-property, as they do not allow redundancy. For more on Riesz bases, see e.g. [95]. As an illustration of the concepts of ONBs, Riesz bases, and redundant frames in a simple setting, consider examples in the Euclidean plane, see Fig. 6.

(a) ONB (e1,e2)(e_{1},e_{2}) for 2\mathbb{R}^{2}       (b) unique representation of xx via (e1,e2)(e_{1},e_{2})
Refer to caption
(c) Riesz basis (ϕ1,ϕ2)(\phi_{1},\phi_{2}) for 2\mathbb{R}^{2}     (d) unique representation of xx via (ϕ1,ϕ2)(\phi_{1},\phi_{2})
Refer to caption
(e) frame (ϕ1,ϕ2,ϕ3)(\phi_{1},\phi_{2},\phi_{3}) for 2\mathbb{R}^{2}      (f) non-unique representation of xx via (ϕ1,ϕ2,ϕ3)(\phi_{1},\phi_{2},\phi_{3})
Refer to caption

Figure 6. Examples in 2\mathbb{R}^{2}: ONB (a,b), Riesz basis (c,d), frame (e,f)

Note that in a finite dimensional Hilbert space, considering only finite sequences, frames are precisely the complete sequences (see, e.g., [22, Sec. 1.1]), i.e., the sequences which span the whole space. However, this is not the case in infinite-dimensional Hilbert spaces - every frame is complete, but completeness is not sufficient to establish the frame property [29]. For results focused on frames in finite dimensional spaces, refer to [4, 17].

As non-trivial examples, let us mention a specific type of frames used often in signal processing applications, namely Gabor frames. A Gabor system is comprised of atoms of the form

(7) gω,τ(t)=e2πiωtg(tτ),g_{\omega,\tau}(t)=e^{2\pi{\rm i}\omega t}g(t-\tau),

with function gL2()g\in L^{2}(\mathbb{R}) called the (generating) window and with time- and frequency-shift τ,ω\tau,\omega\in\mathbb{R}, respectively. To allow perfect and stable reconstruction, the Gabor system (gω,τ)ω,τK(2)(g_{\omega,\tau})_{\omega,\tau\in K(\subset\mathbb{R}^{2})} is assumed to have the frame-property and in this case is called a Gabor frame. Note that the analysis operator of a Gabor frame corresponds to a sampled Short-Time-Fourier transform (see, e.g., [40]) also referred to as Gabor transform.

Most commonly, regular Gabor frames are used; these are frames of the form (gk,l)k,l=(e2πikbg(la))k,l(g_{k,l})_{k,l\in\mathbb{Z}}=\left(e^{2\pi{\rm i}kb\cdot}g(\cdot-la)\right)_{k,l\in\mathbb{Z}} for some positive aa and bb satisfying necessarily (but in general not sufficiently ) ab1ab\leq 1. To mention a concrete example - for the Gaussian g(t)=et2g(t)=e^{-t^{2}}, the respective regular Gabor system (gk,l)k,l(g_{k,l})_{k,l\in\mathbb{Z}} is a frame for L2()L^{2}(\mathbb{R}) if and only if ab<1ab<1 (see, e.g., [40, Sec. 7.5] and references therein).

Other possibilities include using alternative sampling structures, on subgroups [94] or irregular sets [19]. If the window is allowed to change with time (or frequency) one obtains the non-stationary Gabor transform [6]. There it becomes apparent that frames allow to create adaptive and adapted transforms [7], while still guaranteeing perfect reconstruction.

If not continuous but sampled signals are considered, Gabor theory works similarly. Discrete Gabor frames can be defined in an analogue way, namely, frames of the form (e2πik/Mh[la])l,k=0,1,,M1\left(e^{2\pi{\rm i}k/M\cdot}h[\cdot-la]\right)_{l\in\mathbb{Z},k=0,1,\ldots,M-1} for h2()h\in\ell^{2}(\mathbb{Z}) with a,Ma,M\in\mathbb{N}, where a/M1a/M\leq 1 is necessary for the frame property. For readers interested in the theory of Gabor frames on 2()\ell^{2}(\mathbb{Z}), see, e.g., [91]. For constructions of discrete Gabor frames from Gabor frames for L2()L^{2}(\mathbb{R}) through sampling, refer to [50, 81].

3.1.1. Frame-related operators

Given a frame Φ\Phi for \mathcal{H}, consider the following linear mappings:

Analysis operator: 𝐂Φ:l2(K),\displaystyle\mathbf{C}_{\Phi}:\mathcal{H}\rightarrow l^{2}(K), 𝐂Φx:=(x,ϕk)kK;\displaystyle\mbox{$\mathbf{C}_{\Phi}x:=(\left<x,\phi_{k}\right>)_{k\in K}$};\hfill
Synthesis operator: 𝐃Φ:l2(K),\displaystyle\mathbf{D}_{\Phi}:l^{2}(K)\rightarrow\mathcal{H}, 𝐃Φ(ck)kK:=kKckϕk;\displaystyle\mbox{$\mathbf{D}_{\Phi}(c_{k})_{k\in K}:=\sum_{k\in K}c_{k}\phi_{k}$};\hfill
(8) Frame operator: 𝐒Φ:,\displaystyle\mathbf{S}_{\Phi}:\mathcal{H}\rightarrow\mathcal{H},\hskip 11.09654pt 𝐒Φx:=𝐃Φ𝐂Φx=kKx,ϕkϕk.\displaystyle\mbox{$\mathbf{S}_{\Phi}x:=\mathbf{D}_{\Phi}\mathbf{C}_{\Phi}x=\sum_{k\in K}\left<x,\phi_{k}\right>\phi_{k}$}.\hfill

These operators are tremendously important for the theoretical investigation of frames as well as for signal processing. As one can observe, the analysis (resp. synthesis, frame) operator corresponds to analyzing (resp. synthesizing, analyzing and re-synthesizing) a signal. In the following statement the main properties of the frame-related operators are listed.

Theorem 1.

(e.g. [22, Sec. 5]) Let Φ\Phi be a frame for \mathcal{H} with frame bounds AA and BB (ABA\leq B). Then the following holds.

  • (a)

    𝐂Φ\mathbf{C}_{\Phi} is a bounded injective operator with bound 𝐂ΦB\|\mathbf{C}_{\Phi}\|\leq\sqrt{B}.

  • (b)

    𝐃Φ\mathbf{D}_{\Phi} is a bounded surjective operator with bound 𝐃ΦB\|\mathbf{D}_{\Phi}\|\leq\sqrt{B} and 𝐃Φ=𝐂Φ\mathbf{D}_{\Phi}=\mathbf{C}_{\Phi}^{*}.

  • (c)

    𝐒Φ\mathbf{S}_{\Phi} is a bounded bijective positive self-adjoint operator with 𝐒ΦB\|\mathbf{S}_{\Phi}\|\leq B.

  • (d)

    (𝐒Φ1ϕk)kK(\mathbf{S}_{\Phi}^{-1}\phi_{k})_{k\in K} is a frame for \mathcal{H} with frame bounds 1/B,1/A1/B,1/A.

Proof.

(a) By the frame inequalities (6) we have Ax𝐂Φx2Bx\sqrt{A}\|x\|_{\mathcal{H}}\leq\|\mathbf{C}_{\Phi}x\|_{\ell^{2}}\leq\sqrt{B}\|x\|_{\mathcal{H}} for every xx\in\mathcal{H}; the upper inequality implies the boundedness and the lower one - the injectivity, i.e. the operator is one-to-one.

(b) First show that 𝐃Φ\mathbf{D}_{\Phi} is well defined, i.e., that kKckϕk\sum_{k\in K}c_{k}\phi_{k} converges for every (ck)kK2(K)(c_{k})_{k\in K}\in\ell^{2}(K). Without loss of generality, for simplicity of the writing, we may denote KK as \mathbb{N}. Fix arbitrary (ck)k2(c_{k})_{k\in\mathbb{N}}\in\ell^{2}. For every p,qp,q\in\mathbb{N}, p>qp>q,

k=1pckϕkk=1qckϕk\displaystyle\|\sum_{k=1}^{p}c_{k}\phi_{k}-\sum_{k=1}^{q}c_{k}\phi_{k}\|_{\mathcal{H}} =\displaystyle= supx,x=1|k=q+1pckϕk,x|\displaystyle\sup_{x\in\mathcal{H},\|x\|_{\mathcal{H}}=1}|\langle\sum_{k=q+1}^{p}c_{k}\phi_{k},x\rangle|
\displaystyle\leq supx,x=1(k=q+1p|ck|2)1/2(k=q+1p|ϕk,x|2)1/2\displaystyle\sup_{x\in\mathcal{H},\|x\|_{\mathcal{H}}=1}(\sum_{k=q+1}^{p}|c_{k}|^{2})^{1/2}(\sum_{k=q+1}^{p}|\langle\phi_{k},x\rangle|^{2})^{1/2}
\displaystyle\leq B(k=q+1p|ck|2)1/2p,q0,\displaystyle\sqrt{B}\,(\sum_{k=q+1}^{p}|c_{k}|^{2})^{1/2}\xrightarrow[p,q\to\infty]{}0,

which implies that k=1pckϕk\sum_{k=1}^{p}c_{k}\phi_{k} converges in \mathcal{H} as pp\to\infty. Using the adjoint of 𝐂Φ\mathbf{C}_{\Phi}, for every (ck)k=12(c_{k})_{k=1}^{\infty}\in\ell^{2} and every yy\in\mathcal{H}, one has that

𝐂Φ(ck)k=1,y=(ck)k=1,𝐂Φy=k=1cky,ϕk¯=k=1ckϕk,y=k=1ckϕk,y.\langle\mathbf{C}^{*}_{\Phi}(c_{k})_{k=1}^{\infty},y\rangle=\langle\mathbf{(}c_{k})_{k=1}^{\infty},\mathbf{C}_{\Phi}y\rangle=\sum_{k=1}^{\infty}c_{k}\overline{\langle y,\phi_{k}\rangle}=\sum_{k=1}^{\infty}c_{k}\langle\phi_{k},y\rangle=\langle\sum_{k=1}^{\infty}c_{k}\phi_{k},y\rangle.

Therefore 𝐃Φ=𝐂Φ\mathbf{D}_{\Phi}=\mathbf{C}_{\Phi}^{*}, implying also the boundedness of DΦD_{\Phi}.

For every xx\in\mathcal{H}, we have 𝐃Φx2=𝐂Φx2Ax\|\mathbf{D}_{\Phi}^{*}x\|_{\ell^{2}}=\|\mathbf{C}_{\Phi}x\|_{\ell^{2}}\geq\sqrt{A}\|x\|, which implies (see, e.g., [78, Theorem 4.15]) that 𝐃Φ\mathbf{D}_{\Phi} is surjective, i.e. it maps onto the whole space \mathcal{H}.

(c) The boundedness and self-adjointness of 𝐒Φ\mathbf{S}_{\Phi} follow from (a) and (b). Since, 𝐒Φx,x=kK|x,ϕk|2\langle\mathbf{S}_{\Phi}x,x\rangle=\sum_{k\in K}|\langle x,\phi_{k}\rangle|^{2}, 𝐒Φ\mathbf{S}_{\Phi} is positive and the frame inequalities (6) mean that

(9) Ax2𝐒Φx,xBx2,x,A\|x\|_{\mathcal{H}}^{2}\leq\langle\mathbf{S}_{\Phi}x,x\rangle\leq B\|x\|_{\mathcal{H}}^{2},\forall x\in\mathcal{H},

implying that 0(Id1B𝐒Φ)x,xBABx20\leq\langle({\rm Id}_{\mathcal{H}}-\frac{1}{B}\mathbf{S}_{\Phi})x,x\rangle\leq\frac{B-A}{B}\|x\|^{2}_{\mathcal{H}} for all xx\in\mathcal{H}. Then the norm of the bounded self-adjoint operator Id1B𝐒Φ{\rm Id}_{\mathcal{H}}-\frac{1}{B}\mathbf{S}_{\Phi} satisfies

Id1B𝐒Φ=supx,x=1(Id1B𝐒Φ)x,xBAB<1,\|{\rm Id}_{\mathcal{H}}-\frac{1}{B}\mathbf{S}_{\Phi}\|=\sup_{x\in\mathcal{H},\|x\|_{\mathcal{H}}=1}\langle({\rm Id}_{\mathcal{H}}-\frac{1}{B}\mathbf{S}_{\Phi})x,x\rangle\leq\frac{B-A}{B}<1,

which by the Neumann theorem (see, e.g., [45, Theor. 8.1]) implies that 𝐒Φ\mathbf{S}_{\Phi} is bijective.

(d) As a consequence of (c), SΦ1S_{\Phi}^{-1} is bounded, self-adjoint, and positive. In the language of partial ordering of self-adjoint operators (see, e.g., [45, Sec. 68]), (9) can be written as

(10) AId𝐒ΦBId.A\cdot{\rm Id}_{\mathcal{H}}\leq\mathbf{S}_{\Phi}\leq B\cdot{\rm Id}_{\mathcal{H}}.

Since 𝐒Φ1\mathbf{S}_{\Phi}^{-1} is positive and commutes with 𝐒Φ\mathbf{S}_{\Phi} and Id{\rm Id}_{\mathcal{H}}, one can multiply the inequalities in (10) with 𝐒Φ1\mathbf{S}_{\Phi}^{-1} (see, e.g., [45, Prop. 68.9]) and obtain

1BId𝐒Φ11AId,\frac{1}{B}{\rm Id}_{\mathcal{H}}\leq\mathbf{S}_{\Phi}^{-1}\leq\frac{1}{A}{\rm Id}_{\mathcal{H}},

which means that

(11) 1Bx2𝐒Φ1x,x1Ax2,x.\frac{1}{B}\|x\|_{\mathcal{H}}^{2}\leq\langle\mathbf{S}_{\Phi}^{-1}x,x\rangle\leq\frac{1}{A}\|x\|^{2}_{\mathcal{H}},\ \forall x\in\mathcal{H}.

For every xx\in\mathcal{H}, denote yx=𝐒Φ1xy_{x}=\mathbf{S}_{\Phi}^{-1}x and use the fact that 𝐒Φ1\mathbf{S}_{\Phi}^{-1} is self-adjoint to obtain

kK|x,𝐒Φ1ϕk|2=kK|yx,ϕk|2=yx,𝐒Φyx=𝐒Φ1x,x.\sum_{k\in K}|\langle x,\mathbf{S}_{\Phi}^{-1}\phi_{k}\rangle|^{2}=\sum_{k\in K}|\langle y_{x},\phi_{k}\rangle|^{2}=\langle y_{x},\mathbf{S}_{\Phi}y_{x}\rangle=\langle\mathbf{S}_{\Phi}^{-1}x,x\rangle.

Now (11) completes the conclusion that (𝐒Φ1ϕk)kK(\mathbf{S}_{\Phi}^{-1}\phi_{k})_{k\in K} is a frame for \mathcal{H} with frame bounds 1/B1/B, 1/A1/A. ∎

3.1.2. Perfect reconstruction via frames

Here we consider one of the most important properties of frames, namely, the possibility to have perfect reconstruction of all the elements in the space.

Theorem 2.

(e.g. [40, Corol. 5.1.3]) Let Φ\Phi be a frame for \mathcal{H}. Then there exists a frame Ψ\Psi for \mathcal{H} such that

(12) x=kKx,ψkϕk=kKx,ϕkψk,x.x=\sum_{k\in K}\langle x,\psi_{k}\rangle\phi_{k}=\sum_{k\in K}\langle x,\phi_{k}\rangle\psi_{k},\ \forall x\in\mathcal{H}.
Proof.

By Theorem 1(d), the sequence (𝐒Φ1ϕk)kK(\mathbf{S}_{\Phi}^{-1}\phi_{k})_{k\in K} is a frame for \mathcal{H}. Take Ψ:=(𝐒Φ1ϕk)kK\Psi:=(\mathbf{S}_{\Phi}^{-1}\phi_{k})_{k\in K}. Using the boundedness and the self-adjointness of SΦS_{\Phi}, for every xx\in\mathcal{H},

kKx,ϕkψk=kKx,ϕkSΦ1ϕk=SΦ1kKx,ϕkϕk=SΦ1SΦx=x,\sum_{k\in K}\langle x,\phi_{k}\rangle\psi_{k}=\sum_{k\in K}\langle x,\phi_{k}\rangle S_{\Phi}^{-1}\phi_{k}=S_{\Phi}^{-1}\sum_{k\in K}\langle x,\phi_{k}\rangle\phi_{k}=S_{\Phi}^{-1}S_{\Phi}x=x,
kKx,ψkϕk=kKx,SΦ1ϕkϕk=kKSΦ1x,ϕkϕk=SΦSΦ1x=x.\sum_{k\in K}\langle x,\psi_{k}\rangle\phi_{k}=\sum_{k\in K}\langle x,S_{\Phi}^{-1}\phi_{k}\rangle\phi_{k}=\sum_{k\in K}\langle S_{\Phi}^{-1}x,\phi_{k}\rangle\phi_{k}=S_{\Phi}S_{\Phi}^{-1}x=x.

Let Φ\Phi be a frame for \mathcal{H}. Any frame Ψ\Psi for \mathcal{H}, which satisfies (12), is called a dual frame of Φ\Phi. By the above theorem, every frame has at least one dual frame, namely, the sequence

(13) (SΦ1ϕk)kK,(S_{\Phi}^{-1}\phi_{k})_{k\in K},

called the canonical dual of Φ\Phi. When the frame is a Riesz basis, then the coefficient representation is unique and thus there is only one dual frame, the canonical dual. When the frame is redundant, then there are other dual frames different from the canonical dual (see, e.g., [22, Lemma 5.6.1]), even infinitely many. This provides multiple choices for the coefficients in the frame representations, which is desirable in some applications (see, e.g., [7]). The canonical dual has a minimizing property in the sense that the coefficients (x,SΦ1ϕk)kK(\langle x,S_{\Phi}^{-1}\phi_{k}\rangle)_{k\in K} in the representation x=kKx,SΦ1ϕkϕkx=\sum_{k\in K}\langle x,S_{\Phi}^{-1}\phi_{k}\rangle\phi_{k} have the minimal 2\ell^{2}-norm compared to the coefficients (ck)kK(c_{k})_{k\in K} in all other possible representations x=kKckϕkx=\sum_{k\in K}c_{k}\phi_{k}. However, for certain applications other constraints are of interest - e.g. sparsity, efficient algorithms for representations or particular shape restrictions on the dual window [93, 72]. The canonical dual is not always efficient to calculate nor does it always have the desired structure; in such cases other dual frames are of interest [23, 57, 15]. The particular case of tight frames is very convenient for efficient reconstructions, because the canonical dual is simple and does not require operator-inversion:

Corollary 1.

(e.g. [22, Sec. 5.7]) The canonical dual of a tight frame (ϕk)kK(\phi_{k})_{k\in K} with frame bound AA is the sequence (1Aϕk)kK(\frac{1}{A}\phi_{k})_{k\in K}.

Proof.

Let Φ\Phi be a tight frame for \mathcal{H} with frame bound AA. It follows from (10) that 𝐒Φ=AId\mathbf{S}_{\Phi}=A\cdot{\rm Id}_{\mathcal{H}} and thus the canonical dual of Φ\Phi is (𝐒Φ1ϕk)kK=(1Aϕk)kK(\mathbf{S}_{\Phi}^{-1}\phi_{k})_{k\in K}=(\frac{1}{A}\phi_{k})_{k\in K}. ∎

In acoustic applications, it can be of big advantage to not be forced to distinguish between analysis and synthesis atoms. So, one may aim to do analysis and synthesis with the same sequence as an analogue to the case with ONBs. However, such an analysis-synthesis strategy would perfectly reconstruct all the elements of the space if and only if this sequence is a Parseval frame:

Proposition 1.

(e.g. [22, Lemma 5.7.1]) The sequence Φ\Phi satisfies

(14) x=kKx,ϕkϕk,x,x=\sum_{k\in K}\langle x,\phi_{k}\rangle\phi_{k},\ \forall x\in\mathcal{H},

if and only it is a Parseval frame for \mathcal{H}.

Proof.

Let Φ\Phi be a Parseval frame for \mathcal{H}. By Corollary 1, the canonical dual of Φ\Phi is the same sequence Φ\Phi, which implies that (14) holds. Now assume that (14) holds. Then for every xx\in\mathcal{H},

x2=kKx,ϕkϕk,x=kKx,ϕkϕk,x=kK|x,ϕk|2,\|x\|^{2}=\langle\sum_{k\in K}\langle x,\phi_{k}\rangle\phi_{k},x\rangle=\sum_{k\in K}\langle x,\phi_{k}\rangle\langle\phi_{k},x\rangle=\sum_{k\in K}|\langle x,\phi_{k}\rangle|^{2},

which means that Φ\Phi is a Parseval frame for \mathcal{H}. ∎

The above statement characterizes the sequences which provide reconstructions exactly like ONBs - these are precisely the Parseval frames. A trivial example of such a frame which is not an ONB is the sequence (e1,e2/2,e2/2,e3/3,e3/3,e3/3,)(e_{1},e_{2}/\sqrt{2},e_{2}/\sqrt{2},e_{3}/\sqrt{3},e_{3}/\sqrt{3},e_{3}/\sqrt{3},\ldots), where (ek)k=1(e_{k})_{k=1}^{\infty} denotes an ONB for \mathcal{H}. Clearly, any tight frame with frame bound AA is easily converted into a Parseval frame by dividing the frame elements by the square root of AA. Given any frame, one can always construct a Parseval frame as follows:

Proposition 2.

(e.g. [22, Theor. 5.3.4]) Let Φ\Phi be a frame for \mathcal{H}. Then 𝐒Φ1\mathbf{S}_{\Phi}^{-1} has a positive square root and (𝐒Φ1/2ϕk)kK(\mathbf{S}_{\Phi}^{-1/2}\phi_{k})_{k\in K} forms a Parseval frame for \mathcal{H}.

Proof.

Since 𝐒Φ1\mathbf{S}_{\Phi}^{-1} is a bounded positive self-adjoint operator, there is a unique bounded positive self-adjoint operator, which is denoted by 𝐒Φ1/2\mathbf{S}_{\Phi}^{-1/2}, with 𝐒Φ1=𝐒Φ1/2𝐒Φ1/2\mathbf{S}_{\Phi}^{-1}=\mathbf{S}_{\Phi}^{-1/2}\mathbf{S}_{\Phi}^{-1/2}. Furthermore, 𝐒Φ1/2\mathbf{S}_{\Phi}^{-1/2} commutes with 𝐒Φ\mathbf{S}_{\Phi}. For every xx\in\mathcal{H},

kKx,𝐒Φ1/2ϕk𝐒Φ1/2ϕk=𝐒Φ1/2kK𝐒Φ1/2x,ϕkϕk=𝐒Φ1/2𝐒Φ𝐒Φ1/2x=𝐒Φ1𝐒Φx=x.\sum\limits_{k\in K}\langle x,\mathbf{S}_{\Phi}^{-1/2}\phi_{k}\rangle\mathbf{S}_{\Phi}^{-1/2}\phi_{k}=\mathbf{S}_{\Phi}^{-1/2}\sum\limits_{k\in K}\langle\mathbf{S}_{\Phi}^{-1/2}x,\phi_{k}\rangle\phi_{k}=\mathbf{S}_{\Phi}^{-1/2}\mathbf{S}_{\Phi}\mathbf{S}_{\Phi}^{-1/2}x=\mathbf{S}_{\Phi}^{-1}\mathbf{S}_{\Phi}x=x.

By Proposition 1 this means that (𝐒Φ1/2ϕk)kK(\mathbf{S}_{\Phi}^{-1/2}\phi_{k})_{k\in K} is a Parseval frame for \mathcal{H}. ∎

Finally, note that frames guarantee stability. Let Φ\Phi be a frame for \mathcal{H} with frame bounds A,BA,B. Then Axy(x,ϕk)(y,ϕk)kK2Bxy\sqrt{A}\|x-y\|\leq\|(\langle x,\phi_{k}\rangle)-(\langle y,\phi_{k}\rangle)_{k\in K}\|_{\ell^{2}}\leq\sqrt{B}\|x-y\| for x,yx,y\in\mathcal{H}, which implies that close signals lead to close analysis coefficients and vice versa. Furthermore, the representations via Φ\Phi and a dual frame Ψ\Psi is stable. If a signal xx is transmitted via the coefficients (x,ψk)kK(\langle x,\psi_{k}\rangle)_{k\in K} but, during transmission, the coefficients are slightly disturbed (i.e. modified to a sequence (ak)kK2(a_{k})_{k\in K}\in\ell^{2} with small 2\ell^{2}-difference), then by Theorem 1(b) the “reconstructed”  signal y=kKakϕky=\sum_{k\in K}a_{k}\phi_{k} will be close to xx: xy=kK(x,ψkak)kKϕkB(x,ψkak)kK2\|x-y\|=\|\sum_{k\in K}(\langle x,\psi_{k}\rangle-a_{k})_{k\in K}\phi_{k}\|\leq\sqrt{B}\|(\langle x,\psi_{k}\rangle-a_{k})_{k\in K}\|_{\ell^{2}}.

3.2. Frame multipliers

Multipliers have been used implicitly for quite some time in applications, as time-variant filters, see e.g. [60]. The first systematic theoretical development of Gabor multipliers appeared in [33]. An extension of the multiplier concept to general frames in Hilbert spaces was done in [3] and it can be derived as an easy consequence of Theorem 1:

Proposition 3.

[3] Let Φ\Phi and Ψ\Psi be frames for \mathcal{H} and let m=(mk)kKm=(m_{k})_{k\in K} be a complex scalar sequence in (K)\ell^{\infty}(K). Then the series kKmkx,ψkϕk\sum_{k\in K}m_{k}\langle x,\psi_{k}\rangle\phi_{k} converges for every xx\in\mathcal{H} and determines a bounded operator on \mathcal{H}.

Proof.

For every xx\in\mathcal{H}, Theorem 1(a) implies that (x,ψk)kK2(\langle x,\psi_{k}\rangle)_{k\in K}\in\ell^{2} and thus (mkx,ψk)kK2(m_{k}\langle x,\psi_{k}\rangle)_{k\in K}\in\ell^{2}, which by Theorem 1(b) implies that the series kKmkx,ψkϕk\sum_{k\in K}m_{k}\langle x,\psi_{k}\rangle\phi_{k} converges. Thus, the mapping 𝐌m,Φ,Ψ\mathbf{M}_{m,\Phi,\Psi} determined by 𝐌m,Φ,Ψx:=kKmkx,ψkϕk\mathbf{M}_{m,\Phi,\Psi}x:=\sum_{k\in K}m_{k}\langle x,\psi_{k}\rangle\phi_{k} is well defined on \mathcal{H} and furthermore linear. For every xx\in\mathcal{H},

𝐌m,Φ,Ψx\displaystyle\|\mathbf{M}_{m,\Phi,\Psi}x\|_{\mathcal{H}} =\displaystyle= 𝐃Φ(mkx,ψk)kK𝐃Φ(mkx,ψk)kK2\displaystyle\|\mathbf{D}_{\Phi}(m_{k}\langle x,\psi_{k}\rangle)_{k\in K}\|_{\mathcal{H}}\leq\|\mathbf{D}_{\Phi}\|\cdot\|(m_{k}\langle x,\psi_{k}\rangle)_{k\in K}\|_{\ell^{2}}
\displaystyle\leq 𝐃Φm𝐂Ψx,\displaystyle\|\mathbf{D}_{\Phi}\|\cdot\|m\|_{\infty}\cdot\|\mathbf{C}_{\Psi}\|\cdot\|x\|_{\mathcal{H}},

implying the boundedness of 𝐌m,Φ,Ψ\mathbf{M}_{m,\Phi,\Psi}. ∎

Due to above proposition, frame multipliers can be defined as follows:

Definition 3.

Given frames Φ\Phi and Ψ\Psi for \mathcal{H} and given complex scalar sequence m=(mk)kK(K)m=(m_{k})_{k\in K}\in\ell^{\infty}(K), the operator Mm,Φ,ΨM_{m,\Phi,\Psi} determined by

(15) 𝐌m,Φ,Ψx:=kKmkx,ψkϕk,x,\mathbf{M}_{m,\Phi,\Psi}x:=\sum_{k\in K}m_{k}\langle x,\psi_{k}\rangle\phi_{k},\ x\in\mathcal{H},

is called a frame multiplier with a symbol mm.

Thus, frame multipliers extend the frame operator, allowing different frames for the analysis and synthesis step, and modification in between (for an illustration, see Figure 7). However, in contrast to frame operators, multipliers in general loose the bijectivity (as well as self-adjointness and positivity). For some applications it might be necessary to invert multipliers, which brings the interest to bijective multipliers and formulas for their inverses - for interested readers, we refer to [10, 82, 83, 84] for some investigation in this direction.

Refer to caption
Figure 7. An illustrative example to visualize a multiplier (taken from [10]). (TOP LEFT) The time-frequency representation of the music signal ff. (TOP RIGHT) The symbol mm, found by a (manual) estimation of the time-frequency region of the singer’s voice. (BOTTOM) Time-frequency representation of Mm,Ψ~,ΨfM_{m,\widetilde{\Psi},\Psi}f.

In the language of signal processing, Gabor filters [61] are a particular way to do time-variant filtering. In fact, Gabor filters are nothing but frame multipliers associated to a Gabor frame. A signal xx is transformed to the time-frequency domain (with a Gabor frame Φ\Phi), then modified there by point-wise multiplication with the symbol mm, followed by re-synthesis via some Gabor frame Ψ\Psi providing a modified signal. If some elements mkm_{k} of the symbol mm are zero, the corresponding coefficients are removed, as sometimes used in applications like CASA or percerptual sparsity, see Secs. 2.4 and  5.2.

3.2.1. Implementation

In the finite-dimensional case, frames lend themselves easily to implementation in computer codes [4]. The Large Time-Frequency Analysis Toolbox (LTFAT) [80], see http://ltfat.github.io/, is an open-source Matlab/Octave toolbox intended for time-frequency analysis, synthesis and processing, including multipliers. It provides robust and efficient implementations for a variety of frame-related operators for generic frames and several special types, e.g. Gabor and filter bank frames.

In a recent release, reported in [74], a ’frames framework’ was implemented, which models the abstract frame concept in an object-oriented approach. In this setting any algorithm can be designed to use a general frame. If a structured frame, e.g. of Gabor or wavelet type, is used, more efficient algorithms are automatically selected.

4. Filter bank frames: a signal processing viewpoint

Linear time-invariant filter banks (FB) are a classical signal analysis and processing tool. Their general, potentially non-uniform structure provides the natural setting for the design of flexible, frequency-adaptive time-frequency signal representations [7]. In this section, we recall some basics of FB theory and consider the relation of perfect reconstruction FBs to certain frame systems.

4.1. Basics of filter banks

In the following, we consider discrete signals with finite energy (x2()x\in\ell^{2}(\mathbb{Z})), interpreted as samples of a continuous signal, sampled at sampling frequency ξs\xi_{s}, i.e. the signal was sampled every 1/ξs1/\xi_{s} seconds. Bold italic letters indicate matrices (upper case), e.g. 𝑮\mathbfit G, and vectors (lower case), e.g. 𝒉\mathbfit h. We denote by WN=e2iπ/NW_{N}=e^{2i\pi/N} the NNth root of unity and by δk=δ0[k]\delta_{k}=\delta_{0}[\cdot-k] the (discrete) Dirac symbol, with δk[n]=1\delta_{k}[n]=1 for n=kn=k and 0 otherwise. Observe that for q=D/dq=D/d we have

(16) l=0q1WDjld=l=0q1e2πijl/q={qif j is a multiple of q0otherwise.\sum_{l=0}^{q-1}W_{D}^{jld}=\sum_{l=0}^{q-1}e^{2\pi ijl/q}=\left\{\begin{array}[]{ll}q&\text{if $j$ is a multiple of $q$}\\ 0&\text{otherwise.}\end{array}\right.

The zz-transform maps a (discrete-)time domain signal xx to its frequency domain representation XX by

𝒵:x[n]X(z)=nx[n]zn, for all z.\mathcal{Z}:\,x[n]\mapsto X(z)=\sum_{n\in\mathbb{Z}}x[n]z^{n}\text{, for all }z\in\mathbb{C}.

By setting z=e2πiξz=e^{2\pi i\xi} for ξ𝕋\xi\in\mathbb{T}, the zz-transform equals the discrete-time Fourier transform (DTFT\mathrm{DTFT}). Note that the zz-transform is uniquely determined by its values on the complex unit circle [68]. It is easy to see that, 𝒵(δk)=zk\mathcal{Z}\left(\delta_{k}\right)=z^{k}, a property that we will use later on.

The application of a filter to a signal xx is given by the convolution of xx with the time domain representation, or impulse response h2()h\in\ell^{2}(\mathbb{Z}) of the filter

(17) y[n]=xh[n]=lx[l]h[nl],n,y[n]=x\ast h[n]=\sum_{l\in\mathbb{Z}}x[l]h[n-l],\ \forall\ n\in\mathbb{Z},

or equivalently by multiplication in the frequency domain Y(z)=X(z)H(z)Y(z)=X(z)H(z), where H(z)H(z) is the transfer function, or frequency domain representation, of the filter.

Furthermore define the downsampling and upsampling operators d,d\downarrow_{d},\ \uparrow_{d} by

(18) d{x}[n]=x[dn]andd{x}[n]={x[n/d] if nd,0 otherwise.\downarrow_{d}\left\{x\right\}[n]=x[d\cdot n]\quad\text{and}\quad\uparrow_{d}\left\{x\right\}[n]=\begin{cases}x[n/d]&\text{ if }n\in d\mathbb{Z},\\ 0&\text{ otherwise.}\end{cases}

Here, dd\in\mathbb{N} is called the downsampling or upsampling factor, respectively. In the frequency domain, the effect of down- and upsampling is the following [69]:

(19) 𝒵(d{x})(z)=d1j=0d1X(Wdjz1/d)and𝒵(d{x})(z)=X(zd).\mathcal{Z}(\downarrow_{d}\left\{x\right\})(z)=d^{-1}\sum_{j=0}^{d-1}X(W_{d}^{j}z^{1/d})\quad\text{and}\quad\mathcal{Z}(\uparrow_{d}\left\{x\right\})(z)=X(z^{d}).

In words, downsampling a signal by dd results in the dilation of its spectrum by dd and the addition of (d1)(d-1) copies of the dilated spectrum. These copies of the spectrum (the terms X(Wdjz1/d)X(W_{d}^{j}z^{1/d}) for j0j\neq 0 in the sum above) are called aliasing terms. Conversely, upsampling a signal by dd results in the contraction of its spectrum by dd.

An FB is a collection of analysis filters Hk(z)H_{k}(z), synthesis filters Gk(z)G_{k}(z), and downsampling and upsampling factors dkd_{k}, k{0,,K}k\in\{0,\ldots,K\}, see Fig. 8. An FB is called uniform, if all filters have the same downsampling factor, i.e. dk=Dd_{k}=D for all kk.

Refer to caption
Figure 8. General structure of a non-uniform analysis-synthesis FB.

The sub-band components yk[n]y_{k}[n] of the system represented in Fig. 8 are given in the time domain by

(20) yk[n]=dk{hkx}[n]y_{k}[n]\,=\,\downarrow_{d_{k}}\left\{h_{k}*x\right\}[n]

The output signal is x~[n]=k=0K(gkdk{yk})[n]\tilde{x}[n]\,=\,\sum_{k=0}^{K}\left(g_{k}*\uparrow_{d_{k}}\left\{y_{k}\right\}\right)[n]. When analyzing the properties of a filter (bank), it is often useful to transform the expression for x~\tilde{x} to the frequency domain. First, apply the z-transform to the output of a single analysis/synthesis branch, obtaining

(21) 𝒵(gkdk{yk})(z)=dk1[X(Wdk0z),,X(Wdkdk1z)][Hk(Wdk0z)Hk(Wdkdk1z)]Gk(z),\mathcal{Z}\left(g_{k}*\uparrow_{d_{k}}\left\{y_{k}\right\}\right)(z)=d_{k}^{-1}[X(W^{0}_{d_{k}}z),\ldots,X(W^{d_{k}-1}_{d_{k}}z)]\left[\begin{matrix}H_{k}(W^{0}_{d_{k}}z)\\ \vdots\\ H_{k}(W^{d_{k}-1}_{d_{k}}z)\end{matrix}\right]G_{k}(z),

where the down- and upsampling properties of the z-transform were applied, see Eq. (19). Now let D=lcm(d0,,dK)D=\operatorname{lcm}(d_{0},\ldots,d_{K}), i.e. the least common multiple of the downsampling factors, and D/dk=qkD/d_{k}=q_{k}. Then (21) gives

(22) 𝒵(gkdk{yk})(z)=D1[X(WD0z),,X(WDD1z)]𝒉𝒌(𝒛)𝑮𝒌(𝒛),\mathcal{Z}\left(g_{k}*\uparrow_{d_{k}}\left\{y_{k}\right\}\right)(z)=D^{-1}[X(W^{0}_{D}z),\ldots,X(W^{D-1}_{D}z)]\mathbfit h_{k}(z)G_{k}(z),

where ,

𝒉𝒌(𝒛)=𝒒𝒌[𝑯𝒌(𝒛),𝟎,,𝟎𝒒𝒌𝟏 zeros,𝑯𝒌(𝑾𝑫𝒒𝒌𝒛),𝟎,,𝟎𝒒𝒌𝟏 zeros,,𝑯𝒌(𝑾𝑫(𝒅𝒌𝟏)𝒒𝒌𝒛),𝟎,,𝟎𝒒𝒌𝟏 zeros]𝑻.\mathbfit h_{k}(z)=q_{k}\cdot\Big{[}H_{k}(z),\underbrace{0,\cdots,0}_{q_{k}-1\text{ zeros}},H_{k}(W_{D}^{q_{k}}z),\underbrace{0,\cdots,0}_{q_{k}-1\text{ zeros}},\cdots,H_{k}(W_{D}^{(d_{k}-1)q_{k}}z),\underbrace{0,\cdots,0}_{q_{k}-1\text{ zeros}}\Big{]}^{T}.

The relevance of this equality becomes clear if we use linearity of the z-transform to obtain a frequency domain representation of the full FB output, also called the alias domain representation [89]

(23) X~(z)\displaystyle\tilde{X}(z) =\displaystyle= k=0K𝒵(gkdk{yk})(z)\displaystyle\sum_{k=0}^{K}\mathcal{Z}\left(g_{k}*\uparrow_{d_{k}}\left\{y_{k}\right\}\right)(z)
=\displaystyle= D1[X(WD0z),,X(WDD1z)][𝒉𝟎(𝒛),,𝒉𝑲(𝒛)][G0(z)GK(z)]\displaystyle D^{-1}[X(W^{0}_{D}z),\ldots,X(W^{D-1}_{D}z)]\left[\mathbfit h_{0}(z),\ldots,\mathbfit h_{K}(z)\right]\left[\begin{matrix}G_{0}(z)\\ \vdots\\ G_{K}(z)\end{matrix}\right]
=\displaystyle= D1[X(WD0z),,X(WDD1z)]𝑯(𝒛)𝑮(𝒛),\displaystyle D^{-1}[X(W^{0}_{D}z),\ldots,X(W^{D-1}_{D}z)]\mathbfit H(z)\mathbfit G(z),

where 𝑯(𝒛)=[𝒉𝟎(𝒛),,𝒉𝑲(𝒛)]\mathbfit H(z)=\left[\mathbfit h_{0}(z),\ldots,\mathbfit h_{K}(z)\right] is the D×(K+1)D\times(K+1) alias component matrix [89] and 𝑮(𝒛)=[𝑮𝟎(𝒛),,𝑮𝑲(𝒛)]\mathbfit G(z)=\left[G_{0}(z),\ldots,G_{K}(z)\right].

An FB system is undersampled, critically sampled or oversampled, if R=k=0Kdk1R=\sum_{k=0}^{K}d_{k}^{-1} is smaller than, equal to or larger than 11, respectively. Consequently, a uniform FB is critically sampled if it has exactly DD subbands. For a deeper treatment of FBs, see e.g. [54, 89].

Perfect reconstruction FBs: An FB is said to provide perfect reconstruction if x~[n]=x[nl]\tilde{x}[n]=x[n-l] for all x2()x\in\ell^{2}(\mathbb{Z}) and some fixed ll\in\mathbb{Z}. In the case when l0l\neq 0, the FB output is delayed by ll. Using the alias domain representation of the FB, the perfect reconstruction condition can be expressed as

(24) 𝑯(𝒛)𝑮(𝒛)=𝒛𝒍[𝑫 0𝟎]𝑻,\mathbfit H(z)\,\mathbfit G(z)=z^{l}\left[D\;0\cdots 0\;\right]^{T},

for some ll\in\mathbb{Z}, as this condition is equivalent to X~(z)=zlX(z)=𝒵(xδk)(z)\tilde{X}(z)=z^{l}X(z)=\mathcal{Z}(x*\delta_{k})(z). From this vantage point the perfect reconstruction condition can be interpreted as all the alias components (i.e. from the 22nd to D+1D+1-th) in 𝑯(𝒛)\mathbfit H(z) being uniformly canceled over all zz\in\mathbb{C} by the synthesis filters 𝑮(𝒛)\mathbfit G(z), while the first component of 𝑯(𝒛)\mathbfit H(z) remains constant over all zz\in\mathbb{C} (up to a fixed power of zz). The perfect reconstruction condition is of tremendous importance for determining whether an FB, including both analysis and synthesis steps, provides perfect reconstruction. However, given a fixed analysis FB, the alias domain representation may fail to provide straightforward or efficient ways to find suitable synthesis filters that provide perfect reconstruction. It can sometimes be used to determine whether such a system can exist, although the process is far from intuitive [46]. Consequently, non-uniform perfect reconstruction FBs are still not completely investigated, and thus frame theory may provide valuable new insights. However, for uniform FBs the perfect reconstruction conditions have been largely treated in the literature [54, 89]. Therefore, before we indulge in the frame theory of FBs, we also show how a non-uniform FB can be decomposed into its equivalent uniform FB. Such a uniform equivalent of the FB always exists [54, 1] and can be obtained as shown in Fig. 9 and described below.

4.2. The equivalent uniform filter bank

Refer to caption
Figure 9. The equivalent uniform FB [1] corresponding to the non-uniform FB in Fig. 8. The terms Hk(l)H_{k}^{(l)} and Gk(l)G_{k}^{(l)} in (b) correspond to the zz-transforms of the terms hk(l)h_{k}^{(l)} and gk(l)g_{k}^{(l)} defined in (25).

To construct the equivalent uniform FB to a general FB specified by analysis filters Hk(z)H_{k}(z), synthesis filters Gk(z)G_{k}(z), and downsampling and upsampling factors dkd_{k}, k{0,,K}k\in\{0,\ldots,K\}, start by denoting again D=lcm(d0,,dK)D=\operatorname{lcm}(d_{0},\ldots,d_{K}). We first construct the desired uniform FB, before showing that it is in fact equivalent to the given non-uniform FB. For every filter hk,gkh_{k},g_{k} in the non-uniform FB, introduce qk=D/dkq_{k}=D/d_{k} filters, given by specific delayed versions of hk,gkh_{k},g_{k}:

(25) hk(l)[n]=hkδldk=hk[nldk]andgk(l)[n]=gkδldk=gk[n+ldk],h_{k}^{(l)}[n]=h_{k}\ast\delta_{ld_{k}}=h_{k}\left[n-ld_{k}\right]\quad\text{and}\quad g_{k}^{(l)}[n]=g_{k}\ast\delta_{-ld_{k}}=g_{k}\left[n+ld_{k}\right],

for l=0,,qk1l=0,\ldots,q_{k}-1. It is easily seen that convolution with δk\delta_{k} equals translation by kk samples by just checking the definition of the convolution operation (17). Consequently, the sub-band components are

(26) yk(l)[n]=yk[nqkl]=D{hkδldk:=hk(l)x}[n],y_{k}^{(l)}[n]=y_{k}[nq_{k}-l]=\downarrow_{D}\{\underbrace{h_{k}*\delta_{ld_{k}}}_{:=h_{k}^{(l)}}*x\}[n],

where yky_{k} is the kk-th sub-band component with respect to the non-uniform FB. Thus, by grouping the corresponding qkq_{k} sub-bands, we obtain

yk[n]=l=0qk1qk{yk(l)}[n+l].y_{k}[n]=\sum_{l=0}^{q_{k}-1}\uparrow_{q_{k}}\left\{y_{k}^{(l)}\right\}[n+l].

In the frequency domain, the filters hk(l),gk(l)h_{k}^{(l)},g_{k}^{(l)} are given by

Hk(l)(z)=zldkHk(z)andGk(l)(z)=zldkGk(z).H_{k}^{(l)}(z)=z^{ld_{k}}H_{k}(z)\quad\text{and}\quad G_{k}^{(l)}(z)=z^{-ld_{k}}G_{k}(z).

Similar to before, the output of the FB can be written as

(27) X~(z)\displaystyle\widetilde{X}(z) =\displaystyle= D1k=0Kj=0D1l=0qk1Gk(l)(z)Hk(l)(WDjz)X(WDjz)\displaystyle D^{-1}\sum_{k=0}^{K}\sum_{j=0}^{D-1}\sum_{l=0}^{q_{k}-1}G_{k}^{(l)}(z)H_{k}^{(l)}\left(W_{D}^{j}z\right)X\left(W_{D}^{j}z\right)
=\displaystyle= D1k=0Kj=0D1Gk(z)Hk(WDjz)X(WDjz)l=0qk1WDjldk\displaystyle D^{-1}\sum_{k=0}^{K}\sum_{j=0}^{D-1}G_{k}(z)H_{k}\left(W_{D}^{j}z\right)X\left(W_{D}^{j}z\right)\sum_{l=0}^{q_{k}-1}W_{D}^{jld_{k}}

To obtain the second equality, we have used that Gk(l)(z)Hk(l)(WDjz)=WDjldkGk(z)Hk(WDjldkz)G_{k}^{(l)}(z)H_{k}^{(l)}\left(W_{D}^{j}z\right)=W_{D}^{jld_{k}}G_{k}(z)H_{k}\left(W_{D}^{jld_{k}}z\right). Insert Eq. (16) into (27) to obtain

(28) X~(z)\displaystyle\widetilde{X}(z) =\displaystyle= D1k=0Kj=0dk1qkGk(z)Hk(WDjqkz)X(WDjqkz)\displaystyle D^{-1}\sum_{k=0}^{K}\sum_{j=0}^{d_{k}-1}q_{k}G_{k}(z)H_{k}\left(W_{D}^{jq_{k}}z\right)X\left(W_{D}^{jq_{k}}z\right)
=\displaystyle= D1k=0K[X(WD0z),,X(WDD1(z)]𝒉𝒌(𝒛)𝑮𝒌(𝒛)\displaystyle D^{-1}\sum_{k=0}^{K}[X(W_{D}^{0}z),\ldots,X(W_{D}^{D_{1}}(z)]\mathbfit h_{k}(z)G_{k}(z)
=\displaystyle= D1[X(WD0z),,X(WDD1z)]𝑯(𝒛)𝑮(𝒛),\displaystyle D^{-1}[X(W^{0}_{D}z),\ldots,X(W^{D-1}_{D}z)]\mathbfit H(z)\mathbfit G(z),

which is exactly the output of the non-uniform FB specified by the hkh_{k}’s, gkg_{k}’s and dkd_{k}’s, see (23). Therefore, we see that an equivalent uniform FB for every non-uniform FB is obtained by decomposing each kk-th channel of the non-uniform system into qkq_{k} channels. The uniform system then features k=0Kqk\sum_{k=0}^{K}q_{k} channels in total with the downsampling factor D=lcm(d0,,dK)D=\operatorname{lcm}(d_{0},\ldots,d_{K}) in all channels.

4.3. Connection to Frame Theory

We will now describe in detail the connection between non-uniform FBs and frame theory. The main difference to previous work in this direction, cf. [14, 25, 20, 34], is that we do not restrict to the case of uniform FBs. The results in this section are not new, but this presentation is their first appearance in the context of non-uniform FBs. Besides using the equivalent uniform FB representation, see Fig. 9, we transfer results previously obtained for generalized shift-invariant systems [77, 44] and nonstationary Gabor systems [6, 48, 47] to the non-uniform FB setting. For that purpose, we consider frames over the Hilbert space =2()\mathcal{H}=\ell^{2}(\mathbb{Z}) of finite energy sequences. Moreover, we consider only FBs with a finite number K+1K+1\in\mathbb{N} of channels, a setup naturally satisfied in every real- world application. The central observation linking FBs to frames is that the convolution can be expressed as an inner product:

yk[n]=dk{hkx}[n]=x,hk[ndk]¯y_{k}[n]=\downarrow_{d_{k}}\left\{h_{k}*x\right\}[n]=\langle x,\overline{h_{k}[nd_{k}-\cdot]}\rangle

where the bar denotes the complex conjugate. Hence, the sub-band components with respect to the filters hkh_{k} and downsampling factors dkd_{k} equal the frame coefficients of the system Φ=(hk[ndk]¯)k,n\Phi=\left(\overline{h_{k}[nd_{k}-\cdot]}\right)_{k,n}. Note that the upper frame inequality, see Eq. (6), is equivalent to the hkh_{k}’s and dkd_{k}’s defining a system where bounded energy of the input implies bounded energy of the output. We will investigate the frame properties of this system by transference to the Fourier domain [5]; we consider Φ^=(𝐄ndkhk^)k,n\widehat{\Phi}=\left(\mathbf{E}_{-nd_{k}}\widehat{h_{k}}\right)_{k,n}, where hk^(ξ)=Hk(e2πiξ)¯\widehat{h_{k}}(\xi)=\overline{H_{k}(e^{2\pi i\xi})} denotes the Fourier transform of hk[]¯\overline{h_{k}[-\cdot]} and the operator 𝐄ω\mathbf{E}_{\omega} denotes modulation, i.e. 𝐄ndkhk^(ξ)=hk^(ξ)e2πindkξ\mathbf{E}_{-nd_{k}}\widehat{h_{k}}(\xi)=\widehat{h_{k}}(\xi)e^{-2\pi ind_{k}\xi}.

If Φ\Phi satisfies at least the upper frame inequality in Eq. (6), then the frame operators 𝐒Φ\mathbf{S}_{\Phi} and 𝐒Φ^\mathbf{S}_{\widehat{\Phi}} are related by the matrix Fourier transform [2]:

𝐒Φ^=DT𝐒ΦDT1,\mathbf{S}_{\widehat{\Phi}}=\mathcal{F}_{DT}\,\mathbf{S}_{\Phi}\,\mathcal{F}_{DT}^{-1},

where DT\mathcal{F}_{DT} denotes the discrete-time Fourier transform. Since the matrix Fourier transform is a unitary operation, the study of the frame properties of Φ\Phi reduces to the study of the operator 𝐒Φ^\mathbf{S}_{\widehat{\Phi}}. In the context of FBs, the frame operator can be expressed as the action of an FB with analysis filters hkh_{k}’s, downsampling and upsampling factors dkd_{k}’s, and synthesis filters hk[]¯\overline{h_{k}[-\cdot]}. That is, the synthesis filters are given by the time-reversed, conjugate impulse responses of the analysis filters. This is a very common approach to FB synthesis. But note that it only gives perfect reconstruction if the system constitutes a Parseval frame, see Prop. 1. The z-transform of a time-reversed, conjugated signal is given by 𝒵(h[]¯)(z)=𝒵(h)(1/z¯)¯\mathcal{Z}(\overline{h[-\cdot]})(z)=\overline{\mathcal{Z}(h)(1/\overline{z})}. Inserting this into the alias domain representation of the FB (23) yields

(32) 𝐒Φ^X(z)\displaystyle\mathbf{S}_{\widehat{\Phi}}X(z) =1D[X(WD0z)X(WDD1z)]𝑯(𝒛)[𝑯𝟎(𝟏/𝒛¯)¯𝑯𝑲(𝟏/𝒛¯)¯]\displaystyle=\frac{1}{D}\left[X(W_{D}^{0}z)\cdots X(W_{D}^{D-1}z)\right]\mathbfit H(z)\left[\begin{array}[]{c}\overline{H_{0}(1/\overline{z})}\\ \vdots\\ \overline{H_{K}(1/\overline{z})}\end{array}\right]

or, restricted to the Fourier domain

(33) 𝐒Φ^X(e2πiξ)=[X(e2πi(ξ+0/D))X(e2πi(ξ+(D1)/D))](ξ),\mathbf{S}_{\widehat{\Phi}}X(e^{2\pi i\xi})=[X(e^{2\pi i(\xi+0/D)})\cdots X(e^{2\pi i(\xi+(D-1)/D)})]\mathcal{H}(\xi),

with

(34) (ξ):=[0(ξ),,D1(ξ)]T:=1D𝑯(𝒆𝟐π𝒊ξ)[𝑯𝟎(𝒆𝟐π𝒊ξ)¯,,𝑯𝑲(𝒆𝟐π𝒊ξ)¯]𝑻,\mathcal{H}(\xi):=[\mathcal{H}_{0}(\xi),\ldots,\mathcal{H}_{D-1}(\xi)]^{T}:=\frac{1}{D}\mathbfit H(e^{2\pi i\xi})\left[\overline{H_{0}(e^{2\pi i\xi})},\ldots,\overline{H_{K}(e^{2\pi i\xi})}\right]^{T},

for ξ𝕋=/\xi\in\mathbb{T}=\mathbb{R}/\mathbb{Z}. Here, we used 1/e2πiω¯=e2πiω\overline{1/e^{2\pi i\omega}}=e^{2\pi i\omega} for all ω\omega\in\mathbb{R}. We call 0\mathcal{H}_{0} the frequency response and n\mathcal{H}_{n}, n=1,D1n=1,\cdot D-1 the alias components of the FB.

Another way to derive Eq. (33) is by using the the Walnut representation of the frame operator for the nonstationary Gabor frame Φ^=(𝐄ndkhk^)k,n\widehat{\Phi}=\left(\mathbf{E}_{-nd_{k}}\widehat{h_{k}}\right)_{k,n}, first introduced in [28] for the continuous case setting.

Proposition 4.

Let Φ^=(𝐄ndkhk^)k{0,,K},n\widehat{\Phi}=\left(\mathbf{E}_{-nd_{k}}\widehat{h_{k}}\right)_{k\in\{0,\ldots,K\},n\in\mathbb{Z}}, with hk^L2(𝕋)\widehat{h_{k}}\in L^{2}(\mathbb{T}) being (essentially) bounded and dkd_{k}\in\mathbb{N}. Then the frame operator 𝐒Φ^\mathbf{S}_{\widehat{\Phi}} admits the Walnut representation

(35) SΦ^x^(ξ)=k=0Kn=0dk1dk1hk^(ξ)hk^(ξndk1)¯x^(ξndk1),S_{\widehat{\Phi}}\widehat{x}(\xi)=\sum_{k=0}^{K}\sum_{n=0}^{d_{k}-1}d_{k}^{-1}\widehat{h_{k}}(\xi)\overline{\widehat{h_{k}}(\xi-nd_{k}^{-1})}\widehat{x}(\xi-nd_{k}^{-1}),

for almost every ξ𝕋\xi\in\mathbb{T} and all x^L2(𝕋)\widehat{x}\in L^{2}(\mathbb{T}).

Proof.

By the definition of the frame operator, see Eq. (8), we have

𝐒Φ^x^(ξ)=k,nx^,hk^e2πindkξhk^(ξ)e2πindkξ.\mathbf{S}_{\widehat{\Phi}}\widehat{x}(\xi)=\sum\limits_{k,n}\left<\widehat{x},\widehat{h_{k}}e^{-2\pi ind_{k}\xi}\right>\widehat{h_{k}}(\xi)e^{-2\pi ind_{k}\xi}.

Note that

nx^,e2πiξndkhk^e2πiξndk=nDT1(x^hk^¯)[ndk]e2πiξndk.\sum_{n\in\mathbb{Z}}\langle\widehat{x},e^{-2\pi i\xi nd_{k}}\widehat{h_{k}}\rangle e^{-2\pi i\xi nd_{k}}=\sum_{n\in\mathbb{Z}}\mathcal{F}_{DT}^{-1}(\widehat{x}\overline{\widehat{h_{k}}})[nd_{k}]e^{-2\pi i\xi nd_{k}}.

to get the result by applying Poisson’s summation formula, see e.g. [40]. ∎

The sums in (35) can be reordered to obtain

n=0D1x^(ξnD1)kKndk1hk^(ξ)hk^(ξnD1)¯,\sum_{n=0}^{D-1}\widehat{x}(\xi-nD^{-1})\sum_{k\in K_{n}}d_{k}^{-1}\widehat{h_{k}}(\xi)\overline{\widehat{h_{k}}(\xi-nD^{-1})},

where Kn={k{0,,K}:nD1=jdk1 for some j}K_{n}=\{k\in\{0,\ldots,K\}~:~nD^{-1}=jd_{k}^{-1}\text{ for some }j\in\mathbb{N}\}. Inserting hk^(ξ)=Hk(e2πiξ)¯\widehat{h_{k}}(\xi)=\overline{H_{k}(e^{2\pi i\xi})} and comparing the definition of n\mathcal{H}_{n} in (34), we can see that

kKnhk^(ξ)hk^(ξnD1)¯=kKnHk(e2πiξ)¯Hk(e2πi(ξn/D1))=n(ξ)\sum_{k\in K_{n}}\widehat{h_{k}}(\xi)\overline{\widehat{h_{k}}(\xi-nD^{-1})}=\sum_{k\in K_{n}}\overline{H_{k}(e^{2\pi i\xi})}H_{k}(e^{2\pi i(\xi-n/D^{-1})})=\mathcal{H}_{n}(\xi)

for almost every ξ𝕋\xi\in\mathbb{T} and all n=0,,D1n=0,\ldots,D-1. Hence, we recover the representation of the frame operator as per (33), as expected. What makes Proposition 4 so interesting, is that it facilitates the derivation of some important sufficient frame conditions. The first is a generalization of the theory of painless non-orthogonal expansions by Daubechies et al. [27], see also [6] for a direct proof.

Corollary 2.

Let Φ^=(𝐄ndkhk^)k{0,,K},n\widehat{\Phi}=\left(\mathbf{E}_{-nd_{k}}\widehat{h_{k}}\right)_{k\in\{0,\ldots,K\},n\in\mathbb{Z}}, with hk^L2(𝕋)\widehat{h_{k}}\in L^{2}(\mathbb{T}) and dkd_{k}\in\mathbb{N}. Assume for all 0kK0\leq k\leq K, there is Ik𝕋I_{k}\subseteq\mathbb{T} with |Ik|dk1|I_{k}|\leq d_{k}^{-1} and hk^(ξ)=0\widehat{h_{k}}(\xi)=0 for almost every ξ𝕋Ik\xi\in\mathbb{T}\setminus I_{k}. Then Φ^\widehat{\Phi} is a frame if and only if there are A,BA,B such that

(36) 0<Ak=0Kdk1|hk^|2=0B<, a.e. 0<A\leq\sum_{k=0}^{K}d_{k}^{-1}|\widehat{h_{k}}|^{2}=\mathcal{H}_{0}\leq B<\infty,\text{ a.e. }

Moreover, a dual frame for Φ^\widehat{\Phi} is given by Ψ^=(𝐄ndkgk^)k{0,,K},n\widehat{\Psi}=\left(\mathbf{E}_{-nd_{k}}\widehat{g_{k}}\right)_{k\in\{0,\ldots,K\},n\in\mathbb{Z}}, where

(37) gk^(ξ)=hk^(ξ)0(ξ) a.e.\widehat{g_{k}}(\xi)=\frac{\widehat{h_{k}}(\xi)}{\mathcal{H}_{0}(\xi)}\text{ a.e.}
Proof.

First, note that the existence of the upper bound BB is equivalent to hk^L(𝕋)\widehat{h_{k}}\in L^{\infty}(\mathbb{T}), for all k=0,,Kk=0,\ldots,K. It is easy to see that under the assumptions given, Eq. (35) equals

𝐒Φ^x^(ξ)=x^(ξ)k=0Kdk1|hk^|2(ξ)=x^(ξ)0(ξ).\mathbf{S}_{\widehat{\Phi}}\widehat{x}(\xi)=\widehat{x}(\xi)\sum_{k=0}^{K}d_{k}^{-1}|\widehat{h_{k}}|^{2}(\xi)=\widehat{x}(\xi)\cdot\mathcal{H}_{0}(\xi).

Hence, 𝐒Φ^\mathbf{S}_{\widehat{\Phi}} is invertible if and only if 0\mathcal{H}_{0} is bounded above and below, proving the first part. Moreover, 𝐒Φ^1\mathbf{S}_{\widehat{\Phi}}^{-1} is given by pointwise multiplication with 1/01/\mathcal{H}_{0} and therefore, the elements of the canonical dual frame for Φ^\widehat{\Phi}, defined in Eq. (13), are given by

𝐒Φ^1𝐄ndkhk^=𝐄ndkhk^0=𝐄ndkhk^0=gk^.\mathbf{S}_{\widehat{\Phi}}^{-1}\mathbf{E}_{-nd_{k}}\widehat{h_{k}}=\frac{\mathbf{E}_{-nd_{k}}\widehat{h_{k}}}{\mathcal{H}_{0}}=\mathbf{E}_{-nd_{k}}\frac{\widehat{h_{k}}}{\mathcal{H}_{0}}=\widehat{g_{k}}.

In other words, recalling hk^(ξ)=Hk(e2πiξ)¯\widehat{h_{k}}(\xi)=\overline{H_{k}(e^{2\pi i\xi})}, if the filters hkh_{k} are strictly band-limited, the downsampling factors dkd_{k} are small and 0<A0B<00<A\leq\mathcal{H}_{0}\leq B<0 almost everywhere, then we obtain a perfect reconstruction system with synthesis filters gkg_{k} defined by

Gk(e2πiξ)=Hk(e2πiξ)¯0(ξ).G_{k}(e^{2\pi i\xi})=\frac{\overline{H_{k}(e^{2\pi i\xi})}}{\mathcal{H}_{0}(\xi)}.

The second, more general and more interesting condition can be likened to a diagonal dominance result, i.e. if the main term 0\mathcal{H}_{0} is stronger than the sum of the magnitude of alias components n\mathcal{H}_{n}, n=1,,D1n=1,\ldots,D-1, then the FB analysis provided by the filters hkh_{k} and downsampling factors dkd_{k} is invertible.

Proposition 5.

Let Φ^=(𝐄ndkhk^)k{0,,K},n\widehat{\Phi}=\left(\mathbf{E}_{-nd_{k}}\widehat{h_{k}}\right)_{k\in\{0,\ldots,K\},n\in\mathbb{Z}}, with hk^L2(𝕋)\widehat{h_{k}}\in L^{2}(\mathbb{T}) and dkd_{k}\in\mathbb{N}. If there are 0<AB<0<A\leq B<\infty with

(38) Ak=0Kdk1|hk^|2(ξ)±k=0Kn=1dk1dk1|hk^(ξ)hk^(ξndk1)|B,A\leq\sum_{k=0}^{K}d_{k}^{-1}|\widehat{h_{k}}|^{2}(\xi)\pm\sum_{k=0}^{K}\sum_{n=1}^{d_{k}-1}d_{k}^{-1}\left|\widehat{h_{k}}(\xi)\widehat{h_{k}}(\xi-nd_{k}^{-1})\right|\leq B,

for almost every ξ𝕋\xi\in\mathbb{T}, then Φ^\widehat{\Phi} forms a frame with frame bounds A,BA,B.

Note that (38) impliest hk^𝐋()\widehat{h_{k}}\in\mathbf{L}^{\infty}(\mathbb{R}) for all k{0,,K}k\in\{0,\ldots,K\}. Therefore, Proposition 4 applies for any FB that satisfies (38). The proof of Proposition 5 is somewhat lengthy and we omit it here. It is very similar to the proof of the analogous conditions for Gabor and wavelet frames that can be found in [26] for the continuous case. It can also be seen as a corollary of [24, Theorem 3.4], covering a more general setting. A few things should be noted regarding Proposition 5.

(a) As mentioned before, this is a sort of diagonal dominance result. While the sum k=0Kdk1|hk^|2(ξ)\sum_{k=0}^{K}d_{k}^{-1}|\widehat{h_{k}}|^{2}(\xi) corresponds to 0\mathcal{H}_{0}, we have

k=0Kn=1dk1dk1|hk^(ξ)hk^(ξndk1)|=n=1D1|n|(ξ).\sum_{k=0}^{K}\sum_{n=1}^{d_{k}-1}d_{k}^{-1}\left|\widehat{h_{k}}(\xi)\widehat{h_{k}}(\xi-nd_{k}^{-1})\right|=\sum_{n=1}^{D-1}|\mathcal{H}_{n}|(\xi).

Since, in fact, the finite number of channels guarantees the existence of BB if and only if hk^L(𝕋)\widehat{h_{k}}\in L^{\infty}(\mathbb{T}), for all k=0,,Kk=0,\ldots,K, the result implies that the FB analysis provided by hkh_{k}’s and dkd_{k}’s is invertible, whenever

0n=1D1|n|A>0, almost everywhere.\mathcal{H}_{0}-\sum_{n=1}^{D-1}|\mathcal{H}_{n}|\geq A>0,\text{ almost everywhere.}

(b) No explicit dual frame is provided by Proposition 5. So, while we can determine invertibility quite easily, provided the Fourier transforms of the filters can be computed, the actual inversion process is still up in the air. In fact, it is unclear whether there are synthesis filters gkg_{k} such that the hkh_{k}’s and gkg_{k}’s form a perfect reconstruction system with down-/upsampling factors dkd_{k}. We consider here two possible means of recovering the original signal XX from the sub-band components YkY_{k}.

First, the equivalent unform FB, comprised of the filters hk(l)h^{(l)}_{k}, for l{0,,qk1}l\in\{0,\ldots,q_{k}-1\} and all k{0,,K}k\in\{0,\cdots,K\}, with downsampling factor D=lcm(dk:k{0,,K})D=\operatorname{lcm}(d_{k}~:~k\in\{0,\ldots,K\}) can be constructed. Since the non-uniform FB forms a frame, so does its uniform equivalent and hence the existence of a dual FB gk(l)g^{(l)}_{k}, for l{0,,qk1}l\in\{0,\ldots,q_{k}-1\} and all k{0,,K}k\in\{0,\cdots,K\}, is guaranteed. Note that the gk(l)g^{(l)}_{k} are not necessarily delayed versions of gk(0)g^{(0)}_{k}, as it is the case for hk(l)h^{(l)}_{k}. Then, the structure of the alias domain representation in (23) with gk=hk[]¯g_{k}=\overline{h_{k}[-\cdot]} can be exploited [14] to obtain perfect reconstruction synthesis. In the finite, discrete setting, i.e. when considering signals in L\mathbb{R}^{L} (L\mathbb{C}^{L}), a dual FB can be computed explicitly and efficiently by a generalization of the methods presented by Strohmer [86], see also [75]. In practice, both the storage and time efficiency of computing the dual uniform FB rely crucially on D=lcm(dk:kin{0,,K})D=\operatorname{lcm}(d_{k}~:~k\ in\{0,\ldots,K\}) being small, i.e. kqk\sum_{k}q_{k} not being much larger than K+1K+1.

If that is not the case, the frame property of Φ^=(𝐄ndkhk^)k{0,,K},n\widehat{\Phi}=\left(\mathbf{E}_{-nd_{k}}\widehat{h_{k}}\right)_{k\in\{0,\ldots,K\},n\in\mathbb{Z}} guarantees the convergence of the Neumann series

(39) 𝐒Φ^1=2A0+B0l=0(𝐈2A0+B0𝐒Φ^)l,\mathbf{S}_{\widehat{\Phi}}^{-1}=\frac{2}{A_{0}+B_{0}}\sum_{l=0}^{\infty}\left(\mathbf{I}-\frac{2}{A_{0}+B_{0}}\mathbf{S}_{\widehat{\Phi}}\right)^{l},

where 0<A0B0<0<A_{0}\leq B_{0}<\infty are the optimal frame bounds of Φ^\widehat{\Phi}. Instead of computing the elements of any dual frame explicitly, we can apply the inverse frame operator to the FB output

(40) X~(z)=𝐒Φ^X(z)=k=0KYk(zdk)Hk(z),\tilde{X}(z)=\mathbf{S}_{\widehat{\Phi}}X(z)=\sum_{k=0}^{K}Y_{k}(z^{d_{k}})H_{k}(z),

obtaining 𝐒Φ^1X~=X\mathbf{S}_{\widehat{\Phi}}^{-1}\tilde{X}=X. This can be implemented with the frame algorithm [29, 39]. However, any frame operator is positive definite and self-adjoint, allowing for extremely efficient implementation via the conjugate gradients (CG) [39, 87] algorithm. In addition to a significant boost in efficiency compared to the frame algorithm, the conjugate gradients algorithm does not require an estimate of the optimal frame bounds A0,B0A_{0},B_{0} and convergence speed depends solely on the condition number of 𝐒Φ^\mathbf{S}_{\widehat{\Phi}}. It provides guaranteed, exact convergence in LL steps for signals in L\mathbb{C}^{L}, where every step essentially comprises one analysis and one synthesis step with the filters hkh_{k} and gk=hk[]¯g_{k}=\overline{h_{k}[-\cdot]}, respectively. If furthermore, 0n=1D1|n|\mathcal{H}_{0}\gg\sum_{n=1}^{D-1}|\mathcal{H}_{n}|, then convergence speed can be further increased by preconditioning [8], considering instead the operator defined by

𝐒Φ^~X(e2πiξ)=0(ξ)1𝐒Φ^X(e2πiξ).\widetilde{\mathbf{S}_{\widehat{\Phi}}}X(e^{2\pi i\xi})=\mathcal{H}_{0}(\xi)^{-1}\mathbf{S}_{\widehat{\Phi}}X(e^{2\pi i\xi}).

More specifically, the CG algorithm is employed to solve the system 𝐃Φc=𝐒Φx\mathbf{D}_{\Phi}c=\mathbf{S}_{\Phi}x for xx, given the coefficients cc. Recall the analysis/synthesis operators 𝐂Φ,𝐃Φ\mathbf{C}_{\Phi},\mathbf{D}_{\Phi} (see Sec. 3.1.1), associated to a frame Φ\Phi, which are equivalent to the analysis/synthesis stages of the FB. The preconditioned case can be implemented most efficiently, by precomputing an approximate dual FB, defined by Gk(e2πiξ)=0(ξ)1Hk(e2πiξ)G_{k}(e^{2\pi i\xi})=\mathcal{H}_{0}(\xi)^{-1}H_{k}(e^{2\pi i\xi}) and solving instead

𝐃Ψc=10(ξ)1𝐒Φ^x=𝐃Ψ𝐂Φx,where Ψ={gk[ndk]¯}k,n,\mathbf{D}_{\Psi}c=\mathcal{F}^{-1}\mathcal{H}_{0}(\xi)^{-1}\mathbf{S}_{\widehat{\Phi}}\mathcal{F}x=\mathbf{D}_{\Psi}\mathbf{C}_{\Phi}x,\ \text{where }\Psi=\{\overline{g_{k}[nd_{k}-\cdot]}\}_{k,n},

for xx, given the coefficients cc. Algorithm 1 shows a pseudo-code implementation of such a preconditioned CG scheme, available in the LTFAT Toolbox as the routine ifilterbankiter.

Algorithm 1 Iterative synthesis: x~=FBSYNit(c,(hk,gk,dk)k,λ)\tilde{x}=\textbf{FBSYN}^{it}(c,(h_{k},g_{k},d_{k})_{k},\lambda)
1:Initialize x0=0x_{0}=0, k=0k=0
2:b𝐃Ψcb\leftarrow\mathbf{D}_{\Psi}c
3:r0br_{0}\leftarrow b
4:h0,p0r0h_{0},p_{0}\leftarrow r_{0}
5:repeat
6:  qk=𝐃Ψ(𝐂Φp0)q_{k}=\mathbf{D}_{\Psi}(\mathbf{C}_{\Phi}p_{0})
7:  αkrk,hkpk,qk\alpha_{k}\leftarrow\frac{\langle r_{k},h_{k}\rangle}{\langle p_{k},q_{k}\rangle}
8:  xk+1xk+αkpkx_{k+1}\leftarrow x_{k}+\alpha_{k}p_{k}
9:  rk+1rk+αkqkr_{k+1}\leftarrow r_{k}+\alpha_{k}q_{k}
10:  hk+1rk+1h_{k+1}\leftarrow r_{k+1}
11:  βkrk+1,hk+1rk,hk\beta_{k}\leftarrow\frac{\langle r_{k+1},h_{k+1}\rangle}{\langle r_{k},h_{k}\rangle}
12:  pk+1hk+1+βkpkp_{k+1}\leftarrow h_{k+1}+\beta_{k}p_{k}
13:  kk+1k\leftarrow k+1
14:until rkλr_{k}\leq\lambda
15:x~xk\tilde{x}\leftarrow x_{k}

5. Frame Theory: Psychoacoustics-motivated Applications

5.1. A perfectly invertible, perceptually-motivated filter bank

The concept of auditory filters lends itself nicely to the implementation as a FB. As motivated in Sec. 1, it can be expected that many audio signal processing applications greatly benefit from an invertible FB representation adapted to the auditory time-frequency resolution. Despite the auditory system showing significant nonlinear behavior, the results obtained through a linear representation are desirable for being much more predictable than when accounting for nonlinear effects. We call such a system perceptually-motivated FB, to distinguish from auditory FBs that attempt to mimic the nonlinearities in the auditory system. Note that, as mentioned in Section 2.2, the first step in many auditory FBs is the computation of a perceptually-motivated FB, see e.g. [49]. The AUDlet FBs we present here are a family of perceptually-motivated FBs that satisfy a perfect reconstruction property, offer flexible redundancy and enable efficient implementation. They were introduced in [65, 66] and an implementation is available in the LTFAT Toolbox [80].

The AUDlet FB has a general non-uniform structure as presented in Fig. 8 with analysis filters Hk(z)H_{k}(z), synthesis filters Gk(z)G_{k}(z), and downsampling and upsampling factors dkd_{k}. Considering only real-valued signals allows us to deal with symmetric DT\mathcal{F}_{DT}s and process only the positive-frequency range. Therefore let KK denote the number of filters in the frequency range [fmin,fmax][0,fs/2[[f_{\mathrm{min}},f_{\mathrm{max}}]\cap[0,f_{s}/2[, where fmin0f_{\mathrm{min}}\geq 0 to fmaxfs/2f_{\mathrm{max}}\leq f_{s}/2 and fs/2f_{s}/2 is the Nyquist frequency, i.e. half the sampling frequency. If fmin>0f_{\mathrm{min}}>0, this range includes an additional filter at the zero frequency. Furthermore, another filter is always positioned at the Nyquist frequency to ensure that the full frequency range is covered. Thus, all FBs below feature K+1K+1 filters in total and their redundancy is given by R=d01+2k=1K1dk1+dK1R=d_{0}^{-1}+2\sum_{k=1}^{K-1}d_{k}^{-1}+d_{K}^{-1}, since coefficients in the 11st to K1K-1-th subbands are complex-valued.

Parameter Role Information
fminf_{\mathrm{min}} minimum frequency in Hz fmin[0,fs/2[,fmin<fmaxf_{\mathrm{min}}\in[0,f_{s}/2[,\ f_{\mathrm{min}}<f_{\mathrm{max}}
fmaxf_{\mathrm{max}} maximum frequency in Hz fmax]0,fs/2[,fmax>fminf_{\mathrm{max}}\in]0,f_{s}/2[,\ f_{\mathrm{max}}>f_{\mathrm{min}}
fkf_{k} center frequencies in Hz FAUD1(FAUD(f0)+k/V)F_{\mathrm{AUD}}^{-1}(F_{\mathrm{AUD}}(f_{0})+k/V)
KK (essential) number of channels K=V(FAUD(ξmax)FAUD(fmin))+(1δ0,fmin)K=V\left(F_{\mathrm{AUD}}(\xi_{\mathrm{max}})-F_{\mathrm{AUD}}(f_{\mathrm{min}})\right)+(1-\delta_{0,f_{\mathrm{min}}})
VV channels per scale unit V=(FAUD(fk+1)FAUD(fk))1V=\left(F_{\mathrm{AUD}}(f_{k+1})-F_{\mathrm{AUD}}(f_{k})\right)^{-1}, k[1,K2]k\in[1,K-2]
ww frequency domain filter prototype wL2(𝕋)w\in L^{2}(\mathbb{T})
Γk\Gamma_{k} dilation factors rbwBWAUD(fk)r_{bw}BW_{\mathrm{AUD}}(f_{k}), rbw>0r_{bw}>0 (default =1=1)
HkH_{k} filter transfer functions Hk(e2iπξ)=Γk12w(fsξfkΓk)H_{k}(e^{2i\pi\xi})=\Gamma_{k}^{-\frac{1}{2}}w\left(\frac{f_{s}\cdot\xi-f_{k}}{\Gamma_{k}}\right)
dkd_{k} downsampling factors rdBWAUD1(ξk)r_{d}BW_{\mathrm{AUD}}^{-1}(\xi_{k}), rd>0r_{d}>0 (default non-uniform = 1)
RR redundancy R=d01+2k=1K1dk1+dK1R=d_{0}^{-1}+2\sum_{k=1}^{K-1}d_{k}^{-1}+d_{K}^{-1}
Table 1. Parameters of the perceptually-motivated AUDlet FB

The AUDlet filters HkH_{k}’s, k{0,,K}k\in\{0,\ldots,K\} are constructed in the frequency domain by

(41) Hk(e2iπξ)=Γk12w(fsξfkΓk)H_{k}(e^{2i\pi\xi})=\Gamma_{k}^{-\frac{1}{2}}w\left(\frac{f_{s}\cdot\xi-f_{k}}{\Gamma_{k}}\right)

where w(ξ)w(\xi) is a prototype filter shape with bandwidth 11 and center frequency 0. Here, the shape factor Γk\Gamma_{k} controls the effective bandwidth of HkH_{k} and fkf_{k} determines its center frequency. The factor Γk1/2\Gamma_{k}^{-1/2} ensures that all filters (i.e. for all kk) have the same energy. To obtain filters equidistantly spaced on a perceptual frequency scale, the sets {fk}\{f_{k}\} and {Γk}\{\Gamma_{k}\} are calculated using the corresponding FAUDF_{\mathrm{AUD}} and BWAUDBW_{\mathrm{AUD}} formulas, see Tab. 1 for more information on the AUDlet parameters and their relations. Since we emphasize inversion, the default analysis parameters are chosen such that the filters HkH_{k} and downsampling factors dkd_{k} form a frame. As an example, the AUDlet (a) and gammatone (b) analyses of a speech signal are represented in Fig. 10 using AUD = ERB and VV = 6 filters per ERB. The filter prototype ww for the AUDlet was a Hann window. It can be seen that the two signal representations are very similar over the whole time-frequency plane. Since the gammatone filter is an acknowledged auditory filter model, this indicates that the time-frequency resolution of the AUDlet approximates well the auditory resolution.

Refer to caption
Refer to caption
Figure 10. Analyses of a female speech signal taken from the TIMIT database [36] by (a) the AUDlet FB and (b) the gammatone FB using VV = 6 filters per ERB (KK = 201). It can be seen that the two signal representations are very similar over the whole time-frequency plane.

5.2. Perceptual Sparsity

As discussed in Sec. 2.3 not all components of a sound perceived. This effect can be described by masking models and naturally leads to the following question: Given a time-frequency representation or any representation linked to audio, how can we apply that knowledge to only include audible coefficients in the synthesis? In an attempt to answer this question, efforts were made to combine frame theory and masking models into a concept called the Irrelevance Filter. This concept is somehow linked to the currently very prominent sparsity and compressed sensing approach, see e.g. [35, 31] for an overview. To reduce the amount of non-zero coefficients, the irrelevance filter uses a perceptual measure of sparsity, hence perceptual sparsity. Perceptual and compressed sparsity can certainly be combined, see e.g. [21]. Similar to the methods used in compressed sensing, a redundant representation offers an advantage for perceptual sparsity, as well, as the same signal can be reconstructed from several sets of coefficients.

The concept of the irrelevance filter was first introduced in [30] and fully developed in [9]. It consists in removing the inaudible atoms in a Gabor transform while causing no audible difference to the original sound after re-synthesis. Precisely, an adaptive threshold function is calculated for each spectrum (i.e. at each time slice) of the Gabor transform using a simple model of spectral masking (see Sec. 2.3.1), resulting in the so-called irrelevance threshold. Then, the amplitudes of all atoms falling below the irrelevance threshold are set to zero and the inverse transform is applied to the set of modified Gabor coefficients. This corresponds to an adaptive Gabor frame multiplier with coefficients in {0,1}\{0,1\}. The application of the irrelevance filter to a musical signal sampled at 16 kHz is shown in Fig. 11. A Matlab implementation of the algorithm proposed in [9] was used. All Gabor transform and filter parameters were identical to those mentioned in [9]. Noteworthy, the offset parameter oo was set to -2.59 dB. In this particular example, about 48% components were removed without causing any audible difference to the original sound after re-synthesis (as judged by informal listening by the authors). A formal listening test performed in [9] with 36 normal-hearing listeners and various musical and speech signals indicated that, on average, 36% coefficients can be removed without causing any audible artifact in the re-synthesis.

Refer to caption
(a) Original Gabor transform
Refer to caption
(b) Binary mask
Refer to caption
(c) Masked Gabor transform
Refer to caption
(d) Irrelevance threshold
Figure 11. Example application of the irrelevance filter as implemented in [9] to a music signal (excerpt from the song “Heart of Steel” from Manowar). (a) Squared magnitude of the Gabor transform (in dB). (b) Binary mask estimated from the irrelevance threshold. White = 1, black = 0. (c) Squared magnitude (in dB) of the masked Gabor transform, i.e. the result of the point-wise multiplication between the original transform and the binary mask. (d) Amplitudes (in dB) of the irrelevance threshold (bold straight line) and original spectrum (dashed line) at a given time slice.

The irrelevance filter as depicted here has shown very promising results but the approach could be improved. Specifically, the main limitations of the algorithm are the fixed resolution in the Gabor transform and the use of a simple spectral masking model to predict masking in the time-frequency domain. Combining an invertible perceptually-motivated transform like the AUDlet FB (Sec. 5.1) with a model of time-frequency masking (Sec. 2.3.3) is expected to improve performance of the filter. This is work in progress. Potential applications of perceptual sparsity include, for instance:

  1. (1)

    Sound / Data Compression: For applications where perception is relevant, there is no need to encode perceptually irrelevant information. Data that can not be heard should be simply omitted. A similar algorithm is for example used in the MP3 codec. If “over-masking” is used, i.e. the threshold is moved beyond the level of relevance, a higher compression rate can be reached [70].

  2. (2)

    Sound Design: For the visualization of sounds the perceptually irrelevant part can be disregarded. This is for example used for car sound design [13].

6. Conclusion

In this chapter, we have discussed some important concepts from hearing research and perceptual audio signal processing, such as auditory masking and auditory filter banks. Natural and important considerations served as a strong indicator that frame theory provides a solid foundation for the design of robust representations for perceptual signal analysis and processing. This connection was further reinforced by exposing the similarity between some concepts arising naturally in frame theory and signal processing, e.g. between frame multipliers and time-variant filters. Finally, we have shown how frame theory can be used to analyze and implement invertible filter banks, in a quite general setting where previous synthesis methods might fail or be highly inefficient. The codes for Matlab/Octave to reproduce the results presented in Secs. 3 and 5 in this chapter are available for download on the companion Webpage https://www.kfs.oeaw.ac.at/frames_for_psychoacoustics.

It is likely that readers of this contribution who are researchers in psychoacoustics or audio signal processing have already used frames without being aware of the fact. We hope that such readers will, to some extent, grasp the basic principles of the rich mathematical background provided by frame theory and its importance to fundamental issues of signal analysis and processing. With that knowledge, we believe, they will be able to better understand the signal analysis tools they use and might even be able to design new techniques that further elevate their research.

On the other hand, researchers in applied mathematics or signal processing have been supplied with basic knowledge of some central psychoacoustics concepts. We hope that our short excursion piqued their interest and will serve as a starting point for applying their knowledge in the rich and various fields of psychoacoustics or perceptual signal processing.

Acknowledgments The authors acknowledge support from the Austrian Science Fund (FWF) START-project FLAME (’Frames and Linear Operators for Acoustical Modeling and Parameter Estimation’; Y 551-N13) and the French-Austrian ANR-FWF project POTION (“Perceptual Optimization of Time-Frequency Representations and Audio Coding; I 1362-N30”). They thank B. Laback for discussions and W. Kreuzer for the help with a graphics software.

References

  • [1] S. Akkarakaran and P. Vaidyanathan. Nonuniform filter banks: new results and open problems. In Beyond wavelets, volume 10 of Studies in Computational Mathematics, pages 259–301. Elsevier, 2003.
  • [2] P. Balazs. Regular and Irregular Gabor Multipliers with Application to Psychoacoustic Masking. PhD thesis, University of Vienna, 2005.
  • [3] P. Balazs. Basic definition and properties of Bessel multipliers. J. Math. Anal. Appl., 325(1):571–585, 2007.
  • [4] P. Balazs. Frames and finite dimensionality: frame transformation, classification and algorithms. Applied Mathematical Sciences, 2(41–44):2131–2144, 2008.
  • [5] P. Balazs, C. Cabrelli, S. B. Heineken, and U. Molter. Frames by multiplication. Current Development in Theory and Applications of Wavelets, 5(2-3):165–186, 2011.
  • [6] P. Balazs, M. Dörfler, F. Jaillet, N. Holighaus, and G. A. Velasco. Theory, implementation and applications of nonstationary Gabor frames. J. Comput. Appl. Math., 236(6):1481–1496, 2011.
  • [7] P. Balazs, M. Dörfler, M. Kowalski, and B. Torrésani. Adapted and adaptive linear time-frequency representations: a synthesis point of view. IEEE Signal Processing Magazine, 30(6):20–31, 2013.
  • [8] P. Balazs, H. G. Feichtinger, M. Hampejs, and G. Kracher. Double preconditioning for Gabor frames. IEEE Trans. Signal Process., 54(12):4597–4610, December 2006.
  • [9] P. Balazs, B. Laback, G. Eckel, and W. A. Deutsch. Time-frequency sparsity by removing perceptually irrelevant components using a simple model of simultaneous masking. IEEE Trans. Audio, Speech, Language Process., 18(1):34–49, 2010.
  • [10] P. Balazs and D. T. Stoeva. Representation of the inverse of a frame multiplier. J. Math. Anal. Appl., 422(2):981–994, 2015.
  • [11] N. K. Bari. Biorthogonal systems and bases in Hilbert space. Uch. Zap. Mosk. Gos. Univ., 148:69–107, 1951.
  • [12] J. J. Benedetto and A. Teolis. A wavelet auditory model and data compression. Appl. Comput. Harmon. Anal., 01.Jän:3–28, 1994.
  • [13] M. Bézat, V. Roussarie, T. Voinier, R. Kronland-Martinet, and S. Ystad. Car door closure sounds: Characterization of perceptual properties through analysis-synthesis approach. In Proceedings of the 19th International Congress on Acoustics (ICA), Madrid, Spain, September 2007.
  • [14] H. Bölcskei, F. Hlawatsch, and H. Feichtinger. Frame-theoretic analysis of oversampled filter banks. IEEE Trans. Signal Process., 46(12):3256–3268, December 1998.
  • [15] M. Bownik and J. Lemvig. The canonical and alternate duals of a wavelet frame. Appl. Comput. Harmon. Anal., 23(2):263–272, 2007.
  • [16] A. Bregman. Auditory Scene Analysis: The perceptual organization of sound. MIT Press, Cambridge, MA, USA, 1990.
  • [17] P. Casazza and G. Kutyniok. Finite Frames: Theory and Applications. Applied and Numerical Harmonic Analysis. Birkhäuser Boston, 2012.
  • [18] P. G. Casazza. The art of frame theory. Taiwanese J. Math., 4(2):129–201, 2000.
  • [19] P. G. Casazza and O. Christensen. Gabor frames over irregular lattices. Adv. Comput. Math., 18(2-4):329–344, 2003.
  • [20] L. Chai, J. Zhang, C. Zhang, and E. Mosca. Bound ratio minimization of filter bank frames. Signal Processing, IEEE Transactions on, 58(1):209 –220, 2010.
  • [21] G. Chardon, T. Necciari, and P. Balazs. Perceptual matching pursuit with Gabor dictionaries and time-frequency masking. In Proceedings of the 39th International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2014), 2014.
  • [22] O. Christensen. An Introduction to Frames and Riesz Bases. Applied and Numerical Harmonic Analysis. Birkhäuser, Boston, 2003.
  • [23] O. Christensen. Pairs of dual Gabor frame generators with compact support and desired frequency localization. Appl. Comput. Harmon. Anal., 20(3):403–410, 2006.
  • [24] O. Christensen and S. S. Goh Fourier-like frames on locally compact abelian groups. Journal of Approximation Theory, 192(0):82 – 101, 2015.
  • [25] Z. Cvetković and M. Vetterli. Oversampled filter banks. IEEE Trans. Signal Process., 46(5):1245–1255, 1998.
  • [26] I. Daubechies. Ten Lectures on Wavelets, volume 61 of CBMS-NSF Regional Conference Series in Applied Mathematics. SIAM, Philadelphia, PA, 1992.
  • [27] I. Daubechies, A. Grossmann, and Y. Meyer. Painless nonorthogonal expansions. J. Math. Phys., 27(5):1271–1283, May 1986.
  • [28] M. Dörfler and E. Matusiak. Nonstationary Gabor frames - existence and construction. IJWMIP, 12(3), 2014.
  • [29] R. J. Duffin and A. C. Schaeffer. A class of nonharmonic Fourier series. Trans. Amer. Math. Soc., 72:341–366, 1952.
  • [30] G. Eckel. Ein Modell der Mehrfachverdeckung für die Analyse musikalischerSchallsignale. PhD thesis, University of Vienna, 1989.
  • [31] M. Elad. Sparse and Redundant Representations: From Theory to Applications in Signal and Image Processing. Springer, 2010.
  • [32] H. Fastl and E. Zwicker. Psychoacoustics — Facts and Models. Springer, third edition, 2006.
  • [33] H. G. Feichtinger and K. Nowak. A first survey of Gabor multipliers. In H. G. Feichtinger and T. Strohmer, editors, Advances in Gabor Analysis, Appl. Numer. Harmon. Anal., pages 99–128. Birkhäuser, 2003.
  • [34] M. Fickus, M. L. Massar, and D. G. Mixon. Finite frames and filter banks. In Finite Frames, Applied and Numerical Harmonic Analysis, pages 337–379. Birkhäuser Boston, Cambridge, MA, USA, 2013.
  • [35] M. Fornasier. Theoretical Foundations and Numerical Methods for Sparse Recovery. Radon Series on Computational and Applied Mathematics 9. Walter de Gruyter, 2010.
  • [36] J.S. Garofolo, L.F. Lamel, W.M. Fisher, J.G. Fiscus, D.S. Pallett, and N.L. Dahlgren. TIMIT Acoustic-Phonetic Continuous Speech Corpus LDC93S1 Philadelphia: Linguistic Data Consortium 1993.
  • [37] B. R. Glasberg and B. C. J. Moore. Derivation of auditory filter shapes from notched-noise data. Hear. Res., 47:103–138, 1990.
  • [38] D. D. Greenwood. A cochlear frequency-position function for several species—29 years later. J. Acoust. Soc. Am., 87(6):2592–2605, June 1990.
  • [39] K. Gröchenig. Acceleration of the frame algorithm. IEEE Trans. Signal Process., 41(12):3331–3340, December 1993.
  • [40] K. Gröchenig. Foundations of Time-Frequency Analysis. Appl. Numer. Harmon. Anal. Birkhäuser, Boston, MA, 2001.
  • [41] T. S. Gunawan, E. Ambikairajah, and J. Epps. Perceptual speech enhancement exploiting temporal masking properties of human auditory system. Speech Commun., 52(5):381 – 393, 2010.
  • [42] C. Heil. A Basis Theory Primer. Expanded ed. Applied and Numerical Harmonic Analysis. Birkhäuser, Basel, 2011.
  • [43] C. Heil and D. F. Walnut. Continuous and discrete wavelet transforms. SIAM Rev., 31:628–666, 1989.
  • [44] E. Hernández, D. Labate, and G. Weiss. A unified characterization of reproducing systems generated by a finite family. II. J. Geom. Anal., 12(4):615–662, 2002.
  • [45] H. G. Heuser. Functional analysis. Transl. by John Horvath. A Wiley-Interscience Publication. Chichester etc.: John Wiley &amp; Sons. XV, 408 p., 1982.
  • [46] P. Q. Hoang and P. P. Vaidyanathan. Non-uniform multirate filter banks: theory and design. In Circuits and Systems, 1989., IEEE International Symposium on, pages 371–374 vol.1, May 1989.
  • [47] N. Holighaus. Structure of nonstationary Gabor frames and their dual systems. Appl. Comput. Harmon. Anal., 37(3):442–463, November 2014.
  • [48] N. Holighaus, M. Dörfler, G. Velasco, and T. Grill. A framework for invertible, real-time constant-Q transforms. IEEE Audio, Speech, Language Process., 21(4):775–785, April 2013.
  • [49] T. Irino and R. D. Patterson. A dynamic compressive gammachirp auditory filterbank. IEEE Audio, Speech, Language Process., 14(6):2222–2232, November 2006.
  • [50] A. Janssen. From continuous to discrete Weyl-Heisenberg frames through sampling. J. Fourier Anal. Appl., 3(5):583–596, 1997.
  • [51] W. Jesteadt, S. P. Bacon, and J. R. Lehman. Forward masking as a function of frequency, masker level, and signal delay. J. Acoust. Soc. Am., 71(4):950–962, April 1982.
  • [52] A. Kern, C. Heid, W.-H. Steeb, N. Stoop, and R. Stoop. Biophysical parameters modification could overcome essential hearing gaps. PLoS Comput Biol, 4(8):e1000161, 08 2008.
  • [53] G. Kidd Jr. and L. L. Feth. Patterns of residual masking. Hear. Res., 5:49–67, 1981.
  • [54] J. Kovačević and M. Vetterli. Perfect reconstruction filter banks with rational sampling factors. IEEE Trans. Signal Process., 41(6):2047–2066, 1993.
  • [55] B. Laback, P. Balazs, T. Necciari, S. Savel, S. Meunier, S. Ystad, and R. Kronland-Martinet. Additivity of nonsimultaneous masking for short Gaussian-shaped sinusoids. J. Acoust. Soc. Am., 129(2):888–897, February 2011.
  • [56] B. Laback, T. Necciari, P. Balazs, S. Savel, and S. Ystad. Simultaneous masking additivity for short Gaussian-shaped tones: Spectral effects. J. Acoust. Soc. Am., 134(2):1160–1171, August 2013.
  • [57] J. Leng, D. Han, and T. Huang. Optimal dual frames for communication coding with probabilistic erasures. IEEE Trans. Signal Process., 59(11):5380 –5389, nov. 2011.
  • [58] E. A. Lopez-Poveda and R. Meddis. A human nonlinear filterbank. J. Acoust. Soc. Am., 110(6):3107–3118, December 2001.
  • [59] R. Lyon, A. Katsiamis, and E. Drakakis. History and future of auditory filter models. In Proc. ISCAS, pages 3809–3812, Paris, France, June 2010. IEEE.
  • [60] G. Matz and F. Hlawatsch. Time-frequency transfer function calculus (symbolic calculus) of linear time-varying systems (linear operators) based on a generalized underspread theory. J. Math. Phys., 39(8):4041–4070, 1998.
  • [61] G. Matz and F. Hlawatsch. Linear Time-Frequency Filters: On-line Algorithms and Applications, chapter 6 in ’Application in Time-Frequency Signal Processing’, pages 205–271. eds. A. Papandreou-Suppappola, Boca Raton (FL): CRC Press, 2002.
  • [62] B. C. J. Moore. An introduction to the psychology of hearing. Emerald Group Publishing, Bingley, UK, sixth edition, 2012.
  • [63] B. C. J. Moore, J. I. Alcántara, and T. Dau. Masking patterns for sinusoidal and narrow-band noise maskers. J. Acoust. Soc. Am., 104(2):1023–1038, August 1998.
  • [64] T. Necciari. Auditory time-frequency masking: Psychoacoustical measures and application to the analysis-synthesis of sound signals. PhD thesis, Aix-Marseille University, France, 2010.
  • [65] T. Necciari, P. Balazs, N. Holighaus, and P. Søndergaard. The ERBlet transform: An auditory-based time-frequency representation with perfect reconstruction. In Proceedings of the 38th International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2013), pages 498–502, 2013.
  • [66] T. Necciari, N. Holighaus, P. Balazs, Z. Průša, and P. Majdak. Frame-theoretic recipe for the construction of gammatone and perceptually motivated filter banks with perfect reconstruction. http://arxiv.org/abs/1601.06652.
  • [67] J. J. O’Donovan and D. J. Furlong. Perceptually motivated time-frequency analysis. J. Acoust. Soc. Am., 117(1):250–262, January 2005.
  • [68] A. V. Oppenheim and R. W. Schafer. Discrete-time Signal Processing. Prentice Hall, Englewood Cliffs, NJ, 1989.
  • [69] A. V. Oppenheim, R. W. Schafer, J. R. Buck, et al. Discrete-time signal processing, volume 2. Prentice hall Englewood Cliffs, NJ, 1989.
  • [70] T. Painter and A. Spanias. Perceptual coding of digital audio. In Proc. IEEE, volume 88, pages 451–515, April 2000.
  • [71] R. D. Patterson, K. Robinson, J. Holdsworth, D. McKeown, C. Zhang, and M. H. Allerhand. Complex sounds and auditory images. In Auditory physiology and perception, Proceedings of the 9th International Symposium on Hearing, pages 429–446, Oxford, UK, 1992. Pergamond.
  • [72] N. Perraudin, N. Holighaus, P. Søndergaard, and P. Balazs. Gabor dual windows using convex optimization. In Proceeedings of the 10th International Conference on Sampling theory and Applications (SAMPTA 2013), 2013.
  • [73] C. J. Plack. The sense of hearing. Psychology Press, second edition, 2013.
  • [74] Z. Průša, P. Søndergaard, N. Holighaus, C. Wiesmeyr, and P. Balazs. The Large Time-Frequency Analysis Toolbox 2.0. In M. Aramaki, O. Derrien, R. Kronland-Martinet, and S. Ystad, editors, Sound, Music, and Motion, Lecture Notes in Computer Science, pages 419–442. Springer International Publishing, 2014.
  • [75] Z. Průša, P. Søndergaard, and P. Rajmic. Discrete Wavelet Transforms in the Large Time-Frequency Analysis Toolbox for Matlab/GNU Octave. ACM Trans. Math. Softw. To appear. Preprint available from http://ltfat.github.io/notes/ltfatnote038.pdf.
  • [76] E. Ravelli, G. Richard, and L. Daudet. Union of MDCT bases for audio coding. IEEE Trans. Audio, Speech, Language Process., 16(8):1361–1372, 2008.
  • [77] A. Ron and Z. Shen. Generalized shift-invariant systems. Constr. Approx., pages OF1–OF45, 2004.
  • [78] W. Rudin. Functional Analysis. McGraw-Hill Series in Higher Mathematics. New York etc.: McGraw-Hill Book Comp. XIII, 397 p. , 1973.
  • [79] D. Soderquist, A. Carstens, and G. Frank. Backward, simultaneous, and forward masking as a function of signal delay and frequency. The Journal of Auditory Research, 21:227–245, 1981.
  • [80] P. Søndergaard, B. Torrésani, and P. Balazs. The Linear Time Frequency Analysis Toolbox. International Journal of Wavelets, Multiresolution and Information Processing, 10(4):1250032, 2012.
  • [81] P. Søndergaard. Gabor frames by sampling and periodization. Adv. Comput. Math., 27(4):355–373, 2007.
  • [82] D. T. Stoeva and P. Balazs. Invertibility of multipliers. Appl. Comput. Harmon. Anal., 33(2):292–299, 2012.
  • [83] D. T. Stoeva and P. Balazs. Canonical forms of unconditionally convergent multipliers. J. Math. Anal. Appl., 399:252–259, 2013.
  • [84] D. T. Stoeva and P. Balazs. Riesz bases multipliers. In M. Cepedello Boiso, H. Hedenmalm, M. A. Kaashoek, A. Montes-Rodr�guez, and S. Treil, editors, Concrete Operators, Spectral Theory, Operators in Harmonic Analysis and Approximation, volume 236 of Operator Theory: Advances and Applications, pages 475–482. Birkhäuser, Springer Basel, 2014.
  • [85] S. Strahl and A. Mertins. Analysis and design of gammatone signal models. J. Acoust. Soc. Am., 126(5):2379–2389, November 2009.
  • [86] T. Strohmer. Numerical algorithms for discrete Gabor expansions. In H. G. Feichtinger and T. Strohmer, editors, Gabor Analysis and Algorithms: Theory and Applications, Appl. Numer. Harmon. Anal., pages 267–294. Birkhäuser Boston, Boston, 1998.
  • [87] L. N. Trefethen and D. Bau III. Numerical Linear Algebra. SIAM, Philadelphia, PA, USA, 1997.
  • [88] M. Unoki, T. Irino, B. Glasberg, B. C. J. Moore, and R. D. Patterson. Comparison of the roex and gammachirp filters as representations of the auditory filter. J. Acoust. Soc. Am., 120(3):1474–1492, September 2006.
  • [89] P. Vaidyanathan. Multirate Systems And Filter Banks. Electrical engineering. Electronic and digital design. Prentice Hall, Englewood Cliffs, NJ, USA, 1993.
  • [90] X. Valero and F. Alias. Gammatone cepstral coefficients: Biologically inspired features for non-speech audio classification. IEEE Trans. Multimedia, 14(6):1684–1689, Dec 2012.
  • [91] M. Vetterli and J. Kovacević. Wavelets and Subband Coding. Prentice Hall, Englewood Cliffs, NJ, 1995.
  • [92] D. Wang and G. J. Brown. Computational Auditory Scene Analysis: Principles, Algorithms, and Applications. Wiley-IEEE Press, 2006.
  • [93] T. Werther, Y. C. Eldar, and N. K. Subbana. Dual Gabor frames: theory and computational aspects. IEEE Trans. Signal Process., 53(11):4147– 4158, 2005.
  • [94] C. Wiesmeyr, N. Holighaus, and P. Søndergaard. Efficient algorithms for discrete Gabor transforms on a nonseparable lattice. IEEE Trans. Signal Process., 61(20):5131 – 5142, 2013.
  • [95] R. M. Young. An introduction to nonharmonic Fourier series. Pure and Applied Mathematics, 93. New York etc.: Academic Press, a Subsidiary of Harcourt Brace Jovanovich, Publishers. X, 246 p. , 1980.
  • [96] X. Zhao, Y. Shao, and D. Wang. Casa-based robust speaker identification. IEEE Trans. Audio, Speech, Language Process., 20(5):1608–1616, July 2012.
  • [97] E. Zwicker. Dependence of post-masking on masker duration and its relation to temporal effects in loudness. J. Acoust. Soc. Am., 75(1):219–223, January 1984.
  • [98] E. Zwicker and E. Terhardt. Analytical expressions for critical-band rate and critical bandwidth as a function of frequency. J. Acoust. Soc. Am., 68(5):1523–1525, November 1980.