This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

BANC: Towards Efficient Binaural Audio Neural Codec for Overlapping Speech

1st Anton Ratnarajah University of Maryland, College Park, MD, USA    2nd Shi-Xiong Zhang Tencent AI Lab, Bellevue, WA, USA    3rd Dong Yu Tencent AI Lab, Bellevue, WA, USA
Abstract

We introduce BANC, a neural binaural audio codec designed for efficient speech compression in single and two-speaker scenarios while preserving the spatial location information of each speaker. Our key contributions are as follows: 1) The ability of our proposed model to compress and decode overlapping speech. 2) A novel architecture that compresses speech content and spatial cues separately, ensuring the preservation of each speaker’s spatial context after decoding. 3) BANC’s proficiency in reducing the bandwidth required for compressing binaural speech by 48% compared to compressing individual binaural channels. In our evaluation, we employed speech enhancement, room acoustics, and perceptual metrics to assess the accuracy of BANC’s clean speech and spatial cue estimates.

I Introduction

Neural audio codecs (NACs) are used to compress audio signals into codes to reduce the amount of data transmitted or stored. Existing NACs can be categorized into hybrid approaches and end-to-end NACs. Hybrid NACs combine traditional signal processing-based audio coding methods with neural speech synthesis architectures [1, 2, 3]. Recently, data-driven end-to-end NAC architectures have been proposed [4, 5, 6, 7], substantially improving the quality of reconstructed audio. The end-to-end approach does not assume the nature of the input audio training dataset, allowing it to generalize to any audio content. Most NACs are composed of encoder, quantizer, and decoder modules. Current NACs are designed to compress single-channel or stereo audio signals, and their decoder modules are typically optimized for single-speaker scenarios [7]. Two major drawbacks are evident in prior NAC designs: 1) They predominantly focus on single-channel audio processing. 2) They assume the presence of only one primary speaker within the audio. To address these limitations, our paper introduces a novel neural spatial audio codec model, aptly named BANC. This model efficiently compresses binaural speech, accommodating both single-speaker and two-speaker scenarios with different spatial locations.

A key distinction between single-channel and binaural audio is the latter’s encapsulation of spatial localization, in addition to clean speech content (SC[t]S_{C}[t][8]. This spatial context manifests in various acoustic facets, such as early reflections, late reverberations, diffractions, as well as interaural discrepancies like interaural time difference (ITD) and interaural level difference (ILD) between microphones. Mathematically, these intricacies can be captured by the impulse response (IR) function. Consequently, for a mono-speaker scenario, one can decompose the clean speech content SC[t]S_{C}[t] and the binaural acoustic effect BIR[t]BIR[t] from the binaural speech SB[t]S_{B}[t] as:

SB[t]=SC[t]BIR[t].\displaystyle S_{B}[t]=S_{C}[t]\circledast BIR[t]. (1)

For overlapped binaural speech (SOB[t]S_{OB}[t]) with two speakers in two different spatial locations, we can separately decompose their clean speech contents (SC1[t]S_{C1}[t], SC2[t]S_{C2}[t]) and their binaural IRs (BIRs) (BIR1[t]BIR_{1}[t], BIR2[t]BIR_{2}[t]) as follows:

SOB[t]=(SC1[t]BIR1[t])+(SC2[t]BIR2[t]).\displaystyle S_{OB}[t]=(S_{C1}[t]\circledast BIR_{1}[t])+(S_{C2}[t]\circledast BIR_{2}[t]). (2)

Main Contribution: We present a pioneering NAC architecture optimized for binaural overlapped speech from two speakers, crucially retaining each speaker’s spatial context. This architecture is illustrated in Fig. 1. In contrast to the existing AudioDec model[7], our key contributions are: 1) Expansion of previous neural codecs from single to binaural audios. 2) Enhanced capability for compressing and decoding overlapping speech. 3) A novel strategy of separately compressing speech content and spatial cues, ensuring post-decoding preservation of spatial context for each speaker. 4) Achieving a high compression rate for binaural audio. Specifically, our model can reconstruct a 48 kHz binaural speech signal from two distinct speakers using only 12.6 kbps bandwidth, a feat surpassing AudioDec which operates at 24 kbps for binaural speech and results in a remarkable 52% reduction in spatial errors, respectively. Moreover, BANC demonstrates superiority over both Opus[9] and Encodec [5] in quantitative and qualitative analysis when tested under comparable bandwidths. We provide reconstructed speech samples, spectrograms and source code for future research at https://anton-jeran.github.io/MAD/.

II Related Work

Traditional audio codecs: Linear predictive coding-based audio codecs [10, 11] and model-based audio codecs [12] have been proposed in the past for speech coding, but their quality is limited. Among traditional methods, Opus [9] and EVS [13] are state-of-the-art traditional audio codec architectures, and they can support different bitrates and sampling rates at high coding efficiency in real-time.

Neural audio codecs (NAC): End-to-end data-driven architectures have been proposed to encode mono and stereo audio with impressive performance [4, 5, 6]. Encodec [5] compresses stereo audio by processing the left and right channels separately, which leads to inefficient compression as the same speech content is encoded twice in both channels. Our BANC model reduces bandwidth by encoding speech content only once and is optimized to efficiently compress overlapping speech while preserving each speaker’s speech content and spatial acoustic features.

Speech dereverbation and RIR Estimation: Recently, NAC-based architectures have been proposed for audio-visual speech enhancement [14]. Similarly, in our work, we decode clean speech from binaural speech. Generative architectures have also been proposed to estimate IR from given spatial information [15, 16]. Encoder-decoder architectures have shown promising results in estimating IR from reverberant speech [17, 18]. We propose a neural codec to estimate the IR of a one-second duration from binaural speech.

III Binaural Audio Neural Codec

We propose BANC to compress binaural speech SB(x)S_{B}(x) with a sampling rate of 48 kHz. Similar to typical NAC [4, 7], our model consists of an encoder, projector, quantizer and decoder modules. We propose simple and complex decoder architecture for single-speaker and two-speaker scenarios, respectively. Our proposed encoder architecture is the same for single-speaker and two-speaker cases. We adapt the projector and quantizer from the AudioDec [7].

Refer to caption
Figure 1: Our proposed NAC configured for single-speaker and two-speaker two-spatial overlapped binaural speech. In § III, we describe the details of our architecture and training paradigm. We train end-to-end networks with metric loss (Eq. 6) for 200K iterations. Then, we freeze the blocks shown in blue and train the rest of the network with the metric and adversarial loss (Eq. 8) for an additional 500k iterations (single-speaker) and 160k iterations (two-speaker).

III-A Encoder Architecture

We pass the binaural speech through a common encoder, consisting of a 1D convolutional layer (CONV) with a kernel size (K) of 3, stride (S) of 1, input channels (IC) of 2, and output channels (OC) of 2. The output from the common encoder is then passed to both the speech encoder and the binaural impulse response (BIR) encoder. The speech encoder follows the same architecture as AudioDec [7] and SoundStream [4]. The speech encoder starts with a CONV layer (K = 7, S = 1, IC = 2, OC = 16), followed by convolution blocks CB1C_{B1}. Each CB1C_{B1} contains three residual units (RU) with dilated CONV layers (with dilation rates of 1, 3, and 9), followed by a CONV layer (K = 2 * S, S = S, IC = IC, OC = 2 * IC). We have 5 CB1C_{B1} blocks with strides (S) of (2, 2, 3, 5, 5), resulting in a downsampling factor of 300 for the speech content.

Our IR encoder is inspired by the IR estimator network [17]. The IR encoder contains three CONV blocks, CB2C_{B2}. Each CB2C_{B2} consists of a CONV layer followed by batch normalization (BN) and leaky ReLU. The first CB2C_{B2} does not include BN. The three CB2C_{B2} blocks have OC = (128, 256, 512), K = (96001, 41, 41), S = (1500, 2, 2), and padding (P) = (48000, 20, 20). This architecture significantly downsamples the IR content by a factor of 6000. All the CONV layers are causal to enable real-time operation. The outputs of both the speech encoder and IR encoder are projected into multi-dimensional space separately and quantized into codes using projector and quantizer modules, as proposed in AudioDec.

III-B Decoder Architecture

We propose two different architectures for the single-speaker and two-speaker scenarios as follows:

Single speaker: We propose a speech decoder architecture to decode clean speech and an IR decoder to decode BIR. We reconstruct the binaural speech from the estimated clean speech and IR using Eq. 1. Both the speech and IR decoders are adaptations of the SoundStream decoder. Before feeding inputs into the decoder modules, we pass the code through a CONV layer (IC = 64, OC = 512, K = 7, S = 1).

The speech decoder consists of 5 CONV blocks CB3C_{B3} with S = (5, 5, 3, 2, 2), followed by a CONV layer with OC = 1, K = 7, and S = 1. Each CB3C_{B3} contains transposed convolutional layers (IC = IC, OC = 0.5 * IC, K = 2 * S, S), followed by three residual units (RU) layers similar to those in the encoder. The IR decoder has a similar network structure as the speech decoder, except for the number of CB3C_{B3} blocks. The IR decoder contains 6 CB3C_{B3} blocks with S = (5, 5, 5, 4, 3, 2), and the final CONV layer has two output channels. We reconstruct two-second clean speech and one-second BIR at a sampling rate of 48 kHz.

Two speakers: For the binaural scenario, we replicate the speech decoder from the single-speaker scenario twice to decode the clean speech of two speakers separately. We perform speech separation within the decoder to ensure the network preserves the speech content of each individual speaker in the code. Instead of directly passing the output CC from the CONV layer, we learn the representation of each speaker, SiS_{i}, by learning a mask vector Mi[0,1]M_{i}\in[0,1]. Similar to Conv-TasNet [19], SiS_{i} is calculated by performing element-wise multiplication of CC and MiM_{i}. We then pass SiS_{i} to the speech decoder modules to estimate the clean speech of each speaker. The same IR decoder from the single-speaker network is used, but with double the number of channels in each layer. Fig. 1 illustrates our model for two-speaker binaural speech.

Bandwidth: AudioDec allocates 80 bits per frame at a sampling rate of 48,000 Hz and a stride factor of 300, resulting in a bandwidth of 25.6 kbps for binaural speech (2 channels) calculated as 2 × 80 × 48,000 / 300. In contrast, our proposed BANC method compresses the clean speech signal by a factor of 300 and the BIR by 6000, leading to a significantly reduced bit rate of 13.44 kbps, computed as (80 × 48,000 / 300) + (80 × 48,000 / 6000).

III-C Training Objective

We adapt the training paradigm proposed in AudioDec. First, we train the end-to-end network with the metric loss for 200k iterations. Then, we replace our speech decoders with HiFi-GAN [20] vocoders and continue training with both the metric and adversarial losses for an additional 500k iterations, using HiFi-GAN-based multi-period and multi-scale discriminators [7]. In the multi-speaker scenario, after 200k iterations, we further train our end-to-end network with adversarial and metric losses for an additional 160k iterations. Let B(x)B(x) and B^(x)\hat{B}(x) denote the input and reconstructed binaural speech, respectively. We denote the ground truth and reconstructed clean speech of speaker ii as Si(x)S_{i}(x) and S^i(x)\hat{S}{i}_{(}x), respectively. BIRi(x)BIR_{i}(x) and BIR^i(x)\hat{BIR}_{i}(x) represent their corresponding ground truth and reconstructed binaural impulse responses.

Metric Loss: We use the mel spectral loss (Eq. 3) and spectrogram loss as our metric loss for clean and binaural speech. In Eq. 3, MELMEL denotes the extraction of the mel spectrogram. 𝔼\mathbb{E} denotes the expectation, and L1-norm and L2-norm are denoted by ||.||1||.||_{1} and ||.||2||.||_{2} respectively.

MEL(x,x^)=𝔼[MEL(x)MEL(x^)1].\displaystyle\mathcal{L}_{MEL}(x,\hat{x})=\mathbb{E}[||MEL(x)-MEL(\hat{x})||_{1}]. (3)

For spectrogram loss, we calculate the mean square difference of the log magnitude of the ground truth speech spectrogram (Mspec(x)M_{spec}(x)) and estimated speech spectrogram (Mspec(x^)M_{spec}(\hat{x})) (Eq. 4).

MAG(x,x^)=𝔼[Mspec(x)Mspec(x^)22].\displaystyle\mathcal{L}_{MAG}(x,\hat{x})=\mathbb{E}[||M_{spec}(x)-M_{spec}(\hat{x})||_{2}^{2}]. (4)

We calculated time-domain mean square error (MSE) between ground truth and estimated BIRs (Eq. 5) as our metric loss for estimated BIRs as follows:

IR(b,b^)=𝔼[bb^22].\displaystyle\mathcal{L}_{IR}(b,\hat{b})=\mathbb{E}[||b-\hat{b}||_{2}^{2}]. (5)

Our total metric loss is defined as follows:

MET(x)=(MEL(B(x),B^(x))+MAG(B(x),B^(x)))\displaystyle\mathcal{L}_{MET}(x)=(\mathcal{L}_{MEL}(B(x),\hat{B}(x))+\mathcal{L}_{MAG}(B(x),\hat{B}(x))) (6)
+i=1MS(MEL(Si(x),S^i(x))+MAG(Si(x),S^i(x))\displaystyle+\sum_{i=1}^{M_{S}}(\mathcal{L}_{MEL}(S_{i}(x),\hat{S}_{i}(x))+\mathcal{L}_{MAG}(S_{i}(x),\hat{S}_{i}(x))
+IR(BIRi(x),BIR^i(x))),\displaystyle+\mathcal{L}_{IR}(BIR_{i}(x),\hat{BIR}_{i}(x))),

where MSM_{S} is the total number of speakers in the speech.

Adversarial Loss: We train two HiFi-GAN discriminators for binaural and clean speech by optimizing the following objective function:

D(x)=𝔼[max(0,1DB(B(x)))+max(0,1+DB(B^(x))\displaystyle\mathcal{L}_{D}(x)=\mathbb{E}[\max{(0,1-D_{B}(B(x)))}+\max{(0,1+D_{B}(\hat{B}(x))} (7)
+i=1MS(max(0,1DS(Si(x)))+max(0,1+DS(S^i(x))))],\displaystyle+\sum_{i=1}^{M_{S}}(\max{(0,1-D_{S}({S}_{i}(x)))}+\max{(0,1+D_{S}(\hat{S}_{i}(x))}))],

where DBD_{B} and DSD_{S} are the discriminators of binaural speech and clean speech respectively. We train our BANC with the following adversarial loss.

ADV=𝔼[max(0,1DB(B^(x))+i=1MSmax(0,1DS(S^i(x)))].\displaystyle\mathcal{L}_{ADV}=\mathbb{E}[\max{(0,1-D_{B}(\hat{B}(x))}+\sum_{i=1}^{M_{S}}\max{(0,1-D_{S}(\hat{S}_{i}(x))})]. (8)

In addition to the MET(x)\mathcal{L}_{MET}(x) and ADV(x)\mathcal{L}_{ADV}(x), we train our network with VQ\mathcal{L}_{VQ} [21] applied to the VQ codebook. Our overall generator loss GEN(x)\mathcal{L}_{GEN}(x) is as follows:

GEN(x)=MET(x)+λADVADV+λVQVQ,\displaystyle\mathcal{L}_{GEN}(x)=\mathcal{L}_{MET}(x)+\lambda_{ADV}\mathcal{L}_{ADV}+\lambda_{VQ}\mathcal{L}_{VQ}, (9)

where λADV\lambda_{ADV} and λVQ\lambda_{VQ} are the weights.

IV Experiments

Dataset: We generate binaural speech datasets for both single-speaker and two-speaker scenarios using clean speech from the Valentine dataset [22, 23] and simulated BIRs from pygsound [24]. We simulated 50k BIRs using pygsound, which were then randomly convolved with the clean speech corpus using Eq.1 for single-speaker and Eq.2 for two-speaker scenarios. We split the simulated dataset into 33,975 training samples, 750 validation samples, and 752 test samples.

Baselines: Opus is a widely used audio codec in Zoom, Microsoft Teams, Google Meet, and YouTube, and it was standardized by the IETF in 2012. We use Opus as our baseline traditional audio codec. We also compare our approach with the state-of-the-art NAC for two-channel audio (Encodec)[5]. HiFi-Codec[6] and AudioDec [7] only support single-channel speech compression. Therefore, we separately compressed the left and right channels using HiFi-Codec and AudioDec. The pre-trained AudioDec model available on their official GitHub was trained only on clean speech. For a fair comparison, we trained AudioDec using our dataset. AudioDec is an improvised version of SoundStream [4] for speech coding. More details about our baseline can be found in Table I.

Ablation: We evaluated three variations of BANC to select the best model for the single-speaker case. We assess the benefit of the HiFi-GAN vocoder by training our network for 700k iterations using our simple speech decoder described in § III-B (BANC-V1). In AudioDec, only mel spectral loss is used as a metric loss. Therefore, we trained the network without Eq.4 to evaluate the benefits of spectrogram loss (Eq.4) (BANC-V2). BANC is trained using our proposed approach described in § III. Due to computational complexity, we do not use the HiFi-GAN vocoder for our two-speaker model.

Evaluation Metrics: We evaluate our model by measuring the clean speech estimation quality using the widely used speech enhancement metric STOI [25] and BIR estimation quality using a set of BIR acoustic parameters. Reverberation time (T60T_{60}), direct-to-reverberant ratio (DRR), early-decay time (EDT), and early-to-late index (CTE) are commonly used acoustic parameters to measure IRs [26, 27]. We calculate the mean absolute difference between the estimated and ground truth BIR acoustic parameters.

We measure our model’s ability to preserve interaural time difference (ITD) and interaural level difference (ILD) in reconstructed binaural speech. As proposed in prior work [28], we use the generalized cross-correlation phase transform (GCC-PHAT) algorithm [29] to calculate the ITD error (Eq. 10) between the left and right channels of the ground truth speech (BLB^{L}, BRB^{R}) and the reconstructed speech (B^L\hat{B}^{L}, B^R\hat{B}^{R}).

𝐄𝐈𝐓𝐃=𝔼[|ITD(BL,BR)ITD(B^L,B^R)|].\displaystyle\mathbf{E_{ITD}}=\mathbb{E}[|ITD(B^{L},B^{R})-ITD(\hat{B}^{L},\hat{B}^{R})|]. (10)

We define the ILD error for left channel (𝐄𝐈𝐋𝐃𝐋\mathbf{E_{ILDL}}) and right channel (𝐄𝐈𝐋𝐃𝐑\mathbf{E_{ILDR}}) as follows:

𝐄𝐈𝐋𝐃𝐋=𝔼[|20log10B^L22BL22|],𝐄𝐈𝐋𝐃𝐑=𝔼[|20log10B^R22BR22|].\displaystyle\mathbf{E_{ILDL}}=\mathbb{E}[|20\log_{10}\frac{||\hat{B}^{L}||_{2}^{2}}{||B^{L}||_{2}^{2}}|],\mathbf{E_{ILDR}}=\mathbb{E}[|20\log_{10}\frac{||\hat{B}^{R}||_{2}^{2}}{||B^{R}||_{2}^{2}}|]. (11)
TABLE I: The baselines used for the comparison show that BANC significantly compresses the binaural speech. BANC can reduce the bandwidth of AudioDec by up to 47.5% for binaural speech. Opus [9] and HiFi-Codec [6] does not report their compression and bandwidth respectively.
Method Compression Bandwidth Sampling Rate
Opus-12 [9] - 12 kbps 48 kHz
Opus-24 [9] - 24 kbps 48 kHz
HiFi-Codec-320 [6] 320x - 24 kHz
HiFi-Codec-240 [6] 240x - 24 kHz
Encodec-12 [5] 256x 12 kbps 48 kHz
Encodec-48 [5] 64x 48 kbps 48 kHz
AudioDec [7] 300x 25.6 kbps 48 kHz
BANC (Ours) 3150x 13.44 kbps 48 kHz
TABLE II: Interaural time difference error (𝐄𝐈𝐓𝐃\mathbf{E_{ITD}}) and interaural level difference errors (𝐄𝐈𝐋𝐃𝐋\mathbf{E_{ILDL}}, 𝐄𝐈𝐋𝐃𝐑\mathbf{E_{ILDR}}) of the final reconstructed binaural speech are reported for the baselines (Table I) and different variations of our model (§ IV). We also report the STOI of the intermediate estimated clean speech from our approach. Additionally, we compare our single and two-speaker models (Fig. 1).
Speakers Method 𝐄𝐈𝐓𝐃\mathbf{E_{ITD}}\downarrow 𝐄𝐈𝐋𝐃𝐋\mathbf{E_{ILDL}}\downarrow 𝐄𝐈𝐋𝐃𝐑\mathbf{E_{ILDR}}\downarrow STOI\uparrow
Single Opus-12 30.7 ms 1.28 1.28 -
Single Opus-24 25.4 ms 1.04 1.07 -
Single HiFi-Codec-320 36.0 ms 1.21 1.24 -
Single HiFi-Codec-240 37.8 ms 1.07 1.08 -
Single Encodec-12 33.5 ms 0.88 0.91 -
Single Encodec-48 34.0 ms 0.56 0.57 -
Single AudioDec 33.7 ms 0.87 0.88 -
Single BANC-V1 29.0 ms 1.20 1.36 0.71
Single BANC-V2 20.7 ms 1.33 1.41 0.71
Single BANC 16.0 ms 0.75 0.72 0.84
Two Opus-12 27.3 ms 1.01 1.00 -
Two Opus-24 31.2 ms 0.71 0.79 -
Two Encodec-12 33.5 ms 0.83 0.81 -
Two Encodec-48 36.4 ms 0.50 0.50 -
Two BANC 21.9 ms 0.82 0.77 0.72
TABLE III: BIR estimation error for our approach in single-speaker and two-speaker scenarios. Training binaural speech with spectrogram loss significantly improves BIR estimation indirectly. The BIR estimation of our network in the two-speaker scenario is comparable to that in the single-speaker scenario when using a simple speech decoder (BANC-V1). For the two-speaker case, we report the average error across both speakers.
Speakers Channel Method 𝐓𝟔𝟎\mathbf{T_{60}}\downarrow DRR\downarrow EDT\downarrow CTE\downarrow
(ms) (dB) (dB) (ms)
Single Left BANC-V1 25.3 2.79 86.7 2.23
Single Left BANC-V2 20.9 2.21 67.0 1.44
Single Left BANC (ours) 22.7 1.08 39.4 0.79
Two Left BANC (ours) 25.2 3.41 80.1 2.52
Single Right BANC-V1 23.8 2.84 84.7 2.09
Single Right BANC-V2 21.4 2.35 64.9 1.33
Single Right BANC (ours) 23.0 1.05 35.0 0.77
Two Right BANC (ours) 25.6 3.30 83.3 2.09

Results: Table II presents the ITD and ILD errors of the reconstructed binaural speech from different baselines and our approach. Our approach achieves the lowest ITD error for both single-speaker and two-speaker cases. Additionally, it outperforms in terms of ILD errors (𝐄𝐈𝐋𝐃𝐋,𝐄𝐈𝐋𝐃𝐑\mathbf{E_{ILDL}},~{}\mathbf{E_{ILDR}}) compared to all baselines, except for Encodec-48. However, Encodec-48 requires four times more bandwidth, has a compression rate approximately 50 times lower than our approach, and is only suitable for non-streamable usage. For a fair comparison, we compared our model with Encodec-12 and found that our approach outperforms it by 18% and 3% for single-speaker and two-speaker cases, respectively. We also compared three variations of our single-speaker model, and observed that replacing the simple speech decoder with a HiFi-GAN vocoder improves the ITD error by 29%, while adding spectrogram loss enhances clean speech estimation quality (STOI) by 15%.

Table III shows the BIR estimation error for our approach. We observed that improving the binaural speech estimation quality using the HiFi-GAN vocoder and spectrogram loss indirectly contributed to better BIR estimation, reducing the overall error for T60T_{60}, DRR, EDT, and CTE by 6.9%, 62.1%, 56.6%, and 64%, respectively, in the single-speaker scenario. The performance of our two-speaker model is comparable to that of our single-speaker model BANC-V1.

Perceptual Evaluation: We randomly selected two different single-speaker reverberant speech signals, compressed and decoded them using different audio codecs, and asked 26 participants from Amazon Mechanical Turk to rate the quality compared to the ground truth speech on a scale of 1 to 100. The average completion time of the survey is around 10 minutes. The average scores were 69.06, 64.83, 69.81, 70.71 and 72.87 for Opus-24, HiFiCodec-24, Encodec-24, AudioDec, and our BANC model, respectively. These results demonstrate that the audio reconstructed by BANC outperforms prior codecs.

V Conclusion and Future Work

We propose BANC, a novel binaural NAC for single-speaker speech and two-speaker spatially overlapped speech. Our approach outperforms traditional methods and NACs with similar bandwidth, preserving binaural acoustic effects by up to 52%. We introduce a novel technique to compress speech content and acoustic effects separately, demonstrating that our method can reduce the bandwidth for compressing binaural speech by 48% compared to compressing each channel individually using AudioDec. Given the complexity of the scenario and the need for headphone-based evaluation, we have tested our approach on binaural two-speaker spatially overlapped speech. In the future, we aim to extend it to compress and decode overlapped speech from multiple speakers in arbitrary locations.

References

  • [1] J. Skoglund and J.-M. Valin, “Improving opus low bit rate quality with neural speech synthesis,” in INTERSPEECH.   ISCA, 2020.
  • [2] J. Klejsa, P. Hedelin, C. Zhou, R. Fejgin, and L. Villemoes, “High-quality speech coding with sample RNN,” in ICASSP.   IEEE, 2019.
  • [3] R. Fejgin, J. Klejsa, L. Villemoes, and C. Zhou, “Source coding of audio signals with a generative model,” in ICASSP.   IEEE, 2020.
  • [4] N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, and M. Tagliasacchi, “Soundstream: An end-to-end neural audio codec,” IEEE ACM Trans. Audio Speech Lang. Process., vol. 30, 2022.
  • [5] A. Défossez, J. Copet, G. Synnaeve, and Y. Adi, “High fidelity neural audio compression,” CoRR, vol. abs/2210.13438, 2022.
  • [6] D. Yang, S. Liu, R. Huang, J. Tian, C. Weng, and Y. Zou, “Hifi-codec: Group-residual vector quantization for high fidelity audio codec,” CoRR, vol. abs/2305.02765, 2023.
  • [7] Y.-C. Wu, I. D. Gebru, D. Marković, and A. Richard, “Audiodec: An open-source streaming high-fidelity neural audio codec,” in ICASSP 2023, 2023.
  • [8] F. L. Wightman and D. J. Kistler, “The dominant role of low-frequency interaural time differences in sound localization.” The Journal of the Acoustical Society of America, vol. 91 3, 1992. [Online]. Available: https://api.semanticscholar.org/CorpusID:24856270
  • [9] J.-M. Valin, K. Vos, and T. B. Terriberry, “Definition of the opus audio codec,” RFC, vol. 6716, 2012. [Online]. Available: https://api.semanticscholar.org/CorpusID:30715761
  • [10] B. S. Atal, “Speech analysis and synthesis by linear prediction of the speech wave,” The Journal of the Acoustical Society of America, vol. 50, no. 2B, 1971.
  • [11] A. McCree, K. Truong, E. George, T. Barnwell, and V. Viswanathan, “A 2.4 kbit/s MELP coder candidate for the new U.S. federal standard,” in ICASSP.   IEEE Computer Society, 1996.
  • [12] D. Griffin and J. Lim, “A new model-based speech analysis/synthesis system,” in ICASSP ’85., vol. 10, 1985.
  • [13] D. et al., “Overview of the evs codec architecture,” in ICASSP, 2015.
  • [14] K. Yang, D. Marković, S. Krenn, V. Agrawal, and A. Richard, “Audio-visual speech codecs: Rethinking audio-visual speech enhancement by re-synthesis,” in CVPR 2022, 2022.
  • [15] A. Ratnarajah, S.-X. Zhang, M. Yu, Z. Tang, D. Manocha, and D. Yu, “Fast-rir: Fast neural diffuse room impulse response generator,” in ICASSP 2022, 2022.
  • [16] A. Ratnarajah, Z. Tang, R. Aralikatti, and D. Manocha, “Mesh2ir: Neural acoustic impulse response generator for complex 3d scenes,” in Proceedings of the 30th ACM International Conference on Multimedia, ser. MM ’22.   New York, NY, USA: Association for Computing Machinery, 2022. [Online]. Available: https://doi.org/10.1145/3503161.3548253
  • [17] A. Ratnarajah, I. Ananthabhotla, V. K. Ithapu, P. Hoffmann, D. Manocha, and P. Calamia, “Towards improved room impulse response estimation for speech recognition,” in ICASSP 2023, 2023.
  • [18] C. J. Steinmetz, V. K. Ithapu, and P. Calamia, “Filtered noise shaping for time domain room impulse response estimation from reverberant speech,” in WASPAA 2021, 2021.
  • [19] Y. Luo and N. Mesgarani, “Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 8, 2019.
  • [20] J. Kong, J. Kim, and J. Bae, “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” in Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33.   Curran Associates, Inc., 2020. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2020/file/c5d736809766d46260d816d8dbc9eb44-Paper.pdf
  • [21] A. van den Oord, O. Vinyals, and K. Kavukcuoglu, “Neural discrete representation learning,” in Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., vol. 30.   Curran Associates, Inc., 2017. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2017/file/7a98af17e63a0ac09ce2e96d03992fbc-Paper.pdf
  • [22] C. Valentini-Botinhao, “Noisy speech database for training speech enhancement algorithms and TTS models,” 2017. [Online]. Available: https://doi.org/10.7488/ds/2117
  • [23] J. Yamagishi, C. Veaux, and K. MacDonald, “CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit (version 0.92),” 2019. [Online]. Available: https://doi.org/10.7488/ds/2645
  • [24] Z. Tang, L. Chen, B. Wu, D. Yu, and D. Manocha, “Improving reverberant speech training using diffuse acoustic simulation,” in ICASSP 2020, 2020.
  • [25] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “A short-time objective intelligibility measure for time-frequency weighted noisy speech,” in 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, 2010.
  • [26] A. Ratnarajah, Z. Tang, and D. Manocha, “IR-GAN: Room Impulse Response Generator for Far-Field Speech Recognition,” in Proc. Interspeech 2021, 2021.
  • [27] ——, “Ts-rir: Translated synthetic room impulse responses for speech augmentation,” in 2021 IEEE automatic speech recognition and understanding workshop (ASRU).   IEEE, 2021, pp. 259–266.
  • [28] C. Han, Y. Luo, and N. Mesgarani, “Real-time binaural speech separation with preserved spatial cues,” in ICASSP 2020, 2020.
  • [29] C. Knapp and G. Carter, “The generalized correlation method for estimation of time delay,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 24, no. 4, 1976.