Prosodic Alignment for off-screen automatic dubbing

Abstract

The goal of automatic dubbing is to perform speech-to-speech translation while achieving audiovisual coherence. This entails isochrony, i.e., translating the original speech by also matching its prosodic structure into phrases and pauses, especially when the speaker’s mouth is visible. In previous work, we introduced a prosodic alignment model to address isochrone or on-screen dubbing. In this work, we extend the prosodic alignment model to also address off-screen dubbing that requires less stringent synchronization constraints. We conduct experiments on four dubbing directions – English to French, Italian, German and Spanish – on a publicly available collection of TED Talks and on publicly available YouTube videos. Empirical results show that compared to our previous work the extended prosodic alignment model provides significantly better subjective viewing experience on videos in which on-screen and off-screen automatic dubbing is applied for sentences with speakers mouth visible and not visible, respectively.

Index Terms— speech translation, text-to-speech, automatic dubbing, off-screen dubbing

1 Introduction

Automatic Dubbing (AD) is an extension of speech-to-speech translation that replaces speech in a video with speech in a different language while preserving as much as possible the user experience. Speech translation [1, 2, 3, 4] consists of recognizing a speech utterance in the source language, performing translation, and optionally resynthesizing speech in the target language. Use cases for speech translation include human-to-human interaction, live lectures, etc. in which close to real-time response is needed. In contrast, AD is used to automate the localization of audiovisual content, a highly complex workflow [5] usually managed during post-production by dubbing studios. High quality video dubbing usually involves speech synchronization at the utterance level (isochrony), lip movement level (phonetic synchrony) and body movement level (kinetic synchrony). In the past, most work on AD [6, 7, 8, 9] addressed isochrony, i.e., translating original speech by optimally matching its sequence of phrases and pauses. The idea is to first machine translate the source transcript by generating output with roughly the same duration [10, 11] –i.e. in terms of number of characters or syllables – of the input. Next, the translation is segmented into phrases and pauses of the same duration as that of the original phrases. This step is called prosodic alignment (PA).

Past work on PA [6, 7, 8, 9] focused on isochrony in the context of on-screen dubbing, i.e., dubbing of videos in which the speaker’s mouth is visible for all utterances. However, in practical settings, it is quite common that videos contain scenes in which the speaker is not visible (off-screen) and for which the synchronization constraints of isochrone dubbing can be relaxed. To address this case, we extend PA with a mechanism to address on/off-screen dubbing in which all or some of the sentences in a video are off-screen. We perform automatic and human evaluations that compare our original PA model for isochrone dubbing [9] with the augmented PA model for on/off dubbing ¹¹1We will provide a link to examples of dubbed videos with the camera ready version of the paper.. We report results on a test set of TED talks extracted from the MUST-C corpus [12] and on 3 publicly available YouTube videos, on four dubbing directions, English (en) to French (fr), Italian (it), German (de) and Spanish (es). To summarize, our contributions in this work are:

•

We extend the PA model [9] to address off-screen dubbing.
•

We introduce an automatic metric to compute intelligibility of dubbed videos.
•

We run extensive automatic and subjective human evaluations comparing previous work with the new PA model on TED Talks and YouTube clips.
•

Finally, we provide extensive analysis using the linear mixed-effects models to demonstrate the utility of automatic metrics in predicting human score.

Our paper is organized as follows: First, we describe our dubbing architecture, then, we focus on existing and new PA methods and finally present and discuss experimental results comparing past and current work.

2 Dubbing Architecture

Refer to caption — Fig. 1: Speech translation pipeline (dotted box) with enhancements introduced to perform automatic dubbing (bold).

We build on the automatic dubbing architecture presented in [8, 7]. Figure 1 shows (in bold) how we extend a speech-to-speech translation [1, 2, 3] pipeline with: neural machine translation (MT) robust to ASR errors and able to control verbosity of the output [11, 13, 14]; prosodic alignment (PA) [6, 8, 9] which addresses phrase-level synchronization of the MT output by leveraging the force-aligned source transcript; neural text-to-speech (TTS) [15, 16, 17] with precise duration control; and, finally, audio rendering that enriches TTS output with the original background noise (extracted via audio source separation with deep U-Nets [18, 19]) and reverberation, estimated from the original audio [20, 21].

3 Related Work

In the past there has been little work to address isochrony in dubbing [6, 7, 8, 9]. The approach of [6] involved generating and rescoring segmentation hypotheses by utilizing the attention weights of neural machine translation. While they focused only on the linguistic content matching between source-target phrases, Federico et al. [7] focused on fluency. In particular, the model of [7] utilized source-target duration matches and dynamic programming search for faster implementation. In their subsequent works [8, 9] they further enhanced the prosodic alignment model by addition of features controlling for speaking rate variation and linguistic content matching. Additionally, they introduced time-boundary relaxation to further improve speaking rate control. However, none of these works focused on relaxing isochrony constraints by considering if the speaker is on-screen or off-screen. Recently, [22] leveraged on/off screen information to improve MT of dubbing scripts. Their rationale is that as human translations of scripts used in training reflect the different sync requirements posed by on-screen and off-screen speech, it is worth introducing the same bias in the neural MT model. Our work complements [22], by showing how to leverage the same information in order to improve prosodic alignment, too.

4 Prosodic Alignment

4.1 Isochrone Dubbing

PA [6, 7, 8, 9] aims to segment a translation to optimally match the sequence of phrases and pauses of the corresponding source utterance. Let $\mathbf{e}$ denote the source sentence with $n$ words and $k$ breakpoints denoted by $\mathbf{i}=i_{1},\dots,i_{k}$ such that $1\leq i_{1}\leq i_{2}\leq\dots i_{k}=n$ . Let $T$ denote the temporal duration of $\mathbf{e}$ and let $\mathbf{s}$ denote a temporal segmentation into $k$ segments where $\Delta\epsilon$ is the minimum silence after and before each break point.²²2In this work, the minimum silence interval $\Delta\epsilon$ is set to $300ms$ . Given the target sentence $\mathbf{f}$ of $m$ words, the goal of PA is to find $k$ breakpoints $\mathbf{j}=1\leq j_{1}<j_{2}<\dots j_{k}=m$ within $\mathbf{f}$ that maximize the probability:

\displaystyle\max_{\mathbf{j}}\log\Pr\left(\mathbf{j}|\mathbf{i},\mathbf{e},\mathbf{f},\mathbf{s}\right).

(1)

Assuming a Markovian model of $\mathbf{j}$ , we get:

\displaystyle\Pr\left(\mathbf{j}|\mathbf{i},\mathbf{e},\mathbf{f},\mathbf{s}\right)=\displaystyle\sum_{t=1}^{k}\log\Pr\left(j_{t}|j_{t-1};t,\mathbf{i},\mathbf{e},\mathbf{f},\mathbf{s}\right)\enspace.

(2)

In [8] we derive a recurrent formular which permits to efficiently solve (2) with dynamic programming. Moreover, we allow target segments to possibly extend or contract the duration of the corresponding source interval by some fraction of $\Delta\epsilon$ to the left and to the right, denoted by $\delta_{l}$ and $\delta_{r}$ respectively, s.t., $\delta_{l},\delta_{r}\in\left\{0,\pm\frac{1}{4},\pm\frac{2}{4},\pm\frac{3}{4},\pm\frac{4}{4}\right\}$ . In this way, we trade off strict isochrony for small adjustments to the speaking rate, in order to improve the viewing experience. In the past work [9] it was observed that relaxations on isochrony do not help improve the accuracy of finding optimal segmentation but only help improve speech fluency and smoothness.

Two-step optimization procedure: For the above reasons, in [9], we introduce a two-step optimization procedure. In Step 1, we optimize the weights of the following log-linear model by maximizing segmentation accuracy over a manually annotated data set:

\displaystyle\log\Pr\left(j|j^{{}^{\prime}};t\right)\propto\displaystyle\sum_{a=1}^{4}w_{a}\log s_{a}\left(j,j^{{}^{\prime}};t\right)

(3)

The feature functions $s_{a}$ of the model (notice that we dropped some of the dependencies in eq. (2) for readability) denote – (1) the language model score of target break point, (2) the cross-lingual semantic match score across source and target segments, (3) the speaking rate variation across target segments and (4) the speaking rate match across source and target segments respectively.

In Step 2, starting from given breakpoints $\mathbf{\hat{j}}$ , we optimize the relaxations $\delta_{l}$ and $\delta_{r}$ for the $t$ -th segment using another recurrent equation, that can also be solved via dynamic programming, derived from the log-linear model:

\displaystyle\log\Pr\left(\delta_{l},\delta_{r}|\dots;t\right)\propto\displaystyle\sum_{a=1}^{5}w_{a}\log s_{a}\left(\delta_{l},\delta_{r},\hat{j}_{t},\hat{j}_{t-1},\delta^{{}^{\prime}}_{l},\delta^{{}^{\prime}}_{r};t\right)

(4)

which includes as additional feature $s_{5}$ the isochrony score [9]. Weight $w_{5}$ is optimized by maximizing speech smoothness [9] over the training set, assuming the reference breakpoints $\mathbf{\hat{j}}$ are given. Speech smoothness measures speaking rate variations across contiguous segments. Speaking rate computations rely on the strings $\tilde{f}_{t}$ and $\tilde{e}_{t}$ , denoting the $t$ -th source and target segments, as well as the original interval $s_{t}$ and the relaxed interval $s^{*}_{t}$ . Hence, the speaking rate of a source (target) segment is computed by taking the ratio between the duration of the utterance by source (target) TTS run at normal speed and the source (target) interval length, ³³3We run TTS on the entire sentence, force-align audio with text [23, 24] and compute segment duration from the time-stamps of the words., i.e:

	$\displaystyle r_{e}(t)$	$\displaystyle=\frac{\text{duration}(\text{TTS}_{e}(\tilde{e}_{t}))}{\|s_{t}\|}$		(5)
	$\displaystyle r_{f}(t)$	$\displaystyle=\frac{\text{duration}(\text{TTS}_{f}(\tilde{f}_{t}))}{\|s^{*}_{t}\|}$		(6)

4.2 On/Off Dubbing

Figure 2(a) shows that in our past work [9], during inference we apply the two steps of the PA component at the level of a single sentence, i.e., for each target sentence we first segment and then find the optimal relaxation using trained model defined by Eqs. (3), (4). In this work we focus on the more general use case of dubbing videos in which some of the sentences are on-screen – i.e. the speaker’s mouth is visible – and some are off-screen – i.e., the speaker’s mouth is not visible. We name this general case on/off dubbing. For off-screen sentences, we do not need the stringent requirement of isochrony and we can perform more relaxation to further improve the speaking rate control. Figure 2(b) gives an overview of the algorithm for on/off dubbing which extends the two-step isochrone dubbing algorithm as follows:

Step 1

We apply the segmentation step (Eq. (3)) for all sentences, i.e., both on and off screen sentences.

Step 2

We apply the relaxation step locally for all on-screen sentences using Eq. (4) and globally across all off-screen sentences by replacing Eq. (4) with the following:

\displaystyle\Pr\left(\delta_{l},\delta_{r}|\dots;t\right)\propto\begin{cases}1&\text{if }r_{f}(t)\leq 1\enspace,\\ 2-r_{f}(t)&\text{if }1<r_{f}(t)\leq 2\enspace,\\ 0&\text{if }r_{f}(t)>2\end{cases}

(7)

In Step 2 we also apply a more relaxed policy in allocating relaxations $\delta_{l}$ and $\delta_{r}$ inside off-screen sentences. In particular, we allow using the entire available inter-phrase and inter-sentence intervals rather than limiting them to maximum $\Delta\epsilon$ . Isochrone dubbing utilizes relaxation mechanism locally inside each sentence since for on-screen sentences, we tradeoff isochrony for improved speaking rate control. In contrast, for off-screen sentences we do not need isochrony and hence we utilize a global relaxation mechanism by computing optimal relaxations across all off-screen sentences using dynamic programming.

Regarding the scoring function (7), when the target speaking rate $r_{f}(t)$ is below $1$ , it returns a maximum score since for off-screen phrases we can contract time boundaries by setting the speaking rate $r_{f}(t)$ to $1$ without consequences. When the speaking rate of a target phrase $r_{f}(t)$ is larger than $2$ , it returns the lowest score since too high speaking rates will result in unintelligible TTS speech. For $1\leq r_{f}(t)\leq 2$ , it returns scores that increasingly penalize values larger than 1, as they will correspond to less and less intelligible TTS speech.

5 Evaluation Data and Metrics

For training and evaluation, we re-translated and annotated video clips from 20 TED talks of the MUST-C corpus [25] and 3 YouTube videos by vloggers. Each video clip contains 4 sentences manually annotated for on or off screen ⁴⁴4We consider the mixed case, in which the speaker’s mouth is visible only for some part of the sentence, to be on screen so as to preserve isochrony.. A single sentence contains one or more pauses of at least 300ms that are detected by force-aligning original English audio with text [23]. We manually collected and segmented translations in 4 languages - French, Italian, German and Spanish - using external vendors to fit duration and segmentation of corresponding English utterances.

Overall, we created two test sets to test on/off dubbing PA (ON/OFF) against Isochrone dubbing PA (ISO), (i) $D_{1}$ : 15 4-sentence clips in which all clips have all sentences being off-screen, (ii) $D_{2}$ : 15 4-sentence clips in which all clips have at least one sentence being on-screen. Compared to our previous work [9], we increase the size of extracted clips from 1 sentence to 4 sentences to test if the global relaxation mechanism provides better subjective viewing experience. To automatically estimate quality of dubbing, similar to [8, 9] we define Fluency (F) and Smoothness (Sm). For Smoothness we consider contiguous segments that span an entire 4-sentence video clip as opposed to a single sentence. We additionally introduce the following metric:

Intelligibility (In) of audio by dubbing target sentences $\mathbf{F}$ using prosodic alignment is defined by the ratio:

\displaystyle I(\mathbf{F})=\displaystyle\frac{1-\mathrm{WER}(\mathrm{TTS_{f}}(\mathrm{PA}(\mathbf{F})))}{1-\mathrm{WER}(\mathrm{TTS_{f}}(\mathbf{F}))}

(8)

where WER is the word error rate by an automatic speech recognition system ⁵⁵5We use the off-the-shelf service Amazon Transcribe
(https://aws.amazon.com/transcribe). run on TTS audio, either with prosody-alignment (numerator) or without prosody alignment (denominator).

6 Experiments

6.1 Automatic Evaluation

			$D_{1}$		$D_{2}$
			ISO ON/OFF		ISO ON/OFF
MuST-C	fr	Sm	68.5	75.3^∘	60.7	69.3^∗
		F	76.7	83.3	61.3	72.6
		In	93.6	93.5	93.5	93.2
	it	Sm	58.7	75.3^∗	52.0	68.3^∗
		F	68.3	80.0^∘	54.0	68.3^∗
		In	117.2	121.1^∗	98.8	99.0
	de	Sm	66.4	79.7^∗	57.6	70.4^∗
		F	81.7	86.7	58.6	74.1^∗
		In	94.3	94.3	91.7	93.0
	es	Sm	71.6	82.0^∗	61.9	76.0^∗
		F	80.0	85.0	61.9	79.4^∗
		In	124.9	125.7	97.0	98.0
YouTube	fr	Sm	70.6	80.9^∗	70.7	73.2
		F	66.7	80.0	60.0	60.0
		In	102.7	103.9	99.8	102.8
	it	Sm	73.6	81.3^∗	66.1	64.7
		F	40.0	73.3	46.7	43.8
		In	109.5	111.3	101.4	101.5
	de	Sm	69.9	82.3^∗	61.9	67.1^∘
		F	53.3	66.7	46.7	53.3
		In	102.7	105.6^∗	93.8	100.3^∘
	es	Sm	70.1	78.1^∗	67.0	72.0^∗
		F	33.3	60.0	60.0	66.7
		In	105.6	108.1	106.1	107.7^∘

Table 1: Automatic evaluation of PA variants in terms of Smoothness (Sm), Fluency (F), Intelligibility (In) of: isochrone dubbing PA[9] (ISO), mixed dubbing PA (ON/OFF) applied on off-screen clips (

D_{1}

) or on mixed off-screen and on-screen clips (

D_{2}

). All test sets consist of 15 4-sentence video clips for each domain (MuST-C, YouTube). Significance testing is done with levels

p<0.05

(^∘) and

p<0.01

(^∗).

Table 1 shows the results for automatic evaluation. We observe that ON/OFF outperforms ISO on MUST-C $D_{1}$ , with respect to Sm(oothness) and F(luency) with relative improvements ranging from 9.9%-28.3% and 6.1%-17.1% respectively, while for In(telligibility) ON/OFF provides 0.6%-14.5% improvements for it, de, es. Similar improvements are obtained for ON/OFF against ISO for MuST-C on $D_{2}$ and YouTube on $D_{1}$ and $D_{2}$ . We note that improvements for $D_{1}$ are higher compared to $D_{2}$ , primarily because $D_{1}$ considers videos in which all sentences are off-screen. This provides more opportunities for ON/OFF to exploit the global relaxation mechanism.

Compared to MuST-C, we find that YouTube data obtains higher In scores across all target languages and datasets. To investigate this further, we computed the length compliance (LC) metric of [26] at the phrasal level that measures the percentage of translations whose length in character is within $\pm 10\%$ of the length of the source and hence more suitable for auto dubbing. We found that on average, LC for YouTube was 19.8% higher than that for MuST-C. Higher value of LC implies that we are able to better fit the translations in the available phrase intervals thereby having TTS speech more intelligible.

6.2 Human Evaluation

			$D_{1}$		$D_{2}$
			ISO ON/OFF		ISO ON/OFF
MuST-C	fr	W	18.3	38^∗	21	43^∗
		S	4.43	4.79^∗	4.35	4.71^∗
	it	W	23	53.7^∗	15.3	54.3^∗
		S	4.61	5.36^∗	4.87	5.64^∗
	de	W	23.3	55.7^∗	19.7	64.7^∗
		S	4.45	5.11^∗	3.92	5.04^∗
	es	W	16.7	36.7^∗	28.7	37.3^∗
		S	5.03	5.35^∗	5.21	5.3
YouTube	fr	W	21.5	58.2 ^∗	20.0	60.0^∗
		S	5.06	5.77^∗	4.68	5.35^∗
	it	W	17.3	68.7^∗	21.7	55.0^∗
		S	5.03	6.16^∗	5.08	5.87^∗
	de	W	20.3	56.7^∗	25.7	50.0^∗
		S	5.35	6.17^∗	5.00	5.53^∗
	es	W	27.0	56.7^∗	36.3	46.3^∗
		S	4.70	5.36^∗	4.95	5.09

Table 2: Manual evaluations using Wins (W) and Score (S) with prosodic alignments: (ISO) previous work on Isochrone dubbing [9], (ON/OFF) new PA model for dubbing applied on off-screen clips (

D_{1}

) or on mixed off-screen and on-screen clips (

D_{2}

). All test sets consist of 15 4-sentence video clips for each domain (MuST-C, YouTube). Significance testing is done with levels

p<0.05

(^∘) and

p<0.01

(^∗).

In this section, we present results of human evaluation on the test set. For each dubbing direction and dataset we report results on 15 video clips extracted separately for each evaluation using the criterion noted in Sec. 5. We asked 20 native speakers to rate the subjective experience for viewing each dubbed video from each dubbing condition on a scale of 0-10. To reduce cognitive load, we compare two dubbing conditions for each evaluation and collect a total of 600 scores. For all dubbing conditions we utilize post-edited translations to focus the subjects on the synchronization aspect of dubbing.

Finally, for each evaluation we compare two conditions in a head-to-head manner and report Wins (percentage of times one condition is preferred over the other) and Score (average subjective score of dubbed videos) metrics. To measure the impact of PA model on human score, we use a linear-mixed-effects model (LMEM) ⁶⁶6We used the lme4 package for R [27] by defining subjects and clips as random effects [28].

The results are summarized in Table 2. For the dubbing evaluations, we compare ON/OFF vs ISO on test sets $D_{1}$ and $D_{2}$ in two separate evaluations. ON/OFF clearly outperforms ISO on $D_{1}$ providing relative improvements in Wins on both MuST-C ranging from 107.7%-139.1% and YouTube ranging from 110%-297.1% with all results being statistically significant ( $p<0.01$ ). Similarly, ON/OFF outperforms ISO on $D_{2}$ . Finally for Score, we obtain similar relative improvements on both MuST-C (1.7%-28.6%) and YouTube (2.8%-22.5%) with all improvements except the ones for es on $D_{2}$ being statistically significant ( $p<0.01$ ).

		MuST-C		YouTube
		$D_{1}$	$D_{2}$	$D_{1}$	$D_{2}$
MuST-C	$D_{1}$	0.51^∗	0.43^∗	0.73^∗	0.70^∗
	$D_{2}$	0.07	0.26^∘	0.50^∗	0.50^∗
YouTube	$D_{1}$	0.14	0.33^∘	0.68^∗	0.65^∗
	$D_{2}$	0.43^∗	0.47^∗	0.68^∗	0.7^∗

Table 3: Pearson correlation coefficient between predicted score from a LMEM model with fixed effects Sm, F, In and the averaged human score. We train a LMEM model on dataset in each row and predict score on dataset in each column. Significance testing is done with levels

p<0.05

(^∘) and

p<0.01

(^∗).

Relation between automatic and human scores: To explain the observed score variations using automatic metrics, we utilize LMEMs by aggregating evaluation data across all four languages. We define automatic metrics as fixed effects and subjects, clips, PA models and target languages as random effects. Our analysis reveals that Sm is the most impactful metric which is always statistically significant.

To further test if the learned LMEM can help predict human score, for every example we compute the average human score and compare it with the predicted score. We average out the random effect of subjects since different subjects use different range to grade their viewing experience. We train LMEM models, one on each dataset and domain, for a total of 4 models and predict scores on all 4 datasets. Table 3 shows that in most cases we obtain a high pearson correlation coefficient between the predicted and average human score. Additionally, each model is able to correctly predict on average the winning system (ISO or On/Off) for all datasets. In some cases, due to the low correlation, the magnitude of difference in predicted scores between the systems is not consistent with actual score differences. However, the sign of the difference is consistent and hence we can predict the correct winning system even in these cases. Thus we conclude that automatic metrics can also help compare dubbing quality of two systems.

7 Conclusions

We extended prosodic alignment to off-screen dubbing that requires less stringent synchronization constraints. We address off-screen dubbing by introducing a global relaxation algorithm in which we relax timing constraints across all off-screen sentences and compute optimal relaxations using dynamic programming. Both automatic and human evaluations show that compared to applying isochrone dubbing for all sentences, relaxing the synchronization constraints for off-screen sentences significantly improves model performance on both automatic and subjective metrics. Finally, the analysis using linear mixed-effects models shows that a linear combination of all automatic metrics correlates well with the average human score and can be useful to compare dubbing quality of two systems.

References

[1] F. Casacuberta, M. Federico, H. Ney, and E. Vidal, “Recent efforts in spoken language translation,” IEEE Signal Processing Magazine, vol. 25, no. 3, pp. 80–88, 2008.
[2] R. J. Weiss, J. Chorowski, N. Jaitly, Y. Wu, and Z. Chen, “Sequence-to-Sequence Models Can Directly Translate Foreign Speech,” in Proc. Interspeech, 2017, pp. 2625–2629.
[3] L. C. Vila, C. Escolano, J. A. R. Fonollosa, and M. R. Costa-Jussà, “End-to-End Speech Translation with the Transformer,” in IberSPEECH, 2018, pp. 60–63.
[4] M. Sperber and M. Paulik, “Speech Translation and the End-to-End Promise: Taking Stock of Where We Are,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 7409–7421.
[5] F. Chaume, “Synchronization in dubbing: A translation approach,” in Topics in Audiovisual Translation, pp. 35–52. 2004.
[6] A. Öktem, M. Farrús, and A. Bonafonte, “Prosodic Phrase Alignment for Machine Dubbing,” in Proceedings of Interspeech, Graz, Austria, 2019, arXiv: 1908.07226.
[7] M. Federico, R. Enyedi, R. Barra-Chicote, R. Giri, U. Isik, A. Krishnaswamy, and H. Sawaf, “From Speech-to-Speech Translation to Automatic Dubbing,” in Proceedings of the 17th International Conference on Spoken Language Translation, Online, July 2020, pp. 257–264, Association for Computational Linguistics.
[8] M. Federico, Y. Virkar, R. Enyedi, and R. Barra-Chicote, “Evaluating and optimizing prosodic alignment for automatic dubbing,” in Proceedings of Interspeech, 2020, p. 5.
[9] Y. Virkar, M. Federico, R. Enyedi, and R. Barra-Chicote, “Improvements to Prosodic Alignment for Automatic Dubbing,” in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), June 2021, pp. 7543–7574, ISSN: 2379-190X.
[10] A. Saboo and T. Baumann, “Integration of Dubbing Constraints into Machine Translation,” in Proceedings of the Fourth Conference on Machine Translation (Volume 1: Research Papers), Florence, Italy, Aug. 2019, pp. 94–101, Association for Computational Linguistics.
[11] S. M. Lakew, M. Di Gangi, and M. Federico, “Controlling the Output Length of Neural Machine Translation,” in Proceedings of IWSLT, Hong Kong, China, Oct. 2019, arXiv: 1910.10408.
[12] Mattia A. Di Gangi, Roldano Cattoni, Luisa Bentivogli, Matteo Negri, and Marco Turchi, “MuST-C: a Multilingual Speech Translation Corpus,” in Proc. NAACL, 2019, pp. 2012–2017.
[13] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention Is All You Need,” arXiv:1706.03762 [cs], 2017, arXiv: 1706.03762.
[14] M. A. Di Gangi, R. Enyedi, A. Brusadin, and M. Federico, “Robust Neural Machine Translation for Clean and Noisy Speech Transcripts,” in Proc. IWSLT, 2019.
[15] N. Prateek, M. Lajszczak, R. Barra-Chicote, T. Drugman, J. Lorenzo-Trueba, T. Merritt, S. Ronanki, and T. Wood, “In Other News: a Bi-style Text-to-speech Model for Synthesizing Newscaster Voice with Limited Data,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Industry Papers), Minneapolis, Minnesota, 2019, pp. 205–213, Association for Computational Linguistics.
[16] J. Latorre, J. Lachowicz, J. Lorenzo-Trueba, T. Merritt, T. Drugman, S. Ronanki, and K. Viacheslav, “Effect of data reduction on sequence-to-sequence neural TTS,” arXiv:1811.06315 [cs, eess], 2018, arXiv: 1811.06315.
[17] J. Lorenzo-Trueba, T. Drugman, J. Latorre, T. Merritt, B. Putrycz, R. Barra-Chicote, A. Moinet, and V. Aggarwal, “Towards Achieving Robust Universal Neural Vocoding,” in Proc. Interspeech, 2019.
[18] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Proc. ICMAI. Springer, 2015, pp. 234–241.
[19] A. Jansson, E. Humphrey, N. Montecchio, R. Bittner, A. Kumar, and T. Weyde, “Singing voice separation with deep U-net convolutional networks,” in Proceedings of the 18th International Society for Music Information Retrieval Conference, Suzhou, China, 2017, p. 8.
[20] H. Löllmann, E. Yilmaz, M. Jeub, and P. Vary, “An improved algorithm for blind reverberation time estimation,” in Proc. IWAENC, 2010, pp. 1–4.
[21] E. A. P. Habets, “Room impulse response generator,” Tech. Rep. 2.4, Technische Universiteit Eindhoven, 2006.
[22] A. Karakanta, S. Bhattacharya, S. Nayak, T. Baumann, M. Negri, and M. Turchi, “The Two Shades of Dubbing in Neural Machine Translation,” in Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain (Online), 2020, pp. 4327–4333, International Committee on Computational Linguistics.
[23] R. M. Ochshorn and M. Hawkins, “Gentle Forced Aligner,” 2017.
[24] M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Sonderegger, “Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi,” in Interspeech, 2017, pp. 498–502.
[25] M. A. Di Gangi, R. Cattoni, L. Bentivogli, M. Negri, and M. Turchi, “MuST-C: a Multilingual Speech Translation Corpus,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, 2019, pp. 2012–2017, Association for Computational Linguistics.
[26] Surafel M Lakew, Yogesh Virkar, Prashant Mathur, and Marcello Federico, “Isometric mt: Neural machine translation for automatic dubbing,” arXiv preprint arXiv:2112.08682, 2021.
[27] Douglas Bates, Martin Mächler, Ben Bolker, and Steve Walker, “Fitting Linear Mixed-Effects Models Using lme4,” Journal of Statistical Software, vol. 67, no. 1, pp. 1–48, Oct. 2015.
[28] Douglas Bates, Reinhold Kliegl, Shravan Vasishth, and Harald Baayen, “Parsimonious Mixed Models,” arXiv:1506.04967 [stat], June 2015, arXiv: 1506.04967.