Improving CTC-based ASR Models with Gated Interlayer Collaboration

Abstract

The CTC-based automatic speech recognition (ASR) models without the external language model usually lack the capacity to model conditional dependencies and textual interactions. In this paper, we present a Gated Interlayer Collaboration (GIC) mechanism to improve the performance of CTC-based models, which introduces textual information into the model and thus relaxes the conditional independence assumption of CTC-based models. Specifically, we consider the weighted sum of token embeddings as the textual representation for each position, where the position-specific weights are the softmax probability distribution constructed via inter-layer auxiliary CTC losses. The textual representations are then fused with acoustic features by developing a gate unit. Experiments on AISHELL-1 [1], TEDLIUM2 [2], and AIDATATANG [3] corpora show that the proposed method outperforms several strong baselines.

Index Terms— automatic speech recognition, gated interlayer collaboration, auxiliary loss

1 Introduction

The end-to-end automatic speech recognition (E2E ASR) system has become increasingly popular due to its simpler architecture and outstanding performance. A variety of E2E ASR models have been explored in the literature, which can be categorized into three main approaches: attention-based encoder-decoder systems [4, 5, 6, 7], transducer models [8, 9, 10], and Connectionist Temporal Classification (CTC) models [11, 12, 13, 14, 15, 16]. The attention-based systems solve the ASR problem as a sequence mapping from speech feature sequences to texts using an encoder-decoder architecture. The transducer network extends CTC by additionally modeling the dependencies between outputs at different timesteps. The CTC method introduces the blank label to identify the no-output positions and utilizes frame-wise label sequences to obtain the alignments between the input and output sequences. In contrast to the attention-based and the transducer-based models, the CTC models only involve with the encoder module and naturally provide a non-autoregressive manner to generate sentences in a fast and straightforward way.

However, both the attention-based and transducer-based models have often shown better performance than the CTC-based model equipped without the external language model (LM) [9]. The inferior performance of a CTC system stems from two aspects. First, CTC imposes the conditional independence constraint that the output prediction at each position is independent of each other, which is illogical for ASR tasks. Second, CTC only relies on the acoustic feature to generate sentences, while attention-based models utilize both the acoustic and textual features.

In this study, we focus on improving the performance of CTC-based models by addressing the issues mentioned above. To achieve this, we need to consider two problems: (1) how to introduce the textual information into the models during the training process, and (2) how to fuse both the textual and the acoustic features. We present a Gated Interlayer Collaboration (GIC) mechanism, which consists of (1) an $embedding$ layer to introduce the textual information into the CTC-based model by utilizing the sequence of soft labels of intermediate layer predictions, and (2) a gate unit to fuse the textual and acoustic features. Specifically, we train the model with the intermediate auxiliary CTC loss calculated by the interlayer output of the model. The probability distributions of the intermediate layers naturally serve as soft labels, which are also used to obtain the textual embedding at each position. We conduct extensive experiments to compare our methods with previous works on various ASR datasets, including AISHELL-1, AIDATATANG, and TEDLIUM2. GIC sets the new state-of-the-art among several strong baselines. Specifically, our method achieves 4.3% on the AISHELL-1 test set, 4.1% on the AIDATATANG test set, and 7.1% on the TEDLIUM2 test set, respectively. These improvements demonstrate the effectiveness of introducing the GIC mechanism into CTC-based models.

2 Related work

Refer to caption — Fig. 1: Illustration of our method. We compute intermediate CTC losses for some particular encoder layers. Then, GIC blocks introduce the textual information into the model generating by intermediate predictions.

Many works have explored the improvements of auxiliary tasks for CTC models [17, 18, 19, 20, 21, 22, 23, 24]. Intermediate CTC [21] utilize intermediate auxiliary CTC losses of the encoder module to improve the performance of CTC-based models. Self-conditioning CTC [22] and hierarchical conditional CTC [23] improve the performance of CTC-based models by conditioning the final prediction on the intermediate predictions. In addition, integrating the external LM for shallow fusion brings performance improvements for CTC-based models [12, 13]. Our method differs from the previous literature [20, 17, 18, 19, 21, 22, 23, 24] in a number of ways, including the method of introducing textual features into the CTC model and the method of gate-based feature fusion.

3 Proposed Method

In the training process, we use the feature sequence $X=\left\{x_{1},x_{2},...,x_{T}\right\}$ and the text sequence $Y=\left\{y_{1},y_{2},...,y_{n}\right\}$ to train the network parameters. Figure 1 shows the framework of the proposed method, which takes Transformer as the backbone for example. In experiments, we adopt both the Transformer [6] and Conformer [25] as backbones for comparison with strong baselines.

3.1 Baseline Model Architecture

Transformer encoder. An encoder layer of the Transformer [6] comprises two main sub-layers: a self-attention layer followed by a feed-forward network. Formally,

\begin{split}{\rm h}_{l}^{\prime}&={\rm h}_{l}^{in}+{\rm SAN}({\rm Q}_{l},{\rm K}_{l},{\rm V}_{l}),\\ {\rm h}_{l+1}^{in}&={\rm h}_{l}={\rm h}_{l}^{\prime}+{\rm FFN}({\rm LN}({\rm h}_{l}^{\prime})),\end{split}

(1)

where ${\rm h}_{l}^{in}$ and ${\rm h}_{l}$ are the input and output of $l$ -th layer. $\rm LN(\cdot)$ , $\rm SAN(\cdot)$ and $\rm FFN(\cdot)$ are layer normalization [26], attention mechanism, and feed-forward networks respectively. ( ${\rm Q}_{l}$ , ${\rm K}_{l}$ , ${\rm V}_{l}$ ) are query, key and value vectors that are transformed from the normalized $l$ -th encoder layer input, that is ${\rm LN}({\rm h}_{l}^{in})$ .

Conformer encoder. Conformer [25] combines Transformer and convolutional neural network layers to efficiently learn both global and local representations. An encoder block of Conformer is composed of four modules, i.e., a feed-forward module, a self-attention module, a convolution module, and a second feed-forward module. Formally,

\begin{split}{\rm h}_{l}^{\prime}&={\rm h}_{l}^{in}+\frac{1}{2}{\rm FFN}({\rm h}_{l}^{in}),\\ {\rm s}_{l}&={\rm h}_{l}^{\prime}+{\rm SAN}({\rm Q}_{l},{\rm K}_{l},{\rm V}_{l}),\\ {\rm o}_{l}&={\rm s}_{l}+{\rm Convolution}({\rm s}_{l}),\\ {\rm h}_{l+1}^{in}&={\rm h}_{l}={\rm LN}({\rm o}_{l}+\frac{1}{2}{\rm FFN}({\rm o}_{l})).\end{split}

(2)

Connectionist Temporal Classification. CTC [11] optimizes the model by maximizing the likelihood of all valid frame-wise alignments $\beta^{-1}(Y)$ . Formally,

\begin{split}{\rm q}_{L}&={\rm Softmax}({\rm Linear}_{d_{m}\rightarrow{|V|}}({\rm LN}({\rm h}_{L}))),\\ P_{ctc}(Y|{\rm h}_{L})&=\sum_{\alpha\in{\beta^{-1}(Y)}}\prod_{t=1}^{T}{\rm q}_{L}^{t}[\alpha_{t}],\end{split}

(3)

where ${\rm q}_{L}$ is the probability distribution of each frame over the vocabulary $V$ , and $d_{m}$ is the dimension of the attention layer. ${\rm q}_{L}^{t}$ denotes the $t$ -th column of ${\rm q}_{L}$ , and ${\rm q}_{L}^{t}[\alpha_{t}]$ denotes the $\alpha_{t}$ -th symbol of ${\rm q}_{L}^{t}$ . ${\rm h}_{L}$ is the output of the encoder module, and $L$ indicates the number of encoder layers. The CTC loss is defined as the negative log-likelihood as:

\pounds_{ctc}=-\log\ P_{ctc}(Y|{\rm h}_{L}).

(4)

Intermediate CTC. Intermediate CTC [21] utilizes the representations of intermediate encoder layers to calculate the auxiliary CTC losses to further improve the performance of the models. We train the model with a total of $K$ intermediate layers to calculate the inter-layer CTC losses. The total loss function is a linear combination of the final layer CTC loss $\pounds_{ctc}$ (Eq.4) and the intermediate CTC losses. Formally,

\pounds=(1-{\lambda})\pounds_{ctc}+{\lambda}\frac{1}{K}\sum_{i=1}^{K}\pounds_{\frac{iL}{K+1}}.

(5)

The intermediate CTC loss $\pounds_{l}$ is calculated with the output of the $l$ -th encoder layer ( ${\rm h}_{l}$ ) as

\pounds_{l}=-\log\ P_{ctc}(Y|{\rm h}_{l}).

(6)

3.2 Gated Interlayer Collaboration Mechanism

Figure 1 shows the framework of the proposed method. We propose a gated interlayer collaboration (GIC) mechanism to make CTC-based models have access to textual information generating by intermediate predictions of the encoder module. The GIC block first introduces an $embedding$ layer that is used to get the position-specific textual representation by a weighted-combination of all token embeddings. Formally,

{\rm e}_{l}^{j}=\sum_{i}^{|V|}{\rm q}_{l}^{(i,j)}\odot emb_{i},

(7)

where $emb_{i}$ represents the $i$ -th column of the embedding matrix. ${\rm q}_{l}^{(i,j)}$ indicates the $i$ -th probability mass in the soft labels at the $j$ -th frame, where the soft label is calculated by a softmax activation given the hidden state ${\rm h}_{l}^{j}$ of the $j$ -th frame

{\rm q}_{l}={\rm Softmax}({\rm Linear}_{d_{m}\rightarrow{|V|}}({\rm LN}({\rm h}_{l}))).

(8)

In Eq.8, $l$ is one of the particular intermediate layers and the output of the $l$ -th layer ${\rm h}_{l}$ is used to calculate the intermediate CTC loss in Eq.6. Then, a gate unit combines the acoustic features ${\rm h}_{l}$ and the textual features ${\rm e}_{l}$ by a $sigmoid$ activation:

\begin{split}{\rm g}_{l}&={\rm Sigmoid}(W_{1}{\rm h}_{l}+W_{2}{\rm e}_{l}+b),\\ {\rm h}_{l+1}^{in}&={\rm g}_{l}\odot{\rm h}_{l}+(1-{\rm g}_{l})\odot{\rm e}_{l}.\end{split}

(9)

The gated combination ${\rm h}_{l+1}^{in}$ is then fed into the next encoder layer, and thus our approach benefits from both the textual and the acoustic interactions.

4 Experiments

We conduct experiments on three ASR corpora: AISHELL-1 (178 hours, Chinese) [1], AIDATATANG (200 hours, Chinese) [3], and TEDLIUM2 (207 hours, English) [2]. We apply speed perturbation [27] and SpecAugment [28] to the training data. For all experiments, the input speech features are 80-dimensional filterbank (FBank) features with 3-dimensional pitch features computed on 25ms windows with 10ms shifts. The vocabulary includes 4231 characters for AISHELL-1, 3943 characters for AIDATATANG, and 500 subwords for TEDLIUM2, respectively.

Table 1: Character error rate (CER) on AISHELL-1 with the Conformer backbone. ^† denotes the results from previous literature [24].

Methods	#Param (M)	w/o LM		w/ LM
Methods	#Param (M)	dev	test	dev	test
Hybrid [29]	46	4.4	4.7	-	-
Transducer[30]	89	4.3	4.6	4.1	4.4
SC-CTC	-	4.3^†	4.6^†	-	-
Alternate-CTC[24]	-	4.2	4.5	-	-
CTC	50	4.8	5.2	4.6	4.9
Intermediate-CTC	50	4.3	4.7	4.2	4.6
SC-CTC	51	4.2	4.6	4.1	4.5
GIC (ours)	52	4.0	4.4	4.0	4.3

Table 2: Word error rate (WER) on TEDLIUM2 with the Conformer backbone.

{}^{\$}/^{\ddagger}

denote the results from previous literature [31]/[23].

Methods	Dev	Test
Hybrid	10.4^$	8.4^$
Transducer	8.6^$	8.2^$
Improved Mask-CTC [32]	-	8.6
CTC	8.9^$	8.6^$
Intermediate-CTC	8.5^$	8.3^$
SC-CTC	8.5^‡	7.8^‡
HC-CTC [23]	8.0	7.6
GIC (ours)	7.9	7.3
+ beam search with a 4-gram LM	7.7	7.1

4.1 Experimental Setup

We conducted experiments using ESPnet [33]. We adopt two CNN-based downsampling layers followed by an 18-layer encoder for our experiments. For both Transformer and Conformer backbones, the dimension of the attention layer is 256 with 4 split heads, and the dimension of the feed-forward layer is 2048. Specifically, the kernel size is 15 for Conformer. We use a single-LSTM-layer decoder for the transducer-based models with the same 18-layer encoder. The Adam optimizer [34] with 25000 warmup steps are used for training. We train the models for 100 epochs on both AISHELL-1 and TEDLIUM2, and 50 epochs on AIDATATANG. We conduct experiments to compare our method with previous CTC-based models under the same model scale, including CTC [11], intermediate-CTC training with the intermediate CTC loss [21], and SC-CTC training with the intermediate CTC loss and the self-conditioning mechanism [22]. The results are obtained by either CTC greedy decoding without the external LM or beam search (beam size = 10) with a 4-gram LM.

Figure 2 shows the effect of different numbers of $K$ and different values of $\lambda$ in Eq.5. We found that $K=5$ and $\lambda=0.5$ achieve better results. Therefore, we set $K=5$ (interlayer set of $\left\{3,6,9,12,15\right\}$ ) and $\lambda=0.5$ for our experiments.

Table 3: Character error rate (CER) on AIDATATANG with the Conformer backbone.

Methods	w/o LM		w/ LM
Methods	dev	test	dev	test
Hybrid[29]	4.3	5.0	-	-
Transducer	4.6	5.3	-	-
CTC	4.9	5.5	4.0	4.6
GIC (ours)	3.8	4.4	3.5	4.1

4.2 Results

Table 1 shows the results of different methods on AISHELL-1 based on the Conformer backbone. In addition to recently published CTC models [24], we compare our method with a few state-of-the-art models, including the hybrid CTC/Attention model [29] and the transducer-based method [30]. Our model sets the start-of-the-art results among all Conformer-based models. Specifically, our method achieves CER of 4.0%/4.4% on AISHELL-1 dev/test sets using CTC greedy decoding without the external LM, outperforming the hybrid systems and other CTC models with slightly overhead parameters, and outperforming the transducer-based models while having the benefit of non-autoregressive decoding manner. Furthermore, our method performs better to alleviate the conditional independence hypothesis of CTC-based models compared with the LM shallow fusion and the transducer network. Our method achieves CER of 4.0%/4.3% on AISHELL-1 dev/test sets with a 4-gram LM for shallow fusion. The external LM brings less boost since the proposed method introduced text interactions in the training process.

Table 2 and Table 3 show the results of different methods with the Conformer backbone on TEDLIUM2 and AIDATATANG, respectively. GIC achieves WER of 7.9%/7.3% with the CTC greedy search method and WER of 7.7%/7.1% with a 4-gram LM for shallow fusion on TEDLIUM2 dev/test sets, outperforming the results in recent literature [31, 23, 32]. The results show that GIC is also effective on the English dataset. Moreover, GIC achieves CER of 3.8%/4.4% on AIDATATANG dev/test sets without the external LM.

Table 4 shows the results on AISHELL-1 and AIDATA-TANG datasets, using the Transformer backbone. Our method is also effective for the Transformer network and achieves the lowest CER (i.e., 5.1% on the AISHELL-1 test set without LM) among all other Transformer-based models.

Table 4: Character error rate (CER) of Transformer backbone on AISHELL-1 and AIDATATANG without LM.

{}^{\S}/^{\star}

denote the results from the previous literature [22]/[29].

Methods	#Param (M)	AISHELL-1		AIDATATANG
Methods	#Param (M)	dev	test	dev	test
Hybrid	30	4.9^§	5.4^§	5.9^⋆	6.7^⋆
Transducer	31	5.9	6.3	5.8	6.7
CTC	26	5.7^§	6.2^§	5.5	6.3
Intermediate-CTC	26	5.3^§	5.7^§	5.1	5.9
SC-CTC [22]	27	4.9	5.3	5.1	5.9
GIC (ours)	28	4.7	5.1	4.8	5.5

Table 5: Disentangling the proposed method on AIDATA-TANG, using Transformer backbone. We remove the sub-layers of GIC and move towards a CTC model: (1) replacing gate units with element-wise sum operation; (2) removing the embedding layer, which is equivalent to removing the entire GIC module; (3) removing the intermediate CTC loss.

Model Architectures	Dev set	Test set
Proposed method	4.8	5.5
- Gate units	5.0	5.8
- Embedding	5.1	5.9
- Intermediate CTC loss	5.5	6.3

4.3 Ablation Study

The proposed approach differs from the standard CTC [11] model in a number of ways, including the GIC block and intermediate CTC losses. We study the effect of these differences by mutating the proposed approach toward the CTC model. Table 5 shows the impact of each change on the proposed method. The results demonstrate that introducing the GIC mechanism brings performance improvements from 5.1%/5.9% to 4.8%/5.5% on AIDATATANG dev/test sets.

Figure 3 gives the CER comparison of all interlayer predictions of the SC-CTC and GIC methods. GIC achieves better performance in all intermediate layers as well as the final layer. Compared with the self-conditioning mechanism [22], the GIC mechanism performs better to improve the performance of the CTC-based model.

5 Conclusions

We present an effective and novel gated interlayer collaboration (GIC) mechanism to improve the performance of the CTC-based models, which introduces the textual information into the models and thus eases the conditional independence assumption of the models. The GIC block consists of an $embedding$ layer to summarize position-specific textual representation and a gate unit to fuse the textual and acoustic features. Experiments show that our model achieves promising results both on Transformer and Conformer architectures.

References

[1] H. Bu, J Du, X Na, B Wu, and H Zheng, “Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline,” in 2017 20th conference of the oriental chapter of the international coordinating committee on speech databases and speech I/O systems and assessment (O-COCOSDA), 2017.
[2] A. Rousseau, P. Deléglise, Y. Esteve, et al., “Enhancing the ted-lium corpus with selected data for language modeling and more ted talks.,” in LREC, 2014.
[3] Beijing DataTang Technology Co. Lid, “aidatatang_200zh, a free chinese mandarin speech corpus,” .
[4] W. Chan, N. Jaitly, Q. Le V, and Q. Vinyals, “Listen, attend and spell,” arXiv preprint arXiv:1508.01211, 2015.
[5] D. Bahdanau, J. Chorowski, D. Serdyuk, et al., “End-to-end attention-based large vocabulary speech recognition,” in ICASSP. IEEE, 2016, pp. 4945–4949.
[6] A. Vaswani, N. Shazeer, N. Parmar, et al., “Attention is all you need,” in NIPS, 2017.
[7] L. Dong, S. Xu, and B. Xu, “Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition,” in ICASSP, 2018.
[8] A. Graves, “Sequence transduction with recurrent neural networks,” arXiv preprint arXiv:1211.3711, 2012.
[9] E. Battenbergand J. Chen, R. Child, et al., “Exploring neural transducers for end-to-end speech recognition,” in ASRU, 2017.
[10] Q. Zhang, H. Lu, H. Sak, et al., “Transformer transducer: A streamable speech recognition model with transformer encoders and rnn-t loss,” in ICASSP, 2020.
[11] A. Graves, “Connectionist temporal classification : labelling unsegmented sequence data with recurrent neural networks,” Proc. Int. Conf. on Machine Learning, 2006.
[12] A. Hannun, C. Case, J. Casper, B. Catanzaro, et al., “Deep speech: Scaling up end-to-end speech recognition,” arXiv preprint arXiv:1412.5567, 2014.
[13] A. Graves and N. Jaitly, “Towards end-to-end speech recognition with recurrent neural networks,” in ICML, 2014.
[14] Y. Miao, M. Gowayyed, and F. Metze, “Eesen: End-to-end speech recognition using deep rnn models and wfst-based decoding,” in ASRU, 2015.
[15] S. Kriman, S. Beliaev, B. Ginsburg, et al., “Quartznet: Deep automatic speech recognition with 1d time-channel separable convolutions,” in ICASSP, 2020.
[16] S. Majumdar, J. Balam, O. Hrinchuk, et al., “Citrinet: Closing the gap between non-autoregressive and autoregressive end-to-end models for automatic speech recognition,” arXiv preprint arXiv:2104.01721, 2021.
[17] R. Sanabria and Metze F, “Hierarchical multi task learning with ctc,” in SLT, 2018.
[18] K. Krishna, S. Toshniwal, and K. Livescu, “Hierarchical multitask learning for ctc-based speech recognition,” arXiv preprint arXiv:1807.06234, 2018.
[19] K. Rao and H. Sak, “Multi-accent speech recognition with hierarchical grapheme based models,” in 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2017, pp. 4815–4819.
[20] K. Rao, H. Sak, and R. Prabhavalkar, “Exploring architectures, data and units for streaming end-to-end speech recognition with rnn-transducer,” in ASRU. IEEE, 2017.
[21] J. Lee and S. Watanabe, “Intermediate loss regularization for ctc-based speech recognition,” in ICASSP, 2021.
[22] J. Nozaki and T. Komatsu, “Relaxing the conditional independence assumption of ctc-based asr by conditioning on intermediate predictions,” arXiv preprint arXiv:2104.02724, 2021.
[23] Y. Higuchi, K. Karube, T. Ogawa, et al., “Hierarchical conditional end-to-end asr with ctc and multi-granular subword units,” in ICASSP. IEEE, 2022.
[24] Y. Fujita, T. Komatsu, and Y. Kida, “Multi-sequence intermediate conditioning for ctc-based asr,” arXiv preprint arXiv:2204.00175, 2022.
[25] A. Gulati, J. Qin, C. Chiu, et al., “Conformer: Convolution-augmented transformer for speech recognition,” arXiv preprint arXiv:2005.08100, 2020.
[26] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016.
[27] T. Ko, V. Peddinti, D. Povey, et al., “Audio augmentation for speech recognition,” in ISCA, 2015.
[28] D. S. Park, W. Chan, Y. Zhang, et al., “Specaugment: A simple data augmentation method for automatic speech recognition,” 2019.
[29] P. Guo, F. Boyer, X. Chang, et al., “Recent developments on espnet toolkit boosted by conformer,” in ICASSP. IEEE, 2021.
[30] J. Tian, J. Yu, C. Weng, et al., “Consistent training and decoding for end-to-end speech recognition using lattice-free mmi,” arXiv preprint arXiv:2112.02498, 2021.
[31] Y. Higuchi, N. Chen, Y. Fujita, et al., “A comparative study on non-autoregressive modelings for speech-to-text generation,” in ASRU, 2021.
[32] Y. Higuchi, H. Inaguma, S. Watanabe, et al., “Improved mask-ctc for non-autoregressive end-to-end asr,” in ICASSP. IEEE, 2021.
[33] S. Watanabe, T. Hori, S. Karita, et al., “Espnet: End-to-end speech processing toolkit,” arXiv preprint arXiv:1804.00015, 2018.
[34] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.