This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

An Investigation on Deep Learning with Beta Stabilizer

Qi Liu, Tian Tan, Kai Yu Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering
SpeechLab, Department of Computer Science and Engineering
Brain Science and Technology Research Center
Shanghai Jiao Tong University, Shanghai, China
Emails: {liuq901, tantian, kai.yu}@sjtu.edu.cn
Abstract

Artificial neural networks (ANN) have been used in many applications such like handwriting recognition and speech recognition. It is well-known that learning rate is a crucial value in the training procedure for artificial neural networks. It is shown that the initial value of learning rate can confoundedly affect the final result and this value is always set manually in practice. A new parameter called beta stabilizer has been introduced to reduce the sensitivity of the initial learning rate. But this method has only been proposed for deep neural network (DNN) with sigmoid activation function. In this paper we extended beta stabilizer to long short-term memory (LSTM) and investigated the effects of beta stabilizer parameters on different models, including LSTM and DNN with relu activation function. It is concluded that beta stabilizer parameters can reduce the sensitivity of learning rate with almost the same performance on DNN with relu activation function and LSTM. However, it is shown that the effects of beta stabilizer on DNN with relu activation function and LSTM are fewer than the effects on DNN with sigmoid activation function.

I Introduction

Hidden Markov model (HMM) [1] and Gaussian mixture model (GMM) [2] have been used to solve handwriting recognition and speech recognition problem for a long time [3]. Due to the limitation of GMM, ANN especially DNN [4] and recurrent neural network (RNN) [5] have been used to combined with HMM and provides huge improvement on the performance [6].

The state-of-art training method for ANN nowadays is mini-batch based stochastic gradient descent (SGD) with momentum [7]. For SGD algorithm, learning rate is a crucial and sensitive value. The initial value of learning rate has huge effects on the final performance and converge speed for ANN training. However, this value is always an experience parameter i.e. which is set manually. Another problem is that the best initial value of learning rate can be varied with different tasks, different neural network structures and different toolkits. How to set the initial value of learning rate is a tricky part in the training procedure of ANN.

Some researchers are using grid search on learning to choose the best initial value [8]. Some provides self adjustment techniques in pre-training to automatically select the initial value [9]. There are also many training algorithms, which are not such sensitive with learning rate, are produced to solve this problem. These methods include AdaDelta [10], AdaGrad [11] and natural gradient [12].

[13] provide a quite different solution. For every linear transform parameter, a learnable scalar parameter is added. This parameter can affect the update procedure in SGD with learning rate together. By combining the original learning rate and this parameter, the learning rate can be learnable. This can reduce the sensitivity of initial learning rate and accelerate the converge speed.

In [13], only DNN with sigmoid activation function has been used. However, DNN with relu activation functions converges quickly than DNN with sigmoid function, and has been successfully applied to many applications [14] [15]. LSTM has been the state-of-art solution for speech recognition and handwriting recognition [16] [17] [18] because it has the ability to model sequential data. End-to-end models including connectionist temporal classification (CTC) [19] [20] and attention model [21] [22] also use LSTM widely. Therefore it is significant to evaluate beta stabilizer on these new neural network models. In this paper, we extend beta stabilizer parameters to LSTM and evaluate the effects on different ANN architectures including LSTM and DNN with relu activation function.

Multiple speech recognition experiments have been done to verify the results. Two data corpora are prepared, one is the local 15 hours Chinese dataset, the other one is Switchboard 50 hours English dataset [23]. The neural network structure contains DNN with sigmoid function, DNN with relu function and deep LSTM. All the experiments using the same SGD algorithm on a single CUDA based GPU.

The experimental results show that beta stabilizer parameters achieve good results in DNN with sigmoid function. In some cases the performance will be reduced in DNN with relu function and LSTM. However, the sensitivity of initial learning rate can always be reduced with beta stabilizer parameters regardless of neural network architectures.

The rest of the paper is organized as follows: Section II gives the detail of beta stabilizer for DNN. We show how to extend beta stabilizer to LSTM in section III. Section IV shows the setup and results of our experiments. Finally, the conclusion can be found in section V and discussion can be found in section VI.

II Overview of Beta Stabilizer

II-A SGD Background

SGD is a first order optimization algorithm. It is based on a differentiable function decrease fastest along the negative direction of its gradient. This algorithm is well used in machine learning field to optimize a model with multiple variables.

To use SGD in the training procedure of ANN, the parameter θ\theta should be updated to minimize the objective loss function \mathcal{L}. For every scale value θi\theta_{i} in θ\theta, the gradient

Δi=θi\Delta_{i}=\frac{\partial\mathcal{L}}{\partial\theta_{i}}

will be calculated. After all the gradients has been calculated, the update procedure

θi=θiηΔi\theta_{i}=\theta_{i}-\eta\Delta_{i}

will be applied. Here, η\eta is the learning rate.

II-B Learning Rate Scheduling Method

In practice, the learning rate may be changed during the training procedure. The process of the adjustment of learning rate is called learning rate scheduling. There are several methods for learning rate scheduling. Two widely used methods are early stopping [24] and learning rate halving [25]. Early stopping will terminate the training procedure when performance on cross validation set consecutively becomes worse in some iterations. Learning rate halving will reduce the learning rate by half when the performance on cross validation set becomes worse.

There are also some new techniques on learning rate scheduling such like exponential scheduling [26], learning rate monitor [27] and learning rate auto-adjustment method [28].

Due to the sensitivity of initial learning rate also relies on the learning rate scheduling method. To control the experimental variables, we use learning rate halving as our learning rate scheduling method in all the experiments.

II-C Beta Stabilizer for DNN

The beta stabilizer parameter is a scalar parameter for each layer in DNN. For normal DNN hidden layers, the formula is

𝐲=𝐖𝐱+𝐛\mathbf{y}=\mathbf{Wx+b}

here 𝐱\mathbf{x} is the input vector and 𝐲\mathbf{y} is the output vector. 𝐖\mathbf{W} is the linear transform parameter matrix and 𝐛\mathbf{b} is the bias parameter vector.

With a scalar beta stabilizer parameter, the formula change to

𝐲=eβ𝐖𝐱+𝐛\mathbf{y}=e^{\beta}\mathbf{Wx+b}

where ee is the base of natural logarithm and β\beta is the stabilizer parameter.

In the training procedure, the propagation phase can be done by directly following the above formula. The back-propagation phase need to calculated the gradient of objective function respect to 𝐱\mathbf{x}, 𝐖\mathbf{W}, 𝐛\mathbf{b} and β\beta.

The gradients respect to 𝐱\mathbf{x} and 𝐖\mathbf{W} have minor changes,

𝐱=eβ𝐖T𝐲\frac{\partial\mathcal{L}}{\partial\mathbf{x}}=e^{\beta}\mathbf{W}^{T}\frac{\partial\mathcal{L}}{\partial\mathbf{y}}

and

𝐖=eβ𝐲𝐱T.\frac{\partial\mathcal{L}}{\partial\mathbf{W}}=e^{\beta}\frac{\partial\mathcal{L}}{\partial\mathbf{y}}\mathbf{x}^{T}.

The gradient respect to 𝐛\mathbf{b} remains unchanged,

𝐛=𝐲.\frac{\partial\mathcal{L}}{\partial\mathbf{b}}=\frac{\partial\mathcal{L}}{\partial\mathbf{y}}.

The final problem is how to update the stabilizer parameter β\beta. By the chain rule,

β=𝐲𝐲β=eβ𝐲T𝐖𝐱.\frac{\partial\mathcal{L}}{\partial\beta}=\frac{\partial\mathcal{L}}{\partial\mathbf{y}}\frac{\partial\mathbf{y}}{\partial\beta}=e^{\beta}\frac{\partial\mathcal{L}}{\partial\mathbf{y}}^{T}\mathbf{Wx}.

Due to

𝐱T=eβ𝐲T𝐖,\frac{\partial\mathcal{L}}{\partial\mathbf{x}}^{T}=e^{\beta}\frac{\partial\mathcal{L}}{\partial\mathbf{y}}^{T}\mathbf{W},

we have

β=𝐱T𝐱\frac{\partial\mathcal{L}}{\partial\beta}=\frac{\partial\mathcal{L}}{\partial\mathbf{x}}^{T}\mathbf{x}

i.e. the inner product of 𝐱\frac{\partial\mathcal{L}}{\partial\mathbf{x}} and 𝐱\mathbf{x}. The update rule is

β=βη𝐱T𝐱.\beta=\beta-\eta\frac{\partial\mathcal{L}}{\partial\mathbf{x}}^{T}\mathbf{x}.

This means the value of β\beta relies on the relation between layer input and its gradient. It shows that β\beta will be increased if scaling 𝐱\mathbf{x} up can improve the performance and vice versa.

It is shown that β\beta relies on the value and gradient of input vector 𝐱\mathbf{x}. For DNN with multiple hidden layers, these values will depend on the activation function. This is the reason why we investigate the performance of beta stabilizer in DNN with relu activation function.

At the beginning of training procedure, all β\beta values are set to 0 thus eβ=1e^{\beta}=1 where the initial model remains same with the one without stabilizer parameter.

III Beta Stabilizer for LSTM

LSTM is an architecture that uses memory cell to keep information [29], and becomes the state-of-art solution for speech recognition and handwriting recognition nowadays. It can be implemented by the following formulas:

𝐢𝐭\displaystyle\mathbf{i_{t}} =σ(𝐖𝐱𝐢𝐱𝐭+𝐖𝐡𝐢𝐡𝐭𝟏+𝐖𝐜𝐢𝐜𝐭𝟏+𝐛𝐢)\displaystyle=\mathbf{\sigma(W_{xi}x_{t}+W_{hi}h_{t-1}+W_{ci}c_{t-1}+b_{i})}
𝐟𝐭\displaystyle\mathbf{f_{t}} =σ(𝐖𝐱𝐟𝐱𝐭+𝐖𝐡𝐟𝐡𝐭𝟏+𝐖𝐜𝐟𝐜𝐭𝟏+𝐛𝐟)\displaystyle=\mathbf{\sigma(W_{xf}x_{t}+W_{hf}h_{t-1}+W_{cf}c_{t-1}+b_{f})}
𝐜𝐭\displaystyle\mathbf{c_{t}} =𝐟𝐭𝐜𝐭𝟏+𝐢𝐭tanh(𝐖𝐱𝐜𝐱𝐭+𝐖𝐡𝐜𝐡𝐭𝟏+𝐛𝐜)\displaystyle=\mathbf{f_{t}\cdot c_{t-1}+i_{t}\cdot\tanh(W_{xc}x_{t}+W_{hc}h_{t-1}+b_{c})}
𝐨𝐭\displaystyle\mathbf{o_{t}} =σ(𝐖𝐱𝐨𝐱𝐭+𝐖𝐡𝐨𝐡𝐭𝟏+𝐖𝐜𝐨𝐜𝐭+𝐛𝐨)\displaystyle=\mathbf{\sigma(W_{xo}x_{t}+W_{ho}h_{t-1}+W_{co}c_{t}+b_{o})}
𝐡𝐭\displaystyle\mathbf{h_{t}} =𝐨𝐭tanh(𝐜𝐭)\displaystyle=\mathbf{o_{t}\cdot\tanh(c_{t})}

here σ\sigma is sigmoid function.

In DNN, beta stabilizer is applied to the linear transform matrix. But in one LSTM layer, there three gates and one main affine operation. Three ways have been considered to extend beta stabilizer to LSTM. Layer shared beta stabilizer, gate shared beta stabilizer and independent beta stabilizer.

Layer shared beta stabilizer means a single eβe^{\beta} will be added for all the linear transform operation in one LSTM layer. Gate shared beta stabilizer means every individual gate in LSTM will have a beta stabilizer. Because we believe that beta stabilizer is a kind of normalization of linear transform matrix. We have calculated the l2-norm of every linear transform matrix of a trained LSTM model. It is found that the l2-norm of matrices of cell values (i.e. 𝐖𝐜𝐢,𝐖𝐜𝐟,𝐖𝐜𝐨\mathbf{W_{ci},W_{cf},W_{co}}) is one magnitude less than other matrices. The l2-norm of matrices of input vector (i.e. 𝐖𝐱𝐢,𝐖𝐱𝐟,𝐖𝐱𝐜,𝐖𝐱𝐨\mathbf{W_{xi},W_{xf},W_{xc},W_{xo}}) in the first LSTM layer is half of the l2-norm of matrices of hidden activations (i.e. 𝐖𝐡𝐢,𝐖𝐡𝐟,𝐖𝐡𝐜,𝐖𝐡𝐨\mathbf{W_{hi},W_{hf},W_{hc},W_{ho}}). This shows shared beta stabilizer may be not suitable for LSTM.

Therefore, independent beta stabilizer has been selected as our solution. For every linear transform operation, a beta stabilizer has been added. We believed that independent beta stabilizer can adjust the scale of every matrix separately and appropriately. The changed formulas are shown below:

𝐢𝐭\displaystyle\mathbf{i_{t}} =σ(eβxi𝐖𝐱𝐢𝐱𝐭+eβhi𝐖𝐡𝐢𝐡𝐭𝟏+eβci𝐖𝐜𝐢𝐜𝐭𝟏+𝐛𝐢)\displaystyle=\sigma(e^{\beta_{xi}}\mathbf{W_{xi}x_{t}}+e^{\beta_{hi}}\mathbf{W_{hi}h_{t-1}}+e^{\beta_{ci}}\mathbf{W_{ci}c_{t-1}+b_{i})}
𝐟𝐭\displaystyle\mathbf{f_{t}} =σ(eβxf𝐖𝐱𝐟𝐱𝐭+eβhf𝐖𝐡𝐟𝐡𝐭𝟏+eβcf𝐖𝐜𝐟𝐜𝐭𝟏+𝐛𝐟)\displaystyle=\sigma(e^{\beta_{xf}}\mathbf{W_{xf}x_{t}}+e^{\beta_{hf}}\mathbf{W_{hf}h_{t-1}}+e^{\beta_{cf}}\mathbf{W_{cf}c_{t-1}+b_{f})}
𝐜𝐭\displaystyle\mathbf{c_{t}} =𝐟𝐭𝐜𝐭𝟏+𝐢𝐭tanh(eβxc𝐖𝐱𝐜𝐱𝐭+eβhc𝐖𝐡𝐜𝐡𝐭𝟏+𝐛𝐜)\displaystyle=\mathbf{f_{t}\cdot c_{t-1}+i_{t}\cdot}\tanh(e^{\beta_{xc}}\mathbf{W_{xc}x_{t}}+e^{\beta_{hc}}\mathbf{W_{hc}h_{t-1}+b_{c})}
𝐨𝐭\displaystyle\mathbf{o_{t}} =σ(eβxo𝐖𝐱𝐨𝐱𝐭+eβho𝐖𝐡𝐨𝐡𝐭𝟏+eβco𝐖𝐜𝐨𝐜𝐭+𝐛𝐨)\displaystyle=\sigma(e^{\beta_{xo}}\mathbf{W_{xo}x_{t}}+e^{\beta_{ho}}\mathbf{W_{ho}h_{t-1}}+e^{\beta_{co}}\mathbf{W_{co}c_{t}+b_{o})}
𝐡𝐭\displaystyle\mathbf{h_{t}} =𝐨𝐭tanh(𝐜𝐭)\displaystyle=\mathbf{o_{t}\cdot\tanh(c_{t})}

The back-propagation and update rule can be derived by using the similar methods in section II.

IV Experiments

TABLE I: Performance of local Chinese dataset with sigmoid DNN.
Init LR With stabilizer CE on CV Frame ACC on CV WER
0.8 False 2.842 40.6% 31.14%
0.1 False 3.394 33.7% 45.98%
0.8 True 2.835 42.4% 29.49%
0.1 True 2.830 41.9% 30.54%
0.01 True 2.772 41.6% 30.80%
TABLE II: Performance of local Chinese dataset with relu DNN.
Init LR With stabilizer CE on CV Frame ACC on CV WER
0.8 False 3.149 40.0% 32.68%
0.1 False 2.796 43.4% 28.97%
0.0125 False 2.622 43.4% 29.19%
0.0016 False 3.022 38.0% 39.43%
0.1 True 2.962 42.1% 30.74%
0.0125 True 2.802 41.7% 29.87%
0.0016 True 2.907 39.9% 32.53%
TABLE III: Performance of local Chinese dataset with LSTM.
Init LR With stabilizer CE on CV Frame ACC on CV WER
0.1 False 2.099 51.3% 26.04%
0.04 False 2.057 51.4% 26.25%
0.005 False 2.113 50.2% 26.88%
0.0006 False 3.236 36.6% 47.01%
0.1 True 2.133 50.0% 25.86%
0.04 True 2.171 49.8% 26.68%
0.005 True 2.228 49.7% 27.04%
0.0006 True 2.366 47.0% 29.35%
TABLE IV: Performance of Switchboard English dataset with sigmoid DNN.
Init LR With stabilizer CE on CV Frame ACC on CV WER
0.8 False 1.991 50.6% 21.3%
0.1 False 2.270 44.8% 27.1%
0.8 True 2.235 48.6% 21.4%
0.1 True 2.197 48.1% 22.4%
0.01 True 2.133 48.1% 23.5%
TABLE V: Performance of Switchboard English dataset with relu DNN.
Init LR With stabilizer CE on CV Frame ACC on CV WER
0.8 False 2.378 46.9% 23.4%
0.1 False 2.371 47.3% 22.4%
0.0125 False 2.099 48.5% 23.3%
0.0016 False 2.267 45.4% 27.0%
0.1 True 2.263 48.0% 22.4%
0.0125 True 2.164 47.7% 23.2%
0.0016 True 2.245 45.8% 26.2%
TABLE VI: Performance of Switchboard English dataset with LSTM.
Init LR With stabilizer CE on CV Frame ACC on CV WER
0.1 False 1.582 59.2% 20.5%
0.04 False 1.561 59.2% 20.5%
0.005 False 1.616 57.8% 21.7%
0.0006 False 1.807 54.0% 25.7%
0.1 True 1.649 57.8% 21.2%
0.04 True 1.628 58.3% 20.8%
0.005 True 1.665 57.3% 22.2%
0.0006 True 1.774 55.1% 23.2%

IV-A Experimental Setup

Two speech recognition corpora are used in our experiments. The first one is local 15 hours Chinese dataset. The second corpus is Switchboard 50 hours English dataset.

For every dataset, three network structures are prepared. These including DNN with sigmoid function, DNN with relu function and LSTM.

For the experiments with the same corpus and structures, only the initial learning rate may be varied, all the other parameters are the same. Learning rate halving has been used as the learning rate scheduling method. All the experiments are done on a single CUDA based GPU card.

Refer to caption
(a) Sigmoid DNN
Refer to caption
(b) Relu DNN
Refer to caption
(c) LSTM
Figure 1: Experimental results of different ANN architectures. Chn stands for local Chinese dataset and Swbd stands for Switchboard English dataset in the legends.

IV-B Experimental Results for Local Chinese Dataset

The local Chinese dataset contains 15 hours data for the training set. For this dataset, we use the DNN model with 6 hidden layers contains 1024 nodes and LSTM model with 3 hidden layers contains 600 nodes.

Table I shows the results on the DNN with sigmoid activation function. It is clear that without beta stabilizer, changing the initial learning rate from 0.8 to 0.1 has huge impact on the final performance. While the initial learning rate has almost no effect on the performance with beta stabilizer.

Table II concludes the performance on the DNN with relu activation function. It shows that beta stabilizer did not work as well as in sigmoid DNN. The best WER with beta stabilizer becomes a little worse than without it. However, is also claims that the results with beta stabilizer are less sensitive than without it.

Table III is the performance on LSTM. The results are almost the same when the initial learning rate is suitable. But when the initial value becomes relative small, the network with beta stabilizer can achieve better performance.

IV-C Experimental Results for Switchboard English Dataset

For Switchboard 50 hours dataset, we use DNN with 6 hidden layers contains 2048 nodes and LSTM with 3 hidden layers contais 1024 nodes.

Table IV, V and VI shows the results of sigmoid DNN, relu DNN and LSTM respectively. The performance of sigmoid DNN has been improved. But compared with the results of local Chinese dataset, the performance of relu DNN becomes better while the performance of LSTM be slightly worse. However, it can be concluded that the networks with beta stabilizer is less sensitive than the networks without it from these tables.

V Conclusion

From the above results, we conclude that beta stabilizer parameters can reduce the sensitivity of results respect to initial learning rate in both DNN and LSTM. Figure 1 clear shows the results. From figure 1a, it is clear that beta stabilizer parameters achieve the best performance on DNN with sigmoid function. From figure 1b and 1c, DNN with relu function and LSTM are less sensitive about the initial learning rate than DNN with sigmoid function. However, when the initial learning rate becomes relative small, i.e. 0.0016 for relu DNN and 0.0006 for LSTM, the networks with beta stabilizer parameters still give acceptable results. Even with extremely small initial value (0.0001) of learning rate, LSTM with beta stabilizer still can give reasonable results while LSTM without it cannot converge at all.

It is observed that the effects of beta stabilizer on DNN with relu function and LSTM are fewer than the effects on DNN with sigmoid function. The performance may be worse with suitable initial learning rate when beta stabilizer has been used for DNN with relu function and LSTM. However, beta stabilizer performs well when the initial value is relative small. We concluded that beta stabilizer could reduce the sensitivity of initial learning rate with multiple ANN architectures.

VI Discussion

In some complicated network such like convolution-LSTM-deep neural network (CLDNN) [30] [31] and multi-task network [32], different parts of the network can have different beta stabilizers. [13] also mentioned that beta stabilizer can be used not only for SGD but also other training algorithm such like AdaGrad and AdaDelta.

Therefore our ongoing and future works include 1) observe the results of beta stabilizer parameters on large scale data, 2) try beta stabilizer parameters with other training algorithms, 3) add beta stabilizer parameters to complicated networks.

Acknowledgement

This work was supported by the Shanghai Sailing Program No. 16YF1405300, the Program for Professor of Special Appointment (Eastern Scholar) at Shanghai Institutions of Higher Learning, the China NSFC projects (No. 61573241 and No. 61603252) and the Interdisciplinary Program (14JCZ03) of Shanghai Jiao Tong University in China.

References

  • [1] L. E. Baum and T. Petrie, “Statistical inference for probabilistic functions of finite state markov chains,” The Annals of Mathematical Statistics, vol. 37, no. 6, pp. 1554–1563, 1966.
  • [2] J. C. Spall and J. L. Maryak, “A feasible bayesian estimator of quantiles for projectile accuracy from non-i.i.d. data,” Journal of the American Statistical Association, vol. 87, no. 419, pp. 676–681, 1992.
  • [3] M. Gales and S. Young, “The application of hidden markov models in speech recognition,” Foundations and trends in signal processing, vol. 1, no. 3, pp. 195–304, 2007.
  • [4] G. E. Hinton, S. Osindero, and Y. W. Teh, “A fast learning algorithm for deep belief nets,” Neural Computation, vol. 18, no. 7, p. 1527–1554, 2006.
  • [5] C. Goller and A. Kü̈chler, “Learning task-dependent distributed representations by backpropagation through structure,” IEEE Transactions on Neural Networks, vol. 1, pp. 347–352, 1996.
  • [6] G. E. Hinton, L. Deng, D. Yu, G. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. V., P. Nguyen, T. Sainath, and B. Kingsbury, “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” Signal Processing Magazine, vol. 29, no. 6, pp. 82–97, 2012.
  • [7] Y. LeCun, L. Bottou, G. B. Orr, and K.-R. Müller, “Efficient backprop,” in Neural Networks: Tricks of the Trade, pp. 9–50, Springer Berlin Heidelberg, 1998.
  • [8] J. Bergstra and Y. Bengio, “Random search for hyper-parameter optimization,” The Journal of Machine Learning Research, vol. 13, no. 1, pp. 281–305, 2012.
  • [9] A. P. George and W. B. Powell, “Adaptive stepsizes for recursive estimation with applications in approximate dynamic programming,” Machine Learning, vol. 65, no. 1, pp. 167–198, 2011.
  • [10] M. D. Zeiler, “ADADELTA: an adaptive learning rate method,” 2012. arXiv:1212.5701.
  • [11] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization,” The Journal of Machine Learning Research, vol. 12, pp. 2121–2159, 2010.
  • [12] S.-I. Amari, “Natural gradient works efficiently in learning,” Neural Computation, vol. 10, no. 2, pp. 251–276, 1998.
  • [13] P. Ghahremani and J. Droppo, “Self-stabilized deep neural network,” in IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5450–5454, 2016.
  • [14] V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in Proceedings of the International Conference on Machine Learning, pp. 807–814, 2010.
  • [15] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities improve neural network acoustic models,” in Proceedings of the International Conference on Machine Learning, vol. 30, 2013.
  • [16] F. A. Gers, N. N. Schraudolph, and J. Schmidhuber, “Learning precise timing with LSTM recurrent networks,” The Journal of Machine Learning Research, vol. 3, pp. 115–143, 2003.
  • [17] A. Graves, M. A.-r., and G. Hinton, “Speech recognition with deep recurrent neural networks,” in IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6645–6649, 2013.
  • [18] A. Graves and N. Jaitly, “Towards end-to-end speech recognition with recurrent neural networks,” in Proceedings of the International Conference on Machine Learning, pp. 1764–1772, 2014.
  • [19] A. Graves, S. Fernández, and F. Gomez, “Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the International Conference on Machine Learning, pp. 359–376, 2006.
  • [20] Q. Liu, L. Wang, and Q. Huo, “A study on effects of implicit and explicit language model information for DBLSTM-CTC based handwriting recognition,” in IEEE International Conference on Document Analysis and Recognition, pp. 461–465, 2015.
  • [21] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4960–4964, 2016.
  • [22] J. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attention-based models for speech recognition,” 2015. arXiv:1506.07503.
  • [23] J. J. Godfrey, E. C. Holliman, and J. McDaniel, “SWITCHBOARD: telephone speech corpus for research and development,” in IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 1, pp. 517–520, 1992.
  • [24] Y. Yao, L. Rosasco, and A. Caponnetto, “On early stopping in gradient descent learning,” Constructive Approximation, vol. 26, no. 2, pp. 289–315, 2007.
  • [25] I. Goodfellow, Y. Bengio, and A. Courville, “Deep learning.” Book in preparation for MIT Press, 2016.
  • [26] L. Bottou, “Large-scale machine learning with stochastic gradient descent,” in International Conference on Computational Statistics, pp. 177–187, 2010.
  • [27] T. Schaul, S. Zhang, and Y. LeCun, “No more pesky learning rates,” 2012. arXiv:1206.1106.
  • [28] D. Yu, A. Eversole, M. L. Seltzer, K. Yao, Z. Huang, B. Guenter, O. Kuchaiev, Y. Zhang, F. Seide, H. Wang, J. Droppo, G. Zweig, C. Rossbach, J. Currey, J. Gao, A. May, B. Peng, A. Stolcke, and M. Slaney, “An introduction to computational networks and the computational network toolkit,” 2014. MSR-TR-2014-112.
  • [29] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, p. 1735–1780, 1997.
  • [30] T. N. Sainath, O. Vinyals, A. Senior, and H. Sak, “Convolutional, long short-term memory, fully connected deep neural networks,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2015.
  • [31] P. Y. simard, D. Steinkraus, and J. C. Platt, “Best practices for convolutional neural networks applied to visual document analysis,” in IEEE International Conference on Document Analysis and Recognition, vol. 3, pp. 958–962, 2003.
  • [32] M. Yin, S. Sivadas, K. Yu, and B. Ma, “Discriminatively trained joint speaker and environment representations for adaptation of deep neural network acoustic models,” in IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5065–5069, 2016.