JAMC

Article http://dx.doi.org/10.26855/jamc.2022.06.010

Training RNN-T with CTC Loss in Automatic Speech Recognition

TOTAL VIEWS: 6275

Guangdong Huang1, Dan Zhou1,*, Sen Dan2, Fen Chi1

1School of Science. China University of Geosciences, Beijing, China.

2DIDI Chuxing, Beijing, China.

*Corresponding author: Dan Zhou

Published: July 1,2022

Abstract

The end-to-end model for automatic speech recognition (ASR) has recently gained significant interest in the research community. Standard RNN-T loss causes an unreasonable alignment of output and labels, such as the model outputting a sequence of labels at a single time frame and a single label can only be emitted by a single frame. We apply the CTC loss function to RNN-T model to address these problems. On the assumption that the output sequence is no longer than the input sequence we propose an improved forward-backward algorithm to calculate the loss. Experimental results for speech recognition are provided on the AISHELL-1 speech corpus. The results show that TCL converges faster without causing gradient explosion compared with RNN-T.TCL achieves a 9.5% character error rate on a test set of AISHELL-1 Chinese speech corpus, outperforming both the baseline RNN-T reported by Z. Tian etc. and standard CTC achieve11.82% and 10.8% respectively. RNN-T with CTC loss achieves a 19.63% relative reduction in character error rate compared with the standard RNN-T and a 12.04% relative reduction in character error rate compared with standard CTC on the test set.

References

[1] Z. Tian, J. Yi, J. Tao, Y. Bai, and Z. Wen. (2019). “Self-attention transducers for end-to-end speech recognition,” arXiv preprint arXiv:1909.13037, 2019.

[2] A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber. (2006). “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd international conference on Machine learning, 2006, pp. 369-376.

[3] A. Graves. (2008). “Supervised sequence labelling with recurrent neural networks [ph. d. dissertation],” Technical University of Munich, Germany, 2008.

[4] A. Graves and N. Jaitly. (2014). “Towards end-to-end speech recognition with recurrent neural networks,” in International con-ference on machine learning, 2014, pp. 1764-1772.

[5] H. Soltau, H. Liao, and H. Sak. (2016). “Neural speech recognizer: Acoustic-to-word lstm model for large vocabulary speech recognition,” arXiv preprint arXiv:1610.09975, 2016.

[6] A. Graves. (2012). “Sequence transduction with recurrent neural networks,” arXiv preprint arXiv:1211.3711, 2012.

[7] A. Graves, A.-r. Mohamed, and G. Hinton. (2013). “Speech recognition with deep recurrent neural networks,” in 2013 IEEE international conference on acoustics, speech and signal processing. IEEE, 2013, pp. 6645-6649.

[8] H. Sak, M. Shannon, K. Rao, and F. Beaufays. (2017). “Recurrent neural aligner: An encoder-decoder neural network model for sequence to sequence mapping,” in Interspeech 2017, 2017.

[9] W. Chan, N. Jaitly, Q. Le, and O. Vinyals. (2016). “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016, pp. 4960-4964.

[10] D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Bengio. (2016). “End-to-end attention-based large vocabulary speech recognition,” in 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2016, pp. 4945-4949.

[11] C.-C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R. J. Weiss, K. Rao, E. Gonina et al. (2018). “Stateof-the-art speech recognition with sequence-to-sequence models,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 4774-4778.

[12] W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals. (2015). “Listen, attend and spell,” arXiv preprint arXiv:1508.01211, 2015.

[13] D. Povey, V. Peddinti, D. Galvez, P. Ghahremani, V. Manohar, X. Na, Y. Wang, and S. Khudanpur. (2016). “Purely sequence-trained neural networks for asr based on lattice-free mmi.” in Interspeech, 2016, pp. 2751-2755.

[14] H. Hadian, H. Sameti, D. Povey, and S. Khudanpur. (2018). “End-to-end speech recognition using lattice-free mmi.” in Interspeech, 2018, pp. 12-16.

[15] A. Tripathi, H. Lu, H. Sak, and H. Soltau. (2020). “Monotonic recurrent neural network transducer and decoding strategies,” in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2020.

[16] D. P. Kingma and J. Ba. (2014). “Adam: A method for stochastic optimization,” arXiv, 2014.

[17] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. (2017). “Automatic differentiation in pytorch,” 2017.

[18] C. Shan, C. Weng, G. Wang, D. Su, M. Luo, D. Yu, and L. Xie. (2019). “Component fusion: Learning replaceable language model component for end-to-end speech recognition system,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 5361-5635.

How to cite this paper

Training RNN-T with CTC Loss in Automatic Speech Recognition

How to cite this paper:  Guangdong Huang, Dan Zhou, Sen Dan, Fen Chi. (2022) Training RNN-T with CTC Loss in Automatic Speech Recognition. Journal of Applied Mathematics and Computation6(2), 256-262.

DOI: http://dx.doi.org/10.26855/jamc.2022.06.010