ACC

Article http://dx.doi.org/10.26855/acc.2024.02.011

Mongolian Topic Extraction Using Pre-trained Models

TOTAL VIEWS: 2027

Ailiya, Qintu Si*, Siriguleng Wang

College of Computer Science and Technology, Inner Mongolia Normal University, Hohhot, Inner Mongolia, China.

*Corresponding author: Qintu Si

Published: April 10,2024

Abstract

The paper introduces a Mongolian topic extraction method that utilizes a pre-trained language model to improve the quality of Mongolian intelligent question answering. Initially, the Mongolian text data undergoes preprocessing steps, including text correction, data cleaning, and word segmentation, to ensure accurate and readable data. Stop words are then removed to reduce noise while filtering high- and low-frequency words to emphasize key terms for constructing a Mongolian thesaurus. After preprocessing, the pre-trained model is used to represent Mongolian word vectors that capture semantic meanings in the language. Based on these representations, an unsupervised topic extraction method employing a topic model identifies and clusters similar topics within the text, providing a structured representation of the data. Experimental results demonstrate that this proposed method outperforms traditional topic extraction methods such as latent Dirichlet allocation and embedded topic models with improvements of 0.3406 and 0.0675 in terms of topic extraction quality respectively, showcasing its effectiveness and efficiency in extracting relevant topics from Mongolian text. This approach enhances text comprehensibility.

References

[1] Blei D M, Ng A, Jordan M I. (2003). Latent Dirichlet allocation. J Mach Learn Res., 10:1162. DOI:2003/JMLR 2003.3.4-5.993.

[2] Blei D M. (2011). Probabilistic topic Modelsm Proceedings of the 17th ACM SIGKDD International Conference Tutorials. DOI:10.1145/2107736.2107741.

[3] Das R, Zaheer M, Dyer C. (2015). Gaussian LDA for topic models with word embeddings. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2015:795-804.

[4] Yang M, Cui T, Tu W. (2015). Ordering-sensitive and semantic-aware topic modeling. Proceedings of the AAAI Conference on Artificial Intelligence, 29:1.

[5] Miao Y, Grefenstette E, Blunsom P. (2017). Discovering discrete latent topics with neural variational inference. International Conference on Machine Learning. PMLR 2017:2410-2419.

[6] Dieng A B, Ruiz F J R, Blei D M. (2020). Topic modeling in embedding spaces. Trans Assoc Comput Linguist, 8:439-453.

[7] Grootendorst M. (2022). BERTopic: Neural topic modeling with a class-based TF-IDF procedure. DOI:10.48550/arXiv.2203. 05794.

[8] Yuhan W, Min L, Yanling L, et al. (2023). Research on embedded text topic model based on BERT. CEA, 59:01.

[9] Yupeng J. (2016). Research on the topic model of Chinese text based on LDA. Inner Mongolia University.

[10] Riguleng S. (2016). Research and system implementation of Mongolian information retrieval based on LDA. Inner Mongolia Normal University.

[11] Fu Y J. (2018). Mongolian short text semantic similarity calculation based on deep VAE algorithm with topic Information. Inner Mongolia University.

[12] Shuguang B. (2021). Mongolian text keyword extraction and word analysis. Inner Mongolia normal university. DOI:10.27230/dc nki.Gnmsu.2021.000910.

[13] Zheng G, Gaowa G. (2011). Comparative study on Mongolian stop words and English stop words. JCIP 25:35-38.

[14] GB/T 26235-2010. Mongolian word tagging for information technology information processing GB/T 26235-2010.

[15] Zhang Z, Han X, Liu Z, et al. (2019). ERNIE: Enhanced language representation with informative entities. DOI:10.18653/v1/P19-1139.

[16] Mimno D M, Wallach H M, Talley E M, et al. (2011). Optimizing semantic coherence in topic models. Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, EMNLP 2011, 27-31 July 2011, John McIntyre Conference Centre, Edinburgh, UK, A meeting of SIGDAT, a Special Interest Group of the ACL. Association for Computational Linguistics.

[17] Xiaobao W, Thong N, Tuan A L. (2024). A survey on neural topic models: methods, applications, and challenges [J]. Artificial Intelligence Review, 2024, 57(2).

How to cite this paper

Mongolian Topic Extraction Using Pre-trained Models

How to cite this paper: Ailiya, Qintu Si, Siriguleng Wang. (2024) Mongolian Topic Extraction Using Pre-trained Models. Advances in Computer and Communication5(1), 64-71.

DOI: http://dx.doi.org/10.26855/acc.2024.02.011