Language Preparation, Arabic Language Teaching Institute, Imam Muhammad bin Saud Islamic University, Saudi Arabia
Issue: 29 | Pages: 32-44 | June 2022 | https://doi.org/10.54940/ll16145890| PDF
Received: 21 March 2021 | Accepted: 18 April 2021 | Online: 27 March 2022
Abstract
This paper is in the field of natural language processing. It applied unsupervised machine learning approach to identifying the latent topics in Saudi newspapers using one of the most important unsupervised topic modeling algorithms. This algorithm is called Latent Dirichlet Allocation (LDA). I built a corpus from Saudi newspapers, and it contained 4,781 texts after the preprocessing stage. It consisted of 649,734 tokens. The results of training 20 models with ten words showed that the optimal value for the number of topics in those texts is 7 topics. The 7-topics model got a good coherence degree of 0.6723. These topics were inferred through its ten words that had the highest probabilities on each topic. I interpreted the topics, respectively, according to the following topics: surveillance and awareness, development and improvement, sports, health, economics, domestic affairs, and international politics. The 7-topic model was evaluated qualitatively by manually reviewing the coherence of words in each topic. Also, I reviewed the first fifty texts on each topic to make sure that each of which belongs to the topic that LDA was assigned to it. The qualitative evaluation was supported by the algorithm being conducted again on the texts of each of the seven topics to access more details on each topic separately. Although there are some shortcomings in the results of the topic modeling, they can be optimized and then studied in discourse analysis instead of the traditional approaches.
Keywords
Topic model, CV coherence, LDA algorithm, optimal model, machine learning
How to Cite
Altamimi, A. (June 2022). Evaluating Topic Modeling for Saudi Newspapers Texts Using LDA: A Computational Linguistics Study. Journal of Umm Al-Qura University for Language Sciences and Literature, 29, 32–44. https://doi.org/10.54940/ll16145890