Evaluating Topic Modeling for Saudi Newspapers Texts Using LDA: A Computational Linguistics Study

A. Altamimi
Language Preparation, Arabic Language Teaching Institute, Imam Muhammad bin Saud Islamic University, Saudi Arabia
Issue: 29 | Pages: 32-44 | June 2022 | https://doi.org/10.54940/ll16145890| PDF

Received: 21 March 2021 | Accepted: 18 April 2021 | Online: 27 March 2022

Abstract

This paper is in the field of natural language processing. It applied unsupervised machine learning approach to identifying the latent topics in Saudi newspapers using one of the most important unsupervised topic modeling algorithms. This algorithm is called Latent Dirichlet Allocation (LDA). I built a corpus from Saudi newspapers, and it contained 4,781 texts after the preprocessing stage. It consisted of 649,734 tokens. The results of training 20 models with ten words showed that the optimal value for the number of topics in those texts is 7 topics. The 7-topics model got a good coherence degree of 0.6723. These topics were inferred through its ten words that had the highest probabilities on each topic. I interpreted the topics, respectively, according to the following topics: surveillance and awareness, development and improvement, sports, health, economics, domestic affairs, and international politics. The 7-topic model was evaluated qualitatively by manually reviewing the coherence of words in each topic. Also, I reviewed the first fifty texts on each topic to make sure that each of which belongs to the topic that LDA was assigned to it. The qualitative evaluation was supported by the algorithm being conducted again on the texts of each of the seven topics to access more details on each topic separately. Although there are some shortcomings in the results of the topic modeling, they can be optimized and then studied in discourse analysis instead of the traditional approaches.

Keywords

Topic model, CV coherence, LDA algorithm, optimal model, machine learning

How to Cite

Altamimi, A. (June 2022). Evaluating Topic Modeling for Saudi Newspapers Texts Using LDA: A Computational Linguistics Study. Journal of Umm Al-Qura University for Language Sciences and Literature, 29, 32–44. https://doi.org/10.54940/ll16145890

License

Creative Commons License

Official government website of the Government of Kingdom of Saudi Arabia

Link to official Saudi websites end with edu.sa

Government websites use the HTTPS protocol for encryption and security.

Welcome

Login with National Single Sign-On

Issue 29: June 2022 / Article-2

Evaluating Topic Modeling for Saudi Newspapers Texts Using LDA: A Computational Linguistics Study

Abstract

Keywords

How to Cite

License

Search in the Site