Optimization of Arabic text classification using SVM integrated with word embedding models on a novel dataset

Abdulaziz M. Alayba; Mohammed Altamimi

doi:10.21833/ijaas.2025.09.013

	IJAAS
	International Journal of ADVANCED AND APPLIED SCIENCES EISSN: 2313-3724, Print ISSN: 2313-626X Frequency: 12





Volume 12, Issue 9 (September 2025), Pages: 140-151 ---------------------------------------------- Original Research Paper Optimization of Arabic text classification using SVM integrated with word embedding models on a novel dataset Author(s): Abdulaziz M. Alayba , Mohammed Altamimi Affiliation(s):* Department of Information and Computer Science, College of Computer Science and Engineering, University of Ha'il, Ha'il 81481, Saudi Arabia Full text Full Text - PDF * Corresponding Author. Corresponding author's ORCID profile: https://orcid.org/0000-0003-1075-1214 Digital Object Identifier (DOI) https://doi.org/10.21833/ijaas.2025.09.013 Abstract Arabic linguistics covers various areas such as morphology, syntax, semantics, historical linguistics, applied linguistics, pragmatics, and computational linguistics. The Arabic language presents major challenges for natural language processing (NLP) due to its complex morphological and semantic structure. In text classification tasks, effective feature selection is essential, and word embedding techniques have recently proven successful in representing textual data in a continuous vector space, capturing both semantic and morphological relationships. This study introduces a new, balanced Arabic text dataset for classification and examines the performance of combining word embedding models (Word2Vec, GloVe, and fastText) with a Support Vector Machine (SVM) classifier. The approach converts dense vector representations of Arabic text into single-value features for SVM input. Experimental results show that this method significantly outperforms the benchmark Term Frequency–Inverse Document Frequency (TF-IDF) approach, offering more accurate and reliable classification by effectively capturing Arabic contextual information. © 2025 The Authors. Published by IASE. This is an open access article under the CC BY-NC-ND license ( http://creativecommons.org/licenses/by-nc-nd/4.0/). Keywords Arabic linguistics, Natural language processing, Text classification, Word embedding, Support vector machine Article history Received 10 March 2025, Received in revised form 14 July 2025, Accepted 14 August 2025 Data availability The generated datasets during the current study are available in the Mendeley Data repository at: https://data.mendeley.com/datasets/w8njshybth/1. Acknowledgment This research has been funded by the Scientific Research Deanship at the University of Ha’il – Saudi Arabia through project number BA-2207. Compliance with ethical standards Conflict of interest: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article. Citation: Alayba AM and Altamimi M (2025). Optimization of Arabic text classification using SVM integrated with word embedding models on a novel dataset. International Journal of Advanced and Applied Sciences, 12(9): 140-151 Permanent Link to this page Figures Fig. 1 Fig. 2 Fig. 3 Tables Table 1 Table 2 Table 3 Table 4 Table 5 ---------------------------------------------- References (52) Abdelali A, Cowie J, and Soliman H (2005). Building a modern standard Arabic corpus. In the Proceedings of the Workshop on Arabic Language Resources and Tools: Their Use in MT and IR, Asia-Pacific Association for Machine Translation, Phuket, Thailand: 25–28. [Google Scholar] Abou Khachfeh RR, El Kabani I, and Osman Z (2021). A novel Arabic corpus for text classification using deep learning and word embedding. BAU Journal-Science and Technology, 3(1): 31–39. https://doi.org/10.54729/2959-331X.1014 [Google Scholar] Ahmed H, Traore I, and Saad S (2018). Detecting opinion spams and fake news using text classification. Security and Privacy, 1: e9. https://doi.org/10.1002/spy2.9 [Google Scholar] Aizawa A (2003). An information-theoretic perspective of tf–idf measures. Information Processing & Management, 39(1): 45–65. https://doi.org/10.1016/S0306-4573(02)00021-3 [Google Scholar] Akhadam I and Ayyad H (2024). Enhancing Arabic text classification: A comparative study of machine learning and deep learning approaches. In the IEEE 12th International Symposium on Signal, Image, Video and Communications (ISIVC), IEEE, Marrakech, Morocco: 1–6. https://doi.org/10.1109/ISIVC61350.2024.10577929 [Google Scholar] Alansary S and Nagi M (2014). The international corpus of Arabic: Compilation, analysis and evaluation. In the Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing, Association for Computational Linguistics, Doha, Qatar: 8–17. https://doi.org/10.3115/v1/W14-3602 [Google Scholar] Alayba AM (2019). Twitter sentiment analysis on health services in Arabic. M.Sc. Thesis, Coventry University Pureportal, Coventry, UK. [Google Scholar] Alayba AM and Palade V (2022). Leveraging Arabic sentiment classification using an enhanced CNN-LSTM approach and effective Arabic text preparation. Journal of King Saud University - Computer and Information Sciences, 34(10): 9710–9722. https://doi.org/10.1016/j.jksuci.2021.12.004 [Google Scholar] Alayba AM, Palade V, England M, and Iqbal R (2018). Improving sentiment analysis in Arabic using word representation. In the Annual Conference on New Trends in Image Analysis and Processing, Computer Science and Applied Mathematics (ASAR), IEEE, Duhok, Kurdistan Region, Iraq: 13–18. https://doi.org/10.1109/ASAR.2018.8480191 [Google Scholar] Al-Harbi S, Almuhareb A, Al-Thubaity A, Khorsheed MS, and Al-Rajeh A (2008). Automatic Arabic text classification. In the Proceedings of the 9th International Conference on the Statistical Analysis of Textual Data, Presses Universitaires de Lyon, Lyon, France: 77–85. [Google Scholar] Alhawarat M and Aseeri AO (2020). A superior Arabic text categorization deep model (SATCDM). IEEE Access, 8: 24653–24661. https://doi.org/10.1109/ACCESS.2020.2970504 [Google Scholar] Alsaleem S (2011). Automated Arabic text categorization using SVM and NB. International Arab Journal of e-Technology, 2(2): 124–128. [Google Scholar] Al-Sulaiti L and Atwell ES (2006). The design of a corpus of contemporary Arabic. International Journal of Corpus Linguistics, 11(2): 135–171. https://doi.org/10.1075/ijcl.11.2.02als [Google Scholar] Altamimi M and Alayba AM (2023). ANAD: Arabic news article dataset. Data in Brief, 50: 109460. https://doi.org/10.1016/j.dib.2023.109460 [Google Scholar] PMid:37577410 PMCid:PMC10415830 Altamimi M and Teahan WJ (2019). Arabic dialect identification of Twitter text using PPM compression. International Journal of Computational Linguistics, 10(4): 47–59. [Google Scholar] Antoun W, Baly F, and Hajj H (2020). AraBERT: Transformer-based model for Arabic language understanding. In: Al-Khalifa H, Magdy W, Darwish K, Elsayed T, and Mubarak H (Eds.), Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection, European Language Resource Association, Paris, France: 9–15. [Google Scholar] Bari MS, Alnumay Y, Alzahrani NA et al. (2024). ALLaM: Large language models for Arabic and English. Arxiv Preprint Arxiv:2407.15390. https://doi.org/10.48550/arXiv.2407.15390 [Google Scholar] Ben-Hur A and Weston J (2010). A user’s guide to support vector machines. In: Carugo O and Eisenhaber F (Eds.), Data mining techniques for the life sciences. Methods in molecular biology: 223–239. Volume 609, Humana Press, Totowa, USA. https://doi.org/10.1007/978-1-60327-241-4_13 [Google Scholar] PMid:20221922 Bickel DR (2003). Robust and efficient estimation of the mode of continuous data: The mode as a viable measure of central tendency. Journal of Statistical Computation and Simulation, 73(12): 899–912. https://doi.org/10.1080/0094965031000097809 [Google Scholar] Bojanowski P, Grave E, Joulin A, and Mikolov T (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5: 135–146. https://doi.org/10.1162/tacl_a_00051 [Google Scholar] Chou C-H, Sinha AP, and Zhao H (2008). A text mining approach to Internet abuse detection. Information Systems and e-Business Management, 6(4): 419–439. https://doi.org/10.1007/s10257-007-0070-0 [Google Scholar] Cunningham P, Cord M, and Delany SJ (2008). Supervised learning. In: Cord M and Cunningham P (Eds.), Machine learning techniques for multimedia: 21–49. Springer, Berlin, Germany. https://doi.org/10.1007/978-3-540-75171-7_2 [Google Scholar] Darwish K, Habash N, Abbas M et al. (2021). A panoramic survey of natural language processing in the Arab world. Communications of the ACM, 64(4): 72–81. https://doi.org/10.1145/3447735 [Google Scholar] Devlin J, Chang M-W, Lee K, and Toutanova K (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In the Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, Minneapolis, USA: 4171–4186. [Google Scholar] Einea O, Elnagar A, and Al Debsi R (2019). SANAD: Single-label Arabic news articles dataset for automatic text categorization. Data in Brief, 25: 104076. https://doi.org/10.1016/j.dib.2019.104076 [Google Scholar] PMid:31440535 PMCid:PMC6700340 Elarnaoty M and Farghaly A (2018). Machine learning implementations in Arabic text classification. In: Latorre Carmona P (Ed.), Industrial applications of machine learning: 295–324. Springer, Cham, Switzerland. https://doi.org/10.1007/978-3-319-67056-0_15 [Google Scholar] El-Khair IA (2016). 1.5 billion words Arabic corpus. Arxiv Preprint Arxiv:1611.04033. https://doi.org/10.48550/arXiv.1611.04033 [Google Scholar] Ghaddar B and Naoum-Sawaya J (2018). High dimensional data classification and feature selection using support vector machines. European Journal of Operational Research, 265(3): 993–1004. https://doi.org/10.1016/j.ejor.2017.08.040 [Google Scholar] Han H and Jiang X (2014). Overcome support vector machine diagnosis overfitting. Cancer Informatics, 2014: 13s1. https://doi.org/10.4137/CIN.S13875 [Google Scholar] PMid:25574125 PMCid:PMC4264614 Hmeidi I, Hawashin B, and El-Qawasmeh E (2008). Performance of KNN and SVM classifiers on full word Arabic articles. Advanced Engineering Informatics, 22(1): 106–111. https://doi.org/10.1016/j.aei.2007.12.001 [Google Scholar] Howard J and Ruder S (2018). Universal language model fine-tuning for text classification. Arxiv Preprint Arxiv:1801.06146. https://doi.org/10.48550/arXiv.1801.06146 [Google Scholar] Joulin A, Grave E, Bojanowski P, and Mikolov T (2017). Bag of tricks for efficient text classification. In the Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Association for Computational Linguistics, Valencia, Spain: 427–431. https://doi.org/10.18653/v1/E17-2068 [Google Scholar] Khan SUR, Islam MA, Aleem M, and Iqbal MA (2018). Temporal specificity-based text classification for information retrieval. Turkish Journal of Electrical Engineering and Computer Sciences, 26(6): 2916–2927. https://doi.org/10.3906/elk-1711-136 [Google Scholar] Khanal J, Tayara H, and Chong KT (2020). Identifying enhancers and their strength by the integration of word embedding and convolution neural network. IEEE Access, 8: 58369–58376. https://doi.org/10.1109/ACCESS.2020.2982666 [Google Scholar] Labani M, Moradi P, Ahmadizar F, and Jalili M (2018). A novel multivariate filter method for feature selection in text classification problems. Engineering Applications of Artificial Intelligence, 70: 25–37. https://doi.org/10.1016/j.engappai.2017.12.014 [Google Scholar] Le QV and Mikolov T (2014). Distributed representations of sentences and documents. In the Proceedings of the 31st International Conference on Machine Learning, PMLR, Beijing, China: 1188–1196. [Google Scholar] Manning CD, Raghavan P, and Schütze H (2008). Introduction to information retrieval. Cambridge University Press, Cambridge, UK. https://doi.org/10.1017/CBO9780511809071 [Google Scholar] Mikolov T, Sutskever I, Chen K, Corrado GS, and Dean J (2013). Distributed representations of words and phrases and their compositionality. In: Burges CJC, Bottou L, Welling M, Ghahramani Z, and Weinberger KQ (Eds.), Advances in neural information processing systems 26, Curran Associates, Inc., Lake Tahoe, USA: 3111–3119. [Google Scholar] Neogi PPG, Das AK, Goswami S, and Mustafi J (2020). Topic modeling for text classification. In: Mandal JK and Bhattacharya D (Eds.), Emerging technology in modelling and graphics. Advances in intelligent systems and computing: 395–407. Volume 937, Springer, Singapore, Singapore. https://doi.org/10.1007/978-981-13-7403-6_36 [Google Scholar] Noaman HM, Elmougy S, Ghoneim A, and Hamza T (2010). Naive Bayes classifier based Arabic document categorization. In the International Conference on Intelligent Computing and Information Systems (ICICIS), IEEE, Cairo, Egypt: 757–762. [Google Scholar] Pennington J, Socher R, and Manning CD (2014). GloVe: Global vectors for word representation. In the Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Doha, Qatar: 1532–1543. https://doi.org/10.3115/v1/D14-1162 [Google Scholar] Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, and Zettlemoyer L (2018). Deep contextualized word representations. In the Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, New Orleans, USA: 2227–2237. https://doi.org/10.18653/v1/N18-1202 [Google Scholar] Radford A, Wu J, Child R, Luan D, Amodei D, and Sutskever I (2019). Language models are unsupervised multitask learners. OpenAI Blog. Available online at: https://openai.com/blog/better-language-models/ Rifai HE, Al Qadi L, and Elnagar A (2021). Arabic multi-label text classification of news articles. In: Hassanien A-E, Chang K-C, and Mincong T (Eds.), Advanced machine learning technologies and applications. AMLTA 2021. Lecture notes in networks and systems: 431–444. Volume 1339, Springer International Publishing, Cham, Switzerland. https://doi.org/10.1007/978-3-030-69717-4_41 [Google Scholar] Saad MK and Ashour W (2010). OSAC: Open source Arabic corpora. In the 6th International Conference on Electrical and Computer Systems, European University of Lefke, Lefke, North Cyprus: 1-6. [Google Scholar] Sebastiani F (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1): 1–47. https://doi.org/10.1145/505282.505283 [Google Scholar] Selva Birunda S and Kanniga Devi R (2021). A review on word embedding techniques for text classification. In: Raj JS, Iliyasu AM, Bestak R, and Baig ZA (Eds.), Innovative data communication technologies and application. Lecture notes on data engineering and communications technologies: 267–281. Volume 59, Springer, Singapore, Singapore. https://doi.org/10.1007/978-981-15-9651-3_23 [Google Scholar] Setu JH, Halder N, Sikder S, Islam A, and Alam MZ (2024). Empowering multiclass classification and data augmentation of Arabic news articles through transformer model. In the International Joint Conference on Neural Networks, IEEE, Yokohama, Japan: 1–7. https://doi.org/10.1109/IJCNN60899.2024.10650716 [Google Scholar] Trafalis TB and Gilbert RC (2007). Robust support vector machines for classification and computational issues. Optimization Methods and Software, 22(1): 187–198. https://doi.org/10.1080/10556780600883791 [Google Scholar] Wang L (2005). Support vector machines: Theory and applications. In: Kacprzyk J (Ed.), Studies in fuzziness and soft computing. Springer, Berlin, Germany. https://doi.org/10.1007/b95439 [Google Scholar] PMid:16084744 Yang Z, Dai Z, Yang Y, Carbonell J, Salakhutdinov RR, and Le QV (2019). XLNet: Generalized autoregressive pretraining for language understanding. In: Wallach H, Larochelle H, Beygelzimer A, d'Alché-Buc F, Fox E, and Garnett R (Eds.), Advances in neural information processing systems 32 (NeurIPS 2019): 5754–5764. Curran Associates Inc., Vancouver, Canada. [Google Scholar] Zeroual I, Goldhahn D, Eckart T, and Lakhouaja A (2019). OSIAN: Open Source international Arabic news corpus - preparation and integration into the CLARIN-infrastructure. In the Proceedings of the 4th Arabic Natural Language Processing Workshop (WANLP), Association for Computational Linguistics, Florence, Italy: 175–182. https://doi.org/10.18653/v1/W19-4619 [Google Scholar]

Optimization of Arabic text classification using SVM integrated with word embedding models on a novel dataset

Full text

Digital Object Identifier (DOI)

Abstract

Keywords

Article history

Citation:

References (52)