SVM significant role selection method for improving semantic text plagiarism detection

Osman, Ahmed Hamza; Barukab, Omar M. Barukab

International Journal of Advanced and Applied Sciences

Int. j. adv. appl. sci.

EISSN: 2313-3724

Print ISSN: 2313-626X

Volume 4, Issue 8 (August 2017), Pages: 112-122

Title: SVM significant role selection method for improving semantic text plagiarism detection

Author(s): Ahmed Hamza Osman *, Omar M. Barukab

Affiliation(s):

Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah 21911, Saudi Arabia

https://doi.org/10.21833/ijaas.2017.08.016

Full Text - PDF XML

Abstract:

This research introduces an approach for the prediction and detection of plagiarized text based on Semantic Role Labelling (SRL) and Support Vector Machine (SVM). The introduced method evaluates and analyses text based on semantic position for each term within the text. It additionally detects the source semantic sense in considering the connections between its terms using the Semantic Role Labeling (SRL). SRL presents noteworthy remuneration while creating roles from a text semantically. Selecting for every role created by the SVM method keeping in mind the end goal to foresee significant roles is a noteworthy part of the proposed system. The imperative roles that will vote by the SVM strategy will be chosen in the comparability computation process. The proposed strategy assessed utilizing the PAN-PC-10 dataset. The outcomes proved that the introduced strategy enhanced the execution as far as the assessment measures contrasted and other plagiarism detection methods.

This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

Keywords: Plagiarism detection, Semantic similarity, Semantic role, SVM classifier, NLP

Article History: Received 21 April 2017, Received in revised form 13 July 2017, Accepted 14 July 2017

Digital Object Identifier:

https://doi.org/10.21833/ijaas.2017.08.016

Citation:

Osman AH and Barukab OM (2017). SVM significant role selection method for improving semantic text plagiarism detection. International Journal of Advanced and Applied Sciences, 4(8): 112-122

http://www.science-gate.com/IJAAS/V4I8/Osman.html

References:

Alzahrani S and Salim N (2010). Fuzzy semantic-based string similarity for extrinsic plagiarism detection (Lab report for PAN@ CLEF10). In the 4th International Workshop PAN-10, Padua, Italy.
Alzahrani SM, Salim N, and Abraham A (2012). Understanding plagiarism linguistic patterns, textual features, and detection methods. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(2): 133-149. https://doi.org/10.1109/TSMCC.2011.2134847
Buckley C, Salton G, Allan J, and Singhal A (1995). Automatic query expansion using SMART: TREC 3. In: Harman DK (Ed.), The Third Text REtrieval Conference (TREC3): 69-80. National Institute of Standards and Technology Special Publication, Gaithersburg, Maryland, USA.
Burrows S, Potthast M, and Stein B (2013). Paraphrase acquisition via crowdsourcing and machine learning. ACM Transactions on Intelligent Systems and Technology (TIST), 4(3): 1-21. https://doi.org/10.1145/2483669.2483676
Chhabra P, Wadhvani R, and Shukla S (2010). Spam filtering using support vector machine. Special Issue IJCCT, 1(2): 161-171.
Elhadi M and Al-Tobi A (2008). Use of text syntactical structures in detection of document duplicates. In the 3rd International Conference on Digital Information Management, IEEE, London, UK: 520-525. https://doi.org/10.1109/ICDIM.2008.4746719
Elhadi M and Al-Tobi A (2009). Duplicate detection in documents and webpages using improved longest common subsequence and documents syntactical structures. In the Fourth International Conference on Computer Sciences and Convergence Information Technology, IEEE: 679-684. https://doi.org/10.1109/ICCIT.2009.235
Frakes WB and Baeza-Yates R (1992). Information retrieval: data structures and algorithms. Prentice-Hall, Inc. Upper Saddle River, USA.
Franco-Salvador M, Rosso P, and Montes-y-Gómez M (2016). A systematic study of knowledge graph analysis for cross-language plagiarism detection. Information Processing and Management, 52(4): 550-570. https://doi.org/10.1016/j.ipm.2015.12.004
Ghosh A, Bhaskar P, Pal S, and Bandyopadhyay S (2011). Rule based plagiarism detection using information retrieval. Jadavpur University, Kolkata, India.
Gipp B (2014). Citation-based plagiarism detection. Springer Vieweg Research, Berlin, Germany.
Gruner S and Naven S (2005). Tool support for plagiarism detection in text documents. In the ACM Conference on Applied Computing, ACM, Santa Fe, New Mexico, USA: 776-781. https://doi.org/10.1145/1066677.1066854
Jin Q and Ming M (2011). A method to construct self-set for IDS based on negative selection algorithm. In the International Conference on Mechatronic Science, Electric Engineering and Computer, IEEE, Jilin, China: 1051-1053. https://doi.org/10.1109/MEC.2011.6025646
Kent C and Salim N (2010). Features based text similarity detection. Journal of Computing, 2(1): 53-57.
Kim H, Kang YK, Kwon PJ, and Kim MH (2005). An application of DICOM architecture for detecting plagiarism in natural language. In the 9th International Conference on Computer Supported Cooperative Work in Design, IEEE, Coventry, UK: 2: 816-819. https://doi.org/10.1109/CSCWD.2005.194290
Koroutchev K and Cebrián M (2006). Detecting translations of the same text and data with common source. Journal of Statistical Mechanics: Theory and Experiment, 2006(10). https://doi.org/10.1088/1742-5468/2006/10/P10009
Lennon M, Pierce DS, Tarry BD, and Willett P (1981). An evaluation of some conation algorithms for information retrieval. Journal of Information Science, 3(4): 177-183.
Màrquez L, Carreras X, Litkowski KC, and Stevenson S (2008). Semantic role labeling: an introduction to the special issue. Computational Linguistics, 34(2): 145-159. https://doi.org/10.1162/coli.2008.34.2.145
Mikheev A (2000). Document centered approach to text normalization. In the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, Athens, Greece: 136-143. https://doi.org/10.1145/345508.345564
Mikut R and Reischl M (2011). Data mining tools. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(5): 431-443. https://doi.org/10.1002/widm.24
Mozgovoy M, Fredriksson K, White D, Joy M, and Sutinen E (2005). Fast plagiarism detection system. In the International Conference on String Processing and Information Retrieval, Springer Berlin Heidelberg, Heidelberg, Germany: 267-270. https://doi.org/10.1007/11575832_30
Noble WS (2006). What is a support vector machine?. Nature Biotechnology, 24(12): 1565-1567. https://doi.org/10.1038/nbt1206-1565 PMid:17160063
Osman AH and Salim N (2013). An improved semantic plagiarism detection scheme based on Chi-squared automatic interaction detection. In the International Conference on Computing, Electrical and Electronics Engineering, IEEE, Khartoum, Sudan: 640-647. https://doi.org/10.1109/ICCEEE.2013.6634015
Osman AH, Salim N, and Binwahlan MS (2010). Plagiarism Detection Using Graph-Based Representation. Journal of Computing, 2(4): 36-41.
Osman AH, Salim N, and Elhadi AAE (2013). A tree-based conceptual matching for plagiarism detection. In the International Conference on Computing, Electrical and Electronics Engineering, IEEE, Khartoum, Sudan: 571-579. https://doi.org/10.1109/ICCEEE.2013.6634003
Osman AH, Salim N, Binwahlan MS, Alteeb R and Abuobieda A (2012a). An improved plagiarism detection scheme based on semantic role labeling. Applied Soft Computing, 12(5): 1493-1502. https://doi.org/10.1016/j.asoc.2011.12.021
Osman AH, Salim N, Binwahlan MS, Hentably H, and Ali MA (2011). Conceptual similarity and graph-based method for plagiarism detection. Journal of Theoretical and Applied Information Technology, 32(2): 135-145.
Osman AH, Salim N, Binwahlan MS, Twaha S, Kumar YJ, and Abuobieda A (2012b). Plagiarism detection scheme based on Semantic Role Labeling. In the International Conference on Information Retrieval and Knowledge Management, IEEE, Kuala Lumpur, Malaysia: 30-33. https://doi.org/10.1109/ InfRKM.2012.6204978
Ozgencil N, Mccracken N, and Mehrotra K (2008). A cluster-based classification approach to semantic role labeling. In: Nguyen NT, Borzemski L, Grzech A, and Ali M (eds.), New Frontiers in Applied Artificial Intelligence: 265-275. Springer, Berlin, Germany. https://doi.org/10.1007/978-3-540-69052-8_28
Palkovskii Y, Belov A, and Muzyka I (2011). Using WordNet-based semantic similarity measurement in external plagiarism detection. In the 5th International Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse. Notebook Papers of CLEF. Available online at: http://clef2011.org/resources/proceedings/Palkovskii-Clef2011.pdf
Palmieri F, Fiore U, and Castiglione A (2014). A distributed approach to network anomaly detection based on independent component analysis. Concurrency and Computation: Practice and Experience, 26(5): 1113-1129. https://doi.org/10.1002/cpe.3061
Paul M and Jamal S (2015). An improved SRL based plagiarism detection technique using sentence ranking. Procedia Computer Science, 46: 223-230. https://doi.org/10.1016/j.procs.2015.02.015
Potthast M, Barrón-Cede-o A, Eiselt A, Stein B, and Rosso P (2010a). Overview of the 2nd International Competition on Plagiarism Detection. In the 4th Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse, Notebook Papers of CLEF. Available online at: https://pdfs.semanticscholar.org/44e2/8a94f857cb5f7702a7b86455416726df64e9.pdf
Potthast M, Stein B, Barrón-Cede-o A, and Rosso P (2010b). An evaluation framework for plagiarism detection. In the 23rd International Conference on Computational Linguistics: Posters, Association for Computational Linguistics, Beijing, China: 997-1005.
Prechelt L, Malpohl G, and Philippsen M (2002). Finding plagiarisms among a set of programs with JPlag. Journal of Universal Computer Science UCS, 8(11): 1016-1038.
Salcedo-Campos F, Díaz-Verdejo J, and García-Teodoro P (2012). Segmental parameterisation and statistical modelling of e-mail headers for spam detection. Information Sciences, 195: 45-61. https://doi.org/10.1016/j.ins.2012.01.022
Salim N, Suanmali L, and Binwahlan MS (2010). SRL-GSM: A hybrid approach based on semantic role labeling and general statistic method for text summarization. Journal of Applied Sciences, 10(3): 166-173. https://doi.org/10.3923/jas.2010.166.173
Seaward L and Matwin S (2009). Intrinsic plagiarism detection using complexity analysis. In the 25th Annual Conference of the Spanish Society for Natural Language Processing (SEPLN'09), San Sebastian, Spain: 56–61. Available online at: http://ceur-ws.org/Vol-502/pan09-proceedings.pdf#page=64
Sharma A, Pujari AK, and Paliwal KK (2007). Intrusion detection using text processing techniques with a kernel based similarity measure. Computers and Security, 26(7): 488-495. https://doi.org/10.1016/j.cose.2007.10.003
Shehata S, Karray F, and Kamel MS (2010). An efficient model for enhancing text categorization using sentence semantics. Computational Intelligence, 26(3): 215-231. https://doi.org/10.1111/j.1467-8640.2010.00357.x
Stamatatos E (2009). A survey of modern authorship attribution methods. Journal of the American Society for information Science and Technology, 60(3): 538-556. https://doi.org/10.1002/asi.21001
Stamatatos E (2009). Intrinsic plagiarism detection using character n-gram profiles. In the Annual Conference of the Spanish Society for Natural Language Processing (SEPLN'09), Donostia, Spain: 38–46. Available online at: http://ceur-ws.org/Vol-502/paper8.pdf
Stein B, Lipka N, and Prettenhofer P (2011). Intrinsic plagiarism analysis. Language Resources and Evaluation, 45(1): 63-82. https://doi.org/10.1007/s10579-010-9115-y
Suárez P, González JC, and Román JV (2010). A Plagiarism Detector for Intrinsic, External and Internet Plagiarism. In Notebook Papers of CLEF 2010 LABs and Workshops, Padua, Italy.
Temitayo F, Stephen O, and Abimbola A (2012). Hybrid GA-SVM for efficient feature selection in e-mail classification. Computer Engineering and Intelligent Systems, 3(3): 17-28.
Tomasic A and Garcia-Molina H (1993). Query processing and inverted indices in shared: nothing text document information retrieval systems. The VLDB Journal—The International Journal on Very Large Data Bases, 2(3): 243-276.
van Rijsbergen CJ (1979). A new theoretical framework for information retrieval. In the 9th annual international ACM SIGIR Conference on Research and development in information retrieval, ACM, Palazzo dei Congressi, Pisa, Italy: 194-200. https://doi.org/10.1145/253168.253208
Wang L (2005). Support vector machines: theory and applications. Springer Science and Business Media, Berlin, Germany. https://doi.org/10.1007/b95439
Youn S and McLeod D (2007). A comparative study for email classification. In: Elleithy K (Ed.), Advances and innovations in systems, computing sciences and software engineering: 387-391. Springer, Amsterdam, Netherlands. https://doi.org/10.1007/978-1-4020-6264-3_67
Zou D, Long WJ, and Ling Z (2010). A cluster-based plagiarism detection method. In the Notebook Papers of CLEF 2010 LABs and Workshops, Padua, Italy. Available online at: http://www.uni-weimar.de/medien/webis/events/pan-10/pan10-papers-final/pan10-plagiarism-detection/du10-notebook.pdf