DIGITAL LIBRARY
AN EFFICIENT SENTENCE-BASED PLAGIARISM DETECTION ALGORITHM
The Open University of Hong Kong (HONG KONG)
About this paper:
Appears in: INTED2009 Proceedings
Publication year: 2009
Pages: 3751-3757
ISBN: 978-84-612-7578-6
ISSN: 2340-1079
Conference name: 3rd International Technology, Education and Development Conference
Dates: 9-11 March, 2009
Location: Valencia, Spain
Abstract:
The Internet and digital libraries have rapidly developed to be a convenience information source for research. Such convenience entices students to copy the works of others and worsen the plagiarism issues that have been existed in universities for long time. This greatly increases the workload of academics on identifying and proving the plagiarized works in order to properly educate students about intellectual property. To alleviate the burden on academics, plagiarism detection systems, that can automatic detect suspected plagiarized works, are needed. Except verbatim copying, detecting plagiarism is not a simple endeavor. Designing an efficiency and effective plagiarism detection algorithm that can minimize both the false-positive and false-negative results is a challenge.

In this paper, we propose a new sentence-based plagiarism detection algorithm that could not only quickly identify suspected plagiarized works but also provide academics easy to interpret quantified measure for evaluating the severity of the offence. The algorithm allows parametric control to reduce the generation of false-positive and false-negative results. The algorithm uses information retrieval and sequence matching techniques to identify suspected plagiarized sentences in two-stage. In the first stage, Information Retrieval (IR) techniques are used to extract and index keywords in sentences. Asymmetric similarity measure is designed to evaluate the similarity of sentences based on common keywords. Sentences with high asymmetric similarity score are suspected to be plagiarized sentence. Information retrieval techniques with appropriate use of stop words and word stemming allow quick identification of word-by-word plagiarism and change of syntax in plagiarized sentences. Since sentences could have common keywords by chance, using common keywords as an indicator of suspected plagiarism could produce large amount of misleading results. To minimize the false-positive results, a Keyword Sequence Matching (KSM) algorithm is used in the second stage. The KSM algorithm compute the sum of the longest common keyword sequence for similar sentences identified in the first stage in O(n^2) time. In addition to keyword overlapping, significant matched in common keyword sequence provide a much strong evidence to prove a plagiarism case.

The algorithm has been implemented on Java using Apache Lucene (an open source IR system) and Paoding Analyzer (Chinese text analyzer). The system can detect plagiarism in English, Chinese and Simplified Chinese text. Applying the algorithm on typical cases found in the literature shows that the algorithm is very effective. Using the system in a course of over 400 students obtained satisfactory results.