About this paper

Appears in:
Pages: 3751-3757
Publication year: 2009
ISBN: 978-84-612-7578-6
ISSN: 2340-1079

Conference name: 3rd International Technology, Education and Development Conference
Dates: 9-11 March, 2009
Location: Valencia, Spain

AN EFFICIENT SENTENCE-BASED PLAGIARISM DETECTION ALGORITHM

S.S. Lam, P.M. Choi

The Open University of Hong Kong (HONG KONG)
The Internet and digital libraries have rapidly developed to be a convenience information source for research. Such convenience entices students to copy the works of others and worsen the plagiarism issues that have been existed in universities for long time. This greatly increases the workload of academics on identifying and proving the plagiarized works in order to properly educate students about intellectual property. To alleviate the burden on academics, plagiarism detection systems, that can automatic detect suspected plagiarized works, are needed. Except verbatim copying, detecting plagiarism is not a simple endeavor. Designing an efficiency and effective plagiarism detection algorithm that can minimize both the false-positive and false-negative results is a challenge.

In this paper, we propose a new sentence-based plagiarism detection algorithm that could not only quickly identify suspected plagiarized works but also provide academics easy to interpret quantified measure for evaluating the severity of the offence. The algorithm allows parametric control to reduce the generation of false-positive and false-negative results. The algorithm uses information retrieval and sequence matching techniques to identify suspected plagiarized sentences in two-stage. In the first stage, Information Retrieval (IR) techniques are used to extract and index keywords in sentences. Asymmetric similarity measure is designed to evaluate the similarity of sentences based on common keywords. Sentences with high asymmetric similarity score are suspected to be plagiarized sentence. Information retrieval techniques with appropriate use of stop words and word stemming allow quick identification of word-by-word plagiarism and change of syntax in plagiarized sentences. Since sentences could have common keywords by chance, using common keywords as an indicator of suspected plagiarism could produce large amount of misleading results. To minimize the false-positive results, a Keyword Sequence Matching (KSM) algorithm is used in the second stage. The KSM algorithm compute the sum of the longest common keyword sequence for similar sentences identified in the first stage in O(n^2) time. In addition to keyword overlapping, significant matched in common keyword sequence provide a much strong evidence to prove a plagiarism case.

The algorithm has been implemented on Java using Apache Lucene (an open source IR system) and Paoding Analyzer (Chinese text analyzer). The system can detect plagiarism in English, Chinese and Simplified Chinese text. Applying the algorithm on typical cases found in the literature shows that the algorithm is very effective. Using the system in a course of over 400 students obtained satisfactory results.
@InProceedings{LAM2009ANE,
author = {Lam, S.S. and Choi, P.M.},
title = {AN EFFICIENT SENTENCE-BASED PLAGIARISM DETECTION ALGORITHM},
series = {3rd International Technology, Education and Development Conference},
booktitle = {INTED2009 Proceedings},
isbn = {978-84-612-7578-6},
issn = {2340-1079},
publisher = {IATED},
location = {Valencia, Spain},
month = {9-11 March, 2009},
year = {2009},
pages = {3751-3757}}
TY - CONF
AU - S.S. Lam AU - P.M. Choi
TI - AN EFFICIENT SENTENCE-BASED PLAGIARISM DETECTION ALGORITHM
SN - 978-84-612-7578-6/2340-1079
PY - 2009
Y1 - 9-11 March, 2009
CI - Valencia, Spain
JO - 3rd International Technology, Education and Development Conference
JA - INTED2009 Proceedings
SP - 3751
EP - 3757
ER -
S.S. Lam, P.M. Choi (2009) AN EFFICIENT SENTENCE-BASED PLAGIARISM DETECTION ALGORITHM, INTED2009 Proceedings, pp. 3751-3757.
User:
Pass: