计算机科学与探索 ›› 2022, Vol. 16 ›› Issue (9): 2030-2040.DOI: 10.3778/j.issn.1673-9418.2103011

• 数据库技术 • 上一篇    下一篇

SQL-Detector:基于编码特征的SQL习题抄袭检测技术

许嘉1,2,3, 莫晓琨1, 于戈4, 吕品1,2,3,+(), 韦婷婷1   

  1. 1.广西大学 计算机与电子信息学院,南宁 530004
    2.广西大学 广西多媒体通信网络技术重点实验室,南宁 530004
    3.广西大学 广西高校并行与分布式计算重点实验室,南宁 530004
    4.东北大学 计算机科学与工程学院,沈阳 110819
  • 收稿日期:2021-03-03 修回日期:2021-07-28 出版日期:2022-09-01 发布日期:2021-08-06
  • 通讯作者: + E-mail: lvpin@gxu.edu.cn
  • 作者简介:许嘉(1984—),女,山东荣成人,博士,副教授,硕士生导师,CCF高级会员,CCF数据库专委会委员,主要研究方向为数据库理论与技术、教育数据分析挖掘等。
    莫晓琨(1996—),男,广东清远人,硕士,CCF学生会员,主要研究方向为SQL习题自动判分技术。
    于戈(1962—),男,辽宁大连人,博士,教授,博士生导师,CCF会士,主要研究方向为数据库理论与技术、并行分布式计算等。
    吕品(1983—),男,山东滨州人,博士,副研究员,硕士生导师,CCF高级会员,CCF协同计算专委会委员,主要研究方向为物联网、教育大数据等。
    韦婷婷(1996—),女,广西桂平人,硕士研究生,主要研究方向为SQL习题推荐。
  • 基金资助:
    国家自然科学基金(62067001);国家自然科学基金(U1811261);“广西八桂学者”专项经费;广西高等教育本科教学改革工程项目(2020JGA116);广西高等教育本科教学改革工程项目(2017JGZ103);广西研究生教育创新计划资助项目(JGY2021003);广西自然科学基金(2019JJA170045)

SQL-Detector: SQL Plagiarism Detection Technique Based on Coding Features

XU Jia1,2,3, MO Xiaokun1, YU Ge4, LYU Pin1,2,3,+(), WEI Tingting1   

  1. 1. School of Computer Electronics and Information, Guangxi University, Nanning 530004, China
    2. Guangxi Key Laboratory of Multimedia Communications and Network Technology, Guangxi University, Nanning 530004, China
    3. Guangxi Colleges and University Key Laboratory of Parallel and Distributed Computing, Guangxi University, Nanning 530004, China
    4. School of Computer Science and Engineering, Northeastern University, Shenyang 110819, China
  • Received:2021-03-03 Revised:2021-07-28 Online:2022-09-01 Published:2021-08-06
  • About author:XU Jia, born in 1984, Ph.D., associate professor, M.S. supervisor, senior member of CCF, member of CCF Database Committee. Her research interests include database theory and technology, educational data analysis and mining, etc.
    MO Xiaokun, born in 1996, M.S., student member of CCF. His research interest is automatic marking technology for SQL exercises.
    YU Ge, born in 1962, Ph.D., professor, Ph.D. supervisor, fellow of CCF. His research interests include database theory and technology, parallel and distributed computing, etc.
    LYU Pin, born in 1983, Ph.D., associate professor, M.S. supervisor, senior member of CCF, member of CCF Cooperative Computing. His research interests include Internet of things, educational big data, etc.
    WEI Tingting, born in 1996, M.S. candidate. Her research interest is SQL exercises recommendation.
  • Supported by:
    National Natural Science Foundation of China(62067001);National Natural Science Foundation of China(U1811261);Special Funds for Guangxi BaGui Scholars;Projects of Higher Education Undergraduate Teaching Reform in Guangxi(2020JGA116);Special Funds for Guangxi BaGui Scholars, the Projects of Higher Education Undergraduate Teaching Reform in Guangxi(2017JGZ103);Innovation Project of Guangxi Graduate Education(JGY2021003);Natural Science Foundation of Guangxi(2019JJA170045)

摘要:

结构化查询语言(SQL)是学好数据库技术的关键。然而,大量教学实践表明学生在做SQL习题时存在抄袭现象。现有针对SQL习题的抄袭检测方案或是简单将学生提交的SQL代码进行相似性匹配来发现抄袭问题,或是利用学生在SQL编码习惯上的简单差异特征来发现抄袭的作业,均没能很好地利用学生书写SQL代码时所展现出的丰富编码特征来实现高精确度的抄袭检测。鉴于此,提出了基于编码特征的SQL习题抄袭检测技术,命名为SQL-Detector。首先,从SQL特性出发提出了面向特定SQL习题的学生习题编码特征和面向编码习惯的学生泛化编码特征,从而实现对学生的画像。其次,通过对学生的习题编码特征进行聚类分析识别出抄袭群体。最后,通过比较学生的习题泛化编码特征与其历史泛化编码特征之间的一致性来判定抄袭者与被抄袭者。利用真实课堂实践收集到的SQL习题答题数据进行实验评估,结果表明SQL-Detector技术对于SQL习题的抄袭检测精确度比相关最好的技术平均提高了14.0%。

关键词: SQL习题, 抄袭检测, 编码习惯, 编码特征, 层次聚类

Abstract:

Mastering structured query language (SQL) is the key to learn the database technology. However, a lot of teaching practices show that some students may plagiarize when doing SQL exercises. Existing SQL plagiarism detection techniques either detect plagiarized submissions simply by matching the similarities of students' SQL submissions, or identify plagiarism problems by analyzing students' SQL submissions based on the simple coding features displayed in students' SQL codes, which fails to make good use of the rich coding features of students when they write SQL codes to achieve high-accuracy plagiarism detection. In view of this, this paper proposes an SQL plagiarism detection technique based on coding features of students, named SQL-Detector. SQL-Detector first extracts both of the exercise coding features of students for specific SQL exercises and the exercise generalization coding features of students based on their coding habits, so as to profile the students. Then, SQL-Detector identifies the plagiarism group by conducting a clustering analysis over the exercise coding features of all students. Finally, SQL-Detector determines the copiers and givers by comparing the consistency between students' exercise generalization coding features and his/her historical generalization coding features. Experimental evaluation is conducted by using the dataset collected from real classroom practices. Experimental results show that the plagiarism detection accuracy of the SQL-Detector technique is on average 14.0% higher than that of the state-of-the-art technique.

Key words: SQL exercises, plagiarism detection, coding habit, coding features, hierarchical clustering

中图分类号: