Journal of Frontiers of Computer Science and Technology ›› 2022, Vol. 16 ›› Issue (9): 2030-2040.DOI: 10.3778/j.issn.1673-9418.2103011
• Database Technology • Previous Articles Next Articles
XU Jia1,2,3, MO Xiaokun1, YU Ge4, LYU Pin1,2,3,+(), WEI Tingting1
Received:
2021-03-03
Revised:
2021-07-28
Online:
2022-09-01
Published:
2021-08-06
About author:
XU Jia, born in 1984, Ph.D., associate professor, M.S. supervisor, senior member of CCF, member of CCF Database Committee. Her research interests include database theory and technology, educational data analysis and mining, etc.Supported by:
许嘉1,2,3, 莫晓琨1, 于戈4, 吕品1,2,3,+(), 韦婷婷1
通讯作者:
+ E-mail: lvpin@gxu.edu.cn作者简介:
许嘉(1984—),女,山东荣成人,博士,副教授,硕士生导师,CCF高级会员,CCF数据库专委会委员,主要研究方向为数据库理论与技术、教育数据分析挖掘等。基金资助:
CLC Number:
XU Jia, MO Xiaokun, YU Ge, LYU Pin, WEI Tingting. SQL-Detector: SQL Plagiarism Detection Technique Based on Coding Features[J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(9): 2030-2040.
许嘉, 莫晓琨, 于戈, 吕品, 韦婷婷. SQL-Detector:基于编码特征的SQL习题抄袭检测技术[J]. 计算机科学与探索, 2022, 16(9): 2030-2040.
学生 | SQL代码 |
---|---|
学生A | select * from student where dept_name = 'Geology' and tot_cred > 50 |
学生B | Select *From student Where dept_name = 'Geology' And tot_cred > 50 |
学生C | SELECT * FROM student WHERE(dept_name = 'Geology' AND tot_cred > 50) |
Table 1 Students show different SQL coding features
学生 | SQL代码 |
---|---|
学生A | select * from student where dept_name = 'Geology' and tot_cred > 50 |
学生B | Select *From student Where dept_name = 'Geology' And tot_cred > 50 |
学生C | SELECT * FROM student WHERE(dept_name = 'Geology' AND tot_cred > 50) |
子句类型 | 关键字 | SCHEMA | 函数 | 关系 运算符 | 括号 | 空格 | 缩进 | 换行 |
---|---|---|---|---|---|---|---|---|
SELECT | √ | √ | √ | × | √ | √ | √ | √ |
FROM | √ | √ | × | × | √ | √ | √ | √ |
WHERE | √ | √ | × | √ | √ | √ | √ | √ |
JOIN | √ | √ | × | √ | √ | √ | √ | √ |
GROUP BY | √ | √ | × | √ | √ | √ | √ | √ |
HAVING | √ | √ | √ | √ | √ | √ | √ | √ |
ORDER BY | √ | √ | × | × | √ | √ | √ | √ |
Table 2 Relationships between clause types and exercise coding features
子句类型 | 关键字 | SCHEMA | 函数 | 关系 运算符 | 括号 | 空格 | 缩进 | 换行 |
---|---|---|---|---|---|---|---|---|
SELECT | √ | √ | √ | × | √ | √ | √ | √ |
FROM | √ | √ | × | × | √ | √ | √ | √ |
WHERE | √ | √ | × | √ | √ | √ | √ | √ |
JOIN | √ | √ | × | √ | √ | √ | √ | √ |
GROUP BY | √ | √ | × | √ | √ | √ | √ | √ |
HAVING | √ | √ | √ | √ | √ | √ | √ | √ |
ORDER BY | √ | √ | × | × | √ | √ | √ | √ |
题号 | 答案数量 | 题号 | 答案数量 |
---|---|---|---|
1 | 269 | 7 | 266 |
2 | 259 | 8 | 262 |
3 | 268 | 9 | 266 |
4 | 264 | 10 | 264 |
5 | 261 | 11 | 263 |
6 | 266 | 12 | 259 |
Table 3 Details of experimental dataset
题号 | 答案数量 | 题号 | 答案数量 |
---|---|---|---|
1 | 269 | 7 | 266 |
2 | 259 | 8 | 262 |
3 | 268 | 9 | 266 |
4 | 264 | 10 | 264 |
5 | 261 | 11 | 263 |
6 | 266 | 12 | 259 |
题号 | 不同相似度阈值下的F1值/% | |||||
---|---|---|---|---|---|---|
0.5 | 0.6 | 0.7 | 0.8 | 0.9 | 1.0 | |
1 | 37.5 | 55.0 | 68.9 | 80.4 | 79.9 | 76.5 |
2 | 30.3 | 50.6 | 67.2 | 81.3 | 79.2 | 76.2 |
3 | 36.7 | 50.9 | 65.2 | 84.5 | 82.5 | 79.8 |
4 | 39.8 | 50.5 | 69.4 | 82.1 | 79.9 | 80.4 |
5 | 37.2 | 50.4 | 67.6 | 85.3 | 82.8 | 82.5 |
6 | 29.7 | 51.1 | 69.5 | 83.9 | 83.1 | 78.4 |
7 | 31.4 | 51.6 | 66.6 | 80.8 | 79.7 | 79.0 |
8 | 37.6 | 53.9 | 68.0 | 85.2 | 84.3 | 82.2 |
9 | 35.5 | 53.2 | 67.8 | 86.6 | 83.9 | 70.0 |
10 | 39.7 | 50.7 | 67.4 | 83.7 | 81.8 | 77.7 |
11 | 32.3 | 52.8 | 67.1 | 82.1 | 82.6 | 79.3 |
12 | 34.2 | 50.7 | 68.6 | 82.8 | 83.2 | 80.2 |
Table 4 Impact of similarity threshold in hierarchical clustering on accuracy of plagiarism detection
题号 | 不同相似度阈值下的F1值/% | |||||
---|---|---|---|---|---|---|
0.5 | 0.6 | 0.7 | 0.8 | 0.9 | 1.0 | |
1 | 37.5 | 55.0 | 68.9 | 80.4 | 79.9 | 76.5 |
2 | 30.3 | 50.6 | 67.2 | 81.3 | 79.2 | 76.2 |
3 | 36.7 | 50.9 | 65.2 | 84.5 | 82.5 | 79.8 |
4 | 39.8 | 50.5 | 69.4 | 82.1 | 79.9 | 80.4 |
5 | 37.2 | 50.4 | 67.6 | 85.3 | 82.8 | 82.5 |
6 | 29.7 | 51.1 | 69.5 | 83.9 | 83.1 | 78.4 |
7 | 31.4 | 51.6 | 66.6 | 80.8 | 79.7 | 79.0 |
8 | 37.6 | 53.9 | 68.0 | 85.2 | 84.3 | 82.2 |
9 | 35.5 | 53.2 | 67.8 | 86.6 | 83.9 | 70.0 |
10 | 39.7 | 50.7 | 67.4 | 83.7 | 81.8 | 77.7 |
11 | 32.3 | 52.8 | 67.1 | 82.1 | 82.6 | 79.3 |
12 | 34.2 | 50.7 | 68.6 | 82.8 | 83.2 | 80.2 |
题号 | SQL-Detector | 文献[4] | JPlag[ | ||||||
---|---|---|---|---|---|---|---|---|---|
查准率/% | 查全率/% | F1值/% | 查准率/% | 查全率/% | F1值/% | 查准率/% | 查全率/% | F1值/% | |
1 | 78.61 | 82.31 | 80.42 | 67.65 | 73.34 | 70.38 | 57.65 | 43.34 | 49.48 |
2 | 81.04 | 81.58 | 81.31 | 74.48 | 71.44 | 72.93 | 44.48 | 41.44 | 42.91 |
3 | 80.92 | 88.46 | 84.52 | 67.65 | 72.67 | 70.07 | 47.65 | 48.67 | 48.15 |
4 | 77.91 | 86.79 | 82.11 | 66.00 | 73.33 | 69.47 | 40.00 | 43.33 | 41.60 |
5 | 83.21 | 87.42 | 85.26 | 69.50 | 76.67 | 72.91 | 39.50 | 36.67 | 38.03 |
6 | 82.19 | 85.68 | 83.90 | 72.50 | 75.26 | 73.86 | 42.50 | 35.26 | 38.54 |
7 | 79.92 | 81.60 | 80.75 | 66.67 | 78.57 | 72.13 | 46.67 | 38.57 | 42.23 |
8 | 81.62 | 89.11 | 85.20 | 72.50 | 69.69 | 71.07 | 52.50 | 29.69 | 37.93 |
9 | 86.70 | 86.52 | 86.61 | 70.00 | 73.33 | 71.63 | 50.00 | 43.33 | 46.43 |
10 | 82.95 | 84.43 | 83.68 | 77.50 | 78.57 | 78.03 | 47.50 | 38.57 | 42.57 |
11 | 81.51 | 82.68 | 82.09 | 70.00 | 78.00 | 73.78 | 40.00 | 30.00 | 34.29 |
12 | 82.03 | 83.67 | 82.84 | 78.81 | 80.67 | 79.73 | 44.81 | 36.67 | 40.33 |
Table 5 Accuracy analysis of SQL plagiarism detection
题号 | SQL-Detector | 文献[4] | JPlag[ | ||||||
---|---|---|---|---|---|---|---|---|---|
查准率/% | 查全率/% | F1值/% | 查准率/% | 查全率/% | F1值/% | 查准率/% | 查全率/% | F1值/% | |
1 | 78.61 | 82.31 | 80.42 | 67.65 | 73.34 | 70.38 | 57.65 | 43.34 | 49.48 |
2 | 81.04 | 81.58 | 81.31 | 74.48 | 71.44 | 72.93 | 44.48 | 41.44 | 42.91 |
3 | 80.92 | 88.46 | 84.52 | 67.65 | 72.67 | 70.07 | 47.65 | 48.67 | 48.15 |
4 | 77.91 | 86.79 | 82.11 | 66.00 | 73.33 | 69.47 | 40.00 | 43.33 | 41.60 |
5 | 83.21 | 87.42 | 85.26 | 69.50 | 76.67 | 72.91 | 39.50 | 36.67 | 38.03 |
6 | 82.19 | 85.68 | 83.90 | 72.50 | 75.26 | 73.86 | 42.50 | 35.26 | 38.54 |
7 | 79.92 | 81.60 | 80.75 | 66.67 | 78.57 | 72.13 | 46.67 | 38.57 | 42.23 |
8 | 81.62 | 89.11 | 85.20 | 72.50 | 69.69 | 71.07 | 52.50 | 29.69 | 37.93 |
9 | 86.70 | 86.52 | 86.61 | 70.00 | 73.33 | 71.63 | 50.00 | 43.33 | 46.43 |
10 | 82.95 | 84.43 | 83.68 | 77.50 | 78.57 | 78.03 | 47.50 | 38.57 | 42.57 |
11 | 81.51 | 82.68 | 82.09 | 70.00 | 78.00 | 73.78 | 40.00 | 30.00 | 34.29 |
12 | 82.03 | 83.67 | 82.84 | 78.81 | 80.67 | 79.73 | 44.81 | 36.67 | 40.33 |
[1] | 窦亚玲. 多元化模式下数据库SQL教学研究与实践[J]. 软件导刊(教育技术), 2019, 18(2): 46-48. |
DOU Y L. Research and practice of database SQL teaching under pluralistic mode[J]. Software Guide (Educational Technology), 2019, 18(2): 46-48. | |
[2] | SCERBAKOV N, SCHUKIN A, SABININ O. Plagiarism detection in SQL student assignments[C]// Proceedings of the 2017 International Conference on Interactive Collaborative Learning, Budapest, Sep 27-29, 2017. Cham: Springer, 2017: 110-115. |
[3] | KLEEREKOPER A, SCHOFIELD A. The false-positive rate of automated plagiarism detection for SQL assessments[C]// Proceedings of the 1st UK & Ireland Computing Education Research Conference, Canterbury, Sep 5-6, 2019. New York: ACM, 2019: 1-6. |
[4] | 葛文馨, 魏永山. 一种基于编码习惯的SQL语句抄袭检测算法[J]. 软件, 2019, 40(11): 4-8. |
GE W X, WEI Y S. A coding behavior-based plagiarism detection for SQL statement[J]. Computer Engineering & Software, 2019, 40(11): 4-8. | |
[5] | HALSTEAD M H. Elements of software science[M]. New York: Elsevier Science Inc., 1977. |
[6] | DONALDSON J L, LANCASTER A M, SPOSATO P H. A plagiarism detection system[C]// Proceedings of the 12th SIGCSE Technical Symposium on Computer Science Education, St. Louis, Feb 26-27, 1981. New York: ACM, 1981: 21-25. |
[7] | GRIER S L. A tool that detects plagiarism in pascal programs[J]. ACM SIGCSE Bulletin, 1981, 13(1): 15-20. |
[8] | PRECHELT L, MALPOHL G, PHILIPPSEN M. Finding plagiarisms among a set of programs with JPlag[J]. Journal of Universal Computer Science, 2000, 8(11): 1016-1038. |
[9] | SCHLEIMER S, WILKERSON D S, AILEN A. Winnowing: local algorithms for document fingerprinting[C]// Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, San Diego, Jun 9-12, 2003. New York: ACM, 2003: 76-85. |
[10] | 段旭良, 杨洋, 王曼韬, 等. 一种扩展Winnowing剽窃检测算法[J]. 计算机工程与科学, 2017, 39(12): 2245-2251. |
DUAN X L, YANG Y, WANG M T, et al. An extended Winnowing plagiarism detection algorithm[J]. Computer Engineering and Science, 2017, 39(12): 2245-2251. | |
[11] | WISE M J. YAP3: improved detection of similarities in computer program and other texts[J]. ACM SIGCSE Bulletin, 1996, 28(1): 130-134. |
[12] | 殷丹平. 基于CNN的代码相似度检测研究与代码查重系统[D]. 北京: 北京邮电大学, 2018. |
YIN D P. Research on code similarity detection and code duplicate checking system based on CNN[D]. Beijing: Beijing University of Posts and Telecommunications, 2018. | |
[13] | 王颖. 面向相似代码分析的多元化特征提取技术[D]. 北京: 北京邮电大学, 2018. |
WANG Y. Multi feature extraction technology for similar code analysis[D]. Beijing: Beijing University of Posts and Telecommunications, 2018. | |
[14] | KIM Y C, CHO Y Y, MOON J B. A plagiarism detection system using a syntax-tree[C]// Proceedings of the 2004 International Conference on Computational Intelligence, Istanbul, Dec 17-19, 2004: 23-26. |
[15] | JIANG L X, MISHERGHI G, SU Z D, et al. DECKARD: scalable and accurate tree-based detection of code clones[C]// Proceedings of the 29th International Conference on Software Engineering, Minneapolis, May 20-26, 2007. Washington: IEEE Computer Society, 2007: 96-105. |
[16] | ARABYARMOHAMADY S, MORADI H, ASADPOUR M. A coding style-based plagiarism detection[C]// Proceedings of the 2012 International Conference on Interactive Mobile and Computer Aided Learning, Amman, Nov 6-8, 2012. Piscataway: IEEE, 2013: 180-186. |
[17] |
LJUBOVIC V, PAJI E. Plagiarism detection in computer programming using feature extraction from ultra-fine-grained repositories[J]. IEEE Access, 2020, 8: 96505-96514.
DOI URL |
[18] | 赵彦博. 基于抽象语法树的程序代码抄袭检测技术研究[D]. 呼和浩特: 内蒙古师范大学, 2010. |
ZHAO Y B. Research on program code plagiarism detection technology based on abstract syntax tree[D]. Hohhot: Inner Mongolia Normal University, 2010. | |
[19] | 卫军超. 在线评测系统中代码剽窃检测技术的研究与实现[D]. 咸阳: 西北农林科技大学, 2017. |
WEI J C. Research and implementation of code plagiarism detection technology in online evaluation system[D]. Xianyang: Northwest A&F University, 2017. | |
[20] | 石野, 黄龙和, 车天阳, 等. 基于语法树的程序相似度判定方法[J]. 吉林大学学报, 2014, 32(1): 95-100. |
SHI Y, HUANG L H, CHE T Y, et al. Program similarity determination method based on syntax tree[J]. Journal of Jilin University, 2014, 32(1): 95-100. | |
[21] | RUSSELL G, CUMMING A. Online assessment and checking of SQL: detecting and preventing plagiarism[C]// Proceedings of the 22nd British National Conference on Databases, Sunderland, Jul 2005. Berlin, Heidelberg: Springer, 2005: 1-5. |
[22] |
MURTAGH F. A survey of recent advances in hierarchical clustering algorithms[J]. The Computer Journal, 1983, 26(4): 354-359.
DOI URL |
[23] |
WARD J, JOE H. Hierarchical grouping to optimize an objective function[J]. Journal of the American Statistical Association, 1963, 58(301): 236-244.
DOI URL |
[24] | 张文彤. SPSS 统计分析高级教程[M]. 北京: 高等教育出版社, 2011. |
ZHANG W T. SPSS advanced course of statistical analysis[M]. Beijing: Higher Education Press, 2011. |
[1] | XU Xudong, ZHANG Zhixiang, ZHANG Xian. Message Clustering Method for Private Binary Protocol [J]. Journal of Frontiers of Computer Science and Technology, 2020, 14(6): 958-965. |
[2] | WANG Xiaodong, ZHAO Yining, XIAO Haili, WANG Xiaoning, CHI Xuebin. Research on Anomaly Detection System of Online Multi-node Log Flow [J]. Journal of Frontiers of Computer Science and Technology, 2020, 14(11): 1828-1837. |
[3] | LIU Chao, XU Yabin, WU Zhuang. Method for Rapid Detecting Micro-Blog Communities [J]. Journal of Frontiers of Computer Science and Technology, 2015, 9(9): 1100-1107. |
[4] | ZHANG Yonghui, LI Chuan, TANG Changjie, LI Yanmei. Information Networks Community Trend Prediction Based on Structure Analysis [J]. Journal of Frontiers of Computer Science and Technology, 2015, 9(4): 403-409. |
[5] | JING Liping, ZHU Yan, YU Jian. Hierarchical Non-Negative Matrix Factorization for Text Clustering [J]. Journal of Frontiers of Computer Science and Technology, 2011, 5(10): 904-913. |
[6] | QIAO Shaojie1,2,TANG Changjie1+,CHEN Yu1,PENG Jing3,WEN Fenlian1. A new hierarchical clustering algorithm based on tree edit distance [J]. Journal of Frontiers of Computer Science and Technology, 2007, 1(3): 282-292. |
Viewed | ||||||||||||||||||||||||||||||||||||||||||||||||||
Full text 148
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||
Abstract 301
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||
/D:/magtech/JO/Jwk3_kxyts/WEB-INF/classes/