计算机科学与探索 ›› 2019, Vol. 13 ›› Issue (7): 1165-1173.DOI: 10.3778/j.issn.1673-9418.1806021

• 人工智能 • 上一篇    下一篇

空间相关性分析的符号数据分类方法

付康安1,王文剑2+,郭虎升1   

  1. 1.山西大学 计算机与信息技术学院,太原 030006
    2.山西大学 计算智能与中文信息处理教育部重点实验室,太原 030006
  • 出版日期:2019-07-01 发布日期:2019-07-08

Categorical Data Classification Approach Based on Space Correlation Analysis

FU Kang'an1, WANG Wenjian2+, GUO Husheng1   

  1. 1.School of Computer and Information Technology, Shanxi University, Taiyuan 030006, China
    2.Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education, Shanxi University, Taiyuan 030006, China
  • Online:2019-07-01 Published:2019-07-08

摘要: 针对目前符号数据的分类性能较低,通过挖掘属性值与标签之间可能存在的空间结构关系,提出了一种基于空间相关性分析的符号数据分类方法。该方法首先采用独热编码的方式对符号数据进行特征扩容,然后基于互信息和条件熵信息度量方法,定义了一种符号数据空间关系表示方法。在此基础上,分别结合支持向量机(support vector machine,SVM)和K-最近邻(K-nearest neighbor,KNN)模型分类器,提出了基于空间相关性分析的SVM分类算法(SVM classification algorithm based on space correlation analysis,SCA_SVM)和基于空间相关性分析的KNN分类算法(KNN classification algorithm based on space correlation analysis, SCA_KNN)两种分类算法。该方法既能够体现出属性值与标签之间的关联关系,也可以有效地度量不同属性值之间的距离或差异性。在标准UCI数据集上的实验结果表明,该方法在分类性能上更加有效。

关键词: 符号数据, 分类, 空间相关性分析, 支持向量机(SVM), K-最近邻(KNN)

Abstract: Aiming at the problem of low classification accuracy for categorical data, this paper proposes a classification approach based on space correlation analysis by mining the space structure between attributes and labels. At first, the one-hot encoding is applied to expand dimensions for categorical data. Then a spatial representation method for categorical data based on mutual information and conditional entropy is defined. And support vector machine (SVM) and K-nearest neighbor (KNN) model are used respectively, and two classification algorithms are designed, namely SCA_SVM (SVM classification algorithm based on space correlation analysis) and SCA_KNN (KNN classification algorithm based on space correlation analysis). The approach can not only reflect the correlation between attributes and labels well, but also measure the distance or difference between different attributes effectively. The experiment results on the UCI datasets demonstrate that the proposed approach has a better classification performance.

Key words: categorical data, classification, space correlation analysis, support vector machine (SVM), K-nearest neighbor (KNN)