计算机科学与探索 ›› 2023, Vol. 17 ›› Issue (1): 108-115.DOI: 10.3778/j.issn.1673-9418.2104114

• 理论·算法 • 上一篇    下一篇

面向单调特征选择的改进模糊排序互信息

皮洪,罗川,李天瑞,陈红梅   

  1. 1. 四川大学 计算机学院,成都 610065
    2. 西南交通大学 计算机与人工智能学院,成都 611756
  • 出版日期:2023-01-01 发布日期:2023-01-01

Improved Fuzzy Rank Mutual Information for Monotonic Feature Selection

PI Hong, LUO Chuan, LI Tianrui, CHEN Hongmei   

  1. 1. College of Computer Science, Sichuan University, Chengdu 610065, China
    2. School of Computing and Artificial Intelligence, Southwest Jiaotong University, Chengdu 611756, China
  • Online:2023-01-01 Published:2023-01-01

摘要: 缺失值、异常值等低质信息的大量存在使得实际应用中单调分类任务通常不满足一致性单调约束。然而,现有的面向单调分类的特征选择算法中用于评估特征相关性的互信息度量准则在处理不一致单调分类任务时不满足特征集合与度量准则之间的单调性限制关系。针对此问题,首先对经典的模糊排序信息熵进行了改进,使之满足单调性限制关系。进一步定义了模糊排序互信息(nFRMI)用于评估特征之间的单调一致性,并证明了该度量准则在一致性和不一致性单调分类任务中均满足单调性约束,此外结合最大相关最小冗余(mRMR)准则提出了两阶段的特征选择算法。算法第一阶段利用nFRMI计算每个特征的特征重要度,然后根据特征重要度得到特征排序;第二阶段采用Wrapper方法选出具有最优分类性能的特征子集。仿真实验从UCI挑选了四个单调分类数据集,并从分类精度指标对所提出的特征选择算法的有效性进行了验证。

关键词: 特征选择, 单调分类, 不一致, 模糊信息熵, 排序互信息

Abstract: The large amount of low-quality information, such as missing values and outliers, makes monotonic classification tasks usually not satisfy the monotonic consistency. However, for the existing ordered feature selection algorithm, the mutual information used to evaluate the feature relevance doesn’t satisfy the monotonic restriction between feature sets and the metrics when the monotonic classification tasks are inconsistent. In response to the problem, this paper firstly improves the classical fuzzy rank information entropy to make it satisfy the monotonicity restriction. The novel fuzzy rank mutual information (nFRMI) is further defined to evaluate the monotonic consistency of features, and it is proven that the metric criterion satisfies the monotonicity in both consistent and inconsistent monotonic classification tasks. In addition, a two-stage feature selection algorithm is proposed in combination with the minimal-redundancy-maximal-relevance (mRMR) criterion. In the first stage of the algorithm, it calculates significance of each feature based on nFRMI, and then obtains the feature ranking based on the feature significance; the Wrapper method is used to select the feature subset with the optimal classification performance in the second stage. Simulation experiments are conducted on four UCI datasets to verify the effectiveness of the proposed algorithm from the perspective of classification accuracy.

Key words: feature selection, monotonic classification, inconsistent, fuzzy information entropy, rank mutual information