计算机科学与探索 ›› 2024, Vol. 18 ›› Issue (7): 1852-1864.DOI: 10.3778/j.issn.1673-9418.2305025

• 人工智能·模式识别 • 上一篇    下一篇

混合属性数据深度无监督融合特征学习方法

何慧霞,武森,魏桂英,谢嘉瑶,高晓楠   

  1. 1. 北京科技大学 经济管理学院,北京 100083
    2. 国网能源研究院有限公司,北京 102209
  • 出版日期:2024-07-01 发布日期:2024-06-28

Deep Unsupervised Fusion Feature Learning Method for Mixed Attribute Data

HE Huixia, WU Sen, WEI Guiying, XIE Jiayao, GAO Xiaonan   

  1. 1. School of Economics and Management, University of Science and Technology Beijing, Beijing 100083, China
    2. State Grid Energy Research Institute Co., Ltd., Beijing 102209, China
  • Online:2024-07-01 Published:2024-06-28

摘要: 高质量的特征表示是实现数据精准挖掘的关键。针对现有特征学习方法难以有效提取混合属性数据中不同属性之间关联和数据内部真实信息的问题,提出一种面向混合属性数据的深度无监督融合特征学习模型(DUFERM)。该模型建立了一个双模态自编码器框架,对分类属性和数值属性采用不同路径进行建模,并采用深度多模态融合策略加深两种属性之间的联系;针对分类属性构建基于加权异构网络的离散特征自编码器,充分挖掘分类属性内部的结构和语义信息,针对数值属性构建连续特征自编码器,两个独立的自编码器以联合表示的形式组合在公共潜在表示层中;最后以预训练和联合训练相结合的无监督训练方式获得混合属性数据的融合特征表示。在10个公开数据集上的大量实验表明,所提DUFERM模型在各项评价指标上的综合性能优于现有经典的和新颖的混合属性数据特征学习方法,可以充分提取混合属性数据内部潜在特征,取得高质量的融合特征表示结果并提升下游数据挖掘任务的准确性。

关键词: 混合属性数据, 融合特征学习, 无监督, 数据挖掘

Abstract: High-quality feature representation is the key to achieve accurate data mining. A deep unsupervised fusion feature learning model  for mixed-attribute data (DUFERM) is proposed to address the problem that existing feature learning methods are difficult to effectively extract the association between different attributes and the real information within the data in mixed-attribute data. The model establishes a bimodal self-encoder framework that models categorical and numerical attributes using different paths and uses a deep multimodal fusion strategy to deepen the connection between the two attributes. A discrete feature self-encoder based on a weighted heterogeneous network is constructed for categorical attribute to fully exploit the structural and semantic information within the categorical attribute, a continuous feature self-encoder is constructed for numerical attribute, and the two independent self-encoders are combined in a common latent representation layer in the form of a joint representation. Finally, the fused feature representation of the mixed-attribute data is obtained by unsupervised training with a combination of pre-training and joint training. Extensive experiments on 10 publicly available datasets show that the proposed DUFERM model outperforms existing classical and novel mixed-attribute data feature learning methods in terms of comprehensive performance in all evaluation metrics, and can fully extract potential features within the mixed-attribute data, achieve high-quality fused feature representation results and improve the accuracy of downstream data mining tasks.

Key words: mixed attribute data, fusion feature learning, unsupervised, data mining