Journal of Frontiers of Computer Science and Technology ›› 2010, Vol. 4 ›› Issue (9): 769-779.DOI: 10.3778/j.issn.1673-9418.2010.09.001

• 学术研究 • Previous Articles     Next Articles

Nearest Neighbor Algorithm for Positive and Unlabeled Learning with Uncertainty

PAN Shirui1, ZHANG Yang1,2+, LI Xue3, WANG Yong4   

  1. 1. College of Information Engineering, Northwest A&F University, Yangling, Shaanxi 712100, China
    2. State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210093, China
    3. School of Information Technology and Electrical Engineering, University of Queensland, Brisbane 4072, Australia
    4. School of Computer, Northwestern Polytechnical University, Xi’an 710072, China
  • Received:1900-01-01 Revised:1900-01-01 Online:2010-09-09 Published:2010-09-09
  • Contact: ZHANG Yang

针对不确定正例和未标记学习的最近邻算法*

潘世瑞1, 张阳1,2+, 李雪 3, 王勇 4   

  1. 1. 西北农林科技大学信息工程学院, 陕西杨凌 712100
    2. 南京大学计算机软件新技术国家重点实验室, 南京 210093
    3. 昆士兰大学计算机及电子工程系, 布里斯班 4072, 澳大利亚
    4. 西北工业大学计算机学院, 西安 710072
  • 通讯作者: 张阳

Abstract: This paper studies the problem of uncertain data classification under positive and unlabeled (PU) learning
scenario. It proposes a novel algorithm, NNPU (nearest neighbor algorithm for positive and unlabeled learning), to
handle this problem with two varieties, NNPUa and NNPUu. Experimental results on benchmark UCI datasets show
that NNPUu, which considers the whole uncertain information on the datasets, has a better ability to classify unseen
examples than NNPUa that considers the average value of uncertainty only. Furthermore, NNPU outperforms some
existing algorithms such as NN-d, OCC (one-class classifier) and aPUNB in handling precise data.

Key words: uncertain data, positive and unlabeled learning, nearest neighbor algorithm

摘要: 研究了在正例和未标记样本场景下不确定样本的分类问题, 提出了一种新的算法NNPU(nearest neighbor algorithm for positive and unlabeled learning)。NNPU 具有两种实现方式:NNPUa 和NNPUu。在UCI 标准数据集上的实验结果表明, 充分考虑数据不确定信息的NNPUu 算法要比仅仅考虑样本中不确定信息均值的NNPUa 算法具有更好的分类能力; 同时, NNPU 算法在对精确数据进行分类时, 比NN-d、OCC以及aPUNB 算法性能更优。

关键词: 不确定数据, 正例和未标记样本学习, 最近邻算法

CLC Number: