计算机科学与探索 ›› 2012, Vol. 6 ›› Issue (11): 1007-1018.DOI: 10.3778/j.issn.1673-9418.2012.11.005

• 学术研究 • 上一篇    下一篇

基于联合聚类的超立方体高维索引

刘英帆+,崔江涛   

  1. 西安电子科技大学 计算机学院,西安 710071
  • 出版日期:2012-11-01 发布日期:2012-11-02

Hypercube-Based High-Dimensional Index Using Co-clustering

LIU Yingfan+, CUI Jiangtao   

  1. School of Computer, Xidian University, Xi’an 710071, China
  • Online:2012-11-01 Published:2012-11-02

摘要: 高维数据集合的最近邻查询性能会受到“维数灾难”(curse of dimensionality)现象的影响。提出了一种基于联合聚类的HC2(hypercube on co-clustering)高维索引结构。首先通过联合聚类算法同时降低数据尺寸和维数,将高维数据集合聚成若干较低维数的类,然后采用超立方体结构对每个类进行空间区域描述。在基于“过滤-精炼”(filter and refine)的查询过程中,计算查询点与各个类之间的距离下界,实现对聚类的有效过滤。为了提高距离下界对真实距离的逼近能力,采用了一种基于统计优化的超立方体区域描述方法SOHC2(statistically optimized hypercube on co-clustering),能够更加有效地缩小搜索空间,提高查询性能。理论分析和实验结果都表明,SOHC2的查询性能明显优于其他索引方法,适合大规模高维数据的查询;与同类索引结构相比,查询速度能够提高3倍以上。

关键词: 高维索引, 过滤-精炼, 联合聚类, 超立方体

Abstract: The performance of nearest neighbor search in high-dimensional dataset will succumb to the well-known “curse of dimensionality”. This paper proposes a novel hypercube on co-clustering (HC2) index for high-dimensional query. By using the co-clustering methods, both size and dimensionality of the original dataset can be reduced simultaneously, and some low-dimensional clusters can be obtained. Each cluster is described by a bounded hypercube, and lower bounds of the actual distances between the query point and clusters can be efficiently established to achieve fast and lossless similarity search with the filter-and-refine approach. To achieve a tighter lower bounds, the paper investigates a statistically optimal description of hypercube, SOHC2 (statistically optimized hypercube on co-clustering), which generates the least number of candidates for actual distance computations in the sense of statistics. Experimental results show that SOHC2 is up to 3 times faster than the other index structures based on co-clustering, and it also offers significant performance advantages over other existing methods.

Key words: high-dimensional index, filter and refine, co-clustering, hypercube