计算机科学与探索 ›› 2011, Vol. 5 ›› Issue (10): 904-913.

• 学术研究 • 上一篇    下一篇

层次非负矩阵分解及在文本聚类中的应用

景丽萍, 朱 岩, 于 剑

  

  1. 北京交通大学 计算机与信息技术学院, 北京 100044

  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2011-10-01 发布日期:2011-10-01

Hierarchical Non-Negative Matrix Factorization for Text Clustering

JING Liping, ZHU Yan, YU Jian   

  1. School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, China
  • Received:1900-01-01 Revised:1900-01-01 Online:2011-10-01 Published:2011-10-01

摘要:

文本聚类的目标是把数据集中内容相似的文档归为一类, 而使内容不同的文档分开。目前针对不同领域的需求, 多种解决聚类问题的算法应运而生。然而, 由于文本数据本身固有的复杂特点, 如海量、高维、稀疏等, 使得对海量文本数据的聚类仍然是一个棘手的问题。提出了层次非负矩阵分解聚类方法, 该方法不但保留了非负矩阵分解的优点, 如同步识别文档类别和找出类别本质特征, 而且能够展现类别间的层次结构。这种类别层次结构在网页预览等应用中是非常有用的。在真实数据集20Newsgroups和Reuters-RCV1上的实验结果表明, 层次非负矩阵分解相比已有的方法更有效。

关键词: 文本聚类, 非负矩阵分解, 层次聚类

Abstract: The goal of text clustering is to group the documents with similar content into a same cluster, while separate the documents with different contents. Till now, a lot of clustering algorithms have been proposed according to different requirements, however, text clustering is still an open problem because of the potential characteristics of text data: large volume, high dimension, sparse etc. This paper proposes a clustering method based on hierarchical non-negative matrix factorization. The new method keeps the merits of the original non-negative matrix factoriza-tion, simultaneously clustering documents and identifying key features for each cluster. At the same time, it can mine the hierarchical structure between clusters, and such structure is very useful in many real applications, e.g., news browser. The experimental results on real data, 20Newsgroups and Reuters-RCV1, show that the proposed method performs better than the existing popular methods.

Key words: text clustering, non-negative matrix factorization, hierarchical clustering