层次非负矩阵分解及在文本聚类中的应用

计算机科学与探索 ›› 2011, Vol. 5 ›› Issue (10): 904-913.

层次非负矩阵分解及在文本聚类中的应用

景丽萍, 朱岩, 于剑

北京交通大学计算机与信息技术学院, 北京 100044

收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2011-10-01 发布日期:2011-10-01

Hierarchical Non-Negative Matrix Factorization for Text Clustering

JING Liping, ZHU Yan, YU Jian

School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, China

Received:1900-01-01 Revised:1900-01-01 Online:2011-10-01 Published:2011-10-01

摘要/Abstract

摘要：

文本聚类的目标是把数据集中内容相似的文档归为一类, 而使内容不同的文档分开。目前针对不同领域的需求, 多种解决聚类问题的算法应运而生。然而, 由于文本数据本身固有的复杂特点, 如海量、高维、稀疏等, 使得对海量文本数据的聚类仍然是一个棘手的问题。提出了层次非负矩阵分解聚类方法, 该方法不但保留了非负矩阵分解的优点, 如同步识别文档类别和找出类别本质特征, 而且能够展现类别间的层次结构。这种类别层次结构在网页预览等应用中是非常有用的。在真实数据集20Newsgroups和Reuters-RCV1上的实验结果表明, 层次非负矩阵分解相比已有的方法更有效。

关键词: 文本聚类, 非负矩阵分解, 层次聚类

Abstract: The goal of text clustering is to group the documents with similar content into a same cluster, while separate the documents with different contents. Till now, a lot of clustering algorithms have been proposed according to different requirements, however, text clustering is still an open problem because of the potential characteristics of text data: large volume, high dimension, sparse etc. This paper proposes a clustering method based on hierarchical non-negative matrix factorization. The new method keeps the merits of the original non-negative matrix factoriza-tion, simultaneously clustering documents and identifying key features for each cluster. At the same time, it can mine the hierarchical structure between clusters, and such structure is very useful in many real applications, e.g., news browser. The experimental results on real data, 20Newsgroups and Reuters-RCV1, show that the proposed method performs better than the existing popular methods.

Key words: text clustering, non-negative matrix factorization, hierarchical clustering

景丽萍, 朱岩, 于剑 . 层次非负矩阵分解及在文本聚类中的应用[J]. 计算机科学与探索, 2011, 5(10): 904-913.

JING Liping, ZHU Yan, YU Jian. Hierarchical Non-Negative Matrix Factorization for Text Clustering[J]. Journal of Frontiers of Computer Science and Technology, 2011, 5(10): 904-913.

[1]	李会荣, 张林, 赵鹏军, 李超. 带有局部坐标约束的半监督概念分解算法[J]. 计算机科学与探索, 2021, 15(2): 379-388.
[2]	姚晓红, 黄恒君. 非负半监督函数型聚类方法[J]. 计算机科学与探索, 2021, 15(12): 2438-2448.
[3]	曹佳伟，钱鹏江. 流形学习与成对约束联合正则化非负矩阵分解[J]. 计算机科学与探索, 2020, 14(7): 1211-1220.
[4]	徐旭东，张志祥，张献. 面向私有二进制协议的报文聚类方法[J]. 计算机科学与探索, 2020, 14(6): 958-965.
[5]	刘国庆，卢桂馥，周胜，宣东东，曹阿龙. 非负低秩图嵌入算法[J]. 计算机科学与探索, 2020, 14(3): 502-512.
[6]	王晓东，赵一宁，肖海力，王小宁，迟学斌. 线上多节点日志流量异常检测系统的研究[J]. 计算机科学与探索, 2020, 14(11): 1828-1837.
[7]	李向利，张颖. 带核方法的判别图正则非负矩阵分解[J]. 计算机科学与探索, 2020, 14(11): 1899-1907.
[8]	于晓飞，葛洪伟. 自动确定聚类中心的势能聚类算法[J]. 计算机科学与探索, 2018, 12(6): 1004-1012.
[9]	蔡志铃，祝峰. 非负稀疏表示的多标签特征选择[J]. 计算机科学与探索, 2017, 11(7): 1175-1182.
[10]	李娜，潘志松，任义强，李国朋，蒋铭初. 两重稀疏约束的多标记社团分类算法[J]. 计算机科学与探索, 2017, 11(6): 959-971.
[11]	袁成哲，曾碧卿，汤庸，王大豪，曾惠敏. 面向学术社交网络的多维度团队推荐模型[J]. 计算机科学与探索, 2016, 10(2): 201-209.
[12]	李亚芳，贾彩燕，于剑. 应用非负矩阵分解模型的社区发现方法综述[J]. 计算机科学与探索, 2016, 10(1): 1-13.
[13]	刘超，徐雅斌，武装. 微博社区快速发现方法[J]. 计算机科学与探索, 2015, 9(9): 1100-1107.
[14]	张永辉，李川，唐常杰，李艳梅. 基于结构分析的信息网络社团趋势预测[J]. 计算机科学与探索, 2015, 9(4): 403-409.
[15]	乔少杰1,2 ,唐常杰1+ ,陈瑜1 ,彭京3 ,温粉莲1 . 基于树编辑距离的层次聚类算法[J]. 计算机科学与探索, 2007, 1(3): 282-292.