Journal of Frontiers of Computer Science and Technology ›› 2019, Vol. 13 ›› Issue (5): 742-752.DOI: 10.3778/j.issn.1673-9418.1710025

Previous Articles     Next Articles

Extensible Topic Modeling and Analysis Framework for Multisource Data

TANG Shuang1,2, ZHANG Lingxiao1,2, ZHAO Junfeng1,2,3+, XIE Bing1,2,3, ZOU Yanzhen1,2,3   

  1. 1. School of Electronics Engineering and Computer Science, Peking University, Beijing 100871, China
    2. Key Laboratory of High Confidence Software Technologies, Ministry of Education, Beijing 100871, China
    3. Peking University Information Technology Institute (Tianjin Binhai), Tianjin 300450, China
  • Online:2019-05-01 Published:2019-05-08

面向多源数据的可扩展主题建模分析框架

唐  爽1,2,张灵箫1,2,赵俊峰1,2,3+,谢  冰1,2,3,邹艳珍1,2,3   

  1. 1. 北京大学 信息科学技术学院,北京 100871
    2. 高可信软件技术教育部重点实验室,北京 100871
    3. 北京大学(天津滨海)新一代信息技术研究院,天津 300450

Abstract: With the continuous development and application of information technology, many information systems have accumulated a large amount of multi-source heterogeneous data. A large part of these data is structured data which is high-dimensional, low quality and unmarked. It’s difficult to extract feature and refine knowledge from this kind of data. Topic modeling is a very important method in text processing and data mining. It is an unsupervised learning algorithm that is originally used to model unstructured natural language text. It can effectively extract topic information from text semantics, extract feature and reduce dimensionality. But topic modeling is still not well applied in the processing of complex multi-source data, especially structured data. This paper presents a framework based on extensible topic modeling technology for structured and unstructured multi-source data analysis. This framework analyzes the multi-source data by data importing, data analysis and data visualization three steps. On this basis, a multi-source data analysis tool is implemented. Finally, the experiment of two data sets proves the effectiveness of the multi-source data analysis framework.

Key words: topic modeling technology, latent Dirichlet allocation (LDA), structured data analysis, visualization

摘要: 随着信息技术的不断发展和应用,大量信息系统积累了海量多源异构数据,这些数据中有很大一部分都是结构化数据,具有高维度、低质量、无标注等特点,难以进行特征提取与进一步的知识提炼。主题建模是文本处理和数据挖掘中的一个非常重要的方法,它是一种无监督学习算法,最初用于对无结构的自然语言文本进行建模,可以有效地从文本语义中提取主题信息,以进行特征提取和降维分析,然而主题建模技术尚不能很好应用在关系复杂的多源数据,尤其是结构化数据的处理中。提出了一个基于可扩展主题建模技术的针对结构化与非结构化多源数据分析框架,通过数据导入、数据分析、数据可视化三个步骤对多源数据进行基于主题建模技术的数据分析,并在此基础上实现了一个多源数据分析工具,最后通过两个数据集的实验证明了所提的多源数据分析框架的有效性。

关键词: 主题建模技术, 潜在狄利克雷分布(LDA), 结构化数据分析, 可视化