计算机科学与探索 ›› 2008, Vol. 2 ›› Issue (1): 60-76.

• 学术研究 • 上一篇    下一篇

一种挖掘压缩序列模式的高效算法

常 雷1,2+,杨冬青1,2,王腾蛟1,2,唐世渭1,2   

  1. 1. 北京大学 信息科学技术学院,北京 100871
    2. 高可信软件技术教育部重点实验室,北京 100871
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2008-02-20 发布日期:2008-02-20
  • 通讯作者: 常 雷

An effective algorithm for mining compressed sequential patterns

CHANG Lei1,2+, YANG Dongqing1,2, WANG Tengjiao1,2, TANG Shiwei1,2   

  1. 1. School of Electronics Engineering and Computer Science, Peking University, Beijing 100871, China
    2. Key Laboratory of High Confidence Software Technologies (Peking University), Ministry of Education, Beijing 100871, China
  • Received:1900-01-01 Revised:1900-01-01 Online:2008-02-20 Published:2008-02-20
  • Contact: CHANG Lei

摘要: 研究了如何使用SP-Feature来压缩序列模式。SP-Feature是一种简洁表示序列模式的新颖结构。一种新的相似性度量被用来聚类SP-Feature,同时也给出了SP-Feature的合并方法。基于层次聚类框架,设计了一种有效的挖掘压缩序列模式的算法CSP。在真实和模拟数据上的大量实验表明CSP能够快速有效地压缩序列模式(在稠密数据集上的恢复误差小于4%)。

关键词: 数据挖掘, 序列模式压缩, SP-Feature

Abstract: The problem of how to compress sequential patterns using SP-Features(Sequential Pattern Features) is examined. SP-Feature is a novel structure for representing a set of sequential patterns succinctly. A new similarity measure is proposed for clustering SP-Features and a SP-Feature combination method is designed. Based on the hierarchical clustering framework, an effective algorithm CSP is developed to mine compressed sequential patterns. Extensive experimental results on both real and synthetic datasets show that CSP can compress sequential patterns efficiently and effectively with low restoration error (less than 4% on dense datasets).

Key words: data mining, sequential pattern compression, SP-Feature