Journal of Frontiers of Computer Science and Technology ›› 2010, Vol. 4 ›› Issue (8): 711-711.DOI: 10.3778/j.issn.1673-9418.2010.08.004

• 学术研究 • Previous Articles     Next Articles

A Lossless Compression Technique for Similar Data*

ZHAO Guoyi; YANG Xiaochun+; WANG Bin

  

  1. College of Information Science and Engineering, Northeastern University, Shenyang 110819, China
  • Received:1900-01-01 Revised:1900-01-01 Online:2010-08-10 Published:2010-08-10
  • Contact: YANG Xiaochun

面向相似数据的无损压缩技术*

赵国毅;杨晓春+; 王 斌

  

  1. 东北大学 信息科学与工程学院, 沈阳 110819
  • 通讯作者: 杨晓春

Abstract: As to tremendous similarity data in network information, observation data and biological information, using traditional compression methods cannot get better compression effect. A new lossless compression is proposed using base sequence and a group of variants with edit distance to express the whole dataset. Because real similar data have an overall similarity, a first clustering and then compressing method is presented. In each cluster, construct a virtual base sequence to make a largest compression ratio. A large number of experimental tests and analyses on real datasets show that the proposed lossless compression technique can achieve good compression ratio.

Key words: lossless compression, variant expression, edit distance, cluster, base sequence

摘要: 对存在于网络信息、观测数据以及生物信息中的大量相似数据, 使用传统的压缩方法压缩不能达到更好的效果。对相似度很高的数据采用一种新的无损压缩方法, 即基础序列加上一组基于编辑距离的差异量来表示整个数据集, 可以只用很少的差异量来表示原本巨大的数据项。针对现实中数据不会整体相似的特点, 提出一种先聚类再压缩的思想, 在每个聚类中构造聚类中心作为虚拟基础序列, 使压缩比最大化。通过大量实际数据集的实验测试与分析, 表明提出的无损压缩技术对于相似序列数据具有很好的压缩比。

关键词: 无损压缩, 差异量表示, 编辑距离, 聚类, 基础序列

CLC Number: