Journal of Frontiers of Computer Science and Technology ›› 2016, Vol. 10 ›› Issue (2): 210-219.DOI: 10.3778/j.issn.1673-9418.1505045

Previous Articles     Next Articles

Novel Method to Estimate Expression Level Based on Multi-Sample RNA-Seq Data

ZHANG Li+, LIU Xuejun, CHEN Songcan   

  1. College of Computer Science & Technology, Nanjing University of Aeronautics & Astronautics, Nanjing 210016, China
  • Online:2016-02-01 Published:2016-02-03

基于多样本RNA-Seq数据的表达水平估计方法

张  礼+,刘学军,陈松灿   

  1. 南京航空航天大学 计算机科学与技术学院, 南京 210016

Abstract: With the rapid development of the next-generation high-throughput sequencing technology, RNA-Seq has become the standard and important technique for transcriptome analysis. For multi-sample RNA-Seq data, the existing expression estimation methods usually deal with each single RNA-Seq sample, and ignore the read distributions with high consistency between multiple samples. This paper proposes a novel method, MRSeq, to estimate expression using multi-sample RNA-Seq data. MRSeq introduces a bias curve estimation model to capture the common features of read distributions shared among multiple samples. The common features are embedded into the model by deviation weight to correct read distributions. Meanwhile, by adding a sparse constraint, the method considers the sparsity between gene and the corresponding isoform expression. Three real datasets are used to validate the proposed method on gene and isoform expression estimation. Compared with the popular methods, MRSeq obtains more accurate gene and isoform expression estimation, and more meaningful biological explanation.

Key words: RNA-Seq, multi-sample, bias curve, sparse-specific, gene and isoform expression

摘要: 随着下一代高通量DNA测序的快速发展,RNA-Seq测序已成为转录组学分析的标准技术。在处理多样本RNA-Seq数据时,现有表达水平估计方法通常基于单个样本逐个处理,忽略了基因读段分布在样本间高度相似的特点。因此,提出了一个基于多样本RNA-Seq数据的表达水平估计方法,称为MRSeq。其关键是通过建立偏差曲线估计模型获得基因读段分布在样本之间的共享特征,通过偏差权重将共享特征嵌入到模型中,用来修正读段数据,同时通过增加稀疏约束来表现基因和异构体表达水平之间的稀疏性。进而将该模型应用到多个真实数据集进行评测,与目前主流方法的比较结果表明:MRSeq不仅能得到准确的基因和异构体表达水平,同时也获得了更有意义的生物解释。

关键词: RNA-Seq, 多样本, 偏差曲线, 稀疏, 基因和异构体表达水平