Journal of Frontiers of Computer Science and Technology ›› 2016, Vol. 10 ›› Issue (3): 381-388.DOI: 10.3778/j.issn.1673-9418.1505048

Previous Articles     Next Articles

RNA-Seq Data Expression Analysis Based on Smoothed LDA

OU Shuhua+, LIU Xuejun, ZHANG Li   

  1. College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China
  • Online:2016-03-01 Published:2016-03-11

基于平滑LDA的RNA-Seq数据表达分析研究

欧书华+,刘学军,张  礼   

  1. 南京航空航天大学 计算机科学与技术学院,南京 210016

Abstract: RNA-Seq is an important technique for transcriptome research. Considering the multi-mappings between reads and isoforms, non-uniform distribution of reads along the reference sequence, conjunction reads and the sparsity caused by the large exon size, this paper proposes a new method, sLDASeq, to calculate the gene and transcript expression. To solve the problems of multi-mappings, non-uniform distribution of reads and conjunction reads, the model utilizes the known gene-isoform annotation to constrain the hyper-parameters and allocate the read counts according to exon length. By adding a hyper-parameter, the model solves the problem of sparsity in the exons. sLDASeq is validated by using three real datasets on the gene and transcript expression calculation and compared with LDASeq and other popular methods. Results show that sLDASeq obtains more accurate transcript and gene expression measurements than other methods.

Key words: RNA-Seq, gene and transcript expression, smoothed LDA, exon-junction, multi-mapping, non-uniformity

摘要: RNA-Seq是目前转录组研究的一种重要技术,针对RNA-Seq数据分析中读段的多源映射,参考序列分布的不均匀性,一些转录本中外显子分布稀疏以及跨结合区读段处理问题,提出了一个新的转录组表达研究模型sLDASeq。该模型根据基因中转录本注释信息对模型参数进行约束,对跨结合区的读段按长度分配处理,解决了读段非均匀分布和跨结合区问题;在模型中增加一个超参数,从而解决了外显子的稀疏问题。将该模型应用到3个真实的数据集上,并与其他主流方法进行比较,结果表明该模型获得了较为准确的基因以及转录本表达水平计算结果。

关键词: RNA-Seq, 基因转录本表达水平, 平滑LDA, 结合区, 多源映射, 非均匀性