计算机科学与探索 ›› 2012, Vol. 6 ›› Issue (12): 1136-1143.DOI: 10.3778/j.issn.1673-9418.2012.12.008

• 学术研究 • 上一篇    下一篇

K-split Lasso:有效的肿瘤特征基因选择方法

张  靖+,胡学钢,张玉红,施万锋   

  1. 合肥工业大学 计算机与信息学院,合肥 230009
  • 出版日期:2012-12-01 发布日期:2012-12-03

K-split Lasso: An Effective Feature Selection Method for Tumor Gene Expression Data

ZHANG Jing+, HU Xuegang, ZHANG Yuhong, SHI Wanfeng   

  1. School of Computer and Information, Hefei University of Technology, Hefei 230009, China
  • Online:2012-12-01 Published:2012-12-03

摘要: 随着DNA微阵列技术的出现,大量关于不同肿瘤的基因表达谱数据集被发布到网络上,从而使得对肿瘤特征基因选择和亚型分类的研究成为生物信息学领域的热点。基于Lasso(least absolute shrinkage and selection operator)方法提出了K-split Lasso特征选择方法,其基本思想是将数据集平均划分为K份,分别使用Lasso方法对每份进行特征选择,而后将选择出来的每份特征子集合并,重新进行特征选择,得到最终的特征基因。实验采用支持向量机作为分类器,结果表明K-split Lasso方法减少了冗余特征,提高了分类精度,具有良好的稳定性。由于每次计算的维数降低,K-split Lasso方法解决了计算开销过大的问题,并在一定程度上解决了“过拟合”问题。因此K-split Lasso方法是一种有效的肿瘤特征基因选择方法。

关键词: 肿瘤基因表达谱, Lasso, 特征选择, 支持向量机

Abstract: With the advent of DNA microarray technology, a large number of open-access tumor gene expression datasets are searchable online and can be downloaded. Informative gene selection and tumor subtype classification have been becoming one of primary research fields in Bioinformatics. This paper proposes K-split Lasso (least absolute shrinkage and selection operator) method for gene selection, whose main idea is to divide the feature sets into K parts, and then select the genes from each feature subset using Lasso, finally merge the selected genes into one feature subset to get the informative genes. Using the support vector machine as classification tool, the experimental results indicate that K-split Lasso reduces data redundancy, improves sample classification accuracy, and has good stability. In addition, K-split Lasso overcomes the large computation and overfitting problems due to the decrease of dimension. K-split Lasso is an effective method for gene selection of tumor.

Key words: tumor gene expression profiles, Lasso, feature selection, support vector machine