计算机科学与探索 ›› 2018, Vol. 12 ›› Issue (7): 1047-1054.DOI: 10.3778/j.issn.1673-9418.1705044

• 学术研究 • 上一篇    下一篇

分布式随机方差消减梯度下降算法topkSVRG

王建飞,亢良伊,刘杰,叶丹   

  1. 1. 中国科学院 软件研究所 软件工程技术研发中心,北京 100190
    2. 中国科学院大学,北京 100190
  • 出版日期:2018-07-01 发布日期:2018-07-06

Distributed Stochastic Variance Reduction Gradient Descent Algorithm topkSVRG

WANG Jianfei, KANG Liangyi, LIU Jie, YE Dan   

  1. 1. Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing 100190, China
    2. University of Chinese Academy of Sciences, Beijing 100190, China
  • Online:2018-07-01 Published:2018-07-06

摘要:

机器学习问题通常会转换成一个目标函数进行求解,优化算法是求解目标函数中参数的重要工具。随机梯度下降(stochastic gradient descent,SGD)是目前应用最广的算法,因其易受噪声干扰只能达到次线性收敛率,而改进后的随机方差消减梯度法(stochastic variance reduction gradient,SVRG)则可以达到线性的收敛率。SVRG是一种串行单机版算法,为了应对大规模数据集分布式训练问题,设计一种以SVRG算法思想为基础的分布式SVRG的实现算法topkSVRG。改进在于:主节点维护一个全局模型,从节点基于本地数据进行局部模型更新。每轮迭代时,选择与当前全局模型距离最小的k个局部模型进行平均来更新全局模型,参数k调大可以提高收敛速度,调小k可以保证收敛。理论分析了算法的线性收敛性,基于Spark进行算法实现,通过与Mini-Batch SGD、CoCoA、Splash及相关算法的实验比较,topkSVRG可以在高精度要求下更快地收敛。

关键词: 机器学习, 优化, 随机梯度下降(SGD), 随机方差消减梯度法(SVRG), 分布式计算

Abstract:

Machine learning problems are usually converted into an objective function to optimize, and the optimization algorithms are the important tool to solve the parameters in the objective function. Currently, stochastic gradient descent (SGD) is one of the most widely used optimization methods, but it is susceptible to noise gradient so as to get sub-linear convergence rate. The improved stochastic variance reduction gradient (SVRG) can achieve linear convergence rate, but SVRG is a serial algorithm. In order to deal with the distributed training problem of large-scale data set, this paper designs topkSVRG based on SVRG algorithm. The improvement is that the master node maintains a global model, and the local nodes update the local model according to the local data. In each epoch, the global model is updated by selecting k local models which have the smallest distance from the current global model. Usually with a bigger k, the model can converge faster, while with a smaller k, the convergence rate can be guar-   anteed. topkSVRG has been proved linear convergence rate by theoretical analysis. topkSVRG is implemented on Spark, and experiments demonstrate its efficiency compared with Mini-Batch SGD, CoCoA, Splash, and so on.

Key words: machine learning, optimization, stochastic gradient descent (SGD), stochastic variance reduction gradient (SVRG), distributed computing