基于随机采样的方差缩减优化算法

doi:10.3778/j.issn.1673-9418.2403031

摘要/Abstract

摘要： 随机梯度下降（SGD）算法因其性能优异而引起了机器学习和深度学习等领域研究人员的广泛关注。然而，SGD使用单样本随机梯度近似样本全梯度导致算法在迭代过程中引入了额外的方差，使得算法的收敛曲线震荡甚至发散，导致其收敛速率缓慢。因此，有效减小方差成为当前关键挑战。提出了一种基于小批量随机采样的方差缩减优化算法（DM-SRG），并应用于求解凸优化及非凸优化问题。算法主要特征在于设计了内外双循环结构：外循环结构采用小批量随机样本计算梯度近似全梯度，以达到减少梯度计算开销的目的；内循环结构采用小批量随机样本计算梯度并代替单样本随机梯度，提升算法收敛稳定性。针对非凸目标函数与凸目标函数，理论分析证明了DM-SRG算法具有次线性收敛速率。此外，设计了基于计算单元性能评估模型的动态样本容量调整策略，以提高系统训练效率。为评估算法的有效性，分别在不同规模的真实数据集上开展了数值模拟实验。实验结果表明算法较对比算法损失函数减少18.1%并且平均耗时降低8.22%。

关键词: 随机梯度下降, 方差缩减, 凸优化, 非凸优化, 收敛速率

Abstract: The stochastic gradient descent (SGD) algorithms have been applied to machine learning and deep learning due to their superior performance. However, SGD requires the stochastic gradient of a single sample to approximate the full gradient of all samples, introducing additional variance in each iteration. This makes the convergence curve of SGD oscillate or even diverge. Therefore, effectively reducing variance becomes a key challenge at present. To address the above challenge, a variance reduction optimization algorithm, DM-SRG (double mini-batch stochastic recursive gradient), based on mini-batch random sampling is proposed and applied to solving convex and non-convex optimization problems. The main feature of the algorithm including an inner and outer double loop structure is designed: the outer loop structure uses mini-batch random samples to calculate the gradient, approximating the full gradient and reducing the gradient calculation cost; the inner loop structure also uses mini-batch random samples to calculate the gradient and replace the single sample random gradient, improving convergence stability of the algorithm. In this paper, a sublinear convergence rate of DM-SRG algorithm is theoretically guaranteed for both non-convex and convex objective functions. Furthermore, a dynamic sample size adjustment strategy based on the performance evaluation model of computing unit is designed to improve the training efficiency. The effectiveness of the algorithm is evaluated via numerical simulation experiments on real datasets of varying sizes. Experimental results show that the loss function of the DM-SRG algorithm is reduced by 18.1%, and the average time of the algorithm is reduced by 8.22%.

Key words: stochastic gradient descent, variance reduction, convex optimization, non-convex optimization, convergence rate

郭振华, 闫瑞栋, 邱志勇, 赵雅倩, 李仁刚. 基于随机采样的方差缩减优化算法[J]. 计算机科学与探索, 2025, 19(3): 667-681.

GUO Zhenhua, YAN Ruidong, QIU Zhiyong, ZHAO Yaqian, LI Rengang. Variance Reduction Optimization Algorithm Based on Random Sampling[J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(3): 667-681.

参考文献

[1] 刘铁岩, 陈薇, 王太峰, 等. 分布式机器学习: 算法、理论与实践[M]. 北京: 机械工业出版社, 2018: 42-45.
LIU T Y, CHEN W, WANG T F, et al. Distributed machine learning[M]. Beijing: China Machine Press, 2018: 42-45.
[2] LIU J, ZHANG C. Distributed learning systems with first-order methods[J]. Foundations and Trends in Databases, 2020, 9(1): 1-100.
[3] ROBBINS H, MONRO S. A stochastic approximation method[J]. The Annals of Mathematical Statistics, 1951, 22(3): 400-407.
[4] JOHNSON R, ZHANG T. Accelerating stochastic gradient descent using predictive variance reduction[C]//Proceedings of the 27th International Conference on Neural Information Processing Systems, 2013: 315-323.
[5] NGUYEN L M, LIU J, SCHEINBERG K, et al. SARAH[C]//Proceedings of the 34th International Conference on Machine Learning-Volume 70, 2017: 2613-2621.
[6] LIU R, MOZAFARI B. Communication-efficient distributed learning for large batch optimization[C]//Proceedings of the 39th International Conference on Machine Learning, 2022: 13925-13946.
[7] NADO Z, GILMER J M, SHALLUE C J, et al. A large batch optimizer reality check: traditional, generic optimizers suffice across batch sizes[EB/OL]. [2024-01-23]. https://arxiv.org/abs/2102.06356.
[8] YOU Y, GITMAN I, GINSBURG B. Scaling SGD batch size to 32K for ImageNet training[EB/OL]. [2024-01-23]. https://arxiv.org/abs/1708.03888.
[9] YOU Y, LI J, REDDI S, et al. Large batch optimization for deep learning: training BERT in 76 minutes[EB/OL]. [2024-01-23]. https://arxiv.org/abs/1904.00962.
[10] LIN T, KONG L J, STICH S U, et al. Extrapolation for large-batch training in deep learning[C]//Proceedings of the 37th International Conference on Machine Learning, 2020: 6094-6104.
[11] XUE Z, LIANG J, SONG G, et al. Large-batch optimization for dense visual predictions: training faster R-CNN in 4.2 minutes[C]//Advances in Neural Information Processing Systems 35, 2022: 18694-18706.
[12] ZHENG Z W, XU P T, ZOU X, et al. CowClip: reducing CTR prediction model training time from 12 hours to 10 minutes on 1 GPU[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2023, 37(9): 11390-11398.
[13] JIANG S R, CHEN Q C, PAN Y C, et al. ZO-AdaMU optimizer: adapting perturbation by the momentum and uncertainty in zeroth-order optimization[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2024, 38(16): 18363-18371.
[14] WANG X Y, JOHANSSON M, ZHANG T, et al. Generalized polyak step size for first order optimization with momentum[C]//Proceedings of the 40th International Conference on Machine Learning, 2023: 35836-35863.
[15] CHEN Y N, LI Z C, ZHANG L F, et al. Bidirectional looking with a novel double exponential moving average to adaptive and non-adaptive momentum optimizers[C]//Proceedings of the 40th International Conference on Machine Learning, 2023: 4764-4803.
[16] LUO Y, REN X Z, ZHENG Z W, et al. CAME: confidence-guided adaptive memory efficient optimization[C]//Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2023: 4442-4453.
[17] NGUYEN L M, LIU J, SCHEINBERG K, et al. Stochastic recursive gradient algorithm for non-convex optimization[J]. Statistics, 2017, 1050(20): 1-15.
[18] JIANG X, STICH S U. Adaptive SGD with polyak stepsize and line-search: robust convergence and variance reduction[C]//Proceedings of the 37th International Conference on Neural Information Processing Systems, 2023: 1-29.
[19] CAI X F, SONG C B, WRIGHT S J, et al. Cyclic block coordinate descent with variance reduction for composite nonconvex optimization[C]//Proceedings of the 40th International Conference on Machine Learning, 2023: 3469-3494.
[20] LIU J, PAN X K, DUAN J W, et al. Faster stochastic variance reduction methods for compositional MiniMax optimization[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2024, 38(12): 13927-13935.
[21] BARAKAT A, FATKHULLIN I, HE N. Reinforcement learning with general utilities: simpler variance reduction and large state-action space[C]//Proceedings of the 40th International Conference on Machine Learning, 2023: 1753-1800.
[22] LEI L, JORDAN M. Less than a single pass: stochastically controlled stochastic gradient[C]//Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, 2017: 148-156.
[23] HORVÁTH S, LEI L H, RICHTÁRIK P, et al. Adaptivity of stochastic gradient methods for nonconvex optimization[J]. SIAM Journal on Mathematics of Data Science, 2022, 4(2): 634-648.

编辑推荐 0

Metrics

阅读次数

全文

HTML			PDF

最新录用	在线预览	正式出版	最新录用	在线预览	正式出版
0	0	0	8	0	41

	来源	本网站

	次数	49
	比例	100%

摘要

最新录用	在线预览	正式出版

10	0	56

	来源	本网站

	次数	66
	比例	100%