[1] 刘铁岩, 陈薇, 王太峰, 等. 分布式机器学习: 算法、理论与实践[M]. 北京: 机械工业出版社, 2018: 42-45.
LIU T Y, CHEN W, WANG T F, et al. Distributed machine learning[M]. Beijing: China Machine Press, 2018: 42-45.
[2] LIU J, ZHANG C. Distributed learning systems with first-order methods[J]. Foundations and Trends in Databases, 2020, 9(1): 1-100.
[3] ROBBINS H, MONRO S. A stochastic approximation method[J]. The Annals of Mathematical Statistics, 1951, 22(3): 400-407.
[4] JOHNSON R, ZHANG T. Accelerating stochastic gradient descent using predictive variance reduction[C]//Proceedings of the 27th International Conference on Neural Information Processing Systems, 2013: 315-323.
[5] NGUYEN L M, LIU J, SCHEINBERG K, et al. SARAH[C]//Proceedings of the 34th International Conference on Machine Learning-Volume 70, 2017: 2613-2621.
[6] LIU R, MOZAFARI B. Communication-efficient distributed learning for large batch optimization[C]//Proceedings of the 39th International Conference on Machine Learning, 2022: 13925-13946.
[7] NADO Z, GILMER J M, SHALLUE C J, et al. A large batch optimizer reality check: traditional, generic optimizers suffice across batch sizes[EB/OL]. [2024-01-23]. https://arxiv.org/abs/2102.06356.
[8] YOU Y, GITMAN I, GINSBURG B. Scaling SGD batch size to 32K for ImageNet training[EB/OL]. [2024-01-23]. https://arxiv.org/abs/1708.03888.
[9] YOU Y, LI J, REDDI S, et al. Large batch optimization for deep learning: training BERT in 76 minutes[EB/OL]. [2024-01-23]. https://arxiv.org/abs/1904.00962.
[10] LIN T, KONG L J, STICH S U, et al. Extrapolation for large-batch training in deep learning[C]//Proceedings of the 37th International Conference on Machine Learning, 2020: 6094-6104.
[11] XUE Z, LIANG J, SONG G, et al. Large-batch optimization for dense visual predictions: training faster R-CNN in 4.2 minutes[C]//Advances in Neural Information Processing Systems 35, 2022: 18694-18706.
[12] ZHENG Z W, XU P T, ZOU X, et al. CowClip: reducing CTR prediction model training time from 12 hours to 10 minutes on 1 GPU[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2023, 37(9): 11390-11398.
[13] JIANG S R, CHEN Q C, PAN Y C, et al. ZO-AdaMU optimizer: adapting perturbation by the momentum and uncertainty in zeroth-order optimization[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2024, 38(16): 18363-18371.
[14] WANG X Y, JOHANSSON M, ZHANG T, et al. Generalized polyak step size for first order optimization with momentum[C]//Proceedings of the 40th International Conference on Machine Learning, 2023: 35836-35863.
[15] CHEN Y N, LI Z C, ZHANG L F, et al. Bidirectional looking with a novel double exponential moving average to adaptive and non-adaptive momentum optimizers[C]//Proceedings of the 40th International Conference on Machine Learning, 2023: 4764-4803.
[16] LUO Y, REN X Z, ZHENG Z W, et al. CAME: confidence-guided adaptive memory efficient optimization[C]//Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2023: 4442-4453.
[17] NGUYEN L M, LIU J, SCHEINBERG K, et al. Stochastic recursive gradient algorithm for non-convex optimization[J]. Statistics, 2017, 1050(20): 1-15.
[18] JIANG X, STICH S U. Adaptive SGD with polyak stepsize and line-search: robust convergence and variance reduction[C]//Proceedings of the 37th International Conference on Neural Information Processing Systems, 2023: 1-29.
[19] CAI X F, SONG C B, WRIGHT S J, et al. Cyclic block coordinate descent with variance reduction for composite nonconvex optimization[C]//Proceedings of the 40th International Conference on Machine Learning, 2023: 3469-3494.
[20] LIU J, PAN X K, DUAN J W, et al. Faster stochastic variance reduction methods for compositional MiniMax optimization[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2024, 38(12): 13927-13935.
[21] BARAKAT A, FATKHULLIN I, HE N. Reinforcement learning with general utilities: simpler variance reduction and large state-action space[C]//Proceedings of the 40th International Conference on Machine Learning, 2023: 1753-1800.
[22] LEI L, JORDAN M. Less than a single pass: stochastically controlled stochastic gradient[C]//Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, 2017: 148-156.
[23] HORVÁTH S, LEI L H, RICHTÁRIK P, et al. Adaptivity of stochastic gradient methods for nonconvex optimization[J]. SIAM Journal on Mathematics of Data Science, 2022, 4(2): 634-648. |