基于深度学习的人体动作识别综述

doi:10.3778/j.issn.1673-9418.2009095

摘要/Abstract

摘要：

人体动作识别是视频理解领域的重要课题之一，在视频监控、人机交互、运动分析、视频信息检索等方面有着广泛的应用。根据骨干网络的特点，从2D卷积神经网络、3D卷积神经网络、时空分解网络三个角度介绍了动作识别领域的最新研究成果，并对三类方法的优缺点进行了定性的分析和比较。然后，从场景相关和时间相关两方面，全面归纳了常用的动作视频数据集，并着重探讨了不同数据集的特点及用法。随后，介绍了动作识别任务中常见的预训练策略，并着重分析了预训练技术对动作识别模型性能的影响。最后，从最新的研究动态出发，从细粒度动作识别、更精简的模型、小样本学习、无监督学习、自适应网络和视频超分辨动作识别六个角度一致探讨了动作识别未来发展的方向。

关键词: 人体动作识别, 2D卷积神经网络（2D CNN）, 3D卷积神经网络（3D CNN）, 时空分解网络, 预训练

Abstract:

Human action recognition is one of the important topics in video understanding. It is widely used in video surveillance, human-computer interaction, motion analysis, and video information retrieval. According to the chara-cteristics of the backbone network, this paper introduces the latest research results in the field of action recognition from three perspectives: 2D convolutional neural network, 3D convolutional neural network, and spatiotemporal decomposition network. And their advantages and disadvantages are qualitatively analyzed and compared. Then, from the two aspects of scene-related and temporal-related, the commonly used action video datasets are comprehensively summarized, and the characteristics and usage of different datasets are emphatically discussed. Subsequently, the common pre-training strategies in action recognition tasks are introduced, and the influence of pre-training techniques on the performance of action recognition models is emphatically analyzed. Finally, starting from the latest research trends, the future development direction of action recognition is discussed from six perspectives: fine-grained action recognition, streamlined model, few-shot learning, unsupervised learning, adaptive network, and video super-resolution action recognition.

Key words: human action recognition, 2D convolutional neural network (2D CNN), 3D convolutional neural net-work (3D CNN), spatiotemporal decomposition network, pre-training

钱慧芳, 易剑平, 付云虎. 基于深度学习的人体动作识别综述[J]. 计算机科学与探索, 2021, 15(3): 438-455.

QIAN Huifang, YI Jianping, FU Yunhu. Review of Human Action Recognition Based on Deep Learning[J]. Journal of Frontiers of Computer Science and Technology, 2021, 15(3): 438-455.

参考文献

[1] KRIZHECSKY A, SUTSKEVER I, HINTON G. ImageNet classification with deep convolutional neural networks[C]//Proceedings of the Annual Conference and Workshop on Neural Information Processing Systems, Lake Tahoe, Dec 3-6, 2012. Cambridge: MIT Press, 2012: 1097-1105.
[2] FUJIYOSHI H, LIPTON A J, KANADE T, et al. Real-time human motion analysis by image skeletonization[J]. IEICE Transactions on Information and Systems, 2004, 87(1): 113-120.
[3] YILMA A, SHAH M. Recognizing human actions in videos acquired by uncalibrated moving cameras[C]//Proceedings of the 10th IEEE International Conference on Computer Vision, Beijing, Oct 17-20, 2005. Piscataway: IEEE, 2005: 150-157.
[4] JHUANG H, GALL J, ZUFFI S, et al. Towards understan-ding action recognition[C]//Proceedings of the 14th IEEE International Conference on Computer Vision, Sydney, Dec 1-8, 2013. Piscataway: IEEE, 2013: 3192-3199.
[5] YANG X, TIAN Y. Effective 3d action recognition using eigenjoints[J]. Journal of Visual Communication and Image Representation, 2014, 25(1): 2-11.
[6] LAPTEV I. On space-time interest points[J]. International Journal of Computer Vision, 2005, 64(2/3): 107-123.
[7] DOLLAR P, RABAUD V, COTTRELL G W, et al. Behavior recognition via sparse spatio-temporal features[C]//Procee-dings of the 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, Beijing, Oct 15-16, 2005. Piacataway: IEEE, 2005: 65-72.
[8] WILLEMS G, TUYTELAARS T, VAN Gool L. An efficient dense and scale-invariant spatio-temporal interest point de-tector[C]//LNCS 5303: Proceedings of the 10th European Conference on Computer Vision, Marseille, Oct 12-18, 2008. Berlin, Heidelberg: Springer, 2008: 650-663.
[9] WANG H, KLASER A, SCHMID C, et al. Action recogni-tion by dense trajectories[C]//Proceedings of the 24th IEEE Conference on Computer Vision and Pattern Recognition, Colorado Springs, Jun 21-23, 2011. Washington: IEEE Com-puter Society, 2011: 3169-3176.
[10] WANG H, SCHMID C. Action recognition with improved trajectories[C]//Proceedings of the 14th IEEE International Conference on Computer Vision, Sydney, Dec 3-6, 2013. Piscataway: IEEE, 2013: 3551-3558.
[11] SIMONYAN K, ZISSERMAN A. Two-stream convolutional networks for action recognition in videos[C]//Proceedings of the Advances in Neural Information Processing Systems, Montreal, Dec 8-13, 2014. Cambridge: MIT Press, 2014: 568-576.
[12] NG J Y, HAUSKNECHT M, VIJAYANARASIMHAN S, et al. Beyond short snippets: deep networks for video classifica-tion[C]//Proceedings of the 2015 IEEE Conference on Com-puter Vision and Pattern Recognition, Boston, Jun 7-12, 2015. Piscataway: IEEE, 2015: 4694-4702.
[13] HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735-1780.
[14] WANG L, XIONG Y, WANG Z, et al. Temporal segment networks: towards good practices for deep action recognition[C]//LNCS 9912: Proceedings of the 14th European Confer-ence on Computer Vision, Amsterdam, Oct 11-14, 2016. Cham: Springer, 2016: 20-36.
[15] LAN Z, ZHU Y, HAUPTMANN A G, et al. Deep local video feature for action recognition[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Piscataway: IEEE, 2017: 1219-1225.
[16] FEICHTENHOFER C, PINZ A, ZISSERMAN A. Convolu-tional two-stream network fusion for video action recogni-tion[C]//Proceedings of the 2016 IEEE Conference on Com-puter Vision and Pattern Recognition, Las Vegas, Jun 27-Jul 1, 2016. Piscataway: IEEE, 2016: 1933-1941.
[17] WANG Y, LONG M, WANG J, et al. Spatiotemporal pyra-mid network for video action recognition[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Piscataway: IEEE, 2017: 2097-2106.
[18] MNIH V, HEESS N, GRAVES A, et al. Recurrent models of visual attention[C]//Proceedings of the Advances in Neural Information Processing Systems, Montreal, Dec 8-13, 2014. Cambridge: MIT Press, 2014: 2204-2212.
[19] DIBA A, SHARMA V, VAN Gool L. Deep temporal linear encoding networks[C]//Proceedings of the 2017 IEEE Con-ference on Computer Vision and Pattern Recognition, Hono-lulu, Jul 21-Jul 26, 2017. Piscataway: IEEE, 2017: 1541-1550.
[20] ZHU J, ZHU Z, ZOU W. End-to-end video-level representa-tion learning for action recognition[C]//Proceedings of the 2018 IEEE International Conference on Pattern Recognition, Salt Lake City, Jun 18-22, 2018. Piscataway: IEEE, 2018: 645-650.
[21] ZHU Y, LAN Z, NEWSAM S, et al. Hidden two-stream convolutional networks for action recognition[C]//LNCS 11363: Proceedings of the 14th Asian Conference on Com-puter Vision, Perth, Dec 2-6, 2018. Berlin, Heidelberg: Sprin-ger, 2018: 363-378.
[22] SUN S, KUANG Z, SHENG L, et al. Optical flow guided feature: a fast and robust motion representation for video action recognition[C]//Proceedings of the 2018 IEEE Con-ference on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-22, 2018. Piscataway: IEEE, 2018: 1390-1399.
[23] LEE M, LEE S, SON S J, et al. Motion feature network: fixed motion filter for action recognition[C]//LNCS 11214: Proceedings of the 15th European Conference on Computer Vision, Munich, Sep 8-14, 2018. Cham: Springer, 2018: 392-408.
[24] LIN J, GAN C, HAN S. TSM: temporal shift module for efficient video understanding[C]//Proceedings of the 17th IEEE International Conference on Computer Vision, Seoul, Oct 7-Nov 2, 2019. Piscataway: IEEE, 2019: 7083-7093.
[25] HUSSEIN N, GAVVES E, SMEULDERS A W M. Time-ception for complex action recognition[C]//Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 16-20, 2019. Piscataway: IEEE, 2019: 254-263.
[26] SZEGEDY C, LIU W, JIA Y, et al. Going deeper with convolutions[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, Jun 7-12, 2015. Piscataway: IEEE, 2015: 7-12.
[27] JI S, XU W, YANG M, et al. 3D convolutional neural net-works for human action recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(1): 221-231.
[28] TRAN D, BOURDEV L, FERGUS R, et al. Learning spa-tiotemporal features with 3D convolutional networks[C]//Proceedings of the 15th IEEE International Conference on Computer Vision, Santiago, Dec 7-13, 2015. Piscataway: IEEE, 2015: 4489-4497.
[29] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Con-ference on Computer Vision and Pattern Recognition, Las Vegas, Jun 26-Jul 1, 2016. Piscataway: IEEE, 2016: 770-778.
[30] TRAN D, RAY J, SHOU Z, et al. ConvNet architecture search for spatiotemporal feature learning[J]. arXiv:1708. 05038, 2017.
[31] KARPATHY A, TODERICI G, SHETTY S, et al. Large-scale video classification with convolutional neural networks[C]//Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, Jun 23-28, 2014. Piscataway: IEEE, 2014: 1725-1732.
[32] KUN L, LIU W, GAN C, et al. T-C3D: temporal convolu-tional 3D network for real-time action recognition[C]//Pro-ceedings of the 32nd AAAI Conference on Artificial Intel-ligence, the 30th Innovative Applications of Artificial Intel-ligence, and the 8th AAAI Symposium on Educational Adv-ances in Artificial Intelligence, New Orleans, Feb 2-7, 2018. Menlo Park: AAAI, 2018: 7138-7145.
[33] WANG X W, XIE L B, PENG L. Double residual network recognition method for falling abnormal behavior[J]. Journal of Frontiers of Science and Technology, 2020, 14(9): 1580-1589.
王新文, 谢林柏, 彭力. 跌倒异常行为的双重残差网络识别方法[J]. 计算机科学与探索, 2020, 14(9): 1580-1589.
[34] QIAN H F, ZHOU X, ZHENG M M. Abnormal behavior detection and recognition method based on improved Resnet model[J]. Computers, Materials and Continua, 2020, 65(3): 2153-2167.
[35] DIBA A, FAYYAZ M, SHARMA V, et al. Temporal 3d convnets: new architecture and transfer learning for video classification[J]. arXiv:1711.08200, 2017.
[36] HUANG G, LIU Z, DER M L V, et al. Densely connected convolutional networks[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Piscataway: IEEE, 2017: 2261-2269.
[37] CARREIRA J, ZISSERMAN A. Quo Vadis, action recogni-tion? A new model and the kinetics dataset[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Piscata-way: IEEE, 2017: 6299-6308.
[38] DENG J, DONG W, SOCHER R, et al. Imagenet: a large-scale hierarchical image database[C]//Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, Jun 20-25, 2009. Piscataway: IEEE, 2009: 248-255.
[39] KAY W, CARREIRA J, SIMONYAN K, et al. The kinetics human action video dataset[J]. arXiv:1705.06950, 2017.
[40] SOOMRO K, ZAMIR A R, SHAH M. UCF101: a dataset of 101 human actions classes from videos in the wild[J]. arXiv:1212.0402, 2012.
[41] KUEHNE H, JHUANG H, GARROTE E, et al. HMDB: a large video database for human motion recognition[C]//Pro-ceedings of the 13th IEEE International Conference on Com-puter Vision, Barcelona, Nov 6-13, 2011. Piscataway: IEEE, 2011: 2556-2563.
[42] VAROL G, LAPTEV I, SCHMID C, et al. Long-term tem-poral convolutions for action recognition[J]. IEEE Transac-tions on Pattern Analysis and Machine Intelligence, 2018, 40(6): 1510-1517.
[43] GAO D P, ZHU J G. Atom action recognition by multi-dimensional adaptive 3D convolutional neural networks[J]. Computer Engineering and Applications, 2018, 54(4): 174-178.
高大鹏, 朱建刚. 多维度自适应3D卷积神经网络原子行为识别[J]. 计算机工程与应用, 2018, 54(4): 174-178.
[44] WANG L, LI W, LI W, et al. Appearance-and-relation net-works for video classification[C]//Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recogni-tion, Salt Lake City, Jun 18-22, 2018. Piscataway: IEEE, 2018: 1430-1439.
[45] MEMISEVIC R. Learning to relate images[J]. IEEE Transac-tions on Pattern Analysis and Machine Intelligence, 2013, 35(8): 1829-1846.
[46] ZHOU Y, SUN X, ZHA Z J, et al. MiCT: mixed 3d/2d con-volutional tube for human action recognition[C]//Procee-dings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-22, 2018. Pis-cataway: IEEE, 2018: 449-458.
[47] ZOLFAGHARI M R, SINGH K, BROX T, et al. ECO: efficient convolutional network for online video understan-ding[C]//Proceedings of the 15th European Conference on Computer Vision, Munich, Sep 8-14, 2018. Cham: Springer, 2018: 713-730.
[48] FEICHTENHOFER C, FAN H, MALIK J, et al. SlowFast networks for video recognition[C]//Proceedings of the 17th IEEE International Conference on Computer Vision, Seoul, Oct 27-Nov 2, 2019. Piscataway: IEEE, 2019: 6202-6211.
[49] SUN L, JIA K, YEUNG D, et al. Human action recognition using factorized spatio-temporal convolutional networks[C]//Proceedings of the 15th IEEE International Conference on Computer Vision, Santiago, Dec 7-13, 2015. Piscataway: IEEE, 2015: 4597-4605.
[50] QIU Z, YAO T, MEI T. Learning spatio-temporal represen-tation with pseudo-3d residual networks[C]//Proceedings of the 17th IEEE International Conference on Computer Vision, Venice, Oct 22-29, 2017. Piscataway: IEEE, 2017: 5534-5542.
[51] DU T, WANG H, TORRESANI L, et al. A closer look at spatiotemporal convolutions for action recognition[C]//Pro-ceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-22, 2018. Piscataway: IEEE, 2018: 6450-6459.
[52] XIE S N, SUN C, HUANG J, et al. Rethinking spatiotem-poral feature learning: speed-accuracy trade-offs in video classification[C]//LNCS 11219: Proceedings of the 15th Eur-opean Conference on Computer Vision, Munich, Sep 8-14, 2018. Cham: Springer, 2018: 318-335.
[53] GOYAL R, KAHOU S E, MICHALSKI V, et al. The “some-thing something” video database for learning and evaluating visual common sense[C]//Proceedings of the 16th IEEE Inter-national Conference on Computer Vision, Venice, Oct 22-29, 2017. Piscataway: IEEE, 2017: 5842-5850.
[54] LI C, ZHONG Q, XIE D, et al. Collaborative spatiotemporal feature learning for video action recognition[C]//Proceedings of the 2019 IEEE Conference on Computer Vision and Pat-tern Recognition, Long Beach, Jun 16-20, 2019. Piscataway: IEEE, 2019: 7872-7881.
[55] LUO C, YUILLE A L. Grouped spatial-temporal aggregation for efficient action recognition[C]//Proceedings of the 17th IEEE International Conference on Computer Vision, Seoul, Oct 7-Nov 2, 2019. Piscataway: IEEE, 2019: 5512-5521.
[56] JIANG B, WANG M, GAN W, et al. STM: spatiotemporal and motion encoding for action recognition[C]//Proceedings of the 2019 IEEE International Conference on Computer Vision, Seoul, Oct 7-Nov 2, 2019. Piscataway: IEEE, 2019: 2000-2009.
[57] ZHU W, HU J, SUN G, et al. A key volume mining deep framework for action recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Jun 26-Jul 1, 2016. Piscataway: IEEE, 2016: 1991-1999.
[58] KAR A, RAI N, SIKKA K, et al. AdaScan: adaptive scan pooling in deep convolutional neural networks for human action recognition in videos[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recogni-tion, Honolulu, Jul 21-26, 2017. Piscataway: IEEE, 2017: 5699-5708.
[59] KORBAR B, TRAN D, TORRESANI L. SCSampler: sam-pling salient clips from video for efficient action recognition[C]//Proceedings of the 17th IEEE International Conference on Computer Vision, Seoul, Oct 7-Nov 3, 2019. Piscataway: IEEE, 2019: 6232-6242.
[60] LI Y X, XIE L B. Human action recognition based on depth motion map and dense trajectory[J]. Computer Engineering and Applications, 2020, 56(3): 194-200.
李元祥, 谢林柏. 基于深度运动图和密集轨迹的行为识别算法[J]. 计算机工程与应用, 2020, 56(3): 194-200.
[61] GE Y, JING G D. Human action recognition based on convo-lution neural network combined with multi-scale method[J]. Computer Engineering and Applications, 2019, 55(2): 100-103.
盖赟, 荆国栋. 多尺度方法结合卷积神经网络的行为识别[J]. 计算机工程与应用, 2019, 55(2): 100-103.
[62] NAGRANI A, SUN C, ROSS D, et al. Speech2Action: cross-modal supervision for action recognition[C]//Proceedings of the 2020 IEEE Conference on Computer Vision and Pattern Recognition, Seattle, Jun 14-19, 2020. Piscataway: IEEE, 2020: 10317-10326.
[63] GAO R, OH T, GRAUMAN K, et al. Listen to look: action recognition by previewing audio[C]//Proceedings of the 2020 IEEE Conference on Computer Vision and Pattern Recogni-tion, Seattle, Jun 14-19, 2020. Piscataway: IEEE, 2020: 10457-10467.
[64] SCHULDT C, LAPTEV I, CAPUTO B. Recognizing human actions: a local SVM approach[C]//Proceedings of the 17th IEEE International Conference on Pattern Recognition, Cam-bridge, Aug 23-26, 2004. Piscataway: IEEE, 2004: 32-36.
[65] BLANK M, GORELICK L, SHECHTMAN E, et al. Actions as space-time shapes[C]//Proceedings of the 10th IEEE Inter-national Conference on Computer Vision, Beijing, Oct 17-21, 2005. Piscataway: IEEE, 2005: 1395-1402.
[66] WEINLAND D, RONFARD R, BOYER E. Free viewpoint action recognition using motion history volumes[J]. Computer Vision and Image Understanding, 2006, 104(2/3): 249-257.
[67] MARSZALEK M, LAPTEV I, SCHMID C. Actions in context[C]//Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, Jun 20-25, 2009. Piscataway: IEEE, 2009: 2929-2936.
[68] NIEBLES J C, CHEN C, LI F F. Modeling temporal stru-cture of decomposable motion segments for activity classifi-cation[C]//LNCS 6312: Proceedings of the 11th European Conference on Computer Vision, Heraklion, Sep 5-11, 2010. Berlin, Heidelberg: Springer, 2010: 392-405.
[69] ZHAO H, TORRALBA A, TORRESANI L, et al. HACS: human action clips and segments dataset for recognition and temporal localization[C]//Proceedings of the 17th IEEE International Conference on Computer Vision, Seoul, Oct 7-Nov 2, 2019. Piscataway: IEEE, 2019: 8668-8678.
[70] SIGURDSSON G A, VAROL G, WANG X, et al. Holly-wood in Homes: crowdsourcing data collection for activity understanding[C]//LNCS 9905: Proceedings of the 14th Euro-pean Conference on Computer Vision, Amsterdam, Oct 11-14, 2016. Cham: Springer, 2016: 510-526.
[71] DAMEN D, DOUGHTY H, FARINELLA G M, et al. Scaling egocentric vision: the Epic-Kitchens dataset[C]//Pro-ceedings of the 15th European Conference on Computer Vision, Munich, Sep 8-14, 2018. Cham: Springer, 2018: 753-771.
[72] WANG X L, GIRSHICK R A, GUPT A, et al. Non-local neural networks[C]//Proceeding of the 2018 IEEE Confer-ence on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-22, 2018. Piscataway: IEEE, 2018: 7794-7803.
[73] ZHOU B L, ANDONIAN A, OLIVA A, et al. Temporal rela-tional reasoning in videos[C]//LNCS 11205: Proceedings of the 15th European Conference on Computer Vision, Munich, Sep 8-14, 2018. Cham: Springer, 2018: 831-846.
[74] LI Y W, LI Y, VASCONCELOS N. RESOUND: towards action recognition without representation bias[C]//LNCS 11210: Proceedings of the 15th European Conference on Computer Vision, Munich, Sep 8-14, 2018. Cham: Springer, 2018: 520-535.
[75] GU C, SUN C, ROSS D A, et al. AVA: a video dataset of spatio-temporally localized atomic visual actions[C]//Pro-ceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-22, 2018. Piscataway: IEEE, 2018: 6047-6056.
[76] MONFORT M, ANDONIAN A, ZHOU B, et al. Moments in time dataset: one million videos for event understanding[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42(2): 502-508.
[77] MATERZYNSKA J, BERGER G, BAX I, et al. The jester dataset: a large-scale video dataset of human gestures[C]//Proceedings of the 17th IEEE International Conference on Computer Vision, Seoul, Oct 7-Nov 3, 2019. Piscataway: IEEE, 2019: 2874-2882.
[78] SHAO D, ZHAO Y, DAI B, et al. FineGym: a hierarchical video dataset for fine-grained action understanding[C]//Pro-ceedings of the 2020 IEEE Conference on Computer Vision and Pattern Recognition, Seattle, Jun 14-19, 2020. Piscataway: IEEE, 2020: 2613-2622.
[79] WANG L, XIONG Y, WANG Z, et al. Towards good pra-ctices for very deep two-stream convNets[J]. arXiv:1507. 02159, 2015.
[80] JI J, KRISHNA R, LI F F, et al. Action genome: actions as compositions of spatio-temporal scene graphs[C]//Procee-dings of the 2020 IEEE Conference on Computer Vision and Pattern Recognition, Seattle, Jun 14-19, 2020. Piscata-way: IEEE, 2020: 10236-10247.
[81] KAI D, LI J W, CAO Z J, et al. Few-shot video classifica-tion via temporal alignment[C]//Proceeding of the 2020 IEEE Conference on Computer Vision and Pattern Recognition, Seattle, Jun 14-19, 2020. Piscataway: IEEE, 2020.
[82] XIE S N, GIRSHICK R, DOLLAR P, et al. Aggregated residual transformations for deep neural networks[C]//Pro-ceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Pisca-taway: IEEE, 2017: 1492-1500.
[83] ZHANG X Y, ZHOU X Y, LIN M X, et al. Shufflenet: an extremely efficient convolutional neural network for mobile devices[C]//Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-22, 2018. Piscataway: IEEE, 2018: 6848-6856.
[84] LOTTER W, KREIMAN G, COX D. Deep predictive coding networks for video prediction and unsupervised learning[J]. arXiv:1605.08104, 2016.
[85] WANG Y B, LONG M S, WANG J M, et al. PredRNN: recurrent neural networks for predictive learning using spa-tiotemporal LSTMs[C]//Proceedings of the Annual Confer-ence on Neural Information Processing Systems, Long Beach, Dec 9, 2017. Cambridge: MIT Press, 2017: 879-888.
[86] WANG Y B, GAO Z F, LONG M S, et al. PredRNN++: towards a resolution of the deep-in-time dilemma in spatio-temporal predictive learning[J]. arXiv:1804.06300, 2018.
[87] ZHUANG C, SHE T, ANDONIAN A, et al. Unsupervised learning from video with deep neural embeddings[C]//Pro-ceeding of the 2020 IEEE Conference on Computer Vision and Pattern Recognition, Seattle, Jun 14-19, 2020. Piscata-way: IEEE, 2020: 9563-9572.