[1] KRIZHECSKY A, SUTSKEVER I, HINTON G. ImageNet classification with deep convolutional neural networks[C]//Proceedings of the Annual Conference and Workshop on Neural Information Processing Systems, Lake Tahoe, Dec 3-6, 2012. Cambridge: MIT Press, 2012: 1097-1105.
[2] FUJIYOSHI H, LIPTON A J, KANADE T, et al. Real-time human motion analysis by image skeletonization[J]. IEICE Transactions on Information and Systems, 2004, 87(1): 113-120.
[3] YILMA A, SHAH M. Recognizing human actions in videos acquired by uncalibrated moving cameras[C]//Proceedings of the 10th IEEE International Conference on Computer Vision, Beijing, Oct 17-20, 2005. Piscataway: IEEE, 2005: 150-157.
[4] JHUANG H, GALL J, ZUFFI S, et al. Towards understan-ding action recognition[C]//Proceedings of the 14th IEEE International Conference on Computer Vision, Sydney, Dec 1-8, 2013. Piscataway: IEEE, 2013: 3192-3199.
[5] YANG X, TIAN Y. Effective 3d action recognition using eigenjoints[J]. Journal of Visual Communication and Image Representation, 2014, 25(1): 2-11.
[6] LAPTEV I. On space-time interest points[J]. International Journal of Computer Vision, 2005, 64(2/3): 107-123.
[7] DOLLAR P, RABAUD V, COTTRELL G W, et al. Behavior recognition via sparse spatio-temporal features[C]//Procee-dings of the 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, Beijing, Oct 15-16, 2005. Piacataway: IEEE, 2005: 65-72.
[8] WILLEMS G, TUYTELAARS T, VAN Gool L. An efficient dense and scale-invariant spatio-temporal interest point de-tector[C]//LNCS 5303: Proceedings of the 10th European Conference on Computer Vision, Marseille, Oct 12-18, 2008. Berlin, Heidelberg: Springer, 2008: 650-663.
[9] WANG H, KLASER A, SCHMID C, et al. Action recogni-tion by dense trajectories[C]//Proceedings of the 24th IEEE Conference on Computer Vision and Pattern Recognition, Colorado Springs, Jun 21-23, 2011. Washington: IEEE Com-puter Society, 2011: 3169-3176.
[10] WANG H, SCHMID C. Action recognition with improved trajectories[C]//Proceedings of the 14th IEEE International Conference on Computer Vision, Sydney, Dec 3-6, 2013. Piscataway: IEEE, 2013: 3551-3558.
[11] SIMONYAN K, ZISSERMAN A. Two-stream convolutional networks for action recognition in videos[C]//Proceedings of the Advances in Neural Information Processing Systems, Montreal, Dec 8-13, 2014. Cambridge: MIT Press, 2014: 568-576.
[12] NG J Y, HAUSKNECHT M, VIJAYANARASIMHAN S, et al. Beyond short snippets: deep networks for video classifica-tion[C]//Proceedings of the 2015 IEEE Conference on Com-puter Vision and Pattern Recognition, Boston, Jun 7-12, 2015. Piscataway: IEEE, 2015: 4694-4702.
[13] HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735-1780.
[14] WANG L, XIONG Y, WANG Z, et al. Temporal segment networks: towards good practices for deep action recognition[C]//LNCS 9912: Proceedings of the 14th European Confer-ence on Computer Vision, Amsterdam, Oct 11-14, 2016. Cham: Springer, 2016: 20-36.
[15] LAN Z, ZHU Y, HAUPTMANN A G, et al. Deep local video feature for action recognition[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Piscataway: IEEE, 2017: 1219-1225.
[16] FEICHTENHOFER C, PINZ A, ZISSERMAN A. Convolu-tional two-stream network fusion for video action recogni-tion[C]//Proceedings of the 2016 IEEE Conference on Com-puter Vision and Pattern Recognition, Las Vegas, Jun 27-Jul 1, 2016. Piscataway: IEEE, 2016: 1933-1941.
[17] WANG Y, LONG M, WANG J, et al. Spatiotemporal pyra-mid network for video action recognition[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Piscataway: IEEE, 2017: 2097-2106.
[18] MNIH V, HEESS N, GRAVES A, et al. Recurrent models of visual attention[C]//Proceedings of the Advances in Neural Information Processing Systems, Montreal, Dec 8-13, 2014. Cambridge: MIT Press, 2014: 2204-2212.
[19] DIBA A, SHARMA V, VAN Gool L. Deep temporal linear encoding networks[C]//Proceedings of the 2017 IEEE Con-ference on Computer Vision and Pattern Recognition, Hono-lulu, Jul 21-Jul 26, 2017. Piscataway: IEEE, 2017: 1541-1550.
[20] ZHU J, ZHU Z, ZOU W. End-to-end video-level representa-tion learning for action recognition[C]//Proceedings of the 2018 IEEE International Conference on Pattern Recognition, Salt Lake City, Jun 18-22, 2018. Piscataway: IEEE, 2018: 645-650.
[21] ZHU Y, LAN Z, NEWSAM S, et al. Hidden two-stream convolutional networks for action recognition[C]//LNCS 11363: Proceedings of the 14th Asian Conference on Com-puter Vision, Perth, Dec 2-6, 2018. Berlin, Heidelberg: Sprin-ger, 2018: 363-378.
[22] SUN S, KUANG Z, SHENG L, et al. Optical flow guided feature: a fast and robust motion representation for video action recognition[C]//Proceedings of the 2018 IEEE Con-ference on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-22, 2018. Piscataway: IEEE, 2018: 1390-1399.
[23] LEE M, LEE S, SON S J, et al. Motion feature network: fixed motion filter for action recognition[C]//LNCS 11214: Proceedings of the 15th European Conference on Computer Vision, Munich, Sep 8-14, 2018. Cham: Springer, 2018: 392-408.
[24] LIN J, GAN C, HAN S. TSM: temporal shift module for efficient video understanding[C]//Proceedings of the 17th IEEE International Conference on Computer Vision, Seoul, Oct 7-Nov 2, 2019. Piscataway: IEEE, 2019: 7083-7093.
[25] HUSSEIN N, GAVVES E, SMEULDERS A W M. Time-ception for complex action recognition[C]//Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 16-20, 2019. Piscataway: IEEE, 2019: 254-263.
[26] SZEGEDY C, LIU W, JIA Y, et al. Going deeper with convolutions[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, Jun 7-12, 2015. Piscataway: IEEE, 2015: 7-12.
[27] JI S, XU W, YANG M, et al. 3D convolutional neural net-works for human action recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(1): 221-231.
[28] TRAN D, BOURDEV L, FERGUS R, et al. Learning spa-tiotemporal features with 3D convolutional networks[C]//Proceedings of the 15th IEEE International Conference on Computer Vision, Santiago, Dec 7-13, 2015. Piscataway: IEEE, 2015: 4489-4497.
[29] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Con-ference on Computer Vision and Pattern Recognition, Las Vegas, Jun 26-Jul 1, 2016. Piscataway: IEEE, 2016: 770-778.
[30] TRAN D, RAY J, SHOU Z, et al. ConvNet architecture search for spatiotemporal feature learning[J]. arXiv:1708. 05038, 2017.
[31] KARPATHY A, TODERICI G, SHETTY S, et al. Large-scale video classification with convolutional neural networks[C]//Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, Jun 23-28, 2014. Piscataway: IEEE, 2014: 1725-1732.
[32] KUN L, LIU W, GAN C, et al. T-C3D: temporal convolu-tional 3D network for real-time action recognition[C]//Pro-ceedings of the 32nd AAAI Conference on Artificial Intel-ligence, the 30th Innovative Applications of Artificial Intel-ligence, and the 8th AAAI Symposium on Educational Adv-ances in Artificial Intelligence, New Orleans, Feb 2-7, 2018. Menlo Park: AAAI, 2018: 7138-7145.
[33] WANG X W, XIE L B, PENG L. Double residual network recognition method for falling abnormal behavior[J]. Journal of Frontiers of Science and Technology, 2020, 14(9): 1580-1589.
王新文, 谢林柏, 彭力. 跌倒异常行为的双重残差网络识别方法[J]. 计算机科学与探索, 2020, 14(9): 1580-1589.
[34] QIAN H F, ZHOU X, ZHENG M M. Abnormal behavior detection and recognition method based on improved Resnet model[J]. Computers, Materials and Continua, 2020, 65(3): 2153-2167.
[35] DIBA A, FAYYAZ M, SHARMA V, et al. Temporal 3d convnets: new architecture and transfer learning for video classification[J]. arXiv:1711.08200, 2017.
[36] HUANG G, LIU Z, DER M L V, et al. Densely connected convolutional networks[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Piscataway: IEEE, 2017: 2261-2269.
[37] CARREIRA J, ZISSERMAN A. Quo Vadis, action recogni-tion? A new model and the kinetics dataset[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Piscata-way: IEEE, 2017: 6299-6308.
[38] DENG J, DONG W, SOCHER R, et al. Imagenet: a large-scale hierarchical image database[C]//Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, Jun 20-25, 2009. Piscataway: IEEE, 2009: 248-255.
[39] KAY W, CARREIRA J, SIMONYAN K, et al. The kinetics human action video dataset[J]. arXiv:1705.06950, 2017.
[40] SOOMRO K, ZAMIR A R, SHAH M. UCF101: a dataset of 101 human actions classes from videos in the wild[J]. arXiv:1212.0402, 2012.
[41] KUEHNE H, JHUANG H, GARROTE E, et al. HMDB: a large video database for human motion recognition[C]//Pro-ceedings of the 13th IEEE International Conference on Com-puter Vision, Barcelona, Nov 6-13, 2011. Piscataway: IEEE, 2011: 2556-2563.
[42] VAROL G, LAPTEV I, SCHMID C, et al. Long-term tem-poral convolutions for action recognition[J]. IEEE Transac-tions on Pattern Analysis and Machine Intelligence, 2018, 40(6): 1510-1517.
[43] GAO D P, ZHU J G. Atom action recognition by multi-dimensional adaptive 3D convolutional neural networks[J]. Computer Engineering and Applications, 2018, 54(4): 174-178.
高大鹏, 朱建刚. 多维度自适应3D卷积神经网络原子行为识别[J]. 计算机工程与应用, 2018, 54(4): 174-178.
[44] WANG L, LI W, LI W, et al. Appearance-and-relation net-works for video classification[C]//Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recogni-tion, Salt Lake City, Jun 18-22, 2018. Piscataway: IEEE, 2018: 1430-1439.
[45] MEMISEVIC R. Learning to relate images[J]. IEEE Transac-tions on Pattern Analysis and Machine Intelligence, 2013, 35(8): 1829-1846.
[46] ZHOU Y, SUN X, ZHA Z J, et al. MiCT: mixed 3d/2d con-volutional tube for human action recognition[C]//Procee-dings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-22, 2018. Pis-cataway: IEEE, 2018: 449-458.
[47] ZOLFAGHARI M R, SINGH K, BROX T, et al. ECO: efficient convolutional network for online video understan-ding[C]//Proceedings of the 15th European Conference on Computer Vision, Munich, Sep 8-14, 2018. Cham: Springer, 2018: 713-730.
[48] FEICHTENHOFER C, FAN H, MALIK J, et al. SlowFast networks for video recognition[C]//Proceedings of the 17th IEEE International Conference on Computer Vision, Seoul, Oct 27-Nov 2, 2019. Piscataway: IEEE, 2019: 6202-6211.
[49] SUN L, JIA K, YEUNG D, et al. Human action recognition using factorized spatio-temporal convolutional networks[C]//Proceedings of the 15th IEEE International Conference on Computer Vision, Santiago, Dec 7-13, 2015. Piscataway: IEEE, 2015: 4597-4605.
[50] QIU Z, YAO T, MEI T. Learning spatio-temporal represen-tation with pseudo-3d residual networks[C]//Proceedings of the 17th IEEE International Conference on Computer Vision, Venice, Oct 22-29, 2017. Piscataway: IEEE, 2017: 5534-5542.
[51] DU T, WANG H, TORRESANI L, et al. A closer look at spatiotemporal convolutions for action recognition[C]//Pro-ceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-22, 2018. Piscataway: IEEE, 2018: 6450-6459.
[52] XIE S N, SUN C, HUANG J, et al. Rethinking spatiotem-poral feature learning: speed-accuracy trade-offs in video classification[C]//LNCS 11219: Proceedings of the 15th Eur-opean Conference on Computer Vision, Munich, Sep 8-14, 2018. Cham: Springer, 2018: 318-335.
[53] GOYAL R, KAHOU S E, MICHALSKI V, et al. The “some-thing something” video database for learning and evaluating visual common sense[C]//Proceedings of the 16th IEEE Inter-national Conference on Computer Vision, Venice, Oct 22-29, 2017. Piscataway: IEEE, 2017: 5842-5850.
[54] LI C, ZHONG Q, XIE D, et al. Collaborative spatiotemporal feature learning for video action recognition[C]//Proceedings of the 2019 IEEE Conference on Computer Vision and Pat-tern Recognition, Long Beach, Jun 16-20, 2019. Piscataway: IEEE, 2019: 7872-7881.
[55] LUO C, YUILLE A L. Grouped spatial-temporal aggregation for efficient action recognition[C]//Proceedings of the 17th IEEE International Conference on Computer Vision, Seoul, Oct 7-Nov 2, 2019. Piscataway: IEEE, 2019: 5512-5521.
[56] JIANG B, WANG M, GAN W, et al. STM: spatiotemporal and motion encoding for action recognition[C]//Proceedings of the 2019 IEEE International Conference on Computer Vision, Seoul, Oct 7-Nov 2, 2019. Piscataway: IEEE, 2019: 2000-2009.
[57] ZHU W, HU J, SUN G, et al. A key volume mining deep framework for action recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Jun 26-Jul 1, 2016. Piscataway: IEEE, 2016: 1991-1999.
[58] KAR A, RAI N, SIKKA K, et al. AdaScan: adaptive scan pooling in deep convolutional neural networks for human action recognition in videos[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recogni-tion, Honolulu, Jul 21-26, 2017. Piscataway: IEEE, 2017: 5699-5708.
[59] KORBAR B, TRAN D, TORRESANI L. SCSampler: sam-pling salient clips from video for efficient action recognition[C]//Proceedings of the 17th IEEE International Conference on Computer Vision, Seoul, Oct 7-Nov 3, 2019. Piscataway: IEEE, 2019: 6232-6242.
[60] LI Y X, XIE L B. Human action recognition based on depth motion map and dense trajectory[J]. Computer Engineering and Applications, 2020, 56(3): 194-200.
李元祥, 谢林柏. 基于深度运动图和密集轨迹的行为识别算法[J]. 计算机工程与应用, 2020, 56(3): 194-200.
[61] GE Y, JING G D. Human action recognition based on convo-lution neural network combined with multi-scale method[J]. Computer Engineering and Applications, 2019, 55(2): 100-103.
盖赟, 荆国栋. 多尺度方法结合卷积神经网络的行为识别[J]. 计算机工程与应用, 2019, 55(2): 100-103.
[62] NAGRANI A, SUN C, ROSS D, et al. Speech2Action: cross-modal supervision for action recognition[C]//Proceedings of the 2020 IEEE Conference on Computer Vision and Pattern Recognition, Seattle, Jun 14-19, 2020. Piscataway: IEEE, 2020: 10317-10326.
[63] GAO R, OH T, GRAUMAN K, et al. Listen to look: action recognition by previewing audio[C]//Proceedings of the 2020 IEEE Conference on Computer Vision and Pattern Recogni-tion, Seattle, Jun 14-19, 2020. Piscataway: IEEE, 2020: 10457-10467.
[64] SCHULDT C, LAPTEV I, CAPUTO B. Recognizing human actions: a local SVM approach[C]//Proceedings of the 17th IEEE International Conference on Pattern Recognition, Cam-bridge, Aug 23-26, 2004. Piscataway: IEEE, 2004: 32-36.
[65] BLANK M, GORELICK L, SHECHTMAN E, et al. Actions as space-time shapes[C]//Proceedings of the 10th IEEE Inter-national Conference on Computer Vision, Beijing, Oct 17-21, 2005. Piscataway: IEEE, 2005: 1395-1402.
[66] WEINLAND D, RONFARD R, BOYER E. Free viewpoint action recognition using motion history volumes[J]. Computer Vision and Image Understanding, 2006, 104(2/3): 249-257.
[67] MARSZALEK M, LAPTEV I, SCHMID C. Actions in context[C]//Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, Jun 20-25, 2009. Piscataway: IEEE, 2009: 2929-2936.
[68] NIEBLES J C, CHEN C, LI F F. Modeling temporal stru-cture of decomposable motion segments for activity classifi-cation[C]//LNCS 6312: Proceedings of the 11th European Conference on Computer Vision, Heraklion, Sep 5-11, 2010. Berlin, Heidelberg: Springer, 2010: 392-405.
[69] ZHAO H, TORRALBA A, TORRESANI L, et al. HACS: human action clips and segments dataset for recognition and temporal localization[C]//Proceedings of the 17th IEEE International Conference on Computer Vision, Seoul, Oct 7-Nov 2, 2019. Piscataway: IEEE, 2019: 8668-8678.
[70] SIGURDSSON G A, VAROL G, WANG X, et al. Holly-wood in Homes: crowdsourcing data collection for activity understanding[C]//LNCS 9905: Proceedings of the 14th Euro-pean Conference on Computer Vision, Amsterdam, Oct 11-14, 2016. Cham: Springer, 2016: 510-526.
[71] DAMEN D, DOUGHTY H, FARINELLA G M, et al. Scaling egocentric vision: the Epic-Kitchens dataset[C]//Pro-ceedings of the 15th European Conference on Computer Vision, Munich, Sep 8-14, 2018. Cham: Springer, 2018: 753-771.
[72] WANG X L, GIRSHICK R A, GUPT A, et al. Non-local neural networks[C]//Proceeding of the 2018 IEEE Confer-ence on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-22, 2018. Piscataway: IEEE, 2018: 7794-7803.
[73] ZHOU B L, ANDONIAN A, OLIVA A, et al. Temporal rela-tional reasoning in videos[C]//LNCS 11205: Proceedings of the 15th European Conference on Computer Vision, Munich, Sep 8-14, 2018. Cham: Springer, 2018: 831-846.
[74] LI Y W, LI Y, VASCONCELOS N. RESOUND: towards action recognition without representation bias[C]//LNCS 11210: Proceedings of the 15th European Conference on Computer Vision, Munich, Sep 8-14, 2018. Cham: Springer, 2018: 520-535.
[75] GU C, SUN C, ROSS D A, et al. AVA: a video dataset of spatio-temporally localized atomic visual actions[C]//Pro-ceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-22, 2018. Piscataway: IEEE, 2018: 6047-6056.
[76] MONFORT M, ANDONIAN A, ZHOU B, et al. Moments in time dataset: one million videos for event understanding[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42(2): 502-508.
[77] MATERZYNSKA J, BERGER G, BAX I, et al. The jester dataset: a large-scale video dataset of human gestures[C]//Proceedings of the 17th IEEE International Conference on Computer Vision, Seoul, Oct 7-Nov 3, 2019. Piscataway: IEEE, 2019: 2874-2882.
[78] SHAO D, ZHAO Y, DAI B, et al. FineGym: a hierarchical video dataset for fine-grained action understanding[C]//Pro-ceedings of the 2020 IEEE Conference on Computer Vision and Pattern Recognition, Seattle, Jun 14-19, 2020. Piscataway: IEEE, 2020: 2613-2622.
[79] WANG L, XIONG Y, WANG Z, et al. Towards good pra-ctices for very deep two-stream convNets[J]. arXiv:1507. 02159, 2015.
[80] JI J, KRISHNA R, LI F F, et al. Action genome: actions as compositions of spatio-temporal scene graphs[C]//Procee-dings of the 2020 IEEE Conference on Computer Vision and Pattern Recognition, Seattle, Jun 14-19, 2020. Piscata-way: IEEE, 2020: 10236-10247.
[81] KAI D, LI J W, CAO Z J, et al. Few-shot video classifica-tion via temporal alignment[C]//Proceeding of the 2020 IEEE Conference on Computer Vision and Pattern Recognition, Seattle, Jun 14-19, 2020. Piscataway: IEEE, 2020.
[82] XIE S N, GIRSHICK R, DOLLAR P, et al. Aggregated residual transformations for deep neural networks[C]//Pro-ceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Pisca-taway: IEEE, 2017: 1492-1500.
[83] ZHANG X Y, ZHOU X Y, LIN M X, et al. Shufflenet: an extremely efficient convolutional neural network for mobile devices[C]//Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-22, 2018. Piscataway: IEEE, 2018: 6848-6856.
[84] LOTTER W, KREIMAN G, COX D. Deep predictive coding networks for video prediction and unsupervised learning[J]. arXiv:1605.08104, 2016.
[85] WANG Y B, LONG M S, WANG J M, et al. PredRNN: recurrent neural networks for predictive learning using spa-tiotemporal LSTMs[C]//Proceedings of the Annual Confer-ence on Neural Information Processing Systems, Long Beach, Dec 9, 2017. Cambridge: MIT Press, 2017: 879-888.
[86] WANG Y B, GAO Z F, LONG M S, et al. PredRNN++: towards a resolution of the deep-in-time dilemma in spatio-temporal predictive learning[J]. arXiv:1804.06300, 2018.
[87] ZHUANG C, SHE T, ANDONIAN A, et al. Unsupervised learning from video with deep neural embeddings[C]//Pro-ceeding of the 2020 IEEE Conference on Computer Vision and Pattern Recognition, Seattle, Jun 14-19, 2020. Piscata-way: IEEE, 2020: 9563-9572. |