[1] KARPATHY A, TODERICI G, SHETTY S, et al. Large-scale video classification with convolutional neural networks[C]//Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, Jun 23-28, 2014: 1725-1732.
[2] WANG L, XIONG Y, WANG Z, et al. Temporal segment net-works: towards good practices for deep action recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision, Amsterdam, Oct 11-14, 2016: 20-36.
[3] YANG Y, KROMPASS D, TRESP V. Tensor-train recurrent neural networks for video classification[C]//Proceedings of the 34th International Conference on Machine Learning, Sydney, Aug 6-11, 2017: 3891-3900.
[4] ARANDJELOVIC R, GRONAT P, TORII A, et al. NetVLAD: CNN architecture for weakly supervised place recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Jun 27-30, 2016: 5297-5307.
[5] MIECH A, LAPTEV I, SIVIC J. Learnable pooling with context gating for video classification[J]. arXiv:1706.06905, 2017.
[6] 陈洁婷, 王维莹, 金琴. 弹幕信息协助下的视频多标签分类[J]. 计算机科学, 2021, 48(1): 167-174.
CHEN J T, WANG W Y, JIN Q. Multi-label video classifi-cation assisted by danmaku[J]. Computer Science, 2021, 48(1): 167-174.
[7] CARREIRA J, ZISSERMAN A. QUO VADIS, action recog-nition? A new model and the kinetics dataset[C]//Procee-dings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 24-26, 2017: 6299-6308.
[8] WANG X, GIRSHICK R, GUPTA A, et al. Non-local neural networks[C]//Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-22, 2018: 7794-7803.
[9] LONG X, GAN C, DE MELO G, et al. Attention clusters: purely attention based local feature integration for video classification[C]//Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-22, 2018: 7834-7843.
[10] FEICHTENHOFER C, FAN H, MALIK J, et al. Slowfast networks for video recognition[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Oct 27-Nov 2, 2019: 6202-6211.
[11] WU Z, JIANG Y G, WANG J, et al. Exploring inter-feature and inter-class relationships with deep neural networks for video classification[C]//Proceedings of the 2014 ACM Inter-national Conference on Multimedia, Orlando, Nov 3-7, 2014: 167-176.
[12] LI L H, YATSKAR M, YIN D, et al. VisualBERT: a simple and performant baseline for vision and language[J].?arXiv:1908.03557, 2019.
[13] LI G, DUAN N, FANG Y, et al. Unicoder-VL: a universal encoder for vision and language by cross-modal pre-training[C]//Proceedings of the 2020 AAAI ACM Conference on Artificial Intelligence, New York, Feb 7-8, 2020: 11336-11344.
[14] QI D, SU L, SONG J, et al. ImageBERT: cross-modal pre-training with large-scale weak-supervised image-text data[J].?arXiv:2001.07966, 2020.
[15] SU W, ZHU X, CAO Y, et al. VL-BERT: pre-training of generic visual-linguistic representations[J].?arXiv:1908.08530, 2019.
[16] KIM W, SON B, KIM I. ViLT: vision-and-language trans-former without convolution or region supervision[C]//Pro-ceedings of the 38th International Conference on Machine Learning, Jul 18-24, 2021: 5583-5594.
[17] LU J, BATRA D, PARIKH D, et al. VilBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks[C]//Proceedings of the 33rd Advances in Neural Information Processing Systems, Vancouver, Dec 8-14, 2019: 13-33.
[18] RADFORD A, KIM J W, HALLACY C, et al. Learning trans-ferable visual models from natural language supervision[C]//Proceedings of the 38th International Conference on Machine Learning, Jul 18-24, 2021: 8748-8763.
[19] WEHRMANN J, CERRI R, BARROS R C. Hierarchical multi-label classification networks[C]//Proceedings of the 35th International Conference on Machine Learning, Stoc-kholm, Jul 10-15, 2018: 5075-5084.
[20] PARKHI O M, VEDALDI A, ZISSERMAN A. Deep face recognition[C]//Proceedings of the 2015 British Machine Vision Conference, Swansea, Sep 7-10, 2015: 6.
[21] 秦佳佳. 碁于规则和基于相似性的类别在比较任务中的学习和迁移[D]. 金华: 浙江师范大学, 2015.
QIN J J. Learning and transfer of rule-based and similarity-based categories in comparison task[D]. Jinhua: Zhejiang Normal University, 2015.
[22] 王帅, 王维莹, 陈师哲, 等. 基于全局和局部信息的视频记忆度预测[J]. 软件学报, 2020, 31(7): 1969-1979.
WANG S, WANG W Y, CHEN S Z, et al. Video memor-ability prediction based on global and local information[J]. Journal of Software, 2020, 31(7): 1969-1979.
[23] 何相腾, 彭宇新. 跨域和跨模态适应学习的无监督细粒度视频分类[J]. 软件学报, 2021, 32(11): 3482-3495.
HE X T, PENG Y X. Unsupervised fine-grained video cate-gorization via adaptation learning across domains and mo-dalities[J]. Journal of Software, 2021, 32(11): 3482-3495.
[24] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[J]. arXiv:1810.04805, 2018.
[25] SCHROFF F, KALENICHENKO D, PHILBIN J. FaceNet: a unified embedding for face recognition and clustering[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, Jun 7-12, 2015: 815-823.
[26] 官赛萍, 靳小龙, 贾岩涛, 等. 面向知识图谱的知识推理研究进展[J]. 软件学报, 2018, 29(10): 2966-2994.
GUAN S P, JIN X L, JIA Y T, et al. Knowledge reasoning over knowledge graph: a survey[J]. Journal of Software, 2018, 29(10): 2966-2994.
[27] LIU W, WEN Y, YU Z, et al. Sphereface: deep hypersphere embedding for face recognition[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recogni-tion, Honolulu, Jul 24-26, 2017: 212-220.
[28] DENG J, GUO J, XUE N, et al. Arcface: additive angular margin loss for deep face recognition[C]//Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 16-21, 2019: 4690-4699. |