融合知识图谱的影视视频标签分类算法研究

doi:10.3778/j.issn.1673-9418.2209078

摘要/Abstract

摘要： 鉴于视频感知方式的多样性，视频标签层级分类算法均从视觉和文本模态入手，训练联合模型共同推断视频内容。但现有研究通常只适用于粗粒度的分类，针对影视剧名的分类，则需要更加细粒度的识别。提出了一个融合知识图谱的影视视频标签分类算法。首先，使用了基于大规模通用数据训练的多模态预训练模型提取了视觉和文本的特征，训练了一个多任务的视频标签预测模型，得到视频的类型、题材和实体三级标签；通过在多任务学习网络中引入相似性任务提高分类模型训练的难度，使得同类样本特征更加紧密，且更好地表达样本差异。其次，对于最细粒度的实体标签，提出了一个局部注意力头扩展的实体纠错模型，引入外部知识图谱的共现信息对前置模型的预测结果做修正，得到更准确的实体标签预测结果。采集豆瓣的半结构化数据构建了影视知识图谱并对影视视频标签分类模型进行了实证研究。视频标签分类的实验结果表明，首先，基于多任务网络结构，在训练分类任务时加入交叉熵损失函数和相似性损失函数对模型进行共同约束优化了特征表达，在类型、题材、实体标签的Top-1分类准确率上分别提升了3.70%、3.35%和16.57%；其次，针对前置模型的困难样本提出的全局-局部注意力机制模型，在引入了知识图谱信息之后，实体标签的Top-1分类准确率从38.7%提升到45.6%。该研究是使用图片-文本对数据在多模态视频标签分类问题上新的尝试，为少量数据样本情况下的短视频标签分类提供了新的研究思路。

关键词: 知识图谱, 视频标签分类, 多模态内容理解, 实体纠错

Abstract: Based on the diversity of video perception modalities, a complete video tagging hierarchy classification algorithm combines visual and textual modalities to train a joint model to infer video content. However, most of the existing studies are only applicable to coarse-grained classification. Classification for film and television drama requires more fine-grained identification. This study proposes a knowledge graph-based video classification algorithm. Firstly, the algorithm extracts visual and textual features using a multimodal pre-training model, which is trained on large-scale generic data. A multi-task video label prediction model is further trained to obtain a total of three-level labels for the video: content labels, theme labels and entity labels. The difficulty of training the classification model is improved by introducing a similarity task into the multi-task network. The similarity task provides a tighter fit of similar samples, while the learned characteristics better express sample differences. Secondly, for entity labels, an entity correction model with local attention head is proposed. It can fuse, de-duplicate or extend the prediction results by introducing co-occurrence information from the knowledge graph, and produce a more accurate entity label prediction result. Based on semi-structured data retrieved from Douban, this paper constructs a film and television knowledge graph and conducts an empirical study of the video tag classification model for film and television. Experimental results show that, firstly, the cross-entropy loss function and the loss function of similarity task impose a common constraint on training the classification model, which serves to optimize the feature representation. Top-1 accuracy is improved by 3.70%, 3.35% and 16.57% for content labels, theme labels and entity labels respectively. Secondly, entity correction model with global/local attention heads improves the Top-1 accuracy of entity labels from 38.7% to 45.6% after the introduction of knowledge graph information. The proposed research is a new attempt on the multimodal video classification using image-text pair data, providing a new research idea for short video classification in the case of a small number of data samples.

Key words: knowledge graph, video label classification, multimodal content understanding, entity correction

蒋洪迅, 张琳, 孙彩虹. 融合知识图谱的影视视频标签分类算法研究[J]. 计算机科学与探索, 2024, 18(1): 161-174.

JIANG Hongxun, ZHANG Lin, SUN Caihong. Knowledge Graph-Based Video Classification Algorithm for Film and Television Drama[J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(1): 161-174.

参考文献

[1] KARPATHY A, TODERICI G, SHETTY S, et al. Large-scale video classification with convolutional neural networks[C]//Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, Jun 23-28, 2014: 1725-1732.
[2] WANG L, XIONG Y, WANG Z, et al. Temporal segment net-works: towards good practices for deep action recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision, Amsterdam, Oct 11-14, 2016: 20-36.
[3] YANG Y, KROMPASS D, TRESP V. Tensor-train recurrent neural networks for video classification[C]//Proceedings of the 34th International Conference on Machine Learning, Sydney, Aug 6-11, 2017: 3891-3900.
[4] ARANDJELOVIC R, GRONAT P, TORII A, et al. NetVLAD: CNN architecture for weakly supervised place recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Jun 27-30, 2016: 5297-5307.
[5] MIECH A, LAPTEV I, SIVIC J. Learnable pooling with context gating for video classification[J]. arXiv:1706.06905, 2017.
[6] 陈洁婷, 王维莹, 金琴. 弹幕信息协助下的视频多标签分类[J]. 计算机科学, 2021, 48(1): 167-174.
CHEN J T, WANG W Y, JIN Q. Multi-label video classifi-cation assisted by danmaku[J]. Computer Science, 2021, 48(1): 167-174.
[7] CARREIRA J, ZISSERMAN A. QUO VADIS, action recog-nition? A new model and the kinetics dataset[C]//Procee-dings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 24-26, 2017: 6299-6308.
[8] WANG X, GIRSHICK R, GUPTA A, et al. Non-local neural networks[C]//Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-22, 2018: 7794-7803.
[9] LONG X, GAN C, DE MELO G, et al. Attention clusters: purely attention based local feature integration for video classification[C]//Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-22, 2018: 7834-7843.
[10] FEICHTENHOFER C, FAN H, MALIK J, et al. Slowfast networks for video recognition[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Oct 27-Nov 2, 2019: 6202-6211.
[11] WU Z, JIANG Y G, WANG J, et al. Exploring inter-feature and inter-class relationships with deep neural networks for video classification[C]//Proceedings of the 2014 ACM Inter-national Conference on Multimedia, Orlando, Nov 3-7, 2014: 167-176.
[12] LI L H, YATSKAR M, YIN D, et al. VisualBERT: a simple and performant baseline for vision and language[J].?arXiv:1908.03557, 2019.
[13] LI G, DUAN N, FANG Y, et al. Unicoder-VL: a universal encoder for vision and language by cross-modal pre-training[C]//Proceedings of the 2020 AAAI ACM Conference on Artificial Intelligence, New York, Feb 7-8, 2020: 11336-11344.
[14] QI D, SU L, SONG J, et al. ImageBERT: cross-modal pre-training with large-scale weak-supervised image-text data[J].?arXiv:2001.07966, 2020.
[15] SU W, ZHU X, CAO Y, et al. VL-BERT: pre-training of generic visual-linguistic representations[J].?arXiv:1908.08530, 2019.
[16] KIM W, SON B, KIM I. ViLT: vision-and-language trans-former without convolution or region supervision[C]//Pro-ceedings of the 38th International Conference on Machine Learning, Jul 18-24, 2021: 5583-5594.
[17] LU J, BATRA D, PARIKH D, et al. VilBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks[C]//Proceedings of the 33rd Advances in Neural Information Processing Systems, Vancouver, Dec 8-14, 2019: 13-33.
[18] RADFORD A, KIM J W, HALLACY C, et al. Learning trans-ferable visual models from natural language supervision[C]//Proceedings of the 38th International Conference on Machine Learning, Jul 18-24, 2021: 8748-8763.
[19] WEHRMANN J, CERRI R, BARROS R C. Hierarchical multi-label classification networks[C]//Proceedings of the 35th International Conference on Machine Learning, Stoc-kholm, Jul 10-15, 2018: 5075-5084.
[20] PARKHI O M, VEDALDI A, ZISSERMAN A. Deep face recognition[C]//Proceedings of the 2015 British Machine Vision Conference, Swansea, Sep 7-10, 2015: 6.
[21] 秦佳佳. 碁于规则和基于相似性的类别在比较任务中的学习和迁移[D]. 金华: 浙江师范大学, 2015.
QIN J J. Learning and transfer of rule-based and similarity-based categories in comparison task[D]. Jinhua: Zhejiang Normal University, 2015.
[22] 王帅, 王维莹, 陈师哲, 等. 基于全局和局部信息的视频记忆度预测[J]. 软件学报, 2020, 31(7): 1969-1979.
WANG S, WANG W Y, CHEN S Z, et al. Video memor-ability prediction based on global and local information[J]. Journal of Software, 2020, 31(7): 1969-1979.
[23] 何相腾, 彭宇新. 跨域和跨模态适应学习的无监督细粒度视频分类[J]. 软件学报, 2021, 32(11): 3482-3495.
HE X T, PENG Y X. Unsupervised fine-grained video cate-gorization via adaptation learning across domains and mo-dalities[J]. Journal of Software, 2021, 32(11): 3482-3495.
[24] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[J]. arXiv:1810.04805, 2018.
[25] SCHROFF F, KALENICHENKO D, PHILBIN J. FaceNet: a unified embedding for face recognition and clustering[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, Jun 7-12, 2015: 815-823.
[26] 官赛萍, 靳小龙, 贾岩涛, 等. 面向知识图谱的知识推理研究进展[J]. 软件学报, 2018, 29(10): 2966-2994.
GUAN S P, JIN X L, JIA Y T, et al. Knowledge reasoning over knowledge graph: a survey[J]. Journal of Software, 2018, 29(10): 2966-2994.
[27] LIU W, WEN Y, YU Z, et al. Sphereface: deep hypersphere embedding for face recognition[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recogni-tion, Honolulu, Jul 24-26, 2017: 212-220.
[28] DENG J, GUO J, XUE N, et al. Arcface: additive angular margin loss for deep face recognition[C]//Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 16-21, 2019: 4690-4699.

编辑推荐 0

Metrics

阅读次数

全文

180

HTML			PDF

最新录用	在线预览	正式出版	最新录用	在线预览	正式出版
0	0	0	59	0	121

来源	本网站	其他网站

次数	157	23
比例	87%	13%

摘要

253

最新录用	在线预览	正式出版

101	0	152

	来源	本网站

	次数	253
	比例	100%