计算机科学与探索 ›› 2024, Vol. 18 ›› Issue (1): 161-174.DOI: 10.3778/j.issn.1673-9418.2209078

• 人工智能·模式识别 • 上一篇    下一篇

融合知识图谱的影视视频标签分类算法研究

蒋洪迅,张琳,孙彩虹   

  1. 中国人民大学 信息学院,北京 100872
  • 出版日期:2024-01-01 发布日期:2024-01-01

Knowledge Graph-Based Video Classification Algorithm for Film and Television Drama

JIANG Hongxun, ZHANG Lin, SUN Caihong   

  1. School of Information, Renmin University of China, Beijing 100872, China
  • Online:2024-01-01 Published:2024-01-01

摘要: 鉴于视频感知方式的多样性,视频标签层级分类算法均从视觉和文本模态入手,训练联合模型共同推断视频内容。但现有研究通常只适用于粗粒度的分类,针对影视剧名的分类,则需要更加细粒度的识别。提出了一个融合知识图谱的影视视频标签分类算法。首先,使用了基于大规模通用数据训练的多模态预训练模型提取了视觉和文本的特征,训练了一个多任务的视频标签预测模型,得到视频的类型、题材和实体三级标签;通过在多任务学习网络中引入相似性任务提高分类模型训练的难度,使得同类样本特征更加紧密,且更好地表达样本差异。其次,对于最细粒度的实体标签,提出了一个局部注意力头扩展的实体纠错模型,引入外部知识图谱的共现信息对前置模型的预测结果做修正,得到更准确的实体标签预测结果。采集豆瓣的半结构化数据构建了影视知识图谱并对影视视频标签分类模型进行了实证研究。视频标签分类的实验结果表明,首先,基于多任务网络结构,在训练分类任务时加入交叉熵损失函数和相似性损失函数对模型进行共同约束优化了特征表达,在类型、题材、实体标签的Top-1分类准确率上分别提升了3.70%、3.35%和16.57%;其次,针对前置模型的困难样本提出的全局-局部注意力机制模型,在引入了知识图谱信息之后,实体标签的Top-1分类准确率从38.7%提升到45.6%。该研究是使用图片-文本对数据在多模态视频标签分类问题上新的尝试,为少量数据样本情况下的短视频标签分类提供了新的研究思路。

关键词: 知识图谱, 视频标签分类, 多模态内容理解, 实体纠错

Abstract: Based on the diversity of video perception modalities, a complete video tagging hierarchy classification algorithm combines visual and textual modalities to train a joint model to infer video content. However, most of the existing studies are only applicable to coarse-grained classification. Classification for film and television drama requires more fine-grained identification. This study proposes a knowledge graph-based video classification algorithm. Firstly, the algorithm extracts visual and textual features using a multimodal pre-training model, which is trained on large-scale generic data. A multi-task video label prediction model is further trained to obtain a total of three-level labels for the video: content labels, theme labels and entity labels. The difficulty of training the classification model is improved by introducing a similarity task into the multi-task network. The similarity task provides a tighter fit of similar samples, while the learned characteristics better express sample differences. Secondly, for entity labels, an entity correction model with local attention head is proposed. It can fuse, de-duplicate or extend the prediction results by introducing co-occurrence information from the knowledge graph, and produce a more accurate entity label prediction result. Based on semi-structured data retrieved from Douban, this paper constructs a film and television knowledge graph and conducts an empirical study of the video tag classification model for film and television. Experimental results show that, firstly, the cross-entropy loss function and the loss function of similarity task impose a common constraint on training the classification model, which serves to optimize the feature representation. Top-1 accuracy is improved by 3.70%, 3.35% and 16.57% for content labels, theme labels and entity labels respectively. Secondly, entity correction model with global/local attention heads improves the Top-1 accuracy of entity labels from 38.7% to 45.6% after the introduction of knowledge graph information. The proposed research is a new attempt on the multimodal video classification using image-text pair data, providing a new research idea for short video classification in the case of a small number of data samples.

Key words: knowledge graph, video label classification, multimodal content understanding, entity correction