计算机科学与探索 ›› 2018, Vol. 12 ›› Issue (6): 981-993.DOI: 10.3778/j.issn.1673-9418.1705058

• 人工智能与模式识别 • 上一篇    下一篇

密集帧率采样的视频标题生成

汤鹏杰1,2,3,谭云兰2,4+,李金忠2,3,4,谭  彬2,3,4   

  1. 1. 井冈山大学 数理学院,江西 吉安 343009
    2. 井冈山大学 流域生态与地理环境监测国家测绘地理信息局重点实验室,江西 吉安 343009
    3. 同济大学 计算机科学与技术系,上海 201804
    4. 井冈山大学 电子与信息工程学院,江西 吉安 343009
  • 出版日期:2018-06-01 发布日期:2018-06-06

Dense Frame Rate Sampling Based Model for Video Caption Generation

TANG Pengjie1,2,3, TAN Yunlan2,4+, LI Jinzhong2,3,4, TAN Bin2,3,4   

  1. 1. School of Mathematical and Physical Science, Jinggangshan University, Ji'an, Jiangxi 343009, China
    2. Key Laboratory of Watershed Ecology and Geographical Environment Monitoring, National Administration of Surveying, Mapping and Geoinformation, Jinggangshan University, Ji'an, Jiangxi 343009, China
    3. Department of Computer Science and Technology, Tongji University, Shanghai 201804, China
    4. School of Electronics and Information Engineering, Jinggangshan University, Ji'an, Jiangxi 343009, China
  • Online:2018-06-01 Published:2018-06-06

摘要: 使用固定时间间隔取帧的方式用于视频标题生成,易导致多种静态或动态信息丢失,使得生成的句子质量难以提高。针对这一问题,提出了一种使用密集帧率采样的标题生成方法(dense frame rate sampling based captioning model,DFS-CM),将视频分为多个长度统一的片段,提取片段中所有视频帧的深度CNN(convolutional neural network)特征,然后使用均值或最大值方法,降低了特征数量,增强了特征的稀疏程度;同时,还改善了模型的训练策略,增强了模型的稳定性和泛化能力。最后在S2VT框架的基础上,使用GoogLeNet和ResNet-152两种CNN模型,对所提方法进行了验证。在Youtube2Text数据集上的实验结果表明,无论是采用均值特征还是最大值特征,其模型性能与基准模型相比均得到了改善,尤其是使用ResNet-152和最大值方式,其B@4和CIDEr分别达到了47.1%和34.1%。

关键词: 视频, 标题生成, GoogLeNet, ResNet, 长短时记忆(LSTM), 密集帧率采样

Abstract: The method of picking frames at regular fixed intervals for video captioning leads to missing much static and dynamic information easily, thereby, the quality of generation sentences can't be further improved. Facing the challenge, this paper proposes a model named dense frame rate sampling based captioning model (DFS-CM). The video is first split into many fragments with fixed length, then, all the frames in each fragment are fed into the pre-trained CNN (convolutional neural network) model for feature. Next, the average or maximum CNN features from each fragment are calculated in line with the corresponding dimension. Additionally, the training strategy of the pre-trained CNN model is improved and the stability and generalization of the model are strengthened. Finally, based on S2VT framework, the model is evaluated with GoogLeNet and ResNet-152 features. The experimental results on Youtube2Text dataset show that the performance is improved greatly compared to the baseline model. Particularly, the B@4 and CIDEr reach to 47.1% and 34.1% respectively with ResNet-152 and maximum mode.

Key words: video, caption generation, GoogLeNet, ResNet, long short term memory (LSTM), dense frame rate sampling