Dense Frame Rate Sampling Based Model for Video Caption Generation

doi:10.3778/j.issn.1673-9418.1705058

Journal of Frontiers of Computer Science and Technology ›› 2018, Vol. 12 ›› Issue (6): 981-993.DOI: 10.3778/j.issn.1673-9418.1705058

Previous Articles Next Articles

Dense Frame Rate Sampling Based Model for Video Caption Generation

TANG Pengjie1,2,3, TAN Yunlan2,4+, LI Jinzhong2,3,4, TAN Bin2,3,4

1. School of Mathematical and Physical Science, Jinggangshan University, Ji'an, Jiangxi 343009, China
2. Key Laboratory of Watershed Ecology and Geographical Environment Monitoring, National Administration of Surveying, Mapping and Geoinformation, Jinggangshan University, Ji'an, Jiangxi 343009, China
3. Department of Computer Science and Technology, Tongji University, Shanghai 201804, China
4. School of Electronics and Information Engineering, Jinggangshan University, Ji'an, Jiangxi 343009, China

Online:2018-06-01 Published:2018-06-06

密集帧率采样的视频标题生成

汤鹏杰1,2,3，谭云兰2,4+，李金忠2,3,4，谭彬2,3,4

1. 井冈山大学数理学院，江西吉安 343009
2. 井冈山大学流域生态与地理环境监测国家测绘地理信息局重点实验室，江西吉安 343009
3. 同济大学计算机科学与技术系，上海 201804
4. 井冈山大学电子与信息工程学院，江西吉安 343009

Abstract

Abstract: The method of picking frames at regular fixed intervals for video captioning leads to missing much static and dynamic information easily, thereby, the quality of generation sentences can't be further improved. Facing the challenge, this paper proposes a model named dense frame rate sampling based captioning model (DFS-CM). The video is first split into many fragments with fixed length, then, all the frames in each fragment are fed into the pre-trained CNN (convolutional neural network) model for feature. Next, the average or maximum CNN features from each fragment are calculated in line with the corresponding dimension. Additionally, the training strategy of the pre-trained CNN model is improved and the stability and generalization of the model are strengthened. Finally, based on S2VT framework, the model is evaluated with GoogLeNet and ResNet-152 features. The experimental results on Youtube2Text dataset show that the performance is improved greatly compared to the baseline model. Particularly, the B@4 and CIDEr reach to 47.1% and 34.1% respectively with ResNet-152 and maximum mode.

Key words: video, caption generation, GoogLeNet, ResNet, long short term memory (LSTM), dense frame rate sampling

摘要： 使用固定时间间隔取帧的方式用于视频标题生成，易导致多种静态或动态信息丢失，使得生成的句子质量难以提高。针对这一问题，提出了一种使用密集帧率采样的标题生成方法（dense frame rate sampling based captioning model，DFS-CM），将视频分为多个长度统一的片段，提取片段中所有视频帧的深度CNN（convolutional neural network）特征，然后使用均值或最大值方法，降低了特征数量，增强了特征的稀疏程度；同时，还改善了模型的训练策略，增强了模型的稳定性和泛化能力。最后在S2VT框架的基础上，使用GoogLeNet和ResNet-152两种CNN模型，对所提方法进行了验证。在Youtube2Text数据集上的实验结果表明，无论是采用均值特征还是最大值特征，其模型性能与基准模型相比均得到了改善，尤其是使用ResNet-152和最大值方式，其B@4和CIDEr分别达到了47.1%和34.1%。

关键词: 视频, 标题生成, GoogLeNet, ResNet, 长短时记忆（LSTM）, 密集帧率采样

TANG Pengjie, TAN Yunlan, LI Jinzhong, TAN Bin. Dense Frame Rate Sampling Based Model for Video Caption Generation[J]. Journal of Frontiers of Computer Science and Technology, 2018, 12(6): 981-993.

汤鹏杰，谭云兰，李金忠，谭彬. 密集帧率采样的视频标题生成[J]. 计算机科学与探索, 2018, 12(6): 981-993.

[1]	WANG Dicong, BAI Chenshuai, WU Kaijun. Survey of Video Object Detection Based on Deep Learning [J]. Journal of Frontiers of Computer Science and Technology, 2021, 15(9): 1563-1577.
[2]	TAN Yaya, KONG Guangqian. Review of Research on Video Quality Assessment Based on Deep Learning [J]. Journal of Frontiers of Computer Science and Technology, 2021, 15(3): 423-437.
[3]	SONG Yanyan, TAN Li, MA Zihao, REN Xueping. Video Target Detection Based on Improved YOLOV3 Algorithm [J]. Journal of Frontiers of Computer Science and Technology, 2021, 15(1): 163-172.
[4]	ZHANG Zhoubin, XIANG Yan, LIANG Junge, YANG Jialin, MA Lei. Using Position-Enhanced Attention Mechanism for Aspect-Based Sentiment Classi-fication [J]. Journal of Frontiers of Computer Science and Technology, 2020, 14(4): 619-627.
[5]	TIAN Ying, GUI Yan, XIONG Daming. Bilateral Video Object Segmentation Using Dynamic Appearance Modeling and Higher-Order Potential [J]. Journal of Frontiers of Computer Science and Technology, 2020, 14(12): 2108-2121.
[6]	DONG Xu, TAN Li, ZHOU Lina, SONG Yanyan. Short Video Behavior Recognition Combining Scene and Behavior Features [J]. Journal of Frontiers of Computer Science and Technology, 2020, 14(10): 1754-1761.
[7]	GUI Yan, TANG Wen, ZENG Guang. Gradient-Constrained SLIC Based Fast Video Object Segmentation [J]. Journal of Frontiers of Computer Science and Technology, 2019, 13(2): 285-299.
[8]	WANG Bing, PENG Qiang, CHEN Jian. Spatio-Temporal Adaptive Error Concealment Algorithm Based on Block Division [J]. Journal of Frontiers of Computer Science and Technology, 2019, 13(1): 128-137.
[9]	JI Zhong, MA Yaru, HE Yuqing. Video Summarization with Maximal Marginal Importance and Coverage [J]. Journal of Frontiers of Computer Science and Technology, 2018, 12(8): 1286-1294.
[10]	ZHANG Rupeng, YU Yaxin, ZHANG Kang, LIU Meng, SHANG Zuqiang. Research on Human Action Recognition Model Based on OI-LSTM Neural Network Structure [J]. Journal of Frontiers of Computer Science and Technology, 2018, 12(12): 1926-1939.
[11]	YANG Yalong, SONG Tian, ZHU Hongchen. Efficient Live Video Broadcasting Approach over Satellite on Named Data Networking [J]. Journal of Frontiers of Computer Science and Technology, 2018, 12(1): 17-28.
[12]	WANG Lijing, MOISEENKO Ilya, HE Wenbo, WANG Dongsheng. NDNlive: Live Video Streaming System in Named Data Networking [J]. Journal of Frontiers of Computer Science and Technology, 2017, 11(7): 1033-1043.
[13]	ZHANG Wei, ZHOU Zhiping. Image Caption Generation Model with Visual Attention and Dynamic Semantic Information Guiding [J]. Journal of Frontiers of Computer Science and Technology, 2017, 11(12): 2033-2040.
[14]	LIU Jingwei, ZHAO Hui, ZHOU Rui, WANG Pu. Exploration of High-Precision Adaptive Wavelet Neural Network Artificial Intelligence Method [J]. Journal of Frontiers of Computer Science and Technology, 2016, 10(8): 1122-1132.
[15]	PAN Lei. Real-Time Detection Method of Abnormal Event in Crowds Based on Image Entropy [J]. Journal of Frontiers of Computer Science and Technology, 2016, 10(7): 1044-1050.

Dense Frame Rate Sampling Based Model for Video Caption Generation

密集帧率采样的视频标题生成

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics