外部信息引导和残差置乱的场景图生成方法

doi:10.3778/j.issn.1673-9418.2007007

摘要/Abstract

摘要：

场景图因其具有的表示视觉场景内容的语义和组织结构的特点，有助于视觉理解和可解释推理，成为计算机视觉研究热点之一。但由于现存的视觉场景中目标和目标之间关系标注的不平衡，导致现有的场景图生成方法受到数据集偏置影响。对场景图数据失衡问题进行研究，提出一种基于外部信息引导和残差置乱相结合的场景图生成方法（EGRES），缓解数据集偏置对场景图生成的负面影响。该方法利用外部知识库中无偏置的常识性知识规范场景图的语义空间，缓解数据集中关系数据分布不平衡的问题，以提高场景图生成的泛化能力；利用残差置乱方式对视觉特征和提取的常识性知识进行融合，规范场景图生成网络。在VG数据集上的对比实验和消融实验证明，提出的方法可以有效改善场景图生成。对于数据集中不同标签的对比实验证明，提出的方法可以改善绝大多数关系类别的生成性能，尤其是中低频关系类别下的场景图生成性能，极大地改善了数据标注失衡的问题，比现有的场景图生成方法具有更好的生成效果。

关键词: 数据集偏置, 残差置乱, 外部知识库, 场景图生成

Abstract:

Scene graphs have become one of the hotspots in computer vision research area due to their characteristics of representing the semantic and organizational structure of visual scene content, which facilitates visual comprehension and interpretable inference. However, due to the imbalance of the relationship annotation between objects in the visual scene, the existing scene graph generation methods are affected by the bias of the data set. The scene graph data imbalance problem is investigated, and a scene graph generation method based on the combination of external information guidance and residual scrambling (EGRES) is proposed to alleviate the negative impact of data set bias on scene graph generation. This method uses unbiased common sense knowledge in the external knowledge base to standardize the semantic space of the scene graph, alleviate the imbalance of the relational data distribution in the data set, and improve the generalization ability of scene graph generation. The residual scrambling method is used to fuse the visual features and the extracted common sense knowledge to standardize the scene graph generation network. The comparison experiments and ablation experiments on the VG data set prove that the proposed method in this paper can effectively improve the scene graph generation. The comparison experiments for different labels in the data set prove that the proposed method can improve the generation performance of most of the relationship categories, especially in the medium and low frequency relationship categories, which greatly alleviates the imbalance of data labeling and has better generation results than the existing scene graph generation methods.

Key words: data set bias, residual scrambling, external knowledge base, scene graph generation

田鑫, 季怡, 高海燕, 林欣, 刘纯平. 外部信息引导和残差置乱的场景图生成方法[J]. 计算机科学与探索, 2021, 15(10): 1958-1968.

TIAN Xin, JI Yi, GAO Haiyan, LIN Xin, LIU Chunping. Scene Graph Generation Method Based on External Information Guidance and Residual Scrambling[J]. Journal of Frontiers of Computer Science and Technology, 2021, 15(10): 1958-1968.

参考文献

[1] JOHNSON J, KRISHNA R, STARK M, et al. Image retrieval using scene graphs[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, Jun 7-12, 2015. Washington: IEEE Computer Society, 2015: 3668-3678.
[2] MARINO K, SALAKHUTDINOV R, GUPTA A. The more you know: using knowledge graphs for image classification[J]. arXiv:1612.04844, 2017.
[3] FANG Y, KUAN K, LIN J, et al. Object detection meets knowledge graphs[C]//Proceedings of the 26th International Joint Conference on Artificial Intelligence, Melbourne, Aug 19-25, 2017: 1661-1667.
[4] ZITNICK C L, PARIKH D, VANDERWENDE L. Learning the visual interpretation of sentences[C]//Proceedings of the 2013 IEEE International Conference on Computer Vision, Sydney, Dec 1-8, 2013. Washington: IEEE Computer Society, 2013: 1681-1688.
[5] YATSKAR M, ZETTLEMOYER L, FARHADI A. Situation recognition: visual semantic role labeling for image understanding[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Jun 27-30, 2016. Washington: IEEE Computer Society, 2016: 5534-5542.
[6] LU C, KRISHNA R, BERNSTEIN M, et al. Visual relationship detection with language priors[J]. arXiv:1608.00187, 2016.
[7] DAI B, ZHANG Y Q, LIN D H. Detecting visual relationships with deep relational networks[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Washington: IEEE Computer Society, 2017: 3298-3308.
[8] CHEN T S, YU W H, CHEN R Q, et al. Knowledge-embedded routing network for scene graph generation[C]//Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 16-20, 2019. Piscataway: IEEE, 2019: 6163-6171.
[9] ZHAN Y B, YU J, YU T, et al. On exploring undetermined relationships for visual relationship detection[C]//Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 16-20, 2019. Piscataway: IEEE, 2019: 5128-5137.
[10] LIN X, TIAN X, JI Y, et al. Scene graph generation based on shuffle residual context information[J]. Journal of Computer Research and Development, 2019, 56(8): 1721-1730. 林欣, 田鑫, 季怡, 等.一种残差置乱上下文信息的场景图生成方法[J]. 计算机研究与发展, 2019, 56(8): 1721-1730.
[11] XU D F, ZHU Y K, CHOY C B, et al. Scene graph generation by iterative message passing[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Washington: IEEE Computer Society, 2017: 3097-3106.
[12] ZELLERS R, YATSKAR M, THOMSON S, et al. Neural motifs: scene graph parsing with global context[C]//Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-22, 2018. Washington: IEEE Computer Society, 2018: 5831-5840.
[13] GU J X, ZHAO H D, LIN Z, et al. Scene graph generation with external knowledge and image reconstruction[C]//Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 16-20, 2019. Piscataway: IEEE, 2019: 1969-1978.
[14] KRISHNA R, ZHU Y K, GROTH O, et al. Visual genome: connecting language and vision using crowdsourced dense image annotations[J]. International Journal of Computer Vision, 2017, 123(1): 32-73.
[15] AUER S, BIZER C, KOBILAROV G, et al. DBpedia: a nucleus for a Web of open data[C]//LNCS 4825 : Proceedings of the 6th International Semantic Web Conference, 2nd Asian Semantic Web Conference: the Semantic Web, Busan, Nov 11-15, 2007. Berlin, Heidelberg: Springer, 2007: 722-735.
[16] FELLBAUM C. WordNet[M]//Encyclopedia of Language and Linguistics. New York: Elsevier Science Inc., 2012.
[17] LIU H, SINGH P. ConceptNet—a practical commonsense reasoning tool-kit[J]. BT Technology Journal, 2004, 22(4): 211-226.
[18] LEE C W, FANG W, YEH C K, et al. Multi-label zero-shot learning with structured knowledge graphs[C]//Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-22, 2018. Washington: IEEE Computer Society, 2018: 1576-1585.
[19] DENG J, DING N, JIA Y Q, et al. Large-scale object classification using label relation graphs[C]//LNCS 8689: Proceedings of the 13th European Conference on Computer Vision, Sep 6-12, 2014. Cham: Springer, 2014: 48-64.
[20] WU Q, SHEN C, WANG P, et al. Image captioning and visual question answering based on attributes and external knowledge[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(6): 1367-1381.
[21] MIKOLOV T, KARAFIáT M, BURGET L, et al. Recurrent neural network based language model[C]//Proceedings of the 11th Annual Conference of the International Speech Communication Association, Makuhari, Sep 26-30, 2010: 1045-1048.
[22] YANG J W, LU J S, LEE S, et al. Graph R-CNN for scene graph generation[C]//LNCS 11205: Proceedings of the 15th European Conference on Computer Vision, Munich, Sep 8-14, 2018. Cham: Springer, 2018: 690-706.
[23] LI Y K, OUYANG W L, ZHOU B L, et al. Scene graph generation from objects, phrases and region captions[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision, Venice, Oct 22-29, 2017. Washington: IEEE Computer Society, 2017: 1270-1279.
[24] TANG K H, ZHANG H W, WU B Y, et al. Learning to compose dynamic tree structures for visual contexts[J]. arXiv: 1812.01880, 2018.
[25] LIN X, DING C X, ZENG J Q, et al. GPS-Net: graph property sensing network for scene graph generation[J]. arXiv:2003. 12962, 2020.
[26] YU R C, LI A, MORARIU V I, et al. Visual relationship detection with internal and external linguistic knowledge distillation[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision, Venice, Oct 22-29, 2017. Washington: IEEE Computer Society, 2017: 1068-1076.
[27] ZAREIAN A, KARAMAN S, CHANG S F. Bridging know-ledge graphs to generate scene graphs[J]. arXiv:2001.02314, 2020.
[28] REN S, HE K, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[C]//Proceedings of the 28th Annual Conference on Neural Information Processing Systems 2015, Montreal, Dec 7-12, 2015. Red Hook: Curran Associates, 2015: 91-99.
[29] HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735-1780.
[30] CHO K, VAN MERRI?NBOER B, GULCEHRE C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation[J]. arXiv:1406.1078, 2014.
[31] REDMON J, FARHADI A. YOLO9000: better, faster, stronger[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Washington: IEEE Computer Society, 2017: 6517-6525.
[32] RUSSAKOVSKY O, DENG J, SU H, et al. ImageNet large scale visual recognition challenge[J]. arXiv:1409.0575, 2014.
[33] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[J]. arXiv:1409. 1556, 2014.
[34] NEWELL A, DENG J. Pixels to graphs by associative embedding[C]//Proceedings of the Annual Conference on Neural Information Processing Systems 2017, Long Beach, Dec 4-9, 2017. Red Hook: Curran Associates, 2017: 2171-2180.