Journal of Frontiers of Computer Science and Technology ›› 2023, Vol. 17 ›› Issue (7): 1487-1505.DOI: 10.3778/j.issn.1673-9418.2303025
• Frontiers·Surveys • Previous Articles Next Articles
WANG Yu, SUN Haichun
Online:
2023-07-01
Published:
2023-07-01
王虞,孙海春
WANG Yu, SUN Haichun. Review of Visual Question Answering Technology[J]. Journal of Frontiers of Computer Science and Technology, 2023, 17(7): 1487-1505.
王虞, 孙海春. 视觉问答技术研究综述[J]. 计算机科学与探索, 2023, 17(7): 1487-1505.
Add to citation manager EndNote|Ris|BibTeX
URL: http://fcst.ceaj.org/EN/10.3778/j.issn.1673-9418.2303025
[1] ANTOL S, AGRAWAL A, LU J, et al. VQA: visual ques-tion answering[C]//Proceedings of the 2015 IEEE Interna-tional Conference on Computer Vision, Santiago, Dec 7-13,2015. Washington: IEEE Computer Society, 2015: 2425-2433. [2] FUKUI A, PARK D H, YANG D, et al. Multimodal com-pact bilinear pooling for visual question answering and visual grounding[C]//Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Nov 1-4, 2016. Stroudsburg: ACL, 2016: 457-468. [3] WU Q, WANG P, SHEN C H, et al. Ask me anything: free-form visual question answering based on knowledge from external sources[C]//Proceedings of the 2016 IEEE Confer-ence on Computer Vision and Pattern Recognition, Las Vegas, Jun 27-30, 2016. Washington: IEEE Computer Society, 2016: 4622-4630. [4] LU J S, YANG J W, BATRA D, et al. Hierarchical co-attention for visual question answering[J]. arXiv:1606. 00061, 2016. [5] LU J S, BATRA D, PARIKH D, et al. ViLBERT: pretrai-ning task-agnostic visiolinguistic representations for vision-and-language tasks[C]//Proceedings of the Annual Conference on Neural Information Processing Systems 2019, Vancouver, Dec 8-14, 2019: 13-23. [6] LU J S, GOSWAMI V, ROHRBACH M, et al. 12-in-1: multi- task vision and language representation learning[C]//Procee-dings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, Jun 13-19, 2020. Piscataway: IEEE, 2020: 10434-10443. [7] TAN H, BANSAL M. LXMERT: learning cross-modality en-coder representations from transformers[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, and the 9th International Joint Conference on Natural Language Processing, Hong Kong, China, Nov 3-7, 2019. Stroudsburg: ACL, 2019: 5099-5110. [8] CADENE R, DANCETTE C, BEN-YOUNES H, et al. RUBi: reducing unimodal biases in visual question answering[C]// Proceedings of the Annual Conference on Neural Informa-tion Processing Systems 2019, Vancouver, Dec 8-14, 2019: 841-852. [9] CHEN L, YAN X H, XIAO J, et al. Counterfactual samples synthesizing for robust visual question answering[C]//Pro-ceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, Jun 13-19, 2020. Piscataway: IEEE, 2020: 10797-10806. [10] SONG H Y, DONG L, ZHANG W N, et al. CLIP models are few-shot learners: empirical studies on VQA and visual entailment[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Dublin, May 22-27, 2022. Stroudsburg: ACL, 2022: 6088-6100. [11] WANG R N, QIAN Y X, FENG F X, et al. Co-VQA: answering by interactive sub question sequence[C]//Procee-dings of the 60th Annual Meeting of the Association for Computational Linguistics, Dublin, May 22-27, 2022. Stro-udsburg: ACL, 2022: 2396-2408. [12] YU Z, YU J, CUI Y H, et al. Deep modular co-attention networks for visual question answering[C]//Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 16-20, 2019. Piscataway: IEEE, 2019: 6281-6290. [13] GaO P, JIANG Z K, YOU H X, et al. Dynamic fusion with intra- and inter-modality attention flow for visual question answering[C]//Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 16-20, 2019. Piscataway: IEEE, 2019: 6639-6648. [14] BEN-YOUNES H, CADèNE R, THOME N, et al. BLOCK: bilinear superdiagonal fusion for visual question answering and visual relationship detection[C]//Proceedings of the 33rd AAAI Conference on Artificial Intelligence, the 31st Innovative Applications of Artificial Intelligence Conference, the 9th AAAI Symposium on Educational Advances in Artificial Intelligence, Honolulu, Jan 27-Feb 1, 2019. Menlo Park: AAAI, 2019: 8102-8109. [15] LI H, WANG P, SHEN C H, et al. Visual question answer-ing as reading comprehension[C]//Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recogni-tion, Long Beach, Jun 16-20, 2019. Piscataway: IEEE, 2019: 6319-6328. [16] MANJUNATHA V, SAINI N, DAVIS L S. Explicit bias discovery in visual question answering models[C]//Procee-dings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 16-20, 2019. Piscataway: IEEE, 2019: 9562-9571. [17] WU J L, HU Z Y, MOONEY R J. Generating question relevant captions to aid visual question answering[C]//Proceedings of the 57th Conference of the Association for Computational Linguistics, Florence, Jul 28-Aug 2, 2019.Stroudsburg: ACL, 2019: 3585-3594. [18] CADèNE R, BEN-YOUNES H, CORD M, et al. MUREL: multimodal relational reasoning for visual question ans-wering[C]//Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 16-20, 2019. Piscataway: IEEE, 2019: 1989-1998. [19] ZHOU Y Y, JI R R, SU J S, et al. Dynamic capsule attention for visual question answering[C]//Proceedings of the 33rd AAAI Conference on Artificial Intelligence, the 31st Innovative Applications of Artificial Intelligence Con-ference, the 9th AAAI Symposium on Educational Advances in Artificial Intelligence, Honolulu, Jan 27-Feb 1, 2019. Menlo Park: AAAI, 2019: 9324-9331. [20] LAO M R, GUO Y M, PU N, et al. Multi-stage hybrid embedding fusion network for visual question answering[J]. Neurocomputing, 2021, 423: 541-550. [21] SHRESTHA R, KAFLE K, KANAN C. Answer them all! Toward universal visual question answering models[C]//Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 16-20, 2019. Piscataway: IEEE, 2019: 10472-10481. [22] YANG H, LIN J Y, YANG A, et al. Prompt tuning for generative multimodal pretrained models[J]. arXiv:2208. 02532, 2022. [23] LI L J, GAN Z, CHENG Y, et al. Relation-aware graph attention network for visual question answering[C]//Procee-dings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Oct 27-Nov 2, 2019. Piscataway: IEEE, 2019: 10312-10321. [24] WU J L, MOONEY R J. Self-critical reasoning for robust visual question answering[C]//Proceedings of the Annual Conference on Neural Information Processing Systems 2019, Vancouver, Dec 8-14, 2019: 8601-8611. [25] JIANG H Z, MISRA I, ROHRBACH M, et al. In defense of grid features for visual question answering[J]. arXiv:2001. 03615, 2020. [26] YANG Z C, HE X D, GAO J F, et al. Stacked attention networks for image question answering[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Jun 27-30, 2016.Washington: IEEE Computer Society, 2016: 21-29. [27] PENG L, YANG Y, BIN Y, et al. Word-to-region attention network for visual question answering[J]. Multimedia Tools and Applications, 2019, 78(3): 3843-3858. [28] LIU F, LIU J, HONG R C, et al. Erasing-based attention learning for visual question answering[C]//Proceedings of the 27th ACM International Conference on Multimedia, Nice, Oct 21-25, 2019. New York: ACM, 2019: 1175-1183. [29] SUN Q, FU Y W. Stacked self-attention networks for visual question answering[C]//Proceedings of the 2019 Interna-tional Conference on Multimedia Retrieval, Ottawa, Jun 10-13, 2019. New York: ACM, 2019: 207-211. [30] RAHMAN T, CHOU S H, SIGAL L, et al. An improved attention for visual question answering[C]//Proceedings of the 2021 IEEE Conference on Computer Vision and Pattern Recognition Workshops, Jun 19-25, 2021. Piscataway: IEEE, 2021: 1653-1662. [31] PENG L, YANG Y, WANG Z, et al. CRA-Net: composed relation attention network for visual question answering[C]//Proceedings of the 27th ACM International Conference on Multimedia, Nice, Oct 21-25, 2019. New York: ACM, 2019: 1202-1210. [32] PENG L, YANG Y, WANG Z, et al. MRA-Net: improving VQA via multi-modal relation attention network[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44(1): 318-329. [33] HUANG P P, HUANG J H, GUO Y Q, et al. Multi-grained attention with object-level grounding for visual question answering[C]//Proceedings of the 57th Conference of the Association for Computational Linguistics, Florence, Jul 28- Aug 2, 2019. Stroudsburg: ACL, 2019: 3595-3600. [34] WU C F, LIU J L, WANG X J, et al. Differential networks for visual question answering[C]//Proceedings of the 33rd AAAI Conference on Artificial Intelligence, the 31st Inno-vative Applications of Artificial Intelligence Conference, the 9th AAAI Symposium on Educational Advances in Artifi-cial Intelligence, Honolulu, Jan 27-Feb 1, 2019. Menlo Park: AAAI, 2019: 8997-9004. [35] ZHOU Y Y, JI R R, SUN X S, et al. Plenty is plague: fine-grained learning for visual question answering[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence,2019, 44(2): 697-709. [36] XIONG P X, SHEN Y L, JIN H X. MGA-VQA: multi-granularity alignment for visual question answering[J]. arXiv:2201.10656, 2022. [37] ZHAO Z L, SAMEL K, CHEN B H, et al. ProTo: program-guided transformer for program-guided tasks[C]//Procee-dings of the Annual Conference on Neural Information Processing Systems 2021, Dec 6-14, 2021: 17021-17036. [38] HUANG Q B, WEI J L, CAI Y, et al. Aligned dual channel graph convolutional network for visual question answering[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Jul 5-10, 2020. Stroudsburg: ACL, 2020: 7166-7176. [39] GAO P, YOU H X, ZHANG Z P, et al. Multi-modality latent interaction network for visual question answering[C]//Proceedings of the 2019 IEEE/CVF International Confer-ence on Computer Vision, Seoul, Oct 27-Nov 2, 2019. Piscataway: IEEE, 2019: 5824-5834. [40] ZHONG H S, CHEN J Y, SHEN C, et al. Self-adaptive neural module transformer for visual question answering[J]. IEEE Transactions on Multimedia, 2021, 23: 1264-1273. [41] GUO W Y, ZHANG Y, WU X P, et al. Re-attention for visual question answering[C]//Proceedings of the 34th AAAI Conference on Artificial Intelligence, the 32nd Innovative Applications of Artificial Intelligence Conference, the 10th AAAI Symposium on Educational Advances in Artificial Intelligence, New York, Feb 7-12, 2020. Menlo Park: AAAI, 2020: 91-98. [42] WU Y Z, SUN Q, MA J Q, et al. Question guided modular routing networks for visual question answering[J]. arXiv:1904.08324, 2019. [43] LI X J, YIN X, LI C Y, et al. Oscar: object-semantics aligned pre-training for vision-language tasks[C]//LNCS 12375: Proceedings of the 16th European Conference on Computer Vision, Glasgow, Aug 23-28, 2020. Cham: Springer, 2020: 121-137. [44] XIONG P X, YOU Q Z, YU P, et al. SA-VQA: structured alignment of visual and semantic representations for visual question answering[J]. arXiv:2201.10654, 2022. [45] GUO D L, XU C, TAO D C. Graph reasoning networks for visual question answering[J]. arXiv:1907.09815, 2019. [46] LI G H, WANG X, ZHU W W. Boosting visual question answering with context-aware knowledge aggregation[C]//Proceedings of the 28th ACM International Conference on Multimedia, Seattle, Oct 12-16, 2020. New York: ACM, 2020: 1227-1235. [47] ZHU Z H, YU J, WANG Y J, et al. Mucko: multi-layer cross-modal knowledge reasoning for fact-based visual ques-tion answering[C]//Proceedings of the 29th International Joint Conference on Artificial Intelligence, Yokohama, 2020: 1097-1103. [48] KHADEMI M. Multimodal neural graph memory networks for visual question answering[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Jul 5-10, 2020. Stroudsburg: ACL, 2020: 7177-7188. [49] ZHOU Y Y, JI R R, SUN X S, et al. K-armed bandit based multi-modal network architecture search for visual question answering[C]//Proceedings of the 28th ACM International Conference on Multimedia, Seattle, Oct 12-16, 2020. New York: ACM, 2020: 1245-1254. [50] RUWA N, MAO Q R, WANG L J, et al. Mood-aware visual question answering[J]. Neurocomputing, 2019, 330: 305-316. [51] HUDSON D A, MANNING C D. Learning by abstraction: the neural state machine[C]//Proceedings of the Annual Conference on Neural Information Processing Systems 2019, Vancouver, Dec 8-14, 2019: 5901-5914. [52] HEO Y J, KIM E S, CHOI W S, et al. Hypergraph trans-former: weakly-supervised multi-hop reasoning for knowledge-based visual question answering[J]. arXiv:2204.10448, 2022. [53] YAMADA M, D'AMARIO V, TAKEMOTO K, et al. Trans-former module networks for systematic generalization in visual question answering[J]. arXiv:2201.11316, 2022. [54] WANG W H, BAO H B, DONG L, et al. Image as a foreign language: BEiT pretraining for all vision and vision-language tasks[J]. arXiv:2208.10442, 2022. [55] BAO H B, WANG W H, DONG L, et al. VLMo: unified vision-language pre-training with mixture-of-modality-experts[J]. arXiv:2111.02358, 2021. [56] CHEN X, WANG X, CHANGPINYO S, et al. PaLI: a jointly-scaled multilingual language-image model[J]. arXiv:2209. 06794, 2022. [57] WANG Z R, YU J H, YU A W, et al. SimVLM: simple visual language model pretraining with weak supervision[C]//Proceedings of the 10th International Conference on Learning Representations, Apr 25-29, 2022: 1-17. [58] SHAH M, CHEN X L, ROHRBACH M, et al. Cycle-consistency for robust visual question answering[C]//Procee-dings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 16-20, 2019. Piscataway: IEEE, 2019: 6649-6658. [59] GUO D L, TAO D C. Learning compositional represen-tation for few-shot visual question answering[J]. arXiv:2102.10575, 2021. [60] CHEN Y, LI L J, YU L C, et al. UNITER: universal image-text representation learning[C]//LNCS 12375: Proceedings of the 16th European Conference on Computer Vision, Glasgow, Aug 23-28, 2020. Cham: Springer, 2020: 104-120. [61] LI C L, XU H Y, TIAN J F, et al. mPLUG: effective and efficient vision-language learning by cross-modal skip-connections[J]. arXiv:2205.12005, 2022. [62] JIA C, YANG Y F, XIA Y, et al. Scaling up visual and vision-language representation learning with noisy text supervision[C]//Proceedings of the 38th International Con-ference on Machine Learning, Jul 18-24, 2021: 4904-4916. [63] ZHU X, MAO Z D, LIU C X, et al. Overcoming language priors with self-supervised learning for visual question answering[C]//Proceedings of the 29th International Joint Conference on Artificial Intelligence, Yokohama, Jul 2020: 1083-1089. [64] GUPTA V, LI Z W, KORTYLEWSKI A, et al. SwapMix: diagnosing and regularizing the over-reliance on visual context in visual question answering[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, Jun 18-24, 2022. Piscat-away: IEEE, 2022: 5068-5078. [65] GHOSH S, BURACHAS G, RAY A, et al. Generating natural language explanations for visual question answering using scene graphs and visual attention[J]. arXiv:1902.05715, 2019. [66] AYYUBI H A, TANJIM M, MCAULEY J J, et al. Generating rationales in visual question answering[J]. arXiv: 2004.02032, 2020. [67] YAN M, XU H Y, LI C L, et al. Achieving human parity on visual question answering[J]. arXiv:2111.08896, 2021. [68] WHITEHEAD S, PETRYK S, SHAKIB V, et al. Reliable visual question answering: abstain rather than answer incor-rectly[C]//LNCS 13696: Proceedings of the 17th European Conference on Computer Vision, Tel Aviv, Oct 23-27, 2022. Cham: Springer, 2022: 148-166. [69] GAO F, PING Q, THATTAI G, et al. A thousand words are worth more than a picture: natural language-centric outside-knowledge visual question answering[J]. arXiv:2201.05299, 2022. [70] XU Y M, CHEN L, CHENG Z W, et al. Open-ended visual question answering by multi-modal domain adaptation[C]//Findings of the Association for Computational Linguistics, Nov 16-20, 2020. Stroudsburg: ACL, 2020: 367-376. [71] WU J L, LU J S, SABHARWAL A, et al. Multi-modal answer validation for knowledge-based VQA[C]//Procee-dings of the 36th AAAI Conference on Artificial Intel-ligence, the 34th Conference on Innovative Applications of Artificial Intelligence, the 12th Symposium on Educational Advances in Artificial Intelligence, Feb 22-Mar 1, 2022. Menlo Park: AAAI, 2022: 2712-2721. [72] LIN T Y, MAIRE M, BELONGIE S J, et al. Microsoft COCO: common objects in context[C]//LNCS 8693: Procee-dings of the 13th European Conference on Computer Vision, Zurich, Sep 6-12, 2014. Cham: Springer, 2014: 740-755. [73] GOYAL Y, KHOT T, SUMMERS-STAY D, et al. Making the V in VQA matter: elevating the role of image under-standing in visual question answering[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Washington: IEEE Computer Society, 2017: 6325-6334. [74] ZHU Y K, GROTH O, BERNSTEIN M S, et al. Visual7W: grounded question answering in images[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Jun 27-30, 2016. Washington: IEEE Computer Society, 2016: 4995-5004. [75] JOHNSON J, HARIHARAN B, VAN DER MAATEN L, et al. CLEVR: a diagnostic dataset for compositional lang-uage and elementary visual reasoning[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Washington: IEEE Computer Society, 2017: 2901-2910. [76] HUDSON D A, MANNING C D. GQA: a new dataset for real-world visual reasoning and compositional question answering[C]//Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 16-20, 2019. Piscataway: IEEE, 2019: 6700-6709. [77] MARINO K, RASTEGARI M, FARHADI A, et al. OK-VQA: a visual question answering benchmark requiring ex-ternal knowledge[C]//Proceedings of the 2019 IEEE Confe-rence on Computer Vision and Pattern Recognition, Long Beach, Jun 16-20, 2019. Piscataway: IEEE, 2019: 3195-3204. [78] FURKAN B A, RUBEN T, ANDRES M, et al. Scene text visual question answering[C]//Proceedings of the 2019 IEEE/ CVF International Conference on Computer Vision, Seoul, Oct 27-Nov 2, 2019. Piscataway: IEEE, 2019: 4290-4300. [79] WANG X Y, LIU Y L, SHEN C H, et al. On the general value of evidence, and bilingual scene-text visual question answering[C]//Proceedings of the 2020 IEEE/CVF Confer-ence on Computer Vision and Pattern Recognition, Seattle, Jun 13-19, 2020. Piscataway: IEEE, 2020: 10123-10132. [80] SHENG S S, SINGH A, GOSWAMI V, et al. Human-adversarial visual question answering[C]//Proceedings of the Annual Conference on Neural Information Processing Systems 2021, Dec 6-14, 2021: 20346-20359. [81] SINGH A, NATARAJAN V T, SHAH M, et al. Towards VQA models that can read[C]//Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recog-nition, Long Beach, Jun 16-20, 2019. Piscataway: IEEE, 2019: 8317-8326. [82] AGRAWAL A, BATRA D, PARIKH D. Analyzing the behavior of visual question answering models[C]//Procee-dings of the 2016 Conference on Empirical Methods in Nat-ural Language Processing, Austin, Nov 1-4, 2016. Stroud-sburg: ACL, 2016: 1955-1960. [83] GUO Y Y, CHENG Z Y, NIE L Q, et al. Quantifying and alleviating the language prior problem in visual question answering[C]//Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Infor-mation Retrieval, Paris, Jul 21-25, 2019. New York:ACM, 2019: 75-84. [84] LI J N, SELVARAJU R R, GOTMARE A, et al. Align before fuse: vision and language representation learning with momentum distillation[C]//Proceedings of the Annual Conference on Neural Information Processing Systems 2021, Dec 6-14, 2021: 9694-9705. |
[1] | CAO Yingli, DENG Zhaohong, HU Shudong, WANG Shitong. Classification of Alzheimer's Disease Integrating Individual Feature and Fusion Feature [J]. Journal of Frontiers of Computer Science and Technology, 2023, 17(7): 1658-1668. |
[2] | SHI Yucheng, WU Yun, LONG Huiyun. Cross-Modal Fusion of RGB-D Salient Detection for Advanced Semantic Repair Strategy [J]. Journal of Frontiers of Computer Science and Technology, 2023, 17(1): 140-153. |
[3] | HONG Huiqun, SHEN Guiping, HUANG Fenghua. Summary of Expression Recognition Technology [J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(8): 1764-1778. |
[4] | LIU Jiming, ZHANG Peixiang, LIU Ying, ZHANG Weidong, FANG Jie. Summary of Multi-modal Sentiment Analysis Technology [J]. Journal of Frontiers of Computer Science and Technology, 2021, 15(7): 1165-1182. |
[5] | LI Haichao, LI Chenglong, TANG Jin, LUO Bin. Research on Fusion Algorithm for Thermal and Visible Images [J]. Journal of Frontiers of Computer Science and Technology, 2016, 10(3): 407-413. |
Viewed | ||||||
Full text |
|
|||||
Abstract |
|
|||||
/D:/magtech/JO/Jwk3_kxyts/WEB-INF/classes/