[1] 李晓明, 李锋. 刑事案件涉案财物处置的困境与应对: 以J省N市检察机关评查的1299件刑事案件为样本[J]. 人民检察, 2024(22): 69-71.
LI X M, LI F. The disposal dilemma of property involved in criminal cases and its response: a study based on a sample of 1299 criminal cases reviewed by the procuratorate of N city J province[J]. People??s Procuratorial Semimonthly, 2024(22): 69-71.
[2] ANTOL S, AGRAWAL A, LU J S, et al. VQA: visual question answering[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision. Piscataway: IEEE, 2015: 2425-2433.
[3] GOYAL Y, KHOT T, SUMMERS-STAY D, et al. Making the V in VQA matter: elevating the role of image understanding in visual question answering[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2017: 6325-6334.
[4] BAO H B, WANG W H, DONG L, et al. VLMo: unified vision-language pre-training with mixture-of-modality-experts[EB/OL]. [2024-12-19]. https://arxiv.org/abs/2111.02358.
[5] WANG W H, BAO H B, DONG L, et al. Image as a foreign language: BEIT pretraining for vision and vision-language tasks[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 19175-19186.
[6] CHEN X, WANG X, CHANGPINYO S, et al. PaLI: a jointly-scaled multilingual language-image model[EB/OL]. [2024-12-19]. https://arxiv.org/abs/2209.06794.
[7] YANG Z Y, GAN Z, WANG J F, et al. An empirical study of GPT-3 for few-shot knowledge-based VQA[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2022, 36(3): 3081-3089.
[8] HU Y S, HUA H, YANG Z Y, et al. PromptCap: prompt-guided image captioning for VQA with GPT-3[C]//Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2023: 2951-2963.
[9] SHAO Z W, YU Z, WANG M, et al. Prompting large language models with answer heuristics for knowledge-based visual question answering[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 14974-14983.
[10] HU Z J, YANG P, JIANG Y S, et al. Prompting large language model with context and pre-answer for knowledge-based VQA[J]. Pattern Recognition, 2024, 151: 110399.
[11] ZHU Z H, YU J, WANG Y J, et al. Mucko: multi-layer cross-modal knowledge reasoning for fact-based visual question answering[EB/OL]. [2024-12-19]. https://arxiv.org/abs/2006. 09073.
[12] MARINO K, CHEN X L, PARIKH D, et al. KRISP: integrating implicit and symbolic knowledge for open-domain knowledge-based VQA[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2021: 14111-14121.
[13] LAURIOLA I, LAVELLI A, AIOLLI F. An introduction to deep learning in natural language processing: models, techniques, and tools[J]. Neurocomputing, 2022, 470: 443-456.
[14] ZHOU C, LI Q, LI C, et al. A comprehensive survey on pretrained foundation models: a history from BERT to ChatGPT[J/OL]. International Journal of Machine Learning and Cybernetics [2024-12-20]. https://doi.org/10.1007/s13042-024-02443-6.
[15] GUO M H, XU T X, LIU J J, et al. Attention mechanisms in computer vision: a survey[J]. Computational Visual Media, 2022, 8(3): 331-368.
[16] LIU Y, ZHANG Y, WANG Y X, et al. A survey of visual transformers[J]. IEEE Transactions on Neural Networks and Learning Systems, 2024, 35(6): 7478-7498.
[17] JOHNSON J, HARIHARAN B, VAN DER MAATEN L, et al. CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2017: 1988-1997.
[18] HUDSON D A, MANNING C D. GQA: a new dataset for real-world visual reasoning and compositional question answering[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 6693-6702.
[19] SINGH A, NATARAJAN V, SHAH M, et al. Towards VQA models that can read[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 8317-8326.
[20] MISHRA A, SHEKHAR S, SINGH A K, et al. OCR-VQA: visual question answering by reading text in images[C]//Proceedings of the 2019 International Conference on Document Analysis and Recognition. Piscataway: IEEE, 2019: 947-952.
[21] MARINO K, RASTEGARI M, FARHADI A, et al. OK-VQA: a visual question answering benchmark requiring external knowledge[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 3195-3204.
[22] SCHWENK D, KHANDELWAL A, CLARK C, et al. A-OKVQA: a benchmark for visual question answering using world knowledge[C]//Proceedings of the 17th European Conference on Computer Vision. Cham: Springer, 2022: 146-162.
[23] WU J L, LU J S, SABHARWAL A, et al. Multi-modal answer validation for knowledge-based VQA[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2022, 36(3): 2712-2721.
[24] VRANDE?I? D, KR?TZSCH M. Wikidata: a free collaborative knowledgebase[J]. Communications of the ACM, 2014, 57(10): 78-85.
[25] SPEER R, CHIN J, HAVASI C. ConceptNet 5.5: an open multilingual graph of general knowledge[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2017, 31(1): 4444-4451.
[26] BROWN T B, MANN B, RYDER N, et al. Language models are few-shot learners[C]//Proceedings of the 34th International Conference on Neural Information Processing Systems, 2020: 1877-1901.
[27] TOUVRON H, LAVRIL T, IZACARD G, et al. LLaMA: open and efficient foundation language models[EB/OL]. [2024-12-20]. https://arxiv.org/abs/2302.13971.
[28] HONG W Y, WANG W H, DING M, et al. CogVLM2: visual language models for image and video understanding[EB/OL]. [2024-12-20]. https://arxiv.org/abs/2408.16500.
[29] ABDIN M, ANEJA J, AWADALLA H, et al. Phi-3 technical report: a highly capable language model locally on your phone[EB/OL]. [2024-12-20]. https://arxiv.org/abs/2404.14219.
[30] SAHOO P, SINGH A K, SAHA S, et al. A systematic survey of prompt engineering in large language models: techniques and applications[EB/OL]. [2024-12-20]. https://arxiv. org/abs/2402.07927.
[31] DONG Q X, LI L, DAI D M, et al. A survey on in-context learning[EB/OL]. [2024-12-20]. https://arxiv.org/abs/2301. 00234.
[32] WEI J, WANG X Z, SCHUURMANS D, et al. Chain-of-thought prompting elicits reasoning in large language models[C]//Proceedings of the 36th International Conference on Neural Information Processing Systems, 2022: 24824-24837.
[33] LEWIS P, PEREZ E, PIKTUS A, et al. Retrieval-augmented generation for knowledge-intensive NLP tasks[C]//Advances in Neural Information Processing Systems 33, 2020: 9459-9474.
[34] LI S Y, TANG Y, CHEN S Z, et al. Conan-embedding: general text embedding with more and better negative samples[EB/OL]. [2024-12-20]. https://arxiv.org/abs/2408.15710.
[35] HE B L, CHEN N, HE X R, et al. Retrieving, rethinking and revising: the chain-of-verification can improve retrieval augmented generation[EB/OL]. [2024-12-20]. https://arxiv.org/abs/2410.05801.
[36] GLM T, ZENG A H, XU B, et al. ChatGLM: a family of large language models from GLM-130B to GLM-4 all tools[EB/OL]. [2024-12-20]. https://arxiv.org/abs/2406.12793.
[37] JIANG H Q, WU Q H, LUO X F, et al. LongLLMLingua: accelerating and enhancing LLMs in long context scenarios via prompt compression[C]//Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2024: 1658-1677.
[38] LIU H, LI C, LI Y, et al. LLaVA-NeXT: improved reasoning, OCR, and world knowledge[EB/OL]. [2024-12-21]. https://llava-vl.github.io/blog/2024-01-30-llava-next/.
[39] RAM O, LEVINE Y, DALMEDIGOS I, et al. In-context retrieval-augmented language models[J]. Transactions of the Association for Computational Linguistics, 2023, 11: 1316-1331.
[40] WANG Y, SUN Q, HE S. M3E: moka massive mixed embedding model[EB/OL]. [2024-12-21]. https://github.com/wangyingdong/m3e-base/blob/main/README.md. |