[1] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: transformers for image recognition at scale[EB/OL]. [2025-06-21]. https://arxiv.org/abs/2010.11929.
[2] ZHOU S J, ZHANG R Y, ZHU H S, et al. Multimodal LLMs as customized reward models for text-to-image generation[EB/OL]. [2025-08-01]. https://arxiv.org/abs/2507.21391.
[3] YAO H J, ZHANG R F, HUANG J X, et al. A survey on agentic multimodal large language models[EB/OL]. [2025-10-15]. https://arxiv.org/abs/2510.10991.
[4] YIN S K, FU C Y, ZHAO S R, et al. A survey on multimodal large language models[EB/OL]. [2025-06-27]. https://arxiv.org/abs/2306.13549.
[5] 秦小林, 古徐, 李弟诚, 等. 大语言模型综述与展望[J]. 计算机应用, 2025, 45(3): 685-696.
QIN X L, GU X, LI D C, et al. Survey and prospect of large language models[J]. Journal of Computer Applications, 2025, 45(3): 685-696.
[6] 吴信东, 黄满宗, 卜晨阳. BEKO: 大语言模型与知识图谱的双向增强[J]. 计算机学报, 2025, 48(7): 1572-1588.
WU X D, HUANG M Z, BU C Y. BEKO: bidirectional enhancement with a knowledge ocean for LLMs and KGs[J]. Chinese Journal of Computers, 2025, 48(7): 1572-1588.
[7] 任泽裕, 王振超, 柯尊旺, 等. 多模态数据融合综述[J]. 计算机工程与应用, 2021, 57(18): 49-64.
REN Z Y, WANG Z C, KE Z W, et al. Survey of multimodal data fusion[J]. Computer Engineering and Applications, 2021, 57(18): 49-64.
[8] 姜丽梅, 李秉龙. 面向图像文本的多模态处理方法综述[J]. 计算机应用研究, 2024, 41(5): 1281-1290.
JIANG L M, LI B L. Comprehensive review of multimodal processing methods for image-text[J]. Application Research of Computers, 2024, 41(5): 1281-1290.
[9] 杜佳俊, 兰红. 基于多模态特征融合的图像编辑模型[J]. 计算机科学与应用, 2024, 14(6): 164-176.
DU J J, LAN H. Image editing model based on multi-model feature fusion[J]. Computer Science and Application, 2024, 14(6): 164-176.
[10] BAI J Z, BAI S, YANG S S, et al. Qwen-VL: a versatile vision-language model for understanding, localization, text reading, and beyond[EB/OL]. [2025-06-27]. https://arxiv.org/abs/2308.12966.
[11] ABDIN M, ANEJA J, AWADALLA H, et al. Phi-3 technical report: a highly capable language model locally on your phone[EB/OL]. [2025-06-27]. https://arxiv.org/abs/2404.14219.
[12] 梁敏, 刘佳艺, 李杰. 融合迭代反馈与注意力机制的图像超分辨重建方法[J]. 计算机应用, 2023, 43(7): 2280-2287.
LIANG M, LIU J Y, LI J. Image super-resolution reconstruction method based on iterative feedback and attention mechanism[J]. Journal of Computer Applications, 2023, 43(7): 2280-2287.
[13] LIU H, LI C, LI Y, et al. LLaVA-NeXT: improved reasoning, OCR, and world knowledge[EB/OL]. [2025-07-22]. https://llava-vl.github.io/blog/2024-01-30-llava-next/.
[14] CHEN Z, WANG W Y, TIAN H, et al. How far are we to GPT-4V? Closing the gap to commercial multimodal models with open-source suites[EB/OL]. [2025-07-31]. https://arxiv.org/abs/2404.16821.
[15] LUO G, ZHOU Y Y, ZHANG Y X, et al. Feast your eyes: mixture-of-resolution adaptation for multimodal large language models[EB/OL]. [2025-07-31]. https://arxiv.org/abs/2403.03003.
[16] GE C J, CHENG S J, WANG Z M, et al. ConvLLaVA: hierarchical backbones as visual encoder for large multimodal models[EB/OL]. [2025-07-31]. https://arxiv.org/abs/2405.15738.
[17] KIRILLOV A, MINTUN E, RAVI N, et al. Segment anything[C]//Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2023: 3992-4003.
[18] WANG W, DING L, ZENG M, et al. Divide, conquer and combine: a trainingfree framework for high-resolution image perception in multimodal large language models[C]//Proceedings of the 39th AAAI Conference on Artificial Intelligence and the 37th Conference on Innovative Applications of Artificial Intelligence and the 15th Symposium on Educational Advances in Artificial Intelligence. Palo Alto: AAAI, 2025.
[19] SHEN H, ZHAO K, ZHAO, et al. ZoomEye: enhancing multimodal LLMs with human-like zooming capabilities through tree-based image exploration[C]//Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2025.
[20] WANG W B, JING Y C, DING L, et al. Retrieval-augmented perception: high-resolution image perception meets visual RAG[EB/OL]. [2025-07-22]. https://arxiv.org/abs/2503.01222.
[21] YU S, TANG C Y, XU B K, et al. VisRAG: vision-based retrieval-augmented generation on multi-modality documents[EB/OL]. [2025-07-28]. https://arxiv.org/abs/2410.10594.
[22] WU P H, XIE S N. V*: guided visual search as a core mechanism in multimodal LLMs[C]//Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2024: 13084-13094.
[23] GU S H, LUGMAYR A, DANELLJAN M, et al. DIV8K: diverse 8K resolution image dataset[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2020: 3512-3516.
[24] LI B, ZHANG Y H, GUO D, et al. LLaVA-OneVision: easy visual task transfer[EB/OL]. [2025-08-12]. https://arxiv.org/abs/2408.03326.
[25] LIU H T, LI C Y, LI Y H, et al. Improved baselines with visual instruction tuning[C]//Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2024: 26286-26296.
[26] 陆春霞, 马少辉. 基于网格搜索的船体不规则分段动态堆放方法[J]. 计算机应用, 2013, 33(2): 333-337.
LU C X, MA S H. Approach on irregular block dynamic stacking in shipbuilding based on grid technology[J]. Journal of Computer Applications, 2013, 33(2): 333-337. |