计算机科学与探索 ›› 2025, Vol. 19 ›› Issue (12): 3257-3266.DOI: 10.3778/j.issn.1673-9418.2508035

• 多模态大模型理论与技术专题 • 上一篇    下一篇

自适应路由与双阈值剪枝的多模态大模型检索增强感知

徐国愚,张一丹,魏笑,毛洋敏   

  1. 1. 河南财经政法大学 数据科学与电子商务学院,郑州 450016
    2. 河南财经政法大学 计算机与信息工程学院,郑州 450016
    3. 河南财经政法大学 管理科学与工程学院,郑州 450016
  • 出版日期:2025-12-01 发布日期:2025-12-01

Retrieval-Augmented Perception in Multimodal Large Language Models via Adaptive Routing and Dual-Threshold Pruning

XU Guoyu, ZHANG Yidan, WEI Xiao, MAO Yangmin   

  1. 1. School of Data Science and E-Commerce, Henan University of Economics and Law, Zhengzhou 450016, China
    2. School of Computer Science and Information Engineering, Henan University of Economics and Law, Zhengzhou 450016, China
    3. School of Management Science and Engineering, Henan University of Economics and Law, Zhengzhou 450016,China
  • Online:2025-12-01 Published:2025-12-01

摘要: 检索增强感知算法能有效提升多模态大模型对高分辨率图像的感知能力,具有重要应用价值。但是现有算法存在检索时间过长问题,难以满足系统实时性需求。为此,提出一种融合自适应路由机制与双阈值剪枝搜索策略的多模态大模型检索增强感知算法,以优化处理效率。设置了自适应路由机制,通过计算整图任务可行性概率,并结合问题空间复杂度与模型规模自适应设定动态阈值,实现对简单样本的有效预筛选,使其无需分块处理即可直接获得答案,从而从源头规避无效计算。针对必须处理的复杂样本,在树搜索过程中采用双阈值剪枝的搜索策略:第一级剪枝基于语义质量评分的动态衰减约束,提前终止低质量分支的扩展;第二级剪枝则基于置信度评分差异,对通过第一级剪枝的节点,进一步合并那些决策稳定性高、置信度相近的冗余路径,从而有效抑制搜索空间的膨胀。实验结果表明,在V*Bench、HR-Bench等数据集上,该方案在保持感知精度(准确度仅损失2个百分点以内)的同时,实现了检索效率的显著提升,在LLaVA-ov-0.5B模型上检索速度最高提升达48.3%,尤其适用于低资源场景下的部署应用。

关键词: 多模态大模型, 检索增强感知, 自适应路由机制, 双阈值剪枝搜索策略

Abstract: The retrieval-augmented perception algorithm can effectively enhance the perception ability of multimodal large language models for high-resolution images and has important application value. However, existing algorithms suffer from long retrieval time, failing to meet the real-time requirements of the system. To address this problem, this paper proposes a retrieval-augmented perception algorithm for multimodal large language models, which integrates an adaptive routing mechanism and a dual-threshold pruning search strategy to optimize processing efficiency. Firstly, an adaptive routing mechanism is set up. By calculating the feasibility probability of the entire graph task and combining it with the problem space complexity and model size to dynamically set thresholds, it can effectively pre-screen simple samples, enabling them to obtain answers directly without block processing, thus avoiding invalid calculations from the source. Secondly, for complex samples that must be processed, a dual-threshold pruning search strategy is adopted during tree search: the first-level pruning is based on the dynamic attenuation constraint of semantic quality scores to terminate the expansion of low-quality branches in advance; the second-level pruning is based on the difference in confidence scores, and for nodes that have passed the first-level pruning, it further merges redundant paths with high decision stability and similar confidence, effectively suppressing the expansion of the search space. Experimental results show that on datasets such as V*Bench and HR-Bench, the proposed method  significantly improves retrieval efficiency while maintaining perception accuracy (with an accuracy loss of no more than 2 percentage points). On the LLaVA-ov-0.5B model, the retrieval speed is increased by up to 48.3%, making it particularly suitable for deployment in low-resource scenarios.

Key words: multimodal large language models, retrieval-augmented perception, adaptive routing mechanism, dual-threshold pruning search strategy