计算机科学与探索 ›› 2024, Vol. 18 ›› Issue (11): 3006-3014.DOI: 10.3778/j.issn.1673-9418.2401071

• 图形·图像 • 上一篇    下一篇

多级Transformer特征融合的三维点云目标跟踪

李志杰,梁卜文,丁昕苗,郭文   

  1. 1. 山东工商学院 信息与电子工程学院,山东 烟台 264000
    2. 山东工商学院 计算机科学与技术学院,山东 烟台 264000
  • 出版日期:2024-11-01 发布日期:2024-10-31

3D Point Cloud Object Tracking Based on Multi-level Fusion of Transformer Features

LI Zhijie, LIANG Bowen, DING Xinmiao, GUO Wen   

  1. 1. School of Information and Electronic Engineering, Shandong Technology and Business University, Yantai, Shandong 264000, China
    2. School of Computer Science and Technology, Shandong Technology and Business University, Yantai, Shandong 264000, China
  • Online:2024-11-01 Published:2024-10-31

摘要: 三维点云目标跟踪的过程中时常会出现遮挡、稀疏性和随机噪声等问题。为了解决这些问题,提出了一种新颖的多级Transformer特征融合的三维点云目标跟踪方法。该方法主要由点注意嵌入模块和点注意力增强模块组成,且这两个模块分别用于特征提取和特征匹配的过程中。通过将两个注意力机制相互嵌入构成点注意力嵌入模块,并将其和PTTR所提出的关系感知采样法融合,实现充分提取特征的目的。将提取到的特征信息输入点注意力增强模块中,通过交叉注意力机制对不同层次的特征依次匹配,达到全局特征和局部特征深度融合的目标。为了获取判别性特征融合图,利用残差网络的方式对不同层的融合结果进行连接。将特征融合图输入目标预测的模块中,实现对最终3D目标对象的精准预测。在KITTI数据集、nuScenes数据集和Waymo数据集上的实验验证了该方法的有效性。若不计小样本数据,在目标跟踪的成功值中该方法平均提高了1.4个百分点,在跟踪的精确值上也提高了1.4个百分点。

关键词: 3D点云, 孪生网络, 目标跟踪, Transformer, 特征融合

Abstract: During the 3D point cloud object tracking, some issues such as occlusion, sparsity, and random noise often arise. To address these challenges, this paper proposes a novel approach to 3D point cloud object tracking based on multi-level fusion of Transformer features. The method mainly consists of the point attention embedding module and the point attention enhancement module, which are used for feature extraction and feature matching processes, respectively. Firstly, by embedding two attention mechanisms into each other to form the point attention embedding module and fusing it with the relationship-aware sampling method proposed by PTTR (point relation transformer for tracking), the purpose of fully extracting features is achieved. Subsequently, the feature information is input into the point attention enhancement module, and through cross-attention, features from different levels are matched sequentially to achieve the goal of deep fusion of global and local features. Moreover, to obtain discriminative feature fusion maps, a residual network is employed to connect the fusion results from different layers. Finally, the feature fusion map is input into the target prediction module to achieve precise prediction of the final 3D target object. Experimental validation on KITTI, nuScenes, and Waymo datasets demonstrates the effectiveness of the proposed method. Excluding few-shot data, the proposed method achieves an average improvement of 1.4 percentage points in success and 1.4 percentage points in precision in terms of object tracking.

Key words: 3D point cloud, siamese network, object tracking, Transformer, feature fusion