3D Point Cloud Object Tracking Based on Multi-level Fusion of Transformer Features

doi:10.3778/j.issn.1673-9418.2401071

Abstract

Abstract: During the 3D point cloud object tracking, some issues such as occlusion, sparsity, and random noise often arise. To address these challenges, this paper proposes a novel approach to 3D point cloud object tracking based on multi-level fusion of Transformer features. The method mainly consists of the point attention embedding module and the point attention enhancement module, which are used for feature extraction and feature matching processes, respectively. Firstly, by embedding two attention mechanisms into each other to form the point attention embedding module and fusing it with the relationship-aware sampling method proposed by PTTR (point relation transformer for tracking), the purpose of fully extracting features is achieved. Subsequently, the feature information is input into the point attention enhancement module, and through cross-attention, features from different levels are matched sequentially to achieve the goal of deep fusion of global and local features. Moreover, to obtain discriminative feature fusion maps, a residual network is employed to connect the fusion results from different layers. Finally, the feature fusion map is input into the target prediction module to achieve precise prediction of the final 3D target object. Experimental validation on KITTI, nuScenes, and Waymo datasets demonstrates the effectiveness of the proposed method. Excluding few-shot data, the proposed method achieves an average improvement of 1.4 percentage points in success and 1.4 percentage points in precision in terms of object tracking.

Key words: 3D point cloud, siamese network, object tracking, Transformer, feature fusion

摘要： 三维点云目标跟踪的过程中时常会出现遮挡、稀疏性和随机噪声等问题。为了解决这些问题，提出了一种新颖的多级Transformer特征融合的三维点云目标跟踪方法。该方法主要由点注意嵌入模块和点注意力增强模块组成，且这两个模块分别用于特征提取和特征匹配的过程中。通过将两个注意力机制相互嵌入构成点注意力嵌入模块，并将其和PTTR所提出的关系感知采样法融合，实现充分提取特征的目的。将提取到的特征信息输入点注意力增强模块中，通过交叉注意力机制对不同层次的特征依次匹配，达到全局特征和局部特征深度融合的目标。为了获取判别性特征融合图，利用残差网络的方式对不同层的融合结果进行连接。将特征融合图输入目标预测的模块中，实现对最终3D目标对象的精准预测。在KITTI数据集、nuScenes数据集和Waymo数据集上的实验验证了该方法的有效性。若不计小样本数据，在目标跟踪的成功值中该方法平均提高了1.4个百分点，在跟踪的精确值上也提高了1.4个百分点。

关键词: 3D点云, 孪生网络, 目标跟踪, Transformer, 特征融合

LI Zhijie, LIANG Bowen, DING Xinmiao, GUO Wen. 3D Point Cloud Object Tracking Based on Multi-level Fusion of Transformer Features[J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(11): 3006-3014.

李志杰, 梁卜文, 丁昕苗, 郭文. 多级Transformer特征融合的三维点云目标跟踪[J]. 计算机科学与探索, 2024, 18(11): 3006-3014.

References

[1] LUO W, YANG B, URTASUN R. Fast and furious: real time end-to-end 3D detection, tracking and motion forecasting with a single convolutional net[C]//Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition. Washington: IEEE Computer Society, 2018: 3569-3577.
[2] QI C R, ZHOU Y, NAJIBI M, et al. Offboard 3D object detection from point cloud sequences[C]//Proceedings of the 2021 IEEE Conference on Computer Vision and Pattern Recognition. Washington: IEEE Computer Society, 2021: 6134-6144.
[3] LIU Z, CHEN W, LU J, et al. Formation control of mobile robots using distributed controller with sampled-data and com-munication delays[J]. IEEE Transactions on Control Systems Technology, 2016, 24(6): 2125-2132.
[4] MACHIDA E, CAO M, MURAO T, et al. Human motion tracking of mobile robot with Kinect 3D sensor[C]//Procee-dings of the Society of Instrument and Control Engineers Annual Conference. Piscataway: IEEE, 2012: 2207-2211.
[5] YAN X, ZHENG C, LI Z, et al. PointASNL: robust point clouds processing using nonlocal neural networks with adaptive sampling[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 5589-5598.
[6] SHAN J, ZHOU S, FANG Z, et al. PTT: point-track-transformer module for 3D single object tracking in point clouds[C]//Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems. Piscataway: IEEE, 2021: 1310-1316.
[7] WANG Z, XIE Q, LAI Y K, et al. MLVSNet: multi-level voting siamese network for 3D visual tracking[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2021: 3101-3110.
[8] 周燕, 蒲磊, 林良熙, 等. 激光点云的三维目标检测研究进展[J]. 计算机科学与探索, 2022, 16(12): 2695-2717.
ZHOU Y, PU L, LIN L X, et al. Research progress on 3D object detection of LiDAR point cloud[J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(12): 2695-2717.
[9] GIANCOLA S, ZARZAR J, GHANEM B. Leveraging shape completion for 3D siamese tracking[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 1359-1368.
[10] ZHOU C, LUO Z, LUO Y, et al. PTTR: relational 3D point cloud object tracking with transformer[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 8531-8540.
[11] QI C R, YI L, SU H, et al. PointNet++: deep hierarchical feature learning on point sets in a metric space[C]//Advances in Neural Information Processing Systems 30, Long Beach, Dec 4-9, 2017: 5099-5108.
[12] GEIGER A, LENZ P, URTASUN R. Are we ready for auto-nomous driving? The KITTI vision benchmark suite[C]//Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition. Washington: IEEE Computer Society, 2012: 3354-3361.
[13] CAESAR H, BANKITI V, LANG A H, et al. NuScenes: a multimodal dataset for autonomous driving[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 11621-11631.
[14] SUN P, KRETZSCHMAR H, DOTIWALLA X, et al. Scalability in perception for autonomous driving: Waymo open dataset[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 2446-2454.
[15] KU J, MOZIFIAN M, LEE J, et al. Joint 3D proposal generation and object detection from view aggregation[C]//Procee-dings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems. Piscataway: IEEE, 2018: 1-8.
[16] TATARCHENKO M, PARK J, KOLTUN V, et al. Tangent convolutions for dense prediction in 3D[C]//Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recog-nition. Washington: IEEE Computer Society, 2018: 3887-3896.
[17] QI H, FENG C, CAO Z, et al. P2B: point-to-box network for 3D object tracking in point clouds[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 6329-6338.
[18] QI C R, LITANY O, HE K, et al. Deep hough voting for 3D object detection in point clouds[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2019: 9277-9286.
[19] ZHOU X, WANG L, YUAN Z, et al. Structure aware 3D single object tracking of point cloud[J]. Journal of Electronic Imaging, 2021, 30(4): 043010.
[20] ZHENG C, YAN X, GAO J, et al. Box-aware feature enhancement for single object tracking on point clouds[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2021: 13199-13208.
[21] NIE J, HE Z, YANG Y, et al. GLT-T: global-local transformer voting for 3D single object tracking in point clouds[C]//Proceedings of the 2023 AAAI Conference on Artificial Intelligence. Menlo Park: AAAI, 2023: 1957-1965.
[22] FANG Z, ZHOU S, CUI Y, et al. 3D-SiamRPN: an end-to-end learning method for real-time 3D single object tracking using raw point cloud[J]. IEEE Sensors Journal, 2020, 21(4): 4995-5011.
[23] HUI L, WANG L, CHENG M, et al. 3D siamese voxel-to-BEV tracker for sparse point clouds[C]//Advances in Neural Infor-mation Processing Systems 34, Dec 6-14, 2021: 28714-28727.
[24] ZHENG C, YAN X, ZHANG H, et al. Beyond 3D siamese tracking: a motion-centric paradigm for 3D single object track-ing in point clouds[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 8111-8120.
[25] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems 30, Long Beach, Dec 4-9, 2017: 5998-6008.
[26] CARION N, MASSA F, SYNNAEVE G, et al. End-to-end object detection with transformers[C]//Proceedings of the 16th European Conference on Computer Vision. Cham: Springer, 2020: 213-229.
[27] CHEN X, YAN B, ZHU J, et al. Transformer tracking[C]//Pro-ceedings of the 2021 IEEE/CVF conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2021: 8126-8135.
[28] GAO S, ZHOU C, MA C, et al. AiATrack: attention in attention for transformer visual tracking[C]//Proceedings of the 17th European Conference on Computer Vision. Cham: Springer, 2022: 146-164.
[29] CUI Y, FANG Z, SHAN J, et al. 3D object tracking with trans-former[EB/OL]. [2023-11-13]. https://arxiv.org/abs/2110.14921.
[30] HUI L, WANG L, TANG L, et al. 3D siamese transformer network for single object tracking on point clouds[C]//Procee-dings of the 17th European Conference on Computer Vision. Cham: Springer, 2022: 293-310.
[31] KRISTAN M, MATAS J, LEONARDIS A, et al. A novel per-formance evaluation methodology for single-target trackers[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 38(11): 2137-2155.