Overview of Visual Inertial Odometry Technology Based on Deep Learning

doi:10.3778/j.issn.1673-9418.2209014

Abstract

Abstract: Visual inertial odometer can well realize the complementary advantages of vision and inertial sensors, and obtain high precision 6-DOF navigation and positioning, so it has a very wide range of applications. However, the errors of sensors themselves, the disturbance of abnormal visual environment, and the space-time calibration errors between multi-sensor will interfere with the navigation results, leading to the decline of navigation accuracy. In recent years, the deep learning method is developing rapidly. With its powerful data processing and prediction ability, it provides a new direction for the development of visual inertial odometer. This paper reviews the main development achievements of deep learning-based methods. First of all, according to the fusion mode, the research methods are summarized, which are divided into the method combining deep learning with traditional models and the end-to-end method based on deep learning. Then, according to the type of deep learning, visual inertial odometer can be divided into supervised learning and unsupervised/self-supervised learning methods, and the model structures of these methods are described respectively. Next, the optimization and evaluation methods of the system are summarized, and the performance of some of them is compared. Finally, this paper summarizes the key and difficult problems that need to be solved in this field, and looks forward to the future development.

Key words: visual inertial odometry, feature fusion, deep learning, network model, posture

摘要： 视觉惯性里程计在很多方面可以很好地实现视觉和惯性传感器的优势互补，获得高精度的6自由度导航定位，因此应用领域极为广泛。然而，传感器自身的误差、异常视觉环境的扰动、多传感器之间的时空校准误差都会干扰导航结果，导致导航精度下降。近年来，正在迅速发展的深度学习方法凭借其强大的数据处理和预测能力，给视觉惯性里程计的发展提供了全新的发展方向。对基于深度学习的视觉惯性里程计的主要发展成果进行了回顾与总结。首先，按照两种融合策略分别概述研究方法，包括深度学习与传统模型结合的方法和基于深度学习的端到端的方法。之后，根据深度学习类型分为监督学习和无监督/自监督学习的方法，并分别阐述了这些方法的模型结构。然后，概述了系统的优化与评估方法，并比较了其中一些具有代表性的方法的性能。最后，对该领域需要解决的关键难点问题进行了总结，对未来发展进行了展望。

关键词: 视觉惯性里程计, 特征融合, 深度学习, 网络模型, 位姿

WANG Wensen, HUANG Fengrong, WANG Xu, LIU Qinglin, YI Boheng. Overview of Visual Inertial Odometry Technology Based on Deep Learning[J]. Journal of Frontiers of Computer Science and Technology, 2023, 17(3): 549-560.

王文森, 黄凤荣, 王旭, 刘庆璘, 羿博珩. 基于深度学习的视觉惯性里程计技术综述[J]. 计算机科学与探索, 2023, 17(3): 549-560.

References

[1] GEMEINER P, EINRAMHOF P, VINCZE M. Simultaneous motion and structure estimation by fusion of inertial and vision data[J]. International Journal of Robotics Research,2007, 26(6): 591-605.
[2] SERVANT F, HOULIER P, MARCHAND é. Improving monocular plane-based SLAM with inertial measures[C]//Proceedings of the 2010 IEEE/RSJ International Confe-rence on Intelligent Robots and Systems, Taipei, China, Oct 18-22, 2010. Piscataway: IEEE, 2010: 3810-3815.
[3] HUANG G. Visual-inertial navigation: a concise review[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, Jun 13-19, 2020. Piscataway: IEEE, 2020: 2575-7075.
[4] DELMERICO J, SCARAMUZZA D. A benchmark compa-rison of monocular visual-inertial odometry algorithms for flying robots[C]//Proceedings of the 2018 IEEE Interna-tional Conference on Robotics and Automation, Brisbane, May 21-25, 2018. Piscataway: IEEE, 2018: 2502-2509.
[5] SERVIèRES M, RENAUDIN V, DUPUIS A, et al. Visual and visual-inertial slam: state of the art, classification, and experimental benchmarking[J]. Journal of Sensors, 2021(1):1-26.
[6] LECUN Y, BENGIO Y, HINTON G. Deep learning[J]. Nature, 2015, 521(7553): 436-444.
[7] FARSAL W, ANTER S, RAMDANI M. Deep learning: an overview[C]//Proceedings of the 12th International Confe-rence on Intelligent Systems: Theories and Applications, Rabat, Nov 24-26, 2017. New York: ACM, 2018: 1-6.
[8] FORSTER C, CARLONE L, DELLAERT F, et al. IMU preintegration on manifold for efficient visual-inertial maxi-mum-a-posteriori estimation[C]//Proceedings of the Robotics: Science and Systems, Rome, Jul 13-17, 2015. Cambridge: MIT Press, 2015.
[9] YANG Y, HUANG G. Aided inertial navigation with geo-metric features: observability analysis[C]//Proceedings of the 2018 IEEE International Conference on Robotics and Automation, Brisbane, May 21-25, 2018. Piscataway: IEEE, 2018: 2334-2340.
[10] CHEN C, ROSA S, LU C X, et al. SelectFusion: a generic framework to selectively learn multisensory fusion[J]. arXiv:1912.13077, 2019.
[11] 任泽裕, 王振超, 柯尊旺, 等. 多模态数据融合综述[J]. 计算机工程与应用, 2021, 57(18): 49-64.
REN Z Y, WANG Z C, KE Z W, et al. A survey of multi-modal data fusion[J]. Computer Engineering and Applica-tions, 2021, 57(18): 49-64.
[12] RAMBACH J, TEWARI A, PAGANI A, et al. Learning to fuse: a deep learning approach to visual-inertial camera pose estimation[C]//Proceedings of the 2016 IEEE International Symposium on Mixed and Augmented Reality, Merida, Sep 19-23, 2016. Piscataway: IEEE, 2016: 71-76.
[13] LI C, WASLANDER S L. Towards end-to-end learning of visual inertial odometry with an EKF[C]//Proceedings of the 17th Conference on Computer and Robot Vision, Ottawa, May 13-15, 2020. Piscataway: IEEE, 2020: 190-197.
[14] WANG S, CLARK R, WEN H, et al. DeepVO: towards end-to-end visual odometry with deep recurrent convolutional neural networks[C]//Proceedings of the 2017 IEEE Interna-tional Conference on Robotics and Automation, Singapore, May 29-Jun 3, 2017. Piscataway: IEEE, 2017: 2043-2050.
[15] 余洪山, 郭丰, 郭林峰, 等. 融合改进SuperPoint网络的鲁棒单目视觉惯性SLAM[J]. 仪器仪表学报, 2021, 42(1): 116-126.
YU H S, GUO F, GUO L F, et al. Robust monocular visual-inertial SLAM based on the improved SuperPoint network[J]. Chinese Journal of Scientific Instrument, 2021, 42(1): 116-126.
[16] DETONE D, MALISIEWICZ T, RABINOVICH A. Super-point: self-supervised interest point detection and descri-ption[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-22, 2018. Piscataway: IEEE, 2018: 2160-7508.
[17] QIN T, LI P, SHEN S. VINS-Mono: a robust and versatile monocular visual-inertial state estimator[J]. IEEE Transactions on Robotics, 2018, 34(4): 1004-1020.
[18] CHEN D, WANG N, XU R, et al. RNIN-VIO: robust neural inertial navigation aided visual-inertial odometry in challen-ging scenes[C]//Proceedings of the 2021 IEEE International Symposium on Mixed and Augmented Reality, Bari, Oct 4-8, 2021. Piscataway: IEEE, 2021: 275-283.
[19] WANG T, CHANG Z, ZHANG W, et al. Research on the integrated positioning method of inertial/visual aided by convolutional neural network[C]//Proceedings of the 2021 International Conference on Control, Automation and Infor-mation Sciences, Xi??an, Oct 14-17, 2021. Piscataway: IEEE, 2021: 2475-7896.
[20] SHAN M, FENG Q, ATANASOV N. OrcVIO: object residual constrained visual-inertial odometry[C]//Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems, Las Vegas, Oct 24, 2020-Jan 24, 2021. Pisca-taway: IEEE, 2020: 5104-5111.
[21] ZUO X, MERRILL N, LI W, et al. CodeVIO: visual-inertial odometry with learned optimizable dense depth[C]//Procee-dings of the 2021 IEEE International Conference on Robotics and Automation, Xi??an, May 30-Jun 5, 2021. Piscataway: IEEE, 2021: 14382-14388.
[22] MOURIKIS A I, ROUMELIOTIS S I. A multi-state cons-traint Kalman filter for vision-aided inertial navigation[C]//Proceedings of the 2007 IEEE International Conference on Robotics and Automation, Rome, Apr 10-14, 2007. Piscataway: IEEE, 2007: 3565-3572.
[23] CLARK R, WANG S, WEN H, et al. VINet: visual-inertial odometry as a sequence-to-sequence learning problem[C]//Proceedings of the 31st AAAI Conference on Artificial Inte-lligence, San Francisco, Feb 4-9, 2017. Menlo Park: AAAI, 2017: 3995-4001.
[24] CHEN C, ROSA S, MIAO Y, et al. Selective sensor fusion for neural visual-inertial odometry[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 15-20, 2019. Piscataway: IEEE, 2019: 10542-10551.
[25] ALMALIOGLU Y, TURAN M, SAPUTRA M R U, et al. SelfVIO: self-supervised deep monocular visual-inertial odometry and depth estimation[J]. Neural Networks, 2022, 150(6): 119-136.
[26] SHINDE K, LEE J, HUMT M, et al. Learning multiplica-tive interactions with Bayesian neural networks for visual-inertial odometry[J]. arXiv:2007.07630, 2020.
[27] LIU L, LI G, LI T H. ATVIO: attention guided visual-inertial odometry[C]//Proceedings of the 2021 IEEE Inter-national Conference on Acoustics, Speech and Signal Proce-ssing, Toronto, Jun 6-11, 2021. Piscataway: IEEE, 2021: 4125-4129.
[28] ASLAN M F, DURDU A, SABANCI K. Visual-inertial image-odometry (VIIONet): a Gaussian process regression-based deep architecture proposal for UAV pose estimation[J]. Measurement, 2022, 194: 111030.
[29] SEEGER M. Gaussian processes for machine learning[J]. International Journal of Neural Systems, 2004, 14(2): 69-106.
[30] TIAN Y, COMPERE M. A case study on visual-inertial odometry using supervised, semi-supervised and unsuper-vised learning methods[C]//Proceedings of the 2019 IEEE International Conference on Artificial Intelligence and Virtual Reality, San Diego, Dec 9-11, 2019. Piscataway: IEEE, 2019: 203-207.
[31] SHAMWELL E J, LEUNG S, NOTHWANG W D. Vision-aided absolute trajectory estimation using an unsupervised deep network with online error correction[C]//Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems, Madrid, Oct 1-5, 2018. Piscataway: IEEE, 2018: 2524-2531.
[32] SHAMWELL E J, LINDGREN K, LEUNG S, et al. Unsu-pervised deep visual-inertial odometry with online error correction for RGB-D imagery[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 42(10): 2478-2493.
[33] LINDGREN K, LEUNG S, NOTHWANG W D, et al. Boom-VIO: bootstrapped monocular visual-inertial odometry with absolute trajectory estimation through unsupervised deep learning[C]//Proceedings of the 19th International Confe-rence on Advanced Robotics, Belo Horizonte, Dec 2-6, 2019. Piscataway: IEEE, 2019: 516-522.
[34] HAN L, LIN Y, DU G, et al. DeepVIO: self-supervised deep learning of monocular visual inertial odometry using 3D geometric constraints[C]//Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems, Macau, China, Nov 3-8, 2019. Piscataway: IEEE, 2019: 6906-6913.
[35] WEI P, HUA G, HUANG W, et al. Unsupervised monocular visual-inertial odometry network[C]//Proceedings of the 29th International Joint Conferences on Artificial Intelligence, Yokohama, Jan 7-15, 2020: 2347-2354.
[36] ZHANG Z, DERICHE R, FAUGERAS O, et al. A robust technique for matching two uncalibrated images through the recovery of the unknown epipolar geometry[J]. Artifi-cial Intelligence, 1995, 78(1/2): 87-119.
[37] DOSOVITSKIY A, FISCHER P, ILG E, et al. Flownet: lear-ning optical flow with convolutional networks[C]//Procee-dings of the 2015 IEEE International Conference on Computer Vision, Santiago, Dec 7-13, 2015. Piscataway: IEEE, 2015: 2758-2766.
[38] ILG E, MAYER N, SAIKIA T, et al. FlowNet 2.0: evolution of optical flow estimation with deep networks[C]//Procee-dings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Pisca-taway: IEEE, 2017: 1647-1655.
[39] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Proce-ssing Systems 30, Long Beach, Dec 4-9, 2017. Red Hook: Curran Associates, 2017: 5998-6008.
[40] JIE H, LI S, GANG S. Squeeze-and-excitation networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelli-gence, 2020, 42(8): 2011-2023.
[41] WEYTJENS H, WEERDT J D. Process outcome prediction: CNN vs. LSTM (with attention)[C]//Proceedings of the 2020 International Conference on Business Process Mana-gement, Seville, Sep 13-18, 2020. Cham: Springer, 2020: 321-333.
[42] DONG N, ZHAO L, WU C H, et al. Inception v3 based cervical cell classification combined with artificially extracted features[J]. Applied Soft Computing, 2020, 93: 106311.
[43] XUE F, WANG Q, WANG X, et al. Guided feature selec-tion for deep visual odometry[C]//LNCS 11366: Procee-dings of the 14th Asian Conference on Computer Vision, Perth, Dec 2-6, 2018. Cham: Springer, 2018: 293-308.
[44] GEIGER A, LENZ P, URTASUN R. Are we ready for auto-nomous driving? The KITTI vision benchmark suite[C]//Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, Jun 16-21, 2012. Piscataway: IEEE, 2012: 3354-3361.
[45] BARRON J T. A general and adaptive robust loss function[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 15-20, 2019. Piscataway: IEEE, 2019: 4331-4339.
[46] GEIGER A, LENZ P, STILLER C, et al. Vision meets robo-tics: the KITTI dataset[J]. International Journal of Robotics Research, 2013, 32(11): 1231-1237.
[47] BLANCO-CLARACO J L, MORENO-DUENAS F A, GONZáLEZ-JIMéNEZ J. The Málaga urban dataset: high-rate stereo and LiDAR in a realistic urban scenario[J]. International Journal of Robotics Research, 2014, 33(2): 207-214.
[48] CARLEVARIS-BIANCO N, USHANI A K, EUSTICE R M. University of Michigan North Campus long-term vision and LIDAR dataset[J]. The International Journal of Robotics Research, 2016, 35(9): 1023-1035.
[49] MAJDIK A L, TILL C, SCARAMUZZA D. The Zurich urban micro aerial vehicle dataset[J]. The International Journal of Robotics Research, 2017, 36(3): 269-273.
[50] MILLER M, CHUNG S J, HUTCHINSON S. The visual-inertial canoe dataset[J]. The International Journal of Robotics Research, 2018, 37(1): 13-20.
[51] CHEN W, LIU Z, ZHAO H, et al. CUHK-AHU dataset: pro-moting practical self-driving applications in the complex airport logistics, hill and urban environments[C]//Procee-dings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems, Las Vegas, Oct 24, 2020. Piscataway: IEEE, 2020: 4283-4288.
[52] SCHUBERT D, GOLL T, DEMMEL N, et al. The TUM VI benchmark for evaluating visual-inertial odometry[C]//Procee-dings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems, Madrid, Oct 1-5, 2018. Pisca-taway: IEEE, 2018: 1680-1687.
[53] PFROMMER B, SANKET N, DANIILIDIS K, et al. Penn-COSYVIO: a challenging visual inertial odometry bench-mark[C]//Proceedings of the 2017 IEEE International Con-ference on Robotics and Automation, Singapore, May 29-Jun 3, 2017. Piscataway: IEEE, 2017: 3847-3854.
[54] REINA S C, SOLIN A, RAHTU E, et al. ADVIO: an authentic dataset for visual-inertial odometry[C]//LNCS 11214: Proceedings of the 15th European Conference on Computer Vision, Munich, Sep 8-14, 2018. Cham: Springer, 2018: 425-440.
[55] JINYU L, BANGBANG Y, DANPENG C, et al. Survey and evaluation of monocular visual-inertial SLAM algorithms for augmented reality[J]. Virtual Reality & Intelligent Hard-ware, 2019, 1(4): 386-410.
[56] WANG C, ZHAO Y, GUO J, et al. NEAR: the NetEase AR oriented visual inertial dataset[C]//Proceedings of the 2019 IEEE International Symposium on Mixed and Augmented Reality Adjunct, Beijing, Oct 10-18, 2019. Piscataway: IEEE, 2019: 366-371.
[57] ZU?IGA-NO?L D, JAENAL A, GOMEZ-OJEDA R, et al. The UMA-VI dataset: visual-inertial odometry in low-textured and dynamic illumination environments[J]. The International Journal of Robotics Research, 2020, 39(9): 1052-1060.
[58] SONG Y, QIAN J, MIAO R, et al. HAUD: a high-accuracy underwater dataset for visual-inertial odometry[C]//Procee-dings of the 20th IEEE Sensors, Australia, Oct 31-Nov 3, 2021. Piscataway: IEEE, 2021: 1-4.
[59] BURRI M, NIKOLIC J, GOHL P, et al. The EuRoC micro aerial vehicle datasets[J]. The International Journal of Robo-tics Research, 2016, 35(10): 1157-1163.
[60] FERRERA M, CREUZE V, MORAS J, et al. AQUALOC: an underwater dataset for visual-inertial-pressure localiza-tion[J]. The International Journal of Robotics Research, 2019, 38(14): 1549-1559.
[61] ANTONINI A, GUERRA W, MURALI V, et al. The black-bird UAV dataset[J]. The International Journal of Robotics Research, 2020, 39(10/11): 1346-1364.
[62] CAO L, LING J, XIAO X. The WHU rolling shutter visual-inertial dataset[J]. IEEE Access, 2020, 8: 50771-50779.
[63] MINODA K, SCHILLING F, WüEST V, et al. VIODE: a simulated dataset to address the challenges of visual-inertial odometry in dynamic environments[J]. IEEE Robotics and Automation Letters, 2021, 6(2): 1343-1350.
[64] REN S, HE K, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelli-gence, 2017, 39(6): 1137-1149.
[65] SUN K, MOHTA K, PFROMMER B. Robust stereo visual inertial odometry for fast autonomous flight[J]. IEEE Robo-tics and Automation Letters, 2018, 3(2): 965-972.
[66] LEUTENEGGER S, LYNEN S, BOSSE M, et al. Keyframe-based visual-inertial odometry using nonlinear optimization[J]. The International Journal of Robotics Research, 2015, 34(3): 314-334.