Semantic Segmentation Algorithm of Multi-level Feature Fusion Network

doi:10.3778/j.issn.1673-9418.2106110

Abstract

Abstract: Existing methods are difficult to achieve accurate results in object boundary regions due to multiple scales of objects, lack of high-level semantic abstraction information. To solve this problem, this paper proposes a semantic segmentation algorithm based on multi-level feature fusion. In decoding stages, this paper designs three feature extraction branches, including a space detail branch, a semantic supplement branch, and a context information branch. In the space detail branch, this paper uses high-resolution shallow feature maps to directly generate a final segmentation map, which reserves a lot of spatial details. The semantic supplement branch is used to capture high-level semantic abstraction information. The context information branch is responsible for extracting multi-scale information. In the semantic supplement branch, this paper designs a feature fusion guidance module (FFGM) that can model the correspondence between pixels on different feature maps, thus features from different levels can be effectively fused. In the space detail branch, this paper proposes a self-enhancement module (SEM) to refine low-level features for obtaining clear boundary regions. In the context information branch, this paper uses a pyramidal pooling module (PPM) to achieve multi-scale context information for correcting misclassification of boundary pixels caused by multiple scales. Finally, attention mechanism is used to fuse features from the three branches for enhancing important features and suppressing indistinctive ones. Experimental results show that the proposed method obtains mIoU of 81.12% and 74.56% on PASCAL VOC2012 and Cityscapes datasets, respectively, and it obviously outperforms compared methods.

Key words: multi-level feature fusion, context information, semantic segmentation, atrous convolution, attention mechanism

摘要： 目标多尺度性质、高层语义信息不足等造成现有算法很难在目标边界取得非常准确的分类精度。为此，提出了一种基于多层次特征融合的语义分割算法。在解码阶段，设计了三个特征提取分支，分别为空间细节分支、语义补充分支和上下文信息分支。空间细节分支采用浅层较高分辨率特征图来生成最终分割图，主要用于保留大量空间细节信息。语义补充分支用于增加更多的高层语义抽象信息。上下文信息分支主要负责提取多尺度全局信息。在语义补充分支中，设计了一种特征融合指导模块（FFGM），建模不同特征图之间像素的对应关系，从而有效地融合不同层次的特征。在空间细节分支中，提出一种自增强特征模块（SEM），对低层次特征进行精调细化，旨在得到清晰的目标边界。在上下文信息分支中，采用金字塔池化模块（PPM）获得多尺度上下文信息，解决目标多尺度性带来的像素错分问题。最后，采用注意力机制融合三个分支提取的特征图，从而强化重要特征，抑制非显著特征。在主流的语义分割数据集PASCAL VOC2012与Cityscapes上，该网络模型获得了81.12%的平均交并比和74.56%的平均交并比，明显优于实验比较算法。

关键词: 多层次特征融合, 上下文信息, 语义分割, 空洞卷积, 注意力机制

QI Xin, YUAN Feiniu, SHI Jinting, WANG Guiqian. Semantic Segmentation Algorithm of Multi-level Feature Fusion Network[J]. Journal of Frontiers of Computer Science and Technology, 2023, 17(4): 922-932.

祁欣, 袁非牛, 史劲亭, 王贵黔. 多层次特征融合网络的语义分割算法[J]. 计算机科学与探索, 2023, 17(4): 922-932.

References

[1] FRITSCH J, KüHNL T, GEIGER A. A new performance mea-sure and evaluation benchmark for road detection algori-thms[C]//Proceedings of the 16th International IEEE Conference on Intelligent Transportation Systems, The Hague, Oct 6-9, 2013. Piscataway: IEEE, 2013: 1693-1700.
[2] RONNEBERGER O, FISCHER P, BROX T. U-Net: convo-lutional networks for biomedical image segmentation[C]//LNCS 9351: Proceedings of the 18th International Con-ference on Medical Image Computing and Computer-Assisted Intervention, Munich, Oct 5-9, 2015. Cham: Springer, 2015: 234-241.
[3] AZUMA R T. A survey of augmented reality[J]. Presence: Teleoperators & Virtual Environments, 1997, 6(4): 355-385.
[4] LONG J, SHELHAMER E, DARRELL T. Fully convolutional networks for semantic segmentation[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, Jun 7-12, 2015. Washington: IEEE Com-puter Society, 2015: 3431-3440.
[5] BADRINARAYANAN V, KENDALL A, CIPOLLA R. SegNet: a deep convolutional encoder-decoder architecture for image segmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(12): 2481-2495.
[6] LIN G S, MILAN A, SHEN C, et al. RefineNet: multi-path refinement networks for high-resolution semantic segmen-tation[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Washington: IEEE Computer Society, 2017: 5168-5177.
[7] ZHAO H S, SHI J P, QI X J, et al. Pyramid scene parsing network[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Washington: IEEE Computer Society, 2017: 6230-6239.
[8] CHEN L C, PAPANDREOU G, KOKKINOS I, et al. Seman-tic image segmentation with deep convolutional nets and fully connected CRFs[J]. arXiv:1412.7062, 2014.
[9] CHEN L C, PAPANDREOU G, KOKKINOS I, et al. Deep-Lab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(4): 834-848.
[10] CHEN L C, PAPANDREOU G, SCHROFF F, et al. Rethin-king atrous convolution for semantic image segmentation[J]. arXiv:1706.05587, 2017.
[11] CHEN L C, ZHU Y K, PAPANDREOU G, et al. Encoder-decoder with atrous separable convolution for semantic image segmentation[C]//LNCS 11211: Proceedings of the 15th European Conference on Computer Vision, Munich, Sep 8-14, 2018. Cham: Springer, 2018: 833-851.
[12] YU F, KOLTUN V. Multi-scale context aggregation by dilated convolutions[J]. arXiv:1511.07122, 2015.
[13] FU J, LIU J, TIAN H J, et al. Dual attention network for scene segmentation[C]//Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 16-20, 2019. Piscataway: IEEE, 2019: 3146-3154.
[14] HE J J, DENG Z Y, ZHOU L, et al. Adaptive pyramid context network for semantic segmentation[C]//Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 16-20, 2019. Piscataway: IEEE, 2019: 7519-7528.
[15] HE K M, ZHANG X Y, REN S Q, et al. Deep residual lear-ning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Jun 27-30, 2016. Washington: IEEE Computer Society, 2016: 770-778.
[16] LIU W, RABINOVICH A, BERG A C. ParseNet: looking wider to see better[J]. arXiv:1506.04579, 2015.
[17] LI X T, YOU A S, ZHU Z, et al. Semantic flow for fast and accurate scene parsing[C]//LNCS 12346: Proceedings of the 16th European Conference on Computer Vision, Glasgow, Aug 23-28, 2020. Cham: Springer, 2020: 775-793.
[18] HUANG Z L, WEI Y C, WANG X G, et al. AlignSeg: feature-aligned segmentation networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44(1): 550-557.
[19] DAI J F, QI H Z, XIONG Y W, et al. Deformable convo-lutional networks[C]//Proceedings of the 2017 IEEE Inter-national Conference on Computer Vision, Venice, Oct 22-29, 2017. Washington: IEEE Computer Society, 2017: 764-773.
[20] CHOLLET F. Xception: deep learning with depthwise sepa-rable convolutions[C]//Proceedings of the 2017 IEEE Con-ference on Computer Vision and Pattern Recognition, Ho-nolulu, Jul 21-26, 2017. Washington: IEEE Computer So-ciety, 2017: 1800-1807.
[21] WOO S, PARK J, LEE J Y, et al. CBAM: convolutional block attention module[C]//LNCS 11211: Proceedings of the 15th European Conference on Computer Vision, Munich, Sep 8-14, 2018. Cham: Springer, 2018: 3-19.
[22] HU J, SHEN L, SUN G. Squeeze-and-excitation networks[C]//Proceedings of the 2018 IEEE Conference on Com-puter Vision and Pattern Recognition, Salt Lake City, Jun 18-22, 2018. Washington: IEEE Computer Society, 2018: 7132-7141.
[23] EVERINGHAM M, VAN GOOL L, WILLIAMS C K I, et al. The pascal visual object classes (VOC) challenge[J]. Inter-national Journal of Computer Vision, 2010, 88(2): 303-338.
[24] CORDTS M, OMRAN M, RAMOS S, et al. The cityscapes dataset for semantic urban scene understanding[C]//Procee-dings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Jun 27-30, 2016. Washing-ton: IEEE Computer Society, 2016: 3213-3223.
[25] ZHOU Z W, SIDDIQUEE M, TAJBAKHSH N, et al. UNet++: a nested U-Net architecture for medical image segmentation[C]//LNCS 11045: Proceedings of the 4th International Work-shop on Deep Learning in Medical Image Analysis-and- Multimodal Learning for Clinical Decision Support, Gra-nada, Sep 20, 2018. Cham: Springer, 2018: 3-11.
[26] YU C, WANG J, PENG C, et al. Learning a discriminative feature network for semantic segmentation[C]//Proceedings of the 2018 IEEE Conference on Computer Vision and Pat-tern Recognition, Salt Lake City, Jun 18-22, 2018. Washing-ton: IEEE Computer Society, 2018: 1857-1866.
[27] PENG C, ZHANG X Y, YU G, et al. Large kernel matters—improve semantic segmentation by global convolutional network[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Washington: IEEE Computer Society, 2017: 1743-1751.
[28] ROMERA E, ALVAREZ J M, BERGASA L M, et al. ERFNet: efficient residual factorized ConvNet for real-time semantic segmentation[J]. IEEE Transactions on Intelligent Transpor-tation Systems, 2017, 19(1): 263-272.
[29] TAKIKAWA T, ACUNA D, JAMPANI V, et al. GATED-SCNN: gated shape CNNs for semantic segmentation[C]//Proceedings of the 2019 IEEE/CVF International Confe-rence on Computer Vision, Seoul, Oct 27-Nov 2, 2019. Pis-cataway: IEEE, 2019: 5228-5237.
[30] SHAW A E, HUNTER D, LANDOLA F, et al. SqueezeNAS: fast neural architecture search for faster semantic segmen-tation[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Oct 27-28, 2019. Piscataway: IEEE, 2019: 2014-2024.
[31] ARTACHO B, SAVAKIS A E. Waterfall atrous spatial poo-ling architecture for efficient semantic segmentation[J]. Sensors, 2019, 19(24): 5361.