Review of Attention Mechanisms in Image Processing

doi:10.3778/j.issn.1673-9418.2305057

Abstract

Abstract: Attention mechanism in image processing has become one of the popular and important techniques in the field of deep learning, and is widely used in various deep learning models in image processing because of its excellent plug-and-play convenience. By weighting the input features, the attention mechanism focuses the model’s attention on the most important regions to improve the accuracy and performance of image processing tasks. Firstly, this paper divides the development process of attention mechanism into four stages, and on this basis, reviews and summarizes the research status and progress of four aspects: channel attention, spatial attention, channel and spatial mixed attention, and self-attention. Secondly, this paper provides a detailed discussion on the core idea, key structure and specific implementation of attention mechanism, and further summarizes the advantages and disadvantages of used models. Finally, by comparing the current mainstream attention mechanisms and analyzing the results, this paper discusses the problems of attention mechanisms in the image processing field at this stage, and provides an outlook on the future development of attention mechanisms in image processing, so as to provide references for further research.

Key words: attention mechanism, core idea, key structures, image processing

摘要： 图像处理中的注意力机制已成为深度学习领域中流行且重要的技术之一，因其具有优秀的即插即用便利性，被广泛应用于图像处理领域的各种深度学习模型中。注意力机制通过对输入特征进行加权处理，将模型的注意力集中于最重要的区域，以提升图像处理任务的准确性和性能。首先，将注意力机制的发展过程划分为四个阶段，并在此基础上对通道注意力、空间注意力、通道与空间混合注意力和自注意力四个方面的研究现状及进展进行了回顾与总结；其次，针对注意力机制的核心思想、关键结构和具体实现进行了详细的论述，并进一步总结和归纳所使用模型的优缺点；最后，通过对当前主流的注意力机制进行对比实验和结果分析，讨论了现阶段注意力机制在图像处理领域中存在的问题，并对图像处理领域中注意力机制的未来发展进行展望，为进一步研究提供参考。

关键词: 注意力机制, 核心思想, 关键结构, 图像处理

QI Xuanhao, ZHI Min. Review of Attention Mechanisms in Image Processing[J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(2): 345-362.

祁宣豪, 智敏. 图像处理中注意力机制综述[J]. 计算机科学与探索, 2024, 18(2): 345-362.

References

[1] ITTI L, KOCH C, NIEBUR E. A model of saliency-based visual attention for rapid scene analysis[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1998, 20(11): 1254-1259.
[2] HAYHOE M, BALLARD D. Eye movements in natural behavior[J]. Trends in Cognitive Sciences, 2005, 9(4): 188-194.
[3] RENSINK R A. The dynamic representation of scenes[J]. Visual Cognition, 2000, 7: 34576.
[4] CORBETTA M, SHULMAN G L. Control of goal-directed and stimulus-driven attention in the brain[J]. Nature Reviews Neuroscience, 2002, 3(3): 201-215.
[5] 张卫锋. 跨媒体数据语义分析技术研究[D]. 杭州: 杭州电子科技大学, 2019.
ZHANG W F. Research on semantic analysis technology of cross-media data[D]. Hangzhou: Hangzhou University of Electronic Science and Technology, 2019.
[6] MNIH V, HEESS N, GRAVES A. Recurrent models of visual attention[C]//Advances?in?Neural Information Processing Systems 27, Montreal, Dec?8-13,?2014: 2204-2212.
[7] ELMAN J L. Finding structure in time[J]. Cognitive Science, 1990, 14(2): 179-211.
[8] KIM Y. Convolutional neural networks for sentence classification[J]. arXiv:1408.5882v2, 2014.
[9] JADERBERG M, SIMONYAN K, ZISSERMAN A. Spatial transformer networks[C]//Advances?in?Neural?Information?Processing?Systems?28,?Montreal, Dec?7-12,?2015: 2017-2025.
[10] HU J, SHEN L, SUN G. Squeeze-and-excitation networks[C]//Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-22, 2018. Washington: IEEE Computer Society, 2018: 7132-7141.
[11] WOO S, PARK J, LEE J Y, et al. CBAM: convolutional block attention module[C]//Proceedings of the 15th European Conference on Computer Vision, Munich, Sep 8-14, 2018. Cham: Springer, 2018: 3-19.
[12] 彭红星, 徐慧明, 刘华鼐. 融合双分支特征和注意力机制的葡萄病虫害识别模型[J]. 农业工程学报, 2022, 38(10): 156-165.
PENG H X, XU H M, LIU H N. A grape pest identification model incorporating two-branch feature and attention mechanism[J]. Journal of Agricultural Engineering, 2022, 38(10): 156-165.
[13] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances?in?Neural?Information?Processing?Systems?30,?Long?Beach,?Dec?4-9,?2017: 5998-6008.
[14] 张朝阳, 张上, 王恒涛, 等. 多尺度下遥感小目标多头注意力检测[J]. 计算机工程与应用, 2023, 59(8): 227-238.
ZHANG C Y, ZHANG S, WANG H T, et al. Multi-headed attention detection of remote sensing small targets at multiple scales[J]. Computer Engineering and Applications, 2023, 59(8): 227-238.
[15] 耿磊, 邱玲, 吴骏, 等. 结合深度可分离卷积与通道加权的全卷积神经网络视网膜图像血管分割[J]. 生物医学工程学杂志, 2019, 36(1): 107-115.
GENG L, QIU L, WU J, et al. Full convolutional neural network combining depth-separable convolution with channel weighting for retinal image vascular segmentation[J]. Journal of Biomedical Engineering, 2019, 36(1): 107-115.
[16] QIN Z, ZHANG P, WU F, et al. FcaNet: frequency channel attention networks[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Oct 11-17, 2021. Piscataway: IEEE, 2021: 763-772.
[17] AHMED N, NATARAJAN T, RAO K R. Discrete cosine transform[J]. IEEE Transactions on Computers, 1974, 100(1): 90-93.
[18] WANG Q，WU B，ZHU P, et al. ECA-Net：efficient channel attention for deep convolutional neural networks[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, Jun 16-18, 2020.Piscataway: IEEE, 2020: 11531-11539.
[19] SZEGEDY C, VANHOUCKE V, IOFFE S, et al. Rethinking the inception architecture for computer vision[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Jun 26-Jul 1, 2016.Washington: IEEE Computer Society, 2016: 2818-2826.
[20] YANG Z, ZHU L, WU Y, et al. Gated channel transformation for visual recognition[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, Jun 16-18, 2020. Piscataway: IEEE, 2020: 11794-11803.
[21] HU J, SHEN L, ALBANIE S, et al. Gather-excite: exploiting feature context in convolutional neural networks[C]//Advances?in?Neural?Information?Processing?Systems?31, Montréal, Dec?3-8,?2018: 9423-9433.
[22] HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735-1780.
[23] IOFFE S, SZEGEDY C. Batch normalization: accelerating deep network training by reducing internal covariate shift[C]//Proceedings of the 2015 International Conference on Machine Learning, Lille, Jul 6-11, 2015: 448-456.
[24] LI X, WANG W, HU X, et al. Selective kernel networks[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 16-20, 2019. Piscataway: IEEE, 2019: 510-519.
[25] ZHANG K, SUN M, HAN T X, et al. Residual networks of residual networks: multilevel residual networks[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2017, 28(6): 1303-1314.
[26] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. Image-Net classification with deep convolutional neural networks[J]. Communications of the ACM, 2017, 60(6): 84-90.
[27] YU F, KOLTUN V. Multi-scale context aggregation by dilated convolutions[J]. arXiv:1511.07122, 2015.
[28] YU F, KOLTUN V, FUNKHOUSER T. Dilated residual networks[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Washington: IEEE Computer Society, 2017: 472-480.
[29] DAI J, QI H, XIONG Y, et al. Deformable convolutional networks[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision, Italy, Oct 22-29, 2017. Washington: IEEE Computer Society, 2017: 764-773.
[30] GREGOR K, DANIHELKA I, GRAVES A, et al. DRAW: a recurrent neural network for image generation[C]//Proceedings of the 2015 International Conference on Machine Learning, Lille, Jul 6-11, 2015: 1462-1471.
[31] HUANG Z, WANG X, HUANG L, et al. CCNet: criss-cross attention for semantic segmentation[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Oct 27-Nov 2, 2019. Piscataway: IEEE, 2019: 603-612.
[32] WANG Y, WANG H, PENG Z. Rice diseases detection and classification using attention based neural network and Bayesian optimization[J]. Expert Systems with Applications, 2021, 178: 114770.
[33] YU Y, LIU M, FENG H, et al. Split-attention multiframe alignment network for image restoration[J]. IEEE Access, 2020, 8: 39254-39272.
[34] ZAGORUYKO S, KOMODAKIS N. Wide residual networks[J]. arXiv:1605.07146, 2016.
[35] HAN D, KIM J, KIM J. Deep pyramidal residual networks[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Washington: IEEE Computer Society, 2017: 5927-5935.
[36] IANDOLA F, MOSKEWICZ M, KARAYEV S, et al. DenseNet: implementing efficient ConvNet descriptor pyramids[J]. arXiv:1404.1869, 2014.
[37] 李启行, 廖薇, 孟静雯. 基于注意力机制的双通道DAC-RNN文本分类模型[J]. 计算机工程与应用, 2022, 58(16): 157-163.
LI Q H, LIAO W, MENG J W. A two-channel DAC-RNN text classification model based on attention mechanism[J]. Computer Engineering and Applications, 2022, 58(16): 157-163.
[38] PARK J, WOO S, LEE J Y, et al. BAM: bottleneck attention module[J]. arXiv:1807.06514, 2018.
[39] ELSKEN T, METZEN J H, HUTTER F. Neural architecture search: a survey[J]. The Journal of Machine Learning Research, 2019, 20(1): 1997-2017.
[40] ZHANG Q L, YANG Y B. SA-Net: shuffle attention for deep convolutional neural networks[C]//Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing, Toronto, Jun 6-11, 2021. Piscataway: IEEE, 2021: 2235-2239.
[41] WU Y, HE K. Group normalization[C]//Proceedings of the 15th European Conference on Computer Vision, Munich, Sep 8-14, 2018. Cham: Springer, 2018: 3-19.
[42] MA N, ZHANG X, ZHENG H T, et al. ShuffleNet v2: practical guidelines for efficient CNN architecture design[C]//Proceedings of the 15th European Conference on Computer Vision, Munich, Sep 8-14, 2018. Cham: Springer, 2018: 116-131.
[43] SU K, YU D, XU Z, et al. Multi-person pose estimation with enhanced channel-wise and spatial information[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 16-20, 2019. Piscataway: IEEE, 2019: 5674-5682.
[44] ZHANG H, ZU K, LU J, et al. EPSANet: an efficient pyramid squeeze attention block on convolutional neural network[C]//Proceedings of the 2022 Asian Conference on Computer Vision, Macau, China, Dec 4, 2022: 1161-1177.
[45] LIU Y, ZHU Q, CAO F, et al. High-resolution remote sensing image segmentation framework based on attention mechanism and adaptive weighting[J]. ISPRS International Journal of Geo-Information, 2021, 10(4): 241.
[46] YIN F, LI S, JI M, et al. Neural TV program recommendation with label and user dual attention[J]. Applied Intelligence, 2022, 52(1): 19-32.
[47] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: transformers for image recognition at scale[J]. arXiv:2010.11929, 2020.
[48] LIU Z, LIN Y, CAO Y, et al. Swin transformer: hierarchical vision transformer using shifted windows[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Oct 11-17, 2021. Piscataway: IEEE, 2021: 10012-10022.
[49] TOUVRON H, CORD M, DOUZE M, et al. Training data-efficient image transformers & distillation through attention[C]//Proceedings of the 2021 International Conference on Machine Learning, Jul 18-24, 2021: 10347-10357.
[50] QIAO S, CHEN L C, YUILLE A. DetectoRS: detecting objects with recursive feature pyramid and switchable atrous convolution[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun 19-25, 2021. Piscataway: IEEE, 2021: 10213-10224.
[51] ZHENG S, LU J, ZHAO H, et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun 19-25, 2021. Piscataway: IEEE, 2021: 6881-6890.
[52] TOLSTIKHIN I O, HOULSBY N, KOLESNIKOV A, et al. MLP-Mixer: an all-MLP architecture for vision[C]//Advances in?Neural?Information?Processing?Systems?34,?Dec?6-14, 2021: 24261-24272.
[53] DONG X, BAO J, CHEN D, et al. CSWin transformer: a general vision transformer backbone with cross-shaped windows[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, Jun 19-24, 2022. Piscataway: IEEE, 2022: 12124-12134.
[54] 刘文婷, 卢新明. 基于计算机视觉的Transformer研究进展[J]. 计算机工程与应用, 2022, 58(6): 1-16.
LIU W T, LU X M. Progress of Transformer research based on computer vision[J]. Computer Engineering and Applications, 2022, 58(6): 1-16.
[55] 林志玮, 金龄杰, 洪宇. 融合多尺度特征和梯度信息的云种类识别[J]. 激光与光电子学进展, 2022(18): 145-154.
LIN Z W, JIN L J, HONG Y. Fusion of multi-scale features and gradient information for cloud species identification[J]. Advances in Lasers and Optoelectronics, 2022(18): 145-154.
[56] 董玉民, 卫力行. 一种CNN-Transformer网络在皮肤镜图像分割上的应用[J]. 重庆师范大学学报(自然科学版), 2023(2): 126-134.
DONG Y M,WEI L X. A CNN-Transformer network for dermoscopic image segmentation[J]. Journal of Chongqing Normal University (Natural Science Edition), 2023(2): 126-134.
[57] SU H, JAMPANI V, SUN D, et al. Pixel-adaptive convolutional neural networks[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 16-20, 2019. Piscataway: IEEE, 2019: 11166-11175.
[58] VOULODIMOS A, DOULAMIS N, DOULAMIS A, et al. Deep learning for computer vision: a brief review[J]. Computational Intelligence and Neuroscience, 2018: 7068349.
[59] GALASSI A, LIPPI M, TORRONI P. Attention in natural language processing[J]. IEEE Transactions on Neural Networks and Learning Systems, 2020, 32(10): 4291-4308.
[60] LEI S, YI W, YING C, et al. Review of attention mechanism in natural language processing[J]. Data Analysis and Knowledge Discovery, 2020, 4(5): 1-14.
[61] SOOD E, TANNERT S, MüLLER P, et al. Improving natural language processing tasks with human gaze-guided neural attention[C]//Advances in Neural Information Processing Systems 33, Dec 6-12, 2020: 6327-6341.
[62] YANG B, WANG L, WONG D F, et al. Context-aware self-attention networks for natural language processing[J]. Neurocomputing, 2021, 458: 157-169.
[63] NIU Z, ZHONG G, YU H. A review on the attention mechanism of deep learning[J]. Neurocomputing, 2021, 452: 48-62.
[64] BARGH J A. Attention and automaticity in the processing of self-relevant information[J]. Journal of Personality and Social Psychology, 1982, 43(3): 425.