计算机科学与探索 ›› 2025, Vol. 19 ›› Issue (1): 30-44.DOI: 10.3778/j.issn.1673-9418.2403009
郭佳霖,智敏,殷雁君,葛湘巍
出版日期:
2025-01-01
发布日期:
2024-12-31
GUO Jialin, ZHI Min, YIN Yanjun, GE Xiangwei
Online:
2025-01-01
Published:
2024-12-31
摘要: 卷积神经网络(CNN)与视觉Transformer是目前图像处理领域中两大重要的深度学习模型,两者经过多年来不断的研究与进步,已在该领域取得了非凡的成就。近些年来,CNN与视觉Transformer的混合模型正在逐步兴起,广泛的研究不断克服两种模型存在的弱项,高效地发挥出各自的亮点,在图像处理任务中表现出优异的效果。基于CNN与视觉Transformer混合模型进行深入阐述。总体概述了CNN与Vision Transformer模型的架构和优缺点,并总结混合模型的概念及优势。围绕串行结构融合方式、并行结构融合方式、层级交叉结构融合方式以及其他融合方式等四个方面全面回顾梳理了混合模型的研究现状和实际进展,并针对各种融合方式的主要代表模型进行总结与剖析,从多方面对典型混合模型进行评价对比。多角度叙述了混合模型在图像识别、图像分类、目标检测和图像分割等实际图像处理特定领域中应用研究,展现出混合模型在具体实践中的适用性和高效性。深入分析混合模型未来研究方向,并为后续该模型在图像处理中的研究与应用提出展望。
郭佳霖, 智敏, 殷雁君, 葛湘巍. 图像处理中CNN与视觉Transformer混合模型研究综述[J]. 计算机科学与探索, 2025, 19(1): 30-44.
GUO Jialin, ZHI Min, YIN Yanjun, GE Xiangwei. Review of Research on CNN and Visual Transformer Hybrid Models in Image Processing[J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(1): 30-44.
[1] GONZALES R C, WINTZ P. Digital image processing[M]. Addison-Wesley Longman Publishing Co., Inc., 1987. [2] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Washington: IEEE Computer Society, 2016: 770-778. [3] TAN M, LE Q. EfficientNet: rethinking model scaling for convolutional neural networks[C]//Proceedings of the 36th International Conference on Machine Learning, Long Beach, Jun 9-15, 2019: 6105-6114. [4] HOWARD A G, ZHU M, CHEN B, et al. MobileNets: efficient convolutional neural networks for mobile vision applications[EB/OL]. [2024-01-23]. https://arxiv.org/abs/1704.04861. [5] CARION N, MASSA F, SYNNAEVE G, et al. End-to-end object detection with transformers[C]//Proceedings of the 16th European Conference on Computer Vision. Cham: Springer, 2020: 213-229. [6] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: transformers for image recognition at scale[EB/OL]. [2024-01-23]. https://arxiv.org/abs/2010.11929. [7] TOUVRON H, CORD M, DOUZE M, et al. Training data-efficient image transformers & distillation through attention[C]//Proceedings of the 38th International Conference on Machine Learning, Jul 18-24, 2021: 10347-10357. [8] LIU Z, LIN Y, CAO Y, et al. Swin transformer: hierarchical vision transformer using shifted windows[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2021: 10012-10022. [9] WU H, XIAO B, CODELLA N, et al. CvT: introducing convolutions to vision transformers[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2021: 22-31. [10] PENG Z, HUANG W, GU S, et al. ConFormer: local features coupling global representations for visual recognition[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2021: 367-376. [11] GUO J, HAN K, WU H, et al. CMT: convolutional neural networks meet vision transformers[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 12175-12185. [12] PAN X, GE C, LU R, et al. On the integration of self-attention and convolution[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 815-825. [13] CHEN Y, DAI X, CHEN D, et al. Mobile-Former: bridging MobileNet and transformer[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 5270-5279. [14] TU Z, TALEBI H, ZHANG H, et al. MaxViT: multi-axis vision transformer[C]//Proceedings of the 17th European Conference on Computer Vision. Cham: Springer, 2022: 459-479. [15] LI K, YU R, WANG Z, et al. Locality guidance for improving vision transformers on tiny datasets[C]//Proceedings of the 17th European Conference on Computer Vision. Cham: Springer, 2022: 110-127. [16] ZHAO Z, BAI H, ZHANG J, et al. CDDfuse: correlation-driven dual-branch feature decomposition for multi-modality image fusion[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 5906-5916. [17] LOU M, ZHOU H Y, YANG S, et al. TransXNet: learning both global and local dynamics with a dual dynamic token mixer for visual recognition[EB/OL]. [2024-01-23]. https://arxiv.org/abs/2310.19380. [18] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet classification with deep convolutional neural networks[J]. Communications of the ACM, 2017, 60(6): 84-90. [19] DENG J, DONG W, SOCHER R, et al. ImageNet: a large-scale hierarchical image database[C]//Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2009: 248-255. [20] HUANG G, LIU Z, VAN DER MAATEN L, et al. Densely connected convolutional networks[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2017: 2261-2269. [21] LIU Z, MAO H Z, WU C Y, et al. A ConvNet for the 2020s[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 11966-11976. [22] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems 30, Long Beach, Dec 4-9, 2017: 5998-6008. [23] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[EB/OL]. [2024-01-23]. https://arxiv.org/abs/1810.04805. [24] GRAHAM B, EL-NOUBY A, TOUVRON H, et al. LeViT: a vision transformer in ConvNet’s clothing for faster inference[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2021: 12239-12249. [25] DAI Z, LIU H, LE Q V, et al. CoAtNet: marrying convolution and attention for all data sizes[C]//Advances in Neural Information Processing Systems 34, Dec 6-14, 2021: 3965-3977. [26] WU Z H, LIU Z J, LIN J, et al. Lite transformer with long-short range attention[EB/OL]. [2024-01-23]. https://arxiv.org/abs/2004.11886. [27] DINH L, SOHL-DICKSTEIN J, BENGIO S. Density estimation using real NVP[EB/OL]. [2024-01-23]. https://arxiv.org/abs/1605.08803. [28] SRINIVAS A, LIN T Y, PARMAR N, et al. Bottleneck transformers for visual recognition[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2021: 16514-16524. [29] CHEN X H, WANG H, NI B B. X-volution: on the unification of convolution and self-attention[EB/OL]. [2024-01-23]. https://arxiv.org/abs/2106.02253. [30] HINTON G, VINYALS O, DEAN J. Distilling the knowledge in a neural network[EB/OL]. [2024-01-23]. https://arxiv. org/abs/1503.02531. [31] WU K, ZHANG J N, PENG H W, et al. TinyViT: fast pretraining distillation for small vision transformers[C]//Proceedings of the 17th European Conference on Computer Vision. Cham: Springer, 2022: 68-85. [32] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context[C]//Proceedings of the 13th European Conference on Computer Vision. Cham: Springer, 2014: 740-755. [33] TANG L F, YUAN J T, ZHANG H, et al. PIAFusion: a progressive infrared and visible image fusion network based on illumination aware[J]. Information Fusion, 2022, 83: 79-92. [34] XU H, MA J Y, LE Z L, et al. FusionDN: a unified densely connected network for image fusion[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(7): 12484-12491. [35] KRIZHEVSKY A, HINTON G. Learning multiple layers of features from tiny images[J]. Handbook of Systemic Autoimmune Diseases, 2009, 1(4). [36] WANG W H, XIE E Z, LI X, et al. Pyramid vision transformer: a versatile backbone for dense prediction without convolutions[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2021: 548-558. [37] 姜莉. 基于人工智能的图像识别技术分析[J]. 电子技术,2023, 52(5): 56-57. JIANG L. Analysis of image recognition technology based on artificial intelligence[J]. Electronic Technology, 2023, 52(5): 56-57. [38] HE L, HE L, PENG L. CFormerFaceNet: efficient lightweight network merging a CNN and transformer for face recognition[J]. Applied Sciences, 2023, 13(11): 6506. [39] 曾淦雄, 柯逍. 基于3D卷积的图像序列特征提取与自注意力的车牌识别方法[J]. 智能科学与技术学报, 2021, 3(3): 268-279. ZENG G X, KE X. 3D convolution-based image sequence feature extraction and self-attention for license plate recognition method[J]. Chinese Journal of Intelligent Science and Technology, 2021, 3(3): 268-279. [40] YU S, XIE L, HUANG Q. Inception convolutional vision transformers for plant disease identification[J]. Internet of Things, 2023, 21: 100650. [41] SZEGEDY C, IOFFE S, VANHOUCKE V, et al. Inception-v4, inception-resnet and the impact of residual connections on learning[C]//Proceedings of the 2017 AAAI Conference on Artificial Intelligence. Menlo Park: AAAI, 2017: 4278-4284. [42] ELNGAR A A, ARAFA M, FATHY A, et al. Image classification based on CNN: a survey[J]. Journal of Cybersecurity and Information Management, 2021, 6(1): 18-50. [43] LIU W, LI C, XU N, et al. CVM-Cervix: a hybrid cervical Pap-smear image classification framework using CNN, visual transformer and multilayer perceptron[J]. Pattern Recognition, 2022, 130: 108829. [44] CHOLLET F. Xception: deep learning with depthwise separable convolutions[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Washington: IEEE Computer Society, 2017: 1251-1258. [45] 陈辉, 张甜, 陈润斌. 基于轻量级卷积Transformer的图像分类方法及在遥感图像分类中的应用[J/OL]. 电子与信息学报 [2024-01-23]. http://kns.cnki.net/kcms/detail/11.4494.TN.20220705.1638.014.html. CHEN H, ZHANG T, CHEN R B. Image classification method based on lightweight convolutional transformer and its application in remote sensing image classification[J/OL]. Journal of Electronics & Information Technology [2024-01-23]. http://kns.cnki.net/kcms/detail/11.4494.TN.20220705. 1638.014.html. [46] YANG L, YANG Y, YANG J, et al. FusionNet: a convolution-transformer fusion network for hyperspectral image classification[J]. Remote Sensing, 2022, 14(16): 4066. [47] JIANG B, ZHAO K, TANG J. RGTransformer: region-graph transformer for image representation and few-shot classification[J]. IEEE Signal Processing Letters, 2022, 29: 792-796. [48] ARKIN E, YADIKAR N, XU X, et al. A survey: object detection methods from CNN totransformer[J]. Multimedia Tools and Applications, 2023, 82(14): 21353-21383. [49] BEAL J, KIM E, TZENG E, et al. Toward transformer-based object detection[EB/OL]. [2024-01-23]. https://arxiv.org/abs/2012.09958. [50] XU Z, GUAN H, KANG J, et al. Pavement crack detection from CCD images with a locally enhanced transformer network[J]. International Journal of Applied Earth Observation and Geoinformation, 2022, 110: 102825. [51] LU W, LAN C, NIU C, et al. A CNN-Transformer hybrid model based on CSWin transformer for UAV image object detection[J]. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2023, 16: 1211-1231. [52] DONG X, BAO J, CHEN D, et al. CSWin transformer: a general vision transformer backbone with cross-shaped windows[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway: IEEE, 2022: 12124-12134. [53] BULAT A, GUERRERO R, MARTINEZ B, et al. FS-DETR: few-shot detection transformer with prompting and without re-training[C]//Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision.Piscataway: IEEE, 2023: 11793-11802. [54] EVERINGHAM M, ESLAMI S M A, VAN GOOL L, et al. The pascal visual object classes challenge: a retrospective[J]. International Journal of Computer Vision, 2015, 111: 98-136. [55] 黄雯珂, 滕飞, 王子丹, 等. 基于深度学习的图像分割综述[J]. 计算机科学, 2024, 51(2): 107-116. HUANG W K, TENG F, WANG Z D, et al. Image segmentation based on deep learning: a survey[J]. Computer Science, 2024, 51(2): 107-116. [56] ASGARI TAGHANAKI S, ABHISHEK K, COHEN J P, et al. Deep semantic segmentation of natural and medical images: a review[J]. Artificial Intelligence Review, 2021, 54: 137-178. [57] PHAM T H, LI X, NGUYEN K D. SeUNet-Trans: a simple yet effective UNet-transformer model for medical image segmentation[EB/OL]. [2024-01-23]. https://arxiv.org/abs/2310.09998. [58] RONNEBERGER O, FISCHER P, BROX T. U-Net: convolutional networks for biomedical image segmentation[C]//Proceedings of the 18th International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Oct 5-9, 2015. Cham: Springer, 2015: 234-241. [59] 宋熙睿, 葛洪伟. 基于TransMANet的遥感图像语义分割算法[J]. 激光与光电子学进展, 2024, 61(10): 1028002. SONG X R, GE H W. A remote sensing image semantic segmentation algorithm based on TransMANet[J]. Laser & Optoelectronics Progress, 2024, 61(10): 1028002. [60] LI R, ZHENG S, ZHANG C, et al. Multiattention network for semantic segmentation of fine-resolution remote sensing images[J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 1-13. [61] WU Y H, ZHANG S C, LIU Y, et al. Low-resolution self-attention for semantic segmentation[EB/OL]. [2024-01-23]. https://arxiv.org/abs/2310.05026. [62] ZHOU B, ZHAO H, PUIG X, et al. Scene parsing through ADE20K dataset[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Washington: IEEE Computer Society, 2017: 633-641. [63] CAESAR H, UIJLINGS J, FERRARI V. COCO-Stuff: thing and stuff classes in context[C]//Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition.Washington: IEEE Computer Society, 2018: 1209-1218. [64] CORDTS M, OMRAN M, RAMOS S, et al. The cityscapes dataset for semantic urban scene understanding[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Washington: IEEE Computer Society,2016: 3213-3223. [65] LU Z, HE S, ZHU X, et al. Simpler is better: few-shot semantic segmentation with classifier weight transformer[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2021: 8741-8750. [66] JIA M, TANG L, CHEN B C, et al. Visual prompt tuning[C]//Proceedings of the 17th European Conference on Computer Vision. Cham: Springer, 2022: 709-727. [67] OUYANG L, WU J, JIANG X, et al. Training language models to follow instructions with human feedback[C]// Advances in Neural Information Processing Systems 35, New Orleans, Nov 28-Dec 9, 2022: 27730-27744. [68] BROWN T, MANN B, RYDER N, et al. Language models are few-shot learners[C]//Advances in Neural Information Processing Systems 33, Dec 6-12, 2020: 1877-1901. [69] TOUVRON H, LAVRIL T, IZACARD G, et al. LLaMA: open and efficient foundation language models[EB/OL].[2024-01-23]. https://arxiv.org/abs/2302.13971. [70] BAI Y, GENG X, MANGALAM K, et al. Sequential modeling enables scalable learning for large vision models[EB/OL]. [2024-01-23]. https://arxiv.org/abs/2312.00785. [71] KIRILLOV A, MINTUN E, RAVI N, et al. Segment anything[C]//Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision.Piscataway: IEEE, 2023: 4015-4026. [72] RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C]//Proceedings of the 38th International Conference on Machine Learning, Jul 18-24, 2021: 8748-8763. |
[1] | 程佳琳, 袁得嵛, 孙泽宇, 陈梓彦. 基于SDNE嵌入表达的深度学习跨网络身份关联方法[J]. 计算机科学与探索, 2025, 19(2): 417-428. |
[2] | 马倩, 董武, 曾庆涛, 张艳, 陆利坤, 周子镱. 图像重定向及客观质量评价方法综述[J]. 计算机科学与探索, 2025, 19(2): 316-333. |
[3] | 陈冲, 朱啸宇, 王芳, 许雅倩, 张伟. 物理引导的深度学习研究综述:进展、挑战和展望[J]. 计算机科学与探索, 2025, 19(2): 277-294. |
[4] | 孟祥福, 石皓源. 基于Transformer模型的时序数据预测方法综述[J]. 计算机科学与探索, 2025, 19(1): 45-64. |
[5] | 王永威, 魏德健, 曹慧, 姜良. 深度学习在心力衰竭检测中的应用综述[J]. 计算机科学与探索, 2025, 19(1): 65-78. |
[6] | 杨晨, 徐昊, 朱佳伟, 吴秦, 柴志雷. LightGCNet:基于轻量化卷积网络的深度色域压缩算法[J]. 计算机科学与探索, 2025, 19(1): 196-210. |
[7] | 杨梅君, 姚若侠, 谢娟英. CARFB:即插即用的目标检测模块[J]. 计算机科学与探索, 2025, 19(1): 223-236. |
[8] | 方博儒, 仇大伟, 白洋, 刘静. 表面肌电信号在肌肉疲劳研究中的应用综述[J]. 计算机科学与探索, 2024, 18(9): 2261-2275. |
[9] | 李子奇, 苏宇轩, 孙俊, 张永宏, 夏庆锋, 尹贺峰. 基于深度学习的多聚焦图像融合方法前沿进展[J]. 计算机科学与探索, 2024, 18(9): 2276-2292. |
[10] | 袁姮, 王笑雪, 张晟翀. 强化特征图的无参考低光照图像增强[J]. 计算机科学与探索, 2024, 18(9): 2449-2465. |
[11] | 连哲, 殷雁君, 智敏, 徐巧枝. 自然场景文本检测中可微分二值化技术综述[J]. 计算机科学与探索, 2024, 18(9): 2239-2260. |
[12] | 陈福仕, 沈尧, 周池春, 丁锰, 李居昊, 赵东越, 雷永升, 潘亦伦. 无监督学习步态识别综述[J]. 计算机科学与探索, 2024, 18(8): 2014-2033. |
[13] | 叶庆文, 张秋菊. 采用通道像素注意力的多标签图像识别[J]. 计算机科学与探索, 2024, 18(8): 2109-2117. |
[14] | 汪有崧, 裴峻鹏, 李增辉, 王伟. 深度学习的视网膜血管分割研究综述[J]. 计算机科学与探索, 2024, 18(8): 1960-1978. |
[15] | 侯鑫, 王艳, 王绚, 范伟. 全景影像在城市研究中的应用进展综述[J]. 计算机科学与探索, 2024, 18(7): 1661-1682. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||