计算机科学与探索 ›› 2024, Vol. 18 ›› Issue (2): 320-344.DOI: 10.3778/j.issn.1673-9418.2310092
彭斌,白静,李文静,郑虎,马向宇
出版日期:
2024-02-01
发布日期:
2024-02-01
PENG Bin, BAI Jing, LI Wenjing, ZHENG Hu, MA Xiangyu
Online:
2024-02-01
Published:
2024-02-01
摘要: Transformer是一种基于自注意力机制的深度学习模型,在计算机视觉中展现出巨大的潜力。而在图像分类任务中,关键的挑战是高效而准确地捕捉输入图片的局部和全局特征。传统方法使用卷积神经网络的底层提取其局部特征,并通过卷积层堆叠扩大感受野以获取图像的全局特征。但这种策略在相对短的距离内聚合信息,难以建立长期依赖关系。相比之下,Transformer的自注意力机制通过直接比较特征在所有空间位置上的相关性,捕捉了局部和全局的长距离依赖关系,具备更强的全局建模能力。因此,深入探讨Transformer在图像分类任务中的问题是非常有必要的。首先以Vision Transformer为例,详细介绍了Transformer的核心原理和架构。然后以图像分类任务为切入点,围绕与视觉Transformer研究中的性能提升、计算成本和训练优化相关的三个重要方面,总结了视觉Transformer研究中的关键问题和最新进展。此外,总结了Transformer在医学图像、遥感图像和农业图像等多个特定领域的应用情况。这些领域中的应用展示了Transformer的多功能性和通用性。最后,通过综合分析视觉Transformer在图像分类方面的研究进展,对视觉Transformer的未来发展方向进行了展望。
彭斌, 白静, 李文静, 郑虎, 马向宇. 面向图像分类的视觉Transformer研究进展[J]. 计算机科学与探索, 2024, 18(2): 320-344.
PENG Bin, BAI Jing, LI Wenjing, ZHENG Hu, MA Xiangyu. Survey on Visual Transformer for Image Classification[J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(2): 320-344.
[1] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems 30, Long Beach, Dec 4-9, 2017: 5998-6008. [2] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: transformers for image recognition at scale[J]. arXiv:2010.11929, 2020. [3] LECUN Y, BOSER B, DENKER J S, et al. Backpropagation applied to handwritten zip code recognition[J]. Neural Computation, 1989, 1(4): 541-551. [4] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet classification with deep convolutional neural networks[J]. Communications of the ACM, 2017, 60(6): 84-90. [5] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Jun 27-30, 2016. Washington: IEEE Computer Society, 2016: 770-778. [6] DENG J, DONG W, SOCHER R, et al. ImageNet: a large-scale hierarchical image database[C]//Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, Jun 20-25, 2009. Washington: IEEE Computer Society, 2009: 248-255. [7] TOUVRON H, CORD M, DOUZE M, et al. Training data-efficient image transformers & distillation through attention[C]//Proceedings of the 2021 International Conference on Machine Learning, Jul 18-24, 2021. New York: ACM, 2021: 10347-10357. [8] LIU Z, LIN Y, CAO Y, et al. Swin transformer: hierarchical vision transformer using shifted windows[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Oct 11-17, 2021. Piscataway: IEEE, 2021: 10012-10022. [9] ZHANG H, LI F, LIU S, et al. Dino: DETR with improved denoising anchor boxes for end-to-end object detection[J]. arXiv:2203.03605, 2022. [10] LIU Y, ZHANG Y, WANG Y, et al. A survey of visual transformers[J]. arXiv:2111.06091, 2021. [11] HAN K, WANG Y, CHEN H, et al. A survey on vision transformer[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(1): 87-110. [12] TAY Y, DEHGHANI M, BAHRI D, et al. Efficient transformers: a survey[J]. arXiv.2009.06732, 2020. [13] XU Y, WEI H, LIN M, et al. Transformers in computational visual media: a survey[J]. Computational Visual Media, 2022, 8(1): 33-62. [14] YANG Y, JIAO L, LIU X, et al. Transformers meet visual learning understanding: a comprehensive review[J]. arXiv:2203.12944, 2022. [15] 刘文婷, 卢新明. 基于计算机视觉的Transformer研究进展[J]. 计算机工程与应用, 2022, 58(6): 1-16. LIU W T, LU X M. Research progress of Transformer based on computer vision[J]. Computer Engineering and Applications, 2022, 58(6): 1-16. [16] 田永林, 王雨桐, 王建功, 等. 视觉Transformer研究的关键问题: 现状及展望[J]. 自动化学报, 2022, 48(4): 957-979. TIAN Y L, WANG Y T, WANG J G, et al. Key problems and progress of vision transformers: the state of the art and prospects[J]. Acta Automatica Sinica, 2022, 48(4): 957-979. [17] 张晓旭, 马志强, 刘志强, 等. Transformer在语音识别任务中的研究现状与展望[J]. 计算机科学与探索, 2021, 15(9): 1578-1594. ZHANG X X, MA Z Q, LIU Z Q, et al. Research status and prospect of transformer in speech recognition[J]. Journal of Frontiers of Computer Science and Technology, 2021, 15(9): 1578-1594. [18] 季瑞瑞, 谢宇辉, 骆丰凯, 等. 改进视觉Transformer的人脸识别方法[J]. 计算机工程与应用, 2023, 59(8): 117-126. JI R R, XIE Y H, LUO F K, et al. Face recognition method based on improved visual Transformer[J]. Computer Engineering and Applications, 2023, 59(8): 117-126. [19] 石磊, 籍庆余, 陈清威, 等. 视觉Transformer在医学图像分析中的应用研究综述[J]. 计算机工程与应用, 2023, 59(8): 41-55. SHI L, JI Q Y, CHEN Q W, et al. Review of research on application of vision Transformer in medical image analysis [J]. Computer Engineering and Applications, 2023, 59(8): 41-55. [20] 赵亮, 周继开. 基于重组性高斯自注意力的视觉Transformer[J]. 自动化学报, 2023, 49(9): 1976-1988. ZHAO L, ZHOU J K. Vision Transformer based on reconfigurable Gaussian self-attention[J]. Acta Automatica Sinica, 2023, 49(9): 1976-1988. [21] 石争浩, 李成建, 周亮, 等. Transformer驱动的图像分类研究进展[J]. 中国图象图形学报, 2023, 28(9): 2661-2692. SHI Z H, LI C J, ZHOU L, et al. Survey on Transformer for image classification[J]. Chinese Journal of Image and Graphics, 2023, 28(9): 2661-2692. [22] HASSANIN M, KHAMISS A, BENNAMOUN M, et al. CrossFormer: cross spatio-temporal transformer for 3D human pose estimation[J]. arXiv:2203.13387, 2022. [23] CHEN C F R, FAN Q, PANDA R. CrossViT: cross-attention multi-scale vision transformer for image classification[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Oct 11-17, 2021. Piscataway: IEEE, 2021: 357-366. [24] LEE Y, KIM J, WILLETTE J, et al. MPViT: multi-path vision transformer for dense prediction[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, Jun 18-24, 2022. Piscataway: IEEE, 2022: 7287-7296. [25] ZHANG P, DAI X, YANG J, et al. Multi-scale vision longformer: a new vision transformer for high-resolution image encoding[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Oct 11-17, 2021. Piscataway: IEEE, 2021: 2998-3008. [26] REN P, LI C, WANG G, et al. Beyond fixation: dynamic window visual transformer[C]//Proceedings of the 2022 IEEE/ CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 11987-11997. [27] ZHANG Q, XU Y, ZHANG J, et al. VSA: learning varied-size window attention in vision transformers[C]//Proceedings of the 17th European Conference on Computer Vision, Tel Aviv, Oct 23-27, 2022. Cham: Springer, 2022: 466-483. [28] REN S, ZHOU D, HE S, et al. Shunted self-attention via multi-scale token aggregation[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, Jun 18-24, 2022. Piscataway: IEEE, 2022: 10853-10862. [29] WU Y H, LIU Y, ZHAN X, et al. P2T: pyramid pooling transformer for scene understanding[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(11): 12760-12771. [30] GU J, KWON H, WANG D, et al. Multi-scale high-resolution vision transformer for semantic segmentation[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, Jun 18-24, 2022. Piscataway: IEEE, 2022: 12094-12103. [31] CHEN Y, DAI X, CHEN D, et al. Mobile-Former: bridging MobileNet and transformer[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, Jun 18-24, 2022. Piscataway: IEEE, 2022: 5270-5279. [32] PENG Z, HUANG W, GU S, et al. ConFormer: local features coupling global representations for visual recognition[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Oct 11-17, 2021. Piscataway: IEEE, 2021: 367-376. [33] YUAN L, CHEN Y, WANG T, et al. Tokens-to-token ViT: training vision transformers from scratch on ImageNet[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Oct 11-17, 2021. Piscataway: IEEE, 2021: 558-567. [34] ZHANG Z, ZHANG H, ZHAO L, et al. Aggregating nested transformers[J]. arXiv:2105.12723, 2021. [35] VASWANI A, RAMACHANDRAN P, SRINIVAS A, et al. Scaling local self-attention for parameter efficient visual backbones[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun 19-25, 2021. Piscataway: IEEE, 2021: 12894-12904. [36] CHU X, TIAN Z, WANG Y, et al. Twins: revisiting the design of spatial attention in vision transformers[C]//Advances in Neural Information Processing Systems 34, Dec 6-14, 2021: 9355-9366. [37] CHEN C F, PANDA R, FAN Q. RegionViT: regional-to-local attention for vision transformers[J]. arXiv:2106.02689, 2021. [38] YANG J, LI C, ZHANG P, et al. Focal self-attention for localglobal interactions in vision transformers[J]. arXiv:2107. 00641, 2021. [39] LI J, XIA X, LI W, et al. Next-ViT: next generation vision transformer for efficient deployment in realistic industrial scenarios[J]. arXiv:2207.05501, 2022. [40] LI Y, ZHANG K, CAO J, et al. LocalViT: bringing locality to vision transformers[J]. arXiv:2104.05707, 2021. [41] MEHTA S, RASTEGARI M. MobileViT: light-weight, general purpose, and mobile-friendly vision transformer[J]. arXiv:2110.02178, 2021. [42] DING M, XIAO B, CODELLA N, et al. DaViT: dual attention vision transformers[C]//Proceedings of the 17th European Conference on Computer Vision, Tel Aviv, Oct 23-27,2022. Cham: Springer, 2022: 74-92. [43] YANG G, ZHANG Q, ZHANG G. EANet: edge-aware network for the extraction of buildings from aerial images[J]. Remote Sensing, 2020, 12(13): 2161. [44] WEI X, ZHANG T, LI Y, et al. Multi-modality cross attention network for image and sentence matching[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, Jun 13-19, 2020. Piscataway: IEEE, 2020: 10941-10950. [45] SHAW P, USZKOREIT J, VASWANI A. Self-attention with relative position representations[J]. arXiv:1803.02155, 2018. [46] DAI Z, YANG Z, YANG Y, et al. Transformer-XL: attentive language models beyond a fixed-length context[J]. arXiv:1901.02860, 2019. [47] HUANG Z, LIANG D, XU P, et al. Improve transformer models with better relative position embeddings[J]. arXiv:2009.13658, 2020. [48] RAFFEL C, SHAZEER N, ROBERTS A, et al. Exploring the limits of transfer learning with a unified text-to-text transformer[J]. The Journal of Machine Learning Research, 2020, 21(1): 5485-5551. [49] RAMACHANDRAN P, PARMAR N, VASWANI A, et al. Stand-alone self-attention in vision models[C]//Advances in Neural Information Processing Systems 32, Vancouver,Dec 8-14, 2019: 68-80. [50] WANG H, ZHU Y, GREEN B, et al. Axial-Deeplab: stand-alone axial-attention for panoptic segmentation[C]//Proceedings of the 16th European Conference on Computer Vision, Glasgow, Aug 23-28, 2020. Cham: Springer, 2020: 108-126. [51] WU K, PENG H, CHEN M, et al. Rethinking and improving relative position encoding for vision transformer[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Oct 11-17, 2021. Piscataway: IEEE, 2021: 10033-10041. [52] SU J, LU Y, PAN S, et al. RoFormer: enhanced transformer with rotary position embedding[J]. arXiv:2104.09864, 2021. [53] ZHANG Q, YANG Y B. ResT: an efficient transformer for visual recognition[C]//Advances in Neural Information Processing Systems 34, Dec 6-14, 2021: 15475-15485. [54] CHU X, TIAN Z, ZHANG B, et al. Conditional positional encodings for vision transformers[J]. arXiv:2102.10882, 2021. [55] TOLSTIKHIN I O, HOULSBY N, KOLESNIKOV A, et al. MLP-mixer: an all-MLP architecture for vision[C]//Advances in Neural Information Processing Systems 34, Dec 6-14, 2021: 24261-24272. [56] TOUVRON H, BOJANOWSKI P, CARON M, et al. ResMLP: feedforward networks for image classification with data-efficient training[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 45(4): 5314-5321. [57] LIU H, DAI Z, SO D, et al. Pay attention to MLPs[C]//Advances in Neural Information Processing Systems 34, Dec 6-14, 2021: 9204-9215. [58] CHEN S, XIE E, GE C, et al. CycleMLP: a MLP-like architecture for dense prediction[J]. arXiv:2107.10224, 2021. [59] DONG Y, CORDONNIER J B, LOUKAS A. Attention is not all you need: pure attention loses rank doubly exponentially with depth[C]//Proceedings of the 2021 International Conference on Machine Learning, Jul 18-24, 2021. New York: ACM, 2021: 2793-2803. [60] NG D, CHEN Y, TIAN B, et al. ConvMixer: feature interactive convolution with curriculum learning for small footprint and noisy far-field keyword spotting[C]//Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing, Shenzhen, Oct 27-28, 2022. Piscataway: IEEE, 2022: 3603-3607. [61] LIU Z, MAO H, WU C Y, et al. A ConvNet for the 2020s[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, Jun 18-24, 2022. Piscataway: IEEE, 2022: 11976-11986. [62] DING X, ZHANG X, HAN J, et al. Scaling up your kernels to 31×31: revisiting large kernel design in CNNs[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, Jun 18-24, 2022. Piscataway: IEEE, 2022: 11963-11975. [63] WANG G, ZHAO Y, TANG C, et al. When shift operation meets vision transformer: an extremely simple alternative to attention mechanism[C]//Proceedings of the 2022 AAAI Conference on Artificial Intelligence, Feb 22-Mar 1, 2022. Menlo Park: AAAI, 2022: 2423-2430. [64] COHEN N, SHARIR O, SHASHUA A. On the expressive power of deep learning: a tensor analysis[C]//Proceedings of the 2016 Conference on Learning Theory, New York, Jun 23-26, 2016: 698-728. [65] HEO B, YUN S, HAN D, et al. Rethinking spatial dimensions of vision transformers[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Oct 11-17, 2021. Piscataway: IEEE, 2021: 11936-11945. [66] HE K, ZHANG X, REN S, et al. Spatial pyramid pooling in deep convolutional networks for visual recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37(9): 1904-1916. [67] GONG Y, WANG L, GUO R, et al. Multi-scale orderless pooling of deep convolutional activation features[C]//Proceedings of the 13th European Conference on Computer Vision, Zurich, Sep 6-12, 2014. Cham: Springer, 2014: 392-407. [68] MIHCAK M K, KOZINTSEV I, RAMCHANDRAN K, et al. Low-complexity image denoising based on statistical modeling of wavelet coefficients[J]. IEEE Signal Processing Letters, 1999, 6(12): 300-303. [69] HE K, SUN J, TANG X. Guided image filtering[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 35(6): 1397-1409. [70] CRIMINISI A, PéREZ P, TOYAMA K. Region filling and object removal by exemplar-based image inpainting[J]. IEEE Transactions on Image Processing, 2004, 13(9): 1200-1212. [71] WANG W, XIE E, LI X, et al. PVT v2: improved baselines with pyramid vision transformer[J]. Computational Visual Media, 2022, 8(3): 415-424. [72] WANG W, XIE E, LI X, et al. Pyramid vision transformer: a versatile backbone for dense prediction without convolutions[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Oct 11-17, 2021. Piscataway: IEEE, 2021: 568-578. [73] SHAW P, USZKOREIT J, VASWANI A. Self-attention with relative position representations[J]. arXiv:1803.02155, 2018. [74] 李跃, 许少秋. 基于边缘定向的图像插值算法[J]. 机电工程技术, 2015, 44(5): 5-9. LI Y, XU S Q. Image interpolation algorithm based on edge orientation[J]. Electromechanical Engineering Technology, 2015, 44(5): 5-9. [75] TOUVRON H, CORD M, SABLAYROLLES A, et al. Going deeper with image transformers[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE,2021: 32-42. [76] HAN K, GUO J, TANG Y, et al. PyramidTNT: improved transformer-in-transformer baselines with pyramid architecture[J]. arXiv:2201.00978, 2022. [77] RAO Y, ZHAO W, LIU B, et al. DynamicViT: efficient vision transformers with dynamic token sparsification[C]// Advances in Neural Information Processing Systems 34, Dec 6-14, 2021: 13937-13949. [78] CHEN X, LIU Z, TANG H. SparseViT: revisiting activation sparsity for efficient high resolution vision transformer[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Jun 18-22, 2023. Piscataway: IEEE, 2023: 2061-2070. [79] LU J, YAO J, ZHANG J, et al. Soft: Softmax-free transformer with linear complexity[C]//Advances in Neural Information Processing Systems 34, Dec 6-14, 2021: 21297-21309. [80] FAN H, XIONG B, MANGALAM K, et al. Multiscale vision transformers[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Oct 11-17, 2021. Piscataway: IEEE, 2021: 6824-6835. [81] YAO T, PAN Y, LI Y. Wave-ViT: unifying wavelet and transformers for visual representation learning[C]//Proceedings of the 17th European Conference on Computer Vision, Tel Aviv, Oct 23-27, 2022. Cham: Springer, 2022: 328-345. [82] ZHU L, WANG X, KE Z. BiFormer: vision transformer with bi-level routing attention[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Jun 18-22, 2023. Piscataway: IEEE, 2023: 10323-10333. [83] GRAINGER R, PANIAGUA T, SONG X. PaCa-ViT: learning patch-to-cluster attention in vision transformers[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Jun 18-22, 2023. Piscataway: IEEE, 2023: 18568-18578. [84] ZHOU D, KANG B, JIN X, et al. DeepViT: towards deeper vision transformer[J]. arXiv:2103.11886, 2021. [85] ZHOU D, SHI Y, KANG B, et al. Refiner: refining self-attention for vision transformers[J]. arXiv:2106.03714, 2021. [86] D??ASCOLI S, TOUVRON H, LEAVITT M L, et al. ConViT: improving vision transformers with soft convolutional inductive biases[C]//Proceedings of the 2021 International Conference on Machine Learning, Jul 18-24, 2021. New York: ACM, 2021: 2286-2296. [87] DAI Z, LIU H, LE Q V, et al. CoAtNet: marrying convolution and attention for all data sizes[C]//Advances in Neural Information Processing Systems 34, Dec 6-14, 2021: 3965-3977. [88] WU H, XIAO B, CODELLA N, et al. CvT: introducing convolutions to vision transformers[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Oct 11-17, 2021. Piscataway: IEEE, 2021: 22-31. [89] HE H, LIU J, PAN Z, et al. Pruning self-attentions into convolutional layers in single path[J]. arXiv:2111.11802, 2021. [90] WU K, ZHANG J, PENG H, et al. TinyViT: fast pretraining distillation for small vision transformers[C]//Proceedings of the 17th European Conference on Computer Vision, Tel Aviv, Oct 23-27, 2022. Cham: Springer, 2022: 68-85. [91] ELLIOTT D, OTERO C E, WYATT S, et al. Tiny transformers for environmental sound classification at the edge[J]. arXiv: 2103.12157, 2021. [92] XIAO T, SINGH M, MINTUN E, et al. Early convolutions help transformers see better[C]//Advances in Neural Information Processing Systems 34, Dec 6-14, 2021: 30392-30400. [93] YUAN K, GUO S, LIU Z, et al. Incorporating convolution designs into visual transformers[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Oct 11-17, 2021. Piscataway: IEEE, 2021: 579-588. [94] DAI Y, GAO Y, LIU F. TransMed: transformers advance multi-modal medical image classification[J]. Diagnostics, 2021, 11(8): 1384. [95] LIU W, LI C, XU N, et al. CVM-Cervix: a hybrid cervical papsmear image classification framework using CNN, visual transformer and multilayer perceptron[J]. Pattern Recognition, 2022, 130: 108829. [96] CHEN H, LI C, WANG G, et al. GasHis-Transformer: a multiscale visual transformer approach for gastric histopathological image detection[J]. Pattern Recognition, 2022, 130: 108827. [97] DING M, QU A, ZHONG H, et al. An enhanced vision transformer with wavelet position embedding for histopathological image classification[J]. Pattern Recognition, 2023, 140: 109532. [98] ZHONG Z, LI Y, MA L, et al. Spectral-spatial transformer network for hyperspectral image classification: a factorized architecture search framework[J]. IEEE Transactions on Geoscience and Remote Sensing, 2021, 60: 1-15. [99] HE X, CHEN Y, LIN Z. Spatial-spectral transformer for hyperspectral image classification[J]. Remote Sensing, 2021, 13(3): 498. [100] 陈辉, 张甜, 陈润斌. 基于轻量级卷积 Transformer 的图像分类方法及在遥感图像分类中的应用[J]. 电子与信息学报, 2022, 44: 1-9. CHEN H, ZHANG T, CHEN R B. Image classification method based on lightweight convolutional Transformer and its application in remote sensing image classification[J]. Journal of Electronics & Information Technology, 2022, 44: 1-9. [101] HE J, CHEN J N, LIU S, et al. TransFG: a transformer architecture for fine-grained recognition[C]//Proceedings of the 2022 AAAI Conference on Artificial Intelligence, Feb 22-Mar 1, 2022. Menlo Park: AAAI, 2022: 852-860. [102] WANG J, YU X, GAO Y. Feature fusion vision transformer for fine-grained visual categorization[J]. arXiv:2107.02341, 2021. [103] YU X, WANG J, ZHAO Y, et al. Mix-ViT: mixing attentive vision transformer for ultra-fine-grained visual categorization[J]. Pattern Recognition, 2023, 135: 109131. [104] ZHONG Y, DENG W. Face transformer for recognition[J]. arXiv:2103.14803, 2021. [105] SCHROFF F, KALENICHENKO D, PHILBIN J. FaceNet: a unified embedding for face recognition and clustering[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, Jun 7-12, 2015. Washington: IEEE Computer Society, 2015: 815-823. |
[1] | 薛金强, 吴秦. 面向图像复原和增强的轻量级交叉门控Transformer[J]. 计算机科学与探索, 2024, 18(3): 718-730. |
[2] | 杨超城, 严宣辉, 陈容均, 李汉章. 融合双重注意力机制的时间序列异常检测模型[J]. 计算机科学与探索, 2024, 18(3): 740-754. |
[3] | 申通, 王硕, 李孟, 秦伦明. 深度学习在动物行为分析中的应用研究进展[J]. 计算机科学与探索, 2024, 18(3): 612-626. |
[4] | 王一凡, 刘静, 马金刚, 邵润华, 陈天真, 李明. 深度学习在乳腺癌影像学检查中的应用进展[J]. 计算机科学与探索, 2024, 18(2): 301-319. |
[5] | 王昆, 郭威, 王尊严, 韩文强. 赤足足迹识别研究综述[J]. 计算机科学与探索, 2024, 18(1): 44-57. |
[6] | 高洁, 赵心馨, 于健, 徐天一, 潘丽, 杨珺, 喻梅, 李雪威. 结合密度图回归与检测的密集计数研究[J]. 计算机科学与探索, 2024, 18(1): 127-137. |
[7] | 余文婷, 吴云. 时间感知的双塔型自注意力序列推荐模型[J]. 计算机科学与探索, 2024, 18(1): 175-188. |
[8] | 赵婷婷, 孙威, 陈亚瑞, 王嫄, 杨巨成. 潜在空间中深度强化学习方法研究综述[J]. 计算机科学与探索, 2023, 17(9): 2047-2074. |
[9] | 刘华玲, 陈尚辉, 曹世杰, 朱建亮, 任青青. 基于多模态学习的虚假新闻检测研究[J]. 计算机科学与探索, 2023, 17(9): 2015-2029. |
[10] | 徐光宪, 冯春, 马飞. 基于UNet的医学图像分割综述[J]. 计算机科学与探索, 2023, 17(8): 1776-1792. |
[11] | 季长清, 王兵兵, 秦静, 汪祖民. 深度特征的实例图像检索算法综述[J]. 计算机科学与探索, 2023, 17(7): 1565-1575. |
[12] | 吴水秀, 罗贤增, 熊键, 钟茂生, 王明文. 知识追踪研究综述[J]. 计算机科学与探索, 2023, 17(7): 1506-1525. |
[13] | 马妍, 古丽米拉·克孜尔别克. 图像语义分割方法在高分辨率遥感影像解译中的研究综述[J]. 计算机科学与探索, 2023, 17(7): 1526-1548. |
[14] | 张如琳, 王海龙, 柳林, 裴冬梅. 音乐自动标注分类方法研究综述[J]. 计算机科学与探索, 2023, 17(6): 1225-1248. |
[15] | 梁宏涛, 刘硕, 杜军威, 胡强, 于旭. 深度学习应用于时序预测研究综述[J]. 计算机科学与探索, 2023, 17(6): 1285-1300. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||