Survey on Visual Transformer for Image Classification

doi:10.3778/j.issn.1673-9418.2310092

Abstract

Abstract: Transformer is a deep learning model based on the self-attention mechanism, showing tremendous potential in computer vision. In image classification tasks, the key challenge lies in efficiently and accurately capturing both local and global features of input images. Traditional approaches rely on convolutional neural networks to extract local features at the lower layers, expanding the receptive field through stacked convolutional layers to obtain global features. However, this strategy aggregates information over relatively short distances, making it difficult to model long-term dependencies. In contrast, the self-attention mechanism of Transformer directly compares features across all spatial positions, capturing long-range dependencies at both local and global levels and exhibiting stronger global modeling capabilities. Therefore, a thorough exploration of the challenges faced by Transformer in image classification tasks is crucial. Taking Vision Transformer as an example, this paper provides a detailed overview of the core principles and architecture of Transformer. It then focuses on image classification tasks, summarizing key issues and recent advancements in visual Transformer research related to performance enhancement, computational costs, and training optimization. Furthermore, applications of Transformer in specific domains such as medical imagery, remote sensing, and agricultural images are summarized, highlighting its versatility and generality. Finally, a comprehensive analysis of the research progress in visual Transformer for image classification is presented, offering insights into future directions for the development of visual Transformer.

Key words: deep learning, Vision Transformer, network structure, image classification, self-attention mechanism

摘要： Transformer是一种基于自注意力机制的深度学习模型，在计算机视觉中展现出巨大的潜力。而在图像分类任务中，关键的挑战是高效而准确地捕捉输入图片的局部和全局特征。传统方法使用卷积神经网络的底层提取其局部特征，并通过卷积层堆叠扩大感受野以获取图像的全局特征。但这种策略在相对短的距离内聚合信息，难以建立长期依赖关系。相比之下，Transformer的自注意力机制通过直接比较特征在所有空间位置上的相关性，捕捉了局部和全局的长距离依赖关系，具备更强的全局建模能力。因此，深入探讨Transformer在图像分类任务中的问题是非常有必要的。首先以Vision Transformer为例，详细介绍了Transformer的核心原理和架构。然后以图像分类任务为切入点，围绕与视觉Transformer研究中的性能提升、计算成本和训练优化相关的三个重要方面，总结了视觉Transformer研究中的关键问题和最新进展。此外，总结了Transformer在医学图像、遥感图像和农业图像等多个特定领域的应用情况。这些领域中的应用展示了Transformer的多功能性和通用性。最后，通过综合分析视觉Transformer在图像分类方面的研究进展，对视觉Transformer的未来发展方向进行了展望。

关键词: 深度学习, 视觉Transformer, 网络架构, 图像分类, 自注意力机制

PENG Bin, BAI Jing, LI Wenjing, ZHENG Hu, MA Xiangyu. Survey on Visual Transformer for Image Classification[J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(2): 320-344.

彭斌, 白静, 李文静, 郑虎, 马向宇. 面向图像分类的视觉Transformer研究进展[J]. 计算机科学与探索, 2024, 18(2): 320-344.

References

[1] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems 30, Long Beach, Dec 4-9, 2017: 5998-6008.
[2] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: transformers for image recognition at scale[J]. arXiv:2010.11929, 2020.
[3] LECUN Y, BOSER B, DENKER J S, et al. Backpropagation applied to handwritten zip code recognition[J]. Neural Computation, 1989, 1(4): 541-551.
[4] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet classification with deep convolutional neural networks[J]. Communications of the ACM, 2017, 60(6): 84-90.
[5] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Jun 27-30, 2016. Washington: IEEE Computer Society, 2016: 770-778.
[6] DENG J, DONG W, SOCHER R, et al. ImageNet: a large-scale hierarchical image database[C]//Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, Jun 20-25, 2009. Washington: IEEE Computer Society, 2009: 248-255.
[7] TOUVRON H, CORD M, DOUZE M, et al. Training data-efficient image transformers & distillation through attention[C]//Proceedings of the 2021 International Conference on Machine Learning, Jul 18-24, 2021. New York: ACM, 2021: 10347-10357.
[8] LIU Z, LIN Y, CAO Y, et al. Swin transformer: hierarchical vision transformer using shifted windows[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Oct 11-17, 2021. Piscataway: IEEE, 2021: 10012-10022.
[9] ZHANG H, LI F, LIU S, et al. Dino: DETR with improved denoising anchor boxes for end-to-end object detection[J]. arXiv:2203.03605, 2022.
[10] LIU Y, ZHANG Y, WANG Y, et al. A survey of visual transformers[J]. arXiv:2111.06091, 2021.
[11] HAN K, WANG Y, CHEN H, et al. A survey on vision transformer[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(1): 87-110.
[12] TAY Y, DEHGHANI M, BAHRI D, et al. Efficient transformers: a survey[J]. arXiv.2009.06732, 2020.
[13] XU Y, WEI H, LIN M, et al. Transformers in computational visual media: a survey[J]. Computational Visual Media, 2022, 8(1): 33-62.
[14] YANG Y, JIAO L, LIU X, et al. Transformers meet visual learning understanding: a comprehensive review[J]. arXiv:2203.12944, 2022.
[15] 刘文婷, 卢新明. 基于计算机视觉的Transformer研究进展[J]. 计算机工程与应用, 2022, 58(6): 1-16.
LIU W T, LU X M. Research progress of Transformer based on computer vision[J]. Computer Engineering and Applications, 2022, 58(6): 1-16.
[16] 田永林, 王雨桐, 王建功, 等. 视觉Transformer研究的关键问题: 现状及展望[J]. 自动化学报, 2022, 48(4): 957-979.
TIAN Y L, WANG Y T, WANG J G, et al. Key problems and progress of vision transformers: the state of the art and prospects[J]. Acta Automatica Sinica, 2022, 48(4): 957-979.
[17] 张晓旭, 马志强, 刘志强, 等. Transformer在语音识别任务中的研究现状与展望[J]. 计算机科学与探索, 2021, 15(9): 1578-1594.
ZHANG X X, MA Z Q, LIU Z Q, et al. Research status and prospect of transformer in speech recognition[J]. Journal of Frontiers of Computer Science and Technology, 2021, 15(9): 1578-1594.
[18] 季瑞瑞, 谢宇辉, 骆丰凯, 等. 改进视觉Transformer的人脸识别方法[J]. 计算机工程与应用, 2023, 59(8): 117-126.
JI R R, XIE Y H, LUO F K, et al. Face recognition method based on improved visual Transformer[J]. Computer Engineering and Applications, 2023, 59(8): 117-126.
[19] 石磊, 籍庆余, 陈清威, 等. 视觉Transformer在医学图像分析中的应用研究综述[J]. 计算机工程与应用, 2023, 59(8): 41-55.
SHI L, JI Q Y, CHEN Q W, et al. Review of research on application of vision Transformer in medical image analysis [J]. Computer Engineering and Applications, 2023, 59(8): 41-55.
[20] 赵亮, 周继开. 基于重组性高斯自注意力的视觉Transformer[J]. 自动化学报, 2023, 49(9): 1976-1988.
ZHAO L, ZHOU J K. Vision Transformer based on reconfigurable Gaussian self-attention[J]. Acta Automatica Sinica, 2023, 49(9): 1976-1988.
[21] 石争浩, 李成建, 周亮, 等. Transformer驱动的图像分类研究进展[J]. 中国图象图形学报, 2023, 28(9): 2661-2692.
SHI Z H, LI C J, ZHOU L, et al. Survey on Transformer for image classification[J]. Chinese Journal of Image and Graphics, 2023, 28(9): 2661-2692.
[22] HASSANIN M, KHAMISS A, BENNAMOUN M, et al. CrossFormer: cross spatio-temporal transformer for 3D human pose estimation[J]. arXiv:2203.13387, 2022.
[23] CHEN C F R, FAN Q, PANDA R. CrossViT: cross-attention multi-scale vision transformer for image classification[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Oct 11-17, 2021. Piscataway: IEEE, 2021: 357-366.
[24] LEE Y, KIM J, WILLETTE J, et al. MPViT: multi-path vision transformer for dense prediction[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, Jun 18-24, 2022. Piscataway: IEEE, 2022: 7287-7296.
[25] ZHANG P, DAI X, YANG J, et al. Multi-scale vision longformer: a new vision transformer for high-resolution image encoding[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Oct 11-17, 2021. Piscataway: IEEE, 2021: 2998-3008.
[26] REN P, LI C, WANG G, et al. Beyond fixation: dynamic window visual transformer[C]//Proceedings of the 2022 IEEE/ CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 11987-11997.
[27] ZHANG Q, XU Y, ZHANG J, et al. VSA: learning varied-size window attention in vision transformers[C]//Proceedings of the 17th European Conference on Computer Vision, Tel Aviv, Oct 23-27, 2022. Cham: Springer, 2022: 466-483.
[28] REN S, ZHOU D, HE S, et al. Shunted self-attention via multi-scale token aggregation[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, Jun 18-24, 2022. Piscataway: IEEE, 2022: 10853-10862.
[29] WU Y H, LIU Y, ZHAN X, et al. P2T: pyramid pooling transformer for scene understanding[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(11): 12760-12771.
[30] GU J, KWON H, WANG D, et al. Multi-scale high-resolution vision transformer for semantic segmentation[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, Jun 18-24, 2022. Piscataway: IEEE, 2022: 12094-12103.
[31] CHEN Y, DAI X, CHEN D, et al. Mobile-Former: bridging MobileNet and transformer[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, Jun 18-24, 2022. Piscataway: IEEE, 2022: 5270-5279.
[32] PENG Z, HUANG W, GU S, et al. ConFormer: local features coupling global representations for visual recognition[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Oct 11-17, 2021. Piscataway: IEEE, 2021: 367-376.
[33] YUAN L, CHEN Y, WANG T, et al. Tokens-to-token ViT: training vision transformers from scratch on ImageNet[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Oct 11-17, 2021. Piscataway: IEEE, 2021: 558-567.
[34] ZHANG Z, ZHANG H, ZHAO L, et al. Aggregating nested transformers[J]. arXiv:2105.12723, 2021.
[35] VASWANI A, RAMACHANDRAN P, SRINIVAS A, et al. Scaling local self-attention for parameter efficient visual backbones[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun 19-25, 2021. Piscataway: IEEE, 2021: 12894-12904.
[36] CHU X, TIAN Z, WANG Y, et al. Twins: revisiting the design of spatial attention in vision transformers[C]//Advances in Neural Information Processing Systems 34, Dec 6-14, 2021: 9355-9366.
[37] CHEN C F, PANDA R, FAN Q. RegionViT: regional-to-local attention for vision transformers[J]. arXiv:2106.02689, 2021.
[38] YANG J, LI C, ZHANG P, et al. Focal self-attention for localglobal interactions in vision transformers[J]. arXiv:2107. 00641, 2021.
[39] LI J, XIA X, LI W, et al. Next-ViT: next generation vision transformer for efficient deployment in realistic industrial scenarios[J]. arXiv:2207.05501, 2022.
[40] LI Y, ZHANG K, CAO J, et al. LocalViT: bringing locality to vision transformers[J]. arXiv:2104.05707, 2021.
[41] MEHTA S, RASTEGARI M. MobileViT: light-weight, general purpose, and mobile-friendly vision transformer[J]. arXiv:2110.02178, 2021.
[42] DING M, XIAO B, CODELLA N, et al. DaViT: dual attention vision transformers[C]//Proceedings of the 17th European Conference on Computer Vision, Tel Aviv, Oct 23-27,2022. Cham: Springer, 2022: 74-92.
[43] YANG G, ZHANG Q, ZHANG G. EANet: edge-aware network for the extraction of buildings from aerial images[J]. Remote Sensing, 2020, 12(13): 2161.
[44] WEI X, ZHANG T, LI Y, et al. Multi-modality cross attention network for image and sentence matching[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, Jun 13-19, 2020. Piscataway: IEEE, 2020: 10941-10950.
[45] SHAW P, USZKOREIT J, VASWANI A. Self-attention with relative position representations[J]. arXiv:1803.02155, 2018.
[46] DAI Z, YANG Z, YANG Y, et al. Transformer-XL: attentive language models beyond a fixed-length context[J]. arXiv:1901.02860, 2019.
[47] HUANG Z, LIANG D, XU P, et al. Improve transformer models with better relative position embeddings[J]. arXiv:2009.13658, 2020.
[48] RAFFEL C, SHAZEER N, ROBERTS A, et al. Exploring the limits of transfer learning with a unified text-to-text transformer[J]. The Journal of Machine Learning Research, 2020, 21(1): 5485-5551.
[49] RAMACHANDRAN P, PARMAR N, VASWANI A, et al. Stand-alone self-attention in vision models[C]//Advances in Neural Information Processing Systems 32, Vancouver,Dec 8-14, 2019: 68-80.
[50] WANG H, ZHU Y, GREEN B, et al. Axial-Deeplab: stand-alone axial-attention for panoptic segmentation[C]//Proceedings of the 16th European Conference on Computer Vision, Glasgow, Aug 23-28, 2020. Cham: Springer, 2020: 108-126.
[51] WU K, PENG H, CHEN M, et al. Rethinking and improving relative position encoding for vision transformer[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Oct 11-17, 2021. Piscataway: IEEE, 2021: 10033-10041.
[52] SU J, LU Y, PAN S, et al. RoFormer: enhanced transformer with rotary position embedding[J]. arXiv:2104.09864, 2021.
[53] ZHANG Q, YANG Y B. ResT: an efficient transformer for visual recognition[C]//Advances in Neural Information Processing Systems 34, Dec 6-14, 2021: 15475-15485.
[54] CHU X, TIAN Z, ZHANG B, et al. Conditional positional encodings for vision transformers[J]. arXiv:2102.10882, 2021.
[55] TOLSTIKHIN I O, HOULSBY N, KOLESNIKOV A, et al. MLP-mixer: an all-MLP architecture for vision[C]//Advances in Neural Information Processing Systems 34, Dec 6-14, 2021: 24261-24272.
[56] TOUVRON H, BOJANOWSKI P, CARON M, et al. ResMLP: feedforward networks for image classification with data-efficient training[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 45(4): 5314-5321.
[57] LIU H, DAI Z, SO D, et al. Pay attention to MLPs[C]//Advances in Neural Information Processing Systems 34, Dec 6-14, 2021: 9204-9215.
[58] CHEN S, XIE E, GE C, et al. CycleMLP: a MLP-like architecture for dense prediction[J]. arXiv:2107.10224, 2021.
[59] DONG Y, CORDONNIER J B, LOUKAS A. Attention is not all you need: pure attention loses rank doubly exponentially with depth[C]//Proceedings of the 2021 International Conference on Machine Learning, Jul 18-24, 2021. New York: ACM, 2021: 2793-2803.
[60] NG D, CHEN Y, TIAN B, et al. ConvMixer: feature interactive convolution with curriculum learning for small footprint and noisy far-field keyword spotting[C]//Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing, Shenzhen, Oct 27-28, 2022. Piscataway: IEEE, 2022: 3603-3607.
[61] LIU Z, MAO H, WU C Y, et al. A ConvNet for the 2020s[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, Jun 18-24, 2022. Piscataway: IEEE, 2022: 11976-11986.
[62] DING X, ZHANG X, HAN J, et al. Scaling up your kernels to 31×31: revisiting large kernel design in CNNs[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, Jun 18-24, 2022. Piscataway: IEEE, 2022: 11963-11975.
[63] WANG G, ZHAO Y, TANG C, et al. When shift operation meets vision transformer: an extremely simple alternative to attention mechanism[C]//Proceedings of the 2022 AAAI Conference on Artificial Intelligence, Feb 22-Mar 1, 2022. Menlo Park: AAAI, 2022: 2423-2430.
[64] COHEN N, SHARIR O, SHASHUA A. On the expressive power of deep learning: a tensor analysis[C]//Proceedings of the 2016 Conference on Learning Theory, New York, Jun 23-26, 2016: 698-728.
[65] HEO B, YUN S, HAN D, et al. Rethinking spatial dimensions of vision transformers[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Oct 11-17, 2021. Piscataway: IEEE, 2021: 11936-11945.
[66] HE K, ZHANG X, REN S, et al. Spatial pyramid pooling in deep convolutional networks for visual recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37(9): 1904-1916.
[67] GONG Y, WANG L, GUO R, et al. Multi-scale orderless pooling of deep convolutional activation features[C]//Proceedings of the 13th European Conference on Computer Vision, Zurich, Sep 6-12, 2014. Cham: Springer, 2014: 392-407.
[68] MIHCAK M K, KOZINTSEV I, RAMCHANDRAN K, et al. Low-complexity image denoising based on statistical modeling of wavelet coefficients[J]. IEEE Signal Processing Letters, 1999, 6(12): 300-303.
[69] HE K, SUN J, TANG X. Guided image filtering[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 35(6): 1397-1409.
[70] CRIMINISI A, PéREZ P, TOYAMA K. Region filling and object removal by exemplar-based image inpainting[J]. IEEE Transactions on Image Processing, 2004, 13(9): 1200-1212.
[71] WANG W, XIE E, LI X, et al. PVT v2: improved baselines with pyramid vision transformer[J]. Computational Visual Media, 2022, 8(3): 415-424.
[72] WANG W, XIE E, LI X, et al. Pyramid vision transformer: a versatile backbone for dense prediction without convolutions[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Oct 11-17, 2021. Piscataway: IEEE, 2021: 568-578.
[73] SHAW P, USZKOREIT J, VASWANI A. Self-attention with relative position representations[J]. arXiv:1803.02155, 2018.
[74] 李跃, 许少秋. 基于边缘定向的图像插值算法[J]. 机电工程技术, 2015, 44(5): 5-9.
LI Y, XU S Q. Image interpolation algorithm based on edge orientation[J]. Electromechanical Engineering Technology, 2015, 44(5): 5-9.
[75] TOUVRON H, CORD M, SABLAYROLLES A, et al. Going deeper with image transformers[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE,2021: 32-42.
[76] HAN K, GUO J, TANG Y, et al. PyramidTNT: improved transformer-in-transformer baselines with pyramid architecture[J]. arXiv:2201.00978, 2022.
[77] RAO Y, ZHAO W, LIU B, et al. DynamicViT: efficient vision transformers with dynamic token sparsification[C]// Advances in Neural Information Processing Systems 34, Dec 6-14, 2021: 13937-13949.
[78] CHEN X, LIU Z, TANG H. SparseViT: revisiting activation sparsity for efficient high resolution vision transformer[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Jun 18-22, 2023. Piscataway: IEEE, 2023: 2061-2070.
[79] LU J, YAO J, ZHANG J, et al. Soft: Softmax-free transformer with linear complexity[C]//Advances in Neural Information Processing Systems 34, Dec 6-14, 2021: 21297-21309.
[80] FAN H, XIONG B, MANGALAM K, et al. Multiscale vision transformers[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Oct 11-17, 2021. Piscataway: IEEE, 2021: 6824-6835.
[81] YAO T, PAN Y, LI Y. Wave-ViT: unifying wavelet and transformers for visual representation learning[C]//Proceedings of the 17th European Conference on Computer Vision, Tel Aviv, Oct 23-27, 2022. Cham: Springer, 2022: 328-345.
[82] ZHU L, WANG X, KE Z. BiFormer: vision transformer with bi-level routing attention[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Jun 18-22, 2023. Piscataway: IEEE, 2023: 10323-10333.
[83] GRAINGER R, PANIAGUA T, SONG X. PaCa-ViT: learning patch-to-cluster attention in vision transformers[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Jun 18-22, 2023. Piscataway: IEEE, 2023: 18568-18578.
[84] ZHOU D, KANG B, JIN X, et al. DeepViT: towards deeper vision transformer[J]. arXiv:2103.11886, 2021.
[85] ZHOU D, SHI Y, KANG B, et al. Refiner: refining self-attention for vision transformers[J]. arXiv:2106.03714, 2021.
[86] D??ASCOLI S, TOUVRON H, LEAVITT M L, et al. ConViT: improving vision transformers with soft convolutional inductive biases[C]//Proceedings of the 2021 International Conference on Machine Learning, Jul 18-24, 2021. New York: ACM, 2021: 2286-2296.
[87] DAI Z, LIU H, LE Q V, et al. CoAtNet: marrying convolution and attention for all data sizes[C]//Advances in Neural Information Processing Systems 34, Dec 6-14, 2021: 3965-3977.
[88] WU H, XIAO B, CODELLA N, et al. CvT: introducing convolutions to vision transformers[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Oct 11-17, 2021. Piscataway: IEEE, 2021: 22-31.
[89] HE H, LIU J, PAN Z, et al. Pruning self-attentions into convolutional layers in single path[J]. arXiv:2111.11802, 2021.
[90] WU K, ZHANG J, PENG H, et al. TinyViT: fast pretraining distillation for small vision transformers[C]//Proceedings of the 17th European Conference on Computer Vision, Tel Aviv, Oct 23-27, 2022. Cham: Springer, 2022: 68-85.
[91] ELLIOTT D, OTERO C E, WYATT S, et al. Tiny transformers for environmental sound classification at the edge[J]. arXiv: 2103.12157, 2021.
[92] XIAO T, SINGH M, MINTUN E, et al. Early convolutions help transformers see better[C]//Advances in Neural Information Processing Systems 34, Dec 6-14, 2021: 30392-30400.
[93] YUAN K, GUO S, LIU Z, et al. Incorporating convolution designs into visual transformers[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Oct 11-17, 2021. Piscataway: IEEE, 2021: 579-588.
[94] DAI Y, GAO Y, LIU F. TransMed: transformers advance multi-modal medical image classification[J]. Diagnostics, 2021, 11(8): 1384.
[95] LIU W, LI C, XU N, et al. CVM-Cervix: a hybrid cervical papsmear image classification framework using CNN, visual transformer and multilayer perceptron[J]. Pattern Recognition, 2022, 130: 108829.
[96] CHEN H, LI C, WANG G, et al. GasHis-Transformer: a multiscale visual transformer approach for gastric histopathological image detection[J]. Pattern Recognition, 2022, 130: 108827.
[97] DING M, QU A, ZHONG H, et al. An enhanced vision transformer with wavelet position embedding for histopathological image classification[J]. Pattern Recognition, 2023, 140: 109532.
[98] ZHONG Z, LI Y, MA L, et al. Spectral-spatial transformer network for hyperspectral image classification: a factorized architecture search framework[J]. IEEE Transactions on Geoscience and Remote Sensing, 2021, 60: 1-15.
[99] HE X, CHEN Y, LIN Z. Spatial-spectral transformer for hyperspectral image classification[J]. Remote Sensing, 2021, 13(3): 498.
[100] 陈辉, 张甜, 陈润斌. 基于轻量级卷积 Transformer 的图像分类方法及在遥感图像分类中的应用[J]. 电子与信息学报, 2022, 44: 1-9.
CHEN H, ZHANG T, CHEN R B. Image classification method based on lightweight convolutional Transformer and its application in remote sensing image classification[J]. Journal of Electronics & Information Technology, 2022, 44: 1-9.
[101] HE J, CHEN J N, LIU S, et al. TransFG: a transformer architecture for fine-grained recognition[C]//Proceedings of the 2022 AAAI Conference on Artificial Intelligence, Feb 22-Mar 1, 2022. Menlo Park: AAAI, 2022: 852-860.
[102] WANG J, YU X, GAO Y. Feature fusion vision transformer for fine-grained visual categorization[J]. arXiv:2107.02341, 2021.
[103] YU X, WANG J, ZHAO Y, et al. Mix-ViT: mixing attentive vision transformer for ultra-fine-grained visual categorization[J]. Pattern Recognition, 2023, 135: 109131.
[104] ZHONG Y, DENG W. Face transformer for recognition[J]. arXiv:2103.14803, 2021.
[105] SCHROFF F, KALENICHENKO D, PHILBIN J. FaceNet: a unified embedding for face recognition and clustering[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, Jun 7-12, 2015. Washington: IEEE Computer Society, 2015: 815-823.