Fine-Grained Visual Categorization: Deep Pairwise Feature Comparison Interaction Algorithm

doi:10.3778/j.issn.1673-9418.2207091

Abstract

Abstract: Fine-grained visual categorization is an important but challenging task in computer vision due to high intraclass and low inter-class variance. Classical fine-grained image recognition methods use a single-input with single-output approach, which limits the ability of the model to learn inference from paired images. Inspired by the behavior of human beings when discriminating fine-grained images, a deep pairwise feature comparison interactive fine-grained classification algorithm (PCI) is proposed to find common or different features between image pairs and effectively improve the fine-grained recognition accuracy. Firstly, PCI establishes a positive-negative pair input strategy to extract pairwise depth features of fine-grained images. Secondly, a deep pairwise feature interaction mechanism is established to realize global information learning, depth comparison and depth adaptive interaction of paired depth features. Finally, a pairwise feature contrastive learning mechanism is established to constrain pairwise deep fine-grained features through contrastive learning, increasing the similarity between positive pairs and reducing the similarity between negative pairs. Extensive experiments are conducted on the popular fine-grained datasets CUB-200-2011, Stanford Dogs, Stanford Cars, and FGVC-Aircraft, and the experimental results show that PCI outperforms current state-of-the-art methods.

Key words: fine-grained, image classification, deep neural network, contrastive learning, attention mechanism

摘要： 由于高类内和低类间方差，细粒度图像识别成为计算机视觉领域一项极具挑战性的研究课题。经典的细粒度图像识别方法采用单输入单输出的方式，限制了模型从成对图像中对比学习推理的能力。受人类在判别细粒度图像时的行为启发，提出了深度成对特征对比交互细粒度分类算法（PCI），深度对比寻找图像对之间的共同、差异特征，有效提升细粒度识别精度。首先，PCI建立正负对输入策略，提取细粒度图像的成对深度特征；其次，建立深度成对特征交互机制，实现成对深度特征的全局信息学习、深度对比以及深度自适应交互；最后，建立成对特征对比学习机制，通过对比学习约束成对深度细粒度特征，增大正对之间的相似性并减小负对之间的相似性。在流行的细粒度数据集CUB-200-2011、Stanford Dogs、Stanford Cars以及FGVC-Aircraft上开展了广泛的实验，实验结果表明PCI的性能优于当前最先进的方法。

关键词: 细粒度, 图像分类, 深度神经网络, 对比学习, 注意力机制

WANG Min, ZHAO Peng, GUO Xinping, MIN Fan. Fine-Grained Visual Categorization: Deep Pairwise Feature Comparison Interaction Algorithm[J]. Journal of Frontiers of Computer Science and Technology, 2023, 17(11): 2663-2675.

汪敏, 赵鹏, 郭鑫平, 闵帆. 细粒度视觉分类：深度成对特征对比交互算法[J]. 计算机科学与探索, 2023, 17(11): 2663-2675.

References

[1] KARPATHY A, TODERICI G, SHETTY S, et al. Large-scale video classification with convolutional neural networks[C]//Proceedings of the 27th IEEE Conference on Computer Vision and Pattern Recognition, Columbus, Jun 20-23, 2014. Piscataway: IEEE, 2014: 1725-1732.
[2] HUANG G, LIU Z, VAN DER M L, et al. Densely connected convolutional networks[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition,Las Vegas, Jul 21-26, 2017. Piscataway: IEEE, 2017: 4700-4708.
[3] HE J, CHEN J N, LIU S, et al. TransFG: a transformer archi-tecture for fine-grained recognition[J]. arXiv:2103.07976, 2021.
[4] DU R Y, CHANG D L, BHUNIA A K, et al. Fine-grained visual classification via progressive multi-granularity training of jigsaw patches[C]//Proceedings of the 16th European Conference on Computer Vision，Glasgow, Aug 23-28, 2020. Cham: Springer, 2020: 153-168.
[5] ZHUANG P Q, WANG Y L, QIAO Y. Learning attentive pairwise interaction for fine-grained classification[C]//Pro-ceedings of the 34th AAAI Conference on Artificial Intelli-gence, the 32nd Innovative Applications of Artificial Intelli-gence Conference, the 10th AAAI Symposium on Educa-tional Advances in Artificial Intelligence, New York, Feb 7-12, 2020. Menlo Park: AAAI, 2020: 13130-13137.
[6] ZHANG H, XU T, ELHOSEINY M, et al. SPDA-CNN: unifying semantic part detection and abstraction for fine-grained recognition[C]//Proceedings of the 2016 IEEE Con-ference on Computer Vision and Pattern Recognition, Las Vegas, Jun 27-30, 2016. Piscataway: IEEE, 2016: 1143-1152.
[7] KRAUSE J, JIN H L, YANG J C, et al. Fine-grained recog-nition without part annotations[C]//Proceedings of the 28th IEEE Conference on Computer Vision and Pattern Recogni-tion, Boston, Jun 7-12, 2015. Piscataway: IEEE, 2015: 5546-5555.
[8] WANG Y M, CHOI J, MORARIU V, et al. Mining discri-minative triplets of patches for fine-grained classification[C]//Proceedings of the 2016 IEEE Conference on Com-puter Vision and Pattern Recognition, Las Vegas, Jun 27-30, 2016. Piscataway: IEEE, 2016: 1163-1172.
[9] LIN TY, ROYCHOWDHURY A, MAJI S. Bilinear CNN models for fine-grained visual recognition[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision, Santiago, Dec 13-16, 2015. Piscataway: IEEE, 2015: 1449-1457.
[10] JI R Y, WEN L Y, ZHANG L B, et al. Attention convolu-tional binary neural tree for fine-grained visual categoriza-tion[C]//Proceedings of the 2020 IEEE Conference on Computer Vision and Pattern Recognition, Jun 13-19, 2020. Piscataway: IEEE, 2020: 10468-10477.
[11] ZHANG F, LIMEN G, ZHAI G S, et al. Multi-branch and multi-scale attention learning for fine-grained visual catego-rization[C]//LNCS 12572: Proceedings of the 27th Interna-tional Conference on Multimedia Modeling, Prague, Jun 22-24, 2021. Cham: Springer, 2021: 136-147.
[12] CHEN Y, BAI Y L, ZHANG W, et al. Destruction and construction learning for fine-grained image recognition[C]// Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 15-20, 2019. Piscataway: IEEE, 2019: 5157-5166.
[13] YANG Z, LUO T G, WANG D, et al. Learning to navigate for fine-grained classification[C]//LNCS 11218: Proceedings of the 15th European Conference on Computer Vision, Munich, Sep 8-14, 2018. Cham: Springer, 2018: 438-454.
[14] ZHANG N, DONAHUE J, GIRSHICK R, et al. Part-based R-CNNs for fine-grained category detection[C]//LNCS 8689: Proceedings of the 13th European Conference on Computer Vision, Zurich, Sep 6-12, 2014. Cham: Springer, 2014: 834-849.
[15] LIU X, XIA T, WANG J, et al. Fully convolutional attention localization networks: efficient attention localization for fine-grained recognition[J]. arXiv:1603.06765, 2016.
[16] BRANSON S, VAN H G, BELONGIE S, et al. Bird species categorization using pose normalized deep convolutional nets[J]. arXiv:1406.2952, 2014.
[17] HUANG S L, XU Z, TAO D C, et al. Part-stacked CNN for fine-grained visual categorization[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recogni-tion, Las Vegas, Jun 27-30, 2016. Piscataway: IEEE, 2016: 1173-1182.
[18] GAO Y, HAN X T, WANG X, et al. Channel interaction networks for fine-grained image categorization[C]//Procee-dings of the 34th AAAI Conference on Artificial Intelli-gence, New York, Feb 7-12, 2020. Menlo Park: AAAI, 2020: 10818-10825.
[19] CHANG D L, DING Y F, XIE J Y, et al. The devil is in the channels: mutual-channel loss for fine-grained image classi-fication[J]. IEEE Transactions on Image Processing, 2020, 29: 4683-4695.
[20] ZHANG L B, HUANG S L, LIU W, et al. Learning a mixture of granularity-specific experts for fine-grained cate-gorization[C]//Proceedings of the 26th IEEE International Conference on Computer Vision, Seoul, Oct 27-Nov 3, 2019. Piscataway: IEEE, 2019: 8330-8339.
[21] XU S, CHANG D L, XIE J Y, et al. Grad-CAM: guided channel-spatial attention module for fine-grained visual cla-ssification[C]//Proceedings of the 2021 IEEE 31st Interna-tional Workshop on Machine Learning for Signal Processing, Gold Coast, Oct 25-28, 2021. Piscataway: IEEE, 2021: 1-6.
[22] HADSELL R, CHOPRA S, LECUN Y. Dimensionality reduc-tion by learning an invariant mapping[C]//Proceedings of the 19th IEEE Conference on Computer Vision and Pattern Recognition, New York, Jun 17-22, 2006. Piscataway: IEEE, 2006: 1735-1742.
[23] GRILL J B, STRUB F, ALTCHE F, et al. Bootstrap your own latent—a new approach to self-supervised learning[C]//Advances in Neural Information Processing Systems 33, Dec 6-12, 2020: 21271-21284.
[24] SHARMA V, TAPASWI M, SARFRAZ M S, et al. Cluster-ing based contrastive learning for improving face represen-tations[C]//Proceedings of the 2020 IEEE International Conference on Automatic Face and Gesture Recognition, Buenos Aires, Nov 16-20, 2020. Piscataway: IEEE, 2020: 109-116.
[25] DOSOVITSKIY A, SPRINGENBERG J T, RIEDMILLER M, et al. Discriminative unsupervised feature learning with convolutional neural networks[C]//Advances in Neural Infor-mation Processing Systems 27, Montreal, Dec 8-13, 2014: 766-774.
[26] SCHROFF F, KALENICHENKO D, PHILBIN J. FaceNet: a unified embedding for face recognition and clustering[C]//Proceedings of the 28th IEEE Conference on Computer Vision and Pattern Recognition, Boston, Jun 7-12, 2015. Pis-cataway: IEEE, 2015: 815-823.
[27] LI Y F, HU P, LIU Z T, et al. Contrastive clustering[C]//Proceedings of the 35th AAAI Conference on Artificial Intelligence, the 32nd Conference on Innovative Applica-tions of Artificial Intelligence, the 11th Symposium on Edu-cational Advances in Artificial Intelligence, Feb 2-9, 2021. Menlo Park: AAAI, 2021: 8547-8555.
[28] DANG Z Y, DENG C, YANG X, et al. Doubly contrastive deep clustering[J]. arXiv:2103.05484, 2021.
[29] WAH C, BRANSON S, WELINDER P, et al. The caltech-ucsd birds-200-2011 dataset[R]. Pasadena: California Institute of Technology, 2011.
[30] MAJI S, RAHTU E, KANNALA J, et al. Fine-grained visual classification of aircraft[J]. arXiv:1306.5151, 2013.
[31] KRAUSE J, STARK M, DENG J, et al. 3D object represen-tations for fine-grained categorization[C]//Proceedings of the 2013 IEEE International Conference on Computer Vision Workshops, Portland, Jun 23-24, 2013. Piscataway: IEEE, 2013: 554-561.
[32] KHOSLA A, JAYADEVAPRAKASH N, YAOB P, et al. Novel dataset for fine-grained image categorization: Stanford dogs[C]//Proceedings of the 2011 IEEE International Conference on Computer Vision, Barcelona, Nov 6-13, 2011. Piscataway: IEEE, 2011.
[33] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet classification with deep convolutional neural networks[C]//Advances in Neural Information Processing Systems 25,Lake Tahoe, Dec 3-6, 2012: 1106-1114.
[34] HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recogni-tion, Las Vegas, Jun 27-30, 2016. Washington: IEEE Com-puter Society, 2016: 770-778.
[35] RAO Y M, CHEN G Y, LU J W, et al. Counterfactual atten-tion learning for fine-grained visual categorization and re-identification[C]//Proceedings of the 28th IEEE Interna-tional Conference on Computer Vision, Oct 10-17, 2021. Piscataway: IEEE, 2021: 1025-1034.
[36] WANG D Q, SHEN Z Q, SHAO J, et al. Multiple granularity descriptors for fine-grained categorization[C]//Proceedings of the 22nd IEEE International Conference on Computer Vision, Santiago, Dec 13-16, 2015. Piscataway: IEEE, 2015: 2399-2406.
[37] WANG Y M, MORARIU V I, DAVIS L S. Learning a discriminative filter bank within a CNN for fine-grained recognition[C]//Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-22, 2018. Piscataway: IEEE, 2018: 4148-4157.
[38] FU J L, ZHENG H L, MEI T. Look closer to see better: recurrent attention convolutional neural network for fine-grained image recognition[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Washington: IEEE Computer Society, 2017: 4438-4446.
[39] LUO W, YANG X T, MO X J, et al. Cross-x learning for fine-grained visual categorization[C]//Proceedings of the 26th IEEE International Conference on Computer Vision, Seoul, Oct 27-Nov 2, 2019. Piscataway: IEEE, 2019: 8241-8250.
[40] ZHANG T, CHANG D L, MAZ Y, et al. Progressive co-attention network for fine-grained visual classification[C]//Proceedings of the 2021 International Conference on Visual Communications and Image Processing, Munich, Dec 5-8,2021. Piscataway: IEEE, 2021: 1-5.
[41] DUBEY A, GUPTA O, GUO P, et al. Pairwise confusion for fine-grained visual classification[C]//LNCS 11216: Procee-dings of the 15th European Conference on Computer Vision, Munich, Sep 8-14, 2018. Cham: Springer, 2018: 71-88.
[42] DUBEY A, GUPTA O, RASKAR R, et al. Maximum-entropy fine grained classification[C]//Advances in Neural Informa-tion Processing Systems 31, Montréal, Dec 3-8, 2018: 635-645.
[43] SUN M, YUAN Y C, ZHOU F, et al. Multi-attention multi-class constraint for fine-grained image recognition[C]//LNCS 11220: Proceedings of the 15th European Conference on Computer Vision, Munich, Sep 8-14, 2018. Cham: Springer, 2018: 834-850.
[44] LUO W, ZHANG H M, LI J, et al. Learning semantically enhanced feature for fine-grained image classification[J]. IEEE Signal Processing Letters, 2020, 27: 1545-1549.
[45] SELVARAJU R R, COGSWELL M, DAS A, et al. GRAD-CAM: visual explanations from deep networks via gradient-based localization[C]//Proceedings of the 24th IEEE Inter-national Conference on Computer Vision, Venice, Oct 22-29, 2017. Washington: IEEE Computer Society, 2017: 618-626.