基于特征增强和对比嵌入的零样本图像分类算法

doi:10.3778/j.issn.1673-9418.2407042

摘要/Abstract

摘要： 零样本图像分类旨在利用训练过程中可见类的信息实现未见类的预测。特征生成的方法在语义特征的指导下，利用生成模型合成未见类的视觉特征，并在视觉特征空间训练一个有监督学习模型完成预测。但是，视觉特征空间缺乏足够的判别性信息，得到的分类结果不是最优的。为此，构建一个基于对比学习的对比嵌入模块，将生成的特征与真实的特征映射至对比嵌入空间，在对比嵌入空间分别进行实例嵌入与类嵌入，利用对比学习更好地学习实例之间的差异以及类之间的区别，获得更具判别性信息的特征，并最终在对比嵌入空间训练一个有监督学习模型完成预测。此外，为了充分利用视觉特征的数据分布，获得更接近真实特征及其语义信息的生成特征，利用Vision Transformer提取图像的视觉特征，并在特征生成的过程中加入双原型约束策略，利用聚类原型和类别原型帮助生成模型更好地学习数据分布。该策略分别约束生成特征接近真实特征的聚类原型以及生成特征的类别原型接近真实特征的聚类原型。在三个公共数据集上的实验结果验证了提出算法的有效性。

关键词: 零样本图像分类, 生成模型, 对比学习, 聚类原型, 类别原型

Abstract: Zero-shot image classification aims to achieve prediction of unseen classes by utilizing the information of the seen classes during the training. The generative method synthesizes visual features of unseen classes using generative model guided by semantic information and trains a supervised learning model in the visual feature space to complete the prediction. However, the visual feature space lacks sufficient discriminative information, and thus the classification results are not optimal. In order to obtain the features with more discriminative information, this paper proposes to build a contrastive embedding module based on contrastive learning to project the generated features and real features into the contrastive embedding space, performing contrastive embedding in terms of the instance-level and class-level respectively and using the contrastive learning to better learn the differences between instances as well as the differences between classes. Eventually, a supervised learning model is trained in the contrastive embedding space to complete the prediction. In addition, in order to fully utilize the data distribution of visual features and to obtain generated features that are closer to the real features and their semantic information, this paper utilizes Vision Transformer for visual feature extraction, and dual prototype constraint strategy is added to the feature generation process, utilizing clustering prototype and class prototype to help the generative model learn the data distribution better. This strategy constrains the generated features to be close to the clustering prototype of the real feature and the class prototype of the generated features to be close to the clustering prototype of the real feature. Experiments are conducted on three common datasets and the results show the effectiveness of the proposed algorithm.

Key words: zero-shot image classification, generative model, contrastive learning, clustering prototype, class prototype

刘颖, 冯小东, 何敬鲁. 基于特征增强和对比嵌入的零样本图像分类算法[J]. 计算机科学与探索, 2025, 19(8): 2123-2134.

LIU Ying, FENG Xiaodong, HE Jinglu. Zero-Shot Image Classification Based on Feature Enhancement and Contrastive Embedding[J]. Journal of Frontiers of Computer Science and Technology, 2025, 19(8): 2123-2134.

参考文献

[1] DU Y X, LI X. Recursive deep residual learning for single image dehazing[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 843-8437.
[2] SANDLER M, HOWARD A, ZHU M L, et al. MobileNetV2: inverted residuals and linear bottlenecks[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 4510-4520.
[3] TAN M X, LE Q V. EfficientNet: rethinking model scaling for convolutional neural networks[EB/OL]. [2024-05-16]. https://arxiv.org/abs/1905.11946.
[4] XIAN Y Q, LAMPERT C H, SCHIELE B, et al. Zero-shot learning-a comprehensive evaluation of the good, the bad and the ugly[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 41(9): 2251-2265.
[5] LAROCHELLE H, ERHAN D, BENGIO Y. Zero-data learning of new tasks[C]//Proceedings of the 23rd National Conference on Artificial intelligence, 2008: 646-651.
[6] LAMPERT C H, NICKISCH H, HARMELING S. Learning to detect unseen object classes by between-class attribute transfer[C]//Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2009: 951-958.
[7] FARHADI A, ENDRES I, HOIEM D, et al. Describing objects by their attributes[C]//Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2009: 1778-1785.
[8] MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space[EB/OL]. [2024-05-16]. https://arxiv.org/abs/1301.3781.
[9] POURPANAH F, ABDAR M, LUO Y X, et al. A review of generalized zero-shot learning methods[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(4): 4051-4070.
[10] WANG W, ZHENG V W, YU H, et al. A survey of zero-shot learning: settings, methods, and applications[J]. ACM Transactions on Intelligent Systems and Technology, 2019, 10(2).
[11] CHEN S M, HONG Z M, HOU W J, et al. TransZero++: cross attribute-guided transformer for zero-shot learning[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(11): 12844-12861.
[12] CHEN Z, HUANG Y, CHEN J, et al. Duet: cross-modal semantic grounding for contrastive zero-shot learning[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2023, 37(1): 405-413.
[13] CHEN S M, HONG Z M, XIE G S, et al. MSDN: mutually semantic distillation network for zero-shot learning[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 7602-7611.
[14] LIU M, LI F, ZHANG C, et al. Progressive semantic-visual mutual adaption for generalized zero-shot learning[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 15337-15346.
[15] KINGMA D P, WELLING M. Auto-encoding variational Bayes[EB/OL]. [2024-05-16]. https://arxiv.org/abs/1312.6114.
[16] GOODFELLOW I, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial networks[J]. Communications of the ACM, 2020, 63(11): 139-144.
[17] HO J, JAIN A, ABBEEL P. Denoising diffusion probabilistic models[C]//Advances in Neural Information Processing Systems 33, 2020: 6840-6851.
[18] NARAYAN S, GUPTA A, KHAN F S, et al. Latent embedding feedback and discriminative features for zero-shot classification[C]//Proceedings of the 16th European Conference on Computer Vision. Cham: Springer, 2020: 479-495.
[19] YUE Q, CUI J B, BAI L, et al. A zero-shot learning boosting framework via concept-constrained clustering[J]. Pattern Recognition, 2024, 145: 109937.
[20] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: transformers for image recognition at scale[EB/OL]. (2020-10-22) [2024-05-19]. https://arxiv.org/abs/2010.11929.
[21] HAN Z, FU Z, CHEN S, et al. Contrastive embedding for generalized zero-shot learning[C]//Proceedings of the 2021 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2021: 2371-2381.
[22] XIAN Y Q, LORENZ T, SCHIELE B, et al. Feature generating networks for zero-shot learning[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 5542-5551.
[23] MIRZA M, OSINDERO S. Conditional generative adversarial nets[EB/OL]. (2014-11-06) [2024-05-19]. https://arxiv.org/abs/1411.1784.
[24] ARJOVSKY M, CHINTALA S, BOTTOU L. Wasserstein generative adversarial networks[C]//Proceedings of the 34th International Conference on Machine Learning, 2017: 214-223.
[25] MISHRA A, REDDY S K, MITTAL A, et al. A generative model for zero shot learning using conditional variational autoencoders[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 2269-22698.
[26] YU Y, JI Z, HAN J, et al. Episode-based prototype generating network for zero-shot learning[C]//Proceedings of the 2020 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 14035-14044.
[27] SHEN Y M, QIN J, HUANG L, et al. Invertible zero-shot recognition flows[C]//Proceedings of the 16th European Conference on Computer Vision. Cham: Springer, 2020: 614-631.
[28] GOWDA S N. Synthetic sample selection for generalized zero-shot learning[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 58-67.
[29] KULLBACK S, LEIBLER R A. On information and sufficiency[J]. The Annals of Mathematical Statistics, 1951, 22(1): 79-86.
[30] GULRAJANI I, AHMED F, ARJOVSKY M, et al. Improved training of Wasserstein GANs[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017: 5769-5779.
[31] AHMED M, SERAJ R, ISLAM S M S. The k-means algorithm: a comprehensive survey and performance evaluation[J]. Electronics, 2020, 9(8): 1295.
[32] SNELL J, SWERSKY K, ZEMEL R. Prototypical networks for few-shot learning[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017: 4080-4090.
[33] CHEN T, KORNBLITH S, NOROUZI M, et al. A simple framework for contrastive learning of visual representations[C]//Proceedings of the 37th International Conference on Machine Learning, 2020: 1597-1607.
[34] WAH C, BRANSON S, WELINDER P, et al. The Caltech-UCSD Birds-200-2011 dataset: CNS-TR-2011-001[R]. California Institute of Technology, 2011.
[35] PATTERSON G, HAYS J. SUN attribute database: discovering, annotating, and recognizing scene attributes[C]//Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2012: 2751-2758.
[36] REED S, AKATA Z, LEE H, et al. Learning deep representations of fine-grained visual descriptions[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 49-58.
[37] CASCANTE-BONILLA P, KARLINSKY L, SMITH J S, et al. On the transferability of visual features in generalized zero-shot learning[EB/OL]. [2024-05-19]. https://arxiv.org/abs/2211.12494.
[38] MAAS A L, HANNUN A Y, NG A Y. Rectifier nonlinearities improve neural network acoustic models[C]//Proceedings of the 2013 ICML Workshop on Deep Learning for Audio, Speech and Language Processing, 2013.
[39] KINGMA D P, BA J, HAMMAD M M. Adam: a method for stochastic optimization[EB/OL]. [2024-05-19]. https://arxiv.org/abs/1412.6980.
[40] CHEN S M, HONG Z M, LIU Y, et al. TransZero: attribute-guided transformer for zero-shot learning[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2022, 36(1): 330-338.
[41] CHENG D, WANG G R, WANG B, et al. Hybrid routing transformer for zero-shot learning[J]. Pattern Recognition, 2023, 137: 109270.
[42] LIU Y, ZHOU L, BAI X, et al. Goal-oriented gaze estimation for zero-shot learning[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2021: 3793-3802.
[43] XU B R, ZENG Z G, LIAN C, et al. Generative mixup networks for zero-shot learning[J]. IEEE Transactions on Neural Networks and Learning Systems, 2025, 36(3): 4054-4065.
[44] CHEN S M, WANG W J, XIA B H, et al. FREE: feature refinement for generalized zero-shot learning[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2021: 122-131.
[45] KONG X, GAO Z D, LI X F, et al. En-compactness: self-distillation embedding & contrastive generation for generalized zero-shot learning[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 9296-9305.
[46] LI X F, ZHANG Y C, BIAN S R, et al. VS-boost: boosting visual-semantic association for generalized zero-shot learning[C]//Proceedings of the 32nd International Joint Conference on Artificial Intelligence, 2023: 1107-1115.
[47] PAUL R, VORA S, LI B X. Instance adaptive prototypical contrastive embedding for generalized zero shot learning[EB/OL]. [2024-05-18]. https://arxiv.org/abs/2309.06987.
[48] LI Q, ZHAN Z X, SHEN Y Y, et al. Co-GZSL: feature contrastive optimization for generalized zero-shot learning[J]. Neural Processing Letters, 2024, 56(2): 99.
[49] RAO Z J, GUO J C, LU X C, et al. Attribute-aware representation rectification for generalized zero-shot learning[EB/OL]. [2024-05-18]. https://arxiv.org/abs/2311.14750.
[50] XIANG L, ZHOU Y, DUAN H R, et al. Dual feature augmentation network for generalized zero-shot learning[EB/OL]. [2024-05-18]. https://arxiv.org/abs/2309.13833.
[51] LI J J, JING M M, LU K, et al. Leveraging the invariant side of generative zero-shot learning[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 7394-7403.
[52] KIM J, SHIM K, KIM J, et al. Vision transformer-based feature extraction for generalized zero-shot learning[C]//Proceedings of the 2023 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2023: 1-5.
[53] VAN DER MAATEN L, HINTON G. Visualizing data using t-SNE[J]. Journal of Machine Learning Research, 2008, 9(11): 2579-2605.