Multi-label Image Recognition Using Channel Pixel Attention

doi:10.3778/j.issn.1673-9418.2307087

Abstract

Abstract: Multi-label image recognition is the classification of images that contain labels for multiple object categories. In order to solve the problems of small object recognition and sample data imbalance in multi-label image recognition, this paper proposes simple and efficient channel pixel attention (CPA) and class weight cross-entropy loss, respectively. CPA generates the corresponding pixel features for each channel by calculating channel attention and pixel attention score, so as to improve the attention of the network to small objects, and input the pooled and gained pixel features to the multi-layer perceptron for final classification. The positive sample size distribution in the dataset is introduced as the weight of cross-entropy (CE) loss to enhance the attention to objects with small sample size. Experiments are conducted on the public datasets of VOC 2007 (PASCAL VOC challenge 2007), MS-COCO (micro-soft common objects in context) 2014 and VAW (visual attribute prediction in the wild). The results show that the proposed method improves the mean average precision (mAP) by 0.2 percentage points, 0.7 percentage points and 0.9 percentage points compared with other existing advanced methods, respectively. For the MS-COCO 2014 and VAW datasets, the class-weighted cross-entropy loss improves 0.6 percentage points and 1.6 percentage points on mAP compared with the commonly used cross-entropy loss without adding any computational cost, which verifies the sophistication and effectiveness of the proposed method.

Key words: deep learning, image recognition, attention mechanism, multi-label classification, loss function

摘要： 多标签图像识别是对包含多个对象类别标签的图像进行预测分类。为了解决多标签图像识别中存在的小对象识别困难和样本数据不平衡问题，分别提出了简单高效的通道像素注意力（CPA）和类权重交叉熵损失。CPA通过计算通道注意力和像素注意力得分来为每个通道生成对应的像素特征，以提升网络对小对象的注意力，将进行池化和增益后的像素特征输入到多层感知机中用于最终的分类预测；引入数据集中的正样本数量分布作为经典的交叉熵（CE）损失函数的权重，以提升模型对样本数量少的对象特征的关注。在公开多标签图像数据集VOC 2007、MS-COCO 2014和VAW上进行对比实验，所提出的方法相较于其他现有的先进方法在平均精度均值（mAP）上分别提高了0.2个百分点、0.7个百分点和0.9个百分点。针对MS-COCO 2014和VAW数据集，类权重交叉熵损失在不增加任何计算成本的情况下，相较于常用的交叉熵损失在mAP上分别提高了0.6个百分点和1.6个百分点，验证了所提方法的先进性和有效性。

关键词: 深度学习, 图像识别, 注意力机制, 多标签分类, 损失函数

YE Qingwen, ZHANG Qiuju. Multi-label Image Recognition Using Channel Pixel Attention[J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(8): 2109-2117.

叶庆文, 张秋菊. 采用通道像素注意力的多标签图像识别[J]. 计算机科学与探索, 2024, 18(8): 2109-2117.

References

[1] GUO Y, GU S. Multi-label classification using conditional dependency networks[C]//Proceedings of the 22nd International Joint Conference on Artificial Intelligence, Barcelona, Jul 16-22, 2011. Menlo Park: AAAI, 2011: 1300-1305.
[2] QIANG L, QIAO M, WEI B, et al. Conditional graphical lasso for multi-label image classification[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Jun 27-30, 2016. Washington: IEEE Computer Society, 2016: 2977-2986.
[3] JIANG W, YI Y, MAO J, et al. CNN-RNN: a unified framework for multi-label image classification[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Jun 27-30, 2016. Washington: IEEE Computer Society, 2016: 2285-2294.
[4] WANG Z, CHEN T, LI G, et al. Multi-label image recognition by recurrently discovering attentional regions[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision, Venice, Oct 22-29, 2017. Washington: IEEE Computer Society, 2017: 464-472.
[5] 雷宏宇. 基于图神经网络的多标签图像识别[D]. 北京：北京化工大学, 2022.
LEI H Y. Multi-label remote sensing image classification based on graph convolutional network[D]. Beijing: Beijing University of Chemical Technology, 2022.
[6] 任炜, 白鹤翔. 基于全局与局部标签关系的多标签图像分类方法[J]. 计算机应用, 2022, 42(5): 1383-1390.
REN W, BAI H X. Multi-label image classification method based on global and local label relationship[J]. Computer Applications, 2022, 42(5): 1383-1390.
[7] GAO B B, ZHOU H Y. Learning to discover multi-class attentional regions for multi-label image recognition[J]. IEEE Transactions on Image Processing, 2021, 30: 5920-5932.
[8] 陈绵书, 于录录, 苏越, 等. 基于卷积神经网络的多标签图像分类[J]. 吉林大学学报(工学版), 2020, 50(3): 1077-1084.
CHEN M S, YU L L, SU Y, et al. Multi-label images classification based on convolutional neural network[J]. Journal of Jilin University (Engineering and Technology Edition), 2020, 50(3): 1077-1084.
[9] FENG Z, LI H, OUYANG W, et al. Learning spatial regularization with image-level supervisions for multi-label image classification[EB/OL]. (2017-05-31)[2023-07-22]. https://doi.org/10.48550/arXiv.1702.05891.
[10] DURAND T, MORDAN T, THOME N, et al. WILDCAT: weakly supervised learning of deep convnets for image classification, pointwise localization and segmentation[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Washington: IEEE Computer Society, 2017: 5957-5966.
[11] 朱旭东, 熊贇. 基于多层次注意力与图模型的图像多标签分类算法[J]. 计算机工程, 2022, 48(4): 173-178.
ZHU X D, XIONG Z. Multi-label image classification algorithm based on multi-scale attention and graph model[J]. Computer Engineering, 2022, 48(4): 173-178.
[12] ZHU K, WU J. Residual attention: a simple but effective method for multi-label recognition[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Montreal, Oct 10-17, 2021. Piscataway: IEEE, 2021: 184-193.
[13] EVERINGHAM M, GOOL L V, WILLIAMS C, et al. The Pascal visual object classes (VOC) challenge[J]. International Journal of Computer Vision, 2010, 88(2): 303-338.
[14] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context[C]//Proceedings of the 13th European Conference on Computer Vision. Cham: Springer, 2014: 740-755.
[15] PHAM K, KAFLE K, LIN Z, et al. Learning to predict visual attributes in the wild[EB/OL]. [2023-07-22]. https://doi.org/10.48550/arXiv.2106.09707.
[16] YUN S, HAN D, CHUN S, et al. CutMix: regularization strategy to train strong classifiers with localizable features[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Oct 27-Nov 2, 2019. Piscataway: IEEE, 2019: 6022-6031.
[17] CHEN T, XU M, HUI X, et al. Learning semantic-specific graph representation for multi-label image recognition[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Oct 27-Nov 2, 2019. Piscataway: IEEE, 2019: 522-531.
[18] CHEN Z M, WEI X S, WANG P, et al. Multi-label image recognition with graph convolutional networks[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 15-20, 2019. Piscataway: IEEE, 2019: 5172-5181.
[19] BEN-BARUCH E, RIDNIK T, ZAMIR N, et al. Asymmetric loss for multi-label classification[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Montreal, Oct 10-17, 2021. Piscataway: IEEE, 2021: 82-91.
[20] LIU Y, SHENG L, SHAO J, et al. Multi-label image classification via knowledge distillation from weakly-supervised detection[C]//Proceedings of the 26th ACM International Conference on Multimedia, Seoul, Oct 22-26, 2018. New York: ACM, 2018: 700-708.
[21] LI Y, SONG Y, LUO J. Improving pairwise ranking for multi-label image classification[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Washington:IEEE Computer Society, 2017: 1837-1845.
[22] SARAFIANOS N, XU X, KAKADIARIS I A. Deep imbalanced attribute classification using visual attention aggregation[C]//Proceedings of the 15th European Conference on Computer Vision, Munich, Sep 8, 2018. Cham: Springer, 2018: 680-697.