Fine-Grained Image Classification Model Based on Bilinear Aggregate Residual Attention

doi:10.3778/j.issn.1673-9418.2010031

Abstract

Abstract:

Due to diversity in local information between categories is relatively subtle in fine-grained image classification tasks, it often causes problems such as insufficient ability of the model to capture discriminative features, and poor interdependence between channels when extracting features. As a result, the network cannot learn the salient and diverse image category features, which ultimately affects the classification performance. Therefore, this paper proposes a bilinear aggregate residual attention network (BARAN). In order to improve the feature capture ability of the network, firstly, based on the original bilinear convolutional neural networks model (B-CNN), the original feature extraction sub-network is transformed into a more learning aggregate residual network. And then, a distraction module is embedded in each aggregate residual block, so that the network focuses on integrating cross-dimensional features, and strengthens the degree of close association between channels in the feature acquisition process. Finally, the fused bilinear feature map is input into the cross-channel attention module, and the discriminative and distinctive sub-components included in the cross-channel attention module are used to further learn more subtle, diverse and mutually exclusive local inter-classes confusing information. Experimental results show that the classification accuracy on the fine-grained image datasets of CUB-200-2011, FGVC-Aircraft and Stanford Cars is 87.9%, 92.9% and 94.7%, which is superior to primary mainstream methods in classification performance. Moreover, the improvement is 0.038, 0.088 and 0.034 compared with the original B-CNN model.

Key words: fine-grained image classification, aggregate residuals, distracting attention, cross-channel attention, diversified feature

摘要：

针对细粒度图像分类任务中种类间局部信息差异性较小,通常会导致模型表征能力不足,特征通道之间的相互依赖关系较差以及无法有效捕捉到显著且多样化的特征信息等问题,提出了一种双线性聚合残差注意力网络（BARAN）。首先在原双线性卷积网络模型（B-CNN）基础上,把原有特征提取子网络转变为更具学习能力的聚合残差网络,来提升网络的特征捕获能力;然后在每一聚合残差块内嵌入分散注意力模块,使得网络专注于整合跨维度特征,强化特征获取过程中通道之间的紧密关联程度;最终将融合的双线性特征图输入到互通道注意力模块中,利用互通道注意力模块包含的判别性与区分性两个子组件进一步学习到更加细微、多样化且互斥的局部类间易混淆信息。实验结果表明,该方法在CUB-200-2011、FGVC-Aircraft和Stanford Cars三个细粒度图像数据集上分类精度分别达到87.9%、92.9%、94.7%,性能优于大多数主流模型方法,并且相比原B-CNN模型提升幅度分别达到了0.038、0.088、0.034。

关键词: 细粒度图像分类, 聚合残差, 分散注意力, 互通道注意力, 多样化特征

CLC Number:

TP391.4

LI Kuankuan, LIU Libo. Fine-Grained Image Classification Model Based on Bilinear Aggregate Residual Attention[J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(4): 938-949.

李宽宽, 刘立波. 双线性聚合残差注意力的细粒度图像分类模型[J]. 计算机科学与探索, 2022, 16(4): 938-949.

Figures/Tables 14

Fig.1 Aggregated residual transformations block

Fig.2 Bilinear aggregate residual attention network

Fig.3 Location figure of split attention model

Fig.4 Split attention model

Fig.5 Branch of MCA model

Fig.6 Discrimination submodel

Fig.7 Distinctive submodel

Table 1 Datasets information of training and testing

Datasets	Category	Training	Testing
CUB-200-2011	200	5 994	5 794
FGVC-Aircraft	100	6 667	3 333
Stanford Cars	196	8 144	8 041

Fig.8 Data augmentation examples of training set in 3 datasets

Table 2 ξvalue assignment using BARAN with 512 feature channels

Datasets	cnums/cgroups
CUB-200-2011	2/88 3/112
FGVC-Aircraft	5/100 6/2
Stanford Cars	2/76 3/120

Table 3 Experimental comparison between SA module and ResneXt under different cardinality

Method	Base model	Params/10⁶	Accuracy/%
B-CNN[M,D]	VGG16-M+VGG16-D	13.8	84.1
BARN[2×64d]	ResneXt29×2+SA	34.8	84.8
BARN[4×64d]	ResneXt29×2+SA	34.6	85.2
BARN[8×64d]	ResneXt29×2+SA	34.4	85.5
BARN[32×4d]	ResneXt29×2+SA	18.2	85.9

Table 4 Ablation experiment of different components of MCA module

Method	Base model	Accuracy/%
Method	Base model	CUB-200-2011	FGVC-Aircraft	Stanford Cars
BARN+MCA (CWA)	ResneXt29×2+SA	63.85	88.79	89.87
BARN+MCA $(L qsm)$	Resnext29×2+SA	27.35	79.88	70.23
BARN+MCA $(L psm)$	ResneXt29×2+SA	65.07	88.28	90.04
BARN+MCA $(L psm + L qsm)$	ResneXt29×2+SA	66.47	89.90	91.34

Table 4 Ablation experiment of different components of MCA module

Method	Base model	Accuracy/%
Method	Base model	CUB-200-2011	FGVC-Aircraft	Stanford Cars
BARN+MCA (CWA)	ResneXt29×2+SA	63.85	88.79	89.87
BARN+MCA $(L qsm)$	Resnext29×2+SA	27.35	79.88	70.23
BARN+MCA $(L psm)$	ResneXt29×2+SA	65.07	88.28	90.04
BARN+MCA $(L psm + L qsm)$	ResneXt29×2+SA	66.47	89.90	91.34

Table 5 Experimental comparison of different weakly supervised fine-grained image classification methods

Method	Base model	Accuracy/%
Method	Base model	CUB-200-2011	FGVC-Aircraft	Stanford Cars
B-CNN^[15]	VGG16	84.1	84.1	91.3
MaxEnt^[22]	B-CNN	85.3	86.1	92.8
PC^[23]	B-CNN	85.6	85.8	92.5
PC^[23]	DenseNet161	86.9	89.2	92.9
MA-CNN^[24]	VGG19	86.5	89.9	92.8
DFL-CNN^[25]	ResNet50	87.4	91.7	93.9
NTS-Net^[5]	ResNet50	87.5	91.4	93.9
TASN^[26]	ResNet50	87.9	—	93.8
DCL^[27]	VGG16	86.9	91.2	94.1
WPS-CPM^[28]	GoogleNet + ResNet50	90.4	—	—
Bi-Modal PMA^[29]	ResNet50	87.5	90.8	93.1
BARAN(Proposed)	B-CNN+ResneXt29	87.9	92.9	94.7

Fig.9 Comparison of heat maps generated by each module on 3 datasets

References 31

[1]	ZHANG N, DONAHUE J, GIRSHICK R B, et al. Part-based R-CNNs for fine-grained category detection[C]// LNCS 8689: Proceedings of the 13th European Conference on Computer Vision, Zurich, Sep 6-12, 2014. Cham: Springer, 2014: 834-849.
[2]	罗建豪, 吴建鑫. 基于深度卷积特征的细粒度图像分类研究综述[J]. 自动化学报, 2017, 43(8):1306-1318.
	LUO J H, WU J X. A survey on fine-grained image cate-gorization using deep convolutional features[J]. Acta Auto-matica Sinica, 2017, 43(8):1306-1318.
[3]	UIJLINGS J R R, VAN DE SANDE K E A, GEVERS T, et al. Selective search for object recognition[J]. International Journal of Computer Vision, 2013, 104(2):154-171. DOI URL
[4]	LIN D, SHEN X Y, LU C W, et al. Deep LAC: deep localiza-tion, alignment and classification for fine-grained recognition[C]// Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, Jun 7-12, 2015. Washington: IEEE Computer Society, 2015: 1666-1674.
[5]	YANG Z, LUO T G, WANG D, et al. Learning to navigate for fine-grained classification[C]// LNCS 11218: Proceedings of the 15th European Conference on Computer Vision, Munich, Sep 8-14, 2018. Cham: Springer, 2018: 420-435.
[6]	BORJI A, ITTI L. State-of-the-art in visual attention modeling[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(1):185-207. DOI URL
[7]	PENG Y H, HE X T, ZHAO J J. Object-part attention model for fine-grained image classification[J]. IEEE Transactions on Image Processing, 2018, 27(3):1487-1500. DOI URL
[8]	WOO S, PARK J, LEE J Y, et al. CBAM: convolutional block attention module[C]// LNCS 11211: Proceedings of the 15th European Conference on Computer Vision, Munich, Sep 8-14, 2018. Cham: Springer, 2018: 3-19.
[9]	HAN K, GUO J Y, ZHANG C, et al. Attribute-aware attention model for fine-grained representation learning[C]// Proceedings of the 2018 ACM Multimedia Conference on Multimedia Conference, Seoul, Oct 22-26, 2018. New York: ACM, 2018: 2040-2048.
[10]	GAO Y, HAN X T, WANG X, et al. Channel interaction networks for fine-grained image categorization[C]// Procee-dings of the 34th AAAI Conference on Artificial Intelligence, the 32nd Innovative Applications of Artificial Intelligence Conference, the 10th AAAI Symposium on Educational Advances in Artificial Intelligence, New York, Feb 7-12, 2020. Menlo Park: AAAI, 2020: 10818-10825.
[11]	HU J, SHEN L, ALBANIE S, et al. Squeeze-and-excitation networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42(8):2011-2023. DOI URL
[12]	LI X, WANG W H, HU X L, et al. Selective kernel networks[C]// Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 16-20, 2019. Piscataway: IEEE, 2019: 510-519.
[13]	ZHANG H, WU C R, ZHANG Z Y, et al. ResNeSt: split-attention networks[J]. arXiv: 2004. 08955, 2020.
[14]	CHANG D L, DING Y F, XIE J Y, et al. The devil is in the channels: mutual-channel loss for fine-grained image classi-fication[J]. IEEE Transactions on Image Processing, 2020, 29:4683-4695. DOI URL
[15]	LIN T Y, ROYCHOWDHURY A, MAJI S. Bilinear CNNs for fine-grained visual recognition[J]. arXiv: 1504. 07889, 2015.
[16]	XIE S N, GIRSHICK R B, DOLLÁR P, et al. Aggregated residual transformations for deep neural networks[C]// Pro-ceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Wash-ington: IEEE Computer Society, 2017: 5987-5995.
[17]	HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recogni-tion, Las Vegas, Jun 27-30, 2016. Washington: IEEE Com-puter Society, 2016: 770-778.
[18]	PASZKE A, GROSS S, CHINTALA S, et al. Automatic differentiation in PyTorch[C]// Proceedings of the 31st Con-ference on Neural Information Processing System, Long Beach, Oct 28, 2017. Red Hook: Curran Associates, 2017: 1-4.
[19]	WAH C, BRANSON S, WELINDER P, et al. The Caltech-UCSD Birds-200-2011 dataset[R]. Pasadena: California Ins-titute of Technology, 2011.
[20]	MAJI S, RAHTU E, KANNALA J, et al. Fine-grained visual classification of aircraft[J]. arXiv: 1306. 5151, 2013.
[21]	KRAUSE J, STARK M, DENG J, et al. 3D object represen-tations for fine-grained categorization[C]// Proceedings of the 2013 IEEE International Conference on Computer Vision, Sydney, Dec 1-8, 2013. Washington: IEEE Computer Society, 2013: 554-561.
[22]	DUBEY A, GUPTA O, RASKAR R, et al. Maximum-entropy fine grained classification[C]// Proceedings of the Annual Con-ference on Neural Information Processing Systems, Montréal, Dec 3-8, 2018: 635-645.
[23]	DUBEY A, GUPTA O, GUO P, et al. Pairwise confusion for fine-grained visual classification[C]// LNCS 11216: Proceedings of the 15th European Conference on Computer Vision, Munich, Sep 8-14, 2018. Cham: Springer, 2018: 71-88.
[24]	ZHENG H L, FU J L, MEI T, et al. Learning multi-attention convolutional neural network for fine-grained image reco-gnition[C]// Proceedings of the 2017 IEEE International Conference on Computer Vision, Venice, Oct 22-29, 2017. Washington: IEEE Computer Society, 2017: 5219-5227.
[25]	WANG Y M, MORARIU V I, DAVIS L S. Learning a disc-riminative filter bank within a CNN for fine-grained reco-gnition[C]// Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-22, 2018. Washington: IEEE Computer Society, 2018: 4148-4157.
[26]	ZHENG H L, FU J L, ZHA Z J, et al. Looking for the devil in the details: learning trilinear attention sampling network for fine-grained image recognition[C]// Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 16-20, 2019. Piscataway: IEEE, 2019: 5012-5021.
[27]	CHEN Y, BAI Y L, ZHANG W, et al. Destruction and con-struction learning for fine-grained image recognition[C]// Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 16-20, 2019. Piscataway: IEEE, 2019: 5157-5166.
[28]	GE W F, LIN X R, YU Y Z. Weakly supervised comple-mentary parts models for fine-grained image classification from the bottom up[C]// Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 16-20, 2019. Piscataway: IEEE, 2019: 3034-3043.
[29]	SONG K T, WEI X S, SHU X B, et al. Bi-modal progres-sive mask attention for fine-grained recognition[J]. IEEE Transactions on Image Processing, 2020, 29:7006-7018. DOI URL
[30]	SELVARAJU R R, COGSWELL M, DAS A, et al. GRAD-CAM: visual explanations from deep networks via gradient-based localization[C]// Proceedings of the 2017 IEEE Inter-national Conference on Computer Vision, Venice, Oct 22-29, 2017. Washington: IEEE Computer Society, 2017: 618-626.
[31]	杨萌林, 张文生. 分类激活图增强的图像分类算法[J]. 计算机科学与探索, 2020, 14(1):149-158.
	YANG M L, ZHANG W S. Image classification algorithm based on classification activation map enhancement[J]. Journal of Frontiers of Computer Science and Technology, 2020, 14(1):149-158.