Encoder-Decoder Network Fusing Channel and Spatial Attention for Crowd Counting

doi:10.3778/j.issn.1673-9418.2104122

Journal of Frontiers of Computer Science and Technology ›› 2022, Vol. 16 ›› Issue (11): 2547-2556.DOI: 10.3778/j.issn.1673-9418.2104122

• Artificial Intelligence • Previous Articles Next Articles

Encoder-Decoder Network Fusing Channel and Spatial Attention for Crowd Counting

YU Ying⁺(), PAN Cheng, ZHU Huilin, QIAN Jin, TANG Hong

College of Software, East China Jiaotong University, Nanchang 330013, China

Received:2021-05-07 Revised:2021-06-23 Online:2022-11-01 Published:2021-06-24
About author:YU Ying, born in 1979, Ph.D., associate professor, M.S. supervisor, member of CCF. Her research interests include machine learning, computer vision, granular computing, etc.
PAN Cheng, born in 1995, M.S. candidate. His research interests include machine learning and computer vision.
ZHU Huilin, born in 1996, M.S. candidate. Her research interest is computer vision.
QIAN Jin, born in 1975, Ph.D., professor, M.S. supervisor, member of CCF. His research interests include granular computing, big data mining and machine learning.
TANG Hong, born in 1998, M.S. candidate. His research interests include deep learning and computer vision.
Supported by:
National Natural Science Foundation of China(62163016);National Natural Science Foundation of China(62066014);Natural Science Foundation of Jiangxi Province(20212ACB202001);Natural Science Foundation of Jiangxi Province(20202BABL202018)

融合通道与空间注意力的编解码人群计数算法

余鹰⁺(), 潘诚, 朱慧琳, 钱进, 汤洪

华东交通大学软件学院，南昌 330013

通讯作者: + E-mail: yuyingjx@163.com
作者简介:余鹰（1979—），女，博士，副教授，硕士生导师，CCF会员，主要研究方向为机器学习、计算机视觉、粒计算等。
潘诚（1995—），男，硕士研究生，主要研究方向为机器学习、计算机视觉。
朱慧琳（1996—），女，硕士研究生，主要研究方向为计算机视觉。
钱进（1975—），男，博士，教授，硕士生导师，CCF会员，主要研究方向为粒计算、大数据挖掘、机器学习。
汤洪（1998—），男，硕士研究生，主要研究方向为深度学习、计算机视觉。
基金资助:
国家自然科学基金(62163016);国家自然科学基金(62066014);江西省自然科学基金(20212ACB202001);江西省自然科学基金(20202BABL202018)

Abstract

Abstract:

The purpose of crowd counting is to accurately predict the number, distribution and density of crowds in real scenes. However, crowd counting often suffers from some problems such as complex background, diverse target scales, and cluttered crowd distribution, which strongly affects the precision of counting. To solve these problems, a channel and spatial attention-based encoder-decoder network for crowd counting (CSANet) is proposed. It uses a multi-level encoder-decoder network to extract multi-scale semantic features, and fully integrates spatial context information to solve the problem of pedestrian scale changes and messy distribution in complex scenes. To reduce the impact of complex background on counting performance, channel and spatial attention are introduced in the process of feature fusion to improve the quality of crowd density map by increasing the feature weights of crowd regions to highlight regions of interest, and decreasing the feature weights of weakly correlated background regions to suppress background noise interference. To verify the effectiveness of the proposed algorithm, experiments are conducted on several classical crowd counting datasets, and the experimental results show that CSANet performs well in multi-scale feature extraction and background noise suppression compared with existing crowd counting algorithms, which greatly improves the accuracy and robustness of counting algorithm in dense scenes.

Key words: crowd counting, encoder-decoder network, attention, feature fusion, deep learning

摘要：

人群计数旨在准确地预测现实场景中人群的数量、分布和密度，然而现实场景普遍存在背景复杂、目标尺度多样和人群分布杂乱等问题，给人群计数任务带来极大的挑战。针对这些问题，提出了一种融合通道与空间注意力的编解码结构人群计数网络（CSANet）。该模型采用多层次编解码网络结构提取多尺度语义特征，并充分融合空间上下文信息，以此来解决复杂场景中行人尺度变化和分布杂乱的问题；为了降低复杂背景对计数性能的影响，在特征融合的过程中引入了通道与空间注意力，提高人群区域的特征权重，凸显感兴趣区域，同时降低弱相关背景区域的特征权重，抑制背景噪声干扰，最终提升人群密度图质量。为了验证算法的有效性，在多个经典人群计数数据集上进行了实验，实验结果表明，与现有的人群计数算法相比，CSANet具有良好的多尺度特征提取能力和背景噪声抑制能力，这使得密集场景下计数算法的准确性和鲁棒性均有较大提升。

关键词: 人群计数, 编解码网络, 注意力, 特征融合, 深度学习

CLC Number:

TP391

YU Ying, PAN Cheng, ZHU Huilin, QIAN Jin, TANG Hong. Encoder-Decoder Network Fusing Channel and Spatial Attention for Crowd Counting[J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(11): 2547-2556.

余鹰, 潘诚, 朱慧琳, 钱进, 汤洪. 融合通道与空间注意力的编解码人群计数算法[J]. 计算机科学与探索, 2022, 16(11): 2547-2556.

Figures/Tables 13

References 33

[1]	SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[EB/OL]. (2016-08-22)[2021-01-06]. https://arxiv.org/abs/1608.06197.
[2]	HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Jun 27-30, 2016. Washington: IEEE Computer Society, 2016: 770-778.
[3]	HAN K, WANG Y H, TIAN Q, et al. GhostNet: more features from cheap operations[C]// Proceedings of the 2020 IEEE Conference Computer Vision and Pattern Recognition, Seattle, Jun 13-19, 2020. Piscataway: IEEE, 2020: 1580-1589.
[4]	ZHANG C, LI H S, WANG X G, et al. Cross-scene crowd counting via deep convolutional neural networks[C]// Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, Jun 7-12, 2015. Washington: IEEE Computer Society, 2015: 833-841.
[5]	ZHANG Y Y, ZHOU D, CHEN S Q, et al. Single-image crowd counting via multi-column convolutional neural network[C]// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Jun 27-30, 2016. Washington: IEEE Computer Society, 2016: 589-597.
[6]	LIU N, LONG Y C, ZOU C Q, et al. ADCrowdNet: an attention-injective deformable convolutional network for crowd understanding[C]// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 16-20, 2019. Piscataway: IEEE, 2019: 3225- 3234.
[7]	VIOLA P, JONES M J. Robust real-time face detection[J]. International Journal of Computer Vision, 2004, 57(2): 137-154. DOI URL
[8]	DALAL N, TRIGGS B. Histograms of oriented gradients for human detection[C]// Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Diego, Jun 20-26, 2005. Washington: IEEE Computer Society, 2005: 886-893.
[9]	DOLLAR P, WOJEK C, SCHIELE B, et al. Pedestrian detection: an evaluation of the state of the art[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011, 34(4): 743-761. DOI URL
[10]	FELZENSZWALB P F, GIRSHICK R B, MCALLESTER D, et al. Object detection with discriminatively trained part-based models[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009, 32(9): 1627-1645. DOI URL
[11]	CHEN K, GONG S G, XIANG T, et al. Cumulative attribute space for age and crowd density estimation[C]// Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, Jun 23-28, 2013. Washington: IEEE Computer Society, 2013: 2467-2474.
[12]	HOWARD A, SANDLER M, CHU G, et al. Searching for MobileNetV3[C]// Proceedings of the 2019 IEEE/CVF Inter-national Conference on Computer Vision, Seoul, Oct 27-Nov 2, 2019. Piscataway: IEEE, 2019: 1314-1324.
[13]	REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[EB/OL]. (2016-01-06) [2021-01-06]. https://arxiv.org/abs/1506.01497.
[14]	BOCHKOVSKIY A, WANG C Y, LIAO H Y M. YOLOv4: optimal speed and accuracy of object detection[EB/OL]. (2020-04-23)[2021-01-06]. https://arxiv.org/abs/2004.10934.
[15]	CHEN L C, ZHU Y, PAPANDREOU G, et al. Encoder-decoder with atrous separable convolution for semantic image segmentation[C]// LNCS 11211: Proceedings of the 15th European Conference on Computer Vision, Munich, Sep 8-14, 2018. Cham: Springer, 2018: 833-851.
[16]	余鹰, 朱慧琳, 钱进, 等. 基于深度学习的人群计数研究综述[J]. 计算机研究与发展, 2021, 58(12): 2724-2747.
	YU Y, ZHU H L, QIAN J, el al. Survey on deep learning based crowd counting[J]. Journal of Computer Research and Development, 2021, 58(12): 2724-2747.
[17]	GAO G S, GAO J Y, LIU Q J, et al. CNN-based density estimation and crowd counting: a survey[EB/OL]. (2020-03-28)[2021-01-06]. https://arxiv.org/abs/2003.12783.
[18]	YU Y, ZHU H L, WANG L W, et al. Dense crowd counting based on adaptive scene division[J]. International Journal of Machine Learning and Cybernetics, 2021, 12(4): 931-942. DOI URL
[19]	SINDAGI V A, PATEL V M. Generating high-quality crowd density maps using contextual pyramid CNNs[C]// Proceedings of the 2017 IEEE International Conference on Computer Vision, Venice, Oct 22-29, 2017. Washington: IEEE Computer Society, 2017: 1861-1870.
[20]	SAM D B, SURYA S, BABU R V. Switching convolutional neural network for crowd counting[C]// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Washington: IEEE Computer Society, 2017: 4031-4039.
[21]	CAO X K, WANG Z P, ZHAO Y Y, et al. Scale aggregation network for accurate and efficient crowd counting[C]// LNCS 11209: Proceedings of the 15th European Conference on Computer Vision, Munich, Sep 8-14, 2018. Cham: Springer, 2018: 734-750.
[22]	LI Y H, ZHANG X F, CHEN D M. CSRNet: dilated convo-lutional neural networks for understanding the highly congested scenes[C]// Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Jun 18-22, 2018. Washington: IEEE Computer Society, 2018: 1091-1100.
[23]	WOO S, PARK J, LEE J Y, et al. CBAM: convolutional block attention module[C]// LNCS 11211: Proceedings of the 15th European Conference on Computer Vision, Munich, Sep 8-14, 2018. Cham: Springer, 2018: 3-19.
[24]	DENG J, DONG W, SOCHER R, et al. ImageNet: a large-scale hierarchical image database[C]// Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, Jun 20-25, 2009. Washington: IEEE Computer Society, 2009: 248-255.
[25]	WU X J, ZHENG Y B, YE H, et al. Adaptive scenario discovery for crowd counting[C]// Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing, Brighton, May 12-17, 2019. Piscataway: IEEE, 2019: 2382-2386.
[26]	SHI M J, YANG Z H, XU C, et al. Revisiting perspective information for efficient crowd counting[C]// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, Jun 16-20, 2019. Piscataway: IEEE, 2019: 7279-7288.
[27]	ZHU L, ZHAO Z J, LU C, et al. Dual path multi-scale fusion networks with attention for crowd counting[EB/OL]. (2019-02-04)[2021-01-06]. https://arxiv.org/abs/1902.01115.
[28]	OH M, OLSEN P, RAMAMURTHY K N. Crowd counting with decomposed uncertainty[C]// Proceedings of the 2020 AAAI Conference on Artificial Intelligepnce, New York, Feb 7-12, 2020. Menlo Park: AAAI Press, 2020: 11799-11806.
[29]	IDREES H, TAYYAB M, ATHREY K, et al. Composition loss for counting, density map estimation and localization in dense crowds[C]// LNCS 11206: Proceedings of the 15th European Conference on Computer Vision, Munich, Sep 8-14, 2018. Cham: Springer, 2018: 544-559.
[30]	SINDAGI V A, PATEL V M. CNN-based cascaded multi-task learning of high-level prior and density estimation for crowd counting[C]// Proceedings of the 14th IEEE International Conference on Advanced Video and Signal Based Surve-illance, Lecce, Aug 29-Sep 1, 2017. Washington: IEEE Computer Society, 2017: 1-6.
[31]	SINDAGI V A, PATEL V M. Inverse attention guided deep crowd counting network[C]// Proceedings of the 16th IEEE International Conference on Advanced Video and Signal Based Surveillance, Taipei, China, Sep 18-21, 2019. Piscataway: IEEE, 2019: 1-8.
[32]	ZHANG A R, SHEN J Y, XIAO Z H, et al. Relational attention network for crowd counting[C]// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Oct 27 -Nov 2, 2019. Piscataway: IEEE, 2019: 6788-6797.
[33]	IDREES H, SALEEMI I, SEIBERT C, et al. Multi-source multi-scale counting in extremely dense crowd images[C]// Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, Jun 23-28, 2013. Washington: IEEE Computer Society, 2013: 2547-2554.

编码器	解码器
Conv1_1(3-64-1)	Upsampling
Conv1_2(3-64-1)	Concat
Max pooling	CBAM module
Conv2_1(3-128-1)	Conv6_1(1-256-1)
Conv2_2(3-128-1)	Conv6_2(3-256-1)
Max pooling	Upsampling
Conv3_1(3-256-1)	Concat
Conv3_2(3-256-1)	CBAM module
Conv3_3(3-256-1)	Conv7_1(1-128-1)
Max pooling	Conv7_2(3-128-1)
Conv4_1(3-512-1)	Upsampling
Conv4_2(3-512-1)	Concat
Conv4_3(3-512-1)	CBAM module
Max pooling	Conv8_1(1-64-1)
Conv5_1(3-512-1)	Conv8_2(3-64-1)
Conv5_2(3-512-1)	Upsampling
Conv5_3(3-512-1)	Conv9_1(3-32-1)
	Conv9_2(3-32-1)
	Conv10_1(1-1-1)

编码器	解码器
Conv1_1(3-64-1)	Upsampling
Conv1_2(3-64-1)	Concat
Max pooling	CBAM module
Conv2_1(3-128-1)	Conv6_1(1-256-1)
Conv2_2(3-128-1)	Conv6_2(3-256-1)
Max pooling	Upsampling
Conv3_1(3-256-1)	Concat
Conv3_2(3-256-1)	CBAM module
Conv3_3(3-256-1)	Conv7_1(1-128-1)
Max pooling	Conv7_2(3-128-1)
Conv4_1(3-512-1)	Upsampling
Conv4_2(3-512-1)	Concat
Conv4_3(3-512-1)	CBAM module
Max pooling	Conv8_1(1-64-1)
Conv5_1(3-512-1)	Conv8_2(3-64-1)
Conv5_2(3-512-1)	Upsampling
Conv5_3(3-512-1)	Conv9_1(3-32-1)
	Conv9_2(3-32-1)
	Conv10_1(1-1-1)

方法	Part_A		Part_B
方法	MAE	RMSE	MAE	RMSE
MCNN^[5]	110.2	173.2	26.4	41.3
CP-CNN^[19]	73.6	106.4	20.1	30.1
CSRNet^[22]	68.2	115.0	10.6	16.0
ASD^[25]	65.6	98.0	8.5	13.7
PACNN^[26]	66.3	106.4	8.9	13.5
SFANet^[27]	59.8	99.3	6.9	10.9
DUBNet^[28]	64.6	106.8	7.8	12.2
CSANet	59.4	97.1	7.1	11.2

方法	Part_A		Part_B
方法	MAE	RMSE	MAE	RMSE
MCNN^[5]	110.2	173.2	26.4	41.3
CP-CNN^[19]	73.6	106.4	20.1	30.1
CSRNet^[22]	68.2	115.0	10.6	16.0
ASD^[25]	65.6	98.0	8.5	13.7
PACNN^[26]	66.3	106.4	8.9	13.5
SFANet^[27]	59.8	99.3	6.9	10.9
DUBNet^[28]	64.6	106.8	7.8	12.2
CSANet	59.4	97.1	7.1	11.2

方法	MAE	RMSE
MCNN^[5]	277.0	426.0
CMTL^[30]	252.0	514.0
IA-DCCN^[31]	125.0	186.0
RANet^[32]	111.0	190.0
DUBNet^[28]	105.6	180.5
CSANet	104.8	175.4

Encoder-Decoder Network Fusing Channel and Spatial Attention for Crowd Counting

融合通道与空间注意力的编解码人群计数算法

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 13

References 33

Related Articles 15

Recommended Articles 0

Metrics

方法	MAE	RMSE
MCNN^[5]	377.6	509.1
Switch-CNN^[20]	318.1	439.2
CP-CNN^[19]	295.8	320.9
CSRNet^[22]	266.1	397.5
ASD^[25]	196.2	270.9
PACNN^[26]	267.9	357.8
CSANet	191.3	262.8

模块	Part_A		Part_B
模块	MAE	RMSE	MAE	RMSE
主干网络	60.9	99.2	7.7	12.2
主干网络+注意力	58.5	96.1	7.1	11.2

[1]	LYU Xiaoqi, JI Ke, CHEN Zhenxiang, SUN Runyuan, MA Kun, WU Jun, LI Yidong. Expert Recommendation Algorithm Combining Attention and Recurrent Neural Network [J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(9): 2068-2077.
[2]	LI Zhenqi, WANG Jing, JIA Ziyu, LIN Youfang. Attention-Based Multi-dimensional Feature Graph Convolutional Network for Motor Imagery Classification [J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(9): 2050-2060.
[3]	ZHANG Xiangping, LIU Jianxun. Overview of Deep Learning-Based Code Representation and Its Applications [J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(9): 2011-2029.
[4]	LI Dongmei, LUO Sisi, ZHANG Xiaoping, XU Fu. Review on Named Entity Recognition [J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(9): 1954-1968.
[5]	REN Ning, FU Yan, WU Yanxia, LIANG Pengju, HAN Xi. Review of Research on Imbalance Problem in Deep Learning Applied to Object Detection [J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(9): 1933-1953.
[6]	YANG Caidong, LI Chengyang, LI Zhongbo, XIE Yongqiang, SUN Fangwei, QI Jin. Review of Image Super-resolution Reconstruction Algorithms Based on Deep Learning [J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(9): 1990-2010.
[7]	ZENG Fanzhi, XU Luqian, ZHOU Yan, ZHOU Yuexia, LIAO Junwei. Review of Knowledge Tracing Model for Intelligent Education [J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(8): 1742-1763.
[8]	AN Fengping, LI Xiaowei, CAO Xiang. Medical Image Classification Algorithm Based on Weight Initialization-Sliding Window CNN [J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(8): 1885-1897.
[9]	YANG Zhiqiao, ZHANG Ying, WANG Xinjie, ZHANG Dongbo, WANG Yu. Application Research of Improved U-shaped Network in Detection of Retinopathy [J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(8): 1877-1884.
[10]	XIA Hongbin, XIAO Yifei, LIU Yuan. Long Text Generation Adversarial Network Model with Self-Attention Mechanism [J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(7): 1603-1610.
[11]	PENG Hao, LI Xiaoming. Multi-scale Selection Pyramid Networks for Small-Sample Target Detection Algorithms [J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(7): 1649-1660.
[12]	LIU Yi, LI Mengmeng, ZHENG Qibin, QIN Wei, REN Xiaoguang. Survey on Video Object Tracking Algorithms [J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(7): 1504-1515.
[13]	ZHAO Xiaoming, YANG Yijiao, ZHANG Shiqing. Survey of Deep Learning Based Multimodal Emotion Recognition [J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(7): 1479-1503.
[14]	SUN Fangwei, LI Chengyang, XIE Yongqiang, LI Zhongbo, YANG Caidong, QI Jin. Review of Deep Learning Applied to Occluded Object Detection [J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(6): 1243-1259.
[15]	LIU Yafen, ZHENG Yifeng, JIANG Lingyi, LI Guohe, ZHANG Wenjie. Survey on Pseudo-Labeling Methods in Deep Semi-supervised Learning [J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(6): 1279-1290.