融合自注意力机制的跨模态食谱检索方法

doi:10.3778/j.issn.1673-9418.1912016

摘要/Abstract

摘要：

饮食记录是饮食管理的关键环节。为了简化记录过程，研究者提出了基于食物图片的食谱检索技术，通过拍摄的图片检索到对应食谱，并据此生成营养信息，从而提高了记录的便捷性。食谱检索是典型的跨模态检索问题，但与一般问题相比，其主要难点是食谱描述了从原材料到成品的一系列变化过程，而非直接可见的特征，因此模型需要深入理解原材料的处理过程。而当前食谱检索研究工作采用线性方式处理文本，导致其捕捉食谱处理过程中的远距离依赖现象的能力较差。针对这个问题，设计了一种基于自注意力机制的跨模态食谱检索模型。该模型借助Transformer模型中的自注意力机制，捕捉食谱中远距离的依赖关系，同时改进了传统方法中的注意力机制，可以更好地挖掘食谱中的语义。实验结果表明，该模型在食谱检索任务的召回率上比基线方法提高了22%。

关键词: 饮食记录, 食谱检索, 自注意力机制, 跨模态, 深度神经网络

Abstract:

Tracking food intake is a key point for diet management. To simplify the recording process, researchers have proposed recipe retrieval technology based on food pictures. The corresponding recipes are retrieved from the food pictures taken and then nutrient information can be inferred accordingly, thereby improving convenience of dietary recording. Recipe retrieval is a typically cross-modal retrieval problem, but when compared with general problems, its major difficulty is that instead of describing visible features in food pictures, recipes provide the procedure of how ingredients become final dish, and that requires the model to better understand the cooking process of the ingredients. However, current works employ traditional models sequentially to deal with text and thus fail to capture distant dependencies in the cooking process. To tackle the problem, this paper proposes a cross-modal recipe retrieval model based on self-attention mechanism. This paper employs the self-attention mechanism in the Transformer model to capture distant dependencies in recipes and it improves the attention mechanism used in traditional work, which enables this model to better capture the semantic information in recipes. Experimental results show that this model outperforms the baselines by 22% on recall rate of recipe retrieval task.

Key words: dietary recording, recipe retrieval, self-attention mechanism, cross-modal, deep neural network

林阳，初旭，王亚沙，毛维嘉，赵俊峰. 融合自注意力机制的跨模态食谱检索方法[J]. 计算机科学与探索, 2020, 14(9): 1471-1481.

LIN Yang, CHU Xu, WANG Yasha, MAO Weijia, ZHAO Junfeng. Cross-Modal Recipe Retrieval with Self-Attention Mechanism[J]. Journal of Frontiers of Computer Science and Technology, 2020, 14(9): 1471-1481.

参考文献

[1] Aizawa K, Maruyama Y, Li H, et al. Food balance estimation by using personal dietary tendencies in a multimedia food log[J]. IEEE Transactions on Multimedia, 2013, 15(8): 2176- 2185.
[2] Hassannejad H, Matrella G, Ciampolini P, et al. Automatic diet monitoring: a review of computer vision and wearable sensor-based methods[J]. International Journal of Food Sciences and Nutrition, 2017, 68(6): 656-670.
[3] Ortiz A, Covic A, Fliser D, et al. Epidemiology, contributors to, and clinical trials of mortality risk in chronic kidney failure[J]. The Lancet, 2014, 383(9931): 1831-1843.
[4] Zhang L X, Wang F, Wang L, et al. Prevalence of chronic kidney disease in China: a cross-sectional survey[J]. The Lancet, 2012, 379(9818): 815-822.
[5] Aizawa K, Ogawa M. FoodLog: multimedia tool for healthcare applications[J]. IEEE MultiMedia, 2015, 22(2): 4-8.
[6] Tanno R, Okamoto K, Yanai K. Deepfoodcam: a DCNN-based real-time mobile food recognition system[C]//Proceedings of the 2nd International Workshop on Multimedia Assisted Dietary Management, Amsterdam, Oct 16, 2016. New York: ACM, 2016: 89.
[7] Ming Z Y, Chen J J, Cao Y, et al. Food photo recognition for dietary tracking: system and experiment[C]//LNCS 10705: Proceedings of the 2018 International Conference on MultiMedia Modeling, Bangkok, Feb 5-7, 2018. Berlin, Heidelberg: Springer, 2018: 129-141.
[8] Chen J J, Ngo C W. Deep-based ingredient recognition for cooking recipe retrieval[C]//Proceedings of the 2016 ACM Conference on Multimedia Conference, Amsterdam, Oct 15-19, 2016. New York: ACM, 2016: 32-41.
[9] Salvador A, Hynes N, Aytar Y, et al. Learning cross-modal embeddings for cooking recipes and food images[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul 21-26, 2017. Washington: IEEE Computer Society, 2017: 3020-3028.
[10] Chen J J, Ngo C W, Feng F L, et al. Deep understanding of cooking procedure for cross-modal recipe retrieval[C]//Proceedings of the 26th ACM International Conference on Multimedia, Seoul, Oct 22-26, 2018. New York: ACM, 2018: 1020-1028.
[11] Carvalho M, Cadène R, Picard D, et al. Cross-modal retrieval in the cooking context: learning semantic text-image embeddings[C]//Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, Ann Arbor, Jul 8-12, 2018. New York: ACM, 2018: 35-44.
[12] Wang K, Yin Q, Wang W, et al. A comprehensive survey on cross-modal retrieval[J]. arXiv:1607.06215, 2016.
[13] Yamakata Y, Imahori S, Maeta H, et al. A method for extracting major workflow composed of ingredients, tools, and actions from cooking procedural text[C]//Proceedings of the 2016 IEEE International Conference on Multimedia & Expo Workshops, Seattle, Jul 11-15, 2016. Washington: IEEE Computer Society, 2016: 1-6.
[14] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]//Proceedings of the 2017 Annual Conference on Neural Information Processing Systems, Long Beach, Dec 4-9, 2017. Red Hook: Curran Associates, 2017: 5998-6008.
[15] Lai P L, Fyfe C. Kernel and nonlinear canonical correlation analysis[J]. International Journal of Neural Systems, 2000, 10(5): 365-377.
[16] Andrew G, Arora R, Bilmes J, et al. Deep canonical correlation analysis[C]//Proceedings of the 30th International Conference on Machine Learning, Atlanta, Jun 16-21, 2013: 1247-1255.
[17] Feng F X, Wang X J, Li R F. Cross-modal retrieval with correspondence autoencoder[C]//Proceedings of the 22nd ACM International Conference on Multimedia, Orlando, Nov 3-7, 2014. New York: ACM, 2014: 7-16.
[18] Socher R, Karpathy A, Le Q V, et al. Grounded compositional semantics for finding and describing images with sentences[J]. Transactions of the Association for Computational Linguistics, 2014, 2: 207-218.
[19] Zhan Y B, Yu J, Yu Z, et al. Comprehensive distance- preserving autoencoders for cross-modal retrieval[C]//Procee-dings of the 26th ACM International Conference on Multimedia, Seoul, Oct 22-26, 2018. New York: ACM, 2018: 1137-1145.
[20] Yanai K, Kawano Y. Food image recognition using deep convolutional network with pre-training and fine-tuning [C]//Proceedings of the 2015 IEEE International Conference on Multimedia & Expo Workshops, Turin, Jun 29-Jul 3, 2015. Washington: IEEE Computer Society, 2015: 1-6.
[21] Meyers A, Johnston N, Rathod V, et al. Im2Calories: towards an automated mobile vision food diary[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision, Santiago, Dec 7-13, 2015. Washington: IEEE Computer Society, 2015: 1233-1241.
[22] Chen J J, Ngo C W, Chua T S. Cross-modal recipe retrieval with rich food attributes[C]//Proceedings of the 2017 ACM Conference on Multimedia, Mountain View, Oct 23-27, 2017. New York: ACM, 2017: 1771-1779.
[23] He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Jun 27-30, 2016. Washington: IEEE Computer Society, 2016: 770-778.
[24] Castrejón L, Aytar Y, Vondrick C, et al. Learning aligned cross-modal representations from weakly aligned data[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Jun 27-30, 2016. Washington: IEEE Computer Society, 2016: 2940-2949.