Journal of Frontiers of Computer Science and Technology ›› 2020, Vol. 14 ›› Issue (9): 1471-1481.DOI: 10.3778/j.issn.1673-9418.1912016

Previous Articles     Next Articles

Cross-Modal Recipe Retrieval with Self-Attention Mechanism

LIN Yang, CHU Xu, WANG Yasha, MAO Weijia, ZHAO Junfeng   

  1. 1. Key Lab of High Confidence Software Technologies, Ministry of Education, Beijing 100871, China
    2. Department of Computer Science and Technology, Peking University, Beijing 100871, China
    3. National Engineering Research Center for Software Engineering, Peking University, Beijing 100871, China
  • Online:2020-09-01 Published:2020-09-07

融合自注意力机制的跨模态食谱检索方法

林阳初旭王亚沙毛维嘉赵俊峰   

  1. 1. 高可信软件技术教育部重点实验室,北京 100871
    2. 北京大学 计算机科学技术系,北京 100871
    3. 北京大学 软件工程国家工程研究中心,北京 100871

Abstract:

Tracking food intake is a key point for diet management. To simplify the recording process, researchers have proposed recipe retrieval technology based on food pictures. The corresponding recipes are retrieved from the food pictures taken and then nutrient information can be inferred accordingly, thereby improving convenience of dietary recording. Recipe retrieval is a typically cross-modal retrieval problem, but when compared with general problems, its major difficulty is that instead of describing visible features in food pictures, recipes provide the procedure of how ingredients become final dish, and that requires the model to better understand the cooking process of the ingredients. However, current works employ traditional models sequentially to deal with text and thus fail to capture distant dependencies in the cooking process. To tackle the problem, this paper proposes a cross-modal recipe retrieval model based on self-attention mechanism. This paper employs the self-attention mechanism in the Transformer model to capture distant dependencies in recipes and it improves the attention mechanism used in traditional work, which enables this model to better capture the semantic information in recipes. Experimental results show that this model outperforms the baselines by 22% on recall rate of recipe retrieval task.

Key words: dietary recording, recipe retrieval, self-attention mechanism, cross-modal, deep neural network

摘要:

饮食记录是饮食管理的关键环节。为了简化记录过程,研究者提出了基于食物图片的食谱检索技术,通过拍摄的图片检索到对应食谱,并据此生成营养信息,从而提高了记录的便捷性。食谱检索是典型的跨模态检索问题,但与一般问题相比,其主要难点是食谱描述了从原材料到成品的一系列变化过程,而非直接可见的特征,因此模型需要深入理解原材料的处理过程。而当前食谱检索研究工作采用线性方式处理文本,导致其捕捉食谱处理过程中的远距离依赖现象的能力较差。针对这个问题,设计了一种基于自注意力机制的跨模态食谱检索模型。该模型借助Transformer模型中的自注意力机制,捕捉食谱中远距离的依赖关系,同时改进了传统方法中的注意力机制,可以更好地挖掘食谱中的语义。实验结果表明,该模型在食谱检索任务的召回率上比基线方法提高了22%。

关键词: 饮食记录, 食谱检索, 自注意力机制, 跨模态, 深度神经网络