计算机科学与探索 ›› 2021, Vol. 15 ›› Issue (8): 1405-1417.DOI: 10.3778/j.issn.1673-9418.2101042

• 综述·探索 • 上一篇    下一篇

BERT跨语言词向量学习研究

王玉荣,林民,李艳玲   

  1. 内蒙古师范大学 计算机科学技术学院,呼和浩特 010022
  • 出版日期:2021-08-01 发布日期:2021-08-02

Research of BERT Cross-Lingual Word Embedding Learning

WANG Yurong, LIN Min, LI Yanling   

  1. College of Computer Science and Technology, Inner Mongolia Normal University, Hohhot 010022, China
  • Online:2021-08-01 Published:2021-08-02

摘要:

随着互联网多语言信息的发展,如何有效地表示不同语言所含的信息已成为自然语言信息处理的一个重要子任务,因而跨语言词向量成为当下研究的热点。跨语言词向量借助迁移学习将单语词向量映射到一个共享的低维空间,在不同语言间进行语法、语义和结构特征的迁移,能够对跨语言语义信息进行建模。BERT模型通过大量语料的训练,得到一种通用的词向量,同时根据具体的下游任务进一步动态优化,生成上下文语境敏感的动态词向量,解决了以往模型的聚义问题。通过对现有基于BERT的跨语言词向量研究的文献回顾,综合阐述了基于BERT的跨语言词向量学习方法、模型、技术的发展,以及所需的训练数据。根据训练方法的不同,分为有监督学习和无监督学习两类,并对两类方法的代表性研究进行详细的对比和总结。最后概述了跨语言词向量的评估方法,并以构建基于BERT的蒙汉文跨语言词向量进行展望。

关键词: 跨语言词向量, 蒙汉文, BERT

Abstract:

With the development of multilingual information on the Internet, how to effectively represent the infor-mation contained in different language texts has become an important sub-task of natural language information processing. Therefore, cross-lingual word embedding has become a hot technology. Cross-lingual word embedding can be mapped to a shared low-dimensional space with the help of transfer learning, and the grammar semantic and struc-tural features can be transferred between different languages, which can be used to model cross-lingual semantic infor-mation. By training a large number of corpora, a general word embedding is obtained in BERT (bidirectional encoder representations from transformers) model, which is further dynamically optimized according to specific downstream tasks to generate context-sensitive word embedding, thus solving the aggregation problem of previous models and obtaining dynamic word embedding. Based on the literature review of the existing cross-lingual word embedding based on BERT studies, this paper comprehensively describes the development of cross-lingual word embedding learning based on BERT learning methods, models and techniques, as well as the required training data. According to different training methods, it is divided into two categories, supervised learning and unsupervised learning. And the representative research of the two types of methods is compared and summarized in detail. Finally, the evaluation methods of cross-lingual word embedding are summarized, and the prospect is made by studying the cross-lingual word embedding of Mongolian and Chinese based on BERT.

Key words: cross-lingual word embedding, Mongolian-Chinese, bidirectional encoder representations from transformers (BERT)