BERT跨语言词向量学习研究

doi:10.3778/j.issn.1673-9418.2101042

摘要/Abstract

摘要：

随着互联网多语言信息的发展，如何有效地表示不同语言所含的信息已成为自然语言信息处理的一个重要子任务，因而跨语言词向量成为当下研究的热点。跨语言词向量借助迁移学习将单语词向量映射到一个共享的低维空间，在不同语言间进行语法、语义和结构特征的迁移，能够对跨语言语义信息进行建模。BERT模型通过大量语料的训练，得到一种通用的词向量，同时根据具体的下游任务进一步动态优化，生成上下文语境敏感的动态词向量，解决了以往模型的聚义问题。通过对现有基于BERT的跨语言词向量研究的文献回顾，综合阐述了基于BERT的跨语言词向量学习方法、模型、技术的发展，以及所需的训练数据。根据训练方法的不同，分为有监督学习和无监督学习两类，并对两类方法的代表性研究进行详细的对比和总结。最后概述了跨语言词向量的评估方法，并以构建基于BERT的蒙汉文跨语言词向量进行展望。

关键词: 跨语言词向量, 蒙汉文, BERT

Abstract:

With the development of multilingual information on the Internet, how to effectively represent the infor-mation contained in different language texts has become an important sub-task of natural language information processing. Therefore, cross-lingual word embedding has become a hot technology. Cross-lingual word embedding can be mapped to a shared low-dimensional space with the help of transfer learning, and the grammar semantic and struc-tural features can be transferred between different languages, which can be used to model cross-lingual semantic infor-mation. By training a large number of corpora, a general word embedding is obtained in BERT (bidirectional encoder representations from transformers) model, which is further dynamically optimized according to specific downstream tasks to generate context-sensitive word embedding, thus solving the aggregation problem of previous models and obtaining dynamic word embedding. Based on the literature review of the existing cross-lingual word embedding based on BERT studies, this paper comprehensively describes the development of cross-lingual word embedding learning based on BERT learning methods, models and techniques, as well as the required training data. According to different training methods, it is divided into two categories, supervised learning and unsupervised learning. And the representative research of the two types of methods is compared and summarized in detail. Finally, the evaluation methods of cross-lingual word embedding are summarized, and the prospect is made by studying the cross-lingual word embedding of Mongolian and Chinese based on BERT.

Key words: cross-lingual word embedding, Mongolian-Chinese, bidirectional encoder representations from transformers (BERT)

王玉荣, 林民, 李艳玲. BERT跨语言词向量学习研究[J]. 计算机科学与探索, 2021, 15(8): 1405-1417.

WANG Yurong, LIN Min, LI Yanling. Research of BERT Cross-Lingual Word Embedding Learning[J]. Journal of Frontiers of Computer Science and Technology, 2021, 15(8): 1405-1417.

参考文献

[1] RUDER S, VULI I, SGAARD A. A survey of cross-lingual word embedding models[J]. Journal of Artificial Intelligence Research, 2019, 65: 569-631.
[2] CHANG C H. Towards ontology-based multi-lingual spoken dialogue system[J]. International Journal of Computer Pro-cessing of Oriental Languages, 2001, 14(1): 1-15.
[3] LIU Z H, WINATA G I, LIN Z J, et al. Attention-informed mixed-language training for zero-shot cross-lingual task-oriented dialogue systems[C]//Proceedings of the 34th AAAI Conference on Artificial Intelligence, the 32nd Innovative Applications of Artificial Intelligence Conference, the 10th AAAI Sym-posium on Educational Advances in Artificial Intelligence, New York, Feb 7-12, 2020. Menlo Park: AAAI, 2020: 8433-8440.
[4] CHEN M H, HUANG C, HUANG S T, et al. A cross-lingual pattern retrieval framework[J]. Polibits, 2011, 43: 53-59.
[5] POTA M, MARULLI F, ESPOSITO M, et al. Multilingual POS tagging by a composite deep architecture based on character-level features and on-the-fly enriched word embeddings[J]. Knowledge Based Systems, 2019, 164: 309-323.
[6] DAWA I, AISHAN W, DORJICEREN B. Design and analysis of a POS tag multilingual dictionary for mongolian[J]. IERI Procedia, 2014, 7: 102-112.
[7] TEINBERGER R, POULIQUEN B. Cross-lingual named entity recognition[J]. Lingvistic? Investigationes, 2007, 30(1): 135-162.
[8] DANDAPAT S, WAY A. Improved named entity recognition using machine translation-based cross-lingual information[J]. Computación y Sistemas, 2016, 20(3): 495-504.
[9] SCHULZ S, HAHN U. Morpheme-based cross-lingual indexing for medical document retrieval[J]. International Journal of Medical Informatics, 2000, 58(59): 87-99.
[10] LOGANATHAN R, DAVID M. Multilingual dependency parsing: using machine translated texts instead of parallel corpora[J]. The Prague Bulletin of Mathematical Linguistics, 2014, 102: 93-104.
[11] TRIPPE B. The challenges of multilingual chatbots are worth the reward[J]. EContent, 2018, 41(3): 4-8.
[12] SI L, CHEN Y X, ZENG Y L. A study on cross-language information retrieval model based on multilingual ontology[J]. Library and Information Service, 2017(1): 100-108.
司莉, 陈雨雪, 曾粤亮. 基于多语言本体的中英跨语言信息检索模型及实现[J]. 图书情报工作, 2017(1): 100-108.
[13] SUBALALITHA C N, POOVAMMAL E. Automatic bilingual dictionary construction for tirukural[J]. Applied Artificial Intelligence, 2018, 32(6): 558-567.
[14] YANG C C, WEI C P, LI K W. Cross-lingual thesaurus for multilingual knowledge management[J]. Decision Support Systems, 2008, 45(3): 596-605.
[15] XIAO W, HAUPTMANN A G, NGO C W. Measuring novelty and redundancy with multiple modalities in cross-lingual broadcast news[J]. Computer Vision & Image Understanding, 2008, 110(3): 418-431.
[16] PENG X Y, ZHOU D. Survey of cross-lingual word embedding[J]. Journal of Chinese Information Processing, 2020, 34(2): 1-15.
彭晓娅, 周栋. 跨语言词向量研究综述[J]. 中文信息学报, 2020, 34(2): 1-15.
[17] FU B, BRENNAN R. Cross-lingual ontology mapping—an investigation of the impact of machine translation[J]. Cell, 2009, 81(6): 891-904.
[18] GOSZTOLYA G, BALOGH R, IMRE N, et al. Cross-lingual detection of mild cognitive impairment based on temporal parameters of spontaneous speech[J]. Computer Speech & Language, 2021, 69(1): 1012-1015.
[19] SHAROFF S. Finding next of kin: cross-lingual embedding spaces for related languages[J]. Natural Language Engineering, 2020, 26(2): 163-182.
[20] EHSAN N, SHAKERY A, TOMPA F W. Cross-lingual text alignment for fine-grained plagiarism detection[J]. Journal of Information Science, 2019, 45(4): 443-459.
[21] BENGIO Y, RéJEAN D, VINCENT P, et al. A neural pro-babilistic language model[J]. Journal of Machine Learning Research, 2003, 3: 1137-1155.
[22] MIKOLOV T, CHEN K, CORRADO G, et al. Efficient esti-mation of word representations in vector space[J]. arXiv:1301.3781, 2013.
[23] ESULI A, MOREO A, SEBASTIANI F, et al. Cross-lingual sentiment quantification[J]. IEEE Intelligent Systems, 2020, 35(3): 106-114.
[24] DYER C, BALLESTEROS M, LING W, et al. Transition-based dependency parsing with stack long short-term memory[C]//Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, Beijing, Jul 26-31, 2015. Stroudsburg: ACL, 2015: 334-343.
[25] ZHAO Y P, SU Y L, NIU X H, et al. Mongolian-Chinese machine translation method based on neural network transfer learning[J]. Computer Applications and Software, 2020, 37(1): 179-185.
赵亚平, 苏依拉, 牛向华, 等. 基于神经网络迁移学习的蒙汉机器翻译方法[J]. 计算机应用与软件, 2020, 37(1): 179-185.
[26] PETERS M E, NEUMANN M, IYYER M, et al. Deep contextualized word representations[C]//Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, Jun 1-6, 2018. Stroudsburg: ACL, 2018: 2227-2237.
[27] DEVLIN J, CHANG M W, LEE K, et al. BERT: bidirectional encoder representations from transformers for language understanding[J]. Computation and Language, 2018, 23(2): 3-19.
[28] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 30th Annual Conference on Neural Information Processing Systems, Long Beach, Dec 4-9, 2017. Red Hook: Curran Associates, 2017: 5998-6008.
[29] WANG K, LIN M, LI Y L. Review of research on joint intent detection and semantic slot filling in end to end dialogue system[J]. Computer Engineering and Applications, 2020, 56(14): 14-25.
王堃, 林民, 李艳玲. 端到端对话系统意图语义槽联合识别研究综述[J]. 计算机工程与应用, 2020, 56(14): 14-25.
[30] LI Z J, FAN Y, WU X J. Survey of natural language pro-cessing pre-training techniques[J]. Computer Science, 2020, 47(3): 162-173.
李舟军, 范宇, 吴贤杰. 面向自然语言处理的预训练技术研究综述[J]. 计算机科学, 2020, 47(3): 162-173.
[31] BAI X F, CAO H L, ZHAO T J. Improving vector space word representations via kernel canonical correlation analysis[J]. ACM Transactions on Asian Language Information Pro-cessing, 2018, 17(4): 29.
[32] SU J S, SONG Z Q, LU Y J, et al. Exploring implicit semantic constraints for bilingual word embeddings[J]. Neural Pro-cessing Letters, 2018, 48(2): 1073-1088.
[33] ARTETXE M, LABAKA G, AGIRRE E. A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Jul 15-20, 2018. Stroudsburg: ACL, 2018: 789-798.
[34] HOSHEN Y, WOLF L. Non-adversarial unsupervised word translation[C]//Proceedings of the 2018 Conference on Emp-irical Methods in Natural Language Processing, Brussels, Oct 31 - Nov 4, 2018. Stroudsburg: ACL, 2018: 469-478.
[35] ZHANG M, LIU Y, LUAN H, et al. Adversarial training for unsupervised bilingual lexicon induction[C]//Proceedings of the 55th Annual Meeting of the Association for Compu-tational Linguistics, Vancouver, Jul 30-Aug 4, 2017. Stroud-sburg: ACL, 2017: 1959-1970.
[36] HU W P, HU H F. Dual adversarial disentanglement and deep representation decorrelation for NIR-VIS face recognition[J]. IEEE Transactions on Information Forensics and Security, 2021, 16: 70-85.
[37] LIAO X W, LIU D Y, GUI L, et al. Opinion retrieval method combining text conceptualization and network embedding[J]. Journal of Software, 2018, 29(10): 2897-2914.
廖祥文, 刘德元, 桂林, 等. 融合文本概念化与网络表示的观点检索[J]. 软件学报, 2018, 29(10): 2897-2914.
[38] PANG L, LAN Y Y, XU J, et al. A survey on deep text matching[J]. Chinese Journal of Computers, 2017, 40(4): 985-1003.
庞亮, 兰艳艳, 徐君, 等. 深度文本匹配综述[J]. 计算机学报, 2017, 40(4): 985-1003.
[39] GOUWS S, S?GAARD A. Simple task-specific bilingual word embeddings[C]//Proceedings of the 2015 Conference of the North American Chapter of the Association for Com-putational Linguistics: Human Language Technologies, Denver, May 31-Jun 5, 2015. Stroudsburg: ACL, 2015: 1386-1390.
[40] XING C, WANG D, LIU C, et al. Normalized word embedding and orthogonal transform for bilingual word translation[C]//Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, May 31-Jun 5, 2015. Stroudsburg: ACL, 2015: 1006-1011.
[41] FARUQUI M, DYER C. Improving vector space word representations using multilingual correlation[C]//Proceedings of the 14th Conference of the European Chapter of the Ass-ociation for Computational Linguistics, Gothenburg, Apr 26-30, 2014. Stroudsburg: ACL, 2014: 462-471.
[42] FAN Y. Cross-domain and cross-language knowledge acqui-sition based on deep learning[D]. Shanghai: East China Normal University, 2019.
樊艳. 基于深度学习的跨领域跨语言知识获取[D]. 上海: 华东师范大学, 2019.
[43] CONNEAU A, LAMPLE G, RANZATO M, et al. Word translation without parallel data[J]. arXiv:1710.04087, 2017.
[44] WANG Y X, CHE W X, GUO J, et al. Cross-lingual BERT transformation for zero-shot dependency parsing[C]//Proc-eedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Hong Kong, China, Nov 3-7, 2019. Stroudsburg: ACL, 2019: 5720-5726.
[45] KLEMENTIEV A, TITOV I, BHATTARAI B. Inducing crosslingual distributed representations of words[C]//Proceed-ings of the 24th International Conference on Computational Linguistics. Indian Institute of Technology Bombay, 2012: 1459-1474.
[46] CASTELLUCCI G, BELLOMARIA V, FAVALLI A, et al. Multi-lingual intent detection and slot filling in a joint BERT-based model[J]. arXiv:1907.02884, 2019.
[47] LU J L, ZHANG J J. Quality estimation based on multilingual pre-trained language model[J]. Journal of Xiamen University(Natural Science), 2020, 59(2): 151-158.
陆金梁, 张家俊. 基于多语言预训练语言模型的译文质量估计方法[J]. 厦门大学学报(自然科学版), 2020, 59(2): 151-158.
[48] LAMPLE G, CONNEAU A. Cross-lingual language model pretraining[J]. arXiv:1901.07291, 2019.
[49] CONNEAU A, RINOTT R, LAMPLE G, et al. XNLI: evaluating cross-lingual sentence representations[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Oct 31-Nov 4, 2018. Stroudsburg: ACL, 2018: 2475-2485.
[50] XIAO M, GUO Y H. Distributed word representation learning for cross-lingual dependency parsing[C]//Proceedings of the 18th Conference on Computational Natural Language Learning, Baltimore, Jun 26-27, 2014. Stroudsburg: ACL, 2014: 119-129.
[51] COLLOBERT R, WESTON J. A unified architecture for natural language processing: deep neural networks with multitask learning[C]//Proceedings of the 25th International Conference on Machine Learning, Helsinki, Jun 5-9, 2008. New York: ACM, 2008: 160-167.
[52] QIN L B, NI M H, ZHANG Y, et al. CoSDA-ML: multi-lingual code-switching data augmentation for zero-shot cross-lingual NLP[C]//Proceedings of the 29th International Joint Conference on Artificial Intelligence, Yokohama, 2020: 3853-3860.
[53] WU S J, DREDZE M. The surprising cross-lingual effective-ness of BERT[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Hong Kong, China, Nov 3-7, 2019. Stroudsburg: ACL, 2019: 833-844.
[54] ARTETXE M, RUDER S, YOGATAMA D. On the cross-lingual transferability of monolingual representations[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Jul 5-10, 2020. Stroudsburg: ACL, 2020: 4623-4637.
[55] PIRES T, SCHLINGER E, GARRTTE D. How multilingual is multilingual BERT?[C]//Proceedings of the 57th Conference of the Association for Computational Linguistics, Florence, Jul 28- Aug 2, 2019. Stroudsburg: ACL, 2019: 4996-5001.
[56] LIU Z, WINATA G I, MDDOTTO A, et al. Exploring fine-tuning techniques for pre-trained cross-lingual models via continual learning[J]. arXiv:2004.14218, 2020.
[57] KONG X P, WUSHOUER S, YANG Q M, et al. Uyghur named entity recognition based on transfer learning[J]. Journal of Northeast Normal University (Natural Science Edition), 2020, 52(2): 58-65.
孔祥鹏, 吾守尔·斯拉木, 杨启萌, 等. 基于迁移学习的维吾尔语命名实体识别[J]. 东北师大学报(自然科学版), 2020, 52(2): 58-65.