自然语言处理领域中的词嵌入方法综述

doi:10.3778/j.issn.1673-9418.2303056

摘要/Abstract

摘要： 词嵌入作为自然语言处理任务的第一步，其目的是将输入的自然语言文本转换为模型可以处理的数值向量，即词向量，也称词的分布式表示。词向量作为自然语言处理任务的根基，是完成一切自然语言处理任务的前提。然而，国内外针对词嵌入方法的综述文献大多只关注于不同词嵌入方法本身的技术路线，而未能将词嵌入的前置分词方法以及词嵌入方法完整的演变趋势进行分析与概述。以word2vec模型和Transformer模型作为划分点，从生成的词向量是否能够动态地改变其内隐的语义信息来适配输入句子的整体语义这一角度，将词嵌入方法划分为静态词嵌入方法和动态词嵌入方法，并对此展开讨论。同时，针对词嵌入中的分词方法，包括整词切分和子词切分，进行了对比和分析；针对训练词向量所使用的语言模型，从概率语言模型到神经概率语言模型再到如今的深度上下文语言模型的演化，进行了详细列举和阐述；针对预训练语言模型时使用的训练策略进行了总结和探讨。最后，总结词向量质量的评估方法，分析词嵌入方法的当前现状并对其未来发展方向进行展望。

关键词: 词向量, 词嵌入方法, 自然语言处理, 语言模型, 分词, 词向量评估

Abstract: Word embedding, as the first step in natural language processing (NLP) tasks, aims to transform input natural language text into numerical vectors, known as word vectors or distributed representations, which artificial intelligence models can process. Word vectors, the foundation of NLP tasks, are a prerequisite for accomplishing various NLP downstream tasks. However, most existing review literature on word embedding methods focuses on the technical routes of different word embedding methods, neglecting comprehensive analysis of the tokenization methods and the complete evolutionary trends of word embedding. This paper takes the introduction of the word2vec model and the Transformer model as pivotal points. From the perspective of whether generated word vectors can dynamically change their implicit semantic information to adapt to the overall semantics of input sentences, this paper categorizes word embedding methods into static and dynamic approaches and extensively discusses this classification. Simultaneously, it compares and analyzes tokenization methods in word embedding, including whole and sub-word segmentation. This paper also provides a detailed enumeration of the evolution of language models used to train word vectors, progressing from probability language models to neural probability language models and the current deep contextual language models. Additionally, this paper summarizes and explores the training strategies employed in pre-training language models. Finally, this paper concludes with a summary of methods for evaluating word vector quality, an analysis of the current state of word embedding methods, and a prospective outlook on their development.

Key words: word vector, word embedding, natural language processing, language model, tokenization, word vector evaluation

曾骏, 王子威, 于扬, 文俊浩, 高旻. 自然语言处理领域中的词嵌入方法综述[J]. 计算机科学与探索, 2024, 18(1): 24-43.

ZENG Jun, WANG Ziwei, YU Yang, WEN Junhao, GAO Min. Word Embedding Methods in Natural Language Processing: a Review[J]. Journal of Frontiers of Computer Science and Technology, 2024, 18(1): 24-43.

参考文献

[1] BENGIO Y, COURVILLE A, VINCENT P. Representation learning: a review and new perspectives[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(8): 1798-1828.
[2] HINTON G E. Learning distributed representations of concepts[C]//Proceedings of the 8th Annual Conference of the Cognitive Science Society. Amherst: Erlbaum Associates, 1986: 1-12.
[3] BENGIO Y, DUCHARME R, VINCENT P. A neural probab-ilistic language model[C]//Advances in Neural Information Processing Systems. Cambridge: MIT Press, 2000: 932-938.
[4] MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space[J]. arXiv:1301.3781, 2013.
[5] PENNINGTON J, SOCHER R, MANNING C D. GloVe: global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2014: 1532-1543.
[6] PETERS M E, NEUMANN M, IYYER M, et al. Deep contextualized word representations[C]//Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: ACL, 2018: 2227-2237.
[7] SENNRICH R, HADDOW B, BIRCH A. Neural machine translation of rare words with subword units[C]//Proceedings of the 54th Annual Meeting of the Association for Computa-tional Linguistics. Stroudsburg: ACL, 2016: 1715-1725.
[8] KUDO T. Subword regularization: improving neural network translation models with multiple subword candidates[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2018: 66-75.
[9] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, Dec 4-9, 2017: 5998-6008.
[10] CAMACHO-COLLADOS J, PILEHVAR M T. From word to sense embeddings: a survey on vector representations of meaning[J]. Journal of Artificial Intelligence Research, 2018, 63: 743-788.
[11] 郁可人, 傅云斌, 董启文. 基于神经网络语言模型的分布式词向量研究进展[J]. 华东师范大学学报(自然科学版), 2017(5): 52-65.
YU K R, FU Y B, DONG Q W. Survey on distributed word embeddings based on neural network language models[J]. Journal of East China Normal University (Natural Science), 2017(5): 52-65.
[12] KALIYAR R K. A multi-layer bidirectional transformer encoder for pre-trained word embedding: a survey of BERT[C]//Proceedings of the 10th International Conference on Cloud Computing, Data Science & Engineering. Piscataway: IEEE, 2020: 336-340.
[13] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: ACL, 2019: 4171-4186.
[14] SEZERER E, TEKIR S. A survey on neural word embeddings[J]. arXiv:2110.01804, 2021.
[15] WANG Y, HOU Y, CHE W, et al. From static to dynamic word representations: a survey[J]. International Journal of Machine Learning and Cybernetics, 2020, 11: 1611-1630.
[16] CHEN D, MANNING C D. A fast and accurate dependency parser using neural networks[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2014: 740-750.
[17] CHEN Y, PEROZZI B, AL-RFOU R, et al. The expressive power of word embeddings[J]. arXiv:1301.3226, 2013.
[18] STICKLAND A C, MURRAY I. BERT and PALs: projected attention layers for efficient adaptation in multi-task learning[C]//Proceedings of the 36th International Conference on Machine Learning, Long Beach, Jun 9-15, 2019: 5986-5995.
[19] HOULSBY N, GIURGIU A, JASTRZEBSKI S, et al. Parameter-efficient transfer learning for NLP[C]//Proceedings of the 36th International Conference on Machine Learning, Long Beach, Jun 9-15, 2019: 2790-2799.
[20] LIN T, CHONGHUI G, JINGFENG C. Review of Chinese word segmentation studies[J]. Data Analysis and Knowledge Discovery, 2020, 4(2/3): 1-17.
[21] 黄昌宁, 赵海. 中文分词十年回顾[J]. 中文信息学报, 2007, 21(3): 8-19.
HUANG C N, ZHAO H. Chinese word segmentation: a decade review[J]. Journal of Chinese Information Processing, 2007, 21(3): 8-19.
[22] 孙茂松, 左正平, 黄昌宁. 汉语自动分词词典机制的实验研究[J]. 中文信息学报, 2000, 14(1): 1-6.
SUN M S, ZUO Z P, HUANG C N. An experimental study on dictionary mechanism for Chinese word segmentation[J]. Journal of Chinese Information Processing, 2000, 14(1): 1-6.
[23] PENG F, FENG F, MCCALLUM A. Chinese segmentation and new word detection using conditional random fields[C]//Proceedings of the 20th International Conference on Computational Linguistics, Geneva, Aug 23-27, 2004: 562-568.
[24] COLLOBERT R, WESTON J, BOTTOU L, et al. Natural language processing (almost) from scratch[J]. Journal of Machine Learning Research, 2011,12: 2493-2537.
[25] CAI D, ZHAO H. Neural word segmentation learning for Chinese[J]. arXiv:1606.04300, 2016.
[26] GU?L?EHRE ?, AHN S, NALLAPATI R, et al. Pointing the unknown words[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2016: 140-149.
[27] WU Y, SCHUSTER M, CHEN Z, et al. Google’s neural machine translation system: bridging the gap between human and machine translation[J]. arXiv:1609.08144, 2016.
[28] GAGE P. A new algorithm for data compression[J]. The C Users Journal, 1994, 12(2): 23-38.
[29] KIM Y, JERNITE Y, SONTAG D, et al. Character-aware neural language models[C]//Proceedings of the 2016 AAAI Conference on Artificial Intelligence. Menlo Park: AAAI, 2016: 2741-2749.
[30] DONG L, YANG N, WANG W, et al. Unified language model pre-training for natural language understanding and generation[C]//Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Vancouver, Dec 8-14, 2019: 13042-13054.
[31] JOSHI M, CHEN D, LIU Y, et al. SpanBERT: improving pre-training by representing and predicting spans[J]. Transactions of the Association for Computational Linguistics, 2020, 8:64-77.
[32] WANG W, BI B, YAN M, et al. StructBERT: incorporating language structures into pre-training for deep language understanding[J]. arXiv:1908.04577, 2019.
[33] LAN Z, CHEN M, GOODMAN S, et al. ALBERT: a lite BERT for self-supervised learning of language representations[J]. arXiv:1909.11942, 2019.
[34] YANG Z, DAI Z, YANG Y, et al. XLNet: generalized autoregressive pretraining for language understanding[C]// Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Vancouver, Dec 8-14, 2019: 5754-5764.
[35] KUDO T, RICHARDSON J. SentencePiece: a simple and language independent subword tokenizer and detokenizer for neural text processing[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Stroudsburg: ACL, 2018: 66-71.
[36] RADFORD A, NARASIMHAN K, SALIMANS T, et al. Improving language understanding by generative pre-training[R]. OpenAI, 2018.
[37] RADFORD A, WU J, CHILD R, et al. Language models are unsupervised multitask learners[J]. OpenAI Blog, 2019, 1(8): 9.
[38] CONNEAU A, LAMPLE G. Cross-lingual language model pretraining[C]//Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Vancouver, Dec 8-14, 2019: 7057-7067.
[39] LIU Y, OTT M, GOYAL N, et al. RoBERTa: a robustly optimized BERT pretraining approach[J]. arXiv:1907.11692, 2019.
[40] SONG K, TAN X, QIN T, et al. MASS: masked sequence to sequence pre-training for language generation[C]//Proceedings of the 36th International Conference on Machine Learning, Long Beach, Jun 9-15, 2019: 5926-5936.
[41] BENGIO Y. Neural net language models[J]. Scholarpedia, 2008, 3(1): 3881.
[42] BLITZER J, PEREIRA F, WEINBERGER K Q, et al. Hierarchical distributed representations for statistical language modeling[C]//Advances in Neural Information Processing Systems 17, Vancouver, Dec 13-18, 2004: 185-192.
[43] JELINEK F. Interpolated estimation of Markov source parameters from sparse data[C]//Proceeding of the 1980 Workshop on Pattern Recognition in Practice, 1980: 381-397.
[44] KATZ S. Estimation of probabilities from sparse data for the language model component of a speech recognizer[J]. IEEE Transactions on Acoustics, Speech, and Signal Processing, 1987, 35(3): 400-401.
[45] HUANG E H, SOCHER R, MANNING C D, et al. Improving word representations via global context and multiple word prototypes[C]//Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2012: 873-882.
[46] BENGIO Y, DUCHARME R, VINCENT P, et al. A neural probabilistic language model[C]//Advances in Neural Informa-tion Processing Systems 13. Cambridge: MIT Press, 2001: 932-938.
[47] JOSHUA T, GOODMAN J. A bit of progress in language modeling extended version: MSR-TR-2001-72[R]. 2001.
[48] MIKOLOV T. Statistical language models based on neural networks[R]. 2012.
[49] MIKOLOV T, KARAFIAT M, BURGET L, et al. Recurrent neural network based language model[J]. Interspeech, 2010, 2(3): 1045-1048.
[50] SCHMIDHUBER J, HOCHREITER S. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735-1780.
[51] SUN Y, WANG S, LI Y, et al. ERNIE: enhanced representa-tion through knowledge integration[J]. arXiv:1904.09223, 2019.
[52] CLARK K, LUONG M T, LE Q V, et al. ELECTRA: pre-training encoders as discriminators rather than generators[C]//Proceedings of the 8th International Conference on Learning Representations, Addis Ababa, Apr 26-30, 2020:26-30.
[53] LIU W, ZHOU P, ZHAO Z, et al. K-BERT: enabling language representation with knowledge graph[C]//Proceedings of the 2020 AAAI Conference on Artificial Intelligence. Menlo Park: AAAI, 2020: 2901-2908.
[54] LEWIS M, LIU Y, GOYAL N, et al. BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2020: 7871-7880.
[55] BI B, LI C, WU C, et al. PALM: pre-training an autoencoding & autoregressive language model for context-conditioned generation[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2020: 8681-8691.
[56] LUND K, BURGESS C. Producing high-dimensional semantic spaces from lexical co-occurrence[J]. Behavior Research Methods, Instruments, & Computers, 1996, 28(2): 203-208.
[57] DEERWESTER S, DUMAIS S T, FURNAS G W, et al. Indexing by latent semantic analysis[J]. Journal of the American Society for Information Science, 1990, 41(6): 391-407.
[58] ROHDE D L, GONNERMAN L M, PLAUT D C. An improved model of semantic similarity based on lexical co-occurrence[J]. Communications of the ACM, 2006, 8: 627-633.
[59] MNIH A, KAVUKCUOGLU K. Learning word embeddings efficiently with noise-contrastive estimation[C]//Advances in Neural Information Processing Systems 26. Cambridge:MIT Press, 2013: 2265-2273.
[60] QI Y, SACHAN D S, FELIX M, et al. When and why are pre-trained word embeddings useful for neural machine translation?[J]. arXiv:1804.06323, 2018.
[61] BAHDANAU D, CHO K, BENGIO Y. Neural machine translation by jointly learning to align and translate[J]. arXiv:1409.0473, 2014.
[62] HILL F, CHO K, JEAN S, et al. The representational geometry of word meanings acquired by neural machine translation models[J]. Machine Translation, 2017, 31: 3-18.
[63] MCCANN B, BRADBURY J, XIONG C, et al. Learned in translation: contextualized word vectors[C]//Advances in Neural Information Processing Systems 30. Cambridge:MIT Press, 2017: 6294-6305.
[64] QIU Y, LI H, LI S, et al. Revisiting correlations between intrinsic and extrinsic evaluations of word embeddings[C]//Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data: 17th China National Conference, and 6th International Symposium, Shanghai, Oct 19-21, 2018: 209-221.
[65] JOULIN A, GRAVE E, MIKOLOV P B T. Bag of tricks for efficient text classification[C]//Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistic. Stroudsburg: ACL, 2017: 427.
[66] MIKOLOV T, SUTSKEVER I, CHEN K, et al. Distributed representations of words and phrases and their compositionality[C]//Advances in Neural Information Processing Systems 26. Cambridge: MIT Press, 2013: 3111-3119.
[67] ZHU Y, KIROS R, ZEMEL R, et al. Aligning books and movies: towards story-like visual explanations by watching movies and reading books[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision. Washington: IEEE Computer Society, 2015: 19-27.
[68] TRINH T H, LE Q V. A simple method for commonsense reasoning[J]. arXiv:1806.02847, 2018.
[69] CERA D, YANGA Y, KONGA S Y, et al. Universal sentence encoder[J]. arXiv:1803.11175, 2018.
[70] YOU Y, LI J, HSEU J, et al. Reducing BERT pre-training time from 3 days to 76 minutes[J]. arXiv:1904.00962, 2019.
[71] SHOEYBI M, PATWARY M, PURI R, et al. Megatron-LM: training multi-billion parameter language models using model parallelism[J]. arXiv:1909.08053, 2019.
[72] RAJBHANDARI S, RASLEY J, RUWASE O, et al. ZeRO: memory optimization towards training a trillion parameter models[J]. arXiv:1910.02054, 2019.
[73] QIU X, SUN T, XU Y, et al. Pre-trained models for natural language processing: a survey[J]. Science China Technological Sciences, 2020, 63(10): 1872-1897.
[74] BAEVSKI A, EDUNOV S, LIU Y, et al. Cloze-driven pretraining of self-attention networks[J]. arXiv:1903.07785, 2019.
[75] TAYLOR W L. “Cloze procedure”: a new tool for measuring readability[J]. Journalism Quarterly, 1953, 30(4): 415-433.
[76] DAI Z, YANG Z, YANG Y, et al. Transformer-XL: attentive language models beyond a fixed-length context[C]//Pro-ceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2019: 2978-2988.
[77] CUI Y, CHE W, LIU T, et al. Revisiting pre-trained models for Chinese natural language processing[J]. arXiv:2004.13922, 2020.
[78] GOODFELLOW I, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial networks[J]. Communications of the ACM, 2020, 63(11): 139-144.
[79] XU B, XU Y, LIANG J, et al. CN-DBpedia: a never-ending Chinese knowledge extraction system[C]//Proceedings of the 30th International Conference on Industrial Engineering and Other Applications of Applied Intelligent Systems, Arras, Jun 27-30, 2017: 428-438.
[80] DONG Z, DONG Q, HAO C. Hownet and its computation of meaning[C]//Proceedings of the 23rd International Conference on Computational Linguistics, Beijing, Aug 23-27, 2010: 53-56.
[81] SUN Y, WANG S, LI Y, et al. Ernie 2.0: a continual pre-training framework for language understanding[C]//Proceedings of the 2020 AAAI Conference on Artificial Intelligence, New York, Feb 7-12, 2020: 8968-8975.
[82] PARISI G I, KEMKER R, PART J L, et al. Continual lifelong learning with neural networks: a review[J]. Neural Networks, 2019, 113: 54-71.
[83] YAMADA I, ASAI A, SHINDO H, et al. LUKE: deep cont-extualized entity representations with entity-aware self-attention[J]. arXiv:2010.01057, 2020.
[84] MENGGE X, YU B, ZHANG Z, et al. Coarse-to-fine pre-training for named entity recognition[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2020: 6345-6354.
[85] BROWN T, MANN B, RYDER N, et al. Language models are few-shot learners[C]//Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems, Dec 6-12, 2020: 1877-1901.
[86] SANH V, DEBUT L, CHAUMOND J, et al. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter[J]. arXiv:1910.01108, 2019.
[87] SUN S, CHENG Y, GAN Z, et al. Patient knowledge distilla-tion for BERT model compression[J]. arXiv:1908.09355, 2019.
[88] WANG W, WEI F, DONG L, et al. MiniLM: deep self-attention distillation for task-agnostic compression of pre-trained transformers[C]//Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems, Dec 6-12, 2020: 5776-5788.
[89] JIAO X, YIN Y, SHANG L, et al. TinyBERT: distilling BERT for natural language understanding[J]. arXiv:1909.10351, 2019.
[90] SCHNABEL T, LABUTOV I, MIMNO D, et al. Evaluation methods for unsupervised word embeddings[C]//Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2015: 298-307.
[91] BARONI M, DINU G, KRUSZEWSKI G. Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors[C]//Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2014: 238-247.
[92] FINKELSTEIN L, GABRILOVICH E, MATIAS Y, et al. Placing search in context: the concept revisited[C]//Proceedings of the 10th International Conference on World Wide Web, Hong Kong, China, May 1-5, 2001: 406-414.
[93] BRUNI E, TRAN N K, BARONI M. Multimodal distribu-tional semantics[J]. Journal of Artificial Intelligence Research, 2014, 49: 1-47.
[94] FARUQUI M, DYER C. Community evaluation and exchange of word vectors at wordvectors.org[C]//Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Stroudsburg: ACL, 2014: 19-24.
[95] TURNEY P D. A uniform approach to analogies, synonyms, antonyms, and associations[J]. arXiv:0809.0124, 2008.
[96] ARORA S, MAY A, ZHANG J, et al. Contextual embeddings: when are they worth it?[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.Stroudsburg: ACL, 2020: 2650-2663.
[97] TURIAN J, RATINOV L, BENGIO Y. Word representations: a simple and general method for semi-supervised learning[C]//Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2010: 384-394.
[98] VAN DER MAATEN L, HINTON G. Visualizing data using t-SNE[J]. Journal of Machine Learning Research, 2008, 9(11): 2579-2605.
[99] SUN F, LIU J, WU J, et al. BERT4Rec: sequential recom-mendation with bidirectional encoder representations from transformer[C]//Proceedings of the 28th ACM Interna-tional Conference on Information and Knowledge Manage-ment, Beijing, Nov 1-7, 2019: 1441-1450.
[100] LOUIZOS C, WELLING M, KINGMA D P. Learning sparse neural networks through L0 regularization[J]. arXiv:1712.01312, 2017.
[101] SHEN S, DONG Z, YE J, et al. Q-BERT: Hessian based ultra low precision quantization of BERT[C]//Proceedings of the 34th AAAI Conference on Artificial Intelligence, the 32nd Innovative Applications of Artificial Intelligence Conference, the 10th AAAI Symposium on Educational Advances in Artificial Intelligence, New York, Feb 7-12, 2020: 8815-8821.
[102] ZHANG Y, TI?O P, LEONARDIS A, et al. A survey on neural network interpretability[J]. IEEE Transactions on Emerging Topics in Computational Intelligence, 2021, 5(5): 726-742.
[103] GILPIN L H, BAU D, YUAN B Z, et al. Explaining explanations: an overview of interpretability of machine learning[C]//Proceedings of the 5th IEEE International Conference on Data Science and Advanced Analytics, Turin, Oct 1-3, 2018: 80-89.
[104] 华阳. 基于多模态词向量的语句距离计算方法[D]. 哈尔滨: 哈尔滨工业大学, 2018.
HUA Y. Computational methods of sentence distance based on multi-modal word embedding[D]. Harbin: Harbin Institute of Technology, 2018.
[105] FENG Y, LAPATA M. Topic models for image annotation and text illustration[C]//Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics, Los Angeles, Jun 2-4, 2010: 831-839.
[106] KIROS R, SALAKHUTDINOV R, ZEMEL R. Multimodal neural language models[C]//Proceedings of the 31st International Conference on International Conference on Machine Learning, Beijing, Jun 21-26, 2014: 595-603.
[107] KIELA D, CLARK S. Multi-and cross-modal semantics beyond vision: grounding in auditory perception[C]//Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2015: 2461-2470.
[108] COLLELL G, ZHANG T, MOENS M F. Imagined visual representations as multimodal embeddings[C]//Proceedings of the 31st AAAI Conference on Artificial Intelligence, San Francisco, Feb 4-9, 2017: 4378-4384.
[109] MAO J, XU J, JING K, et al. Training and evaluating multimodal word embeddings with large-scale web annotated images[C]//Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems, Barcelona, Dec 5-10, 2016: 442-450.