计算机科学与探索 ›› 2024, Vol. 18 ›› Issue (1): 24-43.DOI: 10.3778/j.issn.1673-9418.2303056

• 前沿·综述 • 上一篇    下一篇

自然语言处理领域中的词嵌入方法综述

曾骏,王子威,于扬,文俊浩,高旻   

  1. 1. 重庆大学 大数据与软件学院,重庆 401331
    2. 信息物理社会可信服务计算教育部重点实验室(重庆大学),重庆 400044
  • 出版日期:2024-01-01 发布日期:2024-01-01

Word Embedding Methods in Natural Language Processing: a Review

ZENG Jun, WANG Ziwei, YU Yang, WEN Junhao, GAO Min   

  1. 1. School of Big Data & Software Engineering, Chongqing University, Chongqing 401331, China
    2. Key Laboratory of Dependable Service Computing in Cyber Physical Society (Chongqing University), Ministry of Education, Chongqing 400044, China
  • Online:2024-01-01 Published:2024-01-01

摘要: 词嵌入作为自然语言处理任务的第一步,其目的是将输入的自然语言文本转换为模型可以处理的数值向量,即词向量,也称词的分布式表示。词向量作为自然语言处理任务的根基,是完成一切自然语言处理任务的前提。然而,国内外针对词嵌入方法的综述文献大多只关注于不同词嵌入方法本身的技术路线,而未能将词嵌入的前置分词方法以及词嵌入方法完整的演变趋势进行分析与概述。以word2vec模型和Transformer模型作为划分点,从生成的词向量是否能够动态地改变其内隐的语义信息来适配输入句子的整体语义这一角度,将词嵌入方法划分为静态词嵌入方法和动态词嵌入方法,并对此展开讨论。同时,针对词嵌入中的分词方法,包括整词切分和子词切分,进行了对比和分析;针对训练词向量所使用的语言模型,从概率语言模型到神经概率语言模型再到如今的深度上下文语言模型的演化,进行了详细列举和阐述;针对预训练语言模型时使用的训练策略进行了总结和探讨。最后,总结词向量质量的评估方法,分析词嵌入方法的当前现状并对其未来发展方向进行展望。

关键词: 词向量, 词嵌入方法, 自然语言处理, 语言模型, 分词, 词向量评估

Abstract: Word embedding, as the first step in natural language processing (NLP) tasks, aims to transform input natural language text into numerical vectors, known as word vectors or distributed representations, which artificial intelligence models can process. Word vectors, the foundation of NLP tasks, are a prerequisite for accomplishing various NLP downstream tasks. However, most existing review literature on word embedding methods focuses on the technical routes of different word embedding methods, neglecting comprehensive analysis of the tokenization methods and the complete evolutionary trends of word embedding. This paper takes the introduction of the word2vec model and the Transformer model as pivotal points. From the perspective of whether generated word vectors can dynamically change their implicit semantic information to adapt to the overall semantics of input sentences, this paper categorizes word embedding methods into static and dynamic approaches and extensively discusses this classification. Simultaneously, it compares and analyzes tokenization methods in word embedding, including whole and sub-word segmentation. This paper also provides a detailed enumeration of the evolution of language models used to train word vectors, progressing from probability language models to neural probability language models and the current deep contextual language models. Additionally, this paper summarizes and explores the training strategies employed in pre-training language models. Finally, this paper concludes with a summary of methods for evaluating word vector quality, an analysis of the current state of word embedding methods, and a prospective outlook on their development.

Key words: word vector, word embedding, natural language processing, language model, tokenization, word vector evaluation