====== 单词级别(word-level) ====== 词典:apple,car,cat.... ===== 优势 ===== 单词有语义信息 ===== 缺点 ===== * 同词根但不同词缀的词视为不同词,互相无关系:surprise surprisely * 存在同词根的现象,使得分布式向量维度稀疏 * 维度取决词典大小:词转换为onehot向量,点乘嵌入矩阵,转换为相应的词向量 * 无法处理外来词(OOV),外来词均用特定的词来代替:单词拼写错误和新词影响非常大 ===== 缓解方案 ===== 单词级别的输出可转化为序列标注问题(BIO),降低输出维度,降低计算消耗 ====== 字符级别(character-level) ====== a-z,A-Z,个别符号 ===== 缺点 ===== 没有语义信息,需要神经网络学习语义信息 ===== 优势 ===== 维度小,70左右,计算消耗低 ====== 单词级别和字符级别相互结合 ====== ====== 参考文章 ====== * [[https://www.lighttag.io/blog/character-level-NLP/|Character Level NLP:强烈推荐]] * Character-level Convolutional Network for Text Classification Applied to Chinese Corpus * Combining Word-Level and Character-Level Representations forRelation Classification of Informal Text * Named Entity Recognition with Bidirectional LSTM-CNNs