理解 TensorFlow 之 word2vec

悠悠 2022-06-10 11:07 320阅读 0赞

自然语言处理(英语:Natural Language Processing,简称NLP)是人工智能和语言学领域的分支学科。自然语言生成系统把计算机数据转化为自然语言。自然语言理解系统把自然语言转化为计算机程序更易于处理的形式。

一般,计算机要处理文本,需要先把文本向量化,即把文本映射到向量空间模型中,再应用深度学习方法来训练。

Word2vec 是一种可以进行高效率词嵌套学习的预测模型。其两种变体分别为:连续词袋模型(CBOW)及Skip-Gram模型


回想一下多层感知器训练 mnist 数据集的例子

TensorFlow MNIST机器学习入门

这里写图片描述

每个输入的图片数据是一个 784 维的向量

对于softmax回归模型可以用下面的图解释,对于输入的xs加权求和,再分别加上一个偏置量,最后再输入到softmax函数中:

这里写图片描述

如果把它写成一个等式,我们可以得到:

这里写图片描述

下面的图片显示了一个模型学习到的图片上每个像素对于特定数字类的权值。红色代表负数权值,蓝色代表正数权值。
(Hidden 层的所有权重 W 可排列成一个举证,则第 0 行的参数与数字为 0 的图运算后的概率会大于其他数字,想象成第 0 行的参数代表了 数字为 0 的图片)

这里写图片描述


Skip-gram 模型

三篇不错的译文
一文详解 Word2vec 之 Skip-Gram 模型(结构篇)
一文详解 Word2vec 之 Skip-Gram 模型(训练篇)
一文详解 Word2vec 之 Skip-Gram 模型(实现篇)

对于用词嵌入表示的文本向量,把一个词向量想象成一个图片,可得到相同的效果,只不过是向量的维度变大了

我们如何来表示这些单词呢?首先,我们都知道神经网络只能接受数值输入,我们不可能把一个单词字符串作为输入,因此我们得想个办法来表示这些单词。最常用的办法就是基于训练文档来构建我们自己的词汇表(vocabulary)再对单词进行one-hot编码

请参考:一文详解 Word2vec 之 Skip-Gram 模型(结构篇)

Word2Vec模型中,主要有Skip-Gram和CBOW两种模型,从直观上理解,Skip-Gram是给定input word来预测上下文。而CBOW是给定上下文,来预测input word

这里写图片描述

假如我们有一个句子“The dog barked at the mailman”,例如我们选取“dog”作为input word;

我们再定义一个叫做 skip_window 的参数,它代表着我们从当前 input word 的一侧(左边或右边)选取词的数量,如果我们设置skip_window=2,那么我们最终获得窗口中的词(包括input word在内)就是[‘The’, ‘dog’,’barked’, ‘at’],所以整个窗口大小span=2x2=4

另一个参数叫 num_skips,它代表着我们从整个窗口中选取多少个不同的词作为我们的 output word,当skip_window=2,num_skips=2时,我们将会得到两组 (input word, output word) 形式的训练数据,即 (‘dog’, ‘barked’),(‘dog’, ‘the’)

注意:这里我是直接 copy 一文详解 Word2vec 之 Skip-Gram 模型(结构篇) 的,这里当 skip_window=2,num_skips=2时,我觉得将得到三组(input word, output word) 形式的训练数据,即 (‘dog’, ‘barked’),(‘dog’, ‘the’),(‘dog’,’at’)
我没看懂原文为什么少了 (‘dog’,’at’)

当我们将文本数据输入后,就会产生很多的 (input word, output word) 形式的数据,这些才是真正的训练数据,我们可以把 input word 类比成一张 mnist 图片,对应的标签是 output word

我们选定句子“The quick brown fox jumps over lazy dog”,设定我们的窗口大小为2(window_size=2),也就是说我们仅选输入词前后各两个词和输入词进行组合。下图中,蓝色代表input word,方框内代表位于窗口内的单词。

这里写图片描述

我们的模型将会从每对单词出现的次数中习得统计结果。例如,我们的神经网络可能会得到更多类似(“Soviet“,”Union“)这样的训练样本对,而对于(”Soviet“,”Sasquatch“)这样的组合却看到的很少。因此,当我们的模型完成训练后,给定一个单词”Soviet“作为输入,输出的结果中”Union“或者”Russia“要比”Sasquatch“被赋予更高的概率。

模型的输入如果为一个10000维的向量,那么输出也是一个10000维度(词汇表的大小)的向量,它包含了10000个概率,每一个概率代表着当前词是输入样本中output word的概率大小。

这里写图片描述

说完单词的编码和训练样本的选取,我们来看下我们的隐层。如果我们现在想用300个特征来表示一个单词(即每个词可以被表示为300维的向量)。那么隐层的权重矩阵应该为10000行,300列(隐层有300个结点)

看下面的图片,左右两张图分别从不同角度代表了输入层-隐层的权重矩阵。左图中每一列代表一个10000维的词向量和隐层单个神经元连接的权重向量。从右边的图来看,每一行实际上代表了每个单词的词向量

这里写图片描述

所以我们最终的目标就是学习这个隐层的权重矩阵。

我们现在回来接着通过模型的定义来训练我们的这个模型。

上面我们提到,input word和output word都会被我们进行one-hot编码。仔细想一下,我们的输入被one-hot编码以后大多数维度上都是0(实际上仅有一个位置为1),所以这个向量相当稀疏,那么会造成什么结果呢。如果我们将一个1 x 10000的向量和10000 x 300的矩阵相乘,它会消耗相当大的计算资源,为了高效计算,它仅仅会选择矩阵中对应的向量中维度值为1的索引行(这句话很绕),看图就明白

这里写图片描述

为了有效地进行计算,这种稀疏状态下不会进行矩阵乘法计算,可以看到矩阵的计算的结果实际上是矩阵对应的向量中值为1的索引,上面的例子中,左边向量中取值为1的对应维度为3(下标从0开始),那么计算结果就是矩阵的第3行(下标从0开始)—— [10, 12, 19],这样模型中的隐层权重矩阵便成了一个”查找表“(lookup table),进行矩阵计算时,直接去查输入向量中取值为1的维度下对应的那些权重值。隐层的输出就是每个输入单词的“嵌入词向量


TensorLayerword2vec 程序

  1. """ 原文档:https://github.com/shorxp/tensorlayer-chinese/blob/master/example/tutorial_word2vec_basic.py Vector Representations of Words --------------------------------- This is the minimalistic reimplementation of tensorflow/examples/tutorials/word2vec/word2vec_basic.py This basic example contains the code needed to download some data, train on it a bit and visualize the result by using t-SNE. """
  2. import collections
  3. import math
  4. import os
  5. import random
  6. import numpy as np
  7. from six.moves import xrange # pylint: disable=redefined-builtin
  8. import tensorflow as tf
  9. import tensorlayer as tl
  10. import time
  11. flags = tf.flags
  12. flags.DEFINE_string("model", "one", "A type of model.")
  13. FLAGS = flags.FLAGS
  14. def main_word2vec_basic():
  15. """ Step 1: Download the data, read the context into a list of strings. Set hyperparameters. """
  16. words = tl.files.load_matt_mahoney_text8_dataset()
  17. word_list : a list
  18. # 单词列表
  19. # 例如: [.... 'the', 'cat', 'is', 'cute', ...]
  20. data_size = len(words)
  21. print('Data size', data_size)
  22. resume = False # 是否加载现有的模型、数据和字典
  23. _UNK = "_UNK"
  24. if FLAGS.model == "one":
  25. # toy setting
  26. vocabulary_size = 50000 # 字典的长度
  27. batch_size = 128
  28. embedding_size = 128 # 词向量的维度
  29. skip_window = 1 # How many words to consider left and right.
  30. num_skips = 2 # How many times to reuse an input to generate a label.
  31. # (should be double of 'skip_window' so as to
  32. # use both left and right words)
  33. num_sampled = 64 # Number of negative examples to sample.
  34. # more negative samples, higher loss
  35. learning_rate = 1.0
  36. n_epoch = 20 # 对整一个数据集循环训练的次数
  37. model_file_name = "model_word2vec_50k_128"
  38. # Eval 2084/15851 accuracy = 15.7%
  39. if FLAGS.model == "two":
  40. # (tensorflow/models/embedding/word2vec.py)
  41. vocabulary_size = 80000
  42. batch_size = 20 # Note: small batch_size need more steps for a Epoch
  43. embedding_size = 200
  44. skip_window = 5
  45. num_skips = 10
  46. num_sampled = 100
  47. learning_rate = 0.2
  48. n_epoch = 15
  49. model_file_name = "model_word2vec_80k_200"
  50. # 7.9%
  51. if FLAGS.model == "three":
  52. # (tensorflow/models/embedding/word2vec_optimized.py)
  53. vocabulary_size = 80000
  54. batch_size = 500
  55. embedding_size = 200
  56. skip_window = 5
  57. num_skips = 10
  58. num_sampled = 25
  59. learning_rate = 0.025
  60. n_epoch = 20
  61. model_file_name = "model_word2vec_80k_200_opt"
  62. # bad 0%
  63. if FLAGS.model == "four":
  64. # see: Learning word embeddings efficiently with noise-contrastive estimation
  65. vocabulary_size = 80000
  66. batch_size = 100
  67. embedding_size = 600
  68. skip_window = 5
  69. num_skips = 10
  70. num_sampled = 25
  71. learning_rate = 0.03
  72. n_epoch = 200 * 10
  73. model_file_name = "model_word2vec_80k_600"
  74. # bad
  75. # num_steps 整个一个训练过程中训练的批次,每批训练 batch_size 个数据
  76. num_steps = int((data_size/batch_size) * n_epoch) # total number of iteration,一个 iteration 即一批,调用一次 tl.nlp.generate_skip_gram_batch
  77. print('%d Steps a Epoch, total Epochs %d' % (int(data_size/batch_size), n_epoch))
  78. print(' learning_rate: %f' % learning_rate)
  79. print(' batch_size: %d' % batch_size)
  80. """ Step 2: 建立词典,并用 'UNK' 代替不常见的词 """
  81. print()
  82. if resume:
  83. # 加载已经训练好的模型,数据和字典
  84. print("Load existing data and dictionaries" + "!"*10)
  85. all_var = tl.files.load_npy_to_any(name=model_file_name+'.npy')
  86. data = all_var['data']; count = all_var['count']
  87. dictionary = all_var['dictionary']
  88. reverse_dictionary = all_var['reverse_dictionary']
  89. else:
  90. data, count, dictionary, reverse_dictionary = \
  91. tl.nlp.build_words_dataset(words, vocabulary_size, True, _UNK)
  92. print('Most 5 common words (+UNK)', count[:5])
  93. # 词频最高的5个词: [['UNK', 418391], (b'the', 1061396), (b'of', 593677), (b'and', 416629), (b'one', 411764)]
  94. print('Sample data', data[:10], [reverse_dictionary[i] for i in data[:10]])
  95. # 输出前前10个数据,data中是词对应的id,reverse_dictionary中是词
  96. # [5243, 3081, 12, 6, 195, 2, 3135, 46, 59, 156]
  97. # [b'anarchism', b'originated', b'as', b'a', b'term', b'of', b'abuse', b'first', b'used', b'against']
  98. del words # 删除词列表 words 以减少内存占用
  99. """ Step 3: Function to generate a training batch for the Skip-Gram model. """
  100. print()
  101. data_index = 0
  102. batch, labels, data_index = tl.nlp.generate_skip_gram_batch(data=data,
  103. batch_size=20, num_skips=4, skip_window=2, data_index=0)
  104. for i in range(20):
  105. print(batch[i], reverse_dictionary[batch[i]],
  106. '->', labels[i, 0], reverse_dictionary[labels[i, 0]])
  107. """ Step 4: Build a Skip-Gram model. """
  108. print()
  109. valid_size = 16 # 随机获取一些词来验证相似性
  110. valid_window = 100 # Only pick dev samples in the head of the distribution.
  111. valid_examples = np.random.choice(valid_window, valid_size, replace=False)
  112. # print(valid_examples) # [90 85 20 33 35 62 37 63 88 38 82 58 83 59 48 64]
  113. print_freq = 2000
  114. # n_epoch = int(num_steps / batch_size) 训练数据集的轮数
  115. # train_inputs 是一个向量, 每个不同的词对应一个id
  116. # train_labels is a column vector, 即一个词所对应的向量
  117. # valid_dataset is a column vector, a valid set is an integer id of single word.
  118. train_inputs = tf.placeholder(tf.int32, shape=[batch_size])
  119. train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
  120. valid_dataset = tf.constant(valid_examples, dtype=tf.int32)
  121. # Look up embeddings for inputs.
  122. emb_net = tl.layers.Word2vecEmbeddingInputlayer(
  123. inputs = train_inputs,
  124. train_labels = train_labels,
  125. vocabulary_size = vocabulary_size,
  126. embedding_size = embedding_size,
  127. num_sampled = num_sampled,
  128. nce_loss_args = {},
  129. E_init = tf.random_uniform_initializer(minval=-1.0, maxval=1.0),
  130. E_init_args = {},
  131. nce_W_init = tf.truncated_normal_initializer(stddev=float(1.0/np.sqrt(embedding_size))),
  132. nce_W_init_args = {},
  133. nce_b_init = tf.constant_initializer(value=0.0),
  134. nce_b_init_args = {},
  135. name ='word2vec_layer',
  136. )
  137. # 优化函数
  138. cost = emb_net.nce_cost
  139. train_params = emb_net.all_params
  140. # train_op = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost, var_list=train_params)
  141. train_op = tf.train.AdagradOptimizer(learning_rate, initial_accumulator_value=0.1,
  142. use_locking=False).minimize(cost, var_list=train_params)
  143. # Compute the cosine similarity between minibatch examples and all embeddings.
  144. # For simple visualization of validation set.
  145. normalized_embeddings = emb_net.normalized_embeddings
  146. # tf.nn.embedding_lookup 对批数据中的单词建立嵌套向量
  147. valid_embed = tf.nn.embedding_lookup(
  148. normalized_embeddings, valid_dataset)
  149. similarity = tf.matmul(
  150. valid_embed, normalized_embeddings, transpose_b=True)
  151. # multiply all valid word vector with all word vector.
  152. # transpose_b=True, normalized_embeddings is transposed before multiplication.
  153. """ Step 5: Begin training. """
  154. print()
  155. sess.run(tf.initialize_all_variables())
  156. if resume:
  157. print("Load existing model" + "!"*10)
  158. # Load from ckpt or npz file
  159. saver = tf.train.Saver()
  160. saver.restore(sess, model_file_name+'.ckpt')
  161. # load_params = tl.files.load_npz(name=model_file_name+'.npz')
  162. # tl.files.assign_params(sess, load_params, emb_net)
  163. emb_net.print_params()
  164. emb_net.print_layers()
  165. # save vocabulary to txt
  166. tl.nlp.save_vocab(count, name='vocab_text8.txt')
  167. average_loss = 0
  168. # for step in xrange(num_steps):
  169. step = 0
  170. while (step < num_steps):
  171. start_time = time.time()
  172. batch_inputs, batch_labels, data_index = tl.nlp.generate_skip_gram_batch(
  173. data=data, batch_size=batch_size, num_skips=num_skips,
  174. skip_window=skip_window, data_index=data_index)
  175. feed_dict = {train_inputs : batch_inputs, train_labels : batch_labels}
  176. # We perform one update step by evaluating the train_op (including it
  177. # in the list of returned values for sess.run()
  178. _, loss_val = sess.run([train_op, cost], feed_dict=feed_dict)
  179. average_loss += loss_val
  180. if step % print_freq == 0:
  181. if step > 0:
  182. average_loss /= 2000
  183. print("Average loss at step %d/%d. loss:%f took:%fs" %
  184. (step, num_steps, average_loss, time.time() - start_time))
  185. average_loss = 0
  186. # Prints out nearby words given a list of words.
  187. # Note that this is expensive (~20% slowdown if computed every 500 steps)
  188. if step % (print_freq * 5) == 0:
  189. sim = similarity.eval()
  190. for i in xrange(valid_size):
  191. valid_word = reverse_dictionary[valid_examples[i]]
  192. top_k = 8 # number of nearest neighbors to print
  193. nearest = (-sim[i, :]).argsort()[1:top_k+1]
  194. log_str = "Nearest to %s:" % valid_word
  195. for k in xrange(top_k):
  196. close_word = reverse_dictionary[nearest[k]]
  197. log_str = "%s %s," % (log_str, close_word)
  198. print(log_str)
  199. if (step % (print_freq * 20) == 0) and (step != 0):
  200. print("Save model, data and dictionaries" + "!"*10);
  201. # Save to ckpt or npz file
  202. saver = tf.train.Saver()
  203. save_path = saver.save(sess,"./" + model_file_name +'.ckpt')
  204. # tl.files.save_npz(emb_net.all_params, name=model_file_name+'.npz')
  205. tl.files.save_any_to_npy(save_dict={
  206. 'data': data, 'count': count,
  207. 'dictionary': dictionary, 'reverse_dictionary':
  208. reverse_dictionary}, name=model_file_name+'.npy')
  209. if step == num_steps-1:
  210. keeptrain = input("Training %d finished enter 1 to keep training: " % num_steps)
  211. if keeptrain == '1':
  212. step = 0
  213. learning_rate = float(input("Input new learning rate: "))
  214. train_op = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
  215. step += 1
  216. """ Step 6: Visualize the normalized embedding matrix by t-SNE. """
  217. print()
  218. final_embeddings = normalized_embeddings.eval()
  219. tl.visualize.tsne_embedding(final_embeddings, reverse_dictionary,
  220. plot_only=500, second=5, saveable=False, name='word2vec_basic')
  221. """ Step 7: Evaluate by analogy questions. see tensorflow/models/embedding/word2vec_optimized.py """
  222. print()
  223. # from tensorflow/models/embedding/word2vec.py
  224. analogy_questions = tl.nlp.read_analogies_file( \
  225. eval_file='questions-words.txt', word2id=dictionary)
  226. # The eval feeds three vectors of word ids for a, b, c, each of
  227. # which is of size N, where N is the number of analogies we want to
  228. # evaluate in one batch.
  229. analogy_a = tf.placeholder(dtype=tf.int32) # [N]
  230. analogy_b = tf.placeholder(dtype=tf.int32) # [N]
  231. analogy_c = tf.placeholder(dtype=tf.int32) # [N]
  232. # Each row of a_emb, b_emb, c_emb is a word's embedding vector.
  233. # They all have the shape [N, emb_dim]
  234. a_emb = tf.gather(normalized_embeddings, analogy_a) # a's embs
  235. b_emb = tf.gather(normalized_embeddings, analogy_b) # b's embs
  236. c_emb = tf.gather(normalized_embeddings, analogy_c) # c's embs
  237. # We expect that d's embedding vectors on the unit hyper-sphere is
  238. # near: c_emb + (b_emb - a_emb), which has the shape [N, emb_dim].
  239. # Bangkok Thailand Tokyo Japan -> Thailand - Bangkok = Japan - Tokyo
  240. # Japan = Tokyo + (Thailand - Bangkok)
  241. # d = c + (b - a)
  242. target = c_emb + (b_emb - a_emb)
  243. # Compute cosine distance between each pair of target and vocab.
  244. # dist has shape [N, vocab_size].
  245. dist = tf.matmul(target, normalized_embeddings, transpose_b=True)
  246. # For each question (row in dist), find the top 'n_answer' words.
  247. n_answer = 4
  248. _, pred_idx = tf.nn.top_k(dist, n_answer)
  249. def predict(analogy):
  250. """Predict the top 4 answers for analogy questions."""
  251. idx, = sess.run([pred_idx], {
  252. analogy_a: analogy[:, 0],
  253. analogy_b: analogy[:, 1],
  254. analogy_c: analogy[:, 2]
  255. })
  256. return idx
  257. # Evaluate analogy questions and reports accuracy.
  258. # i.e. How many questions we get right at precision@1.
  259. correct = 0
  260. total = analogy_questions.shape[0]
  261. start = 0
  262. while start < total:
  263. limit = start + 2500
  264. sub = analogy_questions[start:limit, :] # question
  265. idx = predict(sub) # 4 answers for each question
  266. # print('question:', tl.nlp.word_ids_to_words(sub[0], reverse_dictionary))
  267. # print('answers:', tl.nlp.word_ids_to_words(idx[0], reverse_dictionary))
  268. start = limit
  269. for question in xrange(sub.shape[0]):
  270. for j in xrange(n_answer):
  271. # if one of the top 4 answers in correct, win !
  272. if idx[question, j] == sub[question, 3]:
  273. # Bingo! We predicted correctly. E.g., [italy, rome, france, paris].
  274. print(j+1, tl.nlp.word_ids_to_words([idx[question, j]], reverse_dictionary) \
  275. , ':', tl.nlp.word_ids_to_words(sub[question, :], reverse_dictionary))
  276. correct += 1
  277. break
  278. elif idx[question, j] in sub[question, :3]:
  279. # We need to skip words already in the question.
  280. continue
  281. else:
  282. # The correct label is not the precision@1
  283. break
  284. print("Eval %4d/%d accuracy = %4.1f%%" % (correct, total,
  285. correct * 100.0 / total))
  286. if __name__ == '__main__':
  287. sess = tf.InteractiveSession()
  288. main_word2vec_basic()

输出结果

  1. Nearest to by: through, including, in, under, during, via, from, featuring,
  2. Nearest to years: decades, weeks, days, months, hours, seconds, minutes, year,
  3. Nearest to used: employed, needed, referred, presented, designed, cited, applied, adopted,
  4. Nearest to has: had, have, is, was, maintains, requires, produces, since,
  5. Nearest to of: including, in, besides, from, for, although, includes, and,
  6. Nearest to see: references, but, compare, include, etc, and, external, refers,
  7. Nearest to war: wars, conflict, atrocities, veterans, turmoil, crisis, battle, wwii,
  8. Nearest to that: which, however, nevertheless, additionally, what, moreover, but, furthermore,
  9. Nearest to th: nd, ninth, rd, seventh, twentieth, tenth, nineteenth, st,
  10. Nearest to states: kingdom, nations, state, countries, us, organizations, nation, netherlands,
  11. Nearest to about: approximately, over, regarding, concerning, around, exactly, roughly, within,
  12. Nearest to five: four, seven, six, three, eight, two, nine, zero,
  13. Nearest to his: her, their, my, its, your, our, the, whose,
  14. Nearest to zero: five, seven, eight, six, four, nine, three, two,
  15. Nearest to system: systems, scheme, model, mechanisms, mechanism, theory, concept, process,
  16. Nearest to be: been, become, get, have, is, seem, refer, prove,
  17. Average loss at step 2652000/2657063. loss:3.958110 took:0.002000s
  18. Average loss at step 2654000/2657063. loss:3.956920 took:0.002000s
  19. Average loss at step 2656000/2657063. loss:3.973455 took:0.002000s

参考文档:

  1. TensorFlow 字词的向量表示 http://www.tensorfly.cn/tfdoc/tutorials/word2vec.html
  2. Wikipedia Vector space model https://en.wikipedia.org/wiki/Vector_space_model
  3. 维基百科 自然语言处理
    https://zh.wikipedia.org/wiki/%E8%87%AA%E7%84%B6%E8%AF%AD%E8%A8%80%E5%A4%84%E7%90%86

发表评论

表情:
评论列表 (有 0 条评论,320人围观)

还没有评论,来说两句吧...

相关阅读

    相关 word2vec

    Word2Vec 是一种词嵌入模型,用于将文本中的单词映射到一个固定大小的向量空间中。它的主要目的是通过计算单词之间的相似度来增强自然语言处理的性能。Word2Vec 通常用于

    相关 转载 通俗理解word2vec

    独热编码 独热编码即 One-Hot 编码,又称一位有效编码,其方法是使用N位状态寄存器来对N个状态进行编码,每个状态都有它独立的寄存器位,并且在任意时候,其中只有一位有

    相关 转载 通俗理解word2vec

    独热编码 独热编码即 One-Hot 编码,又称一位有效编码,其方法是使用N位状态寄存器来对N个状态进行编码,每个状态都有它独立的寄存器位,并且在任意时候,其中只有一位有

    相关 理解 TensorFlow word2vec

    自然语言处理(英语:Natural Language Processing,简称NLP)是人工智能和语言学领域的分支学科。自然语言生成系统把计算机数据转化为自然语言。自然语言理

    相关 NLPword2vec

    简介 在NLP领域中,为了能表示人类的语言符号,一般会把这些符号转成一种数学向量形式以方便处理,我们把语言单词嵌入到向量空间中就叫词嵌入(word embedding)。

    相关 理解word2vec

    自然语言处理任务中要处理的对象是单词或者词组,单词可以看做是类别型特征,虽然tree-based模型可以采用类别特征,但包括神经网络在内的大部分机器学习模型只能处理数值型特征。