文本向量化

深碍√TFBOYSˉ_ 2023-02-27 13:43 9阅读 0赞

### 前言 ###

文本向量化，就是把文本转化为向量形式。这里用两种方式实现本文向量，一种是TF方式，一种是TF-IDF方式，且这里向量的长度就是字典的长度。

TF就是词频、TF-IDF就是词频-逆频率。关于这两者的介绍已经满大街了，对于这两者概念如有不懂，自行百度。

本文基于python 实现，输入两篇短文本，输出文本向量，并用余弦相似度方式，计算两篇文档的相关性。

### 实现 ###

#### 计算两个向量余弦相似度 ####

import math
    
    def count_cos_similarity(vec_1, vec_2):
        if len(vec_1) != len(vec_2):
            return 0
    
        s = sum(vec_1[i] * vec_2[i] for i in range(len(vec_2)))
        den1 = math.sqrt(sum([pow(number, 2) for number in vec_1]))
        den2 = math.sqrt(sum([pow(number, 2) for number in vec_2]))
        return s / (den1 * den2)

#### TF 文本向量及相似性计算 ####

from sklearn.feature_extraction.text import CountVectorizer
    
    sent1 = "the cat is walking in the bedroom."
    sent2 = "the dog was running across the kitchen."
    
    count_vec = CountVectorizer()
    
    sentences = [sent1, sent2]
    print(count_vec.fit_transform(sentences).toarray())
    print(count_vec.get_feature_names())
    
    vec_1 = count_vec.fit_transform(sentences).toarray()[0]
    vec_2 = count_vec.fit_transform(sentences).toarray()[1]
    
    print(count_cos_similarity(vec_1, vec_2))

结果为：

[[0 1 1 0 1 1 0 0 2 1 0]
     [1 0 0 1 0 0 1 1 2 0 1]]
    ['across', 'bedroom', 'cat', 'dog', 'in', 'is', 'kitchen', 'running', 'the', 'walking', 'was']
    0.4444444444444444

说明：依次输出每个文本的向量表示、每个维度对应的词语、以及文本余弦相似度。

#### TF-IDF 文本向量及相似性计算 ####

from sklearn.feature_extraction.text import TfidfVectorizer
    
    sent1 = "the cat is walking in the bedroom."
    sent2 = "the dog was running across the kitchen."
    
    tfidf_vec = TfidfVectorizer()
    
    sentences = [sent1, sent2]
    print(tfidf_vec.fit_transform(sentences).toarray())
    print(tfidf_vec.get_feature_names())
    vec_1 = tfidf_vec.fit_transform(sentences).toarray()[0]
    vec_2 = tfidf_vec.fit_transform(sentences).toarray()[1]
    print(count_cos_similarity(vec_1, vec_2))

结果为：

[[0.         0.37729199 0.37729199 0.         0.37729199 0.37729199
      0.         0.         0.53689271 0.37729199 0.        ]
     [0.37729199 0.         0.         0.37729199 0.         0.
      0.37729199 0.37729199 0.53689271 0.         0.37729199]]
    ['across', 'bedroom', 'cat', 'dog', 'in', 'is', 'kitchen', 'running', 'the', 'walking', 'was']
    0.28825378403927704

说明：输出同上

### 小结 ###

上文示例中给了两个句子：  
  ”the cat is walking in the bedroom.”  
  ”the dog was running across the kitchen.”  
这两个句子其实从语义上看特别相似，但是实际得到的相似性却很低~~本质上原因在于两种方式计算的文本向量，都只能衡量文本之间的内容相似度，但难以衡量其中语义相似度。

这篇文章特别初级，之所以写这篇文章其实还有个原因，是想做个对比—和基于word2vec实现的文本相似性计算进行对比，后者下一篇博文 [词语向量化 — word2vec简介和使用][_ word2vec] 进行介绍的~

--------------------

#### 来源于：[宇毅][Link 1] ####

[_ word2vec]: https://blog.csdn.net/weixin_43283397/article/details/107419830
[Link 1]: https://blog.csdn.net/xsdxs/article/details/72951326