【Python】文本词频统计

我不是女神ヾ 2023-07-25 09:19 271阅读 0赞

format_png

哈姆雷特英文

https://python123.io/resources/pye/hamlet.txt

三国演义中文

https://python123.io/resources/pye/threekingdoms.txt

format_png 1

哈姆雷特英文词频分析

  1. def getText():
  2. txt=open("hamlet.txt","r").read()#打开文本,输入具体的文本路径
  3. txt=txt.lower()#将文本中所有的英文字符变成小写
  4. for ch in '!"#$%&()*+,-./;:<=>?@[\\]^‘_{|}~':
  5. txt=txt.replace(ch," ")
  6. return txt #去掉特殊符号
  7. hamletTxt=getText()#调用函数对文本进行处理
  8. words=hamletTxt.split()#进行列表
  9. counts={}#字典
  10. for word in words:
  11. counts[word]=counts.get(word,0)+1#获取到的词在字典中寻找如果有的话在原来的基础上+1,如果没有就收录到字典中
  12. items=list(counts.items())#变成列表类型
  13. items.sort(key=lambda x:x[1],reverse=True)#对列表排序
  14. for i in range(10):#将出现次数前10的单词输出并输出出现次数
  15. word,count=items[i]
  16. print("{0:<10}{1:>5}".format(word,count))

format_png 2

三国演义人物出场次数

  1. import jieba#引入jieba分词库
  2. txt = open("threekingdoms.txt", "r", encoding="utf-8").read()#打开文本
  3. words = jieba.lcut(txt)#进行分词处理并形成列表
  4. counts = {}#构造字典,逐一遍历words中的中文单词进行处理,并用字典计数
  5. for word in words:
  6. if len(word) == 1:
  7. continue
  8. else:
  9. counts[word] = counts.get(word, 0) + 1
  10. items = list(counts.items())#转换列表类型并排序
  11. items.sort(key=lambda x:x[1], reverse=True)
  12. for i in range(15):#输出前15位单词
  13. word, count = items[i]
  14. print("{0:<10}{1:<5}".format(word, count))

结果:

format_png 3

上面有不是人物的词,需要改造

  1. import jieba
  2. txt = open("threekingdoms.txt", "r", encoding="utf-8").read()
  3. excludes = {
  4. "将军", "却说", "荆州", "二人", "不可", "不能", "如此", "主公",\
  5. "军士", "商议", "如何", "左右", "军马", "引兵", "次日", "大喜",\
  6. "天下", "东吴", "于是", "今日", "不敢", "魏兵", "陛下", "一人",\
  7. "都督", "人马", "不知"}#排除不是人名的词汇,加到这个排除词库中
  8. words = jieba.lcut(txt)
  9. counts = {}
  10. for word in words:#进行人名关联,防止重复
  11. if len(word) == 1:
  12. continue
  13. elif word == "诸葛亮" or word == "孔明曰":
  14. rword = "孔明"
  15. elif word == "关公" or word == "云长":
  16. rword = "关羽"
  17. elif word == "玄德" or word == "玄德曰":
  18. rword = "刘备"
  19. elif word == "孟德" or word == "丞相":
  20. rword = "曹操"
  21. else:
  22. rword = word
  23. counts[rword] = counts.get(rword, 0) + 1
  24. for word in excludes:
  25. del counts[word]
  26. items = list(counts.items())
  27. items.sort(key=lambda x:x[1], reverse=True)
  28. for i in range(10):
  29. word, count = items[i]
  30. print("{0:<10}{1:<5}".format(word, count))

结果

format_png 4

不断优化。。。。

format_png 5

format_png 6

发表评论

表情:
评论列表 (有 0 条评论,271人围观)

还没有评论,来说两句吧...

相关阅读