基于N-gram概括内容

Jasmine

发布日期: 2019-04-25 23:09:42 浏览量: 555
评分:
star star star star star star star star star star
*转载请注明来自write-bug.com

利用N-Gram模型概括数据(Python描述)

  • Python 2.7

  • IDE PyCharm 5.0.3

什么是N-Gram模型?

在自然语言里有一个模型叫做n-gram,表示文字或语言中的n个连续的单词组成序列。在进行自然语言分析时,使用n-gram或者寻找常用词组,可以很容易的把一句话分解成若干个文字片段。摘自Python网络数据采集[RyanMitchell著]

简单来说,就是找到核心主题词,那怎么算核心主题词呢,一般而言,重复率也就是提及次数最多的也就是最需要表达的就是核心词。下面的例子也就从这个开始展开

临时补充

在栗子中出现,这里拿出来单独先试一下效果

  • string.punctuation获取所有标点符号,和strip搭配使用
  1. import string
  2. list = ['a,','b!','cj!/n']
  3. item=[]
  4. for i in list:
  5. i =i.strip(string.punctuation)
  6. item.append(i)
  7. print item
  1. ['a', 'b', 'cj!/n']
  • operator.itemgetter()

operator模块提供的itemgetter函数用于获取对象的哪些维的数据,参数为一些序号(即需要获取的数据在对象中的序号)

栗子

  1. import operator
  2. dict_={'name1':'2',
  3. 'name2':'1'}
  4. print sorted(dict_.items(),key=operator.itemgetter(0),reverse=True)
  5. #dict_.items(),键值对
  1. [('name2', '1'), ('name1', '2')]

当然,你可以直接直接使用这个

  1. dict_={'name1':'2',
  2. 'name2':'1'}
  3. print sorted(dict_.iteritems(),key=lambda x:x[1],reverse=True)

2-gram

就以两个关键词来说吧,上个栗子来进行备注讲解

  1. import urllib2
  2. import re
  3. import string
  4. import operator
  5. def cleanText(input):
  6. input = re.sub('\n+', " ", input).lower() # 匹配换行,用空格替换换行符
  7. input = re.sub('\[[0-9]*\]', "", input) # 剔除类似[1]这样的引用标记
  8. input = re.sub(' +', " ", input) # 把连续多个空格替换成一个空格
  9. input = bytes(input)#.encode('utf-8') # 把内容转换成utf-8格式以消除转义字符
  10. #input = input.decode("ascii", "ignore")
  11. return input
  12. def cleanInput(input):
  13. input = cleanText(input)
  14. cleanInput = []
  15. input = input.split(' ') #以空格为分隔符,返回列表
  16. for item in input:
  17. item = item.strip(string.punctuation) # string.punctuation获取所有标点符号
  18. if len(item) > 1 or (item.lower() == 'a' or item.lower() == 'i'): #找出单词,包括i,a等单个单词
  19. cleanInput.append(item)
  20. return cleanInput
  21. def getNgrams(input, n):
  22. input = cleanInput(input)
  23. output = {} # 构造字典
  24. for i in range(len(input)-n+1):
  25. ngramTemp = " ".join(input[i:i+n])#.encode('utf-8')
  26. if ngramTemp not in output: #词频统计
  27. output[ngramTemp] = 0 #典型的字典操作
  28. output[ngramTemp] += 1
  29. return output
  30. #方法一:对网页直接进行读取
  31. content = urllib2.urlopen(urllib2.Request("http://pythonscraping.com/files/inaugurationSpeech.txt")).read()
  32. #方法二:对本地文件的读取,测试时候用,因为无需联网
  33. #content = open("1.txt").read()
  34. ngrams = getNgrams(content, 2)
  35. sortedNGrams = sorted(ngrams.items(), key = operator.itemgetter(1), reverse=True) #=True 降序排列
  36. print(sortedNGrams)
  1. [('of the', 213), ('in the', 65), ('to the', 61), ('by the', 41), ('the constitution', 34),,,巴拉巴拉一堆

上述栗子作用在于抓到2连接词的频率大小来排序的,但是这并不是我们想要的,你说这出现两百多次的 of the 有个猫用啊,所以,我们要进行对这些连接词啊介词啊的剔除工作。

Deeper

完整代码和测试图都在同级目录下的2_gram.ipynb,如要测试请手动下载工程,然后运行jupyter即可,不知道jupyter?百度啊,自己装

  1. # -*- coding: utf-8 -*-
  2. import urllib2
  3. import re
  4. import string
  5. import operator
  6. #剔除常用字函数
  7. def isCommon(ngram):
  8. commonWords = ["the", "be", "and", "of", "a", "in", "to", "have",
  9. "it", "i", "that", "for", "you", "he", "with", "on", "do", "say",
  10. "this", "they", "is", "an", "at", "but","we", "his", "from", "that",
  11. "not", "by", "she", "or", "as", "what", "go", "their","can", "who",
  12. "get", "if", "would", "her", "all", "my", "make", "about", "know",
  13. "will","as", "up", "one", "time", "has", "been", "there", "year", "so",
  14. "think", "when", "which", "them", "some", "me", "people", "take", "out",
  15. "into", "just", "see", "him", "your", "come", "could", "now", "than",
  16. "like", "other", "how", "then", "its", "our", "two", "more", "these",
  17. "want", "way", "look", "first", "also", "new", "because", "day", "more",
  18. "use", "no", "man", "find", "here", "thing", "give", "many", "well"]
  19. if ngram in commonWords:
  20. return True
  21. else:
  22. return False
  23. def cleanText(input):
  24. input = re.sub('\n+', " ", input).lower() # 匹配换行用空格替换成空格
  25. input = re.sub('\[[0-9]*\]', "", input) # 剔除类似[1]这样的引用标记
  26. input = re.sub(' +', " ", input) # 把连续多个空格替换成一个空格
  27. input = bytes(input)#.encode('utf-8') # 把内容转换成utf-8格式以消除转义字符
  28. #input = input.decode("ascii", "ignore")
  29. return input
  30. def cleanInput(input):
  31. input = cleanText(input)
  32. cleanInput = []
  33. input = input.split(' ') #以空格为分隔符,返回列表
  34. for item in input:
  35. item = item.strip(string.punctuation) # string.punctuation获取所有标点符号
  36. if len(item) > 1 or (item.lower() == 'a' or item.lower() == 'i'): #找出单词,包括i,a等单个单词
  37. cleanInput.append(item)
  38. return cleanInput
  39. def getNgrams(input, n):
  40. input = cleanInput(input)
  41. output = {} # 构造字典
  42. for i in range(len(input)-n+1):
  43. ngramTemp = " ".join(input[i:i+n])#.encode('utf-8')
  44. if isCommon(ngramTemp.split()[0]) or isCommon(ngramTemp.split()[1]):
  45. pass
  46. else:
  47. if ngramTemp not in output: #词频统计
  48. output[ngramTemp] = 0 #典型的字典操作
  49. output[ngramTemp] += 1
  50. return output
  51. #获取核心词在的句子
  52. def getFirstSentenceContaining(ngram, content):
  53. #print(ngram)
  54. sentences = content.split(".")
  55. for sentence in sentences:
  56. if ngram in sentence:
  57. return sentence
  58. return ""
  59. #方法一:对网页直接进行读取
  60. content = urllib2.urlopen(urllib2.Request("http://pythonscraping.com/files/inaugurationSpeech.txt")).read()
  61. #对本地文件的读取,测试时候用,因为无需联网
  62. #content = open("1.txt").read()
  63. ngrams = getNgrams(content, 2)
  64. sortedNGrams = sorted(ngrams.items(), key = operator.itemgetter(1), reverse=True) # reverse=True 降序排列
  65. print(sortedNGrams)
  66. for top3 in range(3):
  67. print "###"+getFirstSentenceContaining(sortedNGrams[top3][0],content.lower())+"###"
  1. [('united states', 10), ('general government', 4), ('executive department', 4), ('legisltive bojefferson', 3), ('same causes', 3), ('called upon', 3), ('chief magistrate', 3), ('whole country', 3), ('government should', 3),,,,巴拉巴拉一堆
  2. ### the constitution of the united states is the instrument containing this grant of power to the several departments composing the government###
  3. ### the general government has seized upon none of the reserved rights of the states###
  4. ### such a one was afforded by the executive department constituted by the constitution###

从上述栗子我们可以看出,我们对有用词进行了删选,去掉了连接词,取出核心词排序,然后再把包含核心词的句子抓出来,这里我只是抓了前三句,对于有两三百个句子的文章,用三四句话概括起来,我想还是比较神奇的。

上传的附件 cloud_download 基于N-gram概括内容.zip ( 5.49mb, 0次下载 )
error_outline 下载需要8点积分

发送私信

当你成功时,谁还在乎你的过去

11
文章数
13
评论数
最近文章
eject