python实现文章或博客的自动摘要(附java版开源项目)

写博客的时候，都习惯给文章加入一个简介。现在可以自动完成了！
TF-IDF与余弦相似性的应用（三）：自动摘要 - 阮一峰的网络日志
http://www.ruanyifeng.com/blog/2013/03/automatic_summarization.html

idf.txt来自 https://github.com/jannson/yaha/blob/master/yaha/analyse/idf.txt
python分词库中：https://github.com/jannson/yaha
使用summarize3 需要安装numpy库。

GitHub - jannson/yaha: yaha
https://github.com/jannson/yaha

基本功能：
精确模式，将句子切成最合理的词。
全模式，所有的可能词都被切成词，不消除歧义。
搜索引擎模式，在精确的基础上再次驿长词进行切分，提高召回率，适合搜索引擎创建索引。
备选路径，可生成最好的多条切词路径，可在此基础上根据其它信息得到更精确的分词模式。
可用插件：

正则表达式插件
人名前缀插件
地名后缀插件
定制功能。分词过程产生4种阶段，每个阶段都可以加入个人的定制。

附加功能：
新词学习功能。通过输入大段文字，学习到此内容产生的新老词语。（添加了一个由我朋友实现的C++版本的最大熵新词发现功能，速度是python的10倍吧）
获取大段文本的关键字。
获取大段文本的摘要。
词语纠错功能（新！常用在搜索里对用户的错误输入进行纠正）
支持用户自定义词典（TODO目前还没有实现得很好）

======================================

Python实现提取文章摘要的方法
一、概述
在博客系统的文章列表中，为了更有效地呈现文章内容，从而让读者更有针对性地选择阅读，通常会同时提供文章的标题和摘要。
一篇文章的内容可以是纯文本格式的，但在网络盛行的当今，更多是HTML格式的。无论是哪种格式，摘要一般都是文章开头部分的内容，可以按照指定的字数来提取。

二、纯文本摘要
纯文本文档就是一个长字符串，很容易实现对它的摘要提取：
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""Get a summary of the TEXT-format document"""
def get_summary(text, count):
u"""Get the first `count` characters from `text`
>>> text = u'Welcome 这是一篇关于Python的文章'
>>> get_summary(text, 12) == u'Welcome 这是一篇'
True
"""
assert(isinstance(text, unicode))
return text[0:count]
if __name__ == '__main__':
import doctest
doctest.testmod()

三、HTML摘要
HTML文档中包含大量标记符（如<h1>、、<a>等等），这些字符都是标记指令，并且通常是成对出现的，简单的文本截取会破坏HTML的文档结构，进而导致摘要在浏览器中显示不当。
在遵循HTML文档结构的同时，又要对内容进行截取，就需要解析HTML文档。在Python中，可以借助标准库 HTMLParser 来完成。

一个最简单的摘要提取功能，是忽略HTML标记符而只提取标记内部的原生文本。以下就是类似该功能的Python实现：
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""Get a raw summary of the HTML-format document"""
from HTMLParser import HTMLParser
class SummaryHTMLParser(HTMLParser):
"""Parse HTML text to get a summary
>>> text = u'Hi guys:This is a example using SummaryHTMLParser.'
>>> parser = SummaryHTMLParser(10)
>>> parser.feed(text)
>>> parser.get_summary(u'...')
u'Higuys:Thi...'
"""
def __init__(self, count):
HTMLParser.__init__(self)
self.count = count
self.summary = u''
def feed(self, data):
"""Only accept unicode `data`"""
assert(isinstance(data, unicode))
HTMLParser.feed(self, data)
def handle_data(self, data):
more = self.count - len(self.summary)
if more > 0:
# Remove possible whitespaces in `data`
data_without_whitespace = u''.join(data.split())
self.summary += data_without_whitespace[0:more]
def get_summary(self, suffix=u'', wrapper=u'p'):
return u'<{0}>{1}{2}</{0}>'.format(wrapper, self.summary, suffix)
if __name__ == '__main__':
import doctest
doctest.testmod()

HTMLParser（或者 BeautifulSoup 等等）更适合完成复杂的HTML摘要提取功能，对于上述简单的HTML摘要提取功能，其实有更简洁的实现方案（相比 SummaryHTMLParser 而言）：
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""Get a raw summary of the HTML-format document"""
import re
def get_summary(text, count, suffix=u'', wrapper=u'p'):
"""A simpler implementation (vs `SummaryHTMLParser`).
>>> text = u'Hi guys:This is a example using SummaryHTMLParser.'
>>> get_summary(text, 10, u'...')
u'Higuys:Thi...'
"""
assert(isinstance(text, unicode))
summary = re.sub(r'<.*?>', u'', text) # key difference: use regex
summary = u''.join(summary.split())[0:count]
return u'<{0}>{1}{2}</{0}>'.format(wrapper, summary, suffix)
if __name__ == '__main__':
import doctest
doctest.testmod()

======================================

另外一个比较好的java版本的开源实现：
https://github.com/hankcs/HanLP
HanLP是由一系列模型与算法组成的Java工具包，目标是普及自然语言处理在生产环境中的应用。HanLP具备功能完善、性能高效、架构清晰、语料时新、可自定义的特点。
自然语言处理中文分词词性标注命名实体识别依存句法分析关键词提取自动摘要短语提取拼音简繁转换
根据文章内容自动提取tag

秒客网

python实现文章或博客的自动摘要(附java版开源项目)

相关文章