使用Python从HTML文件中提取文本

时间:2022-11-12 10:54:17

I'd like to extract the text from an HTML file using Python. I want essentially the same output I would get if I copied the text from a browser and pasted it into notepad.

我想使用Python从HTML文件中提取文本。如果我从浏览器中复制文本并将其粘贴到记事本中,我想要的输出基本上是相同的。

I'd like something more robust than using regular expressions that may fail on poorly formed HTML. I've seen many people recommend Beautiful Soup, but I've had a few problems using it. For one, it picked up unwanted text, such as JavaScript source. Also, it did not interpret HTML entities. For example, I would expect ' in HTML source to be converted to an apostrophe in text, just as if I'd pasted the browser content into notepad.

我想要一些比使用正则表达式更健壮的东西,而正则表达式在格式不佳的HTML上可能会失败。我见过很多人推荐漂亮的汤,但是我用起来有一些问题。首先,它接收不需要的文本,比如JavaScript源代码。此外,它没有解释HTML实体。例如,我期望第39条;在HTML源代码中被转换为文本中的撇号,就像我将浏览器内容粘贴到记事本一样。

Update html2text looks promising. It handles HTML entities correctly and ignores JavaScript. However, it does not exactly produce plain text; it produces markdown that would then have to be turned into plain text. It comes with no examples or documentation, but the code looks clean.

更新html2text看起来有前途。它正确地处理HTML实体并忽略JavaScript。然而,它并不能产生纯文本;它生成markdown,然后必须转换为纯文本。它没有示例或文档,但是代码看起来很干净。


Related questions:

相关问题:

29 个解决方案

#1


96  

html2text is a Python program that does a pretty good job at this.

html2text是一个Python程序,它在这方面做得很好。

#2


100  

NOTE: NTLK no longer supports clean_html function

注意:NTLK不再支持clean_html功能

Original answer below, and an alternative in the comments sections.

下面是原始的答案,在评论部分有另外的选择。


Use NLTK

使用NLTK

I wasted my 4-5 hours fixing the issues with html2text. Luckily i could encounter NLTK.
It works magically.

我浪费了4-5个小时来解决html2text的问题。幸运的是我能遇到NLTK。它神奇地工作。

import nltk   
from urllib import urlopen

url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"    
html = urlopen(url).read()    
raw = nltk.clean_html(html)  
print(raw)

#3


96  

The best piece of code I found for extracting text without getting javascript or not wanted things :

我找到的最好的一段代码可以在不需要javascript的情况下提取文本:

import urllib
from bs4 import BeautifulSoup

url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()    # rip it out

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)

print(text)

You just have to install BeautifulSoup before :

你只需要在之前安装漂亮的汤:

pip install beautifulsoup4

#4


52  

Found myself facing just the same problem today. I wrote a very simple HTML parser to strip incoming content of all markups, returning the remaining text with only a minimum of formatting.

发现自己今天也面临着同样的问题。我编写了一个非常简单的HTML解析器来剥离所有标记的传入内容,只返回剩下的文本,并进行最少的格式化。

from HTMLParser import HTMLParser
from re import sub
from sys import stderr
from traceback import print_exc

class _DeHTMLParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.__text = []

    def handle_data(self, data):
        text = data.strip()
        if len(text) > 0:
            text = sub('[ \t\r\n]+', ' ', text)
            self.__text.append(text + ' ')

    def handle_starttag(self, tag, attrs):
        if tag == 'p':
            self.__text.append('\n\n')
        elif tag == 'br':
            self.__text.append('\n')

    def handle_startendtag(self, tag, attrs):
        if tag == 'br':
            self.__text.append('\n\n')

    def text(self):
        return ''.join(self.__text).strip()


def dehtml(text):
    try:
        parser = _DeHTMLParser()
        parser.feed(text)
        parser.close()
        return parser.text()
    except:
        print_exc(file=stderr)
        return text


def main():
    text = r'''
        <html>
            <body>
                <b>Project:</b> DeHTML<br>
                <b>Description</b>:<br>
                This small script is intended to allow conversion from HTML markup to 
                plain text.
            </body>
        </html>
    '''
    print(dehtml(text))


if __name__ == '__main__':
    main()

#5


13  

Here is a version of xperroni's answer which is a bit more complete. It skips script and style sections and translates charrefs (e.g., &#39;) and HTML entities (e.g., &amp;).

这里有一个版本的xperroni的答案,它更完整一些。它跳过脚本和样式部分,翻译charrefs(例如,#39;)和HTML实体(例如,&)。

It also includes a trivial plain-text-to-html inverse converter.

它还包含一个普通的纯文本到html的反向转换器。

"""
HTML <-> text conversions.
"""
from HTMLParser import HTMLParser, HTMLParseError
from htmlentitydefs import name2codepoint
import re

class _HTMLToText(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self._buf = []
        self.hide_output = False

    def handle_starttag(self, tag, attrs):
        if tag in ('p', 'br') and not self.hide_output:
            self._buf.append('\n')
        elif tag in ('script', 'style'):
            self.hide_output = True

    def handle_startendtag(self, tag, attrs):
        if tag == 'br':
            self._buf.append('\n')

    def handle_endtag(self, tag):
        if tag == 'p':
            self._buf.append('\n')
        elif tag in ('script', 'style'):
            self.hide_output = False

    def handle_data(self, text):
        if text and not self.hide_output:
            self._buf.append(re.sub(r'\s+', ' ', text))

    def handle_entityref(self, name):
        if name in name2codepoint and not self.hide_output:
            c = unichr(name2codepoint[name])
            self._buf.append(c)

    def handle_charref(self, name):
        if not self.hide_output:
            n = int(name[1:], 16) if name.startswith('x') else int(name)
            self._buf.append(unichr(n))

    def get_text(self):
        return re.sub(r' +', ' ', ''.join(self._buf))

def html_to_text(html):
    """
    Given a piece of HTML, return the plain text it contains.
    This handles entities and char refs, but not javascript and stylesheets.
    """
    parser = _HTMLToText()
    try:
        parser.feed(html)
        parser.close()
    except HTMLParseError:
        pass
    return parser.get_text()

def text_to_html(text):
    """
    Convert the given text to html, wrapping what looks like URLs with <a> tags,
    converting newlines to <br> tags and converting confusing chars into html
    entities.
    """
    def f(mo):
        t = mo.group()
        if len(t) == 1:
            return {'&':'&amp;', "'":'&#39;', '"':'&quot;', '<':'&lt;', '>':'&gt;'}.get(t)
        return '<a href="%s">%s</a>' % (t, t)
    return re.sub(r'https?://[^] ()"\';]+|[&\'"<>]', f, text)

#6


8  

You can use html2text method in the stripogram library also.

您也可以在stripogram库中使用html2text方法。

from stripogram import html2text
text = html2text(your_html_string)

To install stripogram run sudo easy_install stripogram

要安装stripogram运行sudo easy_install stripogram。

#7


7  

There is Pattern library for data mining.

有用于数据挖掘的模式库。

http://www.clips.ua.ac.be/pages/pattern-web

http://www.clips.ua.ac.be/pages/pattern-web

You can even decide what tags to keep:

你甚至可以决定保留哪些标签:

s = URL('http://www.clips.ua.ac.be').download()
s = plaintext(s, keep={'h1':[], 'h2':[], 'strong':[], 'a':['href']})
print s

#8


6  

PyParsing does a great job. Paul McGuire has several scrips that are easy to adopt for various uses on the pyparsing wiki. (http://pyparsing.wikispaces.com/Examples) One reason for investing a little time with pyparsing is that he has also written a very brief very well organized O'Reilly Short Cut manual that is also inexpensive.

py解析是一项伟大的工作。Paul McGuire有几个很容易在py解析wiki上使用的脚本。(http://pyparsing.wikispaces.com/Examples)花一点时间研究pyparse的一个原因是,他还编写了一个组织良好的、非常简短的O'Reilly Short - Cut manual,而且成本也不高。

Having said that, I use BeautifulSOup a lot and it is not that hard to deal with the entitites issues, you can convert them before you run BeautifulSoup.

话虽如此,我经常使用漂亮的汤料,而且处理实体问题也不是那么困难,你可以在运行漂亮的汤料之前将它们转换。

Goodluck

古德勒克

#9


5  

I know there are a lot of answers already, but the most elegent and pythonic solution I have found is described, in part, here.

我知道已经有很多答案了,但我所找到的最优美、最python化的答案在这里部分地被描述了。

from bs4 import BeautifulSoup

text = ''.join(BeautifulSoup(some_html_string, "html.parser").findAll(text=True))

Update

Based on Fraser's comment, here is more elegant solution:

根据弗雷泽的评论,这里有更优雅的解决方案:

from bs4 import BeautifulSoup

clean_text = ''.join(BeautifulSoup(some_html_string, "html.parser").stripped_strings)

#10


4  

http://pypi.python.org/pypi/webstemmer/0.5.0

http://pypi.python.org/pypi/webstemmer/0.5.0

http://atropine.sourceforge.net/documentation.html

http://atropine.sourceforge.net/documentation.html


alternatively, i think you can drive lynx from python, search on that

或者,我认为你可以用python来驱动lynx,搜索它。

#11


4  

This isn't exactly a Python solution, but it will convert text Javascript would generate into text, which I think is important (E.G. google.com). The browser Links (not Lynx) has a Javascript engine, and will convert source to text with the -dump option.

这并不是一个Python解决方案,但它将把Javascript生成的文本转换成文本,我认为这很重要(例如google。com)。浏览器链接(不是Lynx)有一个Javascript引擎,并将使用-dump选项将源文件转换为文本。

So you could do something like:

你可以这样做:

fname = os.tmpnam()
fname.write(html_source)
proc = subprocess.Popen(['links', '-dump', fname], 
                        stdout=subprocess.PIPE,
                        stderr=open('/dev/null','w'))
text = proc.stdout.read()

#12


4  

Instead of the HTMLParser module, check out htmllib. It has a similar interface, but does more of the work for you. (It is pretty ancient, so it's not much help in terms of getting rid of javascript and css. You could make a derived class, but and add methods with names like start_script and end_style (see the python docs for details), but it's hard to do this reliably for malformed html.) Anyway, here's something simple that prints the plain text to the console

检查htmllib,而不是HTMLParser模块。它有一个类似的接口,但是为您做更多的工作。(它非常古老,所以在摆脱javascript和css方面没有多大帮助。您可以创建一个派生类,但是添加带有start_script和end_style等名称的方法(详细信息请参阅python文档),但是对于格式不正确的html,很难可靠地做到这一点)。无论如何,这里有一些简单的东西,可以将纯文本打印到控制台

from htmllib import HTMLParser, HTMLParseError
from formatter import AbstractFormatter, DumbWriter
p = HTMLParser(AbstractFormatter(DumbWriter()))
try: p.feed('hello<br>there'); p.close() #calling close is not usually needed, but let's play it safe
except HTMLParseError: print ':(' #the html is badly malformed (or you found a bug)

#13


3  

Beautiful soup does convert html entities. It's probably your best bet considering HTML is often buggy and filled with unicode and html encoding issues. This is the code I use to convert html to raw text:

漂亮的汤可以转换html实体。考虑到HTML经常有bug,并且充斥着unicode和HTML编码问题,这可能是您最好的选择。这是我用来将html转换为原始文本的代码:

import BeautifulSoup
def getsoup(data, to_unicode=False):
    data = data.replace("&nbsp;", " ")
    # Fixes for bad markup I've seen in the wild.  Remove if not applicable.
    masssage_bad_comments = [
        (re.compile('<!-([^-])'), lambda match: '<!--' + match.group(1)),
        (re.compile('<!WWWAnswer T[=\w\d\s]*>'), lambda match: '<!--' + match.group(0) + '-->'),
    ]
    myNewMassage = copy.copy(BeautifulSoup.BeautifulSoup.MARKUP_MASSAGE)
    myNewMassage.extend(masssage_bad_comments)
    return BeautifulSoup.BeautifulSoup(data, markupMassage=myNewMassage,
        convertEntities=BeautifulSoup.BeautifulSoup.ALL_ENTITIES 
                    if to_unicode else None)

remove_html = lambda c: getsoup(c, to_unicode=True).getText(separator=u' ') if c else ""

#14


3  

I recommend a Python Package called goose-extractor Goose will try to extract the following information:

我建议一个叫做鹅的Python包将尝试提取以下信息:

Main text of an article Main image of article Any Youtube/Vimeo movies embedded in article Meta Description Meta tags

文章的主要文本文章的主要图像任何Youtube/Vimeo电影嵌入文章元描述元标签

More :https://pypi.python.org/pypi/goose-extractor/

更多:https://pypi.python.org/pypi/goose-extractor/

#15


3  

if you need more speed and less accuracy then you could use raw lxml.

如果您需要更高的速度和更低的精度,那么您可以使用原始lxml。

import lxml.html as lh
from lxml.html.clean import clean_html

def lxml_to_text(html):
    doc = lh.fromstring(html)
    doc = clean_html(doc)
    return doc.text_content()

#16


3  

install html2text using

安装html2text使用

pip install html2text

pip安装html2text

then,

然后,

>>> import html2text
>>>
>>> h = html2text.HTML2Text()
>>> # Ignore converting links from HTML
>>> h.ignore_links = True
>>> print h.handle("<p>Hello, <a href='http://earth.google.com/'>world</a>!")
Hello, world!

#17


2  

Another option is to run the html through a text based web browser and dump it. For example (using Lynx):

另一个选项是通过基于文本的web浏览器运行html并转储它。例如(使用Lynx):

lynx -dump html_to_convert.html > converted_html.txt

This can be done within a python script as follows:

这可以在python脚本中完成,如下所示:

import subprocess

with open('converted_html.txt', 'w') as outputFile:
    subprocess.call(['lynx', '-dump', 'html_to_convert.html'], stdout=testFile)

It won't give you exactly just the text from the HTML file, but depending on your use case it may be preferable to the output of html2text.

它不会只给你HTML文件的文本,但取决于你的用例,它可能比html2text的输出更好。

#18


2  

Another non-python solution: Libre Office:

另一个非python解决方案:Libre Office:

soffice --headless --invisible --convert-to txt input1.html

The reason I prefer this one over other alternatives is that every HTML paragraph gets converted into a single text line (no line breaks), which is what I was looking for. Other methods require post-processing. Lynx does produce nice output, but not exactly what I was looking for. Besides, Libre Office can be used to convert from all sorts of formats...

我更喜欢这个选项而不是其他选项的原因是,每个HTML段落都被转换为一个文本行(没有换行符),这正是我要寻找的。其他方法需要后处理。Lynx确实产生了很好的输出,但并不是我所期望的。此外,Libre Office还可以从各种格式转换。

#19


2  

I know there's plenty of answers here already but I think newspaper3k also deserves a mention. I recently needed to complete a similar task of extracting the text from articles on the web and this library has done an excellent job of achieving this so far in my tests. It ignores the text found in menu items and side bars as well as any JavaScript that appears on the page as the OP requests.

我知道这里已经有很多答案了,但是我认为newspaper3k也值得一提。我最近需要完成一个类似的任务,从web上的文章中提取文本,而这个库在我的测试中已经完成了非常出色的工作。它将忽略菜单项和侧边栏中的文本,以及在OP请求时出现在页面上的任何JavaScript。

from newspaper import Article

article = Article(url)
article.download()
article.parse()
article.text

If you already have the HTML files downloaded you can do something like this:

如果你已经下载了HTML文件,你可以这样做:

article = Article('')
article.set_html(html)
article.parse()
article.text

It even has a few NLP features for summarizing the topics of articles:

它甚至有一些NLP功能来总结文章的主题:

article.nlp()
article.summary

#20


1  

@PeYoTIL's answer using BeautifulSoup and eliminating style and script content didn't work for me. I tried it using decompose instead of extract but it still didn't work. So I created my own which also formats the text using the <p> tags and replaces <a> tags with the href link. Also copes with links inside text. Available at this gist with a test doc embedded.

@PeYoTIL用“漂亮的汤”、“消除风格”和“脚本内容”来回答我的问题。我试过用分解法而不是提取法,但还是没用。因此我创建了自己的文本,它也使用

标签格式化文本,并用href链接替换标签。还要处理文本内部的链接。在这个要点中嵌入了一个测试文档。

from bs4 import BeautifulSoup, NavigableString

def html_to_text(html):
    "Creates a formatted text email message as a string from a rendered html template (page)"
    soup = BeautifulSoup(html, 'html.parser')
    # Ignore anything in head
    body, text = soup.body, []
    for element in body.descendants:
        # We use type and not isinstance since comments, cdata, etc are subclasses that we don't want
        if type(element) == NavigableString:
            # We use the assumption that other tags can't be inside a script or style
            if element.parent.name in ('script', 'style'):
                continue

            # remove any multiple and leading/trailing whitespace
            string = ' '.join(element.string.split())
            if string:
                if element.parent.name == 'a':
                    a_tag = element.parent
                    # replace link text with the link
                    string = a_tag['href']
                    # concatenate with any non-empty immediately previous string
                    if (    type(a_tag.previous_sibling) == NavigableString and
                            a_tag.previous_sibling.string.strip() ):
                        text[-1] = text[-1] + ' ' + string
                        continue
                elif element.previous_sibling and element.previous_sibling.name == 'a':
                    text[-1] = text[-1] + ' ' + string
                    continue
                elif element.parent.name == 'p':
                    # Add extra paragraph formatting newline
                    string = '\n' + string
                text += [string]
    doc = '\n'.join(text)
    return doc

#21


1  

In Python 3.x you can do it in a very easy way by importing 'imaplib' and 'email' packages. Although this is an older post but maybe my answer can help new comers on this post.

Python 3。通过导入“imaplib”和“email”包,您可以用一种非常简单的方式来实现这一点。虽然这是一个较老的帖子,但也许我的回答可以帮助这个帖子的新来者。

status, data = self.imap.fetch(num, '(RFC822)')
email_msg = email.message_from_bytes(data[0][1]) 
#email.message_from_string(data[0][1])

#If message is multi part we only want the text version of the body, this walks the message and gets the body.

if email_msg.is_multipart():
    for part in email_msg.walk():       
        if part.get_content_type() == "text/plain":
            body = part.get_payload(decode=True) #to control automatic email-style MIME decoding (e.g., Base64, uuencode, quoted-printable)
            body = body.decode()
        elif part.get_content_type() == "text/html":
            continue

Now you can print body variable and it will be in plaintext format :) If it is good enough for you then it would be nice to select it as accepted answer.

现在您可以打印body变量,它将是明文格式:)如果它对您来说足够好,那么最好选择它作为已接受的答案。

#22


0  

in a simple way

在一个简单的方法

import re

html_text = open('html_file.html').read()
text_filtered = re.sub(r'<(.*?)>', '', html_text)

this code finds all parts of the html_text started with '<' and ending with '>' and replace all found by an empty string

该代码查找以“<”开头、以“>”结尾的html_text的所有部分,并替换空字符串找到的所有部分

#23


0  

Anyone has tried bleach.clean(html,tags=[],strip=True) with bleach? it's working for me.

有人尝试过漂白剂。clean(html,tags=[],strip=True)吗?这对我的工作。

#24


0  

Here's the code I use on a regular basis.

这是我经常使用的代码。

from bs4 import BeautifulSoup
import urllib.request


def processText(webpage):

    # EMPTY LIST TO STORE PROCESSED TEXT
    proc_text = []

    try:
        news_open = urllib.request.urlopen(webpage.group())
        news_soup = BeautifulSoup(news_open, "lxml")
        news_para = news_soup.find_all("p", text = True)

        for item in news_para:
            # SPLIT WORDS, JOIN WORDS TO REMOVE EXTRA SPACES
            para_text = (' ').join((item.text).split())

            # COMBINE LINES/PARAGRAPHS INTO A LIST
            proc_text.append(para_text)

    except urllib.error.HTTPError:
        pass

    return proc_text

I hope that helps.

我希望有帮助。

#25


0  

Best worked for me is inscripts .

最适合我的是剧本。

https://github.com/weblyzard/inscriptis

https://github.com/weblyzard/inscriptis

import urllib.request
from inscriptis import get_text

url = "http://www.informationscience.ch"
html = urllib.request.urlopen(url).read().decode('utf-8')

text = get_text(html)
print(text)

The results are really good

结果真的很好

#26


0  

you can extract only text from HTML with BeautifulSoup

您只能从HTML中提取文本,使用漂亮的soup

url = "https://www.geeksforgeeks.org/extracting-email-addresses-using-regular-expressions-python/"
con = urlopen(url).read()
soup = BeautifulSoup(con,'html.parser')
texts = soup.get_text()
print(texts)

#27


0  

I've had good results with Apache Tika. Its purpose is the extraction of metadata and text from content, hence the underlying parser is tuned accordingly out of the box.

Apache Tika给我带来了很好的结果。它的目的是从内容中提取元数据和文本,因此底层解析器相应地从方框中调优。

Tika can be run as a server, is trivial to run / deploy in a Docker container, and from there can be accessed via Python bindings.

Tika可以作为服务器运行,在Docker容器中运行/部署非常简单,并且可以通过Python绑定访问它。

#28


0  

The LibreOffice writer comment has merit since the application can employ python macros. It seems to offer multiple benefits both for answering this question and furthering the macro base of LibreOffice. If this resolution is a one-off implementation, rather than to be used as part of a greater production program, opening the HTML in writer and saving the page as text would seem to resolve the issues discussed here.

LibreOffice编写器注释有其优点,因为应用程序可以使用python宏。它似乎为回答这个问题和促进LibreOffice的宏观基础提供了多重好处。如果这个解决方案是一次性实现,而不是作为更大的生产程序的一部分使用,那么打开writer中的HTML并将页面保存为文本似乎可以解决这里讨论的问题。

#29


-1  

I am achieving it something like this.

我得到了这样的东西。

>>> import requests
>>> url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
>>> res = requests.get(url)
>>> text = res.text

#1


96  

html2text is a Python program that does a pretty good job at this.

html2text是一个Python程序,它在这方面做得很好。

#2


100  

NOTE: NTLK no longer supports clean_html function

注意:NTLK不再支持clean_html功能

Original answer below, and an alternative in the comments sections.

下面是原始的答案,在评论部分有另外的选择。


Use NLTK

使用NLTK

I wasted my 4-5 hours fixing the issues with html2text. Luckily i could encounter NLTK.
It works magically.

我浪费了4-5个小时来解决html2text的问题。幸运的是我能遇到NLTK。它神奇地工作。

import nltk   
from urllib import urlopen

url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"    
html = urlopen(url).read()    
raw = nltk.clean_html(html)  
print(raw)

#3


96  

The best piece of code I found for extracting text without getting javascript or not wanted things :

我找到的最好的一段代码可以在不需要javascript的情况下提取文本:

import urllib
from bs4 import BeautifulSoup

url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()    # rip it out

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)

print(text)

You just have to install BeautifulSoup before :

你只需要在之前安装漂亮的汤:

pip install beautifulsoup4

#4


52  

Found myself facing just the same problem today. I wrote a very simple HTML parser to strip incoming content of all markups, returning the remaining text with only a minimum of formatting.

发现自己今天也面临着同样的问题。我编写了一个非常简单的HTML解析器来剥离所有标记的传入内容,只返回剩下的文本,并进行最少的格式化。

from HTMLParser import HTMLParser
from re import sub
from sys import stderr
from traceback import print_exc

class _DeHTMLParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.__text = []

    def handle_data(self, data):
        text = data.strip()
        if len(text) > 0:
            text = sub('[ \t\r\n]+', ' ', text)
            self.__text.append(text + ' ')

    def handle_starttag(self, tag, attrs):
        if tag == 'p':
            self.__text.append('\n\n')
        elif tag == 'br':
            self.__text.append('\n')

    def handle_startendtag(self, tag, attrs):
        if tag == 'br':
            self.__text.append('\n\n')

    def text(self):
        return ''.join(self.__text).strip()


def dehtml(text):
    try:
        parser = _DeHTMLParser()
        parser.feed(text)
        parser.close()
        return parser.text()
    except:
        print_exc(file=stderr)
        return text


def main():
    text = r'''
        <html>
            <body>
                <b>Project:</b> DeHTML<br>
                <b>Description</b>:<br>
                This small script is intended to allow conversion from HTML markup to 
                plain text.
            </body>
        </html>
    '''
    print(dehtml(text))


if __name__ == '__main__':
    main()

#5


13  

Here is a version of xperroni's answer which is a bit more complete. It skips script and style sections and translates charrefs (e.g., &#39;) and HTML entities (e.g., &amp;).

这里有一个版本的xperroni的答案,它更完整一些。它跳过脚本和样式部分,翻译charrefs(例如,#39;)和HTML实体(例如,&)。

It also includes a trivial plain-text-to-html inverse converter.

它还包含一个普通的纯文本到html的反向转换器。

"""
HTML <-> text conversions.
"""
from HTMLParser import HTMLParser, HTMLParseError
from htmlentitydefs import name2codepoint
import re

class _HTMLToText(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self._buf = []
        self.hide_output = False

    def handle_starttag(self, tag, attrs):
        if tag in ('p', 'br') and not self.hide_output:
            self._buf.append('\n')
        elif tag in ('script', 'style'):
            self.hide_output = True

    def handle_startendtag(self, tag, attrs):
        if tag == 'br':
            self._buf.append('\n')

    def handle_endtag(self, tag):
        if tag == 'p':
            self._buf.append('\n')
        elif tag in ('script', 'style'):
            self.hide_output = False

    def handle_data(self, text):
        if text and not self.hide_output:
            self._buf.append(re.sub(r'\s+', ' ', text))

    def handle_entityref(self, name):
        if name in name2codepoint and not self.hide_output:
            c = unichr(name2codepoint[name])
            self._buf.append(c)

    def handle_charref(self, name):
        if not self.hide_output:
            n = int(name[1:], 16) if name.startswith('x') else int(name)
            self._buf.append(unichr(n))

    def get_text(self):
        return re.sub(r' +', ' ', ''.join(self._buf))

def html_to_text(html):
    """
    Given a piece of HTML, return the plain text it contains.
    This handles entities and char refs, but not javascript and stylesheets.
    """
    parser = _HTMLToText()
    try:
        parser.feed(html)
        parser.close()
    except HTMLParseError:
        pass
    return parser.get_text()

def text_to_html(text):
    """
    Convert the given text to html, wrapping what looks like URLs with <a> tags,
    converting newlines to <br> tags and converting confusing chars into html
    entities.
    """
    def f(mo):
        t = mo.group()
        if len(t) == 1:
            return {'&':'&amp;', "'":'&#39;', '"':'&quot;', '<':'&lt;', '>':'&gt;'}.get(t)
        return '<a href="%s">%s</a>' % (t, t)
    return re.sub(r'https?://[^] ()"\';]+|[&\'"<>]', f, text)

#6


8  

You can use html2text method in the stripogram library also.

您也可以在stripogram库中使用html2text方法。

from stripogram import html2text
text = html2text(your_html_string)

To install stripogram run sudo easy_install stripogram

要安装stripogram运行sudo easy_install stripogram。

#7


7  

There is Pattern library for data mining.

有用于数据挖掘的模式库。

http://www.clips.ua.ac.be/pages/pattern-web

http://www.clips.ua.ac.be/pages/pattern-web

You can even decide what tags to keep:

你甚至可以决定保留哪些标签:

s = URL('http://www.clips.ua.ac.be').download()
s = plaintext(s, keep={'h1':[], 'h2':[], 'strong':[], 'a':['href']})
print s

#8


6  

PyParsing does a great job. Paul McGuire has several scrips that are easy to adopt for various uses on the pyparsing wiki. (http://pyparsing.wikispaces.com/Examples) One reason for investing a little time with pyparsing is that he has also written a very brief very well organized O'Reilly Short Cut manual that is also inexpensive.

py解析是一项伟大的工作。Paul McGuire有几个很容易在py解析wiki上使用的脚本。(http://pyparsing.wikispaces.com/Examples)花一点时间研究pyparse的一个原因是,他还编写了一个组织良好的、非常简短的O'Reilly Short - Cut manual,而且成本也不高。

Having said that, I use BeautifulSOup a lot and it is not that hard to deal with the entitites issues, you can convert them before you run BeautifulSoup.

话虽如此,我经常使用漂亮的汤料,而且处理实体问题也不是那么困难,你可以在运行漂亮的汤料之前将它们转换。

Goodluck

古德勒克

#9


5  

I know there are a lot of answers already, but the most elegent and pythonic solution I have found is described, in part, here.

我知道已经有很多答案了,但我所找到的最优美、最python化的答案在这里部分地被描述了。

from bs4 import BeautifulSoup

text = ''.join(BeautifulSoup(some_html_string, "html.parser").findAll(text=True))

Update

Based on Fraser's comment, here is more elegant solution:

根据弗雷泽的评论,这里有更优雅的解决方案:

from bs4 import BeautifulSoup

clean_text = ''.join(BeautifulSoup(some_html_string, "html.parser").stripped_strings)

#10


4  

http://pypi.python.org/pypi/webstemmer/0.5.0

http://pypi.python.org/pypi/webstemmer/0.5.0

http://atropine.sourceforge.net/documentation.html

http://atropine.sourceforge.net/documentation.html


alternatively, i think you can drive lynx from python, search on that

或者,我认为你可以用python来驱动lynx,搜索它。

#11


4  

This isn't exactly a Python solution, but it will convert text Javascript would generate into text, which I think is important (E.G. google.com). The browser Links (not Lynx) has a Javascript engine, and will convert source to text with the -dump option.

这并不是一个Python解决方案,但它将把Javascript生成的文本转换成文本,我认为这很重要(例如google。com)。浏览器链接(不是Lynx)有一个Javascript引擎,并将使用-dump选项将源文件转换为文本。

So you could do something like:

你可以这样做:

fname = os.tmpnam()
fname.write(html_source)
proc = subprocess.Popen(['links', '-dump', fname], 
                        stdout=subprocess.PIPE,
                        stderr=open('/dev/null','w'))
text = proc.stdout.read()

#12


4  

Instead of the HTMLParser module, check out htmllib. It has a similar interface, but does more of the work for you. (It is pretty ancient, so it's not much help in terms of getting rid of javascript and css. You could make a derived class, but and add methods with names like start_script and end_style (see the python docs for details), but it's hard to do this reliably for malformed html.) Anyway, here's something simple that prints the plain text to the console

检查htmllib,而不是HTMLParser模块。它有一个类似的接口,但是为您做更多的工作。(它非常古老,所以在摆脱javascript和css方面没有多大帮助。您可以创建一个派生类,但是添加带有start_script和end_style等名称的方法(详细信息请参阅python文档),但是对于格式不正确的html,很难可靠地做到这一点)。无论如何,这里有一些简单的东西,可以将纯文本打印到控制台

from htmllib import HTMLParser, HTMLParseError
from formatter import AbstractFormatter, DumbWriter
p = HTMLParser(AbstractFormatter(DumbWriter()))
try: p.feed('hello<br>there'); p.close() #calling close is not usually needed, but let's play it safe
except HTMLParseError: print ':(' #the html is badly malformed (or you found a bug)

#13


3  

Beautiful soup does convert html entities. It's probably your best bet considering HTML is often buggy and filled with unicode and html encoding issues. This is the code I use to convert html to raw text:

漂亮的汤可以转换html实体。考虑到HTML经常有bug,并且充斥着unicode和HTML编码问题,这可能是您最好的选择。这是我用来将html转换为原始文本的代码:

import BeautifulSoup
def getsoup(data, to_unicode=False):
    data = data.replace("&nbsp;", " ")
    # Fixes for bad markup I've seen in the wild.  Remove if not applicable.
    masssage_bad_comments = [
        (re.compile('<!-([^-])'), lambda match: '<!--' + match.group(1)),
        (re.compile('<!WWWAnswer T[=\w\d\s]*>'), lambda match: '<!--' + match.group(0) + '-->'),
    ]
    myNewMassage = copy.copy(BeautifulSoup.BeautifulSoup.MARKUP_MASSAGE)
    myNewMassage.extend(masssage_bad_comments)
    return BeautifulSoup.BeautifulSoup(data, markupMassage=myNewMassage,
        convertEntities=BeautifulSoup.BeautifulSoup.ALL_ENTITIES 
                    if to_unicode else None)

remove_html = lambda c: getsoup(c, to_unicode=True).getText(separator=u' ') if c else ""

#14


3  

I recommend a Python Package called goose-extractor Goose will try to extract the following information:

我建议一个叫做鹅的Python包将尝试提取以下信息:

Main text of an article Main image of article Any Youtube/Vimeo movies embedded in article Meta Description Meta tags

文章的主要文本文章的主要图像任何Youtube/Vimeo电影嵌入文章元描述元标签

More :https://pypi.python.org/pypi/goose-extractor/

更多:https://pypi.python.org/pypi/goose-extractor/

#15


3  

if you need more speed and less accuracy then you could use raw lxml.

如果您需要更高的速度和更低的精度,那么您可以使用原始lxml。

import lxml.html as lh
from lxml.html.clean import clean_html

def lxml_to_text(html):
    doc = lh.fromstring(html)
    doc = clean_html(doc)
    return doc.text_content()

#16


3  

install html2text using

安装html2text使用

pip install html2text

pip安装html2text

then,

然后,

>>> import html2text
>>>
>>> h = html2text.HTML2Text()
>>> # Ignore converting links from HTML
>>> h.ignore_links = True
>>> print h.handle("<p>Hello, <a href='http://earth.google.com/'>world</a>!")
Hello, world!

#17


2  

Another option is to run the html through a text based web browser and dump it. For example (using Lynx):

另一个选项是通过基于文本的web浏览器运行html并转储它。例如(使用Lynx):

lynx -dump html_to_convert.html > converted_html.txt

This can be done within a python script as follows:

这可以在python脚本中完成,如下所示:

import subprocess

with open('converted_html.txt', 'w') as outputFile:
    subprocess.call(['lynx', '-dump', 'html_to_convert.html'], stdout=testFile)

It won't give you exactly just the text from the HTML file, but depending on your use case it may be preferable to the output of html2text.

它不会只给你HTML文件的文本,但取决于你的用例,它可能比html2text的输出更好。

#18


2  

Another non-python solution: Libre Office:

另一个非python解决方案:Libre Office:

soffice --headless --invisible --convert-to txt input1.html

The reason I prefer this one over other alternatives is that every HTML paragraph gets converted into a single text line (no line breaks), which is what I was looking for. Other methods require post-processing. Lynx does produce nice output, but not exactly what I was looking for. Besides, Libre Office can be used to convert from all sorts of formats...

我更喜欢这个选项而不是其他选项的原因是,每个HTML段落都被转换为一个文本行(没有换行符),这正是我要寻找的。其他方法需要后处理。Lynx确实产生了很好的输出,但并不是我所期望的。此外,Libre Office还可以从各种格式转换。

#19


2  

I know there's plenty of answers here already but I think newspaper3k also deserves a mention. I recently needed to complete a similar task of extracting the text from articles on the web and this library has done an excellent job of achieving this so far in my tests. It ignores the text found in menu items and side bars as well as any JavaScript that appears on the page as the OP requests.

我知道这里已经有很多答案了,但是我认为newspaper3k也值得一提。我最近需要完成一个类似的任务,从web上的文章中提取文本,而这个库在我的测试中已经完成了非常出色的工作。它将忽略菜单项和侧边栏中的文本,以及在OP请求时出现在页面上的任何JavaScript。

from newspaper import Article

article = Article(url)
article.download()
article.parse()
article.text

If you already have the HTML files downloaded you can do something like this:

如果你已经下载了HTML文件,你可以这样做:

article = Article('')
article.set_html(html)
article.parse()
article.text

It even has a few NLP features for summarizing the topics of articles:

它甚至有一些NLP功能来总结文章的主题:

article.nlp()
article.summary

#20


1  

@PeYoTIL's answer using BeautifulSoup and eliminating style and script content didn't work for me. I tried it using decompose instead of extract but it still didn't work. So I created my own which also formats the text using the <p> tags and replaces <a> tags with the href link. Also copes with links inside text. Available at this gist with a test doc embedded.

@PeYoTIL用“漂亮的汤”、“消除风格”和“脚本内容”来回答我的问题。我试过用分解法而不是提取法,但还是没用。因此我创建了自己的文本,它也使用

标签格式化文本,并用href链接替换标签。还要处理文本内部的链接。在这个要点中嵌入了一个测试文档。

from bs4 import BeautifulSoup, NavigableString

def html_to_text(html):
    "Creates a formatted text email message as a string from a rendered html template (page)"
    soup = BeautifulSoup(html, 'html.parser')
    # Ignore anything in head
    body, text = soup.body, []
    for element in body.descendants:
        # We use type and not isinstance since comments, cdata, etc are subclasses that we don't want
        if type(element) == NavigableString:
            # We use the assumption that other tags can't be inside a script or style
            if element.parent.name in ('script', 'style'):
                continue

            # remove any multiple and leading/trailing whitespace
            string = ' '.join(element.string.split())
            if string:
                if element.parent.name == 'a':
                    a_tag = element.parent
                    # replace link text with the link
                    string = a_tag['href']
                    # concatenate with any non-empty immediately previous string
                    if (    type(a_tag.previous_sibling) == NavigableString and
                            a_tag.previous_sibling.string.strip() ):
                        text[-1] = text[-1] + ' ' + string
                        continue
                elif element.previous_sibling and element.previous_sibling.name == 'a':
                    text[-1] = text[-1] + ' ' + string
                    continue
                elif element.parent.name == 'p':
                    # Add extra paragraph formatting newline
                    string = '\n' + string
                text += [string]
    doc = '\n'.join(text)
    return doc

#21


1  

In Python 3.x you can do it in a very easy way by importing 'imaplib' and 'email' packages. Although this is an older post but maybe my answer can help new comers on this post.

Python 3。通过导入“imaplib”和“email”包,您可以用一种非常简单的方式来实现这一点。虽然这是一个较老的帖子,但也许我的回答可以帮助这个帖子的新来者。

status, data = self.imap.fetch(num, '(RFC822)')
email_msg = email.message_from_bytes(data[0][1]) 
#email.message_from_string(data[0][1])

#If message is multi part we only want the text version of the body, this walks the message and gets the body.

if email_msg.is_multipart():
    for part in email_msg.walk():       
        if part.get_content_type() == "text/plain":
            body = part.get_payload(decode=True) #to control automatic email-style MIME decoding (e.g., Base64, uuencode, quoted-printable)
            body = body.decode()
        elif part.get_content_type() == "text/html":
            continue

Now you can print body variable and it will be in plaintext format :) If it is good enough for you then it would be nice to select it as accepted answer.

现在您可以打印body变量,它将是明文格式:)如果它对您来说足够好,那么最好选择它作为已接受的答案。

#22


0  

in a simple way

在一个简单的方法

import re

html_text = open('html_file.html').read()
text_filtered = re.sub(r'<(.*?)>', '', html_text)

this code finds all parts of the html_text started with '<' and ending with '>' and replace all found by an empty string

该代码查找以“<”开头、以“>”结尾的html_text的所有部分,并替换空字符串找到的所有部分

#23


0  

Anyone has tried bleach.clean(html,tags=[],strip=True) with bleach? it's working for me.

有人尝试过漂白剂。clean(html,tags=[],strip=True)吗?这对我的工作。

#24


0  

Here's the code I use on a regular basis.

这是我经常使用的代码。

from bs4 import BeautifulSoup
import urllib.request


def processText(webpage):

    # EMPTY LIST TO STORE PROCESSED TEXT
    proc_text = []

    try:
        news_open = urllib.request.urlopen(webpage.group())
        news_soup = BeautifulSoup(news_open, "lxml")
        news_para = news_soup.find_all("p", text = True)

        for item in news_para:
            # SPLIT WORDS, JOIN WORDS TO REMOVE EXTRA SPACES
            para_text = (' ').join((item.text).split())

            # COMBINE LINES/PARAGRAPHS INTO A LIST
            proc_text.append(para_text)

    except urllib.error.HTTPError:
        pass

    return proc_text

I hope that helps.

我希望有帮助。

#25


0  

Best worked for me is inscripts .

最适合我的是剧本。

https://github.com/weblyzard/inscriptis

https://github.com/weblyzard/inscriptis

import urllib.request
from inscriptis import get_text

url = "http://www.informationscience.ch"
html = urllib.request.urlopen(url).read().decode('utf-8')

text = get_text(html)
print(text)

The results are really good

结果真的很好

#26


0  

you can extract only text from HTML with BeautifulSoup

您只能从HTML中提取文本,使用漂亮的soup

url = "https://www.geeksforgeeks.org/extracting-email-addresses-using-regular-expressions-python/"
con = urlopen(url).read()
soup = BeautifulSoup(con,'html.parser')
texts = soup.get_text()
print(texts)

#27


0  

I've had good results with Apache Tika. Its purpose is the extraction of metadata and text from content, hence the underlying parser is tuned accordingly out of the box.

Apache Tika给我带来了很好的结果。它的目的是从内容中提取元数据和文本,因此底层解析器相应地从方框中调优。

Tika can be run as a server, is trivial to run / deploy in a Docker container, and from there can be accessed via Python bindings.

Tika可以作为服务器运行,在Docker容器中运行/部署非常简单,并且可以通过Python绑定访问它。

#28


0  

The LibreOffice writer comment has merit since the application can employ python macros. It seems to offer multiple benefits both for answering this question and furthering the macro base of LibreOffice. If this resolution is a one-off implementation, rather than to be used as part of a greater production program, opening the HTML in writer and saving the page as text would seem to resolve the issues discussed here.

LibreOffice编写器注释有其优点,因为应用程序可以使用python宏。它似乎为回答这个问题和促进LibreOffice的宏观基础提供了多重好处。如果这个解决方案是一次性实现,而不是作为更大的生产程序的一部分使用,那么打开writer中的HTML并将页面保存为文本似乎可以解决这里讨论的问题。

#29


-1  

I am achieving it something like this.

我得到了这样的东西。

>>> import requests
>>> url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
>>> res = requests.get(url)
>>> text = res.text