用python - lxml或漂亮的汤解析HTML ?哪一种更适合用于什么目的?

From what I can make out, the two main HTML parsing libraries in Python are lxml and BeautifulSoup. I've chosen BeautifulSoup for a project I'm working on, but I chose it for no particular reason other than finding the syntax a bit easier to learn and understand. But I see a lot of people seem to favour lxml and I've heard that lxml is faster.

在我看来，Python中两个主要的HTML解析库是lxml和BeautifulSoup。我为我正在做的一个项目选择了漂亮的汤，但是我选择它并不是为了寻找更容易学习和理解的语法。但是我看到很多人都喜欢lxml，我听说lxml更快。

So I'm wondering what are the advantages of one over the other? When would I want to use lxml and when would I be better off using BeautifulSoup? Are there any other libraries worth considering?

所以我想知道两者之间的优势是什么?什么时候我想要使用lxml，什么时候使用漂亮的汤更好?还有其他图书馆值得考虑吗?

7 个解决方案

#1

~~For starters, BeautifulSoup is no longer actively maintained, and the author even recommends alternatives such as lxml.~~

首先，BeautifulSoup不再被积极维护，作者甚至推荐了lxml等替代方案。

Quoting from the linked page:

引用链接页:

Version 3.1.0 of Beautiful Soup does significantly worse on real-world HTML than version 3.0.8 does. The most common problems are handling tags incorrectly, "malformed start tag" errors, and "bad end tag" errors. This page explains what happened, how the problem will be addressed, and what you can do right now.

在真实的HTML中，Beautiful Soup的3.1.0版本要比3.0.8版本糟糕得多。最常见的问题是不正确地处理标签，“错误的开始标签”错误和“错误的结束标签”错误。这一页解释发生了什么，如何解决问题，以及您现在可以做什么。

This page was originally written in March 2009. Since then, the 3.2 series has been released, replacing the 3.1 series, and development of the 4.x series has gotten underway. This page will remain up for historical purposes.

这一页最初写于2009年3月。从那时起，3.2系列已经发布，取代了3.1系列，以及4的开发。x系列已经开始了。出于历史原因，本页将继续保留。

tl;dr

博士tl;

Use 3.2.0 instead.

使用3.2.0代替。

#2

Pyquery provides the jQuery selector interface to Python (using lxml under the hood).

Pyquery为Python提供jQuery选择器接口(使用引擎盖下的lxml)。

http://pypi.python.org/pypi/pyquery

It's really awesome, I don't use anything else anymore.

太棒了，我再也不用其他东西了。

#3

In summary, lxml is positioned as a lightning-fast production-quality html and xml parser that, by the way, also includes a soupparser module to fall back on BeautifulSoup's functionality. BeautifulSoup is a one-person project, designed to save you time to quickly extract data out of poorly-formed html or xml.

总之，lxml被定位为一个快速生成的高质量的html和xml解析器，顺便说一下，它还包含一个soupparser模块，以支持BeautifulSoup的功能。BeautifulSoup是一个个人项目，旨在为您节省从格式不佳的html或xml中快速提取数据的时间。

lxml documentation says that both parsers have advantages and disadvantages. For this reason, lxml provides a soupparser so you can switch back and forth. Quoting,

lxml文档说，这两个解析器都有优缺点。由于这个原因，lxml提供了一个soupparser，您可以来回切换。引用,

BeautifulSoup uses a different parsing approach. It is not a real HTML parser but uses regular expressions to dive through tag soup. It is therefore more forgiving in some cases and less good in others. It is not uncommon that lxml/libxml2 parses and fixes broken HTML better, but BeautifulSoup has superiour support for encoding detection. It very much depends on the input which parser works better.

BeautifulSoup使用了一种不同的解析方法。它不是一个真正的HTML解析器，而是使用正则表达式在标记汤中进行潜水。因此，在某些情况下，它更宽容，而在另一些情况下，它就不那么好了。lxml/libxml2更好地解析和修复损坏的HTML并不少见，但BeautifulSoup支持编码检测。这在很大程度上取决于哪个解析器工作得更好。

In the end they are saying,

最后他们说，

The downside of using this parser is that it is much slower than the HTML parser of lxml. So if performance matters, you might want to consider using soupparser only as a fallback for certain cases.

使用这个解析器的缺点是它比lxml的HTML解析器慢得多。因此，如果性能很重要，您可能需要考虑只使用soupparser作为某些情况的后备。

If I understand them correctly, it means that the soup parser is more robust --- it can deal with a "soup" of malformed tags by using regular expressions --- whereas lxml is more straightforward and just parses things and builds a tree as you would expect. I assume it also applies to BeautifulSoup itself, not just to the soupparser for lxml.

如果我理解正确，这意味着soup解析器更加健壮——它可以通过使用正则表达式来处理格式错误的标记的“汤”——而lxml则更简单，只解析内容并构建树。我认为它也适用于漂亮的汤本身，而不仅仅是用于lxml的soupparser。

They also show how to benefit from BeautifulSoup's encoding detection, while still parsing quickly with lxml:

它们还展示了如何从美丽的soup的编码检测中获益，同时还可以快速地用lxml解析:

>>> from BeautifulSoup import UnicodeDammit

>>> def decode_html(html_string):
...     converted = UnicodeDammit(html_string, isHTML=True)
...     if not converted.unicode:
...         raise UnicodeDecodeError(
...             "Failed to detect encoding, tried [%s]",
...             ', '.join(converted.triedEncodings))
...     # print converted.originalEncoding
...     return converted.unicode

>>> root = lxml.html.fromstring(decode_html(tag_soup))

(Same source: http://lxml.de/elementsoup.html).

(同一来源:http://lxml.de/elementsoup.html)。

In words of BeautifulSoup's creator,

用《美丽的汤》的作者的话来说，

That's it! Have fun! I wrote Beautiful Soup to save everybody time. Once you get used to it, you should be able to wrangle data out of poorly-designed websites in just a few minutes. Send me email if you have any comments, run into problems, or want me to know about your project that uses Beautiful Soup.

就是这样!玩得开心!我写了漂亮的汤来节省大家的时间。一旦你习惯了，你应该能够在几分钟内从设计不佳的网站上获取数据。如果你有任何意见，遇到问题，或者想让我知道你的项目使用美丽的汤，请发邮件给我。
 --Leonard

Quoted from the Beautiful Soup documentation.

引用美丽的Soup文档。

I hope this is now clear. The soup is a brilliant one-person project designed to save you time to extract data out of poorly-designed websites. The goal is to save you time right now, to get the job done, not necessarily to save you time in the long term, and definitely not to optimize the performance of your software.

我希望现在大家都明白了。soup是一个出色的个人项目，旨在为您节省从设计不佳的网站中提取数据的时间。我们的目标是现在就为您节省时间，完成工作，而不是为您节省长期的时间，而且绝对不是为了优化您的软件性能。

Also, from the lxml website,

同样，在lxml网站上，

lxml has been downloaded from the Python Package Index more than two million times and is also available directly in many package distributions, e.g. for Linux or MacOS-X.

lxml已经从Python包索引中下载了200多万次，并且可以直接在许多包发行版中使用，例如Linux或MacOS-X。

And, from Why lxml?,

,从lxml的原因吗?

The C libraries libxml2 and libxslt have huge benefits:... Standards-compliant... Full-featured... fast. fast! FAST! ... lxml is a new Python binding for libxml2 and libxslt...

libxml2和libxslt的C库有很多好处:……符合标准的…功能齐全的……快。快!快!…lxml是libxml2和libxslt的一个新的Python绑定…

#4

Don't use BeautifulSoup, use lxml.soupparser then you're sitting on top of the power of lxml and can use the good bits of BeautifulSoup which is to deal with really broken and crappy HTML.

不要用漂亮的汤，用lxml。soupparser你可以利用lxml的强大功能使用漂亮的东西来处理糟糕的HTML。

#5

I've used lxml with great success for parsing HTML. It seems to do a good job of handling "soupy" HTML, too. I'd highly recommend it.

我使用lxml解析HTML非常成功。它似乎在处理“soupy”HTML方面也做得很好。我强烈推荐它。

Here's a quick test I had lying around to try handling of some ugly HTML:

这里有一个快速的测试，我躺在那里试着处理一些难看的HTML:

import unittest
from StringIO import StringIO
from lxml import etree

class TestLxmlStuff(unittest.TestCase):
    bad_html = """
        <html>
            <head><title>Test!</title></head>
            <body>
                <h1>Here's a heading
                <p>Here's some text
                <p>And some more text
                <b>Bold!</b></i>
                <table>
                   <tr>row
                   <tr><td>test1
                   <td>test2
                   </tr>
                   <tr>
                   <td colspan=2>spanning two
                </table>
            </body>
        </html>"""

    def test_soup(self):
        """Test lxml's parsing of really bad HTML"""
        parser = etree.HTMLParser()
        tree = etree.parse(StringIO(self.bad_html), parser)
        self.assertEqual(len(tree.xpath('//tr')), 3)
        self.assertEqual(len(tree.xpath('//td')), 3)
        self.assertEqual(len(tree.xpath('//i')), 0)
        #print(etree.tostring(tree.getroot(), pretty_print=False, method="html"))

if __name__ == '__main__':
    unittest.main()

#6

For sure i would use EHP. It is faster than lxml, much more elegant and simpler to use.

我肯定会用EHP。它比lxml快得多，使用起来更优雅、更简单。

Check out. https://github.com/iogf/ehp

退房。https://github.com/iogf/ehp

<body ><em > foo  <font color="red" ></font></em></body>


from ehp import *

data = '''<html> <body> <em> Hello world. </em> </body> </html>'''

html = Html()
dom = html.feed(data)

for ind in dom.find('em'):
    print ind.text()

Output:

输出:

Hello world.

#7

A somewhat outdated speed comparison can be found here, which clearly recommends lxml, as the speed differences seem drastic.

这里有一个过时的速度比较，它明显推荐使用lxml，因为速度差异似乎很大。

#1