需要python lxml语法帮助解析html

时间:2021-09-01 04:00:38

I am brand new to python, and I need some help with the syntax for finding and iterating through html tags using lxml. Here are the use-cases I am dealing with:

我是python的新手,我需要一些帮助,使用lxml查找和迭代html标签的语法。以下是我正在处理的用例:

HTML file is fairly well formed (but not perfect). Has multiple tables on screen, one containing a set of search results, and one each for a header and footer. Each result row contains a link for the search result detail.

HTML文件格式很好(但并不完美)。屏幕上有多个表,一个包含一组搜索结果,另一个包含页眉和页脚。每个结果行都包含搜索结果详细信息的链接。

  1. I need to find the middle table with the search result rows (this one I was able to figure out):

    我需要找到带有搜索结果行的中间表(这是我能够弄清楚的):

        self.mySearchTables = self.mySearchTree.findall(".//table")
        self.myResultRows = self.mySearchTables[1].findall(".//tr")
    
  2. I need to find the links contained in this table (this is where I'm getting stuck):

    我需要找到此表中包含的链接(这是我遇到的问题):

        for searchRow in self.myResultRows:
            searchLink = patentRow.findall(".//a")
    

    It doesn't seem to actually locate the link elements.

    它似乎没有真正找到链接元素。

  3. I need the plain text of the link. I imagine it would be something like searchLink.text if I actually got the link elements in the first place.

    我需要链接的纯文本。我想如果我实际上首先得到了链接元素,它将类似于searchLink.text。

Finally, in the actual API reference for lxml, I wasn't able to find information on the find and the findall calls. I gleaned these from bits of code I found on google. Am I missing something about how to effectively find and iterate over HTML tags using lxml?

最后,在lxml的实际API参考中,我无法找到有关find和findall调用的信息。我从谷歌上发现的一些代码中收集了这些内容。我是否遗漏了一些关于如何使用lxml有效地查找和迭代HTML标记的内容?

2 个解决方案

#1


27  

Okay, first, in regards to parsing the HTML: if you follow the recommendation of zweiterlinde and S.Lott at least use the version of beautifulsoup included with lxml. That way you will also reap the benefit of a nice xpath or css selector interface.

好的,首先,关于解析HTML:如果你遵循zweiterlinde和S.Lott的建议,至少要使用lxml附带的beautifulsoup版本。这样你就可以获得一个漂亮的xpath或css选择器接口的好处。

However, I personally prefer Ian Bicking's HTML parser included in lxml.

但是,我个人更喜欢Ian Bicking的HTML解析器包含在lxml中。

Secondly, .find() and .findall() come from lxml trying to be compatible with ElementTree, and those two methods are described in XPath Support in ElementTree.

其次,.find()和.findall()来自lxml,试图与ElementTree兼容,这两种方法在ElementTree的XPath支持中描述。

Those two functions are fairly easy to use but they are very limited XPath. I recommend trying to use either the full lxml xpath() method or, if you are already familiar with CSS, using the cssselect() method.

这两个函数相当容易使用,但它们的XPath非常有限。我建议尝试使用完整的lxml xpath()方法,或者,如果您已经熟悉CSS,则使用cssselect()方法。

Here are some examples, with an HTML string parsed like this:

以下是一些示例,其中HTML字符串解析如下:

from lxml.html import fromstring
mySearchTree = fromstring(your_input_string)

Using the css selector class your program would roughly look something like this:

使用css选择器类,您的程序将大致如下所示:

# Find all 'a' elements inside 'tr' table rows with css selector
for a in mySearchTree.cssselect('tr a'):
    print 'found "%s" link to href "%s"' % (a.text, a.get('href'))

The equivalent using xpath method would be:

使用xpath方法的等效方法是:

# Find all 'a' elements inside 'tr' table rows with xpath
for a in mySearchTree.xpath('.//tr/*/a'):
    print 'found "%s" link to href "%s"' % (a.text, a.get('href'))

#2


5  

Is there a reason you're not using Beautiful Soup for this project? It will make dealing with imperfectly formed documents much easier.

您是否有理由不在此项目中使用Beautiful Soup?它将使处理不完美形成的文档变得更加容易。

#1


27  

Okay, first, in regards to parsing the HTML: if you follow the recommendation of zweiterlinde and S.Lott at least use the version of beautifulsoup included with lxml. That way you will also reap the benefit of a nice xpath or css selector interface.

好的,首先,关于解析HTML:如果你遵循zweiterlinde和S.Lott的建议,至少要使用lxml附带的beautifulsoup版本。这样你就可以获得一个漂亮的xpath或css选择器接口的好处。

However, I personally prefer Ian Bicking's HTML parser included in lxml.

但是,我个人更喜欢Ian Bicking的HTML解析器包含在lxml中。

Secondly, .find() and .findall() come from lxml trying to be compatible with ElementTree, and those two methods are described in XPath Support in ElementTree.

其次,.find()和.findall()来自lxml,试图与ElementTree兼容,这两种方法在ElementTree的XPath支持中描述。

Those two functions are fairly easy to use but they are very limited XPath. I recommend trying to use either the full lxml xpath() method or, if you are already familiar with CSS, using the cssselect() method.

这两个函数相当容易使用,但它们的XPath非常有限。我建议尝试使用完整的lxml xpath()方法,或者,如果您已经熟悉CSS,则使用cssselect()方法。

Here are some examples, with an HTML string parsed like this:

以下是一些示例,其中HTML字符串解析如下:

from lxml.html import fromstring
mySearchTree = fromstring(your_input_string)

Using the css selector class your program would roughly look something like this:

使用css选择器类,您的程序将大致如下所示:

# Find all 'a' elements inside 'tr' table rows with css selector
for a in mySearchTree.cssselect('tr a'):
    print 'found "%s" link to href "%s"' % (a.text, a.get('href'))

The equivalent using xpath method would be:

使用xpath方法的等效方法是:

# Find all 'a' elements inside 'tr' table rows with xpath
for a in mySearchTree.xpath('.//tr/*/a'):
    print 'found "%s" link to href "%s"' % (a.text, a.get('href'))

#2


5  

Is there a reason you're not using Beautiful Soup for this project? It will make dealing with imperfectly formed documents much easier.

您是否有理由不在此项目中使用Beautiful Soup?它将使处理不完美形成的文档变得更加容易。