如何在美丽的汤中获得嵌套元素

时间:2023-01-06 17:03:30

I am struggling with the syntax required to grab some hrefs in a td. The table, tr and td elements dont have any class's or id's.

我正在努力获取在td中获取一些href所需的语法。 table,tr和td元素没有任何类或id。

If I wanted to grab the anchor in this example, what would I need?

如果我想在这个例子中抓住锚点,我需要什么?

< tr > < td > < a >...

...

Thanks

2 个解决方案

#1


As per the docs, you first make a parse tree:

根据文档,您首先创建一个解析树:

import BeautifulSoup
html = "<html><body><tr><td><a href='foo'/></td></tr></body></html>"
soup = BeautifulSoup.BeautifulSoup(html)

and then you search in it, for example for <a> tags whose immediate parent is a <td>:

然后搜索它,例如标签的直接父级是:

for ana in soup.findAll('a'):
  if ana.parent.name == 'td':
    print ana["href"]

#2


Something like this?

像这样的东西?

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html)
anchors = [td.find('a') for td in soup.findAll('td')]

That should find the first "a" inside each "td" in the html you provide. You can tweak td.find to be more specific or else use findAll if you have several links inside each td.

那应该在你提供的html中找到每个“td”里面的第一个“a”。您可以将td.find调整为更具体,或者如果每个td中有多个链接,则使用findAll。

UPDATE: re Daniele's comment, if you want to make sure you don't have any None's in the list, then you could modify the list comprehension thus:

更新:Daniele的评论,如果你想确保列表中没有任何None,那么你可以修改列表理解:

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html)
anchors = [a for a in (td.find('a') for td in soup.findAll('td')) if a]

Which basically just adds a check to see if you have an actual element returned by td.find('a').

这基本上只是添加一个检查,看看你是否有一个由td.find('a')返回的实际元素。

#1


As per the docs, you first make a parse tree:

根据文档,您首先创建一个解析树:

import BeautifulSoup
html = "<html><body><tr><td><a href='foo'/></td></tr></body></html>"
soup = BeautifulSoup.BeautifulSoup(html)

and then you search in it, for example for <a> tags whose immediate parent is a <td>:

然后搜索它,例如标签的直接父级是:

for ana in soup.findAll('a'):
  if ana.parent.name == 'td':
    print ana["href"]

#2


Something like this?

像这样的东西?

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html)
anchors = [td.find('a') for td in soup.findAll('td')]

That should find the first "a" inside each "td" in the html you provide. You can tweak td.find to be more specific or else use findAll if you have several links inside each td.

那应该在你提供的html中找到每个“td”里面的第一个“a”。您可以将td.find调整为更具体,或者如果每个td中有多个链接,则使用findAll。

UPDATE: re Daniele's comment, if you want to make sure you don't have any None's in the list, then you could modify the list comprehension thus:

更新:Daniele的评论,如果你想确保列表中没有任何None,那么你可以修改列表理解:

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html)
anchors = [a for a in (td.find('a') for td in soup.findAll('td')) if a]

Which basically just adds a check to see if you have an actual element returned by td.find('a').

这基本上只是添加一个检查,看看你是否有一个由td.find('a')返回的实际元素。