如何使用Beautiful Soup查找带有特定文本的标签?

时间:2022-11-29 12:00:56

I have the following html (line breaks marked with \n):

我有以下html(标记为\ n的换行符):

...
<tr>
  <td class="pos">\n
      "Some text:"\n
      <br>\n
      <strong>some value</strong>\n
  </td>
</tr>
<tr>
  <td class="pos">\n
      "Fixed text:"\n
      <br>\n
      <strong>text I am looking for</strong>\n
  </td>
</tr>
<tr>
  <td class="pos">\n
      "Some other text:"\n
      <br>\n
      <strong>some other value</strong>\n
  </td>
</tr>
...

How to find text I am looking for? The code below returns first found value, so I need to filter by Fixed text somehow.

如何找到我要找的文字?下面的代码返回第一个找到的值,所以我需要以某种方式按固定文本进行过滤。

result = soup.find('td', {'class' :'pos'}).find('strong').text

Upd. If I use the following code:

UPD。如果我使用以下代码:

title = soup.find('td', text = re.compile(ur'Fixed text:(.*)', re.DOTALL), attrs = {'class': 'pos'})
self.response.out.write(str(title.string).decode('utf8'))

then it returns just Fixed text:.

然后它只返回固定文本:

3 个解决方案

#1


27  

You can pass a regular expression to the text parameter of findAll, like so:

您可以将正则表达式传递给findAll的text参数,如下所示:

import BeautifulSoup
import re

columns = soup.findAll('td', text = re.compile('your regex here'), attrs = {'class' : 'pos'})

#2


23  

This post got me to my answer even though the answer is missing from this post. I felt I should give back.

即使这篇文章没有回答,这篇文章也让我得到了答案。我觉得我应该回馈。

The challenge here is in the inconsistent behavior of BeautifulSoup.find when searching with and without text.

这里的挑战在于使用和不使用文本进行搜索时BeautifulSoup.find的不一致行为。

Note: If you have BeautifulSoup, you can test this locally via:

注意:如果您有BeautifulSoup,可以通过以下方式在本地测试:

curl https://gist.githubusercontent.com/RichardBronosky/4060082/raw/test.py | python

Code: https://gist.github.com/4060082

# Taken from https://gist.github.com/4060082
from BeautifulSoup import BeautifulSoup
from urllib2 import urlopen
from pprint import pprint
import re

soup = BeautifulSoup(urlopen('https://gist.githubusercontent.com/RichardBronosky/4060082/raw/test.html').read())
# I'm going to assume that Peter knew that re.compile is meant to cache a computation result for a performance benefit. However, I'm going to do that explicitly here to be very clear.
pattern = re.compile('Fixed text')

# Peter's suggestion here returns a list of what appear to be strings
columns = soup.findAll('td', text=pattern, attrs={'class' : 'pos'})
# ...but it is actually a BeautifulSoup.NavigableString
print type(columns[0])
#>> <class 'BeautifulSoup.NavigableString'>

# you can reach the tag using one of the convenience attributes seen here
pprint(columns[0].__dict__)
#>> {'next': <br />,
#>>  'nextSibling': <br />,
#>>  'parent': <td class="pos">\n
#>>       "Fixed text:"\n
#>>       <br />\n
#>>       <strong>text I am looking for</strong>\n
#>>   </td>,
#>>  'previous': <td class="pos">\n
#>>       "Fixed text:"\n
#>>       <br />\n
#>>       <strong>text I am looking for</strong>\n
#>>   </td>,
#>>  'previousSibling': None}

# I feel that 'parent' is safer to use than 'previous' based on http://www.crummy.com/software/BeautifulSoup/bs4/doc/#method-names
# So, if you want to find the 'text' in the 'strong' element...
pprint([t.parent.find('strong').text for t in soup.findAll('td', text=pattern, attrs={'class' : 'pos'})])
#>> [u'text I am looking for']

# Here is what we have learned:
print soup.find('strong')
#>> <strong>some value</strong>
print soup.find('strong', text='some value')
#>> u'some value'
print soup.find('strong', text='some value').parent
#>> <strong>some value</strong>
print soup.find('strong', text='some value') == soup.find('strong')
#>> False
print soup.find('strong', text='some value') == soup.find('strong').text
#>> True
print soup.find('strong', text='some value').parent == soup.find('strong')
#>> True

Though it is most certainly too late to help the OP, I hope they will make this as the answer since it does satisfy all quandaries around finding by text.

虽然对OP来说肯定太迟了,但我希望他们能够将其作为答案,因为它确实满足了所有关于通过文本查找的窘境。

#3


0  

from bs4 import BeautifulSoup

来自bs4进口BeautifulSoup

from urllib.request import urlopen,Request

来自urllib.request import urlopen,请求

from urllib.parse import urljoin,urlparse

来自urllib.parse导入urljoin,urlparse

rawLinks=soup.findAll('a',href=True)
for link in rawLinks:

    innercontent=link.text

    if keyword.lower() in innercontent.lower():

       print(link)

#1


27  

You can pass a regular expression to the text parameter of findAll, like so:

您可以将正则表达式传递给findAll的text参数,如下所示:

import BeautifulSoup
import re

columns = soup.findAll('td', text = re.compile('your regex here'), attrs = {'class' : 'pos'})

#2


23  

This post got me to my answer even though the answer is missing from this post. I felt I should give back.

即使这篇文章没有回答,这篇文章也让我得到了答案。我觉得我应该回馈。

The challenge here is in the inconsistent behavior of BeautifulSoup.find when searching with and without text.

这里的挑战在于使用和不使用文本进行搜索时BeautifulSoup.find的不一致行为。

Note: If you have BeautifulSoup, you can test this locally via:

注意:如果您有BeautifulSoup,可以通过以下方式在本地测试:

curl https://gist.githubusercontent.com/RichardBronosky/4060082/raw/test.py | python

Code: https://gist.github.com/4060082

# Taken from https://gist.github.com/4060082
from BeautifulSoup import BeautifulSoup
from urllib2 import urlopen
from pprint import pprint
import re

soup = BeautifulSoup(urlopen('https://gist.githubusercontent.com/RichardBronosky/4060082/raw/test.html').read())
# I'm going to assume that Peter knew that re.compile is meant to cache a computation result for a performance benefit. However, I'm going to do that explicitly here to be very clear.
pattern = re.compile('Fixed text')

# Peter's suggestion here returns a list of what appear to be strings
columns = soup.findAll('td', text=pattern, attrs={'class' : 'pos'})
# ...but it is actually a BeautifulSoup.NavigableString
print type(columns[0])
#>> <class 'BeautifulSoup.NavigableString'>

# you can reach the tag using one of the convenience attributes seen here
pprint(columns[0].__dict__)
#>> {'next': <br />,
#>>  'nextSibling': <br />,
#>>  'parent': <td class="pos">\n
#>>       "Fixed text:"\n
#>>       <br />\n
#>>       <strong>text I am looking for</strong>\n
#>>   </td>,
#>>  'previous': <td class="pos">\n
#>>       "Fixed text:"\n
#>>       <br />\n
#>>       <strong>text I am looking for</strong>\n
#>>   </td>,
#>>  'previousSibling': None}

# I feel that 'parent' is safer to use than 'previous' based on http://www.crummy.com/software/BeautifulSoup/bs4/doc/#method-names
# So, if you want to find the 'text' in the 'strong' element...
pprint([t.parent.find('strong').text for t in soup.findAll('td', text=pattern, attrs={'class' : 'pos'})])
#>> [u'text I am looking for']

# Here is what we have learned:
print soup.find('strong')
#>> <strong>some value</strong>
print soup.find('strong', text='some value')
#>> u'some value'
print soup.find('strong', text='some value').parent
#>> <strong>some value</strong>
print soup.find('strong', text='some value') == soup.find('strong')
#>> False
print soup.find('strong', text='some value') == soup.find('strong').text
#>> True
print soup.find('strong', text='some value').parent == soup.find('strong')
#>> True

Though it is most certainly too late to help the OP, I hope they will make this as the answer since it does satisfy all quandaries around finding by text.

虽然对OP来说肯定太迟了,但我希望他们能够将其作为答案,因为它确实满足了所有关于通过文本查找的窘境。

#3


0  

from bs4 import BeautifulSoup

来自bs4进口BeautifulSoup

from urllib.request import urlopen,Request

来自urllib.request import urlopen,请求

from urllib.parse import urljoin,urlparse

来自urllib.parse导入urljoin,urlparse

rawLinks=soup.findAll('a',href=True)
for link in rawLinks:

    innercontent=link.text

    if keyword.lower() in innercontent.lower():

       print(link)