使用Beautifulsoup从html页面提取数据

时间:2022-12-18 20:23:11
<div class="name">
  &nbsp;&nbsp;
  <strong>
    <a target="_blank" href="/page3.html">
      SOME_Name_TEXT
    </a>
  </strong>
</div>

<div class="data">
  <img src="/page1/page2/Images/pic.png" height="13" width="13">
  &nbsp; SOME_Data_TEXT
</div>

I have a html page with the different classes. I am able to extract class "name" and "data" from different classes using beautifulsoup

我有一个包含不同类的html页面。我可以使用beautifulsoup从不同的类中提取类“名称”和“数据”

myName = soup.findAll("div", {"class" : "name"})
myData = soup.findAll("div", {"class" : "data"})

But the result I get when I run the script and print myName and myData elements respectively:

但是当我运行脚本并分别打印myName和myData元素时得到的结果:

  SOME_Name_TEXT(as a link)
 SOME_Data_TEXT

The problem is I dont want Â. This is due to 2 &nbsp;'s in first and one in second.

问题是我不想要Â。这是由于第一个和第二个中的2个。

I just want the result as:

我只想将结果作为:

SOME_Name_TEXT(as a link)
SOME_Data_TEXT

In the first part link with the "SOME_Name_TEXT" is required. Image in data part is not needed, I want just the raw text in second part i.e "SOME_Data_TEXT". I tried doing it using str.split(). How can I get the exact results?

在第一部分中,需要链接“SOME_Name_TEXT”。不需要数据部分中的图像,我只想要第二部分中的原始文本,即“SOME_Data_TEXT”。我试过用str.split()来做。我怎样才能得到确切的结果?

3 个解决方案

#1


1  

Since you do not want &nbsp, you can do something like this:

既然你不想要,你可以这样做:

myName = soup.findAll("div", {"class" : "name"})
myData = soup.findAll("div", {"class" : "data"})
if(myName && !soup.findAll(text="&nbsp;"))
{
    System.out.print(myName);
}

or 2nd approach, here str is your myName:

或第二种方法,这里str是你的myName:

str= "&nbsp; hey how are you doing"
str.decode("utf-8");
str = str.replace("&nbsp;", "")
print str

#2


0  

You'll have to do a unicode replace to remove the &nbsp; because BS converts HTML entities to unicode characters.

您必须执行unicode替换才能删除 因为BS将HTML实体转换为unicode字符。

Edit:
soup.prettify(formatter=lambda x: x.replace(u'\xa0', ''))

Other options: For myData, to just get the text, do this:

其他选项:对于myData,要获取文本,请执行以下操作:

myData = soup.findAll("div", {"class" : "data"})[0].find('img').contents[0].strip()

and for myName:

并为myName:

myName = repr(soup.findAll("div", {"class" : "name"})[0].find('a'))
myName = re.sub(' ', '', myName)

does that work for you?

那对你有用吗?

#3


0  

Finally solved it with the help of other questions:

最后在其他问题的帮助下解决了这个问题:

For the first part i.e

第一部分即

<div class="name">
      &nbsp;&nbsp;
      <strong>
        <a target="_blank" href="/page3.html">
          SOME_Name_TEXT
        </a>
      </strong>
    </div>

Let this block is in x, then I used print x.findNext('strong') And for 2nd part i.e.

让这个块在x中,然后我使用print x.findNext('strong')和第二部分,即

<div class="data">
  <img src="/page1/page2/Images/pic.png" height="13" width="13">
  &nbsp; SOME_Data_TEXT
</div>

I did like:

我喜欢:

tmp = x.findNext('img')
print tmp.get_text().strip()

#1


1  

Since you do not want &nbsp, you can do something like this:

既然你不想要,你可以这样做:

myName = soup.findAll("div", {"class" : "name"})
myData = soup.findAll("div", {"class" : "data"})
if(myName && !soup.findAll(text="&nbsp;"))
{
    System.out.print(myName);
}

or 2nd approach, here str is your myName:

或第二种方法,这里str是你的myName:

str= "&nbsp; hey how are you doing"
str.decode("utf-8");
str = str.replace("&nbsp;", "")
print str

#2


0  

You'll have to do a unicode replace to remove the &nbsp; because BS converts HTML entities to unicode characters.

您必须执行unicode替换才能删除 因为BS将HTML实体转换为unicode字符。

Edit:
soup.prettify(formatter=lambda x: x.replace(u'\xa0', ''))

Other options: For myData, to just get the text, do this:

其他选项:对于myData,要获取文本,请执行以下操作:

myData = soup.findAll("div", {"class" : "data"})[0].find('img').contents[0].strip()

and for myName:

并为myName:

myName = repr(soup.findAll("div", {"class" : "name"})[0].find('a'))
myName = re.sub(' ', '', myName)

does that work for you?

那对你有用吗?

#3


0  

Finally solved it with the help of other questions:

最后在其他问题的帮助下解决了这个问题:

For the first part i.e

第一部分即

<div class="name">
      &nbsp;&nbsp;
      <strong>
        <a target="_blank" href="/page3.html">
          SOME_Name_TEXT
        </a>
      </strong>
    </div>

Let this block is in x, then I used print x.findNext('strong') And for 2nd part i.e.

让这个块在x中,然后我使用print x.findNext('strong')和第二部分,即

<div class="data">
  <img src="/page1/page2/Images/pic.png" height="13" width="13">
  &nbsp; SOME_Data_TEXT
</div>

I did like:

我喜欢:

tmp = x.findNext('img')
print tmp.get_text().strip()