之前 br>之后的BeautifulSoup Parse Text

时间:2023-02-09 15:28:36

I have this code trying to parse search results from a grant website (please find the URL in the code, I can't post the link yet until my rep is higher), the "Year"and "Amount Award" after tags and before tags.


之前 br>之后的BeautifulSoup Parse Text

Two questions:

1) Why is this only returning the 1st table?


2) Any way I can get the text that is after the (i.e. Year and Amount Award strings) and (i.e. the actual number such as 2015 and $100000)



<td valign="top">
				            <b>Year: </b>2014<br>
				            <b>Award Amount: </b>$84,907				                                				                                				        </td>

Here is my script:


import requests
from bs4 import BeautifulSoup
import pandas as pd

url = 'http://www.ned.org/wp-content/themes/ned/search/grant-search.php?' \
    'organizationName=&region=ASIA&projectCountry=China&amount=&fromDate=&toDate=&' \

r = requests.get(url)

html_content = r.text

soup = BeautifulSoup(html_content, "html.parser")

tables = soup.find_all('table')

data = {
        'col_names': [],
        'info' : [],

index = 0

for table in tables:
    rows = table.find_all('tr')[1:]
    for row in rows:
        cols = row.find_all('td')
        except IndexError:
    grant_df = pd.DataFrame(data)
    index += 1
    filename = 'grant ' + str(index) + '.csv'

1 个解决方案



I would suggest approaching the table parsing in a different manner. All of the information is available in the first row of each table. So you can parse the text of the row like:



text = '\n'.join([x.strip() for x in rows[0].get_text().split('\n')
                  if x.strip()]).replace(':\n', ': ')
data_dict = {k.strip(): v.strip() for k, v in
             [x.split(':', 1) for x in text.split('\n')]}


This takes the text and


  1. splits it on newlines
  2. 将它拆分为换行符

  3. removes any blank lines
  4. 删除任何空行

  5. removes any leading/trailing space
  6. 删除任何前导/尾随空格

  7. joins the lines back together into a single text
  8. 将行重新连接成一个文本

  9. joins any line ending in : with the next line
  10. 以下一行加入以:结尾的任何行


  1. split the text again by newline
  2. 换行再次拆分文本

  3. split each line by :
  4. 将每一行拆分为:

  5. strip any whitespace of ends of text on either side of :
  6. 剥离任何一侧文本末尾的任何空格:

  7. use the split text as key and value to a dict
  8. 使用拆分文本作为键和dict的值

Test Code:

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = 'http://www.ned.org/wp-content/themes/ned/search/grant-search.php?' \
      'organizationName=&region=ASIA&projectCountry=China&amount=&' \
      'fromDate=&toDate=&projectFocus%5B%5D=&search=&maxCount=25&' \

r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")

data = []
for table in soup.find_all('table'):
    rows = table.find_all('tr')
    text = '\n'.join([x.strip() for x in rows[0].get_text().split('\n')
                      if x.strip()]).replace(':\n', ': ')
    data_dict = {k.strip(): v.strip() for k, v in
                 [x.split(':', 1) for x in text.split('\n')]}

    if data_dict.get('Award Amount'):
grant_df = pd.DataFrame(data)


  Award Amount                                        Description  \
0      $84,907  To strengthen the capacity of China's rights d...   
1     $204,973  To provide an effective forum for free express...   
2      $48,000  To promote religious freedom in China. The org...   
3      $89,000  To educate and train civil society activists o...   
4      $65,000  To encourage greater public discussion, transp...   

            Organization Name Project Country                Project Focus  \
0                         NaN  Mainland China                  Rule of Law   
1  Princeton China Initiative  Mainland China       Freedom of Information   
2                         NaN  Mainland China                  Rule of Law   
3                         NaN  Mainland China  Democratic Ideas and Values   
4                         NaN  Mainland China                  Rule of Law   

  Project Region                                      Project Title  Year  
0           Asia             Empowering the Chinese Legal Community  2014  
1           Asia  Supporting Free Expression and Open Debate for...  2014  
2           Asia  Religious Freedom, Rights Defense and Rule of ...  2014  
3           Asia     Education on Civil Society and Democratization  2014  
4           Asia        Promoting Democratic Policy Change in China  2014  



I would suggest approaching the table parsing in a different manner. All of the information is available in the first row of each table. So you can parse the text of the row like:



text = '\n'.join([x.strip() for x in rows[0].get_text().split('\n')
                  if x.strip()]).replace(':\n', ': ')
data_dict = {k.strip(): v.strip() for k, v in
             [x.split(':', 1) for x in text.split('\n')]}


This takes the text and


  1. splits it on newlines
  2. 将它拆分为换行符

  3. removes any blank lines
  4. 删除任何空行

  5. removes any leading/trailing space
  6. 删除任何前导/尾随空格

  7. joins the lines back together into a single text
  8. 将行重新连接成一个文本

  9. joins any line ending in : with the next line
  10. 以下一行加入以:结尾的任何行


  1. split the text again by newline
  2. 换行再次拆分文本

  3. split each line by :
  4. 将每一行拆分为:

  5. strip any whitespace of ends of text on either side of :
  6. 剥离任何一侧文本末尾的任何空格:

  7. use the split text as key and value to a dict
  8. 使用拆分文本作为键和dict的值

Test Code:

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = 'http://www.ned.org/wp-content/themes/ned/search/grant-search.php?' \
      'organizationName=&region=ASIA&projectCountry=China&amount=&' \
      'fromDate=&toDate=&projectFocus%5B%5D=&search=&maxCount=25&' \

r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")

data = []
for table in soup.find_all('table'):
    rows = table.find_all('tr')
    text = '\n'.join([x.strip() for x in rows[0].get_text().split('\n')
                      if x.strip()]).replace(':\n', ': ')
    data_dict = {k.strip(): v.strip() for k, v in
                 [x.split(':', 1) for x in text.split('\n')]}

    if data_dict.get('Award Amount'):
grant_df = pd.DataFrame(data)


  Award Amount                                        Description  \
0      $84,907  To strengthen the capacity of China's rights d...   
1     $204,973  To provide an effective forum for free express...   
2      $48,000  To promote religious freedom in China. The org...   
3      $89,000  To educate and train civil society activists o...   
4      $65,000  To encourage greater public discussion, transp...   

            Organization Name Project Country                Project Focus  \
0                         NaN  Mainland China                  Rule of Law   
1  Princeton China Initiative  Mainland China       Freedom of Information   
2                         NaN  Mainland China                  Rule of Law   
3                         NaN  Mainland China  Democratic Ideas and Values   
4                         NaN  Mainland China                  Rule of Law   

  Project Region                                      Project Title  Year  
0           Asia             Empowering the Chinese Legal Community  2014  
1           Asia  Supporting Free Expression and Open Debate for...  2014  
2           Asia  Religious Freedom, Rights Defense and Rule of ...  2014  
3           Asia     Education on Civil Society and Democratization  2014  
4           Asia        Promoting Democratic Policy Change in China  2014