python简单爬虫 使用pandas解析表格,不规则表格

时间:2023-03-09 03:34:32
python简单爬虫 使用pandas解析表格,不规则表格

url = http://www.hnu.edu.cn/xyxk/xkzy/zylb.htm

部分表格如图:

python简单爬虫 使用pandas解析表格,不规则表格

部分html代码:

<table class="MsoNormalTable" style="width:353.0pt;margin-left:4.65pt;border-collapse:collapse;border:none;    mso-border-alt:solid windowtext .5pt;mso-padding-alt:0cm 5.4pt 0cm 5.4pt;    mso-border-insideh:.5pt solid windowtext;mso-border-insidev:.5pt solid windowtext" width="471" cellspacing="0" cellpadding="0" border="1">
<tbody>
<tr class="firstRow" style="mso-yfti-irow:0;mso-yfti-firstrow:yes;height:36.75pt">
<td style="width:170.0pt;border:solid windowtext 1.0pt;mso-border-alt: solid windowtext .5pt;padding:0cm 5.4pt 0cm 5.4pt;height:36.75pt" width="227"><p class="MsoNormal" style="text-align:center;margin-top:6.0pt;margin-right:0cm; margin-bottom:6.0pt;margin-left:0cm;mso-para-margin-top:.5gd;mso-para-margin-right: 0cm;mso-para-margin-bottom:.5gd;mso-para-margin-left:0cm; mso-pagination:widow-orphan"><strong><span style="font-size:9.0pt;font-family: 宋体;mso-bidi-font-family:宋体;mso-font-kerning:0pt">学院<span lang="EN-US">
<o:p></o:p></span></span></strong></p></td>
<td style="width:183.0pt;border:solid windowtext 1.0pt; border-left:none;mso-border-left-alt:solid windowtext .5pt;mso-border-alt: solid windowtext .5pt;padding:0cm 5.4pt 0cm 5.4pt;height:36.75pt" width="244" nowrap=""><p class="MsoNormal" style="text-align:center;margin-top:6.0pt;margin-right:0cm; margin-bottom:6.0pt;margin-left:0cm;mso-para-margin-top:.5gd;mso-para-margin-right: 0cm;mso-para-margin-bottom:.5gd;mso-para-margin-left:0cm; mso-pagination:widow-orphan"><strong><span style="font-size:9.0pt;font-family: 宋体;mso-bidi-font-family:宋体;mso-font-kerning:0pt">专业名称<span lang="EN-US">
<o:p></o:p></span></span></strong></p></td>
</tr>
<tr style="mso-yfti-irow:1;height:16.5pt">
<td rowspan="4" style="width:170.0pt;border:solid windowtext 1.0pt; border-top:none;mso-border-top-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt; padding:0cm 5.4pt 0cm 5.4pt;height:16.5pt" width="227"><p class="MsoNormal" style="text-align:center;margin-top:6.0pt;margin-right:0cm; margin-bottom:6.0pt;margin-left:0cm;mso-para-margin-top:.5gd;mso-para-margin-right: 0cm;mso-para-margin-bottom:.5gd;mso-para-margin-left:0cm; mso-pagination:widow-orphan"><span style="font-size:9.0pt;font-family:宋体; mso-bidi-font-family:宋体;mso-font-kerning:0pt">土木工程学院<span lang="EN-US">450
<o:p></o:p></span></span></p></td>
<td style="width:183.0pt;border-top:none;border-left:none; border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt; mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt; mso-border-alt:solid windowtext .5pt;padding:0cm 5.4pt 0cm 5.4pt;height:16.5pt" width="244" nowrap=""><p class="MsoNormal" style="text-align:center;margin-top:6.0pt;margin-right:0cm; margin-bottom:6.0pt;margin-left:0cm;mso-para-margin-top:.5gd;mso-para-margin-right: 0cm;mso-para-margin-bottom:.5gd;mso-para-margin-left:0cm; mso-pagination:widow-orphan"><span style="font-size:9.0pt;font-family:宋体; mso-bidi-font-family:宋体;mso-font-kerning:0pt">土木工程<span lang="EN-US">
<o:p></o:p></span></span></p></td>
</tr>
    ......
</tbody>
</table>

用pandas解析表格,代码如下:

import pandas as pd
url = 'http://www.hnu.edu.cn/xyxk/xkzy/zylb.htm' table = pd.read_html(url)
pd.set_option('display.max_rows', None) # 显示全部的行
with open("湖南大学学院与专业.txt", "wt", encoding='utf8') as out_file: # 保存为txt文件
for i in table:
out_file.write(str(i)+'\n')

运行结果如下(部分):

python简单爬虫 使用pandas解析表格,不规则表格

非常简洁高效!