使用Python从HTML表中提取数据

时间:2022-10-29 13:38:27

I want to extract data from HTML table using Python script and save it as variables(that I can later use in same script after loading them in if they exist) into a separate file. Also I want the script to ignore the first row of table(Component, Status, Time / Error). I would prefer not to use external libraries.

我想使用Python脚本从HTML表中提取数据,并将其保存为变量(以后我可以在将它们存在后将它们加载到相同的脚本中)保存到单独的文件中。此外,我希望脚本忽略表的第一行(组件,状态,时间/错误)。我宁愿不使用外部库。

The output into a new file should be like so:

输出到新文件应该是这样的:

SAVE_DOCUMENT_STATUS = "OK"
SAVE_DOCUMENT_TIME = "0.408"
GET_DOCUMENT_STATUS = "OK"
GET_DOCUMENT_TIME = "0.361"
...

And heres the input to the script:

并且继承了脚本的输入:

<table border=1>
<tr>
<td><b>Component</b></td>
<td><b>Status</b></td>
<td><b>Time / Error</b></td>
</tr>
<tr><td>SAVE_DOCUMENT</td><td>OK</td><td>0.408 s</td></tr>
<tr><td>GET_DOCUMENT</td><td>OK</td><td>0.361 s</td></tr>
<tr><td>DVK_SEND</td><td>OK</td><td>0.002 s</td></tr>
<tr><td>DVK_RECEIVE</td><td>OK</td><td>0.002 s</td></tr>
<tr><td>GET_USER_INFO</td><td>OK</td><td>0.135 s</td></tr>
<tr><td>NOTIFICATIONS</td><td>OK</td><td>0.002 s</td></tr>
<tr><td>ERROR_LOG</td><td>OK</td><td>0.001 s</td></tr>
<tr><td>SUMMARY_STATUS</td><td>OK</td><td>0.913 s</td></tr>
</table>

I tried to do it in bash, but since I need to compare *_TIME variables to maximum time, then it fails, because they're float numbers.

我尝试用bash来做,但由于我需要将* _TIME变量与最大时间进行比较,然后失败,因为它们是浮点数。

2 个解决方案

#1


4  

Using lxml:

使用lxml:

import lxml.html as lh

content='''\
<table border=1>
<tr>
<td><b>Component</b></td>
<td><b>Status</b></td>
<td><b>Time / Error</b></td>
</tr>
<tr><td>SAVE_DOCUMENT</td><td>OK</td><td>0.408 s</td></tr>
<tr><td>GET_DOCUMENT</td><td>OK</td><td>0.361 s</td></tr>
<tr><td>DVK_SEND</td><td>OK</td><td>0.002 s</td></tr>
<tr><td>DVK_RECEIVE</td><td>OK</td><td>0.002 s</td></tr>
<tr><td>GET_USER_INFO</td><td>OK</td><td>0.135 s</td></tr>
<tr><td>NOTIFICATIONS</td><td>OK</td><td>0.002 s</td></tr>
<tr><td>ERROR_LOG</td><td>OK</td><td>0.001 s</td></tr>
<tr><td>SUMMARY_STATUS</td><td>OK</td><td>0.913 s</td></tr>
</table>
'''
tree=lh.fromstring(content)
for key, status, t in zip(*[iter(tree.xpath('//td/text()'))]*3):
    print('''{k}_STATUS = "{s}"
{k}_TIME = "{t}"'''.format(k=key,s=status,t=t.rstrip(' s')))

yields

产量

SAVE_DOCUMENT_STATUS = "OK"
SAVE_DOCUMENT_TIME = "0.408"
GET_DOCUMENT_STATUS = "OK"
GET_DOCUMENT_TIME = "0.361"
DVK_SEND_STATUS = "OK"
DVK_SEND_TIME = "0.002"
DVK_RECEIVE_STATUS = "OK"
DVK_RECEIVE_TIME = "0.002"
GET_USER_INFO_STATUS = "OK"
GET_USER_INFO_TIME = "0.135"
NOTIFICATIONS_STATUS = "OK"
NOTIFICATIONS_TIME = "0.002"
ERROR_LOG_STATUS = "OK"
ERROR_LOG_TIME = "0.001"
SUMMARY_STATUS_STATUS = "OK"
SUMMARY_STATUS_TIME = "0.913"

#2


2  

Well, if your HTML document really has such a stable structure (which makes me scratch my head because it is pretty rare) you can use regexes:

好吧,如果您的HTML文档确实具有如此稳定的结构(这让我头疼,因为它非常罕见),您可以使用正则表达式:

>>> import re
>>> r = re.compile('<tr><td>(.*)</td><td>(.*)</td><td>(.*) s</td></tr>')

The regex below groups the values you want to show in the result. Then you use the sub() method of the object. If the text is in a variable (such as content) just execute it this way:

下面的正则表达式将您要在结果中显示的值分组。然后使用对象的sub()方法。如果文本在变量(例如内容)中,则以这种方式执行:

r.sub(r'\1_STATUS = "\2"\n\1_TIME = \3', content)

The result:

结果:

>>> print r.sub(r'\1_STATUS = "\2"\n\1_TIME = \3', content)
<table border=1>
<tr>
<td><b>Component</b></td>
<td><b>Status</b></td>
<td><b>Time / Error</b></td>
</tr>
SAVE_DOCUMENT_STATUS = "OK"
SAVE_DOCUMENT_TIME = 0.408
GET_DOCUMENT_STATUS = "OK"
GET_DOCUMENT_TIME = 0.361
DVK_SEND_STATUS = "OK"
DVK_SEND_TIME = 0.002
DVK_RECEIVE_STATUS = "OK"
DVK_RECEIVE_TIME = 0.002
GET_USER_INFO_STATUS = "OK"
GET_USER_INFO_TIME = 0.135
NOTIFICATIONS_STATUS = "OK"
NOTIFICATIONS_TIME = 0.002
ERROR_LOG_STATUS = "OK"
ERROR_LOG_TIME = 0.001
SUMMARY_STATUS_STATUS = "OK"
SUMMARY_STATUS_TIME = 0.913
</table>

Sure, there is a lot of garbage in the string yet, but it gives the idea :)

当然,字符串中还有很多垃圾,但它给出了这个想法:)

If your HTML documents are not that stable, however, you should really consider some XML parser or, better yet, BeautifulSoup, because it would be a heck of a job to process an unstably structured HTML file by hand.

但是,如果您的HTML文档不那么稳定,那么您应该考虑使用一些XML解析器,或者更好的是BeautifulSoup,因为手工处理不稳定的结构化HTML文件会非常困难。

#1


4  

Using lxml:

使用lxml:

import lxml.html as lh

content='''\
<table border=1>
<tr>
<td><b>Component</b></td>
<td><b>Status</b></td>
<td><b>Time / Error</b></td>
</tr>
<tr><td>SAVE_DOCUMENT</td><td>OK</td><td>0.408 s</td></tr>
<tr><td>GET_DOCUMENT</td><td>OK</td><td>0.361 s</td></tr>
<tr><td>DVK_SEND</td><td>OK</td><td>0.002 s</td></tr>
<tr><td>DVK_RECEIVE</td><td>OK</td><td>0.002 s</td></tr>
<tr><td>GET_USER_INFO</td><td>OK</td><td>0.135 s</td></tr>
<tr><td>NOTIFICATIONS</td><td>OK</td><td>0.002 s</td></tr>
<tr><td>ERROR_LOG</td><td>OK</td><td>0.001 s</td></tr>
<tr><td>SUMMARY_STATUS</td><td>OK</td><td>0.913 s</td></tr>
</table>
'''
tree=lh.fromstring(content)
for key, status, t in zip(*[iter(tree.xpath('//td/text()'))]*3):
    print('''{k}_STATUS = "{s}"
{k}_TIME = "{t}"'''.format(k=key,s=status,t=t.rstrip(' s')))

yields

产量

SAVE_DOCUMENT_STATUS = "OK"
SAVE_DOCUMENT_TIME = "0.408"
GET_DOCUMENT_STATUS = "OK"
GET_DOCUMENT_TIME = "0.361"
DVK_SEND_STATUS = "OK"
DVK_SEND_TIME = "0.002"
DVK_RECEIVE_STATUS = "OK"
DVK_RECEIVE_TIME = "0.002"
GET_USER_INFO_STATUS = "OK"
GET_USER_INFO_TIME = "0.135"
NOTIFICATIONS_STATUS = "OK"
NOTIFICATIONS_TIME = "0.002"
ERROR_LOG_STATUS = "OK"
ERROR_LOG_TIME = "0.001"
SUMMARY_STATUS_STATUS = "OK"
SUMMARY_STATUS_TIME = "0.913"

#2


2  

Well, if your HTML document really has such a stable structure (which makes me scratch my head because it is pretty rare) you can use regexes:

好吧,如果您的HTML文档确实具有如此稳定的结构(这让我头疼,因为它非常罕见),您可以使用正则表达式:

>>> import re
>>> r = re.compile('<tr><td>(.*)</td><td>(.*)</td><td>(.*) s</td></tr>')

The regex below groups the values you want to show in the result. Then you use the sub() method of the object. If the text is in a variable (such as content) just execute it this way:

下面的正则表达式将您要在结果中显示的值分组。然后使用对象的sub()方法。如果文本在变量(例如内容)中,则以这种方式执行:

r.sub(r'\1_STATUS = "\2"\n\1_TIME = \3', content)

The result:

结果:

>>> print r.sub(r'\1_STATUS = "\2"\n\1_TIME = \3', content)
<table border=1>
<tr>
<td><b>Component</b></td>
<td><b>Status</b></td>
<td><b>Time / Error</b></td>
</tr>
SAVE_DOCUMENT_STATUS = "OK"
SAVE_DOCUMENT_TIME = 0.408
GET_DOCUMENT_STATUS = "OK"
GET_DOCUMENT_TIME = 0.361
DVK_SEND_STATUS = "OK"
DVK_SEND_TIME = 0.002
DVK_RECEIVE_STATUS = "OK"
DVK_RECEIVE_TIME = 0.002
GET_USER_INFO_STATUS = "OK"
GET_USER_INFO_TIME = 0.135
NOTIFICATIONS_STATUS = "OK"
NOTIFICATIONS_TIME = 0.002
ERROR_LOG_STATUS = "OK"
ERROR_LOG_TIME = 0.001
SUMMARY_STATUS_STATUS = "OK"
SUMMARY_STATUS_TIME = 0.913
</table>

Sure, there is a lot of garbage in the string yet, but it gives the idea :)

当然,字符串中还有很多垃圾,但它给出了这个想法:)

If your HTML documents are not that stable, however, you should really consider some XML parser or, better yet, BeautifulSoup, because it would be a heck of a job to process an unstably structured HTML file by hand.

但是,如果您的HTML文档不那么稳定,那么您应该考虑使用一些XML解析器,或者更好的是BeautifulSoup,因为手工处理不稳定的结构化HTML文件会非常困难。