BeautifulSoup不能完整识别网页html代码

环境：Python版本：2.7.3

>>> html = gethtml('/')
>>> soup = BeautifulSoup(html)
>>> soup.find_all("a",href=True)
[]
>>> soup.find_all("a")
[]
>>> soup.find_all("link")
[<link href="/jianzhimao/web-res/icon/" rel="icon" type="image/x-icon"/>, <link href="/jianzhimao/web-res/icon/" resl="shortcut icon" type="image/x-icon"/>, <link href="/templets/default/style/" rel="stylesheet" type="text/css">\n<!--\u2013[if lt IE9]-->\n<script>\n(function() {\n    if (!\n    /*@cc_on!@*/\n    0) return;\n    var e = "abbr, article, aside, audio, canvas, datalist, details, dialog, eventsource, figure, figcaption, footer, header, hgroup, main, mark, menu, meter, nav, output, progress, section, time, video".split(', ');\n    var i= ;\n    while (i--){\n        (e[i])\n    }\n})()\n</script>\n</link>]

如上，soup.find_all()找不到a标签，然而用chrome查看该网站源码是存在a标签的，gethtml（源码就不贴了）函数运作也没问题，单独打印soup发现只解析了部分html代码.

问题的原因：没有安装第三方的HTML解析器，所以用的是默认的解析器。而Python 2.7.3的默认解析器存在文档容错能力差的毛病。

解决方法： pip install html5lib （或者 lxml）

参考：/software/BeautifulSoup/bs4/doc/#id5

秒客网

BeautifulSoup不能完整识别网页html代码

相关文章