环境:Python版本:2.7.3
>>> html = gethtml('/')
>>> soup = BeautifulSoup(html)
>>> soup.find_all("a",href=True)
[]
>>> soup.find_all("a")
[]
>>> soup.find_all("link")
[<link href="/jianzhimao/web-res/icon/" rel="icon" type="image/x-icon"/>, <link href="/jianzhimao/web-res/icon/" resl="shortcut icon" type="image/x-icon"/>, <link href="/templets/default/style/" rel="stylesheet" type="text/css">\n<!--\u2013[if lt IE9]-->\n<script>\n(function() {\n if (!\n /*@cc_on!@*/\n 0) return;\n var e = "abbr, article, aside, audio, canvas, datalist, details, dialog, eventsource, figure, figcaption, footer, header, hgroup, main, mark, menu, meter, nav, output, progress, section, time, video".split(', ');\n var i= ;\n while (i--){\n (e[i])\n }\n})()\n</script>\n</link>]
如上,soup.find_all()找不到a标签,然而用chrome查看该网站源码是存在a标签的,gethtml(源码就不贴了)函数运作也没问题,单独打印soup发现只解析了部分html代码.
问题的原因:没有安装第三方的HTML解析器,所以用的是默认的解析器。而Python 2.7.3的默认解析器存在文档容错能力差的毛病。
解决方法: pip install html5lib (或者 lxml)
参考:/software/BeautifulSoup/bs4/doc/#id5