PHP爬虫抓取网页内容 (simple_html_dom.php)

　　使用simple_html_dom.php，下载|文档

　　因为抓取的只是一个网页，所以比较简单，整个网站的下次再研究，可能用Python来做爬虫会好些。

 <meta http-equiv="content-type" content="text/html;charset=utf-8"/>

 <?php

 include_once 'simplehtmldom/simple_html_dom.php';

 //获取html数据转化为对象

 $html = file_get_html('http://paopaotv.com/tv-type-id-5-pg-1.html');

 //A-Z的字母列表每条数据是在id=letter-focus 的div内class= letter-focus-item的dl标签内，用find方法查找即为 

     foreach($html->find('.txt-list li a') as $element)

     $arr[]= $element->innertext . '<br>';

     $fileName='data.txt';//不用事先建好

     $arrLen=count($arr);

     for($i=0;$i<$arrLen;$i++){

     file_put_contents($fileName,$arr[$i],FILE_APPEND|LOCK_EX);

     /*FILE_APPEND|LOCK_EX是往后追加数据，如果没有该参数，则只能插入一条数据

         但是如果重新启动抓取时，则会将以往抓取过的数据继续存入*/

     }

     //以上是抓取的数据然后存到data.text里

     $content=file_get_contents($fileName);

     $cont=explode("<br>",$content);

     $contLen=count($cont);

     for($i=0;$i<$contLen;$i++) {

         unset($cont[2*$i+1]);

     }

先在 http://www.paopaotv.com/tv-type-id-5-pg-1.html 中找到节点，

 foreach($html->find('.txt-list li a') as $element)

 $arr[]= $element->innertext . '<br>';

获得节点内的数据

PHP爬虫抓取网页内容 (simple_html_dom.php)

获得的数据：

PHP爬虫抓取网页内容 (simple_html_dom.php)

可以看到，每个获取的数据后面都有个<br>***<br>,这时因为 .txt-list li 下面有两个a,所以会得到两个数据

 $content=file_get_contents($fileName);

     $cont=explode("<br>",$content);

     $contLen=count($cont);

     for($i=0;$i<$contLen;$i++) {

         unset($cont[2*$i+1]);

     }

获取data.text中的数据，通过 explode("<br>",$content) 将<br>前后的数据分成两部分，将$cont用print_r()函数打印出来后，得到

PHP爬虫抓取网页内容 (simple_html_dom.php)

可以看出，所有不需要的数据都是奇数项，所以用 unset($cont[2*$i+1]); 函数删掉，显示的时候是：

PHP爬虫抓取网页内容 (simple_html_dom.php)

但是如何将现在的数组的key重新排序，这个我还没不知道怎么弄，试过array_splice,该函数也不能设定只支持删除奇数的内容。

秒客网

PHP爬虫抓取网页内容 (simple_html_dom.php)

相关文章