java从零到变身爬虫大神（一）

学习java3天有余，知道一些基本语法后

学习java爬虫，1天后开始出现明显效果

刚开始先从最简单的爬虫逻辑入手

爬虫最简单的解析面真的是这样

 import org.jsoup.Jsoup;

 import org.jsoup.nodes.Document;

 import java.io.IOException;

 public class Test {

     public static void Get_Url(String url) {

         try {

          Document doc = Jsoup.connect(url)

           //.data("query", "Java")

           //.userAgent("头部")

           //.cookie("auth", "token")

           //.timeout(3000)

           //.post()

           .get();

         } catch (IOException e) {

               e.printStackTrace();

         }

     }

 }

这只是一个函数而已

那么在下面加上：

 //main函数

     public static void main(String[] args) {

         String url = "...";

         Get_Url(url);

     }

哈哈，搞定

就是这么一个爬虫了

太神奇

但是得到的只是网页的html页面的东西

而且还没筛选

那么就筛选吧

 public static void Get_Url(String url) {

         try {

          Document doc = Jsoup.connect(url)

           //.data("query", "Java")

           //.userAgent("头部")

           //.cookie("auth", "token")

           //.timeout(3000)

           //.post()

           .get();

         //得到html的所有东西

         Element content = doc.getElementById("content");

         //分离出html下<a>...</a>之间的所有东西

         Elements links = content.getElementsByTag("a");

         //Elements links = doc.select("a[href]");

         // 扩展名为.png的图片

         Elements pngs = doc.select("img[src$=.png]");

         // class等于masthead的div标签

         Element masthead = doc.select("div.masthead").first();

         for (Element link : links) {

               //得到<a>...</a>里面的网址

               String linkHref = link.attr("href");

               //得到<a>...</a>里面的汉字

               String linkText = link.text();

               System.out.println(linkText);

             }

         } catch (IOException e) {

               e.printStackTrace();

         }

     }

那就用上面的来解析一下我的博客园

解析的是<a>...</a>之间的东西

java从零到变身爬虫大神（一）

看起来很不错，就是不错

-------------------------------我是快乐的分割线-------------------------------

其实还有另外一种爬虫的方法更加好

他能批量爬取网页保存到本地

先保存在本地再去正则什么的筛选自己想要的东西

这样效率比上面的那个高了很多

很多

看代码！

 　　//将抓取的网页变成html文件，保存在本地

     public static void Save_Html(String url) {

         try {

             File dest = new File("src/temp_html/" + "保存的html的名字.html");

             //接收字节输入流

             InputStream is;

             //字节输出流

             FileOutputStream fos = new FileOutputStream(dest);

             URL temp = new URL(url);

             is = temp.openStream();

             //为字节输入流加缓冲

             BufferedInputStream bis = new BufferedInputStream(is);

             //为字节输出流加缓冲

             BufferedOutputStream bos = new BufferedOutputStream(fos);

             int length;

             byte[] bytes = new byte[1024*20];

             while((length = bis.read(bytes, 0, bytes.length)) != -1){

                 fos.write(bytes, 0, length);

             }

             bos.close();

             fos.close();

             bis.close();

             is.close();

         } catch (IOException e) {

             e.printStackTrace();

         }

     }

这个方法直接将html保存在了文件夹src/temp_html/里面

在批量抓取网页的时候

都是先抓下来，保存为html或者json

然后在正则什么的进数据库

东西在本地了，自己想怎么搞就怎么搞

反爬虫关我什么事

上面两个方法都会造成一个问题

java从零到变身爬虫大神（一）

这个错误代表

这种爬虫方法太low逼

大部分网页都禁止了

所以，要加个头

就是UA

方法一那里的头部那里直接

.userAgent("Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0; MALC)")

方法二间接加：

  URL temp = new URL(url);

  URLConnection uc = temp.openConnection();

  uc.addRequestProperty("User-Agent", "Mozilla/5.0 (iPad; U; CPU OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5");

  is = temp.openStream();

加了头部，几乎可以应付大部分网址了

-------------------------------我是快乐的分割线-------------------------------

将html下载到本地后需要解析啊

解析啊看这里啊

 //解析本地的html

     public static void Get_Localhtml(String path) {

         //读取本地html的路径

         File file = new File(path);

         //生成一个数组用来存储这些路径下的文件名

         File[] array = file.listFiles();

         //写个循环读取这些文件的名字

         for(int i=0;i<array.length;i++){

             try{

                 if(array[i].isFile()){

                 //文件名字

                 System.out.println("正在解析网址：" + array[i].getName());

                 //下面开始解析本地的html

                 Document doc = Jsoup.parse(array[i], "UTF-8");

                 //得到html的所有东西

                 Element content = doc.getElementById("content");

                 //分离出html下<a>...</a>之间的所有东西

                 Elements links = content.getElementsByTag("a");

                 //Elements links = doc.select("a[href]");

                 // 扩展名为.png的图片

                 Elements pngs = doc.select("img[src$=.png]");

                 // class等于masthead的div标签

                 Element masthead = doc.select("div.masthead").first();

                 for (Element link : links) {

                       //得到<a>...</a>里面的网址

                       String linkHref = link.attr("href");

                       //得到<a>...</a>里面的汉字

                       String linkText = link.text();

                       System.out.println(linkText);

                         }

                     }

                 }catch (Exception e) {

                     System.out.println("网址：" + array[i].getName() + "解析出错");

                     e.printStackTrace();

                     continue;

                 }

         }

     }

文字配的很漂亮

就这样解析出来啦

主函数加上

 //main函数

     public static void main(String[] args) {

         String url = "http://www.cnblogs.com/TTyb/";

         String path = "src/temp_html/";

         Get_Localhtml(path);

     }

那么这个文件夹里面的所有的html都要被我解析掉

好啦

3天java1天爬虫的结果就是这样子咯

-------------------------------我是快乐的分割线-------------------------------

其实对于这两种爬取html的方法来说，最好结合在一起

作者测试过

方法二稳定性不足

方法一速度不好

所以自己改正

将方法一放到方法二的catch里面去

当方法二出现错误的时候就会用到方法一

但是当方法一也错误的时候就跳过吧

结合如下：

 import org.jsoup.Jsoup;

 import org.jsoup.nodes.Document;

 import org.jsoup.nodes.Element;

 import org.jsoup.select.Elements;

 import java.io.BufferedInputStream;

 import java.io.BufferedOutputStream;

 import java.io.BufferedReader;

 import java.io.File;

 import java.io.FileOutputStream;

 import java.io.IOException;

 import java.io.InputStream;

 import java.io.InputStreamReader;

 import java.io.OutputStream;

 import java.io.OutputStreamWriter;

 import java.net.HttpURLConnection;

 import java.net.URL;

 import java.net.URLConnection;

 import java.util.Date;

 import java.text.SimpleDateFormat;

 public class JavaSpider {

     //将抓取的网页变成html文件，保存在本地

     public static void Save_Html(String url) {

         try {

             File dest = new File("src/temp_html/" + "我是名字.html");

             //接收字节输入流

             InputStream is;

             //字节输出流

             FileOutputStream fos = new FileOutputStream(dest);

             URL temp = new URL(url);

             URLConnection uc = temp.openConnection();

             uc.addRequestProperty("User-Agent", "Mozilla/5.0 (iPad; U; CPU OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5");

             is = temp.openStream();

             //为字节输入流加缓冲

             BufferedInputStream bis = new BufferedInputStream(is);

             //为字节输出流加缓冲

             BufferedOutputStream bos = new BufferedOutputStream(fos);

             int length;

             byte[] bytes = new byte[1024*20];

             while((length = bis.read(bytes, 0, bytes.length)) != -1){

                 fos.write(bytes, 0, length);

             }

             bos.close();

             fos.close();

             bis.close();

             is.close();

         } catch (IOException e) {

             e.printStackTrace();

             System.out.println("openStream流错误，跳转get流");

             //如果上面的那种方法解析错误

             //那么就用下面这一种方法解析

             try{

                 Document doc = Jsoup.connect(url)

                 .userAgent("Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0; MALC)")

                 .timeout(3000)

                 .get();

                 File dest = new File("src/temp_html/" + "我是名字.html");

                 if(!dest.exists())

                     dest.createNewFile();

                 FileOutputStream out=new FileOutputStream(dest,false);

                 out.write(doc.toString().getBytes("utf-8"));

                 out.close();

             }catch (IOException E) {

                 E.printStackTrace();

                 System.out.println("get流错误，请检查网址是否正确");

             }

         }

     }

     //解析本地的html

     public static void Get_Localhtml(String path) {

         //读取本地html的路径

         File file = new File(path);

         //生成一个数组用来存储这些路径下的文件名

         File[] array = file.listFiles();

         //写个循环读取这些文件的名字

         for(int i=0;i<array.length;i++){

             try{

                 if(array[i].isFile()){

                     //文件名字

                     System.out.println("正在解析网址：" + array[i].getName());

                     //文件地址加文件名字

                     //System.out.println("#####" + array[i]);

                     //一样的文件地址加文件名字

                     //System.out.println("*****" + array[i].getPath()); 

                     //下面开始解析本地的html

                     Document doc = Jsoup.parse(array[i], "UTF-8");

                     //得到html的所有东西

                     Element content = doc.getElementById("content");

                     //分离出html下<a>...</a>之间的所有东西

                     Elements links = content.getElementsByTag("a");

                     //Elements links = doc.select("a[href]");

                     // 扩展名为.png的图片

                     Elements pngs = doc.select("img[src$=.png]");

                     // class等于masthead的div标签

                     Element masthead = doc.select("div.masthead").first();

                     for (Element link : links) {

                           //得到<a>...</a>里面的网址

                           String linkHref = link.attr("href");

                           //得到<a>...</a>里面的汉字

                           String linkText = link.text();

                           System.out.println(linkText);

                         }

                     }

                 }catch (Exception e) {

                     System.out.println("网址：" + array[i].getName() + "解析出错");

                     e.printStackTrace();

                     continue;

                 }

             }

         }

     //main函数

     public static void main(String[] args) {

         String url = "http://www.cnblogs.com/TTyb/";

         String path = "src/temp_html/";

         //保存到本地的网页地址

         Save_Html(url);

         //解析本地的网页地址

         Get_Localhtml(path);

     }

 }

总的来说

java爬虫的方法比python的多好多

java的库真特么变态

秒客网

java从零到变身爬虫大神（一）

相关文章