Jsoup发送https请求的问题

时间:2022-10-31 11:11:20

转自:http://blog.csdn.net/sonnyching/article/details/53706186

 

今天在用jsoup做一个小爬虫。Jsoup连接普通的http网站还是没问题的,但是一碰到https就跪了。查了一下api,不知道是不是我的原因,没发现Jsoup有提供相应api呀??excuse me??
所以我觉得应该还是得用原生javax.net包来解决问题。

虽然问题最终是解决了,但是也反映了我比较菜的特点,哎…

下面来看一下我开始时的初步尝试:

Jsoup发送https请求的问题Jsoup发送https请求的问题
 1 public static void iAmStudent(){
 2             String url = "https://www.v2ex.com/t/116724";
 3             Connection connect = Jsoup.connect(url);
 4             try {
 5                 Response response = connect.execute();
 6                 System.out.println(response.body());
 7             } catch (IOException e) {
 8                 e.printStackTrace();
 9             }   
10 }
View Code

 

直接就报错了:

javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target at com.sun.net.ssl.internal.ssl.Alerts.getSSLException(Alerts.java:174)
...................

 


注意其中里面的一句:unable to find valid certification path to requested target。找不到合法的证书去请求目标url,显然目标网站没有被信任。本地的网管并不知道这网站是干嘛的,心想着不会是什么成人网站吧…为了保护青 少年儿童的健康成长,本次请求自然就失败了。

行,那本儿童就伪造一块身份证,只要有人问我,小盆友,今年你多少岁呀?本儿童都统一回复,老子今年芳龄20!然后本地的网管就信了。。

Jsoup发送https请求的问题Jsoup发送https请求的问题
 1 public static void iAm20() {  
 2             try {  
 3                 HttpsURLConnection.setDefaultHostnameVerifier(new HostnameVerifier() {  
 4                     //验证证书时发现真正请求和服务器的证书域名不一致
 5                     //网管问,你是你爸爸吗?我说,是呀~
 6                     public boolean verify(String hostname, SSLSession session) {  
 7                         return true;  
 8                     }  
 9                 });  
10 
11                 SSLContext context = SSLContext.getInstance("SSL");  
12                 context.init(null, new X509TrustManager[] { new X509TrustManager() {  
13                     //客户端对SSL证书的有效性进行校验
14                     //网管问你满18了吗,我默默的嗯..
15                     public void checkClientTrusted(
16                         X509Certificate[] chain, String authType) throws 
17                         CertificateException {  
18                        //我啥也没干...
19                     }  
20                     //服务端认证
21                     //网管问,你老爸同意你上网吗,我说恩..
22                     public void checkServerTrusted(
23                         X509Certificate[] chain,
24                         String authType) throws CertificateException { 
25                        //我啥也没干... 
26                     }  
27                     //网管要检查身份证,给他一张地摊上买的假证
28                     public X509Certificate[] getAcceptedIssuers() {  
29                         return new X509Certificate[0];  
30                     }  
31                 } }, new SecureRandom());  
32                 HttpsURLConnection.setDefaultSSLSocketFactory(context.getSocketFactory());  
33             } catch (Exception e) {  
34                 // e.printStackTrace();  
35             }  
36 }  
View Code

 


这个时候,我就可以直接去上网了。

 Connection conn = HttpConnection.connect(url);  
    conn.timeout(timeout);  
    conn.header("Accept-Encoding", "gzip,deflate,sdch");  
    conn.header("Connection", "close");  
    String yellowNews = conn.execute().body();  

 

但是要注意,这里的Connection 是org.jsoup.Connection下的。

后来看了一下Jsoup的源码,才发现其实Jsoup本身也支持默认的信任https网站,我总是后知后觉啊。。。
Connection中的Response类已经有了相关的方法了,如下所示:

Jsoup发送https请求的问题Jsoup发送https请求的问题
 /**
         * Initialise Trust manager that does not validate certificate chains and
         * add it to current SSLContext.
         * <p/>
         * please not that this method will only perform action if sslSocketFactory is not yet
         * instantiated.
         *
         * @throws IOException
         */
        private static synchronized void initUnSecureTSL() throws IOException {
            if (sslSocketFactory == null) {
                // Create a trust manager that does not validate certificate chains
                final TrustManager[] trustAllCerts = new TrustManager[]{new X509TrustManager() {

                    public void checkClientTrusted(final X509Certificate[] chain, final String authType) {
                    }

                    public void checkServerTrusted(final X509Certificate[] chain, final String authType) {
                    }

                    public X509Certificate[] getAcceptedIssuers() {
                        return null;
                    }
                }};

                // Install the all-trusting trust manager
                final SSLContext sslContext;
                try {
                    sslContext = SSLContext.getInstance("SSL");
                    sslContext.init(null, trustAllCerts, new java.security.SecureRandom());
                    // Create an ssl socket factory with our all-trusting manager
                    sslSocketFactory = sslContext.getSocketFactory();
                } catch (NoSuchAlgorithmException e) {
                    throw new IOException("Can't create unsecure trust manager");
                } catch (KeyManagementException e) {
                    throw new IOException("Can't create unsecure trust manager");
                }
            }

        }
View Code

 

既然方法有了,那么怎么去调用呢?找了一下,发现,其实调用有一个条件

 if (conn instanceof HttpsURLConnection) {
                if (!req.validateTLSCertificates()) {
                    initUnSecureTSL();
                    ((HttpsURLConnection)conn).setSSLSocketFactory(sslSocketFactory);
                    ((HttpsURLConnection)conn).setHostnameVerifier(getInsecureVerifier());
                }
            }

 


也就是!req.validateTLSCertificates()关闭的了情况下,才会去默认信任https网站,通过进入 validateTLSCertificates()方法发现,这方法就是简单是返回Request类中的 validateTSLCertificates 成员变量而已。

   public void validateTLSCertificates(boolean value) {
            validateTSLCertificates = value;
        }

 


所以,只要设置这个validateTSLCertificates 为false就可以了。然后我在HttpConnection中找到了这个方法:

 public Connection validateTLSCertificates(boolean value) {
        req.validateTLSCertificates(value);
        return this;
    }

 

所以啊,可以直接装爸爸了…

Jsoup发送https请求的问题Jsoup发送https请求的问题
 1  public static void iAmMyDaddy(){
 2             String url = "https://www.v2ex.com/t/116724";
 3             Connection connect = HttpConnection.connect(url);
 4             connect.timeout(3000);  
 5             connect.header("Accept-Encoding", "gzip,deflate,sdch");  
 6             connect.header("Connection", "close");  
 7             connect.validateTLSCertificates(false);
 8             try {
 9                 connect.execute();  
10                 //Document parse = connect.post();
11                 System.out.println(connect.get().html());
12             } catch (IOException e) {
13                 e.printStackTrace();
14             }
15         }
View Code

 

 

validateTLSCertificates(false),相当于我最叼,免签免认证直接进网吧。
打印结果:

Jsoup发送https请求的问题Jsoup发送https请求的问题
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html lang="zh-CN">
 <head> 
  <meta charset="UTF-8"> 
  <meta content="True" name="HandheldFriendly"> 
  <meta name="detectify-verification" content="d0264f228155c7a1f72c3d91c17ce8fb"> 
  <meta name="alexaVerifyID" content="OFc8dmwZo7ttU4UCnDh1rKDtLlY"> 
  ......................
View Code