使用HttpClient 4.3.4 自动登录并抓取中国联通用户基本信息和账单数据,GET/POST/Cookie

时间:2021-05-14 15:54:56

以下内容仅供学习交流使用,请勿做他用,否则后果自负。

 

一.什么是HttpClient?

HTTP 协议可能是现在 Internet 上使用得最多、最重要的协议了,越来越多的 Java 应用程序需要直接通过 HTTP 协议来访问网络资源。虽然在 JDK 的 java net包中已经提供了访问 HTTP 协议的基本功能,但是对于大部分应用程序来说,JDK 库本身提供的功能还不够丰富和灵活。HttpClient 是 Apache Jakarta Common 下的子项目,用来提供高效的、最新的、功能丰富的支持 HTTP 协议的客户端编程工具包,并且它支持 HTTP 协议最新的版本和建议。HttpClient 已经应用在很多的项目中,比如 Apache Jakarta 上很著名的另外两个开源项目 Cactus 和 HTMLUnit 都使用了 HttpClient。现在HttpClient最新版本为 HttpClient 4.3.4(2014-06-22).

-----引自百度百科

简单的说,HttpClient就是一个Apache的一个对于Http封装的一个jar包.

下面将介绍使用GET/POST请求,登录中国联通网站并抓取用户的基本信息和账单数据.

二.新建一个maven项目httpclient

我这里的环境是jdk1.7+Intelij idea 13.0+ubuntu12.04+maven+HttpClient 4.3.4 .下面首先建一个maven项目:

使用HttpClient 4.3.4 自动登录并抓取中国联通用户基本信息和账单数据,GET/POST/Cookie

如图所示,选择quickstart

使用HttpClient 4.3.4 自动登录并抓取中国联通用户基本信息和账单数据,GET/POST/Cookie

然后next下去即可.

建好项目后,如下图所示:

使用HttpClient 4.3.4 自动登录并抓取中国联通用户基本信息和账单数据,GET/POST/Cookie

双击pom.xml文件并添加所需要的jar包:

    <dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.3.4</version>
</dependency>

maven会自动将需要的其它jar包下载好,实际上所需要的jar包如下图所示:

使用HttpClient 4.3.4 自动登录并抓取中国联通用户基本信息和账单数据,GET/POST/Cookie

三.登录中国联通并抓取数据

1.使用Get模拟登录,抓取每月账单数据

中国联通有两种登录方式:

使用HttpClient 4.3.4 自动登录并抓取中国联通用户基本信息和账单数据,GET/POST/Cookie

使用HttpClient 4.3.4 自动登录并抓取中国联通用户基本信息和账单数据,GET/POST/Cookie

上面两图的区别一个是带验证码,一个是不带验证码,下面将先解决不带验证码的登录.

package com.amos;

import org.apache.http.Header;
import org.apache.http.HttpEntity;
import org.apache.http.HttpResponse;
import org.apache.http.client.HttpClient;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.impl.client.DefaultHttpClient;
import org.apache.http.util.EntityUtils; import java.io.File;
import java.io.FileOutputStream;
import java.io.InputStream; /**
* @author amosli
* 登录并抓取中国联通数据
*/ public class LoginChinaUnicom {
/**
* @param args
* @throws Exception
*/
public static void main(String[] args) throws Exception { String name = "中国联通手机号码";
String pwd = "手机服务密码"; String url = "https://uac.10010.com/portal/Service/MallLogin?callback=jQuery17202691898950318097_1403425938090&redirectURL=http%3A%2F%2Fwww.10010.com&userName=" + name + "&password=" + pwd + "&pwdType=01&productType=01&redirectType=01&rememberMe=1"; HttpClient httpClient = new DefaultHttpClient();
HttpGet httpGet = new
HttpGet(url);
HttpResponse loginResponse = httpClient.execute(httpGet); if (loginResponse.getStatusLine().getStatusCode() == 200) {
for (Header head : loginResponse.getAllHeaders()) {
System.out.println(head);
}
HttpEntity loginEntity = loginResponse.getEntity();
String loginEntityContent = EntityUtils.toString(loginEntity);
System.out.println("登录状态:" + loginEntityContent);
//如果登录成功
if (loginEntityContent.contains("resultCode:\"0000\"")) { //月份
String months[] = new String[]{"201401", "201402", "201403", "201404", "201405"}; for (String month : months) {
String billurl = "http://iservice.10010.com/ehallService/static/historyBiil/execute/YH102010002/QUERY_YH102010002.processData/QueryYH102010002_Data/" + month + "/undefined"; HttpPost httpPost = new HttpPost(billurl);
HttpResponse billresponse = httpClient.execute(httpPost); if (billresponse.getStatusLine().getStatusCode() == 200) {
saveToLocal(billresponse.getEntity(), "chinaunicom.bill." + month + ".2.html"
);
}
}

}
} }

找到要登录的url以及要传的参数,这里手机号码服务密码这里就不提供了.

new一个DefaultHttpClient,然后使用Get方式发出请求,如果登录成功,其返回代码是0000.

再用HttpPost方式将返回值写到本地.

/**
* 写文件到本地
*
* @param httpEntity
* @param filename
*/
public static void saveToLocal(HttpEntity httpEntity, String filename) { try { File dir = new File("/home/amosli/workspace/chinaunicom/");
if (!dir.isDirectory()) {
dir.mkdir();
} File file = new File(dir.getAbsolutePath() + "/" + filename);
FileOutputStream fileOutputStream = new FileOutputStream(file);
InputStream inputStream = httpEntity.getContent(); if (!file.exists()) {
file.createNewFile();
}
byte[] bytes = new byte[1024];
int length = 0;
while ((length = inputStream.read(bytes)) > 0) {
fileOutputStream.write(bytes, 0, length);
}
inputStream.close();
fileOutputStream.close();
} catch (Exception e) {
e.printStackTrace();
} }

这里如果只是想输出一下可以使用EntityUtils.toString(HttpEntity entity)方法,其源码如下:

 public static String toString(
final HttpEntity entity, final Charset defaultCharset) throws IOException, ParseException {
Args.notNull(entity, "Entity");
final InputStream instream = entity.getContent();
if (instream == null) {
return null;
}
try {
Args.check(entity.getContentLength() <= Integer.MAX_VALUE,
"HTTP entity too large to be buffered in memory");
int i = (int)entity.getContentLength();
if (i < 0) {
i = 4096;
}
Charset charset = null;
try {
final ContentType contentType = ContentType.get(entity);
if (contentType != null) {
charset = contentType.getCharset();
}
} catch (final UnsupportedCharsetException ex) {
throw new UnsupportedEncodingException(ex.getMessage());
}
if (charset == null) {
charset = defaultCharset;
}
if (charset == null) {
charset = HTTP.DEF_CONTENT_CHARSET;
}
final Reader reader = new InputStreamReader(instream, charset);
final CharArrayBuffer buffer = new CharArrayBuffer(i);
final char[] tmp = new char[1024];
int l;
while((l = reader.read(tmp)) != -1) {
buffer.append(tmp, 0, l);
}
return buffer.toString();
} finally {
instream.close();
}
}

这里可以发现其实现方式还是比较容易看懂的,可以指定编码,也可以不指定.

2.带验证码的登录,抓取基本信息

package com.amos;

import org.apache.http.HttpResponse;
import org.apache.http.client.CookieStore;
import org.apache.http.client.HttpClient;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.cookie.Cookie;
import org.apache.http.impl.client.*;
import org.apache.http.util.EntityUtils; import java.io.BufferedReader;
import java.io.InputStream;
import java.io.InputStreamReader; /**
* Created by amosli on 14-6-22.
*/
public class LoginWithCaptcha { public static void main(String args[]) throws Exception { //生成验证码的链接
String createCaptchaUrl = "http://uac.10010.com/portal/Service/CreateImage";
HttpClient httpClient = new DefaultHttpClient(); String name = "中国联通手机号码";
String pwd = "手机服务密码"; //这里可自定义所需要的cookie
CookieStore cookieStore = new BasicCookieStore(); CloseableHttpClient httpclient = HttpClients.custom()
.setDefaultCookieStore(cookieStore)
.build(); //get captcha,获取验证码
HttpGet captchaHttpGet = new HttpGet(createCaptchaUrl);
HttpResponse capthcaResponse = httpClient.execute(captchaHttpGet); if (capthcaResponse.getStatusLine().getStatusCode() == 200) {
//将验证码写入本地
LoginChinaUnicom.saveToLocal(capthcaResponse.getEntity(), "chinaunicom.capthca." + System.currentTimeMillis());
} //手工输入验证码并验证
HttpResponse verifyResponse = null;
String capthca = null;
String uvc = null; do {
//输入验证码,读入键盘输入
//1)
InputStream inputStream = System.in;
BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(inputStream));
System.out.println("请输入验证码:");
capthca = bufferedReader.readLine(); //2)
//Scanner scanner = new Scanner(System.in);
//capthca = scanner.next(); String verifyCaptchaUrl = "http://uac.10010.com/portal/Service/CtaIdyChk?verifyCode=" + capthca + "&verifyType=1";
HttpGet verifyCapthcaGet = new HttpGet(verifyCaptchaUrl);
verifyResponse = httpClient.execute(verifyCapthcaGet);
AbstractHttpClient abstractHttpClient = (AbstractHttpClient) httpClient;
for (Cookie cookie : abstractHttpClient.getCookieStore().getCookies()) {
System.out.println(cookie.getName() + ":" + cookie.getValue());
if (cookie.getName().equals("uacverifykey")) {
uvc =
cookie.getValue();
}
}

} while (!EntityUtils.toString(verifyResponse.getEntity()).contains("true")); //登录
String loginurl = "https://uac.10010.com/portal/Service/MallLogin?userName=" + name + "&password=" + pwd + "&pwdType=01&productType=01&verifyCode=" + capthca + "&redirectType=03&uvc=" + uvc;
HttpGet loginGet = new HttpGet(loginurl);
CloseableHttpResponse loginResponse = httpclient.execute(loginGet);
System.out.print("loginResponse:" + EntityUtils.toString(loginResponse.getEntity())); //抓取基本信息数据
HttpPost basicHttpGet = new HttpPost("http://iservice.10010.com/ehallService/static/acctBalance/execute/YH102010005/QUERY_AcctBalance.processData/Result");
LoginChinaUnicom.saveToLocal(httpclient.execute(basicHttpGet).getEntity(), "chinaunicom.basic.html"
); } }

这里有两个难点,一是验证码,二uvc码;

验证码,这里将其写到本地,然后人工输入,这个还比较好解决.

uvc码,很重要,这个是在cookie里的,httpclient操作cookie的方法网上找了很久都没有找到,后来看其源码才看到.

3.效果图

使用HttpClient 4.3.4 自动登录并抓取中国联通用户基本信息和账单数据,GET/POST/Cookie

账单数据(这里是json格式的数据,可能不太方便查看):

使用HttpClient 4.3.4 自动登录并抓取中国联通用户基本信息和账单数据,GET/POST/Cookie

 4.本文源码

https://github.com/amosli/crawl/tree/httpclient