如何用curl/wget或者其他方式抓取多次302重定向并要求登录的网页?

时间:2022-09-03 21:47:46
大家好!
我有一个问题向大家请教。
curl我不是很熟,以前用过几次都是直接抓取不需要验证的html网页。
这次我想抓取的页面是:
这个:"https://cares.web.alcatel-lucent.com/cgi-bin/fast/view.cgi?AR=1-6620492"
但是如果在浏览器里访问这个网页的时候,会被重定向到一个SSO验证页面:"https://intranetlogin.web.alcatel-lucent.com:1040/siteminderagent/SmMakeCookie.ccc?SMSESSION=QUERY&PERSIST=0&TARGET=$SM$HTTPS%3a%2f%2fcares%2eweb%2ealcatel-lucent%2ecom%2fcgi-bin%2ffast%2fview%2ecgi%3fAR%3d1-6620492"
如下图所示:
如何用curl/wget或者其他方式抓取多次302重定向并要求登录的网页?
在输入用户名和密码后可以进入内容页面。
如何用curl/wget或者其他方式抓取多次302重定向并要求登录的网页?
我本来想用Linux下的curl命令或者wget命令来抓,查了网上很多资料,有的说curl用-L参数可以跟踪重定向;用-u选项提供用户名和密码进行授权。
但是我用以下命令仍然不能得到验证后的内容页面。
curl -v -A "Mozilla/4.0" -c $arcookie -d "USER=$user&PASSWORD=$pass&submit=Login" --location-trusted "http://cares.web.alcatel-lucent.com:80/cgi-bin/fast/view.cgi?AR=1-6620492"

然后我用winshark抓了一个浏览器(本地机器IP为192.11.23.245)和服务端(135.3.63.53)之间的包(这里好像不能上传附件,请从这里:"http://maru.tech:8888/ar.pcap"下载,登录页面的代码请从这里:"http://maru.tech:8888/source.txt"下载)。

我仔细分析了抓的包,发现只有6个HTTP包(3个来回Request/Respond)而已。

------------------------------------------------------------------------------------------------
GET /cgi-bin/fast/view.cgi?AR=1-6620595 HTTP/1.1
Accept: text/html, application/xhtml+xml, */*
Accept-Language: zh-CN
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko
Accept-Encoding: gzip, deflate
Host: cares.web.alcatel-lucent.com
DNT: 1
Connection: Keep-Alive
------------------------------------------------------------------------------------------------
HTTP/1.1 302 Found
Date: Sat, 04 Feb 2017 11:01:50 GMT
Server: Apache/2.2.22 (Unix) JRun/4.0 mod_ssl/2.2.22 OpenSSL/0.9.8zc
Cache-Control: no-store
Location: https://intranetlogin.web.alcatel-lucent.com:1040/siteminderagent/SmMakeCookie.ccc?SMSESSION=QUERY&PERSIST=0&TARGET=$SM$HTTP%3a%2f%2fcares%2eweb%2ealcatel-lucent%2ecom%2fcgi-bin%2ffast%2fview%2ecgi%3fAR%3d1-6620595
Content-Length: 406
Keep-Alive: timeout=5, max=100
Connection: Keep-Alive
Content-Type: text/html; charset=iso-8859-1<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>302 Found</title>
</head><body>
<h1>Found</h1>
<p>The document has moved <a href="here.https://intranetlogin.web.alcatel-lucent.com:1040/siteminderagent/SmMakeCookie.ccc?SMSESSION=QUERY&PERSIST=0&TARGET=$SM$HTTP%3a%2f%2fcares%2eweb%2ealcatel-lucent%2ecom%2fcgi-bin%2ffast%2fview%2ecgi%3fAR%3d1-6620595">here</a>.</p>
</body></html>
------------------------------------------------------------------------------------------------
GET /cgi-bin/fast/view.cgi?AR=1-6620595&SMSESSION=NO HTTP/1.1
Accept: text/html, application/xhtml+xml, */*
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko
Accept-Encoding: gzip, deflate
DNT: 1
Host: cares.web.alcatel-lucent.com
Accept-Language: zh-CN
Connection: Keep-Alive
------------------------------------------------------------------------------------------------
HTTP/1.1 302 Found
Date: Sat, 04 Feb 2017 11:02:12 GMT
Server: Apache/2.2.22 (Unix) mod_ssl/2.2.22 OpenSSL/0.9.8zc JRun/4.0
Cache-Control: no-store
Location: https://intranetlogin.web.alcatel-lucent.com:1040/login/login_login_intranet1_https_cares.html?TYPE=33554433&REALMOID=06-74ffb716-1721-4d9b-84ea-99126b91c132&GUID=&SMAUTHREASON=0&METHOD=GET&SMAGENTNAME=$SM$oVcoNP6ZmVxXmgw%2bkWmUFhjiqFpqn%2btBGj6adK2eBdxXbebN%2fWEZ2XkXuJ787kVm&TARGET=$SM$HTTP%3a%2f%2fcares%2eweb%2ealcatel-lucent%2ecom%2fcgi-bin%2ffast%2fview%2ecgi%3fAR%3d1-6620595
Content-Length: 590
Keep-Alive: timeout=5, max=100
Connection: Keep-Alive
Content-Type: text/html; charset=iso-8859-1<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>302 Found</title>
</head><body>
<h1>Found</h1>
<p>The document has moved <a href="here.https://intranetlogin.web.alcatel-lucent.com:1040/login/login_login_intranet1_https_cares.html?TYPE=33554433&REALMOID=06-74ffb716-1721-4d9b-84ea-99126b91c132&GUID=&SMAUTHREASON=0&METHOD=GET&SMAGENTNAME=$SM$oVcoNP6ZmVxXmgw%2bkWmUFhjiqFpqn%2btBGj6adK2eBdxXbebN%2fWEZ2XkXuJ787kVm&TARGET=$SM$HTTP%3a%2f%2fcares%2eweb%2ealcatel-lucent%2ecom%2fcgi-bin%2ffast%2fview%2ecgi%3fAR%3d1-6620595">here</a>.</p>
</body></html>
------------------------------------------------------------------------------------------------
GET /cgi-bin/fast/view.cgi?AR=1-6620595 HTTP/1.1
Accept: text/html, application/xhtml+xml, */*
Accept-Language: zh-CN
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko
Accept-Encoding: gzip, deflate
Host: cares.web.alcatel-lucent.com
DNT: 1
Connection: Keep-Alive
Cookie: PROD_HASLOGGEDIN=TRUE; SM_USER=jzhu039; SMSESSION=TIvmY9x/xKLz27J0cuZOObBujbtcWEDZyKWZQuu6Pg8CAduPPZgccj8fVcRvz7k3dYTbWYibU7t7GsWlAhxeXVU0ZcyceByHbJbwmdcgnyL2lG0p4Tcj4PcBzcflbIdzc4JaoqzNx0lZykZdre6FLVxl0IHCMQ+5pWIYuHicsIatHE1DfDn0U5ayOKnOHU+nhq7J6qNV8rOLZe7m69U68FZTg4iufIH59jPH0WLBsyQPX9KftpVloE88O3QE2ezdbihGWb7DMnBQKoWAm/ftBu425uIpofr+IJS/INQemQQVi3hyyaNpfnwvuyKqLY2y3k/zl7Af0ZBS57tB+wupOB7CCPgBZddNvMCTBO4ZHDMcD1ITIkBALLVeE1/2wkBn5/+3NzjIdp7OlfLF8gTrEos8iHrOsrjWkNt7LpmIal2h3kY96oBlKeNGEcZ6Qin9Y+nr8wPiOjFIywRCkLSI+ogOE60Yehnmh5Eygcey5zQRZaR9eJVxXZG7llnAM/0Nr5uY5VQ4YnDALl2SB652/vqSvuzd+s8HBTx1GZ9jmicblpslV0VvmGH6Gvgabc4TyGIajMW44hTKvFFDD+GiwCtK0uCCWjEhm6hWF5q2aVLUcjkluiK9fe1WJ9Y6/AyFbFWZsTQHcg3lHFr1WXPa2agR1RrWQQk+2gh6w6zqWyf8IeylAa6aN1hQxuyDkV+9d/LLUVBBwL7e3n0RQl8OgehBqiQvArFVWsSlxFQ7vE7dtrBMizeHX7kgnPi14L1s0XcUWFqCM1h1ypXnR8ckO3u+L/9xIvCIch9ktfG3DI2LcSST9ccDPCzGyq2N6lTqCuSEtlZmzErHtXMcGWK47xpinJEviouGuQL5eM5MfADbPjRFWaoYQkF11WoPGXzbdRMaLGQ+3bXE7ikcxWHPLYg2awbMq10r3kxo+GiymIjmtsV9hJVXXxakwZfeIbHLzPlZjkKVf8/v5Jw0TBQ+ck+SVedliZD/nk6uqw50MoC+vhgaDXfmh5ErHwhFw1BdFitnX5VWzvFTg5O0tKj1N7aal93IVLTL7sqMBjmXHq88LG1r6IUv1cTJByDM+/t8/lJSajCkf2Jhc2O/C+ACKyUT9U/QyDvyEDtFfcL1Tji4eJq6Jrt6v/y1TdbXfjqq; SMIDENTITY=TKeENBK5q2cQanKqbpuEAUNLkbBaUttOP64VBWSs3Gve7b0l0j7pgyVgtjtFiuz0p3aEZFqUj/NSNVVjnY41CNdtd6SO/e3pN9u3m2Gl857lmBkrKsXor/aj8YUtIrx7MDi/ILtr8LpBd8PJKiX0MqbE6lj81Uh8Ya5OXkmS2OIUMQ/9CXKYYR1FzDYfqNPC8F9MFGD/TVqDlktJZqTKrj/BvdsU9vciwiezdG8bysK6gsxlU+hgmVKT0485GPHV4QjAOmBk8qsLFno8e6EAxqOiCZy8YYj+4gLGqys4PJ+bkF+xkZDDzMcf5FcB+EVR9SIMGFIUXffKGRwus9difkFZazZKH0EjOKYt8coqFCagFtnYZyXqCZp4FsY3omJaGpwK8ph0EKaE4pyg/eadf/7s229FFAGuc2hCLgjOPrV4azdMOqDLs/umAXnrAMiDWBLtnk+9huPMP2ObaQpJCfAIyViNI+sNgC2DmGJ5+uanTGRV4W0ngMaBVs7Nq+FiJiGQao2KLdLEXwCG5fO1LdCnxlkrWFL20j9TrIMBAtVoERf/1JVv72mq8D3S7AeeT+aA5MKrr5OmDhm7/79Z3K1KcpIwXZumZ9AySgtsmpK1RyWEhBrMQHWE25vxe3Pic19xG05kgHmftIy61TBAGL855OwHZnCUc8aYlriiG1AA2aF6OMb4iZovkaryDYdf; SM_SESSION_STAMP=255466963
------------------------------------------------------------------------------------------------
HTTP/1.1 200 OK
Date: Sat, 04 Feb 2017 11:02:44 GMT
Server: Apache/2.2.22 (Unix) mod_ssl/2.2.22 OpenSSL/0.9.8zc JRun/4.0
Cache-Control: max-age=30
Set-Cookie: SMSESSION=mjMo/cVBfFWHH2UqvahUPWKP3UN7w7X+GsfFSJpbJhPmXP+mH5d3kDEdWBbV1AtD7aQ83VQEu9jhystK/i+wRjN4b047vnNp8ZGbBwCHmeefbIs2o/wUzj2BWbUmphESm2vj+Ln8kJrCH24hRchxAdCWbdTRNIKRWLYsMfD6283oDhw02vRlM+eKdA48HAaVGP01ZsbpTyWzALB4PiZNvbWTTleIe58wBr7TReUwpzjP3iRIrXRuwtP7E3TfS9FWpeg4Mbo4UBfPHDWQNStQJWVoGw4a6NtQGpC9AqqHD1jPdzRWZpRDMeNF0UkHRli0+u3RIUPGyWeoLugbNYSLFw5PAa/m9nfUHo6fjzrrWuprWgz9NuoOtFgnrGFaxJ2A+aYDiaR0Vx4i8X2EdIJHztfTmOPv3Zk4XBLXrq3HDQHF26KlcnYzr59xSaZFn3mNFJPxHZo2CocFHy7ysRFCQnlLb9QRPwCK1yAJek0h6r25iA6+H9M2ljrNee0LjXdM+0JN6A5+YUG6pJJucea1WqMVbsNcl0jL9JGRdlgnCC+0ggdtPDC3jgV4NbE4rSaCCVSlvtyXLJthpMMm4Zh37XM1fAJMxL9mrfV2A+uq5ubaPSt+B5HF/VQZ1f1iM7hyF4H8uqii69hprm3h0zNbMi7vkYLfsEdMwozUuqjRHh1ZQRIfHP4R44OaRLiI2XgT6iIUuMJPAvNkyQvOOrZNUGOaVZOnyt35Rq3IfLNgvJiSPufKi0g8PFalqLi+Tsgk0MR//Q5G6IMRWSkQVDiv6ndgcA06ub4y8W/58hzMbs+vz1iqHTI+2CaKXuaJadBFXjKYZug+s9lBWiBuvYA+MUm25RDFPNN8rc0g63Gszhfcxeo5PYW0HAmV8HUTOopig5O+IeatQ+9zQeZNcDZ87oVoyoPLCqPcbdhwjq9yWd49Nu0qvxoYO3qICpnV6MunlRAiMS5EzX+iHfYA+bSAN2sS1QIU0xhzD78k0YspKhMSJh+NECBKdccfBAaUmgf8U1GfPbM+nQrgs4quLphhOebkVcpPvYeJZ0gk/zCU7uJBUvWD8FmuqQ8LGNbqJwGe2tSIjf2/ajHvs0leWILwTWjbuOqzAEwdkP0+mJXESt8JI6DHVmLcU0ZOVADitHOM; path=/; domain=.alcatel-lucent.com
Keep-Alive: timeout=5, max=100
Connection: Keep-Alive
Transfer-Encoding: chunked
Content-Type: text/html; charset=ISO-8859-1
------------------------------------------------------------------------------------------------
下面是保存在本地机器上的Cookie:
SMIDENTITY
TKeENBK5q2cQanKqbpuEAUNLkbBaUttOP64VBWSs3Gve7b0l0j7pgyVgtjtFiuz0p3aEZFqUj/NSNVVjnY41CNdtd6SO/e3pN9u3m2Gl857lmBkrKsXor/aj8YUtIrx7MDi/ILtr8LpBd8PJKiX0MqbE6lj81Uh8Ya5OXkmS2OIUMQ/9CXKYYR1FzDYfqNPC8F9MFGD/TVqDlktJZqTKrj/BvdsU9vciwiezdG8bysK6gsxlU+hgmVKT0485GPHV4QjAOmBk8qsLFno8e6EAxqOiCZy8YYj+4gLGqys4PJ+bkF+xkZDDzMcf5FcB+EVR9SIMGFIUXffKGRwus9difkFZazZKH0EjOKYt8coqFCagFtnYZyXqCZp4FsY3omJaGpwK8ph0EKaE4pyg/eadf/7s229FFAGuc2hCLgjOPrV4azdMOqDLs/umAXnrAMiDWBLtnk+9huPMP2ObaQpJCfAIyViNI+sNgC2DmGJ5+uanTGRV4W0ngMaBVs7Nq+FiJiGQao2KLdLEXwCG5fO1LdCnxlkrWFL20j9TrIMBAtVoERf/1JVv72mq8D3S7AeeT+aA5MKrr5OmDhm7/79Z3K1KcpIwXZumZ9AySgtsmpK1RyWEhBrMQHWE25vxe3Pic19xG05kgHmftIy61TBAGL855OwHZnCUc8aYlriiG1AA2aF6OMb4iZovkaryDYdf
alcatel-lucent.com/
1024
642348288
30719097
871855177
30572246
*
SM_SESSION_STAMP
255466963
alcatel-lucent.com/
1088
1121924352
30572313
886716027
30572246
*
------------------------------------------------------------------------------------------------
我觉得奇怪的是,为什么3个请求包都是GET?
那么用户名和密码是怎么发给服务端的呢?
请问这样网页应该如何用命令或者程序登录并抓取呢?

3 个解决方案

#1


你需要使用username,password登录,然后把获取的cookie保存。然后下次请求时把cookie带上。

curl_setopt($ch, CURLOPT_COOKIEJAR, GOOGLE_PLAY_COOKIE_FILE);  
curl_setopt($ch, CURLOPT_COOKIEFILE, GOOGLE_PLAY_COOKIE_FILE);

命令行对应这两个参数
-b/--cookie <name=string/file>    cookie字符串或文件读取位置
-c/--cookie-jar <file>                    操作结束后把cookie写入到这个文件中

参考: http://blog.csdn.net/fdipzone/article/details/8821957

#2


 谢谢您的答复。
我试过用username和password以POST方式提交表单并FOLLOWLOCATION,但是登录网页根本不接受POST方式,强制转成GET方式,但是还是显示302错误码,而且并没有生成任何cookie。
如果只是那么简单的用username和password登录保存cookie,我也不需要去抓包了。
我已经查了很多资料,网上说的那些方法我全都试过了,没有一个成功的,我觉得只有看懂了登录页面的Javas源代码以及抓的HTTP交互包,才能弄明白为什么常规的做法都不能成功。

#3


你应该是不要设置FOLLOWLOCATION
然后获取他302的地址,
之后再使用curl访问提交username,password

参考: http://blog.csdn.net/fdipzone/article/details/8821957

#1


你需要使用username,password登录,然后把获取的cookie保存。然后下次请求时把cookie带上。

curl_setopt($ch, CURLOPT_COOKIEJAR, GOOGLE_PLAY_COOKIE_FILE);  
curl_setopt($ch, CURLOPT_COOKIEFILE, GOOGLE_PLAY_COOKIE_FILE);

命令行对应这两个参数
-b/--cookie <name=string/file>    cookie字符串或文件读取位置
-c/--cookie-jar <file>                    操作结束后把cookie写入到这个文件中

参考: http://blog.csdn.net/fdipzone/article/details/8821957

#2


 谢谢您的答复。
我试过用username和password以POST方式提交表单并FOLLOWLOCATION,但是登录网页根本不接受POST方式,强制转成GET方式,但是还是显示302错误码,而且并没有生成任何cookie。
如果只是那么简单的用username和password登录保存cookie,我也不需要去抓包了。
我已经查了很多资料,网上说的那些方法我全都试过了,没有一个成功的,我觉得只有看懂了登录页面的Javas源代码以及抓的HTTP交互包,才能弄明白为什么常规的做法都不能成功。

#3


你应该是不要设置FOLLOWLOCATION
然后获取他302的地址,
之后再使用curl访问提交username,password

参考: http://blog.csdn.net/fdipzone/article/details/8821957