如何从需要cookie登录的网站中抓取PHP内容?

时间:2022-11-29 23:02:36

My problem is that it doesn't just require a basic cookie, but rather asks for a session cookie, and for randomly generated IDs. I think this means I need to use a web browser emulator with a cookie jar?

我的问题是,它不仅需要一个基本的cookie,还需要一个会话cookie和随机生成的id。我认为这意味着我需要使用带有cookie jar的web浏览器模拟器?

I have tried to use Snoopy, Goutte and a couple of other web browser emulators, but as of yet I have not been able to find tutorials on how to receive cookies. I am getting a little desperate!

我已经尝试使用Snoopy、Goutte和其他一些web浏览器模拟器,但我还没有找到如何接收cookie的教程。我有点绝望了!

Can anyone give me an example of how to accept cookies in Snoopy or Goutte?

谁能给我举一个如何接受史努比或痛风饼干的例子吗?

Thanks in advance!

提前谢谢!

2 个解决方案

#1


1  

Object-Oriented answer

We implement as much as possible of the above answer in one class called Browser that should supply the normal navigation features.

我们在一个名为Browser的类中尽可能多地实现上述答案,该类应该提供正常的导航特性。

Then we should be able to put the site-specific code, in very simple form, in a new derived class that we call, say, FooBrowser, that performs scraping of the site Foo.

然后,我们应该能够将特定于站点的代码,以非常简单的形式,在一个新的派生类中,我们调用,比如说,FooBrowser,来执行对站点Foo的抓取。

The class deriving Browser must supply some site-specific function such as a path() function allowing to store site-specific information, for example

派生浏览器的类必须提供一些特定于站点的函数,比如path()函数,该函数允许存储特定于站点的信息

function path($basename) {
    return '/var/tmp/www.foo.bar/' . $basename;
}

abstract class Browser
{
    private $options = [];
    private $state   = [];
    protected $cookies;

    abstract protected function path($basename);

    public function __construct($site, $options = []) {
        $this->cookies   = $this->path('cookies');
        $this->options  = array_merge(
            [
                'site'      => $site,
                'userAgent' => 'Mozilla/5.0 (Windows NT 5.1; rv:16.0) Gecko/20100101 Firefox/16.0 - LeoScraper',
                'waitTime'  => 250000,
            ],
            $options
        );
        $this->state = [
            'referer' => '/',
            'url'     => '',
            'curl'    => '',
        ];
        $this->__wakeup();
    }

    /**
     * Reactivates after sleep (e.g. in session) or creation
     */
    public function __wakeup() {
        $this->state['curl'] = curl_init();
        $this->config([
            CURLOPT_USERAGENT       => $this->options['userAgent'],
            CURLOPT_ENCODING        => '',
            CURLOPT_NOBODY          => false,
            // ...retrieving the body...
            CURLOPT_BINARYTRANSFER  => true,
            // ...as binary...
            CURLOPT_RETURNTRANSFER  => true,
            // ...into $ret...
            CURLOPT_FOLLOWLOCATION  => true,
            // ...following redirections...
            CURLOPT_MAXREDIRS       => 5,
            // ...reasonably...
            CURLOPT_COOKIEFILE      => $this->cookies,
            // Save these cookies
            CURLOPT_COOKIEJAR       => $this->cookies,
            // (already set above)
            CURLOPT_CONNECTTIMEOUT  => 30,
            // Seconds
            CURLOPT_TIMEOUT         => 300,
            // Seconds
            CURLOPT_LOW_SPEED_LIMIT => 16384,
            // 16 Kb/s
            CURLOPT_LOW_SPEED_TIME  => 15,
        ]);
    }

    /**
     * Imports an options array.
     *
     * @param array $opts
     * @throws DetailedError
     */
    private function config(array $opts = []) {
        foreach ($opts as $key => $value) {
            if (true !== curl_setopt($this->state['curl'], $key, $value)) {
                throw new \Exception('Could not set cURL option');
            }
        }
    }

    private function perform($url) {
        $this->state['referer'] = $this->state['url'];
        $this->state['url'] = $url;
        $this->config([
            CURLOPT_URL     => $this->options['site'] . $this->state['url'],
            CURLOPT_REFERER => $this->options['site'] . $this->state['referer'],
        ]);
        $response = curl_exec($this->state['curl']);
        // Should we ever want to randomize waitTime, do so here.
        usleep($this->options['waitTime']);

        return $response;
    }

    /**
     * Returns a configuration option.
     * @param string $key       configuration key name
     * @param string $value     value to set
     * @return mixed
     */
    protected function option($key, $value = '__DEFAULT__') {
        $curr   = $this->options[$key];
        if ('__DEFAULT__' !== $value) {
            $this->options[$key]    = $value;
        }
        return $curr;
    }

    /**
     * Performs a POST.
     *
     * @param $url
     * @param $fields
     * @return mixed
     */
    public function post($url, array $fields) {
        $this->config([
            CURLOPT_POST       => true,
            CURLOPT_POSTFIELDS => http_build_query($fields),
        ]);
        return $this->perform($url);
    }

    /**
     * Performs a GET.
     *
     * @param       $url
     * @param array $fields
     * @return mixed
     */
    public function get($url, array $fields = []) {
        $this->config([ CURLOPT_POST => false ]);
        if (empty($fields)) {
            $query = '';
        } else {
            $query = '?' . http_build_query($fields);
        }
        return $this->perform($url . $query);
    }
}

Now to scrape FooSite:

现在刮FooSite:

/* WWW_FOO_COM requires username and password to construct */

class WWW_FOO_COM_Browser extends Browser
{
    private $loggedIn   = false;

    public function __construct($username, $password) {
        parent::__construct('http://www.foo.bar.baz', [
            'username'  => $username,
            'password'  => $password,
            'waitTime'  => 250000,
            'userAgent' => 'FooScraper',
            'cache'     => true
        ]);
        // Open the session
        $this->get('/');
        // Navigate to the login page
        $this->get('/login.do');
    }

    /**
     * Perform login.
     */
    public function login() {
        $response = $this->post(
            '/ajax/loginPerform',
            [
                'j_un'    => $this->option('username'),
                'j_pw'    => $this->option('password'),
            ]
        );
        // TODO: verify that response is OK.
        // if (!strstr($response, "Welcome " . $this->option('username'))
        //     throw new \Exception("Bad username or password")
        $this->loggedIn = true;
        return true;
    }

    public function scrape($entry) {
        // We could implement caching to avoid scraping the same entry
        // too often. Save $data into path("entry-" . md5($entry))
        // and verify the filemtime of said file, is it newer than time()
        // minus, say, 86400 seconds? If yes, return file_get_content and
        // leave remote site alone.
        $data = $this->get(
            '/foobars/baz.do',
            [
                'ticker' => $entry
            ]
        );
        return $data;
    }

Now the actual scraping code would be:

现在实际的抓取代码是:

    $scraper = new WWW_FOO_COM_Browser('lserni', 'mypassword');
    if (!$scraper->login()) {
        throw new \Exception("bad user or pass");
    }
    foreach ($entries as $entry) {
        $html = $scraper->scrape($entry);
        // Parse HTML
    }

Mandatory notice: use a suitable parser to get data from raw HTML.

必须注意:使用合适的解析器从原始HTML获取数据。

#2


23  

You can do that in cURL without needing external 'emulators'.

您可以在cURL中完成,而不需要外部“模拟器”。

The code below retrieves a page into a PHP variable to be parsed.

下面的代码将页面检索到要解析的PHP变量。

Scenario

There is a page (let's call it HOME) that opens the session. Server side, if it is in PHP, is the one (any one actually) calling session_start() for the first time. In other languages you need a specific page that will do all the session setup. From the client side it's the page supplying the session ID cookie. In PHP, all sessioned pages do; in other languages the landing page will do it, all the others will check if the cookie is there, and if there isn't, instead of creating the session, will drop you to HOME.

有一个页面(我们称之为HOME)打开会话。如果服务器端在PHP中,则是第一次调用session_start()的服务器端。在其他语言中,您需要一个特定的页面来完成所有会话设置。在客户端,它是提供会话ID cookie的页面。在PHP中,所有的会话页面都有;在其他语言中,登录页面将会这样做,所有其他语言将检查cookie是否在那里,如果没有,而不是创建会话,将会把您送回家。

There is a page (LOGIN) that generates the login form and adds a critical information to the session - "This user is logged in". In the code below, this is the page asking for the session ID.

有一个页面(登录)生成登录表单并向会话添加关键信息——“该用户已登录”。在下面的代码中,这是请求会话ID的页面。

And finally there are N pages where the goodies to be scrapes reside.

最后,有N页的内容是需要剪贴的内容。

So we want to hit HOME, then LOGIN, then GOODIES one after another. In PHP (and other languages actually), again, HOME and LOGIN might well be the same page. Or all pages might share the same address, for example in Single Page Applications.

所以我们想要回家,然后登录,然后一个接一个的好东西。在PHP(实际上还有其他语言)中,HOME和LOGIN可能是同一个页面。或者所有页面可能共享相同的地址,例如在单页应用程序中。

The Code

    $url            = "the url generating the session ID";
    $next_url       = "the url asking for session";

    $ch             = curl_init();
    curl_setopt($ch, CURLOPT_URL,    $url);
    // We do not authenticate, only access page to get a session going.
    // Change to False if it is not enough (you'll see that cookiefile
    // remains empty).
    curl_setopt($ch, CURLOPT_NOBODY, True);

    // You may want to change User-Agent here, too
    curl_setopt($ch, CURLOPT_COOKIEFILE, "cookiefile");
    curl_setopt($ch, CURLOPT_COOKIEJAR,  "cookiefile");

    // Just in case
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);

    $ret    = curl_exec($ch);

    // This page we retrieve, and scrape, with GET method
    foreach(array(
            CURLOPT_POST            => False,       // We GET...
            CURLOPT_NOBODY          => False,       // ...the body...
            CURLOPT_URL             => $next_url,   // ...of $next_url...
            CURLOPT_BINARYTRANSFER  => True,        // ...as binary...
            CURLOPT_RETURNTRANSFER  => True,        // ...into $ret...
            CURLOPT_FOLLOWLOCATION  => True,        // ...following redirections...
            CURLOPT_MAXREDIRS       => 5,           // ...reasonably...
            CURLOPT_REFERER         => $url,        // ...as if we came from $url...
            //CURLOPT_COOKIEFILE      => 'cookiefile', // Save these cookies
            //CURLOPT_COOKIEJAR       => 'cookiefile', // (already set above)
            CURLOPT_CONNECTTIMEOUT  => 30,          // Seconds
            CURLOPT_TIMEOUT         => 300,         // Seconds
            CURLOPT_LOW_SPEED_LIMIT => 16384,       // 16 Kb/s
            CURLOPT_LOW_SPEED_TIME  => 15,          // 
            ) as $option => $value)
            if (!curl_setopt($ch, $option, $value))
                    die("could not set $option to " . serialize($value));

    $ret = curl_exec($ch);
    // Done; cleanup.
    curl_close($ch);

Implementation

First of all we have to get the login page.

首先,我们需要登录页面。

We use a special User-Agent to introduce ourselves, in order both to be recognizable (we don't want to antagonize the webmaster) but also to fool the server into sending us a specific version of the site that is browser tailored. Ideally, we use the same User-Agent as any browser we're going to use to debug the page, plus a suffix to make it clear to whoever checks that it is an automated tool they're looking at (see comment by Halfer).

我们使用一个特殊的用户代理来介绍我们自己,这样既可以被识别(我们不想惹恼网站管理员),也可以欺骗服务器,让它发送给我们一个特定版本的网站,这个网站是专门为浏览器定制的。理想情况下,我们使用的用户代理与任何用于调试页面的浏览器相同,并添加一个后缀,以便让任何检查该页面是否是他们正在查看的自动工具的人都能清楚地看到(请参阅Halfer的评论)。

    $ua = 'Mozilla/5.0 (Windows NT 5.1; rv:16.0) Gecko/20100101 Firefox/16.0 (ROBOT)';
    $cookiefile = "cookiefile";
    $url1 = "the login url generating the session ID";

    $ch             = curl_init();

    curl_setopt($ch, CURLOPT_URL,            $url1);
    curl_setopt($ch, CURLOPT_USERAGENT,      $ua);
    curl_setopt($ch, CURLOPT_COOKIEFILE,     $cookiefile);
    curl_setopt($ch, CURLOPT_COOKIEJAR,      $cookiefile);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, True);
    curl_setopt($ch, CURLOPT_NOBODY,         False);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, True);
    curl_setopt($ch, CURLOPT_BINARYTRANSFER, True);
    $ret    = curl_exec($ch);

This will retrieve the page asking for user/password. By inspecting the page, we find the needed fields (including hidden ones) and can populate them. The FORM tag tells us whether we need to go on with POST or GET.

这将检索请求用户/密码的页面。通过检查页面,我们可以找到所需的字段(包括隐藏的字段)并填充它们。表单标记告诉我们需要继续使用POST还是GET。

We might want to inspect the form code to adjust the following operations, so we ask cURL to return the page content as-is into $ret, and to do return the page body. Sometimes, CURLOPT_NOBODY set to True is still enough to trigger session creation and cookie submission, and if so, it's faster. But CURLOPT_NOBODY ("no body") works by issuing a HEAD request, instead of a GET; and sometimes the HEAD request doesn't work because the server will only react to a full GET.

我们可能希望检查表单代码以调整以下操作,因此我们要求cURL将页面内容按原样返回到$ret中,并返回页面主体。有时,将CURLOPT_NOBODY设置为True仍然足以触发会话创建和cookie提交,如果是,那么速度会更快。但是CURLOPT_NOBODY(“无主体”)通过发出HEAD请求而不是GET来工作;而且有时候HEAD请求不能工作,因为服务器只会对full GET做出反应。

Instead of retrieving the body this way, it is also possible to login using a real Firefox and sniff the form content being posted with Firebug (or Chrome with Chrome Tools); some sites will try and populate/modify hidden fields with Javascript, so that the form being submitted will not be the one you see in the HTML code.

与其用这种方式检索主体,还可以使用真正的Firefox登录,并嗅出用Firebug(或Chrome工具)发布的表单内容;有些站点会尝试使用Javascript填充/修改隐藏字段,这样提交的表单就不会是HTML代码中看到的表单。

A webmaster who wanted his site not scraped might send a hidden field with the timestamp. A human being (not aided by a too-clever browser - there are ways to tell browsers not to be clever; at worst, every time you change the name of user and pass fields) takes at least three seconds to fill a form. A cURL script takes zero. Of course, a delay can be simulated. It's all shadowboxing...

一个不希望自己的站点被剪贴的网站管理员可能会发送一个带有时间戳的隐藏字段。作为一个人(不借助过于聪明的浏览器),有一些方法可以告诉浏览器不要太聪明;最坏的情况是,每次更改用户名和传递字段时,至少需要3秒钟来填写表单。旋度脚本为零。当然,可以模拟延迟。这是太极拳……

We may also want to be on the lookout for form appearance. A webmaster could for example build a form asking name, email, and password; and then, through use of CSS, move the "email" field where you would expect to find the name, and vice versa. So the real form being submitted will have a "@" in a field called username, none in the field called email. The server, that expects this, merely inverts again the two fields. A "scraper" built by hand (or a spambot) would do what seems natural, and send an email in the email field. And by so doing, it betrays itself. By working through the form once with a real CSS and JS aware browser, sending meaningful data, and sniffing what actually gets sent, we might be able to overcome this particular obstacle. Might, because there are ways of making life difficult. As I said, shadowboxing.

我们可能也想注意外形。例如,网站管理员可以构建一个表单,询问姓名、电子邮件和密码;然后,通过使用CSS,将“email”字段移动到您希望找到名称的地方,反之亦然。因此,真正提交的表单将在一个名为username的字段中有一个“@”,在名为email的字段中没有一个“@”。服务器期望如此,仅仅是将这两个字段倒置。一个手工构建的“抓取器”(或垃圾邮件机器人)会做一些看似自然的事情,并在电子邮件字段中发送电子邮件。这样一来,它就背叛了自己。通过使用真正的CSS和JS感知浏览器,发送有意义的数据,并嗅探究竟发送了什么,我们可能能够克服这个特殊的障碍。可能,因为有很多方法让生活变得困难。就像我说的,太极拳。

Back to the case at hand, in this case the form contains three fields and has no Javascript overlay. We have cPASS, cUSR, and checkLOGIN with a value of 'Check login'.

回到目前的情况,在这种情况下,表单包含三个字段,没有Javascript覆盖。我们有cPASS、cUSR和checkLOGIN,值为“Check login”。

So we prepare the form with the proper fields. Note that the form is to be sent as application/x-www-form-urlencoded, which in PHP cURL means two things:

所以我们用合适的字段来准备表格。注意,表单将作为应用程序/x-www-form- urlencodes发送,这在PHP cURL中有两个含义:

  • we are to use CURLOPT_POST
  • 我们将使用CURLOPT_POST
  • the option CURLOPT_POSTFIELDS must be a string (an array would signal cURL to submit as multipart/form-data, which might work... or might not).
  • 选项CURLOPT_POSTFIELDS必须是一个字符串(一个数组将信号cURL作为多部分/表单数据提交,这可能会起作用)。也可能不会)。

The form fields are, as it says, urlencoded; there's a function for that.

如它所说,表单字段是urlencodes;这是一个函数。

We read the action field of the form; that's the URL we are to use to submit our authentication (which we must have).

我们阅读表单的动作字段;这是我们用来提交身份验证的URL(我们必须有)。

So everything being ready...

所以一切都准备好了…

    $fields = array(
        'checkLOGIN' => 'Check Login',
        'cUSR'       => 'jb007',
        'cPASS'      => 'astonmartin',
    );
    $coded = array();
    foreach($fields as $field => $value)
        $coded[] = $field . '=' . urlencode($value);
    $string = implode('&', $coded);

    curl_setopt($ch, CURLOPT_URL,         $url1); //same URL as before, the login url generating the session ID
    curl_setopt($ch, CURLOPT_POST,        True);
    curl_setopt($ch, CURLOPT_POSTFIELDS,  $string);
    $ret    = curl_exec($ch);

We expect now a "Hello, James - how about a nice game of chess?" page. But more than that, we expect that the session associated with the cookie saved in the $cookiefile has been supplied with the critical information -- "user is authenticated".

我们现在期待的是“你好,詹姆斯——来一场漂亮的国际象棋怎么样?”但更重要的是,我们期望与保存在$cookiefile中的cookie相关联的会话已经提供了关键信息——“user is authenticated”。

So all following page requests made using $ch and the same cookie jar will be granted access, allowing us to 'scrape' pages quite easily - just remember to set request mode back to GET:

因此,所有使用$ch和相同的cookie jar发出的页面请求都将被授予访问权限,允许我们很容易地“抓取”页面——只要记住将请求模式设置为GET:

    curl_setopt($ch, CURLOPT_POST,        False);

    // Start spidering
    foreach($urls as $url)
    {
        curl_setopt($ch, CURLOPT_URL, $url);
        $HTML = curl_exec($ch);
        if (False === $HTML)
        {
            // Something went wrong, check curl_error() and curl_errno().
        }
    }
    curl_close($ch);

In the loop, you have access to $HTML -- the HTML code of every single page.

在循环中,您可以访问$HTML——每个页面的HTML代码。

Great the temptation of using regexps is. Resist it you must. To better cope with ever-changing HTML, as well as being sure not to turn up false positives or false negatives when the layout stays the same but the content changes (e.g. you discover that you have the weather forecasts of Nice, Tourrette-Levens, Castagniers, but never Asprémont or Gattières, and isn't that cürious?), the best option is to use DOM:

使用regexp的诱惑非常大。你必须抗拒它。为了更好地应对不断变化的HTML,以及确定不出现假阳性或假阴性布局不变但内容更改时(例如,你发现你有不错的天气预报,Tourrette-Levens,Castagniers,但从未Aspremont或Gattieres,和那不是好奇吗?),最好的选择是使用DOM:

Grabbing the href attribute of an A element

获取元素的href属性

#1


1  

Object-Oriented answer

We implement as much as possible of the above answer in one class called Browser that should supply the normal navigation features.

我们在一个名为Browser的类中尽可能多地实现上述答案,该类应该提供正常的导航特性。

Then we should be able to put the site-specific code, in very simple form, in a new derived class that we call, say, FooBrowser, that performs scraping of the site Foo.

然后,我们应该能够将特定于站点的代码,以非常简单的形式,在一个新的派生类中,我们调用,比如说,FooBrowser,来执行对站点Foo的抓取。

The class deriving Browser must supply some site-specific function such as a path() function allowing to store site-specific information, for example

派生浏览器的类必须提供一些特定于站点的函数,比如path()函数,该函数允许存储特定于站点的信息

function path($basename) {
    return '/var/tmp/www.foo.bar/' . $basename;
}

abstract class Browser
{
    private $options = [];
    private $state   = [];
    protected $cookies;

    abstract protected function path($basename);

    public function __construct($site, $options = []) {
        $this->cookies   = $this->path('cookies');
        $this->options  = array_merge(
            [
                'site'      => $site,
                'userAgent' => 'Mozilla/5.0 (Windows NT 5.1; rv:16.0) Gecko/20100101 Firefox/16.0 - LeoScraper',
                'waitTime'  => 250000,
            ],
            $options
        );
        $this->state = [
            'referer' => '/',
            'url'     => '',
            'curl'    => '',
        ];
        $this->__wakeup();
    }

    /**
     * Reactivates after sleep (e.g. in session) or creation
     */
    public function __wakeup() {
        $this->state['curl'] = curl_init();
        $this->config([
            CURLOPT_USERAGENT       => $this->options['userAgent'],
            CURLOPT_ENCODING        => '',
            CURLOPT_NOBODY          => false,
            // ...retrieving the body...
            CURLOPT_BINARYTRANSFER  => true,
            // ...as binary...
            CURLOPT_RETURNTRANSFER  => true,
            // ...into $ret...
            CURLOPT_FOLLOWLOCATION  => true,
            // ...following redirections...
            CURLOPT_MAXREDIRS       => 5,
            // ...reasonably...
            CURLOPT_COOKIEFILE      => $this->cookies,
            // Save these cookies
            CURLOPT_COOKIEJAR       => $this->cookies,
            // (already set above)
            CURLOPT_CONNECTTIMEOUT  => 30,
            // Seconds
            CURLOPT_TIMEOUT         => 300,
            // Seconds
            CURLOPT_LOW_SPEED_LIMIT => 16384,
            // 16 Kb/s
            CURLOPT_LOW_SPEED_TIME  => 15,
        ]);
    }

    /**
     * Imports an options array.
     *
     * @param array $opts
     * @throws DetailedError
     */
    private function config(array $opts = []) {
        foreach ($opts as $key => $value) {
            if (true !== curl_setopt($this->state['curl'], $key, $value)) {
                throw new \Exception('Could not set cURL option');
            }
        }
    }

    private function perform($url) {
        $this->state['referer'] = $this->state['url'];
        $this->state['url'] = $url;
        $this->config([
            CURLOPT_URL     => $this->options['site'] . $this->state['url'],
            CURLOPT_REFERER => $this->options['site'] . $this->state['referer'],
        ]);
        $response = curl_exec($this->state['curl']);
        // Should we ever want to randomize waitTime, do so here.
        usleep($this->options['waitTime']);

        return $response;
    }

    /**
     * Returns a configuration option.
     * @param string $key       configuration key name
     * @param string $value     value to set
     * @return mixed
     */
    protected function option($key, $value = '__DEFAULT__') {
        $curr   = $this->options[$key];
        if ('__DEFAULT__' !== $value) {
            $this->options[$key]    = $value;
        }
        return $curr;
    }

    /**
     * Performs a POST.
     *
     * @param $url
     * @param $fields
     * @return mixed
     */
    public function post($url, array $fields) {
        $this->config([
            CURLOPT_POST       => true,
            CURLOPT_POSTFIELDS => http_build_query($fields),
        ]);
        return $this->perform($url);
    }

    /**
     * Performs a GET.
     *
     * @param       $url
     * @param array $fields
     * @return mixed
     */
    public function get($url, array $fields = []) {
        $this->config([ CURLOPT_POST => false ]);
        if (empty($fields)) {
            $query = '';
        } else {
            $query = '?' . http_build_query($fields);
        }
        return $this->perform($url . $query);
    }
}

Now to scrape FooSite:

现在刮FooSite:

/* WWW_FOO_COM requires username and password to construct */

class WWW_FOO_COM_Browser extends Browser
{
    private $loggedIn   = false;

    public function __construct($username, $password) {
        parent::__construct('http://www.foo.bar.baz', [
            'username'  => $username,
            'password'  => $password,
            'waitTime'  => 250000,
            'userAgent' => 'FooScraper',
            'cache'     => true
        ]);
        // Open the session
        $this->get('/');
        // Navigate to the login page
        $this->get('/login.do');
    }

    /**
     * Perform login.
     */
    public function login() {
        $response = $this->post(
            '/ajax/loginPerform',
            [
                'j_un'    => $this->option('username'),
                'j_pw'    => $this->option('password'),
            ]
        );
        // TODO: verify that response is OK.
        // if (!strstr($response, "Welcome " . $this->option('username'))
        //     throw new \Exception("Bad username or password")
        $this->loggedIn = true;
        return true;
    }

    public function scrape($entry) {
        // We could implement caching to avoid scraping the same entry
        // too often. Save $data into path("entry-" . md5($entry))
        // and verify the filemtime of said file, is it newer than time()
        // minus, say, 86400 seconds? If yes, return file_get_content and
        // leave remote site alone.
        $data = $this->get(
            '/foobars/baz.do',
            [
                'ticker' => $entry
            ]
        );
        return $data;
    }

Now the actual scraping code would be:

现在实际的抓取代码是:

    $scraper = new WWW_FOO_COM_Browser('lserni', 'mypassword');
    if (!$scraper->login()) {
        throw new \Exception("bad user or pass");
    }
    foreach ($entries as $entry) {
        $html = $scraper->scrape($entry);
        // Parse HTML
    }

Mandatory notice: use a suitable parser to get data from raw HTML.

必须注意:使用合适的解析器从原始HTML获取数据。

#2


23  

You can do that in cURL without needing external 'emulators'.

您可以在cURL中完成,而不需要外部“模拟器”。

The code below retrieves a page into a PHP variable to be parsed.

下面的代码将页面检索到要解析的PHP变量。

Scenario

There is a page (let's call it HOME) that opens the session. Server side, if it is in PHP, is the one (any one actually) calling session_start() for the first time. In other languages you need a specific page that will do all the session setup. From the client side it's the page supplying the session ID cookie. In PHP, all sessioned pages do; in other languages the landing page will do it, all the others will check if the cookie is there, and if there isn't, instead of creating the session, will drop you to HOME.

有一个页面(我们称之为HOME)打开会话。如果服务器端在PHP中,则是第一次调用session_start()的服务器端。在其他语言中,您需要一个特定的页面来完成所有会话设置。在客户端,它是提供会话ID cookie的页面。在PHP中,所有的会话页面都有;在其他语言中,登录页面将会这样做,所有其他语言将检查cookie是否在那里,如果没有,而不是创建会话,将会把您送回家。

There is a page (LOGIN) that generates the login form and adds a critical information to the session - "This user is logged in". In the code below, this is the page asking for the session ID.

有一个页面(登录)生成登录表单并向会话添加关键信息——“该用户已登录”。在下面的代码中,这是请求会话ID的页面。

And finally there are N pages where the goodies to be scrapes reside.

最后,有N页的内容是需要剪贴的内容。

So we want to hit HOME, then LOGIN, then GOODIES one after another. In PHP (and other languages actually), again, HOME and LOGIN might well be the same page. Or all pages might share the same address, for example in Single Page Applications.

所以我们想要回家,然后登录,然后一个接一个的好东西。在PHP(实际上还有其他语言)中,HOME和LOGIN可能是同一个页面。或者所有页面可能共享相同的地址,例如在单页应用程序中。

The Code

    $url            = "the url generating the session ID";
    $next_url       = "the url asking for session";

    $ch             = curl_init();
    curl_setopt($ch, CURLOPT_URL,    $url);
    // We do not authenticate, only access page to get a session going.
    // Change to False if it is not enough (you'll see that cookiefile
    // remains empty).
    curl_setopt($ch, CURLOPT_NOBODY, True);

    // You may want to change User-Agent here, too
    curl_setopt($ch, CURLOPT_COOKIEFILE, "cookiefile");
    curl_setopt($ch, CURLOPT_COOKIEJAR,  "cookiefile");

    // Just in case
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);

    $ret    = curl_exec($ch);

    // This page we retrieve, and scrape, with GET method
    foreach(array(
            CURLOPT_POST            => False,       // We GET...
            CURLOPT_NOBODY          => False,       // ...the body...
            CURLOPT_URL             => $next_url,   // ...of $next_url...
            CURLOPT_BINARYTRANSFER  => True,        // ...as binary...
            CURLOPT_RETURNTRANSFER  => True,        // ...into $ret...
            CURLOPT_FOLLOWLOCATION  => True,        // ...following redirections...
            CURLOPT_MAXREDIRS       => 5,           // ...reasonably...
            CURLOPT_REFERER         => $url,        // ...as if we came from $url...
            //CURLOPT_COOKIEFILE      => 'cookiefile', // Save these cookies
            //CURLOPT_COOKIEJAR       => 'cookiefile', // (already set above)
            CURLOPT_CONNECTTIMEOUT  => 30,          // Seconds
            CURLOPT_TIMEOUT         => 300,         // Seconds
            CURLOPT_LOW_SPEED_LIMIT => 16384,       // 16 Kb/s
            CURLOPT_LOW_SPEED_TIME  => 15,          // 
            ) as $option => $value)
            if (!curl_setopt($ch, $option, $value))
                    die("could not set $option to " . serialize($value));

    $ret = curl_exec($ch);
    // Done; cleanup.
    curl_close($ch);

Implementation

First of all we have to get the login page.

首先,我们需要登录页面。

We use a special User-Agent to introduce ourselves, in order both to be recognizable (we don't want to antagonize the webmaster) but also to fool the server into sending us a specific version of the site that is browser tailored. Ideally, we use the same User-Agent as any browser we're going to use to debug the page, plus a suffix to make it clear to whoever checks that it is an automated tool they're looking at (see comment by Halfer).

我们使用一个特殊的用户代理来介绍我们自己,这样既可以被识别(我们不想惹恼网站管理员),也可以欺骗服务器,让它发送给我们一个特定版本的网站,这个网站是专门为浏览器定制的。理想情况下,我们使用的用户代理与任何用于调试页面的浏览器相同,并添加一个后缀,以便让任何检查该页面是否是他们正在查看的自动工具的人都能清楚地看到(请参阅Halfer的评论)。

    $ua = 'Mozilla/5.0 (Windows NT 5.1; rv:16.0) Gecko/20100101 Firefox/16.0 (ROBOT)';
    $cookiefile = "cookiefile";
    $url1 = "the login url generating the session ID";

    $ch             = curl_init();

    curl_setopt($ch, CURLOPT_URL,            $url1);
    curl_setopt($ch, CURLOPT_USERAGENT,      $ua);
    curl_setopt($ch, CURLOPT_COOKIEFILE,     $cookiefile);
    curl_setopt($ch, CURLOPT_COOKIEJAR,      $cookiefile);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, True);
    curl_setopt($ch, CURLOPT_NOBODY,         False);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, True);
    curl_setopt($ch, CURLOPT_BINARYTRANSFER, True);
    $ret    = curl_exec($ch);

This will retrieve the page asking for user/password. By inspecting the page, we find the needed fields (including hidden ones) and can populate them. The FORM tag tells us whether we need to go on with POST or GET.

这将检索请求用户/密码的页面。通过检查页面,我们可以找到所需的字段(包括隐藏的字段)并填充它们。表单标记告诉我们需要继续使用POST还是GET。

We might want to inspect the form code to adjust the following operations, so we ask cURL to return the page content as-is into $ret, and to do return the page body. Sometimes, CURLOPT_NOBODY set to True is still enough to trigger session creation and cookie submission, and if so, it's faster. But CURLOPT_NOBODY ("no body") works by issuing a HEAD request, instead of a GET; and sometimes the HEAD request doesn't work because the server will only react to a full GET.

我们可能希望检查表单代码以调整以下操作,因此我们要求cURL将页面内容按原样返回到$ret中,并返回页面主体。有时,将CURLOPT_NOBODY设置为True仍然足以触发会话创建和cookie提交,如果是,那么速度会更快。但是CURLOPT_NOBODY(“无主体”)通过发出HEAD请求而不是GET来工作;而且有时候HEAD请求不能工作,因为服务器只会对full GET做出反应。

Instead of retrieving the body this way, it is also possible to login using a real Firefox and sniff the form content being posted with Firebug (or Chrome with Chrome Tools); some sites will try and populate/modify hidden fields with Javascript, so that the form being submitted will not be the one you see in the HTML code.

与其用这种方式检索主体,还可以使用真正的Firefox登录,并嗅出用Firebug(或Chrome工具)发布的表单内容;有些站点会尝试使用Javascript填充/修改隐藏字段,这样提交的表单就不会是HTML代码中看到的表单。

A webmaster who wanted his site not scraped might send a hidden field with the timestamp. A human being (not aided by a too-clever browser - there are ways to tell browsers not to be clever; at worst, every time you change the name of user and pass fields) takes at least three seconds to fill a form. A cURL script takes zero. Of course, a delay can be simulated. It's all shadowboxing...

一个不希望自己的站点被剪贴的网站管理员可能会发送一个带有时间戳的隐藏字段。作为一个人(不借助过于聪明的浏览器),有一些方法可以告诉浏览器不要太聪明;最坏的情况是,每次更改用户名和传递字段时,至少需要3秒钟来填写表单。旋度脚本为零。当然,可以模拟延迟。这是太极拳……

We may also want to be on the lookout for form appearance. A webmaster could for example build a form asking name, email, and password; and then, through use of CSS, move the "email" field where you would expect to find the name, and vice versa. So the real form being submitted will have a "@" in a field called username, none in the field called email. The server, that expects this, merely inverts again the two fields. A "scraper" built by hand (or a spambot) would do what seems natural, and send an email in the email field. And by so doing, it betrays itself. By working through the form once with a real CSS and JS aware browser, sending meaningful data, and sniffing what actually gets sent, we might be able to overcome this particular obstacle. Might, because there are ways of making life difficult. As I said, shadowboxing.

我们可能也想注意外形。例如,网站管理员可以构建一个表单,询问姓名、电子邮件和密码;然后,通过使用CSS,将“email”字段移动到您希望找到名称的地方,反之亦然。因此,真正提交的表单将在一个名为username的字段中有一个“@”,在名为email的字段中没有一个“@”。服务器期望如此,仅仅是将这两个字段倒置。一个手工构建的“抓取器”(或垃圾邮件机器人)会做一些看似自然的事情,并在电子邮件字段中发送电子邮件。这样一来,它就背叛了自己。通过使用真正的CSS和JS感知浏览器,发送有意义的数据,并嗅探究竟发送了什么,我们可能能够克服这个特殊的障碍。可能,因为有很多方法让生活变得困难。就像我说的,太极拳。

Back to the case at hand, in this case the form contains three fields and has no Javascript overlay. We have cPASS, cUSR, and checkLOGIN with a value of 'Check login'.

回到目前的情况,在这种情况下,表单包含三个字段,没有Javascript覆盖。我们有cPASS、cUSR和checkLOGIN,值为“Check login”。

So we prepare the form with the proper fields. Note that the form is to be sent as application/x-www-form-urlencoded, which in PHP cURL means two things:

所以我们用合适的字段来准备表格。注意,表单将作为应用程序/x-www-form- urlencodes发送,这在PHP cURL中有两个含义:

  • we are to use CURLOPT_POST
  • 我们将使用CURLOPT_POST
  • the option CURLOPT_POSTFIELDS must be a string (an array would signal cURL to submit as multipart/form-data, which might work... or might not).
  • 选项CURLOPT_POSTFIELDS必须是一个字符串(一个数组将信号cURL作为多部分/表单数据提交,这可能会起作用)。也可能不会)。

The form fields are, as it says, urlencoded; there's a function for that.

如它所说,表单字段是urlencodes;这是一个函数。

We read the action field of the form; that's the URL we are to use to submit our authentication (which we must have).

我们阅读表单的动作字段;这是我们用来提交身份验证的URL(我们必须有)。

So everything being ready...

所以一切都准备好了…

    $fields = array(
        'checkLOGIN' => 'Check Login',
        'cUSR'       => 'jb007',
        'cPASS'      => 'astonmartin',
    );
    $coded = array();
    foreach($fields as $field => $value)
        $coded[] = $field . '=' . urlencode($value);
    $string = implode('&', $coded);

    curl_setopt($ch, CURLOPT_URL,         $url1); //same URL as before, the login url generating the session ID
    curl_setopt($ch, CURLOPT_POST,        True);
    curl_setopt($ch, CURLOPT_POSTFIELDS,  $string);
    $ret    = curl_exec($ch);

We expect now a "Hello, James - how about a nice game of chess?" page. But more than that, we expect that the session associated with the cookie saved in the $cookiefile has been supplied with the critical information -- "user is authenticated".

我们现在期待的是“你好,詹姆斯——来一场漂亮的国际象棋怎么样?”但更重要的是,我们期望与保存在$cookiefile中的cookie相关联的会话已经提供了关键信息——“user is authenticated”。

So all following page requests made using $ch and the same cookie jar will be granted access, allowing us to 'scrape' pages quite easily - just remember to set request mode back to GET:

因此,所有使用$ch和相同的cookie jar发出的页面请求都将被授予访问权限,允许我们很容易地“抓取”页面——只要记住将请求模式设置为GET:

    curl_setopt($ch, CURLOPT_POST,        False);

    // Start spidering
    foreach($urls as $url)
    {
        curl_setopt($ch, CURLOPT_URL, $url);
        $HTML = curl_exec($ch);
        if (False === $HTML)
        {
            // Something went wrong, check curl_error() and curl_errno().
        }
    }
    curl_close($ch);

In the loop, you have access to $HTML -- the HTML code of every single page.

在循环中,您可以访问$HTML——每个页面的HTML代码。

Great the temptation of using regexps is. Resist it you must. To better cope with ever-changing HTML, as well as being sure not to turn up false positives or false negatives when the layout stays the same but the content changes (e.g. you discover that you have the weather forecasts of Nice, Tourrette-Levens, Castagniers, but never Asprémont or Gattières, and isn't that cürious?), the best option is to use DOM:

使用regexp的诱惑非常大。你必须抗拒它。为了更好地应对不断变化的HTML,以及确定不出现假阳性或假阴性布局不变但内容更改时(例如,你发现你有不错的天气预报,Tourrette-Levens,Castagniers,但从未Aspremont或Gattieres,和那不是好奇吗?),最好的选择是使用DOM:

Grabbing the href attribute of an A element

获取元素的href属性