如何通过PHP将HTML页面作为字符串获取?

时间:2022-10-31 13:17:15

I am fetching some info via PHP from a webpage using simple_php_dom and curl. The problem is that the page is not built correctly so the DOM object contains erroneous info.

我正在使用simple_php_dom和curl从网页上通过PHP获取一些信息。问题是页面未正确构建,因此DOM对象包含错误信息。

How can I get the HTML file as a string in a PHP var so that I can run a regular expression through it?

如何将HTML文件作为PHP var中的字符串获取,以便我可以通过它运行正则表达式?

Curl doesn't work as it is ignoring the bad part.
simple_html_dom.php has the same issue.
wget doesn't work since I don't have permissions for it on the server.

卷曲不起作用,因为它忽略了坏部分。 simple_html_dom.php有同样的问题。 wget不起作用,因为我在服务器上没有权限。

3 个解决方案

#1


12  

file_get_contents — Reads entire file into a string

file_get_contents - 将整个文件读入字符串

string file_get_contents ( 
    string $filename [, int $flags= 0 [, resource $context [, int $offset= -1 [, int $maxlen= -1 ]]]] 
)

from the manual:

从手册:

This function is similar to file(), except that file_get_contents() returns the file in a string, starting at the specified offset up to maxlen bytes. On failure, file_get_contents() will return FALSE.

此函数类似于file(),但file_get_contents()以字符串形式返回文件,从指定的偏移量开始直到maxlen字节。失败时,file_get_contents()将返回FALSE。

file_get_contents() is the preferred way to read the contents of a file into a string. It will use memory mapping techniques if supported by your OS to enhance performance.

file_get_contents()是将文件内容读入字符串的首选方法。如果操作系统支持,它将使用内存映射技术来提高性能。

And it works both with webpages and files. You can grab the HTML, just by using "http://whatever.com/page.html" as $filename.

它适用于网页和文件。您可以使用“http://whatever.com/page.html”作为$ filename来获取HTML。

#2


4  

With curl you would want to make sure that you're setting the CURLOPT_RETURNTRANSFER parameter to ensure that the page is retrieved as a string, e.g.:

使用curl,您需要确保设置CURLOPT_RETURNTRANSFER参数以确保以字符串形式检索页面,例如:

    //return the transfer as a string 
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 

See http://www.php.net/manual/en/function.curl-setopt.php

见http://www.php.net/manual/en/function.curl-setopt.php

#3


0  

I used cURL to get the file into a string (simple_html_dom::load_file just wraps file_get_contents) then using simple_html_dom load (from string) method to parse it. That works for some URL's but it is failing in this case when the URL has a parameter string. It is fetching the URL as if it had not a parameter string. I set an agent with curl to impersonate a browser but no dice.

我使用cURL将文件转换为字符串(simple_html_dom :: load_file只包装file_get_contents),然后使用simple_html_dom load(from string)方法解析它。这适用于某些URL,但在这种情况下,当URL具有参数字符串时,它会失败。它正在获取URL,就好像它没有参数字符串一样。我设置一个curl代理来冒充浏览器,但没有骰子。

Sorry this is not an answer really, but maybe using curl will work for some people for whom the fopen setting is a problem.

对不起,这不是一个真正的答案,但也许使用curl将适用于fopen设置有问题的一些人。

#1


12  

file_get_contents — Reads entire file into a string

file_get_contents - 将整个文件读入字符串

string file_get_contents ( 
    string $filename [, int $flags= 0 [, resource $context [, int $offset= -1 [, int $maxlen= -1 ]]]] 
)

from the manual:

从手册:

This function is similar to file(), except that file_get_contents() returns the file in a string, starting at the specified offset up to maxlen bytes. On failure, file_get_contents() will return FALSE.

此函数类似于file(),但file_get_contents()以字符串形式返回文件,从指定的偏移量开始直到maxlen字节。失败时,file_get_contents()将返回FALSE。

file_get_contents() is the preferred way to read the contents of a file into a string. It will use memory mapping techniques if supported by your OS to enhance performance.

file_get_contents()是将文件内容读入字符串的首选方法。如果操作系统支持,它将使用内存映射技术来提高性能。

And it works both with webpages and files. You can grab the HTML, just by using "http://whatever.com/page.html" as $filename.

它适用于网页和文件。您可以使用“http://whatever.com/page.html”作为$ filename来获取HTML。

#2


4  

With curl you would want to make sure that you're setting the CURLOPT_RETURNTRANSFER parameter to ensure that the page is retrieved as a string, e.g.:

使用curl,您需要确保设置CURLOPT_RETURNTRANSFER参数以确保以字符串形式检索页面,例如:

    //return the transfer as a string 
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 

See http://www.php.net/manual/en/function.curl-setopt.php

见http://www.php.net/manual/en/function.curl-setopt.php

#3


0  

I used cURL to get the file into a string (simple_html_dom::load_file just wraps file_get_contents) then using simple_html_dom load (from string) method to parse it. That works for some URL's but it is failing in this case when the URL has a parameter string. It is fetching the URL as if it had not a parameter string. I set an agent with curl to impersonate a browser but no dice.

我使用cURL将文件转换为字符串(simple_html_dom :: load_file只包装file_get_contents),然后使用simple_html_dom load(from string)方法解析它。这适用于某些URL,但在这种情况下,当URL具有参数字符串时,它会失败。它正在获取URL,就好像它没有参数字符串一样。我设置一个curl代理来冒充浏览器,但没有骰子。

Sorry this is not an answer really, but maybe using curl will work for some people for whom the fopen setting is a problem.

对不起,这不是一个真正的答案,但也许使用curl将适用于fopen设置有问题的一些人。