如果无法将页面的标记内容解析为XML,该如何获取该页面的标记的内容?

时间:2021-12-29 01:32:05

I'm using PHP libcurl to load a page. Now I need to get this page's <title> tag's content, and some other information too. I've tried to parse it using SimpleXML, but with no luck, because the page isn't valid XML. Can you suggest some other way to easily get contents of <title> tag? Thank you.

我正在使用PHP libcurl来加载页面。现在我需要获取此页面的

标签的内容,以及其他一些信息。我试图使用SimpleXML解析它,但没有运气,因为页面不是有效的XML。你能否建议一些其他方法来轻松获取 标签的内容?谢谢。</p>

4 个解决方案

#1


3  

You can use DOMDocument::loadHTML.

您可以使用DOMDocument :: loadHTML。

This will echo "The title":

这将回应“标题”:

<?php

$doc = <<<HTML
<html>
<head>
<title>The title</title>
<body>
hhhhhh
HTML;

libxml_use_internal_errors(true);
$d = new DOMDocument;
$d->loadHTML($doc);
$ts = $d->getElementsByTagName("title");
if ($ts->length > 0) {
    echo $ts->item(0)->textContent;
}

#2


1  

Or you can use Simple HTML DOM

或者您可以使用Simple HTML DOM

#3


0  

You can use this script to get the title of a page.

您可以使用此脚本来获取页面的标题。

# Script Title.txt
var str page, content
cat $page > $content
stex -r -c "^<title&</title&\>^" $content

Save this little code in file C:/Scripts/Title.txt. Code is in biterscripting. Start biterscripting, and enter this command.

将这个小代码保存在文件C:/Scripts/Title.txt中。代码是biterscripting。启动biterscripting,然后输入此命令。

script "C:/Scripts/Title.txt" page("http://*.com/questions/3135488/how-can-i-get-pages-title-tags-content-if-it-cant-be-parsed-as-xml")

It will get the title of this page (the one you are viewing). Use any other URL or local file path as the value of page(). Use double quotes. When I executed this command, I got

它将获得此页面的标题(您正在查看的那个)。使用任何其他URL或本地文件路径作为page()的值。使用双引号。当我执行这个命令时,我得到了

How can I get page's <title> tag's content if it can't be parsed as XML? - Stack Overflow

如果无法将页面的

标记内容解析为XML,该如何获取该页面的 标记的内容? - 堆栈溢出</p>

You can call this script from any executable or batch file.

您可以从任何可执行文件或批处理文件中调用此脚本。

#4


0  

Try using Yahoo's YQL console. You can query almost any url and then ask for results back in XML. You can even add xpath to narrow it down.

尝试使用Yahoo的YQL控制台。您几乎可以查询任何网址,然后以XML格式查询结果。您甚至可以添加xpath来缩小范围。

http://developer.yahoo.com/yql/console/

Maybe you can call this service using curl. It's pretty handy.

也许你可以使用curl来调用这个服务。它非常方便。

#1


3  

You can use DOMDocument::loadHTML.

您可以使用DOMDocument :: loadHTML。

This will echo "The title":

这将回应“标题”:

<?php

$doc = <<<HTML
<html>
<head>
<title>The title</title>
<body>
hhhhhh
HTML;

libxml_use_internal_errors(true);
$d = new DOMDocument;
$d->loadHTML($doc);
$ts = $d->getElementsByTagName("title");
if ($ts->length > 0) {
    echo $ts->item(0)->textContent;
}

#2


1  

Or you can use Simple HTML DOM

或者您可以使用Simple HTML DOM

#3


0  

You can use this script to get the title of a page.

您可以使用此脚本来获取页面的标题。

# Script Title.txt
var str page, content
cat $page > $content
stex -r -c "^<title&</title&\>^" $content

Save this little code in file C:/Scripts/Title.txt. Code is in biterscripting. Start biterscripting, and enter this command.

将这个小代码保存在文件C:/Scripts/Title.txt中。代码是biterscripting。启动biterscripting,然后输入此命令。

script "C:/Scripts/Title.txt" page("http://*.com/questions/3135488/how-can-i-get-pages-title-tags-content-if-it-cant-be-parsed-as-xml")

It will get the title of this page (the one you are viewing). Use any other URL or local file path as the value of page(). Use double quotes. When I executed this command, I got

它将获得此页面的标题(您正在查看的那个)。使用任何其他URL或本地文件路径作为page()的值。使用双引号。当我执行这个命令时,我得到了

How can I get page's <title> tag's content if it can't be parsed as XML? - Stack Overflow

如果无法将页面的

标记内容解析为XML,该如何获取该页面的 标记的内容? - 堆栈溢出</p>

You can call this script from any executable or batch file.

您可以从任何可执行文件或批处理文件中调用此脚本。

#4


0  

Try using Yahoo's YQL console. You can query almost any url and then ask for results back in XML. You can even add xpath to narrow it down.

尝试使用Yahoo的YQL控制台。您几乎可以查询任何网址,然后以XML格式查询结果。您甚至可以添加xpath来缩小范围。

http://developer.yahoo.com/yql/console/

Maybe you can call this service using curl. It's pretty handy.

也许你可以使用curl来调用这个服务。它非常方便。