如何从.html页面中提取链接和标题?

时间:2021-10-19 08:11:06

for my website, i'd like to add a new functionality.

对于我的网站,我想添加一个新的功能。

I would like user to be able to upload his bookmarks backup file (from any browser if possible) so I can upload it to their profile and they don't have to insert all of them manually...

我希望用户能够上传他的书签备份文件(如果可能的话,可以从任何浏览器),这样我就可以把它上传到他们的个人资料中,他们不需要手动插入所有的书签。

the only part i'm missing to do this it's the part of extracting title and URL from the uploaded file.. can anyone give a clue where to start or where to read?

我唯一缺少的就是从上传文件中提取标题和URL的部分。谁能告诉我从哪里开始,从哪里开始阅读?

used search option and ( how to extract data from a raw html file ) this sis the most related question for mine and it doesn't talk about it..

使用搜索选项和(如何从原始的html文件中提取数据)这是我遇到的最相关的问题,它不讨论它。

I really don't mind if its using jquery or php

我真的不介意用jquery还是php

thank you very much

非常感谢你

5 个解决方案

#1


45  

Thank you everyone, i GOT IT!

谢谢大家,我明白了!

the final Code: This shows you the anchor text assigned and the href for all links in a .html file

最后的代码:这将显示为.html文件中的所有链接分配的锚文本和href

$html = file_get_contents('bookmarks.html');
//Create a new DOM document
$dom = new DOMDocument;

//Parse the HTML. The @ is used to suppress any parsing errors
//that will be thrown if the $html string isn't valid XHTML.
@$dom->loadHTML($html);

//Get all links. You could also use any other tag name here,
//like 'img' or 'table', to extract other tags.
$links = $dom->getElementsByTagName('a');

//Iterate over the extracted links and display their URLs
foreach ($links as $link){
    //Extract and show the "href" attribute.
    echo $link->nodeValue;
    echo $link->getAttribute('href'), '<br>';
}

Again, thanks a lot.

再一次,非常感谢。

#2


31  

This is probably sufficient:

这可能是足够的:

$dom = new DOMDocument;
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('a') as $node)
{
  echo $node->nodeValue.': '.$node->getAttribute("href")."\n";
}

#3


5  

Assuming the stored links are in a html file the best solution is probably to use a html parser such as PHP Simple HTML DOM Parser (never tried it myself). (The other option is to search using basic string search or regexp, and you should probably never use regexp to parse html).

假设存储链接在html文件中,最好的解决方案可能是使用html解析器,比如PHP简单html DOM解析器(我自己从未尝试过)。(另一个选项是使用基本字符串搜索或regexp进行搜索,您可能永远不应该使用regexp来解析html)。

After reading the html file using the parser use it's functions to find the a tags:

使用解析器读取html文件后,使用解析器的函数查找a标记:

from the tutorial:

从本教程:

// Find all links
foreach($html->find('a') as $element)
       echo $element->href . '<br>'; 

#4


3  

This is an example, you can use in your case this:

这是一个例子,你可以用你的例子

$content = file_get_contents('bookmarks.html');

Run this:

运行这个:

<?php

$content = '<html>

<title>Random Website I am Crawling</title>

<body>

Click <a href="http://clicklink.com">here</a> for foobar

Another site is http://foobar.com

</body>

</html>';

$regex = "((https?|ftp)\:\/\/)?"; // SCHEME
$regex .= "([a-z0-9+!*(),;?&=\$_.-]+(\:[a-z0-9+!*(),;?&=\$_.-]+)?@)?"; // User and Pass
$regex .= "([a-z0-9-.]*)\.([a-z]{2,4})"; // Host or IP
$regex .= "(\:[0-9]{2,5})?"; // Port
$regex .= "(\/([a-z0-9+\$_-]\.?)+)*\/?"; // Path
$regex .= "(\?[a-z+&\$_.-][a-z0-9;:@&%=+\/\$_.-]*)?"; // GET Query
$regex .= "(#[a-z_.-][a-z0-9+\$_.-]*)?"; // Anchor


$matches = array(); //create array
$pattern = "/$regex/";

preg_match_all($pattern, $content, $matches); 

print_r(array_values(array_unique($matches[0])));
echo "<br><br>";
echo implode("<br>", array_values(array_unique($matches[0])));

Output:

输出:

Array
(
    [0] => http://clicklink.com
    [1] => http://foobar.com
)

http://clicklink.com

http://clicklink.com

http://foobar.com

http://foobar.com

#5


1  

$html = file_get_contents('your file path');

$dom = new DOMDocument;

@$dom->loadHTML($html);

$styles = $dom->getElementsByTagName('link');

$links = $dom->getElementsByTagName('a');

$scripts = $dom->getElementsByTagName('script');

foreach($styles as $style)
{

    if($style->getAttribute('href')!="#")

    {
        echo $style->getAttribute('href');
        echo'<br>';
    }
}

foreach ($links as $link){

    if($link->getAttribute('href')!="#")
    {
        echo $link->getAttribute('href');
        echo'<br>';
    }
}

foreach($scripts as $script)
{

        echo $script->getAttribute('src');
        echo'<br>';

}

#1


45  

Thank you everyone, i GOT IT!

谢谢大家,我明白了!

the final Code: This shows you the anchor text assigned and the href for all links in a .html file

最后的代码:这将显示为.html文件中的所有链接分配的锚文本和href

$html = file_get_contents('bookmarks.html');
//Create a new DOM document
$dom = new DOMDocument;

//Parse the HTML. The @ is used to suppress any parsing errors
//that will be thrown if the $html string isn't valid XHTML.
@$dom->loadHTML($html);

//Get all links. You could also use any other tag name here,
//like 'img' or 'table', to extract other tags.
$links = $dom->getElementsByTagName('a');

//Iterate over the extracted links and display their URLs
foreach ($links as $link){
    //Extract and show the "href" attribute.
    echo $link->nodeValue;
    echo $link->getAttribute('href'), '<br>';
}

Again, thanks a lot.

再一次,非常感谢。

#2


31  

This is probably sufficient:

这可能是足够的:

$dom = new DOMDocument;
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('a') as $node)
{
  echo $node->nodeValue.': '.$node->getAttribute("href")."\n";
}

#3


5  

Assuming the stored links are in a html file the best solution is probably to use a html parser such as PHP Simple HTML DOM Parser (never tried it myself). (The other option is to search using basic string search or regexp, and you should probably never use regexp to parse html).

假设存储链接在html文件中,最好的解决方案可能是使用html解析器,比如PHP简单html DOM解析器(我自己从未尝试过)。(另一个选项是使用基本字符串搜索或regexp进行搜索,您可能永远不应该使用regexp来解析html)。

After reading the html file using the parser use it's functions to find the a tags:

使用解析器读取html文件后,使用解析器的函数查找a标记:

from the tutorial:

从本教程:

// Find all links
foreach($html->find('a') as $element)
       echo $element->href . '<br>'; 

#4


3  

This is an example, you can use in your case this:

这是一个例子,你可以用你的例子

$content = file_get_contents('bookmarks.html');

Run this:

运行这个:

<?php

$content = '<html>

<title>Random Website I am Crawling</title>

<body>

Click <a href="http://clicklink.com">here</a> for foobar

Another site is http://foobar.com

</body>

</html>';

$regex = "((https?|ftp)\:\/\/)?"; // SCHEME
$regex .= "([a-z0-9+!*(),;?&=\$_.-]+(\:[a-z0-9+!*(),;?&=\$_.-]+)?@)?"; // User and Pass
$regex .= "([a-z0-9-.]*)\.([a-z]{2,4})"; // Host or IP
$regex .= "(\:[0-9]{2,5})?"; // Port
$regex .= "(\/([a-z0-9+\$_-]\.?)+)*\/?"; // Path
$regex .= "(\?[a-z+&\$_.-][a-z0-9;:@&%=+\/\$_.-]*)?"; // GET Query
$regex .= "(#[a-z_.-][a-z0-9+\$_.-]*)?"; // Anchor


$matches = array(); //create array
$pattern = "/$regex/";

preg_match_all($pattern, $content, $matches); 

print_r(array_values(array_unique($matches[0])));
echo "<br><br>";
echo implode("<br>", array_values(array_unique($matches[0])));

Output:

输出:

Array
(
    [0] => http://clicklink.com
    [1] => http://foobar.com
)

http://clicklink.com

http://clicklink.com

http://foobar.com

http://foobar.com

#5


1  

$html = file_get_contents('your file path');

$dom = new DOMDocument;

@$dom->loadHTML($html);

$styles = $dom->getElementsByTagName('link');

$links = $dom->getElementsByTagName('a');

$scripts = $dom->getElementsByTagName('script');

foreach($styles as $style)
{

    if($style->getAttribute('href')!="#")

    {
        echo $style->getAttribute('href');
        echo'<br>';
    }
}

foreach ($links as $link){

    if($link->getAttribute('href')!="#")
    {
        echo $link->getAttribute('href');
        echo'<br>';
    }
}

foreach($scripts as $script)
{

        echo $script->getAttribute('src');
        echo'<br>';

}