从页面html中删除标题内容

时间:2022-12-04 23:35:11

Here I am creating preview for url. Which shows

在这里,我正在为url创建预览。这表明

  1. Url title
  2. Url description (title should not come in this)
  3. 网址描述(标题不应该出现在此)

Here is my try.

这是我的尝试。

<?php
function plaintext($html)
    {
        $plaintext = preg_replace('#([<]title)(.*)([<]/title[>])#', ' ', $html);

        // remove title 
            //$plaintext = preg_match('#<title>(.*?)</title>#', $html);

        // remove comments and any content found in the the comment area (strip_tags only removes the actual tags).
        $plaintext = preg_replace('#<!--.*?-->#s', '', $plaintext);

        // put a space between list items (strip_tags just removes the tags).
            $plaintext = preg_replace('#</li>#', ' </li>', $plaintext);     

            // remove all script and style tags
        $plaintext = preg_replace('#<(script|style)\b[^>]*>(.*?)</(script|style)>#is', "", $plaintext);

        // remove br tags (missed by strip_tags)
            $plaintext = preg_replace("#<br[^>]*?>#", " ", $plaintext);

            // remove all remaining html
            $plaintext = strip_tags($plaintext);

        return $plaintext;
    }
        function get_title($html) 
    {
        return preg_match('!<title>(.*?)</title>!i', $html, $matches) ? $matches[1] : '';
    }
        function trim_display($size,$string)
    {
        $trim_string = substr($string, 0, $size);

        $trim_string = $trim_string . "...";
        return $trim_string;
    }

$url = "http://www.nextbigwhat.com/indian-startups/";
$data = file_get_contents($url);
//$url = trim_url(5,$url);      
    $title = get_title($data);
        echo "title is ; $title";   
    $content = plaintext($data); 
    $Preview = trim_display(100,$content);
echo '<br/>';
echo "preview is: $Preview";

?>

URL title appear correctly. But when I have excluded the title content from description, even it appear.

网址标题正确显示。但是当我从描述中排除标题内容时,即使它出现了。

i have uses $plaintext = preg_replace('#([<]title)(.*)([<]/title[>])#', ' ', $html); to exclude the title from plain text.

我使用$ plaintext = preg_replace('#([<] title)(。*)([<] / title [>])#','',$ html);从纯文本中排除标题。

Regex is correct as per me event it does not exclude title content.

根据我的事件,正则表达式是正确的,它不排除标题内容。

What is the problem here?

这里有什么问题?

output we get here is:

我们得到的输出是:

title is ; Indian Startups Archives - NextBigWhat.com
preview is: Indian Startups Archives : NextBigWhat.com [whatever rest text]...

Actually the text which appears in title part should not again come in preview. That's why i want to exclude it and display rest text in preview.

实际上,标题部分中出现的文本不应再次出现在预览中。这就是我想要排除它并在预览中显示其余文本的原因。

1 个解决方案

#1


1  

how to solve the mistery

如何解决这个错误

If you look closer to the title and the preview, they're different. Let's see the output from the curl.

如果你仔细观察标题和预览,它们​​会有所不同。让我们看看卷曲的输出。

echo plaintext($data);

Well, it seems it has two titles:

好吧,它似乎有两个标题:

<title>
Indian Startups Archives : NextBigWhat.com</title>

and

<title>Indian Startups Archives - NextBigWhat.com</title>

Then the get_title function is retrieving the second title and plaintext leaves alone the first one. What's the difference between them? the line break! therefore your regex isn't matching titles with newline characters, which is why the /s option modifier in regular expressions exists!

然后get_title函数检索第二个标题,明文单独留下第一个标题。它们之间有什么区别?换行!因此你的正则表达式不匹配带换行符的标题,这就是正则表达式中的/ s选项修饰符存在的原因!

tl;dr

Your regex is wrong, add 's' to it.

你的正则表达式是错误的,添加's'。

$plaintext = preg_replace('#([<]title)(.*)([<]/title[>])#s', ' ', $html);`

instead of

$plaintext = preg_replace('#([<]title)(.*)([<]/title[>])#', ' ', $html);`

#1


1  

how to solve the mistery

如何解决这个错误

If you look closer to the title and the preview, they're different. Let's see the output from the curl.

如果你仔细观察标题和预览,它们​​会有所不同。让我们看看卷曲的输出。

echo plaintext($data);

Well, it seems it has two titles:

好吧,它似乎有两个标题:

<title>
Indian Startups Archives : NextBigWhat.com</title>

and

<title>Indian Startups Archives - NextBigWhat.com</title>

Then the get_title function is retrieving the second title and plaintext leaves alone the first one. What's the difference between them? the line break! therefore your regex isn't matching titles with newline characters, which is why the /s option modifier in regular expressions exists!

然后get_title函数检索第二个标题,明文单独留下第一个标题。它们之间有什么区别?换行!因此你的正则表达式不匹配带换行符的标题,这就是正则表达式中的/ s选项修饰符存在的原因!

tl;dr

Your regex is wrong, add 's' to it.

你的正则表达式是错误的,添加's'。

$plaintext = preg_replace('#([<]title)(.*)([<]/title[>])#s', ' ', $html);`

instead of

$plaintext = preg_replace('#([<]title)(.*)([<]/title[>])#', ' ', $html);`