PHP在字符串中获取html注释并在标记中包装。正则表达式或DOM ?

时间:2022-10-27 10:55:15

I would like to find comment tags in a string that are NOT already inside a <pre> tag, and wrap them in a <pre> tag.

我想在一个不包含在

标签内的字符串中找到注释标签,并将其包装在
标签中。

It seems like there's no way of 'finding' comments using the PHP DOM.

似乎没有办法使用PHP DOM“查找”注释。

I'm using regex to do some of the processing already, however I am very unfamiliar with (have yet to grasp or truly understand) look aheads and look behinds in regex.

我已经在使用regex进行一些处理,但是我非常不熟悉(还没有理解或真正理解)在regex中查看和查看后面。

For instance I may have the following code;

例如,我可能有以下代码;

<!-- Comment 1 -->

<pre>
    <div class="some_html"></div>
    <!-- Comment 2 -->
</pre>

I would like to wrap Comment 1 in <pre> tags, but obviously not Comment 2 as it already resides in a <pre>.

我想将注释1封装在

标记中,但显然不是注释2,因为它已经驻留在
中。

How would this usually be done in RegEx?

在RegEx中,这通常是如何实现的?

Here's kind of what I've understood about negative look arounds, and my attempt at one, I'm clearly doing something very wrong!

这是我对负面看法的理解,而我的尝试,很明显我做错了!

(?<!<pre>.*?)<!--.*-->(?!.*?</pre>)

(? < ! < pre >。* ?)< !——。* - - >(? !。* ? < / pre >)

4 个解决方案

#1


2  

You should really use a DOM parser if you are planning on re-using this code. Every regex approach will fail horribly sooner rather than later when presented with real-world HTML.

如果您打算重用这段代码,那么应该使用DOM解析器。每一种regex方法在遇到真实世界的HTML时,都会很快地失败,而不是等到以后。

Having said that, here's what you could (but should not, see above) do:

话虽如此,以下是你可以做的(但不应该做的,见上文):

First, identify comments, e.g. using

首先,识别评论,例如使用

<!-- (?:(?!-->).)*-->

The negative look-ahead block ensures that the .* does not run out of the comment block.

消极的look forward块确保.*不会从注释块中运行。

Now, you need to figure out if this comment is inside a <pre> block. The key observation here, is that there is an even number of either <pre> or </pre> elements following every comment NOT already included in one.

现在,您需要确定这个注释是否在

块中。这里的关键观察是,在未包含在其中的每条注释之后,都有偶数个
元素。

So, run through the rest of your text, always in pairs of <pre>s, and check if you arrive at the end.

所以,在剩下的文本中,始终以成对的

s进行遍历,并检查您是否到达了末尾。

This would look like

这样子

(?=(?:(?!</?pre>).)*(?:</?pre>(?:(?!</?pre>).)*</?pre>(?:(?!</?pre>).)*)*$)

So, together this would be

所以,合起来就是

<!-- (?:(?!-->).)*-->(?=(?:(?!</?pre>).)*(?:</?pre>(?:(?!</?pre>).)*</?pre>(?:(?!</?pre>).)*)*$)

A hurray for write-only code =)

只写代码的hurray =)

The prominent building block of this expression is (?:(?!</?pre>).) which matches every character that is not the starting bracket of a <pre> or </pre> sequence.

这个表达式的显著构造块是(?:(?! ).),它匹配不是

序列的起始括号的每个字符。

Allowing attributes on the <pre> and proper escaping are left as an exercise for the reader. See this in action at RegExr.

允许

上的属性和适当的转义留给读者作为练习。在RegExr中查看这一操作。

#2


1  

It seems like there's no way of 'finding' comments using the PHP DOM.

似乎没有办法使用PHP DOM“查找”注释。

Of course you can... Check this code using PHP Simple HTML DOM Parser:

当然,你可以…使用PHP简单HTML DOM解析器检查此代码:

<?php
$text = '<!-- Comment 1 -->

        <pre>
            <div class="some_html"></div>
            <!-- Comment 2 -->
        </pre>';

echo  "<div>Original Text: <xmp>$text</xmp></div>";

$html = str_get_html($text);

$comments = $html->find('comment');

// if find exists
if ($comments) {

  echo '<br>Find function found '. count($comments) . ' results: ';

  foreach($comments as $key=>$com){
    echo '<br>'.$key . ': ' . $com->tag . ' wich contains = <xmp>' . $com->innertext . '</xmp>';
  }
}
else
  echo "Find() fails !";
?>

$com->innertext will give you the comments like <!-- Comment 1 -->...

$com->的innertext会给你像 …

You have now just to clean them as you wish. For example using <!--\s*(.*)\s*-->... Try it HERE

你现在只要按你的意愿把它们清理干净就行了。例如使用< !——\ s *(. *)\ s * - - >…尝试在这里

Edit:

Just a note concerning the lookbehind, it MUST have a fixed-width, therefore you cannot use repetition *+ or optional items ?

只是一个关于lookbehind的提示,它必须有固定的宽度,因此不能使用重复*+或可选项?

The bad news is that most regex flavors do not allow you to use just any regex inside a lookbehind, because they cannot apply a regular expression backwards. Therefore, the regular expression engine needs to be able to figure out how many steps to step back before checking the lookbehind.

坏消息是,大多数regex的味道不允许您在后面使用任何regex,因为它们不能向后应用正则表达式。因此,正则表达式引擎需要在检查lookbehind之前计算出需要后退多少步。

Therefore, many regex flavors, including those used by Perl and Python, only allow fixed-length strings. You can use any regex of which the length of the match can be predetermined. This means you can use literal text and character classes. You cannot use repetition or optional items. You can use alternation, but only if all options in the alternation have the same length.

因此,许多regex特性,包括Perl和Python使用的特性,只允许固定长度的字符串。您可以使用任何可以预先确定匹配长度的regex。这意味着您可以使用文字文本和字符类。不能使用重复或可选项。你可以使用交替,但前提是所有的选项都有相同的长度。

Source: http://www.regular-expressions.info/lookaround.html

来源:http://www.regular-expressions.info/lookaround.html

#3


0  

Xpath is your friend:

Xpath是你的朋友:

$xpath = new DOMXpath($doc);

foreach($xpath->query('//comment()[not(ancestor::pre)]') as $comment){
  $pre = $doc->createElement("pre");
  $comment->parentNode->insertBefore($pre, $comment);
  $pre->appendChild($comment);
}

#4


0  

its quite easy, using a principle called the stack-counter,
essentially you count the amount of <pre> tags and the amount of </pre> tags until the point in the HTML code your segment is placed.
if there are more <pre> than </pre> - this means that "<pre>..--you are here--..</pre>".
in that case, simply return back the match, unmodified - simple as that.

它非常简单,使用一种称为堆栈计数器的原则,本质上,您要计算

标记的数量和
标记的数量,直到在HTML代码中放置段。如果
多于
-这意味着
..——你在这里…< / pre >”。在这种情况下,简单地返回匹配,不修改——就像这样。

#1


2  

You should really use a DOM parser if you are planning on re-using this code. Every regex approach will fail horribly sooner rather than later when presented with real-world HTML.

如果您打算重用这段代码,那么应该使用DOM解析器。每一种regex方法在遇到真实世界的HTML时,都会很快地失败,而不是等到以后。

Having said that, here's what you could (but should not, see above) do:

话虽如此,以下是你可以做的(但不应该做的,见上文):

First, identify comments, e.g. using

首先,识别评论,例如使用

<!-- (?:(?!-->).)*-->

The negative look-ahead block ensures that the .* does not run out of the comment block.

消极的look forward块确保.*不会从注释块中运行。

Now, you need to figure out if this comment is inside a <pre> block. The key observation here, is that there is an even number of either <pre> or </pre> elements following every comment NOT already included in one.

现在,您需要确定这个注释是否在

块中。这里的关键观察是,在未包含在其中的每条注释之后,都有偶数个
元素。

So, run through the rest of your text, always in pairs of <pre>s, and check if you arrive at the end.

所以,在剩下的文本中,始终以成对的

s进行遍历,并检查您是否到达了末尾。

This would look like

这样子

(?=(?:(?!</?pre>).)*(?:</?pre>(?:(?!</?pre>).)*</?pre>(?:(?!</?pre>).)*)*$)

So, together this would be

所以,合起来就是

<!-- (?:(?!-->).)*-->(?=(?:(?!</?pre>).)*(?:</?pre>(?:(?!</?pre>).)*</?pre>(?:(?!</?pre>).)*)*$)

A hurray for write-only code =)

只写代码的hurray =)

The prominent building block of this expression is (?:(?!</?pre>).) which matches every character that is not the starting bracket of a <pre> or </pre> sequence.

这个表达式的显著构造块是(?:(?! ).),它匹配不是

序列的起始括号的每个字符。

Allowing attributes on the <pre> and proper escaping are left as an exercise for the reader. See this in action at RegExr.

允许

上的属性和适当的转义留给读者作为练习。在RegExr中查看这一操作。

#2


1  

It seems like there's no way of 'finding' comments using the PHP DOM.

似乎没有办法使用PHP DOM“查找”注释。

Of course you can... Check this code using PHP Simple HTML DOM Parser:

当然,你可以…使用PHP简单HTML DOM解析器检查此代码:

<?php
$text = '<!-- Comment 1 -->

        <pre>
            <div class="some_html"></div>
            <!-- Comment 2 -->
        </pre>';

echo  "<div>Original Text: <xmp>$text</xmp></div>";

$html = str_get_html($text);

$comments = $html->find('comment');

// if find exists
if ($comments) {

  echo '<br>Find function found '. count($comments) . ' results: ';

  foreach($comments as $key=>$com){
    echo '<br>'.$key . ': ' . $com->tag . ' wich contains = <xmp>' . $com->innertext . '</xmp>';
  }
}
else
  echo "Find() fails !";
?>

$com->innertext will give you the comments like <!-- Comment 1 -->...

$com->的innertext会给你像 …

You have now just to clean them as you wish. For example using <!--\s*(.*)\s*-->... Try it HERE

你现在只要按你的意愿把它们清理干净就行了。例如使用< !——\ s *(. *)\ s * - - >…尝试在这里

Edit:

Just a note concerning the lookbehind, it MUST have a fixed-width, therefore you cannot use repetition *+ or optional items ?

只是一个关于lookbehind的提示,它必须有固定的宽度,因此不能使用重复*+或可选项?

The bad news is that most regex flavors do not allow you to use just any regex inside a lookbehind, because they cannot apply a regular expression backwards. Therefore, the regular expression engine needs to be able to figure out how many steps to step back before checking the lookbehind.

坏消息是,大多数regex的味道不允许您在后面使用任何regex,因为它们不能向后应用正则表达式。因此,正则表达式引擎需要在检查lookbehind之前计算出需要后退多少步。

Therefore, many regex flavors, including those used by Perl and Python, only allow fixed-length strings. You can use any regex of which the length of the match can be predetermined. This means you can use literal text and character classes. You cannot use repetition or optional items. You can use alternation, but only if all options in the alternation have the same length.

因此,许多regex特性,包括Perl和Python使用的特性,只允许固定长度的字符串。您可以使用任何可以预先确定匹配长度的regex。这意味着您可以使用文字文本和字符类。不能使用重复或可选项。你可以使用交替,但前提是所有的选项都有相同的长度。

Source: http://www.regular-expressions.info/lookaround.html

来源:http://www.regular-expressions.info/lookaround.html

#3


0  

Xpath is your friend:

Xpath是你的朋友:

$xpath = new DOMXpath($doc);

foreach($xpath->query('//comment()[not(ancestor::pre)]') as $comment){
  $pre = $doc->createElement("pre");
  $comment->parentNode->insertBefore($pre, $comment);
  $pre->appendChild($comment);
}

#4


0  

its quite easy, using a principle called the stack-counter,
essentially you count the amount of <pre> tags and the amount of </pre> tags until the point in the HTML code your segment is placed.
if there are more <pre> than </pre> - this means that "<pre>..--you are here--..</pre>".
in that case, simply return back the match, unmodified - simple as that.

它非常简单,使用一种称为堆栈计数器的原则,本质上,您要计算

标记的数量和
标记的数量,直到在HTML代码中放置段。如果
多于
-这意味着
..——你在这里…< / pre >”。在这种情况下,简单地返回匹配,不修改——就像这样。