标记之间的preg匹配文本,不包括中间的相同标记

时间:2022-10-27 12:47:39

Well I know there several questions similar but could not find any with this specific case.

我知道有几个类似的问题但在这个具体的案例中找不到。

I took one code and tweak it to my needs but now I'm founding a bug on it that I can't correct.

我取了一段代码,并根据自己的需要对其进行了调整,但现在我在它上面创建了一个无法纠正的错误。

Code:

代码:

$tag = 'namespace';
$match = Tags::get($f, $tag);
var_dump($match); 

  static function get( $xml, $tag) { // http://*.com/questions/3404433/get-content-within-a-html-tag-using-7-processing
// bug case      string(56) "<namespaces>
//      <namespace key="-2">Media</namespace>"
      $tag_ini = "<{$tag}[^\>]*?>"; $tag_end = "<\\/{$tag}>";
      $tag_regex = '/' . $tag_ini . '(.*?)' . $tag_end . '/si';

      preg_match_all($tag_regex,
      $xml,
      $matches,
      PREG_OFFSET_CAPTURE);
      return $matches;
  }

As you can see, there is a bug if the tag is nested:

如您所见,如果标记是嵌套的,则存在一个bug:

<namespaces> <namespace key="-2">Media</namespace>

<名称> <名称空间关键= " 2> 媒体< /名称>

When it should return 'Media', or even the outer '<namespaces>' and then the inside ones.

当它应该返回“Media”,甚至是外部的“ ”,然后是内部的名称空间。

I tried to add "<{$tag}[^\>|^\r\n ]*?>", ^\s+, changing the * to *?, and other few things that in best case turned to recognize only the bugged case.

我试图添加“< { $ tag }[^ \ > | ^ \ r \ n]* ?>“^ \ s +,改变* * ?,以及其他一些在最好的情况下只能识别被窃听的案例的事情。

Also tried "<{$tag}[^{$tag}]*?>" which gives blank, I suppose it nullifies itself.

也试过“< { $ tag }[^ { $ tag }]* ?>是空白的,我想它是无效的。

I'm a newb on regex, I can tell that to fix this just is needed to add don't let open a new tag of the same type. Or I could even use a hack answer for my use case, that excludes if the inside text has new line carriage.

我是regex上的一个新手,我可以告诉您,要修复它只需要添加不要打开相同类型的新标记。或者,我甚至可以为我的用例使用一个hack的答案,如果内部文本有新的换行符,这个答案将被排除。

Can anyone get the right syntax for this?

有人能得到正确的语法吗?


You can check an extract of the text here: http://pastebin.com/f2naN2S3

您可以在这里查看文本的摘录:http://pastebin.com/f2naN2S3。


After the proposed change: $tag_ini = "<{$tag}\\b[^>]*>"; $tag_end = "<\\/{$tag}>"; it does work for the the example case, but not for this one:

后提出了改变:$ tag_ini = " < { $标签} \ \ b[^ >]* >”;美元tag_end = " < \ \ / { $ tag } >”;它确实适用于这个例子,但不是针对这个例子:

<namespace key="0" />
      <namespace key="1">Talk</namespace>

As it results in:

因为它的结果:

<namespace key="1">Talk"

It's because numbers and " and letters are considered inside word boundary. How could I address that?

这是因为数字和字母在单词边界内被考虑。我该怎么说呢?

3 个解决方案

#1


1  

The main problem is that you did not use a word boundary after the opening tag and thus, namespace in the pattern could also match namespaces tag, and many others.

主要的问题是,在开始标记之后没有使用单词边界,因此,模式中的名称空间也可以匹配名称空间标记和其他许多名称空间。

The subsequent issue is that the <${tag}\b[^>]*>(.*?)<\/${tag}> pattern would overfire if there is a self-closing namespace tag followed with a "normal" paired open/close namespace tag. So, you need to either use a negative lookbehind (?<!\/) before the > (see demo), or use a (?![^>]*\/>) negative lookahead after \b (see demo).

随之而来的问题是,< $ {标签} \ b[^ >]* >(. * ?)< \ / $ {标签} >模式会烧毁之后如果有自闭的名称空间标签与一个“正常”的配对打开/关闭名称空间标签。所以,你需要使用一个负向后插入(? < !之前\ /)>(见演示),或者使用(? ![^ >]* \ / >)-后超前\ b(见演示)。

So, you can use

所以,你可以使用

$tag_ini = "<{$tag}\\b[^>]*(?<!\\/)>"; $tag_end = "<\\/{$tag}>";

#2


1  

This is probably not the idea answer, but I was messing with a regex generator:

这可能不是理想的答案,但我在使用regex生成器:

<?php
# URL that generated this code:
# http://txt2re.com/index-php.php3?s=%3Cnamespace%3E%3Cnamespace%20key=%22-2%22%3EMedia%3C/namespace%3E&12&11

$txt='arstarstarstarstarstarst<namespace key="-2">Media</namespace>arstarstarstarstarst';

$re1='.*?'; # Non-greedy match on filler
$re2='(?:[a-z][a-z]+)'; # Uninteresting: word
$re3='.*?'; # Non-greedy match on filler
$re4='(?:[a-z][a-z]+)'; # Uninteresting: word
$re5='.*?'; # Non-greedy match on filler
$re6='(?:[a-z][a-z]+)'; # Uninteresting: word
$re7='.*?'; # Non-greedy match on filler
$re8='((?:[a-z][a-z]+))';   # Word 1

if ($c=preg_match_all ("/".$re1.$re2.$re3.$re4.$re5.$re6.$re7.$re8."/is", $txt, $matches))
{
    $word1=$matches[1][0];
    print "($word1) \n";
}

#-----
# Paste the code into a new php file. Then in Unix:
# $ php x.php
#-----
?>

#3


0  

This line is what I needed

这条线就是我需要的

   $tag_ini = "<{$tag}\\b[^>|^\\/>]*>"; $tag_end = "<\\/{$tag}>";

Thank you very much you @Alison and @Wictor for your help and directions

非常感谢@Alison和@Wictor的帮助和指导

#1


1  

The main problem is that you did not use a word boundary after the opening tag and thus, namespace in the pattern could also match namespaces tag, and many others.

主要的问题是,在开始标记之后没有使用单词边界,因此,模式中的名称空间也可以匹配名称空间标记和其他许多名称空间。

The subsequent issue is that the <${tag}\b[^>]*>(.*?)<\/${tag}> pattern would overfire if there is a self-closing namespace tag followed with a "normal" paired open/close namespace tag. So, you need to either use a negative lookbehind (?<!\/) before the > (see demo), or use a (?![^>]*\/>) negative lookahead after \b (see demo).

随之而来的问题是,< $ {标签} \ b[^ >]* >(. * ?)< \ / $ {标签} >模式会烧毁之后如果有自闭的名称空间标签与一个“正常”的配对打开/关闭名称空间标签。所以,你需要使用一个负向后插入(? < !之前\ /)>(见演示),或者使用(? ![^ >]* \ / >)-后超前\ b(见演示)。

So, you can use

所以,你可以使用

$tag_ini = "<{$tag}\\b[^>]*(?<!\\/)>"; $tag_end = "<\\/{$tag}>";

#2


1  

This is probably not the idea answer, but I was messing with a regex generator:

这可能不是理想的答案,但我在使用regex生成器:

<?php
# URL that generated this code:
# http://txt2re.com/index-php.php3?s=%3Cnamespace%3E%3Cnamespace%20key=%22-2%22%3EMedia%3C/namespace%3E&12&11

$txt='arstarstarstarstarstarst<namespace key="-2">Media</namespace>arstarstarstarstarst';

$re1='.*?'; # Non-greedy match on filler
$re2='(?:[a-z][a-z]+)'; # Uninteresting: word
$re3='.*?'; # Non-greedy match on filler
$re4='(?:[a-z][a-z]+)'; # Uninteresting: word
$re5='.*?'; # Non-greedy match on filler
$re6='(?:[a-z][a-z]+)'; # Uninteresting: word
$re7='.*?'; # Non-greedy match on filler
$re8='((?:[a-z][a-z]+))';   # Word 1

if ($c=preg_match_all ("/".$re1.$re2.$re3.$re4.$re5.$re6.$re7.$re8."/is", $txt, $matches))
{
    $word1=$matches[1][0];
    print "($word1) \n";
}

#-----
# Paste the code into a new php file. Then in Unix:
# $ php x.php
#-----
?>

#3


0  

This line is what I needed

这条线就是我需要的

   $tag_ini = "<{$tag}\\b[^>|^\\/>]*>"; $tag_end = "<\\/{$tag}>";

Thank you very much you @Alison and @Wictor for your help and directions

非常感谢@Alison和@Wictor的帮助和指导