在格式错误的XML的注释中匹配双连字符

时间:2022-11-22 21:14:22

I'm to parse XML files that do not conform to the "no double hyphens in comments" -standard, which makes MSXML complain. I am looking for a way of deleting offending hyphens.

我要解析不符合“注释中没有双连字符”的标准的XML文件,这使得MSXML抱怨。我正在寻找一种删除违规连字符的方法。

I am using StringRegExpReplace(). I attempted following regular expressions:

我正在使用StringRegExpReplace()。我试图遵循正则表达式:

<!--(.*)--> : correctly gets comments
<!--(-*)--> : fails to be a correct regex (also tried escaping and using \x2D)

Given the right pattern, I would call:

鉴于正确的模式,我会打电话给:

StringRegExpReplace($xml_string,$correct_pattern,"") ;replace with nothing

How to match remaining extra hyphens within an XML comment, while leaving the remaining text alone?

如何在XML注释中匹配剩余的连续连字符,同时保留剩余的文本?

2 个解决方案

#1


4  

You can use this pattern:

你可以使用这种模式:

(?|\G(?!\A)(?|-{2,}+([^->][^-]*)|(-[^-]+)|-+(?=-->)|-->[^<]*(*SKIP)(*FAIL))|[^<]*<+(?>[^<]+<+)*?(?:!--\K|[^<]*\z\K(*ACCEPT))(?|-*+([^->][^-]*)|-+(?=-->)|-?+([^-]+)|-->[^<]*(*SKIP)(*FAIL)()))

details:

(?| 
    \G(?!\A) # contiguous to the precedent match (inside a comment)

    (?|
        -{2,}+([^->][^-]*) # duplicate hyphens, not part of the closing sequence
      |
         (-[^-]+)          # preserve isolated hyphens 
      |
         -+ (?=-->)        # hyphens before closing sequence, break contiguity
      |
         -->[^<]*          # closing sequence, go to next <
         (*SKIP)(*FAIL)    # break contiguity
    )
  |
    [^<]*<+ # reach the next < (outside comment)
    (?> [^<]+ <+ )*?       # next < until !-- or the end of the string 
    (?: !-- \K | [^<]*\z\K (*ACCEPT) ) # new comment or end of the string
    (?|
        -*+ ([^->][^-]*)   # possible hyphens not followed by >
      |
        -+ (?=-->)         # hyphens before closing sequence, break contiguity
      |
        -?+ ([^-]+)        # one hyphen followed by >
      |
        -->[^<]*           # closing sequence, go to next <
        (*SKIP)(*FAIL) ()  # break contiguity (note: "()" avoids a mysterious bug
    )                      # in regex101, you can remove it)
)

With this replacement: \1

有了这个替代品:\ 1

online demo

The \G feature ensures that matches are consecutive. Two ways are used to break the contiguity:

\ G功能可确保匹配是连续的。两种方法用于打破连续性:

  • a lookahead (?=-->)
  • 前瞻(?= - >)

  • the backtracking control verbs (*SKIP)(*FAIL) that forces the pattern to fail and all characters matched before to not be retried.
  • 回溯控制动词(* SKIP)(* FAIL)强制模式失败,之前匹配的所有字符都不重试。

So when contiguity is broken or at the begining the first main branch will fail (cause of the \G anchor) and the second branch will be used.

因此,当连续性被破坏时或者在开始时,第一个主分支将失败(导致\ G锚)并且将使用第二个分支。

\K removes all on the left from the match result.

\ K从匹配结果中删除左侧的所有内容。

(*ACCEPT) makes the pattern succeed unconditionnaly.

(* ACCEPT)使模式成功无条件。

This pattern uses massively the branch reset feature (?|...(..)...|...(..)...|...), so all capturing groups have the same number (in other words there is only one group, the group 1.)

此模式大量使用分支重置功能(?| ...(..)... | ...(..)... | ...),因此所有捕获组具有相同的编号(换句话说)只有一个组,组1)

Note: even this pattern is long, it needs few steps to obtain a match. The impact of non-greedy quantifiers is reduced as much as possible, and each alternatives are sorted and as efficient as possible. One of the goals is to reduce the total number of matches needed to treat a string.

注意:即使这种模式很长,也只需要很少的步骤来获得匹配。非贪婪量词的影响尽可能地减少,并且每个备选方案都被分类并尽可能高效。其中一个目标是减少处理字符串所需的匹配总数。

#2


3  

(?<!<!)--+(?!-?>)(?=(?:(?!-->).)*-->)

matches -- (or ---- etc.) only between <!-- and -->. You need to set the /s parameter to allow the dot to match newlines.

匹配 - (或----等)仅在 之间。您需要设置/ s参数以允许点匹配换行符。

Explanation:

(?<!<!)   # Assert that we're not right at the start of a comment
--+       # Match two or more dashes --
(?=       # only if the following can be matched further onwards:
 (?!-?>)  # First, make sure we're not at the end of the comment.
 (?:      # Then match the following group
  (?!-->) # which must not contain -->
  .       # but may contain any character
 )*       # any number of times
 -->      # as long as --> follows.
)         # End of lookahead assertion.

Test it live on regex101.com.

在regex101.com上测试它。

I suppose the correct AutoIt syntax would be

我想正确的AutoIt语法是

StringRegExpReplace($xml_string, "(?s)(?<!<!)--+(?!-?>)(?=(?:(?!-->).)*-->)", "")

#1


4  

You can use this pattern:

你可以使用这种模式:

(?|\G(?!\A)(?|-{2,}+([^->][^-]*)|(-[^-]+)|-+(?=-->)|-->[^<]*(*SKIP)(*FAIL))|[^<]*<+(?>[^<]+<+)*?(?:!--\K|[^<]*\z\K(*ACCEPT))(?|-*+([^->][^-]*)|-+(?=-->)|-?+([^-]+)|-->[^<]*(*SKIP)(*FAIL)()))

details:

(?| 
    \G(?!\A) # contiguous to the precedent match (inside a comment)

    (?|
        -{2,}+([^->][^-]*) # duplicate hyphens, not part of the closing sequence
      |
         (-[^-]+)          # preserve isolated hyphens 
      |
         -+ (?=-->)        # hyphens before closing sequence, break contiguity
      |
         -->[^<]*          # closing sequence, go to next <
         (*SKIP)(*FAIL)    # break contiguity
    )
  |
    [^<]*<+ # reach the next < (outside comment)
    (?> [^<]+ <+ )*?       # next < until !-- or the end of the string 
    (?: !-- \K | [^<]*\z\K (*ACCEPT) ) # new comment or end of the string
    (?|
        -*+ ([^->][^-]*)   # possible hyphens not followed by >
      |
        -+ (?=-->)         # hyphens before closing sequence, break contiguity
      |
        -?+ ([^-]+)        # one hyphen followed by >
      |
        -->[^<]*           # closing sequence, go to next <
        (*SKIP)(*FAIL) ()  # break contiguity (note: "()" avoids a mysterious bug
    )                      # in regex101, you can remove it)
)

With this replacement: \1

有了这个替代品:\ 1

online demo

The \G feature ensures that matches are consecutive. Two ways are used to break the contiguity:

\ G功能可确保匹配是连续的。两种方法用于打破连续性:

  • a lookahead (?=-->)
  • 前瞻(?= - >)

  • the backtracking control verbs (*SKIP)(*FAIL) that forces the pattern to fail and all characters matched before to not be retried.
  • 回溯控制动词(* SKIP)(* FAIL)强制模式失败,之前匹配的所有字符都不重试。

So when contiguity is broken or at the begining the first main branch will fail (cause of the \G anchor) and the second branch will be used.

因此,当连续性被破坏时或者在开始时,第一个主分支将失败(导致\ G锚)并且将使用第二个分支。

\K removes all on the left from the match result.

\ K从匹配结果中删除左侧的所有内容。

(*ACCEPT) makes the pattern succeed unconditionnaly.

(* ACCEPT)使模式成功无条件。

This pattern uses massively the branch reset feature (?|...(..)...|...(..)...|...), so all capturing groups have the same number (in other words there is only one group, the group 1.)

此模式大量使用分支重置功能(?| ...(..)... | ...(..)... | ...),因此所有捕获组具有相同的编号(换句话说)只有一个组,组1)

Note: even this pattern is long, it needs few steps to obtain a match. The impact of non-greedy quantifiers is reduced as much as possible, and each alternatives are sorted and as efficient as possible. One of the goals is to reduce the total number of matches needed to treat a string.

注意:即使这种模式很长,也只需要很少的步骤来获得匹配。非贪婪量词的影响尽可能地减少,并且每个备选方案都被分类并尽可能高效。其中一个目标是减少处理字符串所需的匹配总数。

#2


3  

(?<!<!)--+(?!-?>)(?=(?:(?!-->).)*-->)

matches -- (or ---- etc.) only between <!-- and -->. You need to set the /s parameter to allow the dot to match newlines.

匹配 - (或----等)仅在 之间。您需要设置/ s参数以允许点匹配换行符。

Explanation:

(?<!<!)   # Assert that we're not right at the start of a comment
--+       # Match two or more dashes --
(?=       # only if the following can be matched further onwards:
 (?!-?>)  # First, make sure we're not at the end of the comment.
 (?:      # Then match the following group
  (?!-->) # which must not contain -->
  .       # but may contain any character
 )*       # any number of times
 -->      # as long as --> follows.
)         # End of lookahead assertion.

Test it live on regex101.com.

在regex101.com上测试它。

I suppose the correct AutoIt syntax would be

我想正确的AutoIt语法是

StringRegExpReplace($xml_string, "(?s)(?<!<!)--+(?!-?>)(?=(?:(?!-->).)*-->)", "")