嵌套标签的正则表达式(维基媒体内容)

时间:2022-10-27 12:47:51

Haven't done regex in awhile, and am a bit rusty.

有一段时间没有完成正则表达式,并且有点生疏。

I'm trying to parse the categories out of a Wikipedia entry. What I need are the individual strings contained in a pattern that starts with two open brackets and ends with two closing brackets.

我正在尝试解析*条目中的类别。我需要的是一个模式中包含的单个字符串,它以两个开放式括号开头,两个结束括号。

This query works most of the time -

此查询大部分时间都有效 -

(\[\[)(?<category>.*[^\]#])([\]])

but has issues when the closing brackets have a comma (',') next to them.

但是当结束括号旁边有一个逗号(',')时会出现问题。

This has the unfortunate result that when parsing the following text -

这有一个令人遗憾的结果,即在解析以下文本时 -

nlocation = [[Seattle, Washington]], [[United States|USA]]|

it extracts the following for "category"

它为“类别”提取以下内容

Seattle, Washington]], [[United States|USA

Clearly, the comma is throwing this off and it is finding the next set. What's the best way to capture every value between open and closed double brackets?

很明显,逗号正在抛弃它,它正在寻找下一组。捕获开括号和闭合双括号之间的每个值的最佳方法是什么?

4 个解决方案

#1


The problem is not the comma, the problem is that .* will match "]][[" just as well as anything else. * is greedy - it will match as much as it possibly can. To fix it, you could use the non-greedy version (as suggested by RichieHindle), or you could change .* to [^\]]* - greedy match anything except closing brackets. That should also do the trick.

问题不是逗号,问题是。*将匹配“]] [[”和其他任何东西一样。 *贪婪 - 它会尽可能多地匹配。要修复它,你可以使用非贪婪版本(由RichieHindle建议),或者你可以改变。*到[^ \]] * - 贪婪匹配除了右括号之外的任何东西。这应该也可以解决问题。

Also, these are not "nested" tags - that would be [[tag [[inside]] tag]]. That's probably not what you want, as I don't think that means anything in Wikimedia markup.

此外,这些不是“嵌套”标签 - 即[[tag [[inside]] tag]]。这可能不是你想要的,因为我不认为这意味着维基媒体标记中的任何内容。

#2


Make your wildcard non-greedy by appending a question mark:

通过附加问号使您的通配符非贪婪:

(\[\[)(?<category>.*?[^\]#])([\]])

                    ^
                    Here is the edit

That will make it match the individual categories.

这将使其与各个类别相匹配。

#3


I think you're making this a lot more complicated than it needs to be. Does this do what you want?

我认为你让它变得比它需要的复杂得多。这样做你想要的吗?

\[\[(?<category>[^\[\]]+)\]\]

#4


The comma isn't relevant at all. You could have confirmed that yourself with a simple test.

逗号根本不相关。您可以通过简单的测试确认自己。

And there's no nesting involved here. Wikilinks aren't allowed to be nested anyway.

而且这里没有嵌套。无论如何都不允许嵌套Wikilinks。

You need to ensure that your inner pattern can't match the double-bracket that closes a wikilink. That way, any time you do encounter a double-bracket sequence, it will stop accumulating more characters into the regex match. The problem in your regular expression is that .* matches everything. The easy way to fix that is to use a non-greedy modifier. That way, the match is terminated as soon as possible. If you don't want to do that or your regex library doesn't support it, though, then you need to explicitly exclude the sequence that should terminate the pattern.

您需要确保内部模式与关闭wikilink的双括号不匹配。这样,只要你遇到双括号序列,就会停止在正则表达式匹配中累积更多字符。正则表达式中的问题是。*匹配所有内容。解决这个问题的简单方法是使用非贪婪的修饰符。这样,比赛就会尽快终止。但是,如果您不想这样做或者您的正则表达式库不支持它,那么您需要明确排除应终止该模式的序列。

A naïve approach would be to simply exclude closing brackets altogether: [^]]*. That's not good enough, though. A single closing bracket is allowed in a wikilink's text. Therefore, you need to accept a single bracket while excluding double brackets. This should do it:

一种天真的方法是简单地完全排除右括号:[^]] *。但这还不够好。在wikilink的文本中允许使用单个结束括号。因此,您需要接受单个括号,同时排除双括号。这应该这样做:

\[\[       # 2 opening brackets
(?<category>
  (
    ]?     # optional bracket
    [^]]   # always a non-bracket
  )*
)
]]         # 2 closing brackets

That will accept a right bracket, but only if it's followed by a non-bracket to break the closing sequence.

这将接受一个右括号,但只有在它后面跟一个非括号才能打破关闭序列。

#1


The problem is not the comma, the problem is that .* will match "]][[" just as well as anything else. * is greedy - it will match as much as it possibly can. To fix it, you could use the non-greedy version (as suggested by RichieHindle), or you could change .* to [^\]]* - greedy match anything except closing brackets. That should also do the trick.

问题不是逗号,问题是。*将匹配“]] [[”和其他任何东西一样。 *贪婪 - 它会尽可能多地匹配。要修复它,你可以使用非贪婪版本(由RichieHindle建议),或者你可以改变。*到[^ \]] * - 贪婪匹配除了右括号之外的任何东西。这应该也可以解决问题。

Also, these are not "nested" tags - that would be [[tag [[inside]] tag]]. That's probably not what you want, as I don't think that means anything in Wikimedia markup.

此外,这些不是“嵌套”标签 - 即[[tag [[inside]] tag]]。这可能不是你想要的,因为我不认为这意味着维基媒体标记中的任何内容。

#2


Make your wildcard non-greedy by appending a question mark:

通过附加问号使您的通配符非贪婪:

(\[\[)(?<category>.*?[^\]#])([\]])

                    ^
                    Here is the edit

That will make it match the individual categories.

这将使其与各个类别相匹配。

#3


I think you're making this a lot more complicated than it needs to be. Does this do what you want?

我认为你让它变得比它需要的复杂得多。这样做你想要的吗?

\[\[(?<category>[^\[\]]+)\]\]

#4


The comma isn't relevant at all. You could have confirmed that yourself with a simple test.

逗号根本不相关。您可以通过简单的测试确认自己。

And there's no nesting involved here. Wikilinks aren't allowed to be nested anyway.

而且这里没有嵌套。无论如何都不允许嵌套Wikilinks。

You need to ensure that your inner pattern can't match the double-bracket that closes a wikilink. That way, any time you do encounter a double-bracket sequence, it will stop accumulating more characters into the regex match. The problem in your regular expression is that .* matches everything. The easy way to fix that is to use a non-greedy modifier. That way, the match is terminated as soon as possible. If you don't want to do that or your regex library doesn't support it, though, then you need to explicitly exclude the sequence that should terminate the pattern.

您需要确保内部模式与关闭wikilink的双括号不匹配。这样,只要你遇到双括号序列,就会停止在正则表达式匹配中累积更多字符。正则表达式中的问题是。*匹配所有内容。解决这个问题的简单方法是使用非贪婪的修饰符。这样,比赛就会尽快终止。但是,如果您不想这样做或者您的正则表达式库不支持它,那么您需要明确排除应终止该模式的序列。

A naïve approach would be to simply exclude closing brackets altogether: [^]]*. That's not good enough, though. A single closing bracket is allowed in a wikilink's text. Therefore, you need to accept a single bracket while excluding double brackets. This should do it:

一种天真的方法是简单地完全排除右括号:[^]] *。但这还不够好。在wikilink的文本中允许使用单个结束括号。因此,您需要接受单个括号,同时排除双括号。这应该这样做:

\[\[       # 2 opening brackets
(?<category>
  (
    ]?     # optional bracket
    [^]]   # always a non-bracket
  )*
)
]]         # 2 closing brackets

That will accept a right bracket, but only if it's followed by a non-bracket to break the closing sequence.

这将接受一个右括号,但只有在它后面跟一个非括号才能打破关闭序列。