Regex匹配Wikipedia内部文章链接。

时间:2023-01-14 23:58:28

I want to regex match text in Wikipedia article source code with following rules:

我想要regex匹配*文章源代码中的文本,并遵循以下规则:


  1. Match only links to internal articles. So don't match links with any namespaces like files, categories, users, ... etc (complete list of these namespaces here)
    • Example link to match [[Without|namespace]]
    • 匹配[[没有|命名空间]的示例链接]
    • Example links NOT to match [[Category:Nope]], [[File:Nopeish]] etc.
    • 示例链接不匹配[[类别:不]]、[[文件:Nopeish]]等。
  2. 只匹配到内部文章的链接。因此,不要将链接与任何名称空间(如文件、类别、用户等)匹配。etc(这里列出这些名称空间的完整列表)示例链接以匹配[[[没有|命名空间]]]]示例链接不匹配[[[Category: no]], [[File:Nopeish]]等。

  1. Match only links having delimiter "|". Links with this symbol are displayed in article with different text as the title of article they are referring to
    • Example link to match [[Something|else]]
    • 示例链接以匹配[[某个|else]]
    • Example link NOT to match [[text]]
    • 示例链接不匹配[[text]]
  2. 只匹配具有分隔符“|”的链接。与此符号的链接在文章中以不同文本作为文章标题显示,它们引用示例链接来匹配[[|else]]]示例链接不匹配[[text]]

  1. Match links in two groups
    • Example link to match [[Something|else]] will be matched into two groups with text:
      1. group: "Something"
      2. 组:“东西”
      3. group: "else"
      4. 组:“其他”
    • 匹配的示例链接[[Something|else]]将被匹配为两个组,文本为:group: "Something" group: "else"
  2. 在两个组的例子中,Match链接将被匹配为两个组:组:“Something”组:“else”

I have tested this and so far I've come up with following regex: \[\[(?!.+?:)(.+?)\|(.+?)\]\] which is not working as expected since it also matches text like this:

我已经测试过了,到目前为止,我已经找到了以下的regex: \[\]\ [(?! +?:) \ (.+?)\|(.+?)\]\

[[Problem]] non link text [[Another link|problemAgain]]
  ^------------ group 1 (wrong) -------^ ^-group 2 -^

[[This should be|matched|]]

DEMO

演示

Thanks

谢谢

1 个解决方案

#1


3  

Just use a negated character class instead of .+?,

用一个否定的字符类代替。

\[\[(?!.+?:)([^\]\[]+)\|([^\]\[]+)\]\]

Java regex would be,

Java正则表达式,

"\\[\\[(?!.+?:)([^\\]\\[]+)\\|([^\\]\\[]+)\\]\\]"

DEMO

演示

OR

simply you could do like this,

你可以这样做,

\[\[([^\]\[:]+)\|([^\]\[:]+)\]\]

Java regex would be,

Java正则表达式,

"\\[\\[([^\\]\\[:]+)\\|([^\\]\\[:]+)\\]\\]"

DEMO

演示

#1


3  

Just use a negated character class instead of .+?,

用一个否定的字符类代替。

\[\[(?!.+?:)([^\]\[]+)\|([^\]\[]+)\]\]

Java regex would be,

Java正则表达式,

"\\[\\[(?!.+?:)([^\\]\\[]+)\\|([^\\]\\[]+)\\]\\]"

DEMO

演示

OR

simply you could do like this,

你可以这样做,

\[\[([^\]\[:]+)\|([^\]\[:]+)\]\]

Java regex would be,

Java正则表达式,

"\\[\\[([^\\]\\[:]+)\\|([^\\]\\[:]+)\\]\\]"

DEMO

演示