有效地将正确拼写的参数部分与regex匹配

时间:2022-09-13 13:26:13

I'm trying to write a GtkSourceView language file to highlight some of my files in gedit. The problem I'm encountering is that I want to highlight words that contain at least the first four characters and are correctly spelled. To illustrate, say I have four patterns:

我正在尝试编写一个GtkSourceView语言文件来突出显示gedit中的一些文件。我遇到的问题是,我想突出显示包含至少前四个字符并且拼写正确的单词。举例来说,我有四个模式:

variable
vari
variab
variabel

and I want to identify the first three, but not the fourth, because the first three are all correctly spelled substrings of the target "variable". What gets the job done is using

我想确定前三个,但不是第四个,因为前三个都是正确拼写的目标变量的子串。完成工作的方法就是使用

\bvari(a|ab|abl|able)?\b

but this can become quite tedious with longer words. So in a full lang-file it would look something like this:

但是用长一点的词会变得很乏味。所以在一个完整的语言文件中看起来是这样的:

<?xml version="1.0" encoding="UTF-8"?>
  <language id="foo" _name="foo" version="2.0" _section="Other">
  <metadata>
     <property name="mimetypes">text/x-foo</property>
     <property name="globs">*.foo</property>
  </metadata>

  <styles>
    <style id="keyword" _name="Keyword" map-to="def:keyword"/>
  </styles>

  <default-regex-options case-sensitive="false"/>

  <definitions>
    <context id="foo">
      <include>
        <context id="keyword" style-ref="keyword">
          <keyword>\bvari(a|ab|abl|able)\b</keyword>
        </context>
      </include>
    </context>
  </definitions>
</language>

I was not able to find a solution to this - because I'm extremely unfamiliar with regex and do not know the correct phrasing for this question. Is there a simple and efficient solution to this problem?

我无法找到对此的解决方案——因为我对regex非常不熟悉,也不知道这个问题的正确措辞。这个问题有简单有效的解决办法吗?

1 个解决方案

#1


3  

Unfortunately, there isn't really a less tedious way to do it.

不幸的是,没有比这更乏味的方法了。

About your pattern: Note that GtkSourceView uses the PCRE regex engine that is an NFA regex engine. So when you write an alternation, the first alternative (from left to right) that matches will succeed and the regex engine will not test other alternatives more far on the right, example for the string abcdef the pattern (a|ab|abc|abcde|abcdef) will return a (when a DFA will return the longest alternative that matches, so abcdef)

关于您的模式:注意,GtkSourceView使用的PCRE regex引擎是NFA regex引擎。所以当你写一个交替,第一选择(从左到右),比赛会成功和正则表达式引擎不会测试其他替代品越远,下面的代码示例演示六边形abcdef模式(abc ab | | |中的|六边形abcdef)将返回一个(当一个DFA将返回最长的替代相匹配,所以六边形abcdef)

This mean that your pattern works only because there is a word-boundary at the end (for the whole word variable, each alternative succeed, but once the word boundary reached, the regex engine must backtrack and test the next alternative and so on until the last.)

这意味着您的模式之所以有效,仅仅是因为在结尾有一个单词边界(对于整个单词变量,每个选项都成功,但是一旦单词边界到达,regex引擎必须回溯并测试下一个选项,直到最后一个)。

Conclusion, it's better to write your alternation from the longest alternative to the shortest, to avoid unnecessary work to the engine, so:

结论:最好将您的交替项从最长改为最短,以免对发动机造成不必要的工作,因此:

\bvari(able|abl|ab|a)?\b

An other possibility is to design your pattern like that:

另一种可能是这样设计你的图案:

\bvari(a(b(le?)?)?)?\b

In this case the regex engine goes straight to the end of the pattern without to have to find the good alternation. But note that it isn't more simple to write but a little shorter since you do not have to write letters several times!

在这种情况下,regex引擎直接进入模式的末尾,而不需要找到正确的变更。但是要注意的是,写起来并不简单,因为你不用写很多次信,所以写得更短一些。

#1


3  

Unfortunately, there isn't really a less tedious way to do it.

不幸的是,没有比这更乏味的方法了。

About your pattern: Note that GtkSourceView uses the PCRE regex engine that is an NFA regex engine. So when you write an alternation, the first alternative (from left to right) that matches will succeed and the regex engine will not test other alternatives more far on the right, example for the string abcdef the pattern (a|ab|abc|abcde|abcdef) will return a (when a DFA will return the longest alternative that matches, so abcdef)

关于您的模式:注意,GtkSourceView使用的PCRE regex引擎是NFA regex引擎。所以当你写一个交替,第一选择(从左到右),比赛会成功和正则表达式引擎不会测试其他替代品越远,下面的代码示例演示六边形abcdef模式(abc ab | | |中的|六边形abcdef)将返回一个(当一个DFA将返回最长的替代相匹配,所以六边形abcdef)

This mean that your pattern works only because there is a word-boundary at the end (for the whole word variable, each alternative succeed, but once the word boundary reached, the regex engine must backtrack and test the next alternative and so on until the last.)

这意味着您的模式之所以有效,仅仅是因为在结尾有一个单词边界(对于整个单词变量,每个选项都成功,但是一旦单词边界到达,regex引擎必须回溯并测试下一个选项,直到最后一个)。

Conclusion, it's better to write your alternation from the longest alternative to the shortest, to avoid unnecessary work to the engine, so:

结论:最好将您的交替项从最长改为最短,以免对发动机造成不必要的工作,因此:

\bvari(able|abl|ab|a)?\b

An other possibility is to design your pattern like that:

另一种可能是这样设计你的图案:

\bvari(a(b(le?)?)?)?\b

In this case the regex engine goes straight to the end of the pattern without to have to find the good alternation. But note that it isn't more simple to write but a little shorter since you do not have to write letters several times!

在这种情况下,regex引擎直接进入模式的末尾,而不需要找到正确的变更。但是要注意的是,写起来并不简单,因为你不用写很多次信,所以写得更短一些。