Flex词法分析器规则,对包含连字符和斜杠的字母数字字符串具有正向前瞻断言

时间:2022-03-23 09:40:48

I have a bit of trouble to build a flex lexer rule with a positive lookahead assertion for a certain type of token and could use some help. I am sure I am missing something simple here.

我有一点麻烦来构建一个flex lexer规则,对某种类型的令牌有一个积极的先行断言,可以使用一些帮助。我相信我在这里缺少一些简单的东西。

The token string I want to match looks like this:

我想匹配的令牌字符串如下所示:

33-abc-13/12
99-ab-33
o3sehh04/00
glu6-840d/00
vm-22hd
xyz-3

The token object to match is a string containing letters and digits and has slashes and/or hyphens, a rare cases a dot, possibly something like xx-3006/10.00

要匹配的令牌对象是一个包含字母和数字的字符串,并且有斜杠和/或连字符,极少数情况下是一个点,可能类似于xx-3006 / 10.00

What must not be matched (cause other rules cover these cases) are tokens such as:

不能匹配的(因为其他规则涵盖这些情况)是令牌,例如:

numeric370
hyphen-term
plainterm
00/40

What I tried so far is this rule with a lookahead:

到目前为止我尝试的是这个规则有一个先行:

([a-z0-9/-]*)/[-/]+[0-9/-]+

With above I get results that comes close to what I would like to achieve. It matches all these above listed strings, but the last character or digit is skipped. The matched tokens look like:

在上面,我得到的结果接近我想要实现的结果。它匹配以上列出的所有字符串,但跳过最后一个字符或数字。匹配的令牌看起来像:

33-abc-13/1
o3sehh04/0
...

Unfortunately the rule also matches 00/40 (resulting in 00/4).

不幸的是,该规则也匹配00/40(导致00/4)。

So my question is what do I miss here? It would be nice to cover these cases with one rule if possible and fast enough. I am aware of the order of processed rules in the lexer script, so the position of that rule would be one of the first ones in the entire set. If it is not possible perhaps a breakdown of that rule would be another way to go.

所以我的问题是我在这里想念什么?如果可能且足够快,用一条规则来覆盖这些案例会很好。我知道词法分析器脚本中处理规则的顺序,因此该规则的位置将是整个集合中的第一个规则之一。如果不可能,那么该规则的细分将是另一种方式。

With this project I use the RE-flex package (https://github.com/Genivia/RE-flex) because it covers the flex lexer interface and provides unicode (I need to work with wchar_t character strings). My lexer is a whitespace tokenizer with token classification, it was basically build on the flex 2.5 package a few years back. I've refactored a few things in the token processing and moved to re-flex as it gives me more opportunities. The tokenizer Input strings are short simple text snippets, they do not exceed a length of, lets say, 250-300 characters. So far the background.

在这个项目中,我使用RE-flex包(https://github.com/Genivia/RE-flex),因为它涵盖了flex lexer接口并提供了unicode(我需要使用wchar_t字符串)。我的词法分析器是一个带有令牌分类的空白标记器,几年​​前它基本上是在flex 2.5包上构建的。我在令牌处理中重构了一些东西并转移到重新弯曲,因为它给了我更多的机会。标记生成器输入字符串是简短的文本片段,它们的长度不超过250-300个字符。到目前为止的背景。

NOTE: I use regex101.com to check/experiment when building rules before I transform them for the lexer. It helps a little to get to the right direction, but that's all.

注意:我在构建规则之前使用regex101.com检查/实验,然后再将它们转换为词法分析器。它可以帮助我们找到正确的方向,但这就是全部。

Any help is greatly appreciated, thanks for your efforts in advance!

非常感谢任何帮助,感谢您的提前努力!

Update: Based on rici's answer the final pattern now looks like this:

更新:根据rici的回答,最终模式现在看起来像这样:

[a-z0-9/.-]*[/.-][0-9/-]+

This also covers now tokens containing a ., for example

这也包括现在包含a。的令牌

xx33-4.00
f/44-7.87
...

And considering the sentence separator problem in my comment below was simply a . in the last character group of the pattern. I removed it and now it works as expected.

考虑到下面评论中的句子分隔符问题只是一个问题。在模式的最后一个字符组中。我删除它,现在它按预期工作。

1 个解决方案

#1


0  

I don't know anything about RE-flex (although it looks cool) but assuming it really is compatible with flex, the same approach should work: forget about forward lookahead assertions (since the string matched will not include the lookahead pattern, and you want to match the whole string) and put the rule after all the other rules which might match the same thing.

我对RE-flex一无所知(虽然它看起来很酷),但假设它真的与flex兼容,同样的方法应该有效:忘记前瞻性断言(因为匹配的字符串不包括前瞻模式,而你想要匹配整个字符串)并将规则放在可能匹配相同内容的所有其他规则之后。

The flex rule is:

flex规则是:

  • the pattern which has the longest match wins, but
  • 具有最长匹配的模式获胜,但是

  • if two or more patterns both match the longest match, the first pattern in the file wins.
  • 如果两个或多个模式都匹配最长匹配,则文件中的第一个模式获胜。

So, for example, say you have the patterns:

所以,例如,假设你有这些模式:

[0-9]+("/"[0-9]+)*          { return SLASHED_NUMBERS; }
[a-z0-9/-]*[/-][0-9/-]+     { return GENERAL_TOKEN;   }

[Note 1]

Both of those will match 00/40, so if that is the token at the input point, that token will be detected as SLASHED_NUMBERS (the first rule in the file). On the other hand, if you have 00/49-23, it will be detected as GENERAL_TOKEN because that rule matched more characters.

这两个都将匹配00/40,因此如果这是输入点处的令牌,则该令牌将被检测为SLASHED_NUMBERS(文件中的第一个规则)。另一方面,如果您有00 / 49-23,它将被检测为GENERAL_TOKEN,因为该规则匹配更多字符。


Notes

  1. I based that on your regex. I didn't understand "a rare cases a dot" and it doesn't seem to be reflected in your pattern; furthermore, your pattern seems to be more specific than just "letters, numbers, hyphens and slashes", but I'm not sure exactly what the specifics are.
  2. 我基于你的正则表达式。我不明白“罕见的情况是一个点”,它似乎没有反映在你的模式中;此外,你的模式似乎比“字母,数字,连字符和斜线”更具体,但我不确定具体是什么。

#1


0  

I don't know anything about RE-flex (although it looks cool) but assuming it really is compatible with flex, the same approach should work: forget about forward lookahead assertions (since the string matched will not include the lookahead pattern, and you want to match the whole string) and put the rule after all the other rules which might match the same thing.

我对RE-flex一无所知(虽然它看起来很酷),但假设它真的与flex兼容,同样的方法应该有效:忘记前瞻性断言(因为匹配的字符串不包括前瞻模式,而你想要匹配整个字符串)并将规则放在可能匹配相同内容的所有其他规则之后。

The flex rule is:

flex规则是:

  • the pattern which has the longest match wins, but
  • 具有最长匹配的模式获胜,但是

  • if two or more patterns both match the longest match, the first pattern in the file wins.
  • 如果两个或多个模式都匹配最长匹配,则文件中的第一个模式获胜。

So, for example, say you have the patterns:

所以,例如,假设你有这些模式:

[0-9]+("/"[0-9]+)*          { return SLASHED_NUMBERS; }
[a-z0-9/-]*[/-][0-9/-]+     { return GENERAL_TOKEN;   }

[Note 1]

Both of those will match 00/40, so if that is the token at the input point, that token will be detected as SLASHED_NUMBERS (the first rule in the file). On the other hand, if you have 00/49-23, it will be detected as GENERAL_TOKEN because that rule matched more characters.

这两个都将匹配00/40,因此如果这是输入点处的令牌,则该令牌将被检测为SLASHED_NUMBERS(文件中的第一个规则)。另一方面,如果您有00 / 49-23,它将被检测为GENERAL_TOKEN,因为该规则匹配更多字符。


Notes

  1. I based that on your regex. I didn't understand "a rare cases a dot" and it doesn't seem to be reflected in your pattern; furthermore, your pattern seems to be more specific than just "letters, numbers, hyphens and slashes", but I'm not sure exactly what the specifics are.
  2. 我基于你的正则表达式。我不明白“罕见的情况是一个点”,它似乎没有反映在你的模式中;此外,你的模式似乎比“字母,数字,连字符和斜线”更具体,但我不确定具体是什么。