如何在Flex（词法分析器）中定义数字格式？

What I need :

我需要的 :

Acceptable > 1234 & 12.34

可接受> 1234和12.34

Error (Non acceptable) > 12.34.56

错误(不可接受)> 12.34.56

Scanner.L :

      ...
%%

[0-9]+                printf("Number ");
[0-9]+"."[0-9]+       printf("Decimal_Number ");
"."                   printf("Dot "):

%%
      ...

After compile & run :

编译运行后:

Input :
1234    12.34    12.34.65

Output :
Number    Decimal_Number      Decimal_Number Dot Number

How to print Error instead of Decimal_Number Dot Number (Or just ignore it) ?

如何打印错误而不是Decimal_Number点号(或者只是忽略它)?

~~Is it possible to define space before & after number as seperator ?~~

是否可以将数字前后的空格定义为分隔符?

3 个解决方案

#1

It's often considered better to detect errors like 12.34.56 in your parser rather than your scanner. But there is also an argument that you can produce better error messages by detecting the error lexically.

在解析器而不是扫描仪中检测12.34.56等错误通常被认为更好。但是也存在一个论点,即通过词法检测错误可以产生更好的错误消息。

If you want to do so, you can use two patterns; the first one detects only correct numbers and the second one detects a larger set of strings, including all the erroneous strings (but not anything which could be legitimate). This relies on the matching behaviour of (f)lex: it always accepts the longest match, and if the longest token is matched by two or more rules, it uses the first matching rule.

如果你想这样做,你可以使用两种模式;第一个检测到正确的数字,第二个检测到更大的字符串集,包括所有错误的字符串(但不是任何可能合法的字符串)。这取决于(f)lex的匹配行为:它总是接受最长匹配,如果最长令牌由两个或更多规则匹配,则它使用第一个匹配规则。

For example, suppose you wanted to accept dots by themselves as '.', numbers as NUMBER tokens, and produce an error on numeric strings with more than one dot. You could do that with three rules:

例如,假设您想要将点自己接受为“。”,将数字作为NUMBER个标记,并在具有多个点的数字字符串上产生错误。你可以用三条规则来做到这一点:

  /* If the token is just a dot, match it here */
\.                             { return '.';    }
  /* Match integers without decimal points */
[[:digit:]]+                   { return INTEGER; }
  /* If the token is a number including a decimal point,
   * match it here. This pattern will also match just '.',
   * but the previous rules will be preferred.) */
[[:digit:]]*\.[[:digit:]]*     { return FLOAT; }
  /* This rule matches any sequence of dots and digits.
   * That will also match single dots and correct numbers, but
   * again, the previous rules are preferred. */
[.[:digit:]]+                  { /* signal error */
                                 return BADNUMBER; }

You need to be very careful with solutions like the above. For example, the last rule will match .. and ..., which might be valid tokens (or even valid sequences of . tokens.)

您需要非常小心上述解决方案。例如,最后一条规则将匹配..和...,它们可能是有效的令牌(甚至是。令牌的有效序列。)

Suppose, for example, that your language permits "range" expressions like 4 .. 17 (meaning the list of integers from 4 to 17, or some such). Your users might expect 4..17 to be accepted as a range expression, but the above will produce a BADNUMBER error, even when you add the rule

例如,假设您的语言允许“范围”表达式,例如4 .. 17(表示4到17之间的整数列表,或者某些此类表达式)。您的用户可能希望将4..17作为范围表达式接受,但即使您添加规则,上述内容也会产生BADNUMBER错误

".."                           { return RANGE; }

at the beginning, because 4.. will match BADNUMBER at a previous point in the scan.

在开头,因为4 ..将在扫描的前一个点匹配BADNUMBER。

In order to avoid false alerts, we need to modify the BADNUMBER rule to avoid matching strings which include two (or more) consecutive dots. And we also need to make sure that 4..17 is not lexed as 4. followed by .17. (This second problem could be avoided by insisting that . neither start not end a numeric token, but that might annoy some users.)

为了避免错误警报,我们需要修改BADNUMBER规则以避免匹配包含两个(或更多)连续点的字符串。而且我们还需要确保4..17没有被列为4.然后是.17。 (第二个问题可以通过坚持这一点来避免。既不开始不结束数字令牌,但这可能会使一些用户烦恼。)

So, we start with the actual dot tokens:

所以,我们从实际的点令牌开始:

"."                            { return '.'; }
".."                           { return RANGE; }
"..."                          { return ELLIPSIS; }

To avoid overmatching a number followed by .., we can use flex's trailing context operator. Here, we recognize a sequence of digits terminated by a . as a number only if the string is followed by something other than a .:

为了避免匹配后跟...的数字,我们可以使用flex的尾随上下文运算符。在这里,我们识别由a终止的一系列数字。仅当字符串后跟除了a之外的其他内容时才作为数字:

[[:digit:]]+                   { return INTEGER; }
  /* Change * to + so that we don't do numbers ending with . */
[[:digit:]]*(\.[[:digit:]]+)?  { return FLOAT; }
  /* Numbers which end with dot not followed by dot */
[[:digit:]]+\./[^.]            { return FLOAT; }

Now we need to fix the error rule. First, we limit it to recognizing strings where every dot is followed by a digit. Then, similar to the above, we do match the case where there is a trailing dot not followed by another dot:

现在我们需要修复错误规则。首先,我们将其限制为识别每个点后跟一个数字的字符串。然后,与上面类似,我们确实匹配一个尾随点后面跟着另一个点的情况:

[[:digit:]]*(\.[[:digit:]]+)+  { return BADNUMBER; }
[[:digit:]]*(\.[[:digit:]]+)+\./[^.] { return BADNUMBER; }

#2

That's not the duty of the lexer, but of the parser (yacc or bison). If you define . as a valid symbol then there is no surprise that

这不是词法分析员的责任,而是解析器(yacc或bison)的责任。如果你定义。作为一个有效的符号,毫无疑问

12.34.56

is tokenized as

被标记为

Decimal_Number Dot Number

The point is that the parser won't have a rule that accepts that sequence of token so the error will be raised later. White space is usually ignored so forcing a space between numbers won't make sense, especially in a context where you could have 12.34+56.78 that won't be tokenized as Decimal_Number Binary_Operator Decimal_Number because it lacks white space.

关键是解析器没有接受该令牌序列的规则,因此稍后会引发错误。白色空间通常被忽略,因此强制数字之间的空格是没有意义的,尤其是在你可能有12.34 + 56.78且不会被标记为Decimal_Number Binary_Operator Decimal_Number的环境中,因为它缺少空格。

#3

You can check my procedure to handle your problem. But as you are trying with lex you should know that whenever it matches any case then it work. Now change as like as follows:

您可以检查我的程序来处理您的问题。但是当你尝试使用lex时,你应该知道,无论何时匹配任何情况,它都能正常工作。现在改变如下:

%%

[0-9]+                {printf("Number ");}
[0-9]+[.][0-9]*[.]+[0-9|.]*       {printf("error ");}
[0-9]+[.][0-9]+       {printf("Decimal_Number ");}
%%

Now the program works as you want.

现在程序可以按你的意愿运行。

Input :
1234    12.34    12.34.65

Output :
Number    Decimal_Number     Error

#1