C预处理器插入的空格

时间:2022-11-25 10:48:41

Suppose we are given this input C code:

假设我们给出了这个输入C代码:

#define Y 20
#define A(x) (10+x+Y)

A(A(40))

gcc -E outputs like that (10+(10+40 +20)+20).

gcc -E输出就像那样(10 +(10 + 40 +20)+20)。

gcc -E -traditional-cpp outputs like that (10+(10+40+20)+20).

gcc -E -traditional-cpp输出(10+(10 + 40 + 20)+20)。

Why the default cpp inserts the space after 40 ?

为什么默认cpp在40之后插入空格?

Where can I find the most detailed specification of the cpp that covers that logic ?

我在哪里可以找到涵盖该逻辑的最详细的cpp规范?

2 个解决方案

#1


10  

The C standard doesn't specify this behaviour, since the output of the preprocessing phase is simply a stream of tokens and whitespace. Serializing the stream of tokens back into a character string, which is what gcc -E does, is not required or even mentioned by the standard, and does not form part of the translation processs specified by the standard.

C标准没有指定这种行为,因为预处理阶段的输出只是一个令牌和空格流。将标记流序列化为字符串,这是gcc -E所做的,标准不需要甚至不提及,也不构成标准指定的翻译过程的一部分。

In phase 3, the program "is decomposed into preprocessing tokens and sequences of white-space characters." Aside from the result of the concatenation operator, which ignores whitespace, and the stringification operator, which preserves whitespace, tokens are then fixed and whitespace is no longer needed to separate them. However, the whitespace is needed in order to:

在阶段3中,程序“被分解为预处理标记和空白字符序列”。除了忽略空格的连接运算符的结果,以及保留空格的字符串化运算符之外,还会修复标记,并且不再需要空格来分隔它们。但是,需要空格以便:

  • parse preprocessor directives
  • 解析预处理程序指令

  • correctly process the stringification operator
  • 正确处理字符串化运算符

The whitespace elements in the stream are not eliminated until phase 7, although they are no longer relevant after phase 4 concludes.

直到阶段7,流中的空白元素才被消除,尽管在阶段4结束后它们不再相关。

Gcc is capable of producing a variety of information useful to programmers, but not corresponding to anything in the standard. For example, the preprocessor phase of the translation can also produce dependency information useful for inserting into a Makefile, using one of the -M options. Alternatively, a human-readable version of the compiled code can be output using the -S option. And a compilable version of the preprocessed program, roughly corresponding to the token stream produced by phase 4, can be output using the -E option. None of these output formats are in any way controlled by the C standard, which is only concerned with actually executing the program.

Gcc能够生成对程序员有用的各种信息,但不能与标准中的任何内容相对应。例如,转换的预处理器阶段还可以使用-M选项之一生成对插入Makefile有用的依赖性信息。或者,可以使用-S选项输出编译代码的人类可读版本。并且可以使用-E选项输出预处理程序的可编译版本,其大致对应于阶段4产生的令牌流。这些输出格式都不受C标准的任何控制,C标准仅涉及实际执行程序。

In order to produce the -E output, gcc must serialize the stream of tokens and whitespace in a format which does not change the semantics of the program. There are cases in which two consecutive tokens in the stream would be incorrectly glued together into a single token if they are not separated from each other, so gcc must take some precautions. It cannot actually insert whitespace into the stream being processed, but nothing stops it from adding whitespace when it presents the stream in response to gcc -E.

为了产生-E输出,gcc必须以不改变程序语义的格式序列化令牌流和空格流。如果它们没有彼此分离,则存在流中的两个连续令牌被错误地粘合到一个令牌中的情况,因此gcc必须采取一些预防措施。它实际上不能将空格插入正在处理的流中,但是当它呈现流以响应gcc -E时,没有什么能阻止它添加空格。

For example, if macro invocation in your example were modified to

例如,如果示例中的宏调用被修改为

A(A(0x40E))

then naive output of the token stream would result in

那么令牌流的天真输出将导致

(10+(10+0x40E+20)+20)

which could not be compiled because 0x40E+20 is a single pp-number token which cannot be converted into a numeric token. The space before the + prevents this from happening.

因为0x40E + 20是一个无法转换为数字标记的单个pp-number标记,所以无法编译。 +之前的空格可以防止这种情况发生。

If you attempt to implement a preprocessor as some kind of string transformation, you will undoubtedly confront serious issues in the corner cases. The correct implementation strategy is to tokenize first, as indicated in the standard, and then perform phase 4 as a function on a stream of tokens and whitespace.

如果您尝试将预处理器实现为某种字符串转换,那么无疑会遇到严重问题。正确的实现策略是首先标记化,如标准中所示,然后在标记和空白流上执行阶段4作为函数。

Stringification is a particularly interesting case where whitespace affects semantics, and it can be used to see what the actual token stream looks like. If you stringify the expansion of A(A(40)), you can see that no whitespace was actually inserted:

字符串化是一个特别有趣的情况,其中空格会影响语义,它可以用来查看实际令牌流的外观。如果对A(A(40))的扩展进行字符串化,则可以看到实际没有插入空格:

$ gcc -E -x c - <<<'
#define Y 20
#define A(x) (10+x+Y)
#define Q_(x) #x
#define Q(x) Q_(x)         
Q(A(A(40)))'

"(10+(10+40+20)+20)"

The handling of whitespace in stringification is precisely specified by the standard: (§6.10.3.2, paragraph 2, many thanks to John Bollinger for finding the specification.)

字符串化中的空白处理由标准精确指定:(§6.10.3.2,第2段,非常感谢John Bollinger查找规范。)

Each occurrence of white space between the argument’s preprocessing tokens becomes a single space character in the character string literal. White space before the first preprocessing token and after the last preprocessing token composing the argument is deleted.

参数的预处理标记之间每次出现的空格都会成为字符串文字中的单个空格字符。第一个预处理标记之前和构成参数的最后一个预处理标记之后的空格被删除。

Here is a more subtle example where additional whitespace is required in the gcc -E output, but is not actually inserted into the token stream (again shown by using stringification to produce the real token stream.) The I (identify) macro is used to allow two tokens to be inserted into the token stream without intervening whitespace; that's a useful trick if you want to use macros to compose the argument to the #include directive (not recommended, but it can be done).

这是一个更精细的示例,其中gcc -E输出中需要额外的空格,但实际上并未插入到令牌流中(通过使用字符串化再次显示以生成真实令牌流。)I(标识)宏用于允许将两个令牌插入令牌流中,而不插入空格;如果你想使用宏来组成#include指令的参数(不推荐,但可以这样做),这是一个有用的技巧。

Maybe this could be a useful test case for your preprocessor:

也许这对您的预处理器来说可能是一个有用的测试用例:

#define Q_(x) #x
#define Q(x) Q_(x)
#define I(x) x
#define C(x,...) x(__VA_ARGS__)
// Uncomment the following line to run the program
//#include <stdio.h>

char*quoted=Q(C(I(int)I(main),void){I(return)I(C(puts,quoted));});
C(I(int)I(main),void){I(return)I(C(puts,quoted));}

Here's the output of gcc -E (just the good stuff at the end):

这是gcc -E的输出(最后的好东西):

$ gcc -E squish.c | tail -n2
char*quoted="intmain(void){returnputs(quoted);}";
int main(void){return puts(quoted);}

In the token stream which is passed out of phase 4, the tokens int and main are not separated by whitespace (and neither are return and puts). That's clearly shown by the stringification, in which no whitespace separates the token. However, the program compiles and executes fine, even if passed explicitly through gcc -E:

在传递出阶段4的令牌流中,令牌int和main不用空格分隔(并且都不是return和puts)。字符串化清楚地显示了这一点,其中没有空格分隔令牌。但是,即使通过gcc -E显式传递,程序也会编译并执行正常:

$ gcc -E squish.c | gcc -x c - && ./a.out 
intmain(void){returnputs(quoted);}

and compiling the output of gcc -E.

并编译gcc -E的输出。


Different compilers and different versions of the same compiler may produce different serializations of a preprocessed program. So I don't think you will find any algorithm which is testable with a character-by-character comparison with the -E output of a given compiler.

不同的编译器和相同编译器的不同版本可以产生预处理程序的不同序列化。所以我认为你不会发现任何可以通过逐个字符与给定编译器的-E输出进行比较来测试的算法。

The simplest possible serialization algorithm would be to unconditionally output a space between two consecutive tokens. Obviously, that would output unnecessary spaces, but it would never syntactically alter the program.

最简单的序列化算法是无条件地在两个连续令牌之间输出空格。显然,这将输出不必要的空格,但它永远不会在语法上改变程序。

I think the minimal space algorithm would be to record the DFA state at the end of the last character in a token so that you can later output a space between two consecutive tokens if there exists a transition from the state at the end of the first token on the first character of the following token. (Keeping the DFA state as part of the token is not intrinsically different from keeping the token type as part of the token, since you can derive the token type from a simple lookup from the DFA state.) That algorithm would not insert a space after 40 in your original test case, but it would insert a space after 0x40E. So it is not the algorithm being used by your version of gcc.

我认为最小空间算法是在令牌中最后一个字符的末尾记录DFA状态,以便稍后如果存在从第一个令牌末尾的状态转换,则可以在两个连续令牌之间输出空格。在下一个标记的第一个字符上。 (将DFA状态保持为令牌的一部分与将令牌类型保持为令牌的一部分本质上没有区别,因为您可以从DFA状态的简单查找中派生令牌类型。)该算法不会在之后插入空格在原始测试用例中为40,但它会在0x40E之后插入一个空格。因此,您的gcc版本不使用该算法。

If you use the above algorithm, you will need to rescan tokens created by token concatenation. However, that is necessary anyway, because you need to flag an error if the result of the concatenation is not a valid preprocessing token.

如果使用上述算法,则需要重新扫描由标记串联创建的标记。但是,无论如何,这是必要的,因为如果连接的结果不是有效的预处理标记,则需要标记错误。

If you don't want to record states (although, as I said, there is essentially no cost in doing so) and you don't want to regenerate the state by rescanning the token as you output it (which would also be quite cheap), you could precompute a two-dimensional boolean array keyed by token type and following character. The computation would essentially be the same as the above: for every accepting DFA state which returns a particular token type, enter a true value in the array for that token type and any character with a transition out of the DFA state. Then you can look up the token type of a token and the first character of the following token to see if a space may be necessary. This algorithm does not produce a minimally-spaced output: it would, for example, put a space after the 40 in your example, since 40 is a pp-number and it is possible for some pp-number to be extended with a + (even though you cannot extend 40 in that way). So it's possible that gcc uses some version of this algorithm.

如果您不想记录状态(尽管如我所说,这样做基本上没有成本)并且您不希望通过在输出时重新扫描令牌来重新生成状态(这也很便宜) ),您可以预先计算由令牌类型和后续字符键入的二维布尔数组。计算基本上与上述相同:对于每个接受返回特定标记类型的DFA状态,在该标记类型的数组中输入一个真值,以及任何转换超出DFA状态的字符。然后,您可以查找令牌的令牌类型和以下令牌的第一个字符,以查看是否需要空格。这个算法不会产生最小间距的输出:例如,它会在你的例子中在40之后放一个空格,因为40是一个pp数,并且有些pp数可以用+扩展(即使你不能以这种方式扩展40)。所以gcc可能会使用这个算法的某个版本。

#2


1  

Adding some historical context to rici's excellent answer.

为rici的优秀答案添加一些历史背景。

If you can get your hands on a working copy of gcc 2.7.2.3, experiment with its preprocessor. At that time the preprocessor was a separate program from the compiler, and it used a very naive algorithm for text serialization, which tended to insert far more spaces than were necessary. When Neil Booth, Per Bothner and I implemented the integrated preprocessor (appearing in gcc 3.0 and since), we decided to make -E output a bit smarter at the same time, but without making the implementation too complicated. The core of this algorithm is the library function cpp_avoid_paste, defined at https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libcpp/lex.c#l2990 , and its caller is here: https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/c-family/c-ppoutput.c#l177 (look for "Subtle logic to output a space...").

如果您可以获得gcc 2.7.2.3的工作副本,请试用其预处理器。那时预处理器是一个独立于编译器的程序,它使用了一种非常天真的文本序列化算法,它往往会插入比必要的空间更多的空间。当Neil Booth,Per Bothner和我实现了集成预处理器(出现在gcc 3.0及之后)时,我们决定让-E输出同时更智能,但不会使实现过于复杂。该算法的核心是库函数cpp_avoid_paste,在https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libcpp/lex.c#l2990中定义,其调用者是这里:https://gcc.gnu.org/git/?p = gcc.git; a = blob; f = gcc / c-family / c -ppoutput.c#l177(寻找“输出空间的微妙逻辑” ......“)。

In the case of your example

就你的例子而言

#define Y 20
#define A(x) (10+x+Y)
A(A(40))

cpp_avoid_paste will be called with a CPP_NUMBER token (what rici called a "pp-number") on the left, and a '+' token on the right. In this case it unconditionally says "yes, you need to insert a space to avoid pasting" rather than checking whether the last character of the number token is one of eEpP.

cpp_avoid_paste将在左侧使用CPP_NUMBER标记(rici称为“pp-number”)和右侧的“+”标记进行调用。在这种情况下,它无条件地说“是的,你需要插入一个空格以避免粘贴”,而不是检查数字标记的最后一个字符是否是eEpP之一。

Compiler design often comes down to a trade-off between accuracy and implementation simplicity.

编译器设计通常归结为准确性和实现简单性之间的权衡。

#1


10  

The C standard doesn't specify this behaviour, since the output of the preprocessing phase is simply a stream of tokens and whitespace. Serializing the stream of tokens back into a character string, which is what gcc -E does, is not required or even mentioned by the standard, and does not form part of the translation processs specified by the standard.

C标准没有指定这种行为,因为预处理阶段的输出只是一个令牌和空格流。将标记流序列化为字符串,这是gcc -E所做的,标准不需要甚至不提及,也不构成标准指定的翻译过程的一部分。

In phase 3, the program "is decomposed into preprocessing tokens and sequences of white-space characters." Aside from the result of the concatenation operator, which ignores whitespace, and the stringification operator, which preserves whitespace, tokens are then fixed and whitespace is no longer needed to separate them. However, the whitespace is needed in order to:

在阶段3中,程序“被分解为预处理标记和空白字符序列”。除了忽略空格的连接运算符的结果,以及保留空格的字符串化运算符之外,还会修复标记,并且不再需要空格来分隔它们。但是,需要空格以便:

  • parse preprocessor directives
  • 解析预处理程序指令

  • correctly process the stringification operator
  • 正确处理字符串化运算符

The whitespace elements in the stream are not eliminated until phase 7, although they are no longer relevant after phase 4 concludes.

直到阶段7,流中的空白元素才被消除,尽管在阶段4结束后它们不再相关。

Gcc is capable of producing a variety of information useful to programmers, but not corresponding to anything in the standard. For example, the preprocessor phase of the translation can also produce dependency information useful for inserting into a Makefile, using one of the -M options. Alternatively, a human-readable version of the compiled code can be output using the -S option. And a compilable version of the preprocessed program, roughly corresponding to the token stream produced by phase 4, can be output using the -E option. None of these output formats are in any way controlled by the C standard, which is only concerned with actually executing the program.

Gcc能够生成对程序员有用的各种信息,但不能与标准中的任何内容相对应。例如,转换的预处理器阶段还可以使用-M选项之一生成对插入Makefile有用的依赖性信息。或者,可以使用-S选项输出编译代码的人类可读版本。并且可以使用-E选项输出预处理程序的可编译版本,其大致对应于阶段4产生的令牌流。这些输出格式都不受C标准的任何控制,C标准仅涉及实际执行程序。

In order to produce the -E output, gcc must serialize the stream of tokens and whitespace in a format which does not change the semantics of the program. There are cases in which two consecutive tokens in the stream would be incorrectly glued together into a single token if they are not separated from each other, so gcc must take some precautions. It cannot actually insert whitespace into the stream being processed, but nothing stops it from adding whitespace when it presents the stream in response to gcc -E.

为了产生-E输出,gcc必须以不改变程序语义的格式序列化令牌流和空格流。如果它们没有彼此分离,则存在流中的两个连续令牌被错误地粘合到一个令牌中的情况,因此gcc必须采取一些预防措施。它实际上不能将空格插入正在处理的流中,但是当它呈现流以响应gcc -E时,没有什么能阻止它添加空格。

For example, if macro invocation in your example were modified to

例如,如果示例中的宏调用被修改为

A(A(0x40E))

then naive output of the token stream would result in

那么令牌流的天真输出将导致

(10+(10+0x40E+20)+20)

which could not be compiled because 0x40E+20 is a single pp-number token which cannot be converted into a numeric token. The space before the + prevents this from happening.

因为0x40E + 20是一个无法转换为数字标记的单个pp-number标记,所以无法编译。 +之前的空格可以防止这种情况发生。

If you attempt to implement a preprocessor as some kind of string transformation, you will undoubtedly confront serious issues in the corner cases. The correct implementation strategy is to tokenize first, as indicated in the standard, and then perform phase 4 as a function on a stream of tokens and whitespace.

如果您尝试将预处理器实现为某种字符串转换,那么无疑会遇到严重问题。正确的实现策略是首先标记化,如标准中所示,然后在标记和空白流上执行阶段4作为函数。

Stringification is a particularly interesting case where whitespace affects semantics, and it can be used to see what the actual token stream looks like. If you stringify the expansion of A(A(40)), you can see that no whitespace was actually inserted:

字符串化是一个特别有趣的情况,其中空格会影响语义,它可以用来查看实际令牌流的外观。如果对A(A(40))的扩展进行字符串化,则可以看到实际没有插入空格:

$ gcc -E -x c - <<<'
#define Y 20
#define A(x) (10+x+Y)
#define Q_(x) #x
#define Q(x) Q_(x)         
Q(A(A(40)))'

"(10+(10+40+20)+20)"

The handling of whitespace in stringification is precisely specified by the standard: (§6.10.3.2, paragraph 2, many thanks to John Bollinger for finding the specification.)

字符串化中的空白处理由标准精确指定:(§6.10.3.2,第2段,非常感谢John Bollinger查找规范。)

Each occurrence of white space between the argument’s preprocessing tokens becomes a single space character in the character string literal. White space before the first preprocessing token and after the last preprocessing token composing the argument is deleted.

参数的预处理标记之间每次出现的空格都会成为字符串文字中的单个空格字符。第一个预处理标记之前和构成参数的最后一个预处理标记之后的空格被删除。

Here is a more subtle example where additional whitespace is required in the gcc -E output, but is not actually inserted into the token stream (again shown by using stringification to produce the real token stream.) The I (identify) macro is used to allow two tokens to be inserted into the token stream without intervening whitespace; that's a useful trick if you want to use macros to compose the argument to the #include directive (not recommended, but it can be done).

这是一个更精细的示例,其中gcc -E输出中需要额外的空格,但实际上并未插入到令牌流中(通过使用字符串化再次显示以生成真实令牌流。)I(标识)宏用于允许将两个令牌插入令牌流中,而不插入空格;如果你想使用宏来组成#include指令的参数(不推荐,但可以这样做),这是一个有用的技巧。

Maybe this could be a useful test case for your preprocessor:

也许这对您的预处理器来说可能是一个有用的测试用例:

#define Q_(x) #x
#define Q(x) Q_(x)
#define I(x) x
#define C(x,...) x(__VA_ARGS__)
// Uncomment the following line to run the program
//#include <stdio.h>

char*quoted=Q(C(I(int)I(main),void){I(return)I(C(puts,quoted));});
C(I(int)I(main),void){I(return)I(C(puts,quoted));}

Here's the output of gcc -E (just the good stuff at the end):

这是gcc -E的输出(最后的好东西):

$ gcc -E squish.c | tail -n2
char*quoted="intmain(void){returnputs(quoted);}";
int main(void){return puts(quoted);}

In the token stream which is passed out of phase 4, the tokens int and main are not separated by whitespace (and neither are return and puts). That's clearly shown by the stringification, in which no whitespace separates the token. However, the program compiles and executes fine, even if passed explicitly through gcc -E:

在传递出阶段4的令牌流中,令牌int和main不用空格分隔(并且都不是return和puts)。字符串化清楚地显示了这一点,其中没有空格分隔令牌。但是,即使通过gcc -E显式传递,程序也会编译并执行正常:

$ gcc -E squish.c | gcc -x c - && ./a.out 
intmain(void){returnputs(quoted);}

and compiling the output of gcc -E.

并编译gcc -E的输出。


Different compilers and different versions of the same compiler may produce different serializations of a preprocessed program. So I don't think you will find any algorithm which is testable with a character-by-character comparison with the -E output of a given compiler.

不同的编译器和相同编译器的不同版本可以产生预处理程序的不同序列化。所以我认为你不会发现任何可以通过逐个字符与给定编译器的-E输出进行比较来测试的算法。

The simplest possible serialization algorithm would be to unconditionally output a space between two consecutive tokens. Obviously, that would output unnecessary spaces, but it would never syntactically alter the program.

最简单的序列化算法是无条件地在两个连续令牌之间输出空格。显然,这将输出不必要的空格,但它永远不会在语法上改变程序。

I think the minimal space algorithm would be to record the DFA state at the end of the last character in a token so that you can later output a space between two consecutive tokens if there exists a transition from the state at the end of the first token on the first character of the following token. (Keeping the DFA state as part of the token is not intrinsically different from keeping the token type as part of the token, since you can derive the token type from a simple lookup from the DFA state.) That algorithm would not insert a space after 40 in your original test case, but it would insert a space after 0x40E. So it is not the algorithm being used by your version of gcc.

我认为最小空间算法是在令牌中最后一个字符的末尾记录DFA状态,以便稍后如果存在从第一个令牌末尾的状态转换,则可以在两个连续令牌之间输出空格。在下一个标记的第一个字符上。 (将DFA状态保持为令牌的一部分与将令牌类型保持为令牌的一部分本质上没有区别,因为您可以从DFA状态的简单查找中派生令牌类型。)该算法不会在之后插入空格在原始测试用例中为40,但它会在0x40E之后插入一个空格。因此,您的gcc版本不使用该算法。

If you use the above algorithm, you will need to rescan tokens created by token concatenation. However, that is necessary anyway, because you need to flag an error if the result of the concatenation is not a valid preprocessing token.

如果使用上述算法,则需要重新扫描由标记串联创建的标记。但是,无论如何,这是必要的,因为如果连接的结果不是有效的预处理标记,则需要标记错误。

If you don't want to record states (although, as I said, there is essentially no cost in doing so) and you don't want to regenerate the state by rescanning the token as you output it (which would also be quite cheap), you could precompute a two-dimensional boolean array keyed by token type and following character. The computation would essentially be the same as the above: for every accepting DFA state which returns a particular token type, enter a true value in the array for that token type and any character with a transition out of the DFA state. Then you can look up the token type of a token and the first character of the following token to see if a space may be necessary. This algorithm does not produce a minimally-spaced output: it would, for example, put a space after the 40 in your example, since 40 is a pp-number and it is possible for some pp-number to be extended with a + (even though you cannot extend 40 in that way). So it's possible that gcc uses some version of this algorithm.

如果您不想记录状态(尽管如我所说,这样做基本上没有成本)并且您不希望通过在输出时重新扫描令牌来重新生成状态(这也很便宜) ),您可以预先计算由令牌类型和后续字符键入的二维布尔数组。计算基本上与上述相同:对于每个接受返回特定标记类型的DFA状态,在该标记类型的数组中输入一个真值,以及任何转换超出DFA状态的字符。然后,您可以查找令牌的令牌类型和以下令牌的第一个字符,以查看是否需要空格。这个算法不会产生最小间距的输出:例如,它会在你的例子中在40之后放一个空格,因为40是一个pp数,并且有些pp数可以用+扩展(即使你不能以这种方式扩展40)。所以gcc可能会使用这个算法的某个版本。

#2


1  

Adding some historical context to rici's excellent answer.

为rici的优秀答案添加一些历史背景。

If you can get your hands on a working copy of gcc 2.7.2.3, experiment with its preprocessor. At that time the preprocessor was a separate program from the compiler, and it used a very naive algorithm for text serialization, which tended to insert far more spaces than were necessary. When Neil Booth, Per Bothner and I implemented the integrated preprocessor (appearing in gcc 3.0 and since), we decided to make -E output a bit smarter at the same time, but without making the implementation too complicated. The core of this algorithm is the library function cpp_avoid_paste, defined at https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libcpp/lex.c#l2990 , and its caller is here: https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/c-family/c-ppoutput.c#l177 (look for "Subtle logic to output a space...").

如果您可以获得gcc 2.7.2.3的工作副本,请试用其预处理器。那时预处理器是一个独立于编译器的程序,它使用了一种非常天真的文本序列化算法,它往往会插入比必要的空间更多的空间。当Neil Booth,Per Bothner和我实现了集成预处理器(出现在gcc 3.0及之后)时,我们决定让-E输出同时更智能,但不会使实现过于复杂。该算法的核心是库函数cpp_avoid_paste,在https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libcpp/lex.c#l2990中定义,其调用者是这里:https://gcc.gnu.org/git/?p = gcc.git; a = blob; f = gcc / c-family / c -ppoutput.c#l177(寻找“输出空间的微妙逻辑” ......“)。

In the case of your example

就你的例子而言

#define Y 20
#define A(x) (10+x+Y)
A(A(40))

cpp_avoid_paste will be called with a CPP_NUMBER token (what rici called a "pp-number") on the left, and a '+' token on the right. In this case it unconditionally says "yes, you need to insert a space to avoid pasting" rather than checking whether the last character of the number token is one of eEpP.

cpp_avoid_paste将在左侧使用CPP_NUMBER标记(rici称为“pp-number”)和右侧的“+”标记进行调用。在这种情况下,它无条件地说“是的,你需要插入一个空格以避免粘贴”,而不是检查数字标记的最后一个字符是否是eEpP之一。

Compiler design often comes down to a trade-off between accuracy and implementation simplicity.

编译器设计通常归结为准确性和实现简单性之间的权衡。