什么使Java比C更容易解析?

时间:2022-09-06 13:25:16

I'm acquainted with the fact that the grammars of C and C++ are context-sensitive, and in particular you need a "lexer hack" in C. On the other hand, I'm under the impression that you can parse Java with only 2 tokens of look-ahead, despite considerable similarity between the two languages.

我知道C和c++的语法是上下文敏感的,特别是在C语言中需要一个“lexer hack”,另一方面,我的印象是,尽管这两种语言有相当的相似之处,但是您可以只使用两个标记来解析Java。

What would you have to change about C to make it more tractable to parse?

要使C更易于解析,您需要改变什么?

I ask because all of the examples I've seen of C's context-sensitivity are technically allowable but awfully weird. For example,

我问这个问题是因为我所见过的所有C的上下文敏感性的例子在技术上是允许的,但是非常奇怪。例如,

foo (a);

could be calling the void function foo with argument a. Or, it could be declaring a to be an object of type foo, but you could just as easily get rid of the parantheses. In part, this weirdness occurs because the "direct declarator" production rule for the C grammar fulfills the dual purpose of declaring both functions and variables.

可能是用参数a调用void函数foo,也可能是将a声明为foo类型的对象,但是你也可以很容易地摆脱这些冒号。在某种程度上,这种怪异的发生是因为C语法的“直接声明者”生产规则实现了声明函数和变量的双重目的。

On the other hand, the Java grammar has separate production rules for variable declaration and function declaration. If you write

另一方面,Java语法对变量声明和函数声明有单独的生产规则。如果你写

foo a;

then you know it's a variable declaration and foo can unambiguously be parsed as a typename. This might not be valid code if the class foo hasn't been defined somewhere in the current scope, but that's a job for semantic analysis that can be performed in a later compiler pass.

然后你知道它是一个变量声明,foo可以毫无疑问地解析为一个typename。如果当前范围内没有定义foo类,那么这可能不是有效的代码,但是这是语义分析的工作,可以在以后的编译器传递中执行。

I've seen it said that C is hard to parse because of typedef, but you can declare your own types in Java too. Which C grammar rules, besides direct_declarator, are at fault?

我看到它说由于类型定义,C很难解析,但是您也可以在Java中声明自己的类型。除了direct_declarator外,哪些C语法规则有问题?

1 个解决方案

#1


76  

Parsing C++ is getting hard. Parsing Java is getting to be just as hard.

解析c++变得越来越困难。解析Java也变得同样困难。

See this SO answer discussing why C (and C++) is "hard" to parse. The short summary is that C and C++ grammars are inherently ambiguous; they will give you multiple parses and you must use context to resolve the ambiguities. People then make the mistake of assuming you have to resolve ambiguities as you parse; not so, see below. If you insist on resolving ambiguities as you parse, your parser gets more complicated and that much harder to build; but that complexity is a self-inflicted wound.

请参阅这个答案,讨论为什么C(和c++)是“很难”解析的。简短的总结是,C和c++语法本质上是不明确的;它们将提供多个解析,您必须使用上下文来解决不明确的问题。然后人们就会犯这样的错误,认为在分析时必须解决歧义;见下文,不是这样的。如果您坚持在解析时解决歧义,那么解析器就会变得更加复杂,构建起来也会更加困难;但这种复杂性是自己造成的。

IIRC, Java 1.4's "obvious" LALR(1) grammar was not ambiguous, so it was "easy" to parse. I'm not so sure that modern Java hasn't got at least long distance local ambiguities; there's always the problem of deciding whether "...>>" closes off two templates or is a "right shift operator". I suspect modern Java does not parse with LALR(1) anymore.

IIRC, Java 1.4的“明显的”LALR(1)语法没有歧义,所以解析起来很“容易”。我不太确定现代Java是否至少有远距离的局部模糊;要决定是否……>>“关闭两个模板或是一个“右移位操作符”。我怀疑现代Java不再使用LALR(1)进行解析。

But one can get past the parsing problem by using strong parsers (or weak parsers and context collection hacks as C and C++ front ends mostly do now), for both languages. C and C++ have the additional complication of having a preprocessor; these are more complicated in practice than they look. One claim is that the C and C++ parsers are so hard they have to be be written by hand. It isn't true; you can build Java and C++ parsers just fine with GLR parser generators.

但是,您可以通过使用强解析器(或者使用弱解析器和上下文收集技巧,就像C和c++的前端现在所做的那样)来解决解析问题。C和c++还有一个额外的麻烦,那就是有一个预处理器;这些在实践中比看起来更复杂。一种说法是,C和c++解析器非常难,它们必须手工编写。它不是真的;您可以使用GLR解析器生成器构建Java和c++解析器。

But parsing isn't really where the problem is.

但是解析并不是真正的问题所在。

Once you parse, you will want to do something with the AST/parse tree. In practice, you need to know, for every identifier, what its definition is and where it is used ("name and type resolution", sloppily, building symbol tables). This turns out to be a LOT more work than getting the parser right, compounded by inheritance, interfaces, overloading and templates, and the confounded by the fact that the semantics for all this is written in informal natural language spread across tens to hundreds of pages of the language standard. C++ is really bad here. Java 7 and 8 are getting to be pretty awful from this point of view. (And symbol tables aren't all you need; see my bio for a longer essay on "Life After Parsing").

解析之后,您将希望对AST/parse树做一些事情。在实践中,您需要知道,对于每个标识符,它的定义是什么以及它在哪里使用(“名称和类型解析”,草率地,构建符号表)。事实证明,与解析器正确相比,这要多得多的工作,因为继承、接口、重载和模板,以及所有这一切的语义都是用非正式的自然语言编写的,在语言标准的数百页之间传播。c++在这里很糟糕。从这个角度来看,Java 7和8变得非常糟糕。(符号表并不是您所需要的全部;关于“解析后的生活”,请参阅我的个人简介)。

Most folks struggle with the pure parsing part (often never finishing; check SO itself for the many, many questions about to how to build working parsers for real langauges), so they don't ever see life after parsing. And then we get folk theorems about what is hard to parse and no signal about what happens after that stage.

大多数人挣扎于纯解析部分(通常永远不会完成;检查一下自己是否有许多关于如何构建真正的语言的工作解析器的问题,因此他们在解析之后永远不会看到生命。然后我们得到了民间定理关于什么是很难解析的没有信号关于那个阶段之后会发生什么。

Fixing C++ syntax won't get you anywhere.

修改c++语法不会有任何收获。

Regarding changing the C++ syntax: you'll find you need to patch a lot of places to take care of the variety of local and real ambiguities in any C++ grammar. If you insist, the following list might be a good starting place. I contend there is no point in doing this if you are not the C++ standards committee; if you did so, and built a compiler using that, nobody sane would use it. There's too much invested in existing C++ applications to switch for convenience of the guys building parsers; besides, their pain is over and existing parsers work fine.

关于改变c++语法:您将发现您需要修补许多地方,以处理任何c++语法中各种本地和真实的不确定性。如果你坚持,下面的列表可能是一个好的开始。我认为,如果您不是c++标准委员会,那么这样做毫无意义;如果您这样做,并使用它构建一个编译器,没有人会使用它。为了方便构建解析器的人,在现有c++应用程序上投入了太多的资金;此外,他们的痛苦已经结束,现有的解析器也可以正常工作。

You may want to write your own parser. OK, that's fine; just don't expect the rest of the community to let you change the language they must use to make it easier for you. They all want it easier for them, and that's to use the language as documented and implemented.

您可能需要编写自己的解析器。好吧,这很好;只是不要指望社区的其他人会让你改变他们必须使用的语言,让你更容易。他们都希望对他们来说更容易,那就是使用文档化和实现的语言。

#1


76  

Parsing C++ is getting hard. Parsing Java is getting to be just as hard.

解析c++变得越来越困难。解析Java也变得同样困难。

See this SO answer discussing why C (and C++) is "hard" to parse. The short summary is that C and C++ grammars are inherently ambiguous; they will give you multiple parses and you must use context to resolve the ambiguities. People then make the mistake of assuming you have to resolve ambiguities as you parse; not so, see below. If you insist on resolving ambiguities as you parse, your parser gets more complicated and that much harder to build; but that complexity is a self-inflicted wound.

请参阅这个答案,讨论为什么C(和c++)是“很难”解析的。简短的总结是,C和c++语法本质上是不明确的;它们将提供多个解析,您必须使用上下文来解决不明确的问题。然后人们就会犯这样的错误,认为在分析时必须解决歧义;见下文,不是这样的。如果您坚持在解析时解决歧义,那么解析器就会变得更加复杂,构建起来也会更加困难;但这种复杂性是自己造成的。

IIRC, Java 1.4's "obvious" LALR(1) grammar was not ambiguous, so it was "easy" to parse. I'm not so sure that modern Java hasn't got at least long distance local ambiguities; there's always the problem of deciding whether "...>>" closes off two templates or is a "right shift operator". I suspect modern Java does not parse with LALR(1) anymore.

IIRC, Java 1.4的“明显的”LALR(1)语法没有歧义,所以解析起来很“容易”。我不太确定现代Java是否至少有远距离的局部模糊;要决定是否……>>“关闭两个模板或是一个“右移位操作符”。我怀疑现代Java不再使用LALR(1)进行解析。

But one can get past the parsing problem by using strong parsers (or weak parsers and context collection hacks as C and C++ front ends mostly do now), for both languages. C and C++ have the additional complication of having a preprocessor; these are more complicated in practice than they look. One claim is that the C and C++ parsers are so hard they have to be be written by hand. It isn't true; you can build Java and C++ parsers just fine with GLR parser generators.

但是,您可以通过使用强解析器(或者使用弱解析器和上下文收集技巧,就像C和c++的前端现在所做的那样)来解决解析问题。C和c++还有一个额外的麻烦,那就是有一个预处理器;这些在实践中比看起来更复杂。一种说法是,C和c++解析器非常难,它们必须手工编写。它不是真的;您可以使用GLR解析器生成器构建Java和c++解析器。

But parsing isn't really where the problem is.

但是解析并不是真正的问题所在。

Once you parse, you will want to do something with the AST/parse tree. In practice, you need to know, for every identifier, what its definition is and where it is used ("name and type resolution", sloppily, building symbol tables). This turns out to be a LOT more work than getting the parser right, compounded by inheritance, interfaces, overloading and templates, and the confounded by the fact that the semantics for all this is written in informal natural language spread across tens to hundreds of pages of the language standard. C++ is really bad here. Java 7 and 8 are getting to be pretty awful from this point of view. (And symbol tables aren't all you need; see my bio for a longer essay on "Life After Parsing").

解析之后,您将希望对AST/parse树做一些事情。在实践中,您需要知道,对于每个标识符,它的定义是什么以及它在哪里使用(“名称和类型解析”,草率地,构建符号表)。事实证明,与解析器正确相比,这要多得多的工作,因为继承、接口、重载和模板,以及所有这一切的语义都是用非正式的自然语言编写的,在语言标准的数百页之间传播。c++在这里很糟糕。从这个角度来看,Java 7和8变得非常糟糕。(符号表并不是您所需要的全部;关于“解析后的生活”,请参阅我的个人简介)。

Most folks struggle with the pure parsing part (often never finishing; check SO itself for the many, many questions about to how to build working parsers for real langauges), so they don't ever see life after parsing. And then we get folk theorems about what is hard to parse and no signal about what happens after that stage.

大多数人挣扎于纯解析部分(通常永远不会完成;检查一下自己是否有许多关于如何构建真正的语言的工作解析器的问题,因此他们在解析之后永远不会看到生命。然后我们得到了民间定理关于什么是很难解析的没有信号关于那个阶段之后会发生什么。

Fixing C++ syntax won't get you anywhere.

修改c++语法不会有任何收获。

Regarding changing the C++ syntax: you'll find you need to patch a lot of places to take care of the variety of local and real ambiguities in any C++ grammar. If you insist, the following list might be a good starting place. I contend there is no point in doing this if you are not the C++ standards committee; if you did so, and built a compiler using that, nobody sane would use it. There's too much invested in existing C++ applications to switch for convenience of the guys building parsers; besides, their pain is over and existing parsers work fine.

关于改变c++语法:您将发现您需要修补许多地方,以处理任何c++语法中各种本地和真实的不确定性。如果你坚持,下面的列表可能是一个好的开始。我认为,如果您不是c++标准委员会,那么这样做毫无意义;如果您这样做,并使用它构建一个编译器,没有人会使用它。为了方便构建解析器的人,在现有c++应用程序上投入了太多的资金;此外,他们的痛苦已经结束,现有的解析器也可以正常工作。

You may want to write your own parser. OK, that's fine; just don't expect the rest of the community to let you change the language they must use to make it easier for you. They all want it easier for them, and that's to use the language as documented and implemented.

您可能需要编写自己的解析器。好吧,这很好;只是不要指望社区的其他人会让你改变他们必须使用的语言,让你更容易。他们都希望对他们来说更容易,那就是使用文档化和实现的语言。