从抽象语法树获取控制流图

时间:2023-02-02 20:22:36

I have an AST derived from the ANTLR Parser Generator for Java. What I want to do is somehow construct a control flow graph of the source code, where each statement or expression is a unique Node. I understand there must be some recursiveness to this identification, I was wondering what you would suggest as the best option and if ANTLR has a toolset I can use for this job. Cheers, Chris

我有一个源自ANTLR Parser Generator for Java的AST。我想要做的是以某种方式构建源代码的控制流图,其中每个语句或表达式是唯一的节点。我知道必须有一些这种识别的递归,我想知道你会建议什么是最好的选择,如果ANTLR有一个工具集我可以用于这项工作。干杯,克里斯


EDIT - My main concern is to get a control flow graph(CFG) from the AST. This way I can get a tree representation of the source. To clarify, both the source code and the implementation language is Java.

编辑 - 我主要关心的是从AST获得控制流图(CFG)。这样我就可以获得源代码的树形表示。为了澄清,源代码和实现语言都是Java。

6 个解决方案

#1


9  

Usually CFGs are computed on a lower-level representation (e.g. JVM bytecode). Someone did a thesis on such things a few years ago. There might be a helpful way described in there for how to get at that representation.

通常,CFG是在较低级别的表示(例如JVM字节码)上计算的。几年前有人就此类事做过论文。在那里可能有一种有用的方式来描述如何获得该表示。

Since your source and target languages are the same, there's no code generation step -- you're already done! However, now you get to walk the AST. At each node of the AST, you have to ask yourself: is this a "jumping" instruction or not? Method calls and if statements are examples of jumping instructions. So are loop constructs (such as for and while). Instructions such as addition and multiplication are non-jumping.

由于源语言和目标语言相同,因此没有代码生成步骤 - 您已经完成了!但是,现在你可以走AST了。在AST的每个节点,你必须问自己:这是一个“跳跃”指令吗?方法调用和if语句是跳转指令的示例。循环结构也是如此(例如for和while)。加法和乘法等指令不跳转。

First associate with each java statement a node in the CFG, along with an entry and exit node. As a first approximation, walk the tree and:

首先将每个java语句与CFG中的节点以及入口和出口节点相关联。作为第一个近似,走树和:

  1. if the current statement is a method call, figure out where the entry node is for the corresponding body of that method call, and make an edge pointing from the current statement to that entry node. if the statement is a method return, enumerate the places that could have called it and add an edge to those.
  2. 如果当前语句是方法调用,请确定该条目调用的相应主体的入口节点的位置,并使边缘从当前语句指向该入口节点。如果语句是方法返回,则枚举可以调用它的位置并为其添加边。

  3. for each non-jumping statement, make an edge between it and the next statement.
  4. 对于每个非跳转语句,在它与下一个语句之间建立一个优势。

This will give you some kind of CFG. The procedure is slightly hairy in step 2 because the method called may be declared in a library, and not elsewhere in the AST -- if so, either don't make an edge or make an edge to a special node representing the entry to that library method.

这会给你一些CFG。在步骤2中,该过程略显毛茸茸,因为调用的方法可以在库中声明,而不是在AST中的其他地方声明 - 如果是这样,要么不为表示该条目的特殊节点创建边缘或边缘库方法。

Does this make sense?

这有意义吗?

#2


4  

Producing a full control flow graph that really takes into account all the language issues is harder than it looks. Not only do you have to identify what appears to be the "basic blocks", but you have to identify function calls (sort of easy, but identifying the target might be harder), where behind-the-scenes operations such as class initializers can happen. and to worry about the points where exceptions can occur and where control goes if an exception does occur.

生成一个真正考虑到所有语言问题的完整控制流图表比看起来更难。你不仅需要识别看起来像“基本块”的东西,而且你必须识别函数调用(有点简单,但识别目标可能更难),其中幕后操作如类初始化程序可以发生。并且担心可能发生异常的点以及如果发生异常则控制发生的地方。

If you examine most languages carefully, they will also be clear about ordering of evaluation of computations in expressions, and this matters if you have two side-effects in an expression; the control flow should reflect the order (or the non-order, if it isn't defined).

如果仔细检查大多数语言,他们也会清楚地了解表达式中计算的评估顺序,如果你在表达式中有两个副作用,这就很重要;控制流应反映顺序(或非顺序,如果未定义)。

Maybe you only want an abstraction of the control flow having the basic blocks and the conditionals. That's obviously a bit easier.

也许你只想要一个具有基本块和条件的控制流的抽象。这显然有点容易。

In either case (simple CFG or full CFG), you need to walk the AST, at each point having a reference to possible control flow targets (e.g., for most cases, such as IF statements, there are two flow targets: the THEN and ELSE clauses). At each node, link that node to the appropriate control flow target, possibly replacing the flow targets (e.g., when you encounter an IF).

在任何一种情况下(简单的CFG或完整的CFG),您需要在每个点上引用AST,参考可能的控制流目标(例如,对于大多数情况,例如IF语句,有两个流目标:THEN和ELSE条款)。在每个节点处,将该节点链接到适当的控制流目标,可能替换流目标(例如,当您遇到IF时)。

To do this for the full language semantics of Java (or C) is quite a lot of work. You may want to simply use a tool that computes this off-the-shelf. See http://www.semanticdesigns.com/Products/DMS/FlowAnalysis.html for what this really looks like, coming out of our tools.

为Java(或C)的完整语言语义执行此操作是相当多的工作。您可能只想使用一种计算现成的工具。请参阅http://www.semanticdesigns.com/Products/DMS/FlowAnalysis.html,了解这些工具的真实情况。

#3


1  

Based on some comments, it sounds like the OP really wants to do code generation -- to convert the AST into a lower-level sequence of instructions based on basic blocks and jump points.

根据一些评论,听起来OP确实想要进行代码生成 - 将AST转换为基于基本块和跳转点的低级指令序列。

Code generation is very language-specific, and a lot of work has been put into this topic. Before you do code generation you need to know your target language -- whether it be assembler or simply some other high-level language. Once you have identified this, you simply need to walk the AST and generate a sequence of instructions that implements the code in the AST. (I say this is simple, but it can be hard -- it's hard to generalise because the considerations here are pretty language-specific.)

代码生成是特定于语言的,并且已经在该主题中进行了大量工作。在进行代码生成之前,您需要了解目标语言 - 无论是汇编语言还是其他高级语言。一旦确定了这一点,您只需要遍历AST并生成一系列指令,以实现AST中的代码。 (我说这很简单,但可能很难 - 很难概括,因为这里的考虑是非常特定的语言。)

The representation you choose for code generation will contain the control-flow graph, implicitly or explicitly. If your target language is fairly low-level (close to assembler), then the control-flow graph should be relatively easy to extract.

您为代码生成选择的表示将隐式或显式地包含控制流图。如果您的目标语言相当低级(接近汇编程序),那么控制流图应该相对容易提取。

(Please comment if you'd like more clarification.)

(如果您想进一步澄清,请发表评论。)

#4


-1  

Did you ever tryed ANTLR Studio? It does not generate the hole AST graph, but for review, its already quite helpful.

你有没有尝试过ANTLR Studio?它没有生成孔AST图,但是对于审查,它已经非常有用。

#5


-1  

When I have done this in the past, I used graphviz, in particular the dot tool, to generate the graph. I created the dot input file by actually traversing the control-flow graph at compile time.

当我在过去做过这个时,我使用graphviz,特别是点工具来生成图形。我通过在编译时实际遍历控制流图来创建点输入文件。

Graph layout is a hard problem, and graphviz does an excellent job. It can output to ps, pdf, and various image formats, and the layout is usually pretty intuitive to look at. I highly recommend it.

图形布局是一个难题,graphviz做得很好。它可以输出为ps,pdf和各种图像格式,布局通常非常直观。我强烈推荐它。

#6


-1  

I don't think I'll be able to answer your question in a way that you are maybe looking for since I don't know of any way in ANTLR to produce a CFG with or without an AST. But, in short you would use what ANTLR produces to generate a separate Java program to produce a CFG. You would utilize the ANTLR generated syntax tree as input to generate your CFG in a separate Java program of your own creation. At this point you are, in essence, building a compiler. The difference between your "compiler" and a JVM is that your output is a visual representation (CFG) of how a program branches its various execution paths and a JVM/Java compiler produces code for execution on a real machine (CPU).

我不认为我能够以你可能正在寻找的方式回答你的问题,因为我不知道ANTLR中是否有任何方式生成有或没有AST的CFG。但是,简而言之,您将使用ANTLR生成的单独Java程序来生成CFG。您可以使用ANTLR生成的语法树作为输入,在您自己创建的单独Java程序中生成CFG。此时,您实际上构建了一个编译器。 “编译器”和JVM之间的区别在于输出是程序如何分支其各种执行路径的可视化表示(CFG),JVM / Java编译器生成用于在真实机器(CPU)上执行的代码。

An analogy is if someone sits down to write a book (in English for example), the individual words used in sentences are the TOKENS of a computer language, sentences are formed in a similar manner that context free grammars express valid computer code, and paragraphs & whole novels tell a story in a similar manner that semantic analysis/compilers/CFGs might produce/represent logically valid programs that actually do something useful and are more or less free of logic bugs. In other words, once you get past the issue of valid syntax (correct sentence structure), anyone can write a bunch of sentences on a page but only certain combinations of sentences produce text that actually does something (tell a story).

一个类比是,如果有人坐下来写一本书(例如英文),句子中使用的单词是计算机语言的TOKENS,句子的形成方式类似于上下文无关语法表达有效的计算机代码,段落整个小说以类似的方式讲述一个故事,语义分析/编译器/ CFG可能产生/代表逻辑上有效的程序,这些程序实际上做了一些有用的事情,或多或少地没有逻辑错误。换句话说,一旦你超越了有效语法(正确的句子结构)的问题,任何人都可以在页面上写一堆句子,但只有某些句子组合产生实际做某事的文本(讲故事)。

What you're asking about is that last piece - how to go about taking a syntax tree and transforming or interpreting what the AST actually does logically. And of course you would need to build a "compiler" for each language you want to do this for. Having a correct grammar doesn't tell you what a program does - just that a program is correct from a grammar perspective.

您要问的是最后一部分 - 如何采用语法树并转换或解释AST实际上逻辑上的内容。当然,您需要为每个要执行此操作的语言构建“编译器”。拥有正确的语法并不能告诉你程序的功能 - 从语法的角度来看,程序是正确的。

Linters and syntax highlighters and IDEs are all built around the idea of trying to make this last piece of the puzzle an easier and a more efficient task for humans.

Linters和语法高亮显示器和IDE都围绕着试图使这个难题的最后一部分变得更容易和更有效的人类任务的想法。

#1


9  

Usually CFGs are computed on a lower-level representation (e.g. JVM bytecode). Someone did a thesis on such things a few years ago. There might be a helpful way described in there for how to get at that representation.

通常,CFG是在较低级别的表示(例如JVM字节码)上计算的。几年前有人就此类事做过论文。在那里可能有一种有用的方式来描述如何获得该表示。

Since your source and target languages are the same, there's no code generation step -- you're already done! However, now you get to walk the AST. At each node of the AST, you have to ask yourself: is this a "jumping" instruction or not? Method calls and if statements are examples of jumping instructions. So are loop constructs (such as for and while). Instructions such as addition and multiplication are non-jumping.

由于源语言和目标语言相同,因此没有代码生成步骤 - 您已经完成了!但是,现在你可以走AST了。在AST的每个节点,你必须问自己:这是一个“跳跃”指令吗?方法调用和if语句是跳转指令的示例。循环结构也是如此(例如for和while)。加法和乘法等指令不跳转。

First associate with each java statement a node in the CFG, along with an entry and exit node. As a first approximation, walk the tree and:

首先将每个java语句与CFG中的节点以及入口和出口节点相关联。作为第一个近似,走树和:

  1. if the current statement is a method call, figure out where the entry node is for the corresponding body of that method call, and make an edge pointing from the current statement to that entry node. if the statement is a method return, enumerate the places that could have called it and add an edge to those.
  2. 如果当前语句是方法调用,请确定该条目调用的相应主体的入口节点的位置,并使边缘从当前语句指向该入口节点。如果语句是方法返回,则枚举可以调用它的位置并为其添加边。

  3. for each non-jumping statement, make an edge between it and the next statement.
  4. 对于每个非跳转语句,在它与下一个语句之间建立一个优势。

This will give you some kind of CFG. The procedure is slightly hairy in step 2 because the method called may be declared in a library, and not elsewhere in the AST -- if so, either don't make an edge or make an edge to a special node representing the entry to that library method.

这会给你一些CFG。在步骤2中,该过程略显毛茸茸,因为调用的方法可以在库中声明,而不是在AST中的其他地方声明 - 如果是这样,要么不为表示该条目的特殊节点创建边缘或边缘库方法。

Does this make sense?

这有意义吗?

#2


4  

Producing a full control flow graph that really takes into account all the language issues is harder than it looks. Not only do you have to identify what appears to be the "basic blocks", but you have to identify function calls (sort of easy, but identifying the target might be harder), where behind-the-scenes operations such as class initializers can happen. and to worry about the points where exceptions can occur and where control goes if an exception does occur.

生成一个真正考虑到所有语言问题的完整控制流图表比看起来更难。你不仅需要识别看起来像“基本块”的东西,而且你必须识别函数调用(有点简单,但识别目标可能更难),其中幕后操作如类初始化程序可以发生。并且担心可能发生异常的点以及如果发生异常则控制发生的地方。

If you examine most languages carefully, they will also be clear about ordering of evaluation of computations in expressions, and this matters if you have two side-effects in an expression; the control flow should reflect the order (or the non-order, if it isn't defined).

如果仔细检查大多数语言,他们也会清楚地了解表达式中计算的评估顺序,如果你在表达式中有两个副作用,这就很重要;控制流应反映顺序(或非顺序,如果未定义)。

Maybe you only want an abstraction of the control flow having the basic blocks and the conditionals. That's obviously a bit easier.

也许你只想要一个具有基本块和条件的控制流的抽象。这显然有点容易。

In either case (simple CFG or full CFG), you need to walk the AST, at each point having a reference to possible control flow targets (e.g., for most cases, such as IF statements, there are two flow targets: the THEN and ELSE clauses). At each node, link that node to the appropriate control flow target, possibly replacing the flow targets (e.g., when you encounter an IF).

在任何一种情况下(简单的CFG或完整的CFG),您需要在每个点上引用AST,参考可能的控制流目标(例如,对于大多数情况,例如IF语句,有两个流目标:THEN和ELSE条款)。在每个节点处,将该节点链接到适当的控制流目标,可能替换流目标(例如,当您遇到IF时)。

To do this for the full language semantics of Java (or C) is quite a lot of work. You may want to simply use a tool that computes this off-the-shelf. See http://www.semanticdesigns.com/Products/DMS/FlowAnalysis.html for what this really looks like, coming out of our tools.

为Java(或C)的完整语言语义执行此操作是相当多的工作。您可能只想使用一种计算现成的工具。请参阅http://www.semanticdesigns.com/Products/DMS/FlowAnalysis.html,了解这些工具的真实情况。

#3


1  

Based on some comments, it sounds like the OP really wants to do code generation -- to convert the AST into a lower-level sequence of instructions based on basic blocks and jump points.

根据一些评论,听起来OP确实想要进行代码生成 - 将AST转换为基于基本块和跳转点的低级指令序列。

Code generation is very language-specific, and a lot of work has been put into this topic. Before you do code generation you need to know your target language -- whether it be assembler or simply some other high-level language. Once you have identified this, you simply need to walk the AST and generate a sequence of instructions that implements the code in the AST. (I say this is simple, but it can be hard -- it's hard to generalise because the considerations here are pretty language-specific.)

代码生成是特定于语言的,并且已经在该主题中进行了大量工作。在进行代码生成之前,您需要了解目标语言 - 无论是汇编语言还是其他高级语言。一旦确定了这一点,您只需要遍历AST并生成一系列指令,以实现AST中的代码。 (我说这很简单,但可能很难 - 很难概括,因为这里的考虑是非常特定的语言。)

The representation you choose for code generation will contain the control-flow graph, implicitly or explicitly. If your target language is fairly low-level (close to assembler), then the control-flow graph should be relatively easy to extract.

您为代码生成选择的表示将隐式或显式地包含控制流图。如果您的目标语言相当低级(接近汇编程序),那么控制流图应该相对容易提取。

(Please comment if you'd like more clarification.)

(如果您想进一步澄清,请发表评论。)

#4


-1  

Did you ever tryed ANTLR Studio? It does not generate the hole AST graph, but for review, its already quite helpful.

你有没有尝试过ANTLR Studio?它没有生成孔AST图,但是对于审查,它已经非常有用。

#5


-1  

When I have done this in the past, I used graphviz, in particular the dot tool, to generate the graph. I created the dot input file by actually traversing the control-flow graph at compile time.

当我在过去做过这个时,我使用graphviz,特别是点工具来生成图形。我通过在编译时实际遍历控制流图来创建点输入文件。

Graph layout is a hard problem, and graphviz does an excellent job. It can output to ps, pdf, and various image formats, and the layout is usually pretty intuitive to look at. I highly recommend it.

图形布局是一个难题,graphviz做得很好。它可以输出为ps,pdf和各种图像格式,布局通常非常直观。我强烈推荐它。

#6


-1  

I don't think I'll be able to answer your question in a way that you are maybe looking for since I don't know of any way in ANTLR to produce a CFG with or without an AST. But, in short you would use what ANTLR produces to generate a separate Java program to produce a CFG. You would utilize the ANTLR generated syntax tree as input to generate your CFG in a separate Java program of your own creation. At this point you are, in essence, building a compiler. The difference between your "compiler" and a JVM is that your output is a visual representation (CFG) of how a program branches its various execution paths and a JVM/Java compiler produces code for execution on a real machine (CPU).

我不认为我能够以你可能正在寻找的方式回答你的问题,因为我不知道ANTLR中是否有任何方式生成有或没有AST的CFG。但是,简而言之,您将使用ANTLR生成的单独Java程序来生成CFG。您可以使用ANTLR生成的语法树作为输入,在您自己创建的单独Java程序中生成CFG。此时,您实际上构建了一个编译器。 “编译器”和JVM之间的区别在于输出是程序如何分支其各种执行路径的可视化表示(CFG),JVM / Java编译器生成用于在真实机器(CPU)上执行的代码。

An analogy is if someone sits down to write a book (in English for example), the individual words used in sentences are the TOKENS of a computer language, sentences are formed in a similar manner that context free grammars express valid computer code, and paragraphs & whole novels tell a story in a similar manner that semantic analysis/compilers/CFGs might produce/represent logically valid programs that actually do something useful and are more or less free of logic bugs. In other words, once you get past the issue of valid syntax (correct sentence structure), anyone can write a bunch of sentences on a page but only certain combinations of sentences produce text that actually does something (tell a story).

一个类比是,如果有人坐下来写一本书(例如英文),句子中使用的单词是计算机语言的TOKENS,句子的形成方式类似于上下文无关语法表达有效的计算机代码,段落整个小说以类似的方式讲述一个故事,语义分析/编译器/ CFG可能产生/代表逻辑上有效的程序,这些程序实际上做了一些有用的事情,或多或少地没有逻辑错误。换句话说,一旦你超越了有效语法(正确的句子结构)的问题,任何人都可以在页面上写一堆句子,但只有某些句子组合产生实际做某事的文本(讲故事)。

What you're asking about is that last piece - how to go about taking a syntax tree and transforming or interpreting what the AST actually does logically. And of course you would need to build a "compiler" for each language you want to do this for. Having a correct grammar doesn't tell you what a program does - just that a program is correct from a grammar perspective.

您要问的是最后一部分 - 如何采用语法树并转换或解释AST实际上逻辑上的内容。当然,您需要为每个要执行此操作的语言构建“编译器”。拥有正确的语法并不能告诉你程序的功能 - 从语法的角度来看,程序是正确的。

Linters and syntax highlighters and IDEs are all built around the idea of trying to make this last piece of the puzzle an easier and a more efficient task for humans.

Linters和语法高亮显示器和IDE都围绕着试图使这个难题的最后一部分变得更容易和更有效的人类任务的想法。