在C中表示抽象语法树

时间:2023-01-25 09:19:21

I'm implementing a compiler for a simple toy language in C. I have a working scanner and parser, and a reasonable background on the conceptual function/construction of an AST. My question is related to the specific way to represent an AST in C. I've come across three styles pretty frequently in different texts/resources online:

我正在C中为一个简单的玩具语言实现一个编译器。我有一个工作的扫描器和解析器,以及AST的概念功能/构造的合理背景。我的问题与在C中表示AST的具体方式有关。我在网上不同的文本/资源中经常遇到三种风格:

One struct per type of node.

每种节点一个结构。

This has a base node "class"(struct) that is the first field in all the child structs. The base node contains an enum that stores the type of node(constant, binary operator, assignment, etc). Members of the struct are accessed using a set of macros, with one set per struct. It looks something like this:

它有一个基节点“class”(struct),它是所有子结构中的第一个字段。基节点包含一个存储节点类型的枚举(常量,二元运算符,赋值等)。使用一组宏访问结构的成员,每个结构一个集。它看起来像这样:

struct ast_node_base {
    enum {CONSTANT, ADD, SUB, ASSIGNMENT} class;
};

struct ast_node_constant {
    struct ast_node_base *base;
    int value;
};

struct ast_node_add {
    struct ast_node_base *base;
    struct ast_node_base *left;
    struct ast_node_base *right;
};

struct ast_node_assign {
    struct ast_node_base *base;
    struct ast_node_base *left;
    struct ast_node_base *right;
};

#define CLASS(node) ((ast_node_base*)node)->class;

#define ADD_LEFT(node) ((ast_node_add*)node)->left;
#define ADD_RIGHT(node) ((ast_node_add*)node)->right;

#define ASSIGN_LEFT(node) ((ast_node_assign*)node)->left;
#define ASSIGN_RIGHT(node) ((ast_node_assign*)node)->right;

One struct per layout of node.

每个节点布局一个结构。

This appears to be mostly the same as the above layout, except instead of having ast_node_add and ast_node_assign it would have an ast_node_binary to represent both, because the layout of the two structs is the same and they only differ by the contents of base->class. The advantage to this seems to be a more uniform set of macros(LEFT(node) for all nodes with a left and right instead of one pair of macros per), but the disadvantage seems that the C type checking won't be as useful(there would be no way to detect an ast_node_assign where there should only be an ast_node_add, for example).

这似乎与上面的布局大致相同,除了没有ast_node_add和ast_node_assign它将有一个ast_node_binary来表示两者,因为两个结构的布局是相同的,它们只是由base-> class的内容不同。这样做的好处似乎是一组更加统一的宏(左侧和右侧所有节点的LEFT(节点),而不是每对一对宏),但缺点似乎是C类型检查不会有用(没有办法检测到只有ast_node_add的ast_node_assign,例如)。

One struct total, with a union to hold different types of node data.

一个结构总数,带有用于保存不同类型节点数据的联合。

A better explanation of this than I can give can be found here. Using the types from the previous example it would look like:

可以在这里找到比我能给出的更好的解释。使用上一个示例中的类型,它看起来像:

struct ast_node {
  enum { CONSTANT, ADD, SUB, ASSIGNMENT } class;
  union { int                                 value;
          struct { struct ast_node* left;    
                   struct ast_node* right;  } op;
};

I'm inclined to like the third option the most because it makes recursive traversal much easier(in that lots of pointer casting is avoided in favor of the union), but it also doesn't take advantage of C type checking. The first option seems the most dangerous in that it relies on pointers to structs being cast to access the member of any node(even different members of the same node requiring different cases to access(base vs. left)), but these casts are type checked so that might be moot. The second option to me seems like the worst of both worlds, although maybe I'm missing something.

我倾向于最喜欢第三个选项,因为它使得递归遍历变得更容易(因为大量的指针转换被避免支持联合),但它也没有利用C类型检查。第一个选项似乎是最危险的,因为它依赖于指向结构的指针来访问任何节点的成员(甚至同一节点的不同成员需要访问不同的情况(base vs. left)),但这些类型转换是类型的检查,这可能是没有实际意义的。对我而言,第二种选择似乎是两个世界中最糟糕的选择,尽管我可能错过了一些东西。

Which of these three schemes are the best, and why? Is there a better fourth option I haven't come across yet? I'm assuming none of them are a "one size fits all" solution, so if it matters the language I'm implementing is a statically typed imperative language, almost a small subset of C.

这三种方案中哪一种最好,为什么?有没有更好的第四种选择我尚未遇到过?我假设它们都不是“一刀切”的解决方案,所以如果它重要我正在实现的语言是一种静态类型的命令式语言,几乎是C的一小部分。

A specific question I have about the third(union) layout. If I use only the value field, will there be empty space following the value to accommodate for the possibility of op being written to?

我对第三个(联合)布局的具体问题。如果我只使用值字段,那么值后面会有空格,以适应op被写入的可能性吗?

2 个解决方案

#1


16  

You can make any of these work.

你可以做任何这些工作。

I prefer the union layout, because then all nodes have "the same" layout.

我更喜欢联合布局,因为所有节点都具有“相同”的布局。

[You may find it useful to have a "child sublist" option, e.g., and arbitarily big, dynamic array of children, instead of having left- or right-leaning lists.]

[你可能会发现拥有一个“子子列表”选项很有用,例如,具有任意大的动态子代数组,而不是左倾或右倾列表。

You are going to find that this issue isn't the one that makes building your compiler hard. Rather, it is having symbol tables, performing various kinds of analyses, choosing a machine-level IR, building a code generator, and doing code optimizations. Then you're going to encounter real users and you'll discover what you really did wrong :-}

您将发现此问题不是使编译器难以构建的问题。相反,它具有符号表,执行各种分析,选择机器级IR,构建代码生成器以及执行代码优化。然后你会遇到真正的用户,你会发现你真正做错了什么: - }

I'd pick one and run with it, so that you have a chance to get near the other issues.

我会选择一个并运行它,这样你就有机会接近其他问题。

#2


1  

Ira Baxter gave you a good simple and forward looking answer, especially of note is the problems one will encounter down the road, so I will focus on this question:

艾拉·巴克斯特给了你一个简单而前瞻性的答案,特别值得注意的是你将要遇到的问题,所以我将重点关注这个问题:

Is there a better fourth option I haven't come across yet?

有没有更好的第四种选择我尚未遇到过?

You are using the imperative language to write a compiler and having problems designing the data structure for the concept of a node in the AST. In the world of functional languages such as ML, OCaml, Haskell, F# one would use a Tagged union to hold all of the different node types in one data structure, which is basically what you have created.

您正在使用命令式语言编写编译器,并且在设计AST中节点概念的数据结构时遇到问题。在诸如ML,OCaml,Haskell等功能语言的世界中,F#将使用Tagged联合来在一个数据结构中保存所有不同的节点类型,这基本上就是您创建的。

I don't expect that the OP will switch to a functional language for this problem, but if others regularly deal with trees then they might find it of value to learn a functional language and use it for problems related to trees.

我不认为OP会切换到这个问题的函数式语言,但如果其他人经常处理树,那么他们可能会发现学习函数式语言并将其用于与树有关的问题是有价值的。

#1


16  

You can make any of these work.

你可以做任何这些工作。

I prefer the union layout, because then all nodes have "the same" layout.

我更喜欢联合布局,因为所有节点都具有“相同”的布局。

[You may find it useful to have a "child sublist" option, e.g., and arbitarily big, dynamic array of children, instead of having left- or right-leaning lists.]

[你可能会发现拥有一个“子子列表”选项很有用,例如,具有任意大的动态子代数组,而不是左倾或右倾列表。

You are going to find that this issue isn't the one that makes building your compiler hard. Rather, it is having symbol tables, performing various kinds of analyses, choosing a machine-level IR, building a code generator, and doing code optimizations. Then you're going to encounter real users and you'll discover what you really did wrong :-}

您将发现此问题不是使编译器难以构建的问题。相反,它具有符号表,执行各种分析,选择机器级IR,构建代码生成器以及执行代码优化。然后你会遇到真正的用户,你会发现你真正做错了什么: - }

I'd pick one and run with it, so that you have a chance to get near the other issues.

我会选择一个并运行它,这样你就有机会接近其他问题。

#2


1  

Ira Baxter gave you a good simple and forward looking answer, especially of note is the problems one will encounter down the road, so I will focus on this question:

艾拉·巴克斯特给了你一个简单而前瞻性的答案,特别值得注意的是你将要遇到的问题,所以我将重点关注这个问题:

Is there a better fourth option I haven't come across yet?

有没有更好的第四种选择我尚未遇到过?

You are using the imperative language to write a compiler and having problems designing the data structure for the concept of a node in the AST. In the world of functional languages such as ML, OCaml, Haskell, F# one would use a Tagged union to hold all of the different node types in one data structure, which is basically what you have created.

您正在使用命令式语言编写编译器,并且在设计AST中节点概念的数据结构时遇到问题。在诸如ML,OCaml,Haskell等功能语言的世界中,F#将使用Tagged联合来在一个数据结构中保存所有不同的节点类型,这基本上就是您创建的。

I don't expect that the OP will switch to a functional language for this problem, but if others regularly deal with trees then they might find it of value to learn a functional language and use it for problems related to trees.

我不认为OP会切换到这个问题的函数式语言,但如果其他人经常处理树,那么他们可能会发现学习函数式语言并将其用于与树有关的问题是有价值的。