什么操作码调度策略用于高效的解释器?

时间:2022-09-11 20:45:24

What techniques promote efficient opcode dispatch to make a fast interpreter? Are there some techniques that only work well on modern hardware and others that don't work well anymore due to hardware advances? What trade offs must be made between ease of implementation, speed, and portability?

有哪些技术可以促进高效的操作码调度,从而实现快速解释?是否有一些技术仅适用于现代硬件以及其他由于硬件进步而无法正常工作的技术?在易于实施,速度和可移植性之间必须做出哪些权衡?

I'm pleased that Python's C implementation is finally moving beyond a simple switch (opcode) {...} implementation for opcode dispatch to indirect threading as a compile time option, but I'm less pleased that it took them 20 years to get there. Maybe if we document these strategies on * the next language will get fast faster.

我很高兴Python的C实现最终超越了操作码调度的简单切换(操作码){...}实现,以及作为编译时选项的间接线程,但我不太高兴它花了20年的时间才获得那里。也许如果我们在*上记录这些策略,下一种语言将会更快。

8 个解决方案

#1


There are a number of papers on different kinds of dispatch:

关于不同类型的派遣有很多论文:

M. Anton Ertl and David Gregg, Optimizing Indirect Branch Prediction Accuracy in Virtual Machine Interpreters, in Proceedings of the ACM SIGPLAN 2003 Conference on Programming Language Design and Implementation (PLDI 03), pp. 278-288, San Diego, California, June 2003.

M. Anton Ertl和David Gregg,在ACM SIGPLAN 2003会议编程语言设计和实现会议(PLDI 03),第278-288页,加利福尼亚州圣地亚哥,2003年6月,优化虚拟机解释器中的间接分支预测准确性。

M. Anton Ertl and David Gregg, The behaviour of efficient virtual machine interpreters on modern architectures, in Proceedings of the 7th European Conference on Parallel Computing (Europar 2001), pp. 403-412, LNCS 2150, Manchester, August 2001.

M. Anton Ertl和David Gregg,“现代架构上高效虚拟机解释器的行为”,第7届欧洲并行计算会议论文集(Europar 2001),第403-412页,LNCS 2150,曼彻斯特,2001年8月。

An excellent summary is provided by Yunhe Shi in his PhD thesis.

Yunhe Shi在他的博士论文中提供了一个很好的总结。

Also, someone discovered a new technique a few years ago which is valid ANSI C.

此外,有人在几年前发现了一种新技术,它是有效的ANSI C.

#2


Befor you start anything, check Lua.

因为你开始做任何事,检查Lua。

It's small (150Kb), pure ANSI C, works on anything having C compiler. Very fast.

它很小(150Kb),纯ANSI C,适用于任何具有C编译器的东西。非常快。

And most important - source code is clean and readable. Worth checking out.

最重要的是 - 源代码清晰可读。值得一试。

#3


Indirect threading is a strategy where each opcode implementation has its own JMP to the next opcode. The patch to the Python interpreter looks something like this:

间接线程是一种策略,其中每个操作码实现都有自己的JMP到下一个操作码。 Python解释器的补丁看起来像这样:

add:
    result = a + b;
    goto *opcode_targets[*next_instruction++];

opcode_targets maps the instruction in the language's bytecode to the location in memory of the opcode implementation. This is faster because the processor's branch predictor can make a different prediction for each bytecode, in contrast to a switch statement that has only one branch instruction.

opcode_targets将语言字节码中的指令映射到操作码实现的内存中的位置。这更快,因为处理器的分支预测器可以对每个字节码进行不同的预测,这与只有一个分支指令的switch语句形成对比。

The compiler must support computed goto for this to work, which mostly means gcc.

编译器必须支持计算goto才能工作,这主要是指gcc。

Direct threading is similar, but in direct threading the array of opcodes is replaced with pointers to the opcode implentations like so:

直接线程是类似的,但在直接线程中,操作码数组被替换为指向操作码实现的指针,如下所示:

goto *next_opcode_target++;

These techniques are only useful because modern processors are pipelined and must clear their pipelines (slow) on a mispredicted branch. The processor designers put in branch prediction to avoid having to clear the pipeline as often, but branch prediction only works for branches that are more likely to take a particular path.

这些技术只是有用,因为现代处理器是流水线的,必须在错误预测的分支上清除它们的管道(缓慢)。处理器设计者进行分支预测以避免必须经常清除管道,但分支预测仅适用于更可能采用特定路径的分支。

#4


Just-in-time compilation is one.

即时编译就是其中之一。

#5


One big win is to store the source code in an intermediate form, rather than redoing lexical analysis and parsing during execution.

一个重要的胜利是以中间形式存储源代码,而不是在执行期间重做词法分析和解析。

This can range all the way from just storing the tokens, through Forth style threaded code and on to JIT compilation.

这可以从存储令牌,通过Forth样式的线程代码到JIT编译。

#6


Benchmarking is a good technique on making anything fast on given platform(s). Test, refine, test again, improve.

基准测试是在给定平台上快速制作任何东西的好方法。测试,改进,再次测试,改进。

I don't think you can get any better answer. There's lot of techniques to make interpreters. But I give you a tip, do not do trade offs, just choose what you really need and pursue those goals.

我认为你不能得到任何更好的答案。制作口译员有很多技巧。但是我给你一个提示,不做权衡,只选择你真正需要的东西并追求那些目标。

#7


The question is a bit vague. But, it seems you're asking about writing an interpreter.

问题有点模糊。但是,似乎你在问写一个翻译。

Interpreters typically utilize traditional parsing components: lexer, parser, and abstract syntax tree (AST). This allows the designer to read and interpret valid syntax and build a tree structure of commands with associated operators, parameters, etc.

解释器通常使用传统的解析组件:词法分析器,解析器和抽象语法树(AST)。这允许设计者读取和解释有效语法,并使用相关的运算符,参数等构建命令的树结构。

Once in AST form, the entire input is tokenized and the interpreter can begin executing by traversing the tree.

一旦处于AST格式,整个输入就被标记化,解释器可以通过遍历树开始执行。

There are many options, but I've recently used ANTLR as a parser generator that can build parsers in various target languages, including C/C++ and C#.

有很多选项,但我最近使用ANTLR作为解析器生成器,可以使用各种目标语言构建解析器,包括C / C ++和C#。

#8


I found an blog post on threaded interpreter implementation that was useful.

我找到了一篇关于线程解释器实现的博客文章很有用。

The author describes GCC label-based threading and also how to do this in Visual Studio using inline assembler.

作者描述了基于GCC标签的线程以及如何使用内联汇编程序在Visual Studio中执行此操作。

http://abepralle.wordpress.com/2009/01/25/how-not-to-make-a-virtual-machine-label-based-threading/

The results are interesting. He reports 33% performance improvement when using GCC but surprisingly the Visual Studio inline assembly implementation is 3 times slower!

结果很有趣。他报告使用GCC时性能提高了33%,但令人惊讶的是Visual Studio内联汇编实现速度慢了3倍!

#1


There are a number of papers on different kinds of dispatch:

关于不同类型的派遣有很多论文:

M. Anton Ertl and David Gregg, Optimizing Indirect Branch Prediction Accuracy in Virtual Machine Interpreters, in Proceedings of the ACM SIGPLAN 2003 Conference on Programming Language Design and Implementation (PLDI 03), pp. 278-288, San Diego, California, June 2003.

M. Anton Ertl和David Gregg,在ACM SIGPLAN 2003会议编程语言设计和实现会议(PLDI 03),第278-288页,加利福尼亚州圣地亚哥,2003年6月,优化虚拟机解释器中的间接分支预测准确性。

M. Anton Ertl and David Gregg, The behaviour of efficient virtual machine interpreters on modern architectures, in Proceedings of the 7th European Conference on Parallel Computing (Europar 2001), pp. 403-412, LNCS 2150, Manchester, August 2001.

M. Anton Ertl和David Gregg,“现代架构上高效虚拟机解释器的行为”,第7届欧洲并行计算会议论文集(Europar 2001),第403-412页,LNCS 2150,曼彻斯特,2001年8月。

An excellent summary is provided by Yunhe Shi in his PhD thesis.

Yunhe Shi在他的博士论文中提供了一个很好的总结。

Also, someone discovered a new technique a few years ago which is valid ANSI C.

此外,有人在几年前发现了一种新技术,它是有效的ANSI C.

#2


Befor you start anything, check Lua.

因为你开始做任何事,检查Lua。

It's small (150Kb), pure ANSI C, works on anything having C compiler. Very fast.

它很小(150Kb),纯ANSI C,适用于任何具有C编译器的东西。非常快。

And most important - source code is clean and readable. Worth checking out.

最重要的是 - 源代码清晰可读。值得一试。

#3


Indirect threading is a strategy where each opcode implementation has its own JMP to the next opcode. The patch to the Python interpreter looks something like this:

间接线程是一种策略,其中每个操作码实现都有自己的JMP到下一个操作码。 Python解释器的补丁看起来像这样:

add:
    result = a + b;
    goto *opcode_targets[*next_instruction++];

opcode_targets maps the instruction in the language's bytecode to the location in memory of the opcode implementation. This is faster because the processor's branch predictor can make a different prediction for each bytecode, in contrast to a switch statement that has only one branch instruction.

opcode_targets将语言字节码中的指令映射到操作码实现的内存中的位置。这更快,因为处理器的分支预测器可以对每个字节码进行不同的预测,这与只有一个分支指令的switch语句形成对比。

The compiler must support computed goto for this to work, which mostly means gcc.

编译器必须支持计算goto才能工作,这主要是指gcc。

Direct threading is similar, but in direct threading the array of opcodes is replaced with pointers to the opcode implentations like so:

直接线程是类似的,但在直接线程中,操作码数组被替换为指向操作码实现的指针,如下所示:

goto *next_opcode_target++;

These techniques are only useful because modern processors are pipelined and must clear their pipelines (slow) on a mispredicted branch. The processor designers put in branch prediction to avoid having to clear the pipeline as often, but branch prediction only works for branches that are more likely to take a particular path.

这些技术只是有用,因为现代处理器是流水线的,必须在错误预测的分支上清除它们的管道(缓慢)。处理器设计者进行分支预测以避免必须经常清除管道,但分支预测仅适用于更可能采用特定路径的分支。

#4


Just-in-time compilation is one.

即时编译就是其中之一。

#5


One big win is to store the source code in an intermediate form, rather than redoing lexical analysis and parsing during execution.

一个重要的胜利是以中间形式存储源代码,而不是在执行期间重做词法分析和解析。

This can range all the way from just storing the tokens, through Forth style threaded code and on to JIT compilation.

这可以从存储令牌,通过Forth样式的线程代码到JIT编译。

#6


Benchmarking is a good technique on making anything fast on given platform(s). Test, refine, test again, improve.

基准测试是在给定平台上快速制作任何东西的好方法。测试,改进,再次测试,改进。

I don't think you can get any better answer. There's lot of techniques to make interpreters. But I give you a tip, do not do trade offs, just choose what you really need and pursue those goals.

我认为你不能得到任何更好的答案。制作口译员有很多技巧。但是我给你一个提示,不做权衡,只选择你真正需要的东西并追求那些目标。

#7


The question is a bit vague. But, it seems you're asking about writing an interpreter.

问题有点模糊。但是,似乎你在问写一个翻译。

Interpreters typically utilize traditional parsing components: lexer, parser, and abstract syntax tree (AST). This allows the designer to read and interpret valid syntax and build a tree structure of commands with associated operators, parameters, etc.

解释器通常使用传统的解析组件:词法分析器,解析器和抽象语法树(AST)。这允许设计者读取和解释有效语法,并使用相关的运算符,参数等构建命令的树结构。

Once in AST form, the entire input is tokenized and the interpreter can begin executing by traversing the tree.

一旦处于AST格式,整个输入就被标记化,解释器可以通过遍历树开始执行。

There are many options, but I've recently used ANTLR as a parser generator that can build parsers in various target languages, including C/C++ and C#.

有很多选项,但我最近使用ANTLR作为解析器生成器,可以使用各种目标语言构建解析器,包括C / C ++和C#。

#8


I found an blog post on threaded interpreter implementation that was useful.

我找到了一篇关于线程解释器实现的博客文章很有用。

The author describes GCC label-based threading and also how to do this in Visual Studio using inline assembler.

作者描述了基于GCC标签的线程以及如何使用内联汇编程序在Visual Studio中执行此操作。

http://abepralle.wordpress.com/2009/01/25/how-not-to-make-a-virtual-machine-label-based-threading/

The results are interesting. He reports 33% performance improvement when using GCC but surprisingly the Visual Studio inline assembly implementation is 3 times slower!

结果很有趣。他报告使用GCC时性能提高了33%,但令人惊讶的是Visual Studio内联汇编实现速度慢了3倍!