如何利用Spark 2.0“全阶段代码生成”

时间:2022-06-22 02:31:57

I have been reading many articles about Spark 2.0 "whole-stage code generation". Since the technique optimize the code at compiling stage, I have several questions about that:

我一直在阅读很多关于Spark 2.0“全阶段代码生成”的文章。由于该技术在编译阶段优化了代码,因此我有几个问题:

Q1. Can Python or R take advantage of this technique? Q2. In Scala/Java, How to take advantage of this technique? Should I have to bring all the query using Spark's API, or just a string query is good enough? For example, can each of the following programs taking advantage of the "whole-stage code generation":

Q1。 Python或R可以利用这种技术吗? Q2。在Scala / Java中,如何利用这种技术?我是否应该使用Spark的API提供所有查询,或者只是字符串查询是否足够好?例如,以下每个程序都可以利用“整阶代码生成”:

case 1:

sparksession.sql("select * from a john b on a.id = b.id")

case 2:

val talbe_a = sparksession.sql("select * from a)
val table_b = sparksession.sql("select * from b)
val table_c = table_a.join(table_b, table_a(COL_ADID) === table_b(COL_ADID))

Q3. If Q2 case 1 is able to utilize "whole-stage code generation", how about we read the query string from external files, like that:

Q3。如果Q2案例1能够利用“整阶代码生成”,我们如何从外部文件中读取查询字符串,如下所示:

val query = scala.io.Source.fromFile(queryfile).mkString
sparksession.sql(query)

In the above code, the complier really doesn't know what the query string looks like, at the compiling stage, can it utilize the "whole-stage code generation" technique?

在上面的代码中,编译器实际上不知道查询字符串是什么样的,在编译阶段,它是否可以利用“整阶代码生成”技术?

1 个解决方案

#1


0  

  1. All languages using Spark SQL API can benefit from codegen as long as they don't use language specific extensions (Python UDF, dapply, gapply in R)

    使用Spark SQL API的所有语言都可以从codegen中受益,只要它们不使用特定于语言的扩展(Python UDF,dapply,在R中使用gapply)

  2. Both SQL and DataFrame APIs are supported and they way you provide the query doesn't matter. Codegen is internal process applied between user input and query execution.

    SQL和DataFrame API都受支持,它们提供查询的方式无关紧要。 Codegen是在用户输入和查询执行之间应用的内部过程。

#1


0  

  1. All languages using Spark SQL API can benefit from codegen as long as they don't use language specific extensions (Python UDF, dapply, gapply in R)

    使用Spark SQL API的所有语言都可以从codegen中受益,只要它们不使用特定于语言的扩展(Python UDF,dapply,在R中使用gapply)

  2. Both SQL and DataFrame APIs are supported and they way you provide the query doesn't matter. Codegen is internal process applied between user input and query execution.

    SQL和DataFrame API都受支持,它们提供查询的方式无关紧要。 Codegen是在用户输入和查询执行之间应用的内部过程。