在火花流中我可以在worker上创建RDD

时间:2021-07-09 20:48:55

I want to know how can I create RDD on worker say containing a Map. This Map/RDD will be small and I want this RDD to completely reside on one machine/executor (I guess repartition(1) can achieve this). Further I want to be able to cache this Map/RDD on local executor and use it in tasks running on this executor for lookup.

我想知道如何在工人上创建包含Map的RDD。这个Map / RDD会很小,我希望这个RDD完全驻留在一台机器/执行器上(我猜重新分区(1)可以实现这一点)。此外,我希望能够在本地执行程序上缓存此Map / RDD,并将其用于在此执行程序上运行的任务中进行查找。

How can I do this?

我怎样才能做到这一点?

2 个解决方案

#1


0  

You can create RDD on your driver program using sc.parallelize(data) . For storing Map, it can be split into 2 parts as key, value and then can be stored in RDD/Dataframe as two separate columns.

您可以使用sc.parallelize(data)在驱动程序上创建RDD。对于存储Map,它可以分为两部分作为键,值,然后可以作为两个单独的列存储在RDD / Dataframe中。

#2


0  

No, you cannot create RDD in worker node. Only driver can create RDD.

不,您无法在工作节点中创建RDD。只有驱动程序才能创建RDD。

The broadcast variable seems be solution in your situation. It will send data to all workers, however if your map is small, then it wouldn't be an issue.

在你的情况下,广播变量似乎是解决方案。它会将数据发送给所有工作人员,但是如果您的地图很小,那么这不会是一个问题。

You cannot control on which partition your RDD will be placed, so you cannot just do repartition(1) - you don't know if this RDD will be placed on the same node ;) Broadcast variable will be on every node, so lookup will be very fast

您无法控制RDD将放在哪个分区上,因此您不能只进行重新分区(1) - 您不知道此RDD是否将放置在同一节点上;)广播变量将位于每个节点上,因此查找将非常快

#1


0  

You can create RDD on your driver program using sc.parallelize(data) . For storing Map, it can be split into 2 parts as key, value and then can be stored in RDD/Dataframe as two separate columns.

您可以使用sc.parallelize(data)在驱动程序上创建RDD。对于存储Map,它可以分为两部分作为键,值,然后可以作为两个单独的列存储在RDD / Dataframe中。

#2


0  

No, you cannot create RDD in worker node. Only driver can create RDD.

不,您无法在工作节点中创建RDD。只有驱动程序才能创建RDD。

The broadcast variable seems be solution in your situation. It will send data to all workers, however if your map is small, then it wouldn't be an issue.

在你的情况下,广播变量似乎是解决方案。它会将数据发送给所有工作人员,但是如果您的地图很小,那么这不会是一个问题。

You cannot control on which partition your RDD will be placed, so you cannot just do repartition(1) - you don't know if this RDD will be placed on the same node ;) Broadcast variable will be on every node, so lookup will be very fast

您无法控制RDD将放在哪个分区上,因此您不能只进行重新分区(1) - 您不知道此RDD是否将放置在同一节点上;)广播变量将位于每个节点上,因此查找将非常快