从apache spark中读取数据并将数据写入S3

时间:2020-12-19 19:14:17

I have run same application twice ,once with community edition ( only 6GB memory us-west) and once with one driver and one worker( 60 GB Memory , eu-central) surprisingly app in community edition run much faster in terms of reading and writing the data into S3 .

我已经运行了两次相同的应用程序,一次是社区版(只有6GB内存,我们 - 西),一次有一个驱动程序和一个工作人员(60 GB内存,eu-central),在社区版中的应用程序在阅读和写作方面运行得更快将数据导入S3。

I haven't found any explanation to this poor result as our clusters are much more powerful than community edition , I have even try one driver, one worker ( up to 60 ) again it will take a lot more than community edition. We are using S3 , as a datasource in our application, we read a 9 million rows .csv file , make some analysis on it and again write the result on S3, as we have mounted our buckets to bdfs .

我没有找到任何解释这个糟糕的结果,因为我们的集群比社区版更强大,我甚至尝试了一个驱动程序,一个工人(最多60个),它将比社区版需要更多。我们使用S3作为我们应用程序中的数据源,我们读取了900万行.csv文件,对其进行了一些分析并再次在S3上写入结果,因为我们已经将桶安装到bdfs。

 df=sqlContext.read.format('com.databricks.spark.csv').options(delimiter=',',header='true',inferschema='true').load("dbfs:/mnt/mount1/2016/rrdb_succesful_sales/*")

the code i use to write to s3:

我用来写s3的代码:

top_profit_product.coalesce(1).write.csv("dbfs:/mnt/mount2/tappalytics/profitability_report/weekly/top_profit_product",mode='overwrite',header=True)

I dont' think there would be any problem with my code , is it? any advice?

我不认为我的代码会有任何问题,是吗?任何建议?

1 个解决方案

#1


1  

This the databricks filesystem here, not the OSS Apache S3 clients or the Amazon EMR driver, so you'll have to take it up with them.

这是databricks文件系统,而不是OSS Apache S3客户端或Amazon EMR驱动程序,因此您必须使用它们。

For the ASF code, the s3a client delays come in from: number of HTTP requests; bandwidth to s3, seek times on HDD. HTTPS request setup/teardown is vey expensive; the latest s3a clients do a lot less seeking, though you have to choose the right option for your datasource.

对于ASF代码,s3a客户端延迟来自:HTTP请求的数量;带宽到s3,在HDD上寻找时间。 HTTPS请求设置/拆解费用昂贵;虽然您必须为您的数据源选择正确的选项,但最新的s3a客户搜索的次数要少得多。

If you are working with an s3 bucket on a site different from where your VMs are, that'll be your bottleneck. You will be bandwidth limited, billed per MB, and better off skipping 500K of data rather than seeking to a new location by aborting the active HTTP GET and setting up a new TCP stream.

如果您在与VM不同的站点上使用s3存储桶,那将是您的瓶颈。您将受带宽限制,按MB计费,最好跳过500K数据,而不是通过中止活动HTTP GET和设置新TCP流来寻找新位置。

tip: s3a://landsat-pds/scene_list.gz makes for a good 20MB test source; hosted on US-east, AWS pay for your downloads. Spark 2 also adds its own CSV reader.

提示:s3a://landsat-pds/scene_list.gz是一个很好的20MB测试源;在美国东部托管,AWS为您的下载付费。 Spark 2还添加了自己的CSV阅读器。

#1


1  

This the databricks filesystem here, not the OSS Apache S3 clients or the Amazon EMR driver, so you'll have to take it up with them.

这是databricks文件系统,而不是OSS Apache S3客户端或Amazon EMR驱动程序,因此您必须使用它们。

For the ASF code, the s3a client delays come in from: number of HTTP requests; bandwidth to s3, seek times on HDD. HTTPS request setup/teardown is vey expensive; the latest s3a clients do a lot less seeking, though you have to choose the right option for your datasource.

对于ASF代码,s3a客户端延迟来自:HTTP请求的数量;带宽到s3,在HDD上寻找时间。 HTTPS请求设置/拆解费用昂贵;虽然您必须为您的数据源选择正确的选项,但最新的s3a客户搜索的次数要少得多。

If you are working with an s3 bucket on a site different from where your VMs are, that'll be your bottleneck. You will be bandwidth limited, billed per MB, and better off skipping 500K of data rather than seeking to a new location by aborting the active HTTP GET and setting up a new TCP stream.

如果您在与VM不同的站点上使用s3存储桶,那将是您的瓶颈。您将受带宽限制,按MB计费,最好跳过500K数据,而不是通过中止活动HTTP GET和设置新TCP流来寻找新位置。

tip: s3a://landsat-pds/scene_list.gz makes for a good 20MB test source; hosted on US-east, AWS pay for your downloads. Spark 2 also adds its own CSV reader.

提示:s3a://landsat-pds/scene_list.gz是一个很好的20MB测试源;在美国东部托管,AWS为您的下载付费。 Spark 2还添加了自己的CSV阅读器。