sparking water

时间:2021-04-18 10:26:21


sparking water

2 It provides a way to initialize H2O services on each node in the Spark cluster and to access data stored in data structures of Spark and H2O.

3 Internal Backend  is easiest to deploy; however when Spark or YARN kills the executor - which is not an unusual case - the entire H2O cluster goes down because H2O does not support high availability.

4 The internal backend is the default for behavior for Sparkling Water.  Another way to change type of backend is by calling the setExternalClusterMode() or setInternalClusterMode() method on the H2OConf class. H2OConf is simple wrapper around SparkConf and inherits all properties in the Spark configuration.

5 好像在安装sparkingwater时,就会把pyspark和H2O装好: pip install h2o_pysparkling_2.3


1 启动spark :  ./sbin/      ./sbin/ spark://zcy-VirtualBox:7077

2 可以先运行一个很简单的脚本,看环境是否ready ,为了运行成功,需要把虚拟机内存调大(我改成了2g)

sparking water

from pysparkling import *
from pyspark.sql import SparkSession
import h2o # Initiate SparkSession
spark = SparkSession.builder.appName("App name").getOrCreate() # Initiate H2OContext
hc = H2OContext.getOrCreate(spark) # Stop H2O and Spark services
print ""

./bin/spark-submit --master spark://zcy-VirtualBox:7077  --conf "spark.executor.memory=1g" /home/zcy/working/


sparking water

3 运行一个稍微复杂的脚本:

import h2o
from datetime import datetime from pyspark import SparkConf, SparkFiles
from pyspark.sql import Row, SparkSession
import os
from pysparkling import * # Refine date column
def refine_date_col(data, col):
data["Day"] = data[col].day()
data["Month"] = data[col].month()
data["Year"] = data[col].year()
data["WeekNum"] = data[col].week()
data["WeekDay"] = data[col].dayOfWeek()
data["HourOfDay"] = data[col].hour() # Create weekend and season cols
# Spring = Mar, Apr, May. Summer = Jun, Jul, Aug. Autumn = Sep, Oct. Winter = Nov, Dec, Jan, Feb.
# data["Weekend"] = [ if x in ("Sun", "Sat") else for x in data["WeekDay"]]
data["Weekend"] = ((data["WeekDay"] == "Sun") | (data["WeekDay"] == "Sat"))
data["Season"] = data["Month"].cut([, , , , , ], ["Winter", "Spring", "Summer", "Autumn", "Winter"]) # This is just helper function returning path to data-files
def _locate(file_name):
if os.path.isfile("/home/zcy/working/data_tst/" + file_name):
return "/home/zcy/working/data_tst/" + file_name
print "eeeeeeeeeeee" spark = SparkSession.builder.appName("ChicagoCrimeTest").getOrCreate()
# Start H2O services
h2oContext = H2OContext.getOrCreate(spark)
# Define file names
chicagoAllWeather = "chicagoAllWeather.csv"
chicagoCensus = "chicagoCensus.csv"
chicagoCrimes10k = "" # h2o.import_file expects cluster-relative path
f_weather = h2o.upload_file(_locate(chicagoAllWeather))
f_census = h2o.upload_file(_locate(chicagoCensus))
f_crimes = h2o.upload_file(_locate(chicagoCrimes10k))
print "" # Transform weather table
# Remove 1st column (date)
f_weather = f_weather[:] # Transform census table
# Remove all spaces from column names (causing problems in Spark SQL)
col_names = list(map(lambda s: s.strip().replace(' ', '_').replace('+', '_'), f_census.col_names)) # Update column names in the table
# f_weather.names = col_names
f_census.names = col_names # Transform crimes table
# Drop useless columns
f_crimes = f_crimes[:] # Set time zone to UTC for date manipulation
h2o.cluster().timezone = "Etc/UTC" # Replace ' ' by '_' in column names
col_names = list(map(lambda s: s.replace(' ', '_'), f_crimes.col_names))
f_crimes.names = col_names
refine_date_col(f_crimes, "Date")
f_crimes = f_crimes.drop("Date") # Expose H2O frames as Spark DataFrame
print ""
df_weather = h2oContext.as_spark_frame(f_weather)
df_census = h2oContext.as_spark_frame(f_census)
df_crimes = h2oContext.as_spark_frame(f_crimes) # Register DataFrames as tables
df_crimes.createOrReplaceTempView("chicagoCrime") crimeWithWeather = spark.sql("""SELECT
a.Year, a.Month, a.Day, a.WeekNum, a.HourOfDay, a.Weekend, a.Season, a.WeekDay,
a.IUCR, a.Primary_Type, a.Location_Description, a.Community_Area, a.District,
a.Arrest, a.Domestic, a.Beat, a.Ward, a.FBI_Code,
b.minTemp, b.maxTemp, b.meanTemp,
FROM chicagoCrime a
JOIN chicagoWeather b
ON a.Year = b.year AND a.Month = b.month AND a.Day =
JOIN chicagoCensus c
ON a.Community_Area = c.Community_Area_Number""") # Publish Spark DataFrame as H2OFrame with given name
crimeWithWeatherHF = h2oContext.as_h2o_frame(crimeWithWeather, "crimeWithWeatherTable")
print ""
# Transform selected String columns to categoricals
cat_cols = ["Arrest", "Season", "WeekDay", "Primary_Type", "Location_Description", "Domestic"]
for col in cat_cols :
crimeWithWeatherHF[col] = crimeWithWeatherHF[col].asfactor() # Split frame into two - we use one as the training frame and the second one as the validation frame
splits = crimeWithWeatherHF.split_frame(ratios=[0.8])
train = splits[]
test = splits[]
print ""
h2o.download_csv(test,'/home/zcy/working/data_tst/ret/test.csv') # stop H2O and Spark services

3 运行脚本,

./bin/spark-submit --master spark://zcy-VirtualBox:7077  --conf "spark.executor.memory=1g" /home/zcy/working/

sparking water

sparking water

sparking water

sparking water的更多相关文章

  1. [LeetCode] Pacific Atlantic Water Flow 太平洋大西洋水流

    Given an m x n matrix of non-negative integers representing the height of each unit cell in a contin ...

  2. [LeetCode] Trapping Rain Water II 收集雨水之二

    Given an m x n matrix of positive integers representing the height of each unit cell in a 2D elevati ...

  3. [LeetCode] Water and Jug Problem 水罐问题

    You are given two jugs with capacities x and y litres. There is an infinite amount of water supply a ...

  4. [LeetCode] Trapping Rain Water 收集雨水

    Given n non-negative integers representing an elevation map where the width of each bar is 1, comput ...

  5. [LeetCode] Container With Most Water 装最多水的容器

    Given n non-negative integers a1, a2, ..., an, where each represents a point at coordinate (i, ai). ...

  6. 如何装最多的水? — leetcode 11. Container With Most Water

    炎炎夏日,还是呆在空调房里切切题吧. Container With Most Water,题意其实有点噱头,简化下就是,给一个数组,恩,就叫 height 吧,从中任选两项 i 和 j(i <= ...

  7. 【leetcode】Container With Most Water

    题目描述: Given n non-negative integers a1, a2, ..., an, where each represents a point at coordinate (i, ...

  8. &lbrack;LintCode&rsqb; Trapping Rain Water 收集雨水

    Given n non-negative integers representing an elevation map where the width of each bar is 1, comput ...

  9. &lbrack;LintCode&rsqb; Container With Most Water 装最多水的容器

    Given n non-negative integers a1, a2, ..., an, where each represents a point at coordinate (i, ai).  ...


  1. 查看mac中磁盘空间占用情况

    今天发现磁盘空间不够了,首先要找到那些文件夹占用了磁盘空间. du命令很好使 du -c -d 1 -m | sort -n -c 显示当前文件夹总计占用空间 -d 1 层级为1,即只显示当前目录下一 ...

  2. 【第一篇献给markdown】markdown入门

    Markdown 是一种轻量级的「标记语言」,语法十分简单,常用的标记符号也不超过十个.虽然功能很强大,但是上手估计不用十分钟. 一些认识 Markdown 官方文档 这里可以看到官方的 Markdo ...

  3. 确定比赛名次---HDU1285&lpar;拓扑排序&rpar; 题目大意: 给你每场比赛的成绩,让你根据成绩把排名弄出来 分析: 本来我是用普通方法写的,然后就一直wa, ...

  4. 关于struts2如何去掉默认的后缀(&period;action)

    struts2是可以配置默认的后缀名的,如http://localhost:8080/test.action,这个是默认的,但是也可以通过配置去修改这个.action为别的. 这里是通过一个常量配置改 ...

  5. Linux 挂载命令 --mount

    1.挂载光盘命令  mount :  mount [-t vfstype] [-o options] device dir mount [-t 文件系统] [-o 特殊选项] 设备文件名 挂载点 -t ...

  6. Echoprint系列--Android编译与调用

    在Echoprint系列--编译中编译了源代码,这次将Echoprint移植到Android平台并測试识别歌曲功能. 一.编译库 1.环境准备 Android NDK,我的是android-ndk-r ...

  7. MVC导入命名空间

    为什么要导入 一次性导入,避免每个页面都要导入,代码看起来更为清晰,不再带一个长长的命名空间,视图里面可以直接写类名了. 导入方法 在Views文件夹的web.config的namespaces里面配 ...

  8. 我的Android手册

    目录解释说明 assets文件说明 app_id:机智云 app id app_secret:机智云 app secret product_key:机智云 product key wifi_type_ ...

  9. GOF 23种设计模式

    设计模式目录 创建型 1. Factory Method(工厂方法) 2. Abstract Factory(抽象工厂) 3. Builder(建造者) 4. Prototype(原型) 5. Sin ...

  10. SpringBoot2&period;0之五 优雅整合SpringBoot2&period;0&plus;MyBatis&plus;druid&plus;PageHelper

    上篇文章我们介绍了SpringBoot和MyBatis的整合,可以说非常简单快捷的就搭建了一个web项目,但是在一个真正的企业级项目中,可能我们还需要更多的更加完善的框架才能开始真正的开发,比如连接池 ...