sparking water

2 It provides a way to initialize H2O services on each node in the Spark cluster and to access data stored in data structures of Spark and H2O.

3 Internal Backend is easiest to deploy; however when Spark or YARN kills the executor - which is not an unusual case - the entire H2O cluster goes down because H2O does not support high availability.

4 The internal backend is the default for behavior for Sparkling Water. Another way to change type of backend is by calling the setExternalClusterMode() or setInternalClusterMode() method on the H2OConf class. H2OConf is simple wrapper around SparkConf and inherits all properties in the Spark configuration.

5 好像在安装sparkingwater时，就会把pyspark和H2O装好： pip install h2o_pysparkling_2.3

=======================

1 启动spark : ./sbin/start-master.sh ./sbin/start-slave.sh spark://zcy-VirtualBox:7077

2 可以先运行一个很简单的脚本，看环境是否ready ，为了运行成功，需要把虚拟机内存调大（我改成了2g）

sparking water

from pysparkling import *

from pyspark.sql import SparkSession

import h2o

# Initiate SparkSession

spark = SparkSession.builder.appName("App name").getOrCreate()

# Initiate H2OContext

hc = H2OContext.getOrCreate(spark)

# Stop H2O and Spark services

h2o.cluster().shutdown()

spark.stop()

print ""

./bin/spark-submit --master spark://zcy-VirtualBox:7077 --conf "spark.executor.memory=1g" /home/zcy/working/tst.py

结果如下

sparking water

3 运行一个稍微复杂的脚本：

import h2o

from datetime import datetime

from pyspark import SparkConf, SparkFiles

from pyspark.sql import Row, SparkSession

import os

from pysparkling import *

# Refine date column

def refine_date_col(data, col):

    data["Day"] = data[col].day()

    data["Month"] = data[col].month()

    data["Year"] = data[col].year()

    data["WeekNum"] = data[col].week()

    data["WeekDay"] = data[col].dayOfWeek()

    data["HourOfDay"] = data[col].hour()

    # Create weekend and season cols

    # Spring = Mar, Apr, May. Summer = Jun, Jul, Aug. Autumn = Sep, Oct. Winter = Nov, Dec, Jan, Feb.

    # data["Weekend"]   = [ if x in ("Sun", "Sat") else  for x in data["WeekDay"]]

    data["Weekend"] = ((data["WeekDay"] == "Sun") | (data["WeekDay"] == "Sat"))

    data["Season"] = data["Month"].cut([, , , , , ], ["Winter", "Spring", "Summer", "Autumn", "Winter"])

# This is just helper function returning path to data-files

def _locate(file_name):

    if os.path.isfile("/home/zcy/working/data_tst/" + file_name):

        return "/home/zcy/working/data_tst/" + file_name

    else:

        print "eeeeeeeeeeee"

spark = SparkSession.builder.appName("ChicagoCrimeTest").getOrCreate()

# Start H2O services

h2oContext = H2OContext.getOrCreate(spark)

# Define file names

chicagoAllWeather = "chicagoAllWeather.csv"

chicagoCensus = "chicagoCensus.csv"

chicagoCrimes10k = "chicagoCrimes10k.csv.zip"

# h2o.import_file expects cluster-relative path

f_weather = h2o.upload_file(_locate(chicagoAllWeather))

f_census = h2o.upload_file(_locate(chicagoCensus))

f_crimes = h2o.upload_file(_locate(chicagoCrimes10k))

print ""

# Transform weather table

# Remove 1st column (date)

f_weather = f_weather[:]

# Transform census table

# Remove all spaces from column names (causing problems in Spark SQL)

col_names = list(map(lambda s: s.strip().replace(' ', '_').replace('+', '_'), f_census.col_names))

# Update column names in the table

# f_weather.names = col_names

f_census.names = col_names

# Transform crimes table

# Drop useless columns

f_crimes = f_crimes[:]

# Set time zone to UTC for date manipulation

h2o.cluster().timezone = "Etc/UTC"

# Replace ' ' by '_' in column names

col_names = list(map(lambda s: s.replace(' ', '_'), f_crimes.col_names))

f_crimes.names = col_names

refine_date_col(f_crimes, "Date")

f_crimes = f_crimes.drop("Date")

# Expose H2O frames as Spark DataFrame

print ""

df_weather = h2oContext.as_spark_frame(f_weather)

df_census = h2oContext.as_spark_frame(f_census)

df_crimes = h2oContext.as_spark_frame(f_crimes)

# Register DataFrames as tables

df_weather.createOrReplaceTempView("chicagoWeather")

df_census.createOrReplaceTempView("chicagoCensus")

df_crimes.createOrReplaceTempView("chicagoCrime")

crimeWithWeather = spark.sql("""SELECT

a.Year, a.Month, a.Day, a.WeekNum, a.HourOfDay, a.Weekend, a.Season, a.WeekDay,

a.IUCR, a.Primary_Type, a.Location_Description, a.Community_Area, a.District,

a.Arrest, a.Domestic, a.Beat, a.Ward, a.FBI_Code,

b.minTemp, b.maxTemp, b.meanTemp,

c.PERCENT_AGED_UNDER_18_OR_OVER_64, c.PER_CAPITA_INCOME, c.HARDSHIP_INDEX,

c.PERCENT_OF_HOUSING_CROWDED, c.PERCENT_HOUSEHOLDS_BELOW_POVERTY,

c.PERCENT_AGED_16__UNEMPLOYED, c.PERCENT_AGED_25__WITHOUT_HIGH_SCHOOL_DIPLOMA

FROM chicagoCrime a

JOIN chicagoWeather b

ON a.Year = b.year AND a.Month = b.month AND a.Day = b.day

JOIN chicagoCensus c

ON a.Community_Area = c.Community_Area_Number""")

# Publish Spark DataFrame as H2OFrame with given name

crimeWithWeatherHF = h2oContext.as_h2o_frame(crimeWithWeather, "crimeWithWeatherTable")

print ""

# Transform selected String columns to categoricals

cat_cols = ["Arrest", "Season", "WeekDay", "Primary_Type", "Location_Description", "Domestic"]

for col in cat_cols :

    crimeWithWeatherHF[col] = crimeWithWeatherHF[col].asfactor()

# Split frame into two - we use one as the training frame and the second one as the validation frame

splits = crimeWithWeatherHF.split_frame(ratios=[0.8])

train = splits[]

test = splits[]

print ""

h2o.download_csv(train,'/home/zcy/working/data_tst/ret/train.csv')

h2o.download_csv(test,'/home/zcy/working/data_tst/ret/test.csv')

# stop H2O and Spark services

h2o.cluster().shutdown()

spark.stop()

3 运行脚本，

./bin/spark-submit --master spark://zcy-VirtualBox:7077 --conf "spark.executor.memory=1g" /home/zcy/working/sparkH2O.py

sparking water

sparking water的更多相关文章

[LeetCode] Pacific Atlantic Water Flow 太平洋大西洋水流
Given an m x n matrix of non-negative integers representing the height of each unit cell in a contin ...
[LeetCode] Trapping Rain Water II 收集雨水之二
Given an m x n matrix of positive integers representing the height of each unit cell in a 2D elevati ...
[LeetCode] Water and Jug Problem 水罐问题
You are given two jugs with capacities x and y litres. There is an infinite amount of water supply a ...
[LeetCode] Trapping Rain Water 收集雨水
Given n non-negative integers representing an elevation map where the width of each bar is 1, comput ...
[LeetCode] Container With Most Water 装最多水的容器
Given n non-negative integers a1, a2, ..., an, where each represents a point at coordinate (i, ai). ...
如何装最多的水？ — leetcode 11&period; Container With Most Water
炎炎夏日,还是呆在空调房里切切题吧. Container With Most Water,题意其实有点噱头,简化下就是,给一个数组,恩,就叫 height 吧,从中任选两项 i 和 j(i <= ...
【leetcode】Container With Most Water
题目描述: Given n non-negative integers a1, a2, ..., an, where each represents a point at coordinate (i, ...
[LintCode] Trapping Rain Water 收集雨水
Given n non-negative integers representing an elevation map where the width of each bar is 1, comput ...
[LintCode] Container With Most Water 装最多水的容器
Given n non-negative integers a1, a2, ..., an, where each represents a point at coordinate (i, ai). ...

随机推荐

查看mac中磁盘空间占用情况
今天发现磁盘空间不够了,首先要找到那些文件夹占用了磁盘空间. du命令很好使 du -c -d 1 -m | sort -n -c 显示当前文件夹总计占用空间 -d 1 层级为1,即只显示当前目录下一 ...
【第一篇献给markdown】markdown入门
Markdown 是一种轻量级的「标记语言」,语法十分简单,常用的标记符号也不超过十个.虽然功能很强大,但是上手估计不用十分钟. 一些认识 Markdown 官方文档这里可以看到官方的 Markdo ...
确定比赛名次---HDU1285(拓扑排序)
http://acm.hdu.edu.cn/showproblem.php?pid=1285 题目大意: 给你每场比赛的成绩,让你根据成绩把排名弄出来分析: 本来我是用普通方法写的,然后就一直wa, ...
关于struts2如何去掉默认的后缀（&period;action）
struts2是可以配置默认的后缀名的,如http://localhost:8080/test.action,这个是默认的,但是也可以通过配置去修改这个.action为别的. 这里是通过一个常量配置改 ...
Linux 挂载命令 --mount
1.挂载光盘命令 mount : mount [-t vfstype] [-o options] device dir mount [-t 文件系统] [-o 特殊选项] 设备文件名挂载点 -t ...
Echoprint系列--Android编译与调用
在Echoprint系列--编译中编译了源代码,这次将Echoprint移植到Android平台并測试识别歌曲功能. 一.编译库 1.环境准备 Android NDK,我的是android-ndk-r ...
MVC导入命名空间
为什么要导入一次性导入,避免每个页面都要导入,代码看起来更为清晰,不再带一个长长的命名空间,视图里面可以直接写类名了. 导入方法在Views文件夹的web.config的namespaces里面配 ...
我的Android手册
目录解释说明 assets文件说明 app_id:机智云 app id app_secret:机智云 app secret product_key:机智云 product key wifi_type_ ...
GOF 23种设计模式
设计模式目录创建型 1. Factory Method(工厂方法) 2. Abstract Factory(抽象工厂) 3. Builder(建造者) 4. Prototype(原型) 5. Sin ...
SpringBoot2&period;0之五优雅整合SpringBoot2&period;0+MyBatis+druid+PageHelper
上篇文章我们介绍了SpringBoot和MyBatis的整合,可以说非常简单快捷的就搭建了一个web项目,但是在一个真正的企业级项目中,可能我们还需要更多的更加完善的框架才能开始真正的开发,比如连接池 ...