大数据mapreduce全局排序top-N之python实现

时间:2023-03-08 22:22:09

a.txt、b.txt文件如下:

a.txt

       hadoop
hadoop
hadoop
hadoop
hadoop
hadoop
hadoop
hadoop
hadoop
hadoop
hadoop
hadoop
hadoop
hadoop
hadoop
hadoop
hadoop
hadoop
hadoop
hadoop
hadoop
hadoop
hadoop
hadoop
hadoop
hadoop
hadoop
hadoop
hadoop

b.txt如下:

       java
java
java
java
java
java
java
java
java
java
java
java
java
java
java
java
java
java
java
java
java
java
java
java
java
java
java
java
java
java

将a.txt、b.txt上传至hdfs文件 /mapreduce/allsort 内:

hadoop fs -put a.txt b.txt  /mapreduce/allsort

实验一:第一种全局排序为,将数字列作为key,其余为value,设置一个reduce,利用shffer阶段,进行排序:(sgffer排序默认字符串排序,需要注意)

map.py代码如下:

#!usr/bin/python
import sys
base_count=9000000000
for line in sys.stdin:
ss=line.strip().split('\t')
key,val=ss
new_key=base_count-int(key)
print "%s\t%s"%(new_key,val)

red.py代码如下:

#!usr/bin/python
import sys
base_count=9000000000
for line in sys.stdin:
ss=line.strip().split()
key=9000000000-int(ss[0])
print "%s\t%s"%(key,ss[1])

run.sh代码如下:

HADOOP=/usr/local/src/hadoop-1.2.1/bin/hadoop
HADOOP_STREAMING=/usr/local/src/hadoop-1.2.1/contrib/streaming/hadoop-streaming-1.2.1.jar
INPUT_PATH=/mapreduce/allsort
OUT_PATH=/mapreduce/allsort/out
$HADOOP jar $HADOOP_STREAMING \
-input $INPUT_PATH \
-output $OUT_PATH \
-mapper "python map.py" \
-reducer "python red.py" \
-file "./map.py" \
-file "./red.py"

不设置reduce的运行个数,默认red.py的个数为1,结果输出为一个文件,且完成了倒排序;

实验二:设置3个reduce,每个ruduce内部完成排序,且3个reduce间也可以排序;思路:3个桶,60-40为0号桶、40-20为1号桶,20以下为3号桶,每个桶内部依赖shffer排序

map.py

#!usr/bin/python
import sys
base_count=9000000000
for line in sys.stdin:
ss=line.strip().split('\t')
key,val=ss
new_key=base_count-int(key)
if int(key)>=40:
print "%s\t%s\t%s"%("",new_key,val)
elif int(key)>=20:
print "%s\t%s\t%s"%("",new_key,val)
else:
print "%s\t%s\t%s"%("",new_key,val)

red.py

#!usr/bin/python
import sys
base_count=9000000000
for line in sys.stdin:
ss=line.strip().split()
key=base_count-int(ss[1])
print "%s\t%s"%(key,ss[2])

run.sh

HADOOP="/usr/local/src/hadoop-1.2.1/bin/hadoop"
HADOOP_STREAMING="/usr/local/src/hadoop-1.2.1/contrib/streaming/hadoop-streaming-1.2.1.jar"
INPUT_PATH="/mapreduce/allsort"
OUT_PATH="/mapreduce/allsort/out"
$HADOOP fs -rmr $OUT_PATH
$HADOOP jar $HADOOP_STREAMING \
-input $INPUT_PATH \
-output $OUT_PATH \
-mapper "python map.py" \
-reducer "python red.py" \
-file "./map.py" \
-file "./red.py" \
-jobconf "mapred.reduce.tasks=3" \
-jobconf "stream.num.map.output.key.fields=2" \#设置前2个为key
-jobconf "num.key.fields.for.partition=1" \ #设置第一个为partition
-partitioner "org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner"

实验三:通过参数调控实现全局排序:

数据如下:

aaa.txt

d.1.5.
e.9.4.
e.5.9.
e.5.1.
e.5.1.
a.7.2.
f.8.3.

目的:在streaming模式默认hadoop会把map输出的一行中遇到的第一个设定的字段分隔符前面的部分作为key,后面的作为 value,这里我们将map输出的前2个字段作为key,后面字段作为value,并且不使用hadoop默认的“\t”字段分隔符,而是根据该 文本特点使用“.”来分割

实现前3个字段为key排序,后面为value;

第2个和第三个字段为partition

run.sh如下:

HADOOP="/usr/local/src/hadoop-1.2.1/bin/hadoop"
HADOOP_STREAMING="/usr/local/src/hadoop-1.2.1/contrib/streaming/hadoop-streaming-1.2.1.jar"
INPUT_PATH="/mapreduce/allsort/aaa.txt"
OUT_PATH="/mapreduce/allsort/out"
$HADOOP fs -rmr $OUT_PATH
$HADOOP jar $HADOOP_STREAMING \
-input $INPUT_PATH \
-output $OUT_PATH \
-mapper "cat" \
-reducer "cat" \
-jobconf stream.num.map.output.key.fields= \
-jobconf stream.map.output.field.separator=. \
-jobconf map.output.key.field.separator=. \
-jobconf mapred.text.key.partitioner.options=-k2, \
-jobconf mapred.reduce.tasks= \
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner