如何使用Hadoop MapReduce或Spark进行数据预处理？

I'm very new to the Hadoop MapReduce/Spark, for my target project, I want to perform Data Preprocessing with Hadoop MapReduce/Spark. I know the basics of Hadoop MapReduce, but I don't know how to implement the Preprocessing algorithms/methods with this framework. For Hadoop MapReduce, I have to define Map() and Reduce() which takes <key, value> pair as the transmission type from Mappers to Reducers. But with database tables, how can I handle table entries in <key, value> format? Applying primay key as the key seems nonsense. It's the similar case for Spark since I need to specify the key.

我是Hadoop MapReduce / Spark的新手,对于我的目标项目,我想用Hadoop MapReduce / Spark执行数据预处理。我知道Hadoop MapReduce的基础知识,但我不知道如何用这个框架实现预处理算法/方法。对于Hadoop MapReduce,我必须定义Map()和Reduce(),它将对作为从Mappers到Reducers的传输类型。但是对于数据库表,我如何处理格式的表条目?应用primay键作为关键似乎是胡说八道。这是Spark的类似情况,因为我需要指定密钥。 ,value> ,value>

For example, for each data entry in the database table, some fields of some entries may be missed, thus I want to add the default value for those fields with kind of imputation strategies. How can I process the data entries in a <key, value> way? Setting the primary key as key here is nonsense since if that's the case, each <key, value> pair won't have the same key as others, so aggregation is not helpful in this case.

例如,对于数据库表中的每个数据条目,某些条目的某些字段可能会被遗漏,因此我想为那些具有插补策略的字段添加默认值。如何以方式处理数据条目?在这里将主键设置为键是无意义的,因为如果是这种情况,每个对将不具有与其他键相同的键,因此在这种情况下聚合没有帮助。 ,value> ,value>

1 个解决方案

#1

Map reduce is kind of low level programming. You can start with high level abstractions like HIVE and PIG.

Map reduce是一种低级编程。您可以从HIVE和PIG等高级抽象开始。

If in case you are dealing with structured data you go with HIVE, which is SQL like interface, which intenally converts SQLs to MR jobs.

如果您正在处理结构化数据,那么您可以使用HIVE,这就像SQL一样,它可以将SQL转换为MR作业。

Hope this helps.

希望这可以帮助。

#1