将输入arff文件拆分为较小的块以处理非常大的数据集

时间:2022-01-10 02:02:13

I am trying to run a weka classifier on map reduce and loading entire arff file of even 200mb is leading to heap space error, so I want to split the arff file into chunks, but the thing is it has to maintain the block information ie the arff attributes information in every chunk so as to run the classifier in each mapper. Here is the code that I am trying to split the data but not able to do with efficiency,

我试图在地图上运行weka分类器减少并加载甚至200mb的整个arff文件导致堆空间错误,所以我想将arff文件拆分成块,但问题是它必须维护块信息,即arff在每个块中定义信息,以便在每个映射器中运行分类器。这是我试图分割数据但不能提高效率的代码,

 List<InputSplit> splits = new ArrayList<InputSplit>();
        for (FileStatus file: listStatus(job)) {
            Path path = file.getPath();
            FileSystem fs = path.getFileSystem(job.getConfiguration());

            //number of bytes in this file
            long length = file.getLen();
            BlockLocation[] blkLocations = fs.getFileBlockLocations(file, 0, length);

            // make sure this is actually a valid file
            if(length != 0) {
                // set the number of splits to make. NOTE: the value can be changed to anything
                int count = job.getConfiguration().getInt("Run-num.splits",1);
                for(int t = 0; t < count; t++) {
                    //split the file and add each chunk to the list
                    splits.add(new FileSplit(path, 0, length, blkLocations[0].getHosts())); 
                }
            }
            else {
                // Create empty array for zero length files
                splits.add(new FileSplit(path, 0, length, new String[0]));
            }
        }
        return splits;

1 个解决方案

#1


Have you tried this first?

你先试过这个吗?

In mapred-site.xml, add this property:

在mapred-site.xml中,添加以下属性:

<property>
    <name>mapred.child.java.opts</name>
    <value>-Xmx2048m</value>
</property>

// memory allocation for MR jobs

// MR作业的内存分配

#1


Have you tried this first?

你先试过这个吗?

In mapred-site.xml, add this property:

在mapred-site.xml中,添加以下属性:

<property>
    <name>mapred.child.java.opts</name>
    <value>-Xmx2048m</value>
</property>

// memory allocation for MR jobs

// MR作业的内存分配