在Linux中将文件拆分为不相等的块

时间:2022-04-26 01:27:10

I wish to split a large file (with ~ 17M Lines of Strings) into multiple files with varying number of lines in each chunk. Would it be possible to send in an array to the 'split -l' command like this:

我希望将一个大文件(带有~17M行的字符串)拆分成多个文件,每个块中的行数不同。是否可以将数组发送到'split -l'命令,如下所示:

[
 1=>1000000,
 2=>1000537,
 ...
]

so as to send those many number of lines to each chunk

以便将那么多行发送到每个块

3 个解决方案

#1


6  

Use a compound command:

使用复合命令:

{
  head -n 10000 > output1
  head -n   200 > output2
  head -n  1234 > output3
  cat > remainder
} < yourbigfile

This also works with loops:

这也适用于循环:

{
  i=1
  for n in 10000 200 1234
  do
      head -n $n > output$i
      let i++
  done
  cat > remainder
} < yourbigfile

This does not work on OS X, where head reads and discards additional output.

这在OS X上不起作用,其中磁头读取并丢弃额外的输出。

#2


1  

The split command does not have that capability, so you'll have to use a different tool, or write one of your own.

split命令没有该功能,因此您必须使用其他工具,或者编写自己的工具。

#3


1  

You could use sed by getting another script to generate the sed commands for you.

您可以通过获取另一个脚本来为您生成sed命令来使用sed。

# split_gen.py
use strict;
my @limits = ( 100, 250, 340,999);
my $filename = "joker";

my $start = 1;
foreach my $end (@limits) {
    print qq{sed -n '$start,${end}p;${end}q' $filename > $filename.$start-$end\n};
    $start = $end + 1;
}

Run thus perl split_gen.py giving:

运行perl split_gen.py给出:

sed -n '1,100p;100q' joker > joker.1-100
sed -n '101,250p;250q' joker > joker.101-250
sed -n '251,340p;340q' joker > joker.251-340
sed -n '341,999p;999q' joker > joker.341-999

If you're happy with the command then you can

如果你对命令感到满意,那么你可以

perl split_gen.py | sh 

Then enjoy the wait as it may be slow with big files.

然后享受等待,因为大文件可能会很慢。

#1


6  

Use a compound command:

使用复合命令:

{
  head -n 10000 > output1
  head -n   200 > output2
  head -n  1234 > output3
  cat > remainder
} < yourbigfile

This also works with loops:

这也适用于循环:

{
  i=1
  for n in 10000 200 1234
  do
      head -n $n > output$i
      let i++
  done
  cat > remainder
} < yourbigfile

This does not work on OS X, where head reads and discards additional output.

这在OS X上不起作用,其中磁头读取并丢弃额外的输出。

#2


1  

The split command does not have that capability, so you'll have to use a different tool, or write one of your own.

split命令没有该功能,因此您必须使用其他工具,或者编写自己的工具。

#3


1  

You could use sed by getting another script to generate the sed commands for you.

您可以通过获取另一个脚本来为您生成sed命令来使用sed。

# split_gen.py
use strict;
my @limits = ( 100, 250, 340,999);
my $filename = "joker";

my $start = 1;
foreach my $end (@limits) {
    print qq{sed -n '$start,${end}p;${end}q' $filename > $filename.$start-$end\n};
    $start = $end + 1;
}

Run thus perl split_gen.py giving:

运行perl split_gen.py给出:

sed -n '1,100p;100q' joker > joker.1-100
sed -n '101,250p;250q' joker > joker.101-250
sed -n '251,340p;340q' joker > joker.251-340
sed -n '341,999p;999q' joker > joker.341-999

If you're happy with the command then you can

如果你对命令感到满意,那么你可以

perl split_gen.py | sh 

Then enjoy the wait as it may be slow with big files.

然后享受等待,因为大文件可能会很慢。