如何使用Linux split将一个多gb的文件分割成大约1.5 gb的块?

时间:2022-09-29 21:35:24

I have a file that can be bigger than 4GB. I am using the linux split command to split it by lines (that's the requirement). But after splitting the original file, I want the size of split file to be always less than 2GB. The original file size can vary from 3-5 GB. I want to write some logic for this in my shell script and feed the number of lines into my split command below to keep the split file sizes less than 2 GB.

我有一个大于4GB的文件。我正在使用linux split命令将其按行拆分(这是要求)。但是在分割原始文件之后,我希望分割文件的大小总是小于2GB。原始文件大小可以从3- 5gb不等。我想在shell脚本中为此编写一些逻辑,并将行数输入到下面的split命令中,以保持分割文件大小小于2gb。

split -l 100000 -d abc.txt abc

3 个解决方案

#1


2  

That's how I solved this problem. Sorry for posting the solution late.

这就是我解决这个问题的方法。对不起,我把解决方案发晚了。

1. Declared a global variable DEFAULT_SPLITFILE_SIZE= 1.5Gb

1。声明了一个全局变量DEFAULT_SPLITFILE_SIZE= 1.5Gb

DEFAULT_SPLITFILE_SIZE=1500000000

2. Calculated no of lines in the file.

2。计算文件中的行数。

LINES_IN_FILE=`wc -l $file | awk '{print $1}'`

echo `date`  "Total word count = ${LINES_IN_FILE}."

3. Calculated the size of a file.

3所示。计算文件的大小。

FILE_SIZE=`stat -c %s "${file}"`

4. Calculated size of each line in the file.

4所示。计算文件中每一行的大小。

SIZE_PER_LINE=$(( FILE_SIZE / LINES_IN_FILE ))

echo `date`  "Bytes Per Line = $SIZE_PER_LINE"

5. Calculated the no of lines needed to make it a 1.5gb split file.

5。计算了使它成为1.5gb分割文件所需的行数。

SPLIT_LINE=$(( DEFAULT_SPLITFILE_SIZE / SIZE_PER_LINE ))

echo `date`  "Lines for Split = $SPLIT_LINE"

#2


1  

Transferring comments into an answer.

将注释转换成答案。

Seeking clarification: How many lines in the typical file? How much do the line lengths vary? Can you do some arithmetic, including a margin for error, on how many lines to request? Have you looked at the options on your split command? Does it the support the -C option? (GNU split says: -C, --line-bytes=SIZE put at most SIZE bytes of lines per output file — that sounds like it might be what you want.)

寻求澄清:典型文件中有多少行?线长变化了多少?你能做一些算术吗,包括在请求多少行上留有误差?您查看过拆分命令的选项吗?它是否支持-C选项?(GNU split表示:- c, -line-bytes=SIZE,将每个输出文件的行大小设置为最大字节——这听起来像是您想要的。)

This is what I thought of doing.

这就是我想做的。

  1. Do wc -l abc.txt — This will give me total no of lines in that file.
  2. wc - l abc。txt -这将使我在该文件中完全没有行。
  3. Get the file size of original file abc.txt and divide it by no of lines in that file; that will give me size per line.
  4. 获取原始文件abc的文件大小。txt然后除以文件中没有行;这样每行就有了尺寸。
  5. Divide 1.5 GB or any number less than 2GB by size per line; that will give me no of lines.
  6. 将1.5 GB或任何小于2GB的数字除以每行的大小;这样就没有线了。
  7. Use the no of lines I got from step 3 in my split command.
  8. 使用我在split命令中从步骤3中得到的no行。

That's why I asked the questions about the file and line sizes. You could run into problems if your file has many lines that are 10 bytes long and a few that are 20 KiB long; you might accidentally get a huge block of 20 KiB lines that blows your limit because they are all grouped together. However, the chances are that your data is uniform enough that you won't run into such problems.

这就是为什么我要问关于文件和行大小的问题。如果您的文件有许多行,长10字节,长20 KiB,那么您可能会遇到问题;您可能会意外地得到一个由20个KiB行组成的巨大块,它会超出您的极限,因为它们都是分组在一起的。然而,您的数据是统一的,这样您就不会遇到这样的问题。

Consider whether it is worth installing GNU split on your machine (not in place of the standard issue split; install it in a separate directory, such as /usr/gnu/bin).

考虑是否值得在您的机器上安装GNU split(不代替标准问题拆分;将它安装在一个单独的目录中,例如/usr/gnu/bin)。

The number of lines varies from file to file, but one of the files I am working on has 328969322 lines, and the file size is 52.5GB. Yes, I checked the options of my split and it does support -C option. How do I use that in my problem?

不同文件的行数不同,但是我正在处理的一个文件有328969322行,文件大小为52.5GB。是的,我检查了分割的选项,它支持-C选项。如何在问题中使用它?

I note that this data file is considerably bigger (about ten times bigger) than the sizes mentioned in the question. However, that's not a major problem.

我注意到这个数据文件比问题中提到的大小要大得多(大约是10倍)。然而,这并不是一个大问题。

csplit -C 1500000000 datafile

Or, if you want 1.5 GiB rather than 1.5 GB, then use:

或者,如果你想要1.5 GB而不是1.5 GB,那么使用:

csplit -C 1610612736 datafile

When I experimented with csplit -C 20 and some of the lines were 40 bytes long, the long lines were split (maximum length 20 bytes), but the shorter lines were grouped to make files up to 20 bytes long. Check out your code on small data files (and small sizes of the chunk size).

当我尝试使用csplit - c20时,其中一些行长达40字节,长行被分割(最大长度为20字节),但较短的行被分组以使文件长至20字节。检查小数据文件上的代码(以及小块大小)。

From the data you give, it appears your lines are about 170 bytes each on average, so you shouldn't have any problems with untoward splits. If need be, you can experiment with something like:

从您提供的数据来看,您的行平均约为170个字节,因此不应该对逆分割有任何问题。如果需要的话,你可以做如下实验:

sed 100q datafile | split -C 1700 -

That should give you about 10 files with about 10 lines in each.

这将给您提供大约10个文件,每个文件大约有10行。

#3


0  

It's always advisable to refer the manual before posting the question. Split command provides an option to split the files on bytes. Below is the option which you can find in the manual page of split command.

在发布问题之前,最好参考手册。Split command提供了一个在字节上分割文件的选项。下面是在split命令的手册页中可以找到的选项。

   -b, --bytes=SIZE
          put SIZE bytes per output file
split --bytes=1500000000 abc.txt abc

You need not explicitly specify the number of lines. This command serves your purpose.

不需要显式地指定行数。这个命令符合你的目的。

#1


2  

That's how I solved this problem. Sorry for posting the solution late.

这就是我解决这个问题的方法。对不起,我把解决方案发晚了。

1. Declared a global variable DEFAULT_SPLITFILE_SIZE= 1.5Gb

1。声明了一个全局变量DEFAULT_SPLITFILE_SIZE= 1.5Gb

DEFAULT_SPLITFILE_SIZE=1500000000

2. Calculated no of lines in the file.

2。计算文件中的行数。

LINES_IN_FILE=`wc -l $file | awk '{print $1}'`

echo `date`  "Total word count = ${LINES_IN_FILE}."

3. Calculated the size of a file.

3所示。计算文件的大小。

FILE_SIZE=`stat -c %s "${file}"`

4. Calculated size of each line in the file.

4所示。计算文件中每一行的大小。

SIZE_PER_LINE=$(( FILE_SIZE / LINES_IN_FILE ))

echo `date`  "Bytes Per Line = $SIZE_PER_LINE"

5. Calculated the no of lines needed to make it a 1.5gb split file.

5。计算了使它成为1.5gb分割文件所需的行数。

SPLIT_LINE=$(( DEFAULT_SPLITFILE_SIZE / SIZE_PER_LINE ))

echo `date`  "Lines for Split = $SPLIT_LINE"

#2


1  

Transferring comments into an answer.

将注释转换成答案。

Seeking clarification: How many lines in the typical file? How much do the line lengths vary? Can you do some arithmetic, including a margin for error, on how many lines to request? Have you looked at the options on your split command? Does it the support the -C option? (GNU split says: -C, --line-bytes=SIZE put at most SIZE bytes of lines per output file — that sounds like it might be what you want.)

寻求澄清:典型文件中有多少行?线长变化了多少?你能做一些算术吗,包括在请求多少行上留有误差?您查看过拆分命令的选项吗?它是否支持-C选项?(GNU split表示:- c, -line-bytes=SIZE,将每个输出文件的行大小设置为最大字节——这听起来像是您想要的。)

This is what I thought of doing.

这就是我想做的。

  1. Do wc -l abc.txt — This will give me total no of lines in that file.
  2. wc - l abc。txt -这将使我在该文件中完全没有行。
  3. Get the file size of original file abc.txt and divide it by no of lines in that file; that will give me size per line.
  4. 获取原始文件abc的文件大小。txt然后除以文件中没有行;这样每行就有了尺寸。
  5. Divide 1.5 GB or any number less than 2GB by size per line; that will give me no of lines.
  6. 将1.5 GB或任何小于2GB的数字除以每行的大小;这样就没有线了。
  7. Use the no of lines I got from step 3 in my split command.
  8. 使用我在split命令中从步骤3中得到的no行。

That's why I asked the questions about the file and line sizes. You could run into problems if your file has many lines that are 10 bytes long and a few that are 20 KiB long; you might accidentally get a huge block of 20 KiB lines that blows your limit because they are all grouped together. However, the chances are that your data is uniform enough that you won't run into such problems.

这就是为什么我要问关于文件和行大小的问题。如果您的文件有许多行,长10字节,长20 KiB,那么您可能会遇到问题;您可能会意外地得到一个由20个KiB行组成的巨大块,它会超出您的极限,因为它们都是分组在一起的。然而,您的数据是统一的,这样您就不会遇到这样的问题。

Consider whether it is worth installing GNU split on your machine (not in place of the standard issue split; install it in a separate directory, such as /usr/gnu/bin).

考虑是否值得在您的机器上安装GNU split(不代替标准问题拆分;将它安装在一个单独的目录中,例如/usr/gnu/bin)。

The number of lines varies from file to file, but one of the files I am working on has 328969322 lines, and the file size is 52.5GB. Yes, I checked the options of my split and it does support -C option. How do I use that in my problem?

不同文件的行数不同,但是我正在处理的一个文件有328969322行,文件大小为52.5GB。是的,我检查了分割的选项,它支持-C选项。如何在问题中使用它?

I note that this data file is considerably bigger (about ten times bigger) than the sizes mentioned in the question. However, that's not a major problem.

我注意到这个数据文件比问题中提到的大小要大得多(大约是10倍)。然而,这并不是一个大问题。

csplit -C 1500000000 datafile

Or, if you want 1.5 GiB rather than 1.5 GB, then use:

或者,如果你想要1.5 GB而不是1.5 GB,那么使用:

csplit -C 1610612736 datafile

When I experimented with csplit -C 20 and some of the lines were 40 bytes long, the long lines were split (maximum length 20 bytes), but the shorter lines were grouped to make files up to 20 bytes long. Check out your code on small data files (and small sizes of the chunk size).

当我尝试使用csplit - c20时,其中一些行长达40字节,长行被分割(最大长度为20字节),但较短的行被分组以使文件长至20字节。检查小数据文件上的代码(以及小块大小)。

From the data you give, it appears your lines are about 170 bytes each on average, so you shouldn't have any problems with untoward splits. If need be, you can experiment with something like:

从您提供的数据来看,您的行平均约为170个字节,因此不应该对逆分割有任何问题。如果需要的话,你可以做如下实验:

sed 100q datafile | split -C 1700 -

That should give you about 10 files with about 10 lines in each.

这将给您提供大约10个文件,每个文件大约有10行。

#3


0  

It's always advisable to refer the manual before posting the question. Split command provides an option to split the files on bytes. Below is the option which you can find in the manual page of split command.

在发布问题之前,最好参考手册。Split command提供了一个在字节上分割文件的选项。下面是在split命令的手册页中可以找到的选项。

   -b, --bytes=SIZE
          put SIZE bytes per output file
split --bytes=1500000000 abc.txt abc

You need not explicitly specify the number of lines. This command serves your purpose.

不需要显式地指定行数。这个命令符合你的目的。