如何使用Perl计算大型CSV文件中的行数?

时间:2021-08-18 02:17:58

I have to use Perl on a Windows environment at work, and I need to be able to find out the number of rows that a large csv file contains (about 1.4Gb). Any idea how to do this with minimum waste of resources?

我必须在工作的Windows环境中使用Perl,并且我需要能够找出大型csv文件包含的行数(大约1.4Gb)。知道如何以最少的资源浪费做到这一点吗?

Thanks

PS This must be done within the Perl script and we're not allowed to install any new modules onto the system.

PS这必须在Perl脚本中完成,我们不允许在系统上安装任何新模块。

6 个解决方案

#1


Do you mean lines or rows? A cell may contain line breaks which would add lines to the file, but not rows. If you are guaranteed that no cells contain new lines, then just use the technique in the Perl FAQ. Otherwise, you will need a proper CSV parser like Text::xSV.

你的意思是线条还是行?单元格可能包含换行符,这会将行添加到文件中,但不会添加行。如果您确保没有单元格包含新行,那么只需使用Perl FAQ中的技术即可。否则,您将需要一个适当的CSV解析器,如Text :: xSV。

#2


Yes, don't use perl.

是的,不要使用perl。

Instead use the simple utility for counting lines; wc.exe

而是使用简单的实用程序来计算行数; wc.exe

It's part of a suite of windows utilities ported from unix originals.

它是从unix原件移植的一套Windows实用程序的一部分。

http://unxutils.sourceforge.net/

For example;

PS D:\> wc test.pl
     12      26     271 test.pl
PS D:\>

Where 12 == number of lines, 26 == number of words, 271 == number of characters.

其中12 ==行数,26 ==单词数,271 ==字符数。

If you really have to use perl;

如果你真的必须使用perl;

D:\>perl -lne "END{print $.;}" < test.pl
12

#3


perl -lne "END { print $. }" myfile.csv

This only reads one line at a time, so it doesn't waste any memory unless each line is enormously long.

这一次只能读取一行,因此除非每行都非常长,否则不会浪费任何内存。

#4


This one-liner handles new lines within the rows:

这个单行处理行中的新行:

  1. Considering lines with an odd number of quotes.
  2. 考虑具有奇数引号的行。

  3. Considering that doubled quotes is a way of indicating quotes within the field.
  4. 考虑到双引号是一种在字段内指示引号的方式。

  5. It uses the awesome flip-flop operator.

    它使用了令人敬畏的触发器操作器。

    perl -ne 'BEGIN{$re=qr/^[^"]*(?:"[^"]*"[^"]*)*?"[^"]*$/;}END{print"Count: $t\n";}$t++ unless /$re/../$re/'
    

Consider:

  • wc is not going to work. It's awesome for counting lines, but not CSV rows
  • wc不会起作用。计算行数很棒,但不是CSV行

  • You should install--or fight to install--Text::CSV or some similar standard package for proper handling.
  • 您应该安装 - 或争取安装 - Text :: CSV或一些类似的标准包以便正确处理。

  • This may get you there, nonetheless.
  • 尽管如此,这可能会让你在那里。


EDIT: It slipped my mind that this was windows:

perl -ne "BEGIN{$re=qr/^[^\"]*(?:\"[^\"]*\"[^\"]*)*?\"[^\"]*$/;}END{print qq/Count: $t\n/;};$t++ unless $pq and $pq = /$re/../$re/;"

The weird thing is that The Broken OS' shell interprets && as the OS conditional exec and I couldn't do anything to change its mind!! If I escaped it, it would just pass it that way to perl.

奇怪的是,破碎的OS'shell将&&解释为操作系统条件执行者,我无法改变主意!如果我逃脱它,它只会通过它传递给perl。

#5


Upvote for edg's answer, another option is to install cygwin to get wc and a bunch of other handy utilities on Windows.

upvote为edg的答案,另一个选择是安装cygwin以获取wc和Windows上的一些其他方便的实用程序。

#6


I was being idiotic, the simple way to do it in the script is:

我是愚蠢的,在脚本中这样做的简单方法是:

open $extract, "<${extractFileName}" or die ("Cannot read row count of $extractFileName");
$rowCount=0;    
while (<$extract>)
{
    $rowCount=$rowCount+1;
}

close($extract);

#1


Do you mean lines or rows? A cell may contain line breaks which would add lines to the file, but not rows. If you are guaranteed that no cells contain new lines, then just use the technique in the Perl FAQ. Otherwise, you will need a proper CSV parser like Text::xSV.

你的意思是线条还是行?单元格可能包含换行符,这会将行添加到文件中,但不会添加行。如果您确保没有单元格包含新行,那么只需使用Perl FAQ中的技术即可。否则,您将需要一个适当的CSV解析器,如Text :: xSV。

#2


Yes, don't use perl.

是的,不要使用perl。

Instead use the simple utility for counting lines; wc.exe

而是使用简单的实用程序来计算行数; wc.exe

It's part of a suite of windows utilities ported from unix originals.

它是从unix原件移植的一套Windows实用程序的一部分。

http://unxutils.sourceforge.net/

For example;

PS D:\> wc test.pl
     12      26     271 test.pl
PS D:\>

Where 12 == number of lines, 26 == number of words, 271 == number of characters.

其中12 ==行数,26 ==单词数,271 ==字符数。

If you really have to use perl;

如果你真的必须使用perl;

D:\>perl -lne "END{print $.;}" < test.pl
12

#3


perl -lne "END { print $. }" myfile.csv

This only reads one line at a time, so it doesn't waste any memory unless each line is enormously long.

这一次只能读取一行,因此除非每行都非常长,否则不会浪费任何内存。

#4


This one-liner handles new lines within the rows:

这个单行处理行中的新行:

  1. Considering lines with an odd number of quotes.
  2. 考虑具有奇数引号的行。

  3. Considering that doubled quotes is a way of indicating quotes within the field.
  4. 考虑到双引号是一种在字段内指示引号的方式。

  5. It uses the awesome flip-flop operator.

    它使用了令人敬畏的触发器操作器。

    perl -ne 'BEGIN{$re=qr/^[^"]*(?:"[^"]*"[^"]*)*?"[^"]*$/;}END{print"Count: $t\n";}$t++ unless /$re/../$re/'
    

Consider:

  • wc is not going to work. It's awesome for counting lines, but not CSV rows
  • wc不会起作用。计算行数很棒,但不是CSV行

  • You should install--or fight to install--Text::CSV or some similar standard package for proper handling.
  • 您应该安装 - 或争取安装 - Text :: CSV或一些类似的标准包以便正确处理。

  • This may get you there, nonetheless.
  • 尽管如此,这可能会让你在那里。


EDIT: It slipped my mind that this was windows:

perl -ne "BEGIN{$re=qr/^[^\"]*(?:\"[^\"]*\"[^\"]*)*?\"[^\"]*$/;}END{print qq/Count: $t\n/;};$t++ unless $pq and $pq = /$re/../$re/;"

The weird thing is that The Broken OS' shell interprets && as the OS conditional exec and I couldn't do anything to change its mind!! If I escaped it, it would just pass it that way to perl.

奇怪的是,破碎的OS'shell将&&解释为操作系统条件执行者,我无法改变主意!如果我逃脱它,它只会通过它传递给perl。

#5


Upvote for edg's answer, another option is to install cygwin to get wc and a bunch of other handy utilities on Windows.

upvote为edg的答案,另一个选择是安装cygwin以获取wc和Windows上的一些其他方便的实用程序。

#6


I was being idiotic, the simple way to do it in the script is:

我是愚蠢的,在脚本中这样做的简单方法是:

open $extract, "<${extractFileName}" or die ("Cannot read row count of $extractFileName");
$rowCount=0;    
while (<$extract>)
{
    $rowCount=$rowCount+1;
}

close($extract);