如何使此PowerShell脚本更快地解析大文件?

时间:2023-01-15 10:23:29

I have the following PowerShell script that will parse some very large file for ETL purposes. For starters my test file is ~ 30 MB. Larger files around 200 MB are expected. So I have a few questions.

我有以下PowerShell脚本,它将解析一些非常大的文件以用于ETL目的。对于初学者,我的测试文件大约是30 MB。预计大约200 MB的大文件。所以我有几个问题。

The script below works, but it takes a very long time to process even a 30 MB file.

下面的脚本可以工作,但是处理甚至30 MB的文件需要很长时间。

PowerShell Script:

$path = "E:\Documents\Projects\ESPS\Dev\DataFiles\DimProductionOrderOperation"
$infile = "14SEP11_ProdOrderOperations.txt"
$outfile = "PROCESSED_14SEP11_ProdOrderOperations.txt"
$array = @()

$content = gc $path\$infile |
    select -skip 4 |
    where {$_ -match "[|].*[|].*"} |
    foreach {$_ -replace "^[|]","" -replace "[|]$",""}

$header = $content[0]

$array = $content[0]
for ($i = 1; $i -le $content.length; $i+=1) {
    if ($array[$i] -ne $content[0]) {$array += $content[$i]}
}

$array | out-file $path\$outfile -encoding ASCII

DataFile Excerpt:

---------------------------
|Data statistics|Number of|
|-------------------------|
|Records passed |   93,118|
---------------------------
02/14/2012                                                                                                                                                           Production Operations and Confirmations                                                                                                                                                              2
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Production Operations and Confirmations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|ProductionOrderNumber|MaterialNumber                       |ModifiedDate|Plant|OperationRoutingNumber|WorkCenter|OperationStatus|IsActive|     WbsElement|SequenceNumber|OperationNumber|OperationDescription                    |OperationQty|ConfirmedYieldQty|StandardValueLabor|ActualDirectLaborHrs|ActualContractorLaborHrs|ActualOvertimeLaborHrs|ConfirmationNumber|
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|180849518            |011255486L1                          |02/08/2012  |2101 |            9901123118|56B30     |I9902          |        |SOC10MA2302SOCJ31|              |0140           |Operation 1                             |          1 |               0 |              0.0 |                    |                499.990 |                      |        9908651250|
|180849518            |011255486L1                          |02/08/2012  |2101 |            9901123118|56B30     |I9902          |        |SOC10MA2302SOCJ31|14            |9916           |Operation 2                             |          1 |               0 |            499.0 |                    |                        |                      |        9908532289|
|181993564            |011255486L1                          |02/09/2012  |2101 |            9901288820|56B30     |I9902          |        |SOC10MD2302SOCJ31|14            |9916           |Operation 1                             |          1 |               0 |            499.0 |                    |                399.599 |                      |        9908498544|
|180885825            |011255486L1                          |02/08/2012  |2101 |            9901162239|56B30     |I9902          |        |SOC10MG2302SOCJ31|              |0150           |Operation 3                             |          1 |               0 |              0.0 |                    |                882.499 |                      |        9908099659|
|180885825            |011255486L1                          |02/08/2012  |2101 |            9901162239|56B30     |I9902          |        |SOC10MG2302SOCJ31|14            |9916           |Operation 4                             |          1 |               0 |            544.0 |                    |                        |                      |        9908858514|
|181638583            |990104460I0                          |02/10/2012  |2101 |            9902123289|56G99     |I9902          |        |SOC11MAR105SOCJ31|              |0160           |Operation 5                             |          1 |               0 |          1,160.0 |                    |                        |                      |        9914295010|
|181681218            |990104460B0                          |02/08/2012  |2101 |            9902180981|56G99     |I9902          |        |SOC11MAR328SOCJ31|0             |9910           |Operation 6                             |          1 |               0 |            916.0 |                    |                        |                      |        9914621885|
|181681036            |990104460I0                          |02/09/2012  |2101 |            9902180289|56G99     |I9902          |        |SOC11MAR108SOCJ31|              |0180           |Operation 8                             |          1 |               0 |              1.0 |                    |                        |                      |        9914619196|
|189938054            |011255486A2                          |02/10/2012  |2101 |            9999206805|5AD99     |I9902          |        |RS08MJ2305SOCJ31 |              |0599           |Operation 8                             |          1 |               0 |              0.0 |                    |                        |                      |        9901316289|
|181919894            |012984532A3                          |02/10/2012  |2101 |            9902511433|A199399Z  |I9902          |        |SOC12MCB101SOCJ31|0             |9935           |Operation 9                             |          1 |               0 |              0.5 |                    |                        |                      |        9916914233|
|181919894            |012984532A3                          |02/10/2012  |2101 |            9902511433|A199399Z  |I9902          |        |SOC12MCB101SOCJ31|22            |9951           |Operation 10                            |          1 |               0 |           68.080 |                    |                        |                      |        9916914224|

3 个解决方案

#1


13  

Your script reads one line at a time (slow!) and stores almost the entire file in memory (big!).

您的脚本一次读取一行(慢!)并将几乎整个文件存储在内存中(大!)。

Try this (not tested extensively):

试试这个(未经过广泛测试):

$path = "E:\Documents\Projects\ESPS\Dev\DataFiles\DimProductionOrderOperation"
$infile = "14SEP11_ProdOrderOperations.txt"
$outfile = "PROCESSED_14SEP11_ProdOrderOperations.txt"

$batch = 1000

[regex]$match_regex = '^\|.+\|.+\|.+'
[regex]$replace_regex = '^\|(.+)\|$'

$header_line = (Select-String -Path $path\$infile -Pattern $match_regex -list).line

[regex]$header_regex = [regex]::escape($header_line)

$header_line.trim('|') | Set-Content $path\$outfile

Get-Content $path\$infile -ReadCount $batch |
    ForEach {
             $_ -match $match_regex -NotMatch $header_regex -Replace $replace_regex ,'$1' | Out-File $path\$outfile -Append
    }

That's a compromise between memory usage and speed. The -match and -replace operators will work on an array, so you can filter and replace an entire array at once without having to foreach through every record. The -readcount will cause the file to be read in chunks of $batch records, so you're basically reading in 1000 records at a time, doing the match and replace on that batch then appending the result to your output file. Then it goes back for the next 1000 records. Increasing the size of $batch should speed it up, but it will make it use more memory. Adjust that to suit your resources.

这是内存使用和速度之间的折衷。 -match和-replace运算符将在数组上运行,因此您可以一次过滤和替换整个数组,而无需遍历每个记录。 -readcount将导致文件以$ batch记录的块读取,因此您基本上一次读取1000条记录,执行匹配并替换该批处理,然后将结果附加到输出文件。然后它返回接下来的1000条记录。增加$ batch的大小应该加快它的速度,但它会使它使用更多的内存。调整以适合您的资源。

#2


4  

The Get-Content cmdlet does not perform as well as a StreamReader when dealing with very large files. You can read a file line by line using a StreamReader like this:

处理非常大的文件时,Get-Content cmdlet的性能不如StreamReader。您可以使用StreamReader逐行读取文件,如下所示:

$path = 'C:\A-Very-Large-File.txt'
$r = [IO.File]::OpenText($path)
while ($r.Peek() -ge 0) {
    $line = $r.ReadLine()
    # Process $line here...
}
$r.Dispose()

Some performance comparisons:

一些性能比较:

Measure-Command {Get-Content .\512MB.txt > $null}

Total Seconds: 49.4742533

总秒数:49.4742533

Measure-Command {
    $r = [IO.File]::OpenText('512MB.txt')
    while ($r.Peek() -ge 0) {
        $r.ReadLine() > $null
    }
    $r.Dispose()
}

Total Seconds: 27.666803

总秒数:27.666803

#3


3  

This is almost a non-answer...I love PowerShell...but I will not use it to parse log files, especially large log files. Use Microsoft's Log Parser.

这几乎是一个非答案......我喜欢PowerShell ...但我不会用它来解析日志文件,尤其是大型日志文件。使用Microsoft的Log Parser。

C:\>type input.txt | logparser "select substr(field1,1) from STDIN" -i:TSV -nskiplines:14 -headerrow:off -iseparator:spaces -o:tsv -headers:off -stats:off

#1


13  

Your script reads one line at a time (slow!) and stores almost the entire file in memory (big!).

您的脚本一次读取一行(慢!)并将几乎整个文件存储在内存中(大!)。

Try this (not tested extensively):

试试这个(未经过广泛测试):

$path = "E:\Documents\Projects\ESPS\Dev\DataFiles\DimProductionOrderOperation"
$infile = "14SEP11_ProdOrderOperations.txt"
$outfile = "PROCESSED_14SEP11_ProdOrderOperations.txt"

$batch = 1000

[regex]$match_regex = '^\|.+\|.+\|.+'
[regex]$replace_regex = '^\|(.+)\|$'

$header_line = (Select-String -Path $path\$infile -Pattern $match_regex -list).line

[regex]$header_regex = [regex]::escape($header_line)

$header_line.trim('|') | Set-Content $path\$outfile

Get-Content $path\$infile -ReadCount $batch |
    ForEach {
             $_ -match $match_regex -NotMatch $header_regex -Replace $replace_regex ,'$1' | Out-File $path\$outfile -Append
    }

That's a compromise between memory usage and speed. The -match and -replace operators will work on an array, so you can filter and replace an entire array at once without having to foreach through every record. The -readcount will cause the file to be read in chunks of $batch records, so you're basically reading in 1000 records at a time, doing the match and replace on that batch then appending the result to your output file. Then it goes back for the next 1000 records. Increasing the size of $batch should speed it up, but it will make it use more memory. Adjust that to suit your resources.

这是内存使用和速度之间的折衷。 -match和-replace运算符将在数组上运行,因此您可以一次过滤和替换整个数组,而无需遍历每个记录。 -readcount将导致文件以$ batch记录的块读取,因此您基本上一次读取1000条记录,执行匹配并替换该批处理,然后将结果附加到输出文件。然后它返回接下来的1000条记录。增加$ batch的大小应该加快它的速度,但它会使它使用更多的内存。调整以适合您的资源。

#2


4  

The Get-Content cmdlet does not perform as well as a StreamReader when dealing with very large files. You can read a file line by line using a StreamReader like this:

处理非常大的文件时,Get-Content cmdlet的性能不如StreamReader。您可以使用StreamReader逐行读取文件,如下所示:

$path = 'C:\A-Very-Large-File.txt'
$r = [IO.File]::OpenText($path)
while ($r.Peek() -ge 0) {
    $line = $r.ReadLine()
    # Process $line here...
}
$r.Dispose()

Some performance comparisons:

一些性能比较:

Measure-Command {Get-Content .\512MB.txt > $null}

Total Seconds: 49.4742533

总秒数:49.4742533

Measure-Command {
    $r = [IO.File]::OpenText('512MB.txt')
    while ($r.Peek() -ge 0) {
        $r.ReadLine() > $null
    }
    $r.Dispose()
}

Total Seconds: 27.666803

总秒数:27.666803

#3


3  

This is almost a non-answer...I love PowerShell...but I will not use it to parse log files, especially large log files. Use Microsoft's Log Parser.

这几乎是一个非答案......我喜欢PowerShell ...但我不会用它来解析日志文件,尤其是大型日志文件。使用Microsoft的Log Parser。

C:\>type input.txt | logparser "select substr(field1,1) from STDIN" -i:TSV -nskiplines:14 -headerrow:off -iseparator:spaces -o:tsv -headers:off -stats:off