如何在文件中找到所有正则表达式匹配项

时间:2021-08-05 05:15:59

I have a list of regular expressions(about 2000) and over a million html files. I want to check if each regular expression success on every file or not. How to do this on powershell?

我有一个正则表达式列表(大约2000个)和超过100万个html文件。我想检查每个文件上的正则表达式是否成功。在powershell中如何实现这个功能呢?

Performance is important, so I don't want to loop through regular expressions.

性能很重要,所以我不想循环遍历正则表达式。

I try

我试着

$text | Select-String -Pattern pattern1, pattern2,...

And it returns all matches, but I also want to find out, which pattern success which one not. I need to build a list of success regular expressions for each file

它返回所有匹配项,但我也想知道,哪个模式成功,哪个不成功。我需要为每个文件构建一个成功的正则表达式列表。

2 个解决方案

#1


2  

You could try something like this:

你可以试试这样的方法:

$regex = "^test","e2$"  #Or use (Get-Content <path to your regex file>)
$ht = @{}

#Modify Get-Childitem to your criterias(filter, path, recurse etc.)
Get-ChildItem -Filter *.txt | Select-String -Pattern $regex | ForEach-Object { 
    $ht[$_.Path] += @($_ | Select-Object -ExpandProperty Pattern)
}

Test-output:

测试输出:

$ht | Format-Table -AutoSize

Name                                               Value
----                                               -----
C:\Users\graimer\Desktop\New Text Document (2).txt {e2$}
C:\Users\graimer\Desktop\New Text Document.txt     {^test, e2$}

You didn't specify how you wanted the output.

您没有指定要如何输出。

UPDATE: To match multiple patterns on a single line, try this(mjolinor's answer is probably faster then this).

更新:要在一行中匹配多个模式,请尝试这个(mjolinor的答案可能比这个要快)。

$regex = "^test","e2$"  #Or use (Get-Content <path to your regex file>)
$ht = @{}

#Modify Get-Childitem to your criterias(filter, path, recurse etc.)
$regex | ForEach-Object {
    $pattern = $_
    Get-ChildItem -Filter *.txt | Select-String -Pattern $pattern | ForEach-Object { 
        $ht[$_.Path] += @($_ | Select-Object -ExpandProperty Pattern)
    }
}

UPDATE2: I don't have enough samples to try it, but since you have such a huge amount of files, you migh want to try reading the file into memory before looping through the patterns. It may be faster.

UPDATE2:我没有足够的样本来尝试,但是由于您有大量的文件,所以您希望尝试在通过模式循环之前将文件读入内存。它可能更快。

$regex = "^test","e2$"  #Or use (Get-Content <path to your regex file>)
$ht = @{}

#Modify Get-Childitem to your criterias(filter, path, recurse etc.)
Get-ChildItem -Filter *.txt | ForEach-Object {
    $text = $_ | Get-Content
    $filename = $_.FullName
    $regex | ForEach-Object {
        $text | Select-String -Pattern $_ | ForEach-Object { 
            $ht[$filename] += @($_ | Select-Object -ExpandProperty Pattern)
        }
    }
}

#2


1  

I don't see any way around doing a foreach through the regex collection.

我没有办法通过regex集合来实现foreach。

This is the best I could come up with performance-wise:

这是我能想到的最好的表现方式:

$regexes = 'pattern1','pattern2'
$files = get-childitem -Path  <file path> |
 select -ExpandProperty fullname

$ht = @{}

 foreach ($file in $files)
 {
   $ht[$file] = New-Object collections.arraylist
   foreach ($regex in $regexes)
    {
      if (select-string $regex $file -Quiet)
        {
          [void]$ht[$file].add($regex)
        }
    }
}

$ht

You could speed up the process by using background jobs and dividing up the file collection among the jobs.

您可以通过使用后台作业并在作业之间划分文件集合来加快进程。

#1


2  

You could try something like this:

你可以试试这样的方法:

$regex = "^test","e2$"  #Or use (Get-Content <path to your regex file>)
$ht = @{}

#Modify Get-Childitem to your criterias(filter, path, recurse etc.)
Get-ChildItem -Filter *.txt | Select-String -Pattern $regex | ForEach-Object { 
    $ht[$_.Path] += @($_ | Select-Object -ExpandProperty Pattern)
}

Test-output:

测试输出:

$ht | Format-Table -AutoSize

Name                                               Value
----                                               -----
C:\Users\graimer\Desktop\New Text Document (2).txt {e2$}
C:\Users\graimer\Desktop\New Text Document.txt     {^test, e2$}

You didn't specify how you wanted the output.

您没有指定要如何输出。

UPDATE: To match multiple patterns on a single line, try this(mjolinor's answer is probably faster then this).

更新:要在一行中匹配多个模式,请尝试这个(mjolinor的答案可能比这个要快)。

$regex = "^test","e2$"  #Or use (Get-Content <path to your regex file>)
$ht = @{}

#Modify Get-Childitem to your criterias(filter, path, recurse etc.)
$regex | ForEach-Object {
    $pattern = $_
    Get-ChildItem -Filter *.txt | Select-String -Pattern $pattern | ForEach-Object { 
        $ht[$_.Path] += @($_ | Select-Object -ExpandProperty Pattern)
    }
}

UPDATE2: I don't have enough samples to try it, but since you have such a huge amount of files, you migh want to try reading the file into memory before looping through the patterns. It may be faster.

UPDATE2:我没有足够的样本来尝试,但是由于您有大量的文件,所以您希望尝试在通过模式循环之前将文件读入内存。它可能更快。

$regex = "^test","e2$"  #Or use (Get-Content <path to your regex file>)
$ht = @{}

#Modify Get-Childitem to your criterias(filter, path, recurse etc.)
Get-ChildItem -Filter *.txt | ForEach-Object {
    $text = $_ | Get-Content
    $filename = $_.FullName
    $regex | ForEach-Object {
        $text | Select-String -Pattern $_ | ForEach-Object { 
            $ht[$filename] += @($_ | Select-Object -ExpandProperty Pattern)
        }
    }
}

#2


1  

I don't see any way around doing a foreach through the regex collection.

我没有办法通过regex集合来实现foreach。

This is the best I could come up with performance-wise:

这是我能想到的最好的表现方式:

$regexes = 'pattern1','pattern2'
$files = get-childitem -Path  <file path> |
 select -ExpandProperty fullname

$ht = @{}

 foreach ($file in $files)
 {
   $ht[$file] = New-Object collections.arraylist
   foreach ($regex in $regexes)
    {
      if (select-string $regex $file -Quiet)
        {
          [void]$ht[$file].add($regex)
        }
    }
}

$ht

You could speed up the process by using background jobs and dividing up the file collection among the jobs.

您可以通过使用后台作业并在作业之间划分文件集合来加快进程。