搜索多个.txt文件以查找所有出现的字符串?

时间:2023-01-14 11:08:47

I am trying to create a tool that will search 300+ .txt files for a string that that may be used several times in each of the 300+ .txt files

我正在尝试创建一个工具,它将搜索300多个.txt文件中的字符串,这些字符串可能会在300多个.txt文件中的每一个中使用多次

I want to be able to go through each file and get the string between each of the occurrences.

我希望能够浏览每个文件并在每个事件之间获取字符串。

It sounds a bit twisted I know, I have been scratching my head for hours, while testing code.

听起来有点扭曲我知道,在测试代码的同时,我一直在摸不着头脑。

What I have tried

我试过了什么

I read through each file and check for if it contains my search text at least once, if it does, then I add the full path of the (files that do contain it) to a list

我读完每个文件并检查它是否包含我的搜索文本至少一次,如果是,那么我将(包含它的文件)的完整路径添加到列表中

Dim FileNamesList As New List(Of String)
    Dim occurList As New List(Of String)

    Dim textSearch As String = TextBox1.Text.ToLower

    'check each file to see if it even contains textbox1.text
    'if it does, then add matching files to list
    For Each f As FileInfo In dir.GetFiles("*.txt")

        Dim tmpRead = File.ReadAllText(f.FullName).ToLower

        Dim tIndex As Integer = tmpRead.IndexOf(textSearch)

        If tIndex > -1 Then
            FileNamesList.Add(f.FullName)

        End If

    Next

Then I thought, oh, now all I need to do is go through each string in that 'approved' files list and add the entire contents of each to a new list.

然后我想,哦,现在我需要做的就是浏览“已批准”文件列表中的每个字符串,并将每个字符串的全部内容添加到新列表中。

Then I go through each in 'that' list and get string between two delimiters.

然后我浏览每个'that'列表,并在两个分隔符之间获取字符串。

And... I just get lost from there...

而且......我只是迷路了......

Here is the get string between delimiters I have tried using.

这是我尝试使用的分隔符之间的get字符串。

  Private Function GetStringBetweenTags(ByVal startIdentifer As String, ByVal endIndentifier As String, ByVal textsource As String) As String
    Dim idLength As Int16 = startIdentifer.Length

    Dim s As String = textsource

    Try

        s = s.Substring(s.IndexOf(startIdentifer) + idLength)
        s = s.Substring(0, s.IndexOf(endIndentifier))
        'MsgBox(s)

    Catch
    End Try
    Return s
End Function

In simple terms...

简单来说...

  • I have 300 .txt files
  • 我有300个.txt文件

  • Some may contain a string that I am after
  • 有些可能包含我追求的字符串

  • I want the substring of each string
  • 我想要每个字符串的子字符串

Normally I am fine, and never need to ask questions, but there is too many forceptions going on.

通常我很好,从不需要提问,但是有太多的力量在继续。

Logical Example

== Table.txt ==

print("I am tony")
print("pineapple")
print("brown cows")
log("cable ties")
log("bad ocd")
log("bingo")

== Cherry.txt ==

print("grapes")
print("pie")
print("apples")
log("laugh")
log("tuna")
log("gonuts")

== Tower.txt ==

print("tall")
print("clouds")
print("nomountain")
log("goggles?")
log("kuwait")
log("india")

I want to end with list of the text between only the print function from all 3 files

我想以所有3个文件中的打印功能之间的文本列表结束

Haven't found any other thread about this, probably because it stupid.

没有找到任何关于此的其他线索,可能是因为它很愚蠢。

So I should end with

所以我应该结束

 ==  ResultList == 

    I am tony
    pineapple
    brown cows
    grapes
    pie
    apples
    tall
    clouds
    nomountain

2 个解决方案

#1


RegEx is probably your best choice for something like this. For instance:

RegEx可能是你这样的最佳选择。例如:

Dim results As New List(Of String)()
Dim r As New RegEx("print\(""(.*)""\)")
For path As String In filePaths
    Dim contents As String = File.ReadAllText(path)
    For Each m As Match in r.Matches(contents)
        If m.Sucess Then
            results.Add(m.Groups(1).Value)
        End If
    Next
Next

As you can see, the code loops through a list of file paths. For each one, it loads the entire contents of the file into a string. It then searches the file contents string for all matches to the following regular expression pattern: print\("(.*)"\). It then loops through all of those pattern matches and grabs the value of the first capture group from each one. Those are added to the results list, which contains your desired strings. Here's the meaning of the parts of the RegEx:

如您所见,代码循环遍历文件路径列表。对于每一个,它将文件的全部内容加载到字符串中。然后,它将文件内容字符串中的所有匹配项搜索到以下正则表达式模式:print \(“(。*)”\)。然后循环遍历所有这些模式匹配并从每个匹配中获取第一个捕获组的值。这些将添加到结果列表中,其中包含所需的字符串。以下是RegEx各部分的含义:

  • print - Looks for any string starting with the word "print"
  • print - 查找以“print”开头的任何字符串

  • \( - The next character after the word "print" must be an open parentheses (the backslash is an escape character)
  • \( - “print”一词后面的下一个字符必须是一个开括号(反斜杠是一个转义字符)

  • " - The next character after the open parentheses must be a double quote character (it is repeated twice so as to escape it so that VB doesn't think it's the end of the string).
  • “ - 打开括号后的下一个字符必须是双引号字符(它会重复两次以便转义它,以便VB不认为它是字符串的结尾)。

  • (.*) - The parentheses define this as a capturing group (so that we can pull out just this value from the matches). The .* means any characters of any length.
  • (。*) - 括号将其定义为捕获组(以便我们可以从匹配中提取此值)。 。*表示任何长度的任何字符。

  • "\) - Matching strings must end with a double quote followed by a closing parentheses.
  • “\) - 匹配字符串必须以双引号结尾,后跟右括号。

#2


Use Regex:

Imports System.Text.RegularExpressions
Module Module1

    Sub Main()
        Dim input1 As String = _
            "print(""I am tony"") " + _
            "print(""pineapple"") " + _
            "print(""brown cows"") " + _
            "log(""cable ties"") " + _
            "log(""bad ocd"") " + _
            "log(""bingo"")"

        Dim input2 As String = _
            "print(""grapes"") " + _
            "print(""pie"") " + _
            "print(""apples"") " + _
            "log(""laugh"") " + _
            "log(""tuna"") " + _
            "log(""gonuts"")"

        Dim input3 As String = _
           "print(""tall"") " + _
           "print(""clouds"") " + _
           "print(""nomountain"") " + _
           "log(""goggles?"") " + _
           "log(""kuwait"") " + _
           "log(""india"")"

        Dim pattern As String = "print\(""([^""]*)""\)"
        Dim expr As Regex = New Regex(pattern, RegexOptions.Singleline)
        Dim matches As MatchCollection = Nothing
        Dim data As List(Of String) = New List(Of String)()

        matches = expr.Matches(input1)
        For Each mat As Match In matches
            data.Add(mat.Groups(1).Value)
        Next mat
        matches = expr.Matches(input2)
        For Each mat As Match In matches
            data.Add(mat.Groups(1).Value)
        Next mat
        matches = expr.Matches(input3)
        For Each mat As Match In matches
            data.Add(mat.Groups(1).Value)
        Next mat

    End Sub

End Module

#1


RegEx is probably your best choice for something like this. For instance:

RegEx可能是你这样的最佳选择。例如:

Dim results As New List(Of String)()
Dim r As New RegEx("print\(""(.*)""\)")
For path As String In filePaths
    Dim contents As String = File.ReadAllText(path)
    For Each m As Match in r.Matches(contents)
        If m.Sucess Then
            results.Add(m.Groups(1).Value)
        End If
    Next
Next

As you can see, the code loops through a list of file paths. For each one, it loads the entire contents of the file into a string. It then searches the file contents string for all matches to the following regular expression pattern: print\("(.*)"\). It then loops through all of those pattern matches and grabs the value of the first capture group from each one. Those are added to the results list, which contains your desired strings. Here's the meaning of the parts of the RegEx:

如您所见,代码循环遍历文件路径列表。对于每一个,它将文件的全部内容加载到字符串中。然后,它将文件内容字符串中的所有匹配项搜索到以下正则表达式模式:print \(“(。*)”\)。然后循环遍历所有这些模式匹配并从每个匹配中获取第一个捕获组的值。这些将添加到结果列表中,其中包含所需的字符串。以下是RegEx各部分的含义:

  • print - Looks for any string starting with the word "print"
  • print - 查找以“print”开头的任何字符串

  • \( - The next character after the word "print" must be an open parentheses (the backslash is an escape character)
  • \( - “print”一词后面的下一个字符必须是一个开括号(反斜杠是一个转义字符)

  • " - The next character after the open parentheses must be a double quote character (it is repeated twice so as to escape it so that VB doesn't think it's the end of the string).
  • “ - 打开括号后的下一个字符必须是双引号字符(它会重复两次以便转义它,以便VB不认为它是字符串的结尾)。

  • (.*) - The parentheses define this as a capturing group (so that we can pull out just this value from the matches). The .* means any characters of any length.
  • (。*) - 括号将其定义为捕获组(以便我们可以从匹配中提取此值)。 。*表示任何长度的任何字符。

  • "\) - Matching strings must end with a double quote followed by a closing parentheses.
  • “\) - 匹配字符串必须以双引号结尾,后跟右括号。

#2


Use Regex:

Imports System.Text.RegularExpressions
Module Module1

    Sub Main()
        Dim input1 As String = _
            "print(""I am tony"") " + _
            "print(""pineapple"") " + _
            "print(""brown cows"") " + _
            "log(""cable ties"") " + _
            "log(""bad ocd"") " + _
            "log(""bingo"")"

        Dim input2 As String = _
            "print(""grapes"") " + _
            "print(""pie"") " + _
            "print(""apples"") " + _
            "log(""laugh"") " + _
            "log(""tuna"") " + _
            "log(""gonuts"")"

        Dim input3 As String = _
           "print(""tall"") " + _
           "print(""clouds"") " + _
           "print(""nomountain"") " + _
           "log(""goggles?"") " + _
           "log(""kuwait"") " + _
           "log(""india"")"

        Dim pattern As String = "print\(""([^""]*)""\)"
        Dim expr As Regex = New Regex(pattern, RegexOptions.Singleline)
        Dim matches As MatchCollection = Nothing
        Dim data As List(Of String) = New List(Of String)()

        matches = expr.Matches(input1)
        For Each mat As Match In matches
            data.Add(mat.Groups(1).Value)
        Next mat
        matches = expr.Matches(input2)
        For Each mat As Match In matches
            data.Add(mat.Groups(1).Value)
        Next mat
        matches = expr.Matches(input3)
        For Each mat As Match In matches
            data.Add(mat.Groups(1).Value)
        Next mat

    End Sub

End Module