Regex将匹配特定字符串后的所有行

时间:2022-09-13 16:27:55

Possible duplicate of Regex - find all lines after a match: although my need is a little different.

Regex的可能副本——在匹配后查找所有行:尽管我的需求略有不同。

I want to parse a plain text file with multiple date/value data separated by specific strings. I want to skip the first half of the file until a specific line where I want to match the results.

我想要解析一个纯文本文件,其中包含了由特定字符串分隔的多个日期/值数据。我想跳过文件的前半部分,直到我想匹配结果的特定行。

Here is an example of the file in question (including the mess with tabulations and spaces):

这里有一个文件的例子(包括表格和空格的混乱):

   I dont want to capture the following measures. This  text is     on a single line and        contains tabs and spaces    is also ends with this token : Token1
05/01/1969         0.01846  
15/01/1969         0.16730  
25/01/1969         0.33988  
05/04/1969         0.81319  
15/04/1969         0.76973  
25/11/2011             0.24210
05/12/2011             0.25220
15/12/2011             0.31160
25/12/2011             0.36845
            End :  bla bla bla
   This text        is also on a single line        and marks the beginning of a new series of      results. These are the results that I want. it also ends with the following         token : Token2
05/01/1969       109.46333  
15/01/1969       110.06998       118.18000
25/01/1969       110.82954  
05/02/1969       111.51394       118.83000
25/02/1969       112.36483  
05/10/2011       114.38798       114.31000
05/10/2011           114.31000       114.38798       114.38798       114.38798       114.38798       114.38798       114.38798
25/12/2011           112.64000       112.41261       112.86301       113.25494       114.06421       115.93219       116.38780
05/01/2012               112.22834       112.92301       113.40561       114.78823       116.62931       117.43421
05/09/2012               110.01410       112.16391       112.88199       115.23640       117.04756       118.04632
15/09/2012               109.97572       112.00809       112.70266       114.91247       116.65256       117.57412
25/09/2012               109.93967       111.87272       112.53305       114.60381       116.26935       117.12756 
            End :  Marks the    end of          the      file

What I wish to do is to match every line after the line which ends with Token2. I have tried different solutions from the other similar questions but none work. I ended up matching all the results of the file and considered splitting it before applying the following pattern. Is there a pure regex solution to this ?

我想做的是匹配以记号2结尾的每一行。我尝试过不同于其他类似问题的解决方案,但都没有成功。最后,我匹配了文件的所有结果,并考虑在应用以下模式之前将其拆分。有一个纯粹的regex解决方案吗?

Here is the pattern that works for the whole file. With named capture groups :

这是适用于整个文件的模式。与命名捕获组:

(?P<date>\d\d\/\d\d\/\d\d\d\d)\s*(?P<simul>\d+\.*\d*)[\t ]*(?P<observ>\d+\.*\d*){0,1}[\t ]*(?P<prev_no_rain>\d+\.*\d*){0,1}[\t ]*(?P<prev_10_dry>\d+\.*\d*){0,1}[\t ]*(?P<prev_20_dry>\d+\.*\d*){0,1}[\t ]*(?P<prev_50>\d+\.*\d*){0,1}[\t ]*(?P<prev_20_wet>\d+\.*\d*){0,1}[\t ]*(?P<prev_10_wet>\d+\.*\d*){0,1}

Regex101 link : https://regex101.com/r/a0mCZ2/3

Regex101链接:https://regex101.com/r/a0mCZ2/3

1 个解决方案

#1


2  

You may leverage the \G operator that matches the start of string (that can be excluded with a negative lookaround) and the end of the previous successful match position. With the (?:\G(?!\A)|\bToken2[\r\n]+) we can tell the regex engine to find a whole word Token2 at the end of the line (with linebreak symbols) and then only find the following subpatterns if they follow in an immediate succession.

您可以利用与字符串的开始(可以通过负查找排除)和之前成功匹配位置的结束匹配的\G操作符。有了(?:\G(?!\A)|\bToken2[\r\n]+),我们可以告诉regex引擎在行尾找到一个完整的单词Token2(带有换行符符号),然后只有在它们连续地执行时才能找到以下子模式。

A regex that can be used:

可以使用的正则表达式:

(?:\G(?!\A)[\r\n]*|Token2[\r\n]+)\K(?P<date>\d\d\/\d\d\/\d{4})\s*(?P<simul>\d+\.*\d*)[\t ]*(?P<observ>\d+\.*\d*)?[\t ]*(?P<prev_no_rain>\d+(?:\.\d+)*)?[\t ]*(?P<prev_10_dry>\d+\.*\d*)?[\t ]*(?P<prev_20_dry>\d+\.*\d*)?[\t ]*(?P<prev_50>\d+\.*\d*)?[\t ]*(?P<prev_20_wet>\d+\.*\d*)?[\t ]*(?P<prev_10_wet>\d+\.*\d*)?

See the regex demo. Note I replaced {0,1} with ? to shorten it a bit.

查看演示正则表达式。注意,我将{0,1}替换为?把它缩短一点。

The part you are interested in is (?:\G(?!\A)[\r\n]*|Token2[\r\n]+)\K.

你感兴趣的部分是(?:\ G(? ! \ A)[\ r \ n]* | Token2[\ r \ n]+)\ K。

  • (?:\G(?!\A)[\r\n]*|Token2[\r\n]+) - 1 of two alternatives:
    • \G(?!\A)[\r\n]* - end of the previous successful match and 0+ linebreak symbols
    • \G(?!\A)[\r\n]* -结束先前成功的比赛及0+ linebreak符号
    • | - or
    • |——或者
    • Token2[\r\n]+ - Token2 followed with 1+ CR or LFs. (If you need to match Token2 as a whole word, you might add \b before it).
    • Token2[\r\n]+ - Token2和1+ CR或LFs。(如果您需要将Token2与整个单词匹配,您可以在它之前添加\b)。
  • (或:\G(?!\A)[\r\n]*|Token2[\r\n]+) -两个选择中的一个:\G(?!\A)[\r\n]* -结束先前成功的比赛及0+断行符号| -或Token2[\r\n]+ Token2](如果您需要将Token2与整个单词匹配,您可以在它之前添加\b)。
  • \K - omit the text matched so far.
  • 省略到目前为止匹配的文本。

The (?P<date>\d\d\/\d\d\/\d{4})\s*(?P<simul>\d+\.*\d*)[\t ]*(?P<observ>\d+\.*\d*)?[\t ]*(?P<prev_no_rain>\d+(?:\.\d+)*)?[\t ]*(?P<prev_10_dry>\d+\.*\d*)?[\t ]*(?P<prev_20_dry>\d+\.*\d*)?[\t ]*(?P<prev_50>\d+\.*\d*)?[\t ]*(?P<prev_20_wet>\d+\.*\d*)?[\t ]*(?P<prev_10_wet>\d+\.*\d*)? is your pattern that I did not modify too much, and that matches a line with specific fata (note that the fact it matches a line justifies the usage of [\r\n]* after (\G(?!\A))).

(? P <日期> \ \ \ / \ d \ d \ / \ d { 4 })\ s *(? P <一起> \ d + \ . * \ d *)\[t]*(? P < observ > \ d + \ . * \ d *)?\[t]*(? P < prev_no_rain > \ d +(?:\ \ d +)*)?\[t]*(? P < prev_10_dry > \ d + \ . * \ d *)?\[t]*(? P < prev_20_dry > \ d + \ . * \ d *)?\[t]*(? P < prev_50 > \ d + \ . * \ d *)?\[t]*(? P < prev_20_wet > \ d + \ . * \ d *)?\[t]*(? P < prev_10_wet > \ d + \ . * \ d *)?您的模式是否我没有修改太多,并且与特定fata的行匹配(请注意,它匹配的行证明了使用[\r\n]* after (\G(?!\ a)))。

#1


2  

You may leverage the \G operator that matches the start of string (that can be excluded with a negative lookaround) and the end of the previous successful match position. With the (?:\G(?!\A)|\bToken2[\r\n]+) we can tell the regex engine to find a whole word Token2 at the end of the line (with linebreak symbols) and then only find the following subpatterns if they follow in an immediate succession.

您可以利用与字符串的开始(可以通过负查找排除)和之前成功匹配位置的结束匹配的\G操作符。有了(?:\G(?!\A)|\bToken2[\r\n]+),我们可以告诉regex引擎在行尾找到一个完整的单词Token2(带有换行符符号),然后只有在它们连续地执行时才能找到以下子模式。

A regex that can be used:

可以使用的正则表达式:

(?:\G(?!\A)[\r\n]*|Token2[\r\n]+)\K(?P<date>\d\d\/\d\d\/\d{4})\s*(?P<simul>\d+\.*\d*)[\t ]*(?P<observ>\d+\.*\d*)?[\t ]*(?P<prev_no_rain>\d+(?:\.\d+)*)?[\t ]*(?P<prev_10_dry>\d+\.*\d*)?[\t ]*(?P<prev_20_dry>\d+\.*\d*)?[\t ]*(?P<prev_50>\d+\.*\d*)?[\t ]*(?P<prev_20_wet>\d+\.*\d*)?[\t ]*(?P<prev_10_wet>\d+\.*\d*)?

See the regex demo. Note I replaced {0,1} with ? to shorten it a bit.

查看演示正则表达式。注意,我将{0,1}替换为?把它缩短一点。

The part you are interested in is (?:\G(?!\A)[\r\n]*|Token2[\r\n]+)\K.

你感兴趣的部分是(?:\ G(? ! \ A)[\ r \ n]* | Token2[\ r \ n]+)\ K。

  • (?:\G(?!\A)[\r\n]*|Token2[\r\n]+) - 1 of two alternatives:
    • \G(?!\A)[\r\n]* - end of the previous successful match and 0+ linebreak symbols
    • \G(?!\A)[\r\n]* -结束先前成功的比赛及0+ linebreak符号
    • | - or
    • |——或者
    • Token2[\r\n]+ - Token2 followed with 1+ CR or LFs. (If you need to match Token2 as a whole word, you might add \b before it).
    • Token2[\r\n]+ - Token2和1+ CR或LFs。(如果您需要将Token2与整个单词匹配,您可以在它之前添加\b)。
  • (或:\G(?!\A)[\r\n]*|Token2[\r\n]+) -两个选择中的一个:\G(?!\A)[\r\n]* -结束先前成功的比赛及0+断行符号| -或Token2[\r\n]+ Token2](如果您需要将Token2与整个单词匹配,您可以在它之前添加\b)。
  • \K - omit the text matched so far.
  • 省略到目前为止匹配的文本。

The (?P<date>\d\d\/\d\d\/\d{4})\s*(?P<simul>\d+\.*\d*)[\t ]*(?P<observ>\d+\.*\d*)?[\t ]*(?P<prev_no_rain>\d+(?:\.\d+)*)?[\t ]*(?P<prev_10_dry>\d+\.*\d*)?[\t ]*(?P<prev_20_dry>\d+\.*\d*)?[\t ]*(?P<prev_50>\d+\.*\d*)?[\t ]*(?P<prev_20_wet>\d+\.*\d*)?[\t ]*(?P<prev_10_wet>\d+\.*\d*)? is your pattern that I did not modify too much, and that matches a line with specific fata (note that the fact it matches a line justifies the usage of [\r\n]* after (\G(?!\A))).

(? P <日期> \ \ \ / \ d \ d \ / \ d { 4 })\ s *(? P <一起> \ d + \ . * \ d *)\[t]*(? P < observ > \ d + \ . * \ d *)?\[t]*(? P < prev_no_rain > \ d +(?:\ \ d +)*)?\[t]*(? P < prev_10_dry > \ d + \ . * \ d *)?\[t]*(? P < prev_20_dry > \ d + \ . * \ d *)?\[t]*(? P < prev_50 > \ d + \ . * \ d *)?\[t]*(? P < prev_20_wet > \ d + \ . * \ d *)?\[t]*(? P < prev_10_wet > \ d + \ . * \ d *)?您的模式是否我没有修改太多,并且与特定fata的行匹配(请注意,它匹配的行证明了使用[\r\n]* after (\G(?!\ a)))。