如何使用结合regex & string/file操作和模式的存储实例的Python在文本文件中搜索模式?

时间:2022-09-13 07:48:21

So essentially I'm looking for specifically a 4 digit code within two angle brackets within a text file. I know that I need to open the text file and then parse line by line, but I am not sure the best way to go about structuring my code after checking "for line in file".

本质上,我在寻找一个文本文件中两个尖括号内的四位数字代码。我知道我需要打开文本文件,然后逐行解析,但是我不确定在检查“文件中的行”之后开始构造代码的最佳方式。

I think I can either somehow split it, strip it, or partition, but I also wrote a regex which I used compile on and so if that returns a match object I don't think I can use that with those string based operations. Also I'm not sure whether my regex is greedy enough or not...

我想我可以拆分它,去掉它,或者分区,但是我也写了一个regex我使用了编译,如果它返回一个匹配对象我不认为我可以用那些基于字符串的操作。我也不确定我的regex是否足够贪婪……

I'd like to store all instances of those found hits as strings within either a tuple or a list.

我想将这些被发现的点击的所有实例作为字符串存储在元组或列表中。

Here is my regex:

这是我的正则表达式:

regex = re.compile("(<(\d{4,5})>)?")

I don't think I need to include all that much code considering its fairly basic so far.

我不认为我需要包含那么多代码,考虑到它到目前为止是相当基本的。

2 个解决方案

#1


31  

import re
pattern = re.compile("<(\d{4,5})>")

for i, line in enumerate(open('test.txt')):
    for match in re.finditer(pattern, line):
        print 'Found on line %s: %s' % (i+1, match.groups())

A couple of notes about the regex:

关于regex的一些注释:

  • You don't need the ? at the end and the outer (...) if you don't want to match the number with the angle brackets, but only want the number itself
  • 你不需要?如果你不想用尖括号来匹配数字,那么最后和外部(…)只需要数字本身。
  • It matches either 4 or 5 digits between the angle brackets
  • 它在尖括号之间匹配4或5位数字。

Update: It's important to understand that the match and capture in a regex can be quite different. The regex in my snippet above matches the pattern with angle brackets, but I ask to capture only the internal number, without the angle brackets.

更新:重要的是要理解regex中的匹配和捕获可能非常不同。上面代码片段中的regex与带尖括号的模式匹配,但我要求只捕获内部编号,不包含尖括号。

#2


9  

Doing it in one bulk read:

一堆一堆地做:

import re

textfile = open(filename, 'r')
filetext = textfile.read()
textfile.close()
matches = re.findall("(<(\d{4,5})>)?", filetext)

Line by line:

逐行:

import re

textfile = open(filename, 'r')
matches = []
reg = re.compile("(<(\d{4,5})>)?")
for line in textfile:
    matches += reg.findall(line)
textfile.close()

But again, the matches that returns will not be useful for anything except counting unless you added an offset counter:

但是,除了计数之外,返回的匹配对于任何东西都没有用处,除非您添加一个偏移计数器:

import re

textfile = open(filename, 'r')
matches = []
offset = 0
reg = re.compile("(<(\d{4,5})>)?")
for line in textfile:
    matches += [(reg.findall(line),offset)]
    offset += len(line)
textfile.close()

But it still just makes more sense to read the whole file in at once.

但是,同时读取整个文件是更有意义的。

#1


31  

import re
pattern = re.compile("<(\d{4,5})>")

for i, line in enumerate(open('test.txt')):
    for match in re.finditer(pattern, line):
        print 'Found on line %s: %s' % (i+1, match.groups())

A couple of notes about the regex:

关于regex的一些注释:

  • You don't need the ? at the end and the outer (...) if you don't want to match the number with the angle brackets, but only want the number itself
  • 你不需要?如果你不想用尖括号来匹配数字,那么最后和外部(…)只需要数字本身。
  • It matches either 4 or 5 digits between the angle brackets
  • 它在尖括号之间匹配4或5位数字。

Update: It's important to understand that the match and capture in a regex can be quite different. The regex in my snippet above matches the pattern with angle brackets, but I ask to capture only the internal number, without the angle brackets.

更新:重要的是要理解regex中的匹配和捕获可能非常不同。上面代码片段中的regex与带尖括号的模式匹配,但我要求只捕获内部编号,不包含尖括号。

#2


9  

Doing it in one bulk read:

一堆一堆地做:

import re

textfile = open(filename, 'r')
filetext = textfile.read()
textfile.close()
matches = re.findall("(<(\d{4,5})>)?", filetext)

Line by line:

逐行:

import re

textfile = open(filename, 'r')
matches = []
reg = re.compile("(<(\d{4,5})>)?")
for line in textfile:
    matches += reg.findall(line)
textfile.close()

But again, the matches that returns will not be useful for anything except counting unless you added an offset counter:

但是,除了计数之外,返回的匹配对于任何东西都没有用处,除非您添加一个偏移计数器:

import re

textfile = open(filename, 'r')
matches = []
offset = 0
reg = re.compile("(<(\d{4,5})>)?")
for line in textfile:
    matches += [(reg.findall(line),offset)]
    offset += len(line)
textfile.close()

But it still just makes more sense to read the whole file in at once.

但是,同时读取整个文件是更有意义的。