如何在当前目录中的所有* .txt文件上运行脚本? [重复]

时间:2021-06-11 07:08:48

This question already has an answer here:

这个问题在这里已有答案:

I am trying to run below script on all *.txt files in current directory. Currently it will process only test.txt file and print block of text based on regular expression. What would be the quickest way of scanning current directory for *.txt files and running below script on all found *.txt files? Also how I could include lines containing 'word1' and 'word3' as currently script is printing only content between those two lines? I would like to print whole block.

我试图在当前目录中的所有* .txt文件下运行脚本。目前,它将仅处理test.txt文件并基于正则表达式打印文本块。扫描当前目录中* .txt文件并在所有找到的* .txt文件下运行脚本下最快捷的方法是什么?另外我如何在当前脚本中包含包含'word1'和'word3'的行只打印这两行之间的内容?我想打印整块。

#!/usr/bin/env python
import os, re
file = 'test.txt'
with open(file) as fp:
   for result in re.findall('word1(.*?)word3', fp.read(), re.S):
     print result

I would appreciate any advice or suggestions on how to improve above code e.g. speed when running on large set of text files. Thank you.

我将不胜感激任何有关如何改进上述代码的建议或建议,例如:在大量文本文件上运行时的速度。谢谢。

2 个解决方案

#1


6  

Use glob.glob:

import os, re
import glob

pattern = re.compile('word1(.*?)word3', flags=re.S)
for file in glob.glob('*.txt'):
    with open(file) as fp:
        for result in pattern.findall(fp.read()):
            print result

#2


0  

Inspired by the answer of falsetru, I rewrote my code, making it more generic.

受到falsetru的答案的启发,我重写了我的代码,使其更通用。

Now the files to explore :

现在要探索的文件:

  • can be described either by a string as second argument that will be used by glob(),
    or by a function specifically written for this goal in case the set of desired files can't be described with a globish pattern

    可以用字符串作为第二个参数来描述,它将由glob()使用,或者由一个专门为此目标编写的函数来描述,以防所需的文件集不能用全局模式描述

  • and may be in the current directory if no third argument is passed,
    or in a specified directory if its path is passed as a second argument

    如果没有传递第三个参数,则可以在当前目录中,如果其路径作为第二个参数传递,则可以在指定目录中

.

import re,glob
from itertools import ifilter
from os import getcwd,listdir,path
from inspect import isfunction

regx = re.compile('^[^\n]*word1.*?word3.*?$',re.S|re.M)

G = '\n\n'\
    'MWMWMWMWMWMWMWMWMWMWMWMWMWMWMWMWMWMWMWMWMW\n'\
    'MWMWMW  %s\n'\
    'MWMWMW  %s\n'\
    '%s%s'

def search(REGX, how_to_find_files, dirpath='',
           G=G,sepm = '\n======================\n'):
    if dirpath=='':
        dirpath = getcwd()

    if isfunction(how_to_find_files):
        gen = ifilter(how_to_find_files,
                      ifilter(path.isfile,listdir(dirpath)))
    elif isinstance(how_to_find_files,str):
        gen = glob.glob(path.join(dirpath,
                                  how_to_find_files))

    for fn in gen:
        with open(fn) as fp:
            found = REGX.findall(fp.read())
            if found:
                yield G % (dirpath,path.basename(fn),
                           sepm,sepm.join(found))

# Example of searching in .txt files

#============ one use ===================
def select(fn):
    return fn[-4:]=='.txt'
print ''.join(search(regx, select))

#============= another use ==============
print ''.join(search(regx,'*.txt'))

The advantage of chaining the treatments of sevral files through succession of generators is that the final joining with ''.join() creates a unique string that is instantly written,
while, if not so processed, the printing of several individual strings one after the other is longer because of the interrupts of displaying (am I understandable ?)

通过连续生成器链接处理sevral文件的优点是最终与'.join()的连接创建了一个即时写入的唯一字符串,而如果没有这样处理,则在一个后面打印几个单独的字符串。其他因为显示中断而更长(我可以理解吗?)

#1


6  

Use glob.glob:

import os, re
import glob

pattern = re.compile('word1(.*?)word3', flags=re.S)
for file in glob.glob('*.txt'):
    with open(file) as fp:
        for result in pattern.findall(fp.read()):
            print result

#2


0  

Inspired by the answer of falsetru, I rewrote my code, making it more generic.

受到falsetru的答案的启发,我重写了我的代码,使其更通用。

Now the files to explore :

现在要探索的文件:

  • can be described either by a string as second argument that will be used by glob(),
    or by a function specifically written for this goal in case the set of desired files can't be described with a globish pattern

    可以用字符串作为第二个参数来描述,它将由glob()使用,或者由一个专门为此目标编写的函数来描述,以防所需的文件集不能用全局模式描述

  • and may be in the current directory if no third argument is passed,
    or in a specified directory if its path is passed as a second argument

    如果没有传递第三个参数,则可以在当前目录中,如果其路径作为第二个参数传递,则可以在指定目录中

.

import re,glob
from itertools import ifilter
from os import getcwd,listdir,path
from inspect import isfunction

regx = re.compile('^[^\n]*word1.*?word3.*?$',re.S|re.M)

G = '\n\n'\
    'MWMWMWMWMWMWMWMWMWMWMWMWMWMWMWMWMWMWMWMWMW\n'\
    'MWMWMW  %s\n'\
    'MWMWMW  %s\n'\
    '%s%s'

def search(REGX, how_to_find_files, dirpath='',
           G=G,sepm = '\n======================\n'):
    if dirpath=='':
        dirpath = getcwd()

    if isfunction(how_to_find_files):
        gen = ifilter(how_to_find_files,
                      ifilter(path.isfile,listdir(dirpath)))
    elif isinstance(how_to_find_files,str):
        gen = glob.glob(path.join(dirpath,
                                  how_to_find_files))

    for fn in gen:
        with open(fn) as fp:
            found = REGX.findall(fp.read())
            if found:
                yield G % (dirpath,path.basename(fn),
                           sepm,sepm.join(found))

# Example of searching in .txt files

#============ one use ===================
def select(fn):
    return fn[-4:]=='.txt'
print ''.join(search(regx, select))

#============= another use ==============
print ''.join(search(regx,'*.txt'))

The advantage of chaining the treatments of sevral files through succession of generators is that the final joining with ''.join() creates a unique string that is instantly written,
while, if not so processed, the printing of several individual strings one after the other is longer because of the interrupts of displaying (am I understandable ?)

通过连续生成器链接处理sevral文件的优点是最终与'.join()的连接创建了一个即时写入的唯一字符串,而如果没有这样处理,则在一个后面打印几个单独的字符串。其他因为显示中断而更长(我可以理解吗?)