Python'sys.argv'是否限制了最大参数数量?

时间:2023-02-01 23:16:41

I have a Python script that needs to process a large number of files. To get around Linux's relatively small limit on the number of arguments that can be passed to a command, I am using find -print0 with xargs -0.

我有一个需要处理大量文件的Python脚本。为了解决Linux对可以传递给命令的参数数量的相对较小的限制,我使用find -print0和xargs -0。

I know another option would be to use Python's glob module, but that won't help when I have a more advanced find command, looking for modification times, etc.

我知道另一个选择是使用Python的glob模块,但是当我有一个更高级的find命令,寻找修改时间等时,这将无济于事。

When running my script on a large number of files, Python only accepts a subset of the arguments, a limitation I first thought was in argparse, but appears to be in sys.argv. I can't find any documentation on this. Is it a bug?

当在大量文件上运行我的脚本时,Python只接受参数的一个子集,这是我首先想到的在argparse中的限制,但似乎在sys.argv中。我找不到任何关于此的文件。这是一个错误吗?

Here's a sample Python script illustrating the point:

这是一个Python脚本示例,说明了这一点:

import argparse
import sys
import os

parser = argparse.ArgumentParser()
parser.add_argument('input_files', nargs='+')
args = parser.parse_args(sys.argv[1:])

print 'pid:', os.getpid(), 'argv files', len(sys.argv[1:]), 'argparse files:', len(args.input_files)

I have a lot of files to run this on:

我有很多文件可以运行:

$ find ~/ -name "*" -print0 | xargs -0 ls > filelist
748709 filelist

But it appears xargs or Python is chunking my big list of files and processing it with several different Python runs:

但是看起来xargs或Python正在整理我的大文件列表并使用几个不同的Python运行来处理它:

$ find ~/ -name "*" -print0 | xargs -0 python test.py
pid: 4216 argv files 1819 number of files: 1819
pid: 4217 argv files 1845 number of files: 1845
pid: 4218 argv files 1845 number of files: 1845
pid: 4219 argv files 1845 number of files: 1845
pid: 4220 argv files 1845 number of files: 1845
pid: 4221 argv files 1845 number of files: 1845
...

Why are multiple processes being created to process the list? Why is it being chunked at all? I don't think there are newlines in the file names and shouldn't -print0 and -0 take care of that issue? If there were newlines, I'd expect sed -n '1810,1830p' filelist to show some weirdness for the above example. What gives?

为什么要创建多个进程来处理列表?为什么它会被分块呢?我认为文件名中没有新行,-print0和-0不应该处理这个问题?如果有换行符,我希望sed -n'1810,1830p'文件列表显示上面例子的一些奇怪之处。是什么赋予了?

I almost forgot:

我差点忘了:

$ python -V
Python 2.7.2+

4 个解决方案

#1


7  

xargs will chunk your arguments by default. Have a look at the --max-args and --max-chars options of xargs. Its man page also explains the limits (under --max-chars).

默认情况下,xargs将会对您的参数进行分块。看看xargs的--max-args和--max-chars选项。它的手册页也解释了限制(低于-max-chars)。

#2


3  

Everything that you want from find is available from os.walk.

你想要的一切都可以从os.walk获得。

Don't use find and the shell for any of this.

不要使用find和shell来实现任何目的。

Use os.walk and write all your rules and filters in Python.

使用os.walk并在Python中编写所有规则和过滤器。

"looking for modification times" means that you'll be using os.stat or some similar library function.

“寻找修改时间”意味着您将使用os.stat或一些类似的库函数。

#3


2  

Python does not seem to place a limit on the number of arguments but the operating system does.

Python似乎没有限制参数的数量,但操作系统的数量。

Have a look here for a more comprehensive discussion.

看看这里进行更全面的讨论。

#4


1  

xargs will pass as much as it can, but there's still a limit. For instance,

xargs会尽可能多地通过,但仍然存在限制。例如,

find ~/ -name "*" -print0 | xargs -0 wc -l | grep total

will give you multiple lines of output.

会给你多行输出。

You probably want to have your script either take a file containing a list of filenames, or accept filenames on its stdin.

您可能希望让脚本获取包含文件名列表的文件,或者在其stdin上接受文件名。

#1


7  

xargs will chunk your arguments by default. Have a look at the --max-args and --max-chars options of xargs. Its man page also explains the limits (under --max-chars).

默认情况下,xargs将会对您的参数进行分块。看看xargs的--max-args和--max-chars选项。它的手册页也解释了限制(低于-max-chars)。

#2


3  

Everything that you want from find is available from os.walk.

你想要的一切都可以从os.walk获得。

Don't use find and the shell for any of this.

不要使用find和shell来实现任何目的。

Use os.walk and write all your rules and filters in Python.

使用os.walk并在Python中编写所有规则和过滤器。

"looking for modification times" means that you'll be using os.stat or some similar library function.

“寻找修改时间”意味着您将使用os.stat或一些类似的库函数。

#3


2  

Python does not seem to place a limit on the number of arguments but the operating system does.

Python似乎没有限制参数的数量,但操作系统的数量。

Have a look here for a more comprehensive discussion.

看看这里进行更全面的讨论。

#4


1  

xargs will pass as much as it can, but there's still a limit. For instance,

xargs会尽可能多地通过,但仍然存在限制。例如,

find ~/ -name "*" -print0 | xargs -0 wc -l | grep total

will give you multiple lines of output.

会给你多行输出。

You probably want to have your script either take a file containing a list of filenames, or accept filenames on its stdin.

您可能希望让脚本获取包含文件名列表的文件,或者在其stdin上接受文件名。