如何从bash中的目录中选择随机文件？

I have a directory with about 2000 files. How can I select a random sample of N files through using either a bash script or a list of piped commands?

我有一个大约2000个文件的目录。如何通过使用bash脚本或管道命令列表来选择N个文件的随机样本?

11 个解决方案

#1

Here's a script that uses GNU sort's random option:

这是一个使用GNU sort的随机选项的脚本:

ls |sort -R |tail -$N |while read file; do
    # Something involving $file, or you can leave
    # off the while to just get the filenames
done

#2

You can use shuf (from the GNU coreutils package) for that. Just feed it a list of file names and ask it to return the first line from a random permutation:

你可以使用shuf(来自GNU coreutils包)。只需输入一个文件名列表,并要求它从随机排列中返回第一行:

ls dirname | shuf -n 1
# probably faster and more flexible:
find dirname -type f | shuf -n 1
# etc..

Adjust the -n, --head-count=COUNT value to return the number of wanted lines. For example to return 5 random filenames you would use:

调整-n, - head-count = COUNT值以返回所需行数。例如,要返回5个随机文件名,您将使用:

find dirname -type f | shuf -n 5

#3

Here are a few possibilities that don't parse the output of ls and that are 100% safe regarding files with spaces and funny symbols in their name. All of them will populate an array randf with a list of random files. This array is easily printed with printf '%s\n' "${randf[@]}" if needed.

以下是一些不解析ls输出的可能性,对于名称中带有空格和滑稽符号的文件,它们是100%安全的。所有这些都将使用随机文件列表填充数组randf。如果需要,可以使用printf'%s \ n'“$ {randf [@]}”轻松打印此数组。

This one will possibly output the same file several times, and N needs to be known in advance. Here I chose N=42.

这个可能会多次输出相同的文件,并且需要事先知道N.在这里我选择N = 42。
```
a=( * )
randf=( "${a[RANDOM%${#a[@]}]"{1..42}"}" )
```
This feature is not very well documented.

此功能没有很好的记录。
If N is not known in advance, but you really liked the previous possibility, you can use eval. But it's evil, and you must really make sure that N doesn't come directly from user input without being thoroughly checked!

如果事先不知道N,但你真的很喜欢以前的可能性,你可以使用eval。但它是邪恶的,你必须确保N不直接来自用户输入而不经过彻底检查!
```
N=42
a=( * )
eval randf=( \"\${a[RANDOM%\${#a[@]}]\"\{1..$N\}\"}\" )
```
I personally dislike eval and hence this answer!

我个人不喜欢eval,因此这个答案!
The same using a more straightforward method (a loop):

使用更简单的方法(循环)相同:
```
N=42
a=( * )
randf=()
for((i=0;i<N;++i)); do
    randf+=( "${a[RANDOM%${#a[@]}]}" )
done
```

If you don't want to possibly have several times the same file:

如果您不希望多次使用同一个文件:

N=42
a=( * )
randf=()
for((i=0;i<N && ${#a[@]};++i)); do
    ((j=RANDOM%${#a[@]}))
    randf+=( "${a[j]}" )
    a=( "${a[@]:0:j}" "${a[@]:j+1}" )
done

Note. This is a late answer to an old post, but the accepted answer links to an external page that shows terrible bash practice, and the other answer is not much better as it also parses the output of ls. A comment to the accepted answer points to an excellent answer by Lhunath which obviously shows good practice, but doesn't exactly answer the OP.

注意。这是对旧帖子的迟到答案,但是接受的答案链接到显示可怕的bash练习的外部页面,而另一个答案并不是更好,因为它也解析了ls的输出。对接受的答案的评论指出了Lhunath的一个很好的答案,这显然表明了良好的做法,但并没有完全回答OP。

#4

ls | shuf -n 10 # ten random files

#5

If you have Python installed (works with either Python 2 or Python 3):

如果安装了Python(适用于Python 2或Python 3):

To select one file (or line from an arbitrary command), use

要选择一个文件(或来自任意命令的行),请使用

ls -1 | python -c "import sys; import random; print(random.choice(sys.stdin.readlines()).rstrip())"

To select N files/lines, use (note N is at the end of the command, replace this by a number)

要选择N个文件/行,请使用(注意N位于命令的末尾,将其替换为数字)

ls -1 | python -c "import sys; import random; print(''.join(random.sample(sys.stdin.readlines(), int(sys.argv[1]))).rstrip())" N

#6

This is an even later response to @gniourf_gniourf's late answer, which I just upvoted because it's by far the best answer, twice over. (Once for avoiding eval and once for safe filename handling.)

这是对@gniourf_gniourf迟到的答案后来的回应,我刚刚赞成,因为它是迄今为止最好的答案,两次。 (一次用于避免eval,一次用于安全文件名处理。)

But it took me a few minutes to untangle the "not very well documented" feature(s) this answer uses. If your Bash skills are solid enough that you saw immediately how it works, then skip this comment. But I didn't, and having untangled it I think it's worth explaining.

但是我花了几分钟时间来解开这个答案使用的“没有很好记录”的功能。如果您的Bash技能足够坚实,您可以立即看到它是如何工作的,那么请跳过此评论。但我没有,并且解开它我认为值得解释。

Feature #1 is the shell's own file globbing. a=(*) creates an array, $a, whose members are the files in the current directory. Bash understands all the weirdnesses of filenames, so that list is guaranteed correct, guaranteed escaped, etc. No need to worry about properly parsing textual file names returned by ls.

功能#1是shell自己的文件通配符。 a =(*)创建一个数组$ a,其成员是当前目录中的文件。 Bash理解文件名的所有奇怪之处,因此列表保证正确,保证转义等。无需担心正确解析ls返回的文本文件名。

Feature #2 is Bash parameter expansions for arrays, one nested within another. This starts with ${#ARRAY[@]}, which expands to the length of $ARRAY.

特征#2是数组的Bash参数扩展,一个嵌套在另一个中。这从$ {#ARRAY [@]}开始,扩展到$ ARRAY的长度。

That expansion is then used to subscript the array. The standard way to find a random number between 1 and N is to take the value of random number modulo N. We want a random number between 0 and the length of our array. Here's the approach, broken into two lines for clarity's sake:

然后使用该扩展来下标数组。找到1到N之间的随机数的标准方法是取模数为N的随机数的值。我们想要一个介于0和数组长度之间的随机数。这是方法,为清楚起见分为两行:

LENGTH=${#ARRAY[@]}
RANDOM=${a[RANDOM%$LENGTH]}

But this solution does it in a single line, removing the unnecessary variable assignment.

但是这个解决方案在一行中完成,删除了不必要的变量赋值。

Feature #3 is Bash brace expansion, although I have to confess I don't entirely understand it. Brace expansion is used, for instance, to generate a list of 25 files named filename1.txt, filename2.txt, etc: echo "filename"{1..25}".txt".

功能#3是Bash大括号扩展,虽然我不得不承认我并不完全理解它。例如,使用大括号扩展来生成名为filename1.txt,filename2.txt等的25个文件的列表:echo“filename”{1..25}“。txt”。

The expression inside the subshell above, "${a[RANDOM%${#a[@]}]"{1..42}"}", uses that trick to produce 42 separate expansions. The brace expansion places a single digit in between the ] and the }, which at first I thought was subscripting the array, but if so it would be preceded by a colon. (It would also have returned 42 consecutive items from a random spot in the array, which is not at all the same thing as returning 42 random items from the array.) I think it's just making the shell run the expansion 42 times, thereby returning 42 random items from the array. (But if someone can explain it more fully, I'd love to hear it.)

上面的子shell中的表达式“$ {a [RANDOM%$ {#a [@]}]”{1..42}“}”,使用该技巧产生42个单独的扩展。大括号扩展在]和}之间放置一个数字,起初我认为是下标数组,但如果是这样,它前面会有一个冒号。 (它也会从数组中的一个随机点返回42个连续项,这与从数组中返回42个随机项完全不同。)我认为它只是使shell运行扩展42次,从而返回数组中的42个随机项。 (但如果有人能够更充分地解释它,我很乐意听到它。)

The reason N has to be hardcoded (to 42) is that brace expansion happens before variable expansion.

N必须被硬编码(到42)的原因是支撑扩展在变量扩展之前发生。

Finally, here's Feature #4, if you want to do this recursively for a directory hierarchy:

最后,这是功能#4,如果你想以递归方式为目录层次结构执行此操作:

shopt -s globstar
a=( ** )

This turns on a shell option that causes ** to match recursively. Now your $a array contains every file in the entire hierarchy.

这会打开一个shell选项,导致**递归匹配。现在,$ a数组包含整个层次结构中的每个文件。

#7

A simple solution for selecting 5 random files while avoiding to parse ls. It also works with files containing spaces, newlines and other special characters:

一个简单的解决方案,用于选择5个随机文件,同时避免解析ls。它还适用于包含空格,换行符和其他特殊字符的文件:

shuf -ezn 5 * | xargs -0 -n1 echo

Replace echo with the command you want to execute for your files.

将echo替换为要为文件执行的命令。

#8

This is the only script I can get to play nice with bash on MacOS. I combined and edited snippets from the following two links:

这是我可以在MacOS上与bash玩得很好的唯一脚本。我合并并编辑了以下两个链接的片段:

ls command: how can I get a recursive full-path listing, one line per file?

ls命令:如何获得递归的完整路径列表,每个文件一行?

http://www.linuxquestions.org/questions/linux-general-1/is-there-a-bash-command-for-picking-a-random-file-678687/

#!/bin/bash

# Reads a given directory and picks a random file.

# The directory you want to use. You could use "$1" instead if you
# wanted to parametrize it.
DIR="/path/to/"
# DIR="$1"

# Internal Field Separator set to newline, so file names with
# spaces do not break our script.
IFS='
'

if [[ -d "${DIR}" ]]
then
  # Runs ls on the given dir, and dumps the output into a matrix,
  # it uses the new lines character as a field delimiter, as explained above.
  #  file_matrix=($(ls -LR "${DIR}"))

  file_matrix=($(ls -R $DIR | awk '; /:$/&&f{s=$0;f=0}; /:$/&&!f{sub(/:$/,"");s=$0;f=1;next}; NF&&f{ print s"/"$0 }'))
  num_files=${#file_matrix[*]}

  # This is the command you want to run on a random file.
  # Change "ls -l" by anything you want, it's just an example.
  ls -l "${file_matrix[$((RANDOM%num_files))]}"
fi

exit 0

#9

MacOS does not have the sort -R and shuf commands, so I needed a bash only solution that randomizes all files without duplicates and did not find that here. This solution is similar to gniourf_gniourf's solution #4, but hopefully adds better comments.

MacOS没有sort -R和shuf命令,因此我需要一个仅使用bash的解决方案来随机化所有文件而不重复,并且在此处找不到。此解决方案类似于gniourf_gniourf的解决方案#4,但希望添加更好的评论。

The script should be easy to modify to stop after N samples using a counter with if, or gniourf_gniourf's for loop with N. $RANDOM is limited to ~32000 files, but that should do for most cases.

该脚本应该很容易修改,以便在使用带有if的计数器的N个样本后停止,或者使用带有N. $ RANDOM的gniourf_gniourf for循环限制为~32000个文件,但这应该适用于大多数情况。

#!/bin/bash

array=(*)  # this is the array of files to shuffle
# echo ${array[@]}
for dummy in "${array[@]}"; do  # do loop length(array) times; once for each file
    length=${#array[@]}
    randomi=$(( $RANDOM % $length ))  # select a random index

    filename=${array[$randomi]}
    echo "Processing: '$filename'"  # do something with the file

    unset -v "array[$randomi]"  # set the element at index $randomi to NULL
    array=("${array[@]}")  # remove NULL elements introduced by unset; copy array
done

#10

I use this: it uses temporary file but goes deeply in a directory until it find a regular file and return it.

我使用它:它使用临时文件,但深入到目录,直到找到一个常规文件并返回它。

# find for a quasi-random file in a directory tree:

# directory to start search from:
ROOT="/";  

tmp=/tmp/mytempfile    
TARGET="$ROOT"
FILE=""; 
n=
r=
while [ -e "$TARGET" ]; do 
    TARGET="$(readlink -f "${TARGET}/$FILE")" ; 
    if [ -d "$TARGET" ]; then
      ls -1 "$TARGET" 2> /dev/null > $tmp || break;
      n=$(cat $tmp | wc -l); 
      if [ $n != 0 ]; then
        FILE=$(shuf -n 1 $tmp)
# or if you dont have/want to use shuf:
#       r=$(($RANDOM % $n)) ; 
#       FILE=$(tail -n +$(( $r + 1 ))  $tmp | head -n 1); 
      fi ; 
    else
      if [ -f "$TARGET"  ] ; then
        rm -f $tmp
        echo $TARGET
        break;
      else 
        # is not a regular file, restart:
        TARGET="$ROOT"
        FILE=""
      fi
    fi
done;

#11

How about a Perl solution slightly doctored from Mr. Kang over here:
How can I shuffle the lines of a text file on the Unix command line or in a shell script?

如何从Kang先生那里略微篡改Perl解决方案:如何在Unix命令行或shell脚本中对文本文件的行进行洗牌?

$ ls | perl -MList::Util=shuffle -e '@lines = shuffle(<>); print @lines[0..4]'

$ ls | perl -MList :: Util = shuffle -e'@ lines = shuffle(<>); print @lines [0..4]'

#1