正则表达式字符恰好出现x次

I'm working in bash and I have a large file in which I want to remove all the lines that do not match a certain regex, probably using $ grep -e "<regex>" <file> > output.txt

我在bash工作,我有一个大文件,我想删除所有与某个正则表达式不匹配的行,可能使用$ grep -e“ ” > output.txt

What I want to keep is any line that contain exactly x times a specified character, for example in the binary sequence

我想要保留的是任何包含指定字符x次的行,例如二进制序列

0000, 0001, 0010, 0011, 0100, 0101, 0111, 1000, 1001, 1010, 1011, 1100, 1101, 1110, 1111

0000,0001,0010,0011,0100,0101,0111,1000,1001,1010,1011,1100,1101,1110,1111

I would like to keep only those who have 2 1, leaving me with

我想只保留那些有2 1的人,让我留下

0011, 0101, 0110, 1001, 1010, 1100

0011,0101,0110,1001,11010,1100

I would then use a bash variable to vary the amount I need (always exactly half of the length, working with strings of the same length) I'm litterally looking for lines that are half 0 and half 1

然后我会使用一个bash变量来改变我需要的数量(总是正好是长度的一半,使用相同长度的字符串)我正在寻找半0和半1的行

I have this right now. It's not using regex. It works, but is very slow:

我现在有这个。它不使用正则表达式。它有效,但速度很慢:

($1 is the length of every string, $d is just a directory)

($ 1是每个字符串的长度,$ d只是一个目录)

sed -e 's/\(.\)/\1 /g' < $d/input.txt > $d/spaces.txt
awk '{c=0;for(i=1;i<=NF;++i){c+=$i};print c}' $d/spaces.txt > $d/sums.txt
grep -n "$(($1/2))" $d/sums.txt | cut -f1 -d: > $d/linenums.txt
for i in $(cat $d/linenums.txt)
do
    sed "${i}q;d" $d/input.txt 
done > $d/valids.txt

In case you wonder this puts spaces in between every digit turning 1010 into 1 0 1 0, then it adds the values together, saves the results in sums.txt, grep for length/2 and save only the line numbers in linenums.txt, then it reads linenums.txt and outputs the corresponding line from input.txt to output.txt

如果你想知道这会在每个数字之间放置空格1010变成1 0 1 0,那么它将值加在一起,将结果保存在sums.txt中,grep表示长度/ 2并且只保存linenums.txt中的行号,然后它读取linenums.txt并从input.txt输出相应的行到output.txt

I need something quicker, the for loop is what's taking way too long

我需要更快的东西,for循环是太长时间了

Thanks for your time and for sharing your knowledge with me.

感谢您的时间,并与我分享您的知识。

1 个解决方案

#1

you can definitely make this faster.

你绝对可以加快速度。

here is a grep regex example to match any lines with exactly two occurrences of 1:

这是一个grep正则表达式示例,以匹配任何行恰好两次出现1:

grep '^\([^1]*1[^1]*\)\{2\}$' input.txt

you can generalize this to match exactly n occurrences of c:

你可以推广这个以恰好匹配n次出现的c:

grep "^\([^$c]*$c[^$c]*\)\{$n\}\$" input.txt

you also mentioned wanting to match lines that are half 0s, half 1s. since you stipulated that all the lines are of the same length, you can consider only the first line, and use awk (or wc) to get line length and choose n:

你还提到想要匹配半0,半1的线。既然你规定所有的行都是相同的长度,你可以只考虑第一行,并使用awk(或wc)来获取行长并选择n:

n=`head -n1 input.txt | awk '{printf "%d\n",length($0)/2}'`
c=1
grep "^\([^$c]*$c[^$c]*\)\{$n\}\$" input.txt

#1