sed查找并替换fastq regex

时间:2021-09-11 08:43:13

I have a file such as

我有这样一个文件

head testSed.fastq
@M01551:51:000000000-BCB7H:1:1101:15800:1330 1:N:0:NGTCACTN+TATCCTCTCTTGAAGA
NGTCACTN
+
#>AAAAF#
@M01551:51:000000000-BCB7H:1:1101:15605:1331 1:N:0:NATCAGCN+TAGATCGCCAAGTTAA
NATCAGCN
+
#>>AA?C#
@M01551:51:000000000-BCB7H:1:1101:15557:1332 1:N:0:NCAGCAGN+TATCTTCTATAAATAT
NCAGCAGN

And I am attempting to replace the string after the final colon with 0 (in this example on lines 1,5,9 - but globally) using a regular expression.

我正在尝试使用正则表达式将最后一个冒号后面的字符串替换为0(在本例中,第1,5,9行)。

I have checked my regex using egrep egrep '[ATGCN]{8}\+[ATGCN]{16}$' testSed.fastq which returns all the lines I would expect.

我已使用白鹭白鹭'[ATGCN]{8}\+[ATGCN]{16}$' testSed检查我的regex。fastq返回我所期望的所有线。

However when I try to use sed -i 's/[ATGCN]{8}\+[ATGCN]{16}$/0/g' testSed.fastq the original file is unchanged and no replacement occurs.

然而,当我尝试使用sed -i 's/[ATGCN]{8}\+[ATGCN]{16}$/0/g' testSed。原始文件没有改变,没有发生替换。

How can I fix this? Is my regex not specific enough?

我怎么解决这个问题?我的regex是否不够具体?

2 个解决方案

#1


1  

Your regex is structured as an ERE rather than a BRE, which is sed's default interpretation. Not all sed implementations support ERE, but you can check man sed in your environment to determine whether it's possible for you. Look for -r or -E options. You can alternately use bounds by preceding the curly braces with backslashes.

你的正则表达式的结构是一个ERE而不是一个BRE,这是sed的默认解释。不是所有的sed实现都支持ERE,但是您可以在您的环境中检查man sed,以确定它是否适合您。寻找-r或-E选项。你可以通过在带反斜杠的花括号前交替使用边界。

That said, rather than matching the precise text in the last field, why not just look for the string that starts with a colon, and is followed by no-more-colons? The following RE is both BRE and ERE compatible.

也就是说,与其在最后一个字段中匹配精确的文本,为什么不直接查找以冒号开头、后面跟着无更多冒号的字符串呢?下面的RE是BRE和ERE兼容的。

$ sed '/^@/s/:[^:]*$/:0/' testq
@M01551:51:000000000-BCB7H:1:1101:15800:1330 1:N:0:0
NGTCACTN
+
#>AAAAF#
@M01551:51:000000000-BCB7H:1:1101:15605:1331 1:N:0:0
NATCAGCN
+
#>>AA?C#
@M01551:51:000000000-BCB7H:1:1101:15557:1332 1:N:0:0
NCAGCAGN

#2


2  

Do you need a regex for this?

你需要一个regex吗?

awk -F: -v OFS=: '/^@/ {$NF = "0"} 1' testfile

That won't save in-place. If you have GNU awk you can

救不了原地。如果你有GNU awk,你可以。

gawk -F: -v OFS=: -i inplace '...' file

ref: https://www.gnu.org/software/gawk/manual/html_node/Extension-Sample-Inplace.html

裁判:https://www.gnu.org/software/gawk/manual/html_node/Extension-Sample-Inplace.html

#1


1  

Your regex is structured as an ERE rather than a BRE, which is sed's default interpretation. Not all sed implementations support ERE, but you can check man sed in your environment to determine whether it's possible for you. Look for -r or -E options. You can alternately use bounds by preceding the curly braces with backslashes.

你的正则表达式的结构是一个ERE而不是一个BRE,这是sed的默认解释。不是所有的sed实现都支持ERE,但是您可以在您的环境中检查man sed,以确定它是否适合您。寻找-r或-E选项。你可以通过在带反斜杠的花括号前交替使用边界。

That said, rather than matching the precise text in the last field, why not just look for the string that starts with a colon, and is followed by no-more-colons? The following RE is both BRE and ERE compatible.

也就是说,与其在最后一个字段中匹配精确的文本,为什么不直接查找以冒号开头、后面跟着无更多冒号的字符串呢?下面的RE是BRE和ERE兼容的。

$ sed '/^@/s/:[^:]*$/:0/' testq
@M01551:51:000000000-BCB7H:1:1101:15800:1330 1:N:0:0
NGTCACTN
+
#>AAAAF#
@M01551:51:000000000-BCB7H:1:1101:15605:1331 1:N:0:0
NATCAGCN
+
#>>AA?C#
@M01551:51:000000000-BCB7H:1:1101:15557:1332 1:N:0:0
NCAGCAGN

#2


2  

Do you need a regex for this?

你需要一个regex吗?

awk -F: -v OFS=: '/^@/ {$NF = "0"} 1' testfile

That won't save in-place. If you have GNU awk you can

救不了原地。如果你有GNU awk,你可以。

gawk -F: -v OFS=: -i inplace '...' file

ref: https://www.gnu.org/software/gawk/manual/html_node/Extension-Sample-Inplace.html

裁判:https://www.gnu.org/software/gawk/manual/html_node/Extension-Sample-Inplace.html