如何使用sed/grep来提取两个单词之间的文本?

时间:2022-09-13 11:37:19

I am trying to output a string that contains everything between two words of a string:

我正在尝试输出一个字符串,它包含了字符串中两个单词之间的所有内容:

input:

输入:

"Here is a String"

output:

输出:

"is a"

Using:

使用:

sed -n '/Here/,/String/p'

includes the endpoints, but I don't want to include them.

包括端点,但我不想包含它们。

10 个解决方案

#1


66  

sed -e 's/Here\(.*\)String/\1/'

#2


110  

Simple grep can also support positive & negative look-ahead & look-back: For your case, the command would be:

简单的grep也可以支持正面和负面的查找和查询:对于您的情况,命令将是:

 echo "Here is a string" | grep -o -P '(?<=Here).*(?=string)'

#3


24  

You can strip strings in Bash alone:

可以在Bash中单独使用字符串:

$ foo="Here is a String"
$ foo=${foo##*Here }
$ echo "$foo"
is a String
$ foo=${foo%% String*}
$ echo "$foo"
is a
$

And if you have a GNU grep that includes PCRE, you can use a zero-width assertion:

如果你有一个包含PCRE的GNU grep,你可以使用一个零宽度的断言:

$ echo "Here is a String" | grep -Po '(?<=(Here )).*(?= String)'
is a

#4


20  

The accepted answer does not remove text that could be before Here or after String. This will:

被接受的答案不会删除在此之前或在字符串之后的文本。这将:

sed -e 's/.*Here\(.*\)String.*/\1/'

The main difference is the addition of .* immediately before Here and after String.

主要的区别是添加了。*在这里和之后的字符串。

#5


16  

Through GNU awk,

通过GNU awk,

$ echo "Here is a string" | awk -v FS="(Here|string)" '{print $2}'
 is a 

grep with -P(perl-regexp) parameter supports \K, which helps in discarding the previously matched characters. In our case , the previously matched string was Here so it got discarded from the final output.

带有-P(perl-regexp)参数的grep支持\K,这有助于丢弃以前匹配的字符。在我们的例子中,先前匹配的字符串在这里,所以它从最终输出中被丢弃。

$ echo "Here is a string" | grep -oP 'Here\K.*(?=string)'
 is a 
$ echo "Here is a string" | grep -oP 'Here\K(?:(?!string).)*'
 is a 

If you want the output to be is a then you could try the below,

如果你希望输出是a,你可以试试下面的,

$ echo "Here is a string" | grep -oP 'Here\s*\K.*(?=\s+string)'
is a
$ echo "Here is a string" | grep -oP 'Here\s*\K(?:(?!\s+string).)*'
is a

#6


15  

If you have a long file with many multi-line ocurrences, it is useful to first print number lines:

如果您有一个长文件,并且有许多多行代码,那么第一个打印数字行是有用的:

cat -n file | sed -n '/Here/,/String/p'

#7


6  

This might work for you (GNU sed):

这可能对你有用(GNU sed):

sed '/Here/!d;s//&\n/;s/.*\n//;:a;/String/bb;$!{n;ba};:b;s//\n&/;P;D' file 

This presents each representation of text between two markers (in this instance Here and String) on a newline and preserves newlines within the text.

这将在新行上显示两个标记之间的文本(在这个实例中是字符串),并在文本中保留换行符。

#8


3  

All the above solutions have deficiencies where the last search string is repeated elsewhere in the string. I found it best to write a bash function.

上面所有的解决方案都有不足之处,最后一个搜索字符串在字符串的其他地方重复出现。我发现最好编写bash函数。

    function str_str {
      local str
      str="${1#*${2}}"
      str="${str%%$3*}"
      echo -n "$str"
    }

    # test it ...
    mystr="this is a string"
    str_str "$mystr" "this " " string"

#9


1  

You can use \1 (refer to http://www.grymoire.com/Unix/Sed.html#uh-4):

您可以使用\1(请参阅http://www.grymoire.com/Unix/Sed.html# -4):

echo "Hello is a String" | sed 's/Hello\(.*\)String/\1/g'

The contents that is inside the brackets will be stored as \1.

括号内的内容将存储为\1。

#10


0  

Problem. My stored Claws Mail messages are wrapped as follows, and I am trying to extract the Subject lines:

问题。我的存储的爪子邮件是这样包装的,我试图提取主题行:

Subject: [SLC38A9 lysosomal arginine sensor; mTORC1 pathway] Key molecular
 link in major cell growth pathway: Findings point to new potential
 therapeutic target in pancreatic cancer [mTORC1 Activator SLC38A9 Is
 Required to Efflux Essential Amino Acids from Lysosomes and Use Protein as
 a Nutrient] [Re: Nutrient sensor in key growth-regulating metabolic pathway
 identified [Lysosomal amino acid transporter SLC38A9 signals arginine
 sufficiency to mTORC1]]
Message-ID: <20171019190902.18741771@VictoriasJourney.com>

Per A2 in this thread, How to use sed/grep to extract text between two words? the first expression, below, "works" as long as the matched text does not contain a newline:

在这个线程中,每A2,如何使用sed/grep来提取两个单词之间的文本?第一个表达式,下面的“works”,只要匹配的文本不包含换行符:

grep -o -P '(?<=Subject: ).*(?=molecular)' corpus/01

[SLC38A9 lysosomal arginine sensor; mTORC1 pathway] Key

However, despite trying numerous variants (.+?; /s; ...), I could not get these to work:

然而,尽管尝试了许多变体(.+?/ s;我无法让这些工作:

grep -o -P '(?<=Subject: ).*(?=link)' corpus/01
grep -o -P '(?<=Subject: ).*(?=therapeutic)' corpus/01
etc.

Solution 1.

解决方案1。

Per Extract text between two strings on different lines

每个提取文本在两个字符串之间的不同行。

sed -n '/Subject: /{:a;N;/Message-ID:/!ba; s/\n/ /g; s/\s\s*/ /g; s/.*Subject: \|Message-ID:.*//g;p}' corpus/01

which gives

这给了

[SLC38A9 lysosomal arginine sensor; mTORC1 pathway] Key molecular link in major cell growth pathway: Findings point to new potential therapeutic target in pancreatic cancer [mTORC1 Activator SLC38A9 Is Required to Efflux Essential Amino Acids from Lysosomes and Use Protein as a Nutrient] [Re: Nutrient sensor in key growth-regulating metabolic pathway identified [Lysosomal amino acid transporter SLC38A9 signals arginine sufficiency to mTORC1]]                              

Solution 2.*

解决方案2。*

Per How can I replace a newline (\n) using sed?

如何使用sed替换新行(\n) ?

sed ':a;N;$!ba;s/\n/ /g' corpus/01

will replace newlines with a space.

将用空格替换换行符。

Chaining that with A2 in How to use sed/grep to extract text between two words?, we get:

用A2在如何使用sed/grep来提取两个单词之间的文本?,我们得到:

sed ':a;N;$!ba;s/\n/ /g' corpus/01 | grep -o -P '(?<=Subject: ).*(?=Message-ID:)'

which gives

这给了

[SLC38A9 lysosomal arginine sensor; mTORC1 pathway] Key molecular  link in major cell growth pathway: Findings point to new potential  therapeutic target in pancreatic cancer [mTORC1 Activator SLC38A9 Is  Required to Efflux Essential Amino Acids from Lysosomes and Use Protein as  a Nutrient] [Re: Nutrient sensor in key growth-regulating metabolic pathway  identified [Lysosomal amino acid transporter SLC38A9 signals arginine  sufficiency to mTORC1]] 

This variant removes double spaces:

该变体删除了双空间:

sed ':a;N;$!ba;s/\n/ /g; s/\s\s*/ /g' corpus/01 | grep -o -P '(?<=Subject: ).*(?=Message-ID:)'

giving

[SLC38A9 lysosomal arginine sensor; mTORC1 pathway] Key molecular link in major cell growth pathway: Findings point to new potential therapeutic target in pancreatic cancer [mTORC1 Activator SLC38A9 Is Required to Efflux Essential Amino Acids from Lysosomes and Use Protein as a Nutrient] [Re: Nutrient sensor in key growth-regulating metabolic pathway identified [Lysosomal amino acid transporter SLC38A9 signals arginine sufficiency to mTORC1]]

#1


66  

sed -e 's/Here\(.*\)String/\1/'

#2


110  

Simple grep can also support positive & negative look-ahead & look-back: For your case, the command would be:

简单的grep也可以支持正面和负面的查找和查询:对于您的情况,命令将是:

 echo "Here is a string" | grep -o -P '(?<=Here).*(?=string)'

#3


24  

You can strip strings in Bash alone:

可以在Bash中单独使用字符串:

$ foo="Here is a String"
$ foo=${foo##*Here }
$ echo "$foo"
is a String
$ foo=${foo%% String*}
$ echo "$foo"
is a
$

And if you have a GNU grep that includes PCRE, you can use a zero-width assertion:

如果你有一个包含PCRE的GNU grep,你可以使用一个零宽度的断言:

$ echo "Here is a String" | grep -Po '(?<=(Here )).*(?= String)'
is a

#4


20  

The accepted answer does not remove text that could be before Here or after String. This will:

被接受的答案不会删除在此之前或在字符串之后的文本。这将:

sed -e 's/.*Here\(.*\)String.*/\1/'

The main difference is the addition of .* immediately before Here and after String.

主要的区别是添加了。*在这里和之后的字符串。

#5


16  

Through GNU awk,

通过GNU awk,

$ echo "Here is a string" | awk -v FS="(Here|string)" '{print $2}'
 is a 

grep with -P(perl-regexp) parameter supports \K, which helps in discarding the previously matched characters. In our case , the previously matched string was Here so it got discarded from the final output.

带有-P(perl-regexp)参数的grep支持\K,这有助于丢弃以前匹配的字符。在我们的例子中,先前匹配的字符串在这里,所以它从最终输出中被丢弃。

$ echo "Here is a string" | grep -oP 'Here\K.*(?=string)'
 is a 
$ echo "Here is a string" | grep -oP 'Here\K(?:(?!string).)*'
 is a 

If you want the output to be is a then you could try the below,

如果你希望输出是a,你可以试试下面的,

$ echo "Here is a string" | grep -oP 'Here\s*\K.*(?=\s+string)'
is a
$ echo "Here is a string" | grep -oP 'Here\s*\K(?:(?!\s+string).)*'
is a

#6


15  

If you have a long file with many multi-line ocurrences, it is useful to first print number lines:

如果您有一个长文件,并且有许多多行代码,那么第一个打印数字行是有用的:

cat -n file | sed -n '/Here/,/String/p'

#7


6  

This might work for you (GNU sed):

这可能对你有用(GNU sed):

sed '/Here/!d;s//&\n/;s/.*\n//;:a;/String/bb;$!{n;ba};:b;s//\n&/;P;D' file 

This presents each representation of text between two markers (in this instance Here and String) on a newline and preserves newlines within the text.

这将在新行上显示两个标记之间的文本(在这个实例中是字符串),并在文本中保留换行符。

#8


3  

All the above solutions have deficiencies where the last search string is repeated elsewhere in the string. I found it best to write a bash function.

上面所有的解决方案都有不足之处,最后一个搜索字符串在字符串的其他地方重复出现。我发现最好编写bash函数。

    function str_str {
      local str
      str="${1#*${2}}"
      str="${str%%$3*}"
      echo -n "$str"
    }

    # test it ...
    mystr="this is a string"
    str_str "$mystr" "this " " string"

#9


1  

You can use \1 (refer to http://www.grymoire.com/Unix/Sed.html#uh-4):

您可以使用\1(请参阅http://www.grymoire.com/Unix/Sed.html# -4):

echo "Hello is a String" | sed 's/Hello\(.*\)String/\1/g'

The contents that is inside the brackets will be stored as \1.

括号内的内容将存储为\1。

#10


0  

Problem. My stored Claws Mail messages are wrapped as follows, and I am trying to extract the Subject lines:

问题。我的存储的爪子邮件是这样包装的,我试图提取主题行:

Subject: [SLC38A9 lysosomal arginine sensor; mTORC1 pathway] Key molecular
 link in major cell growth pathway: Findings point to new potential
 therapeutic target in pancreatic cancer [mTORC1 Activator SLC38A9 Is
 Required to Efflux Essential Amino Acids from Lysosomes and Use Protein as
 a Nutrient] [Re: Nutrient sensor in key growth-regulating metabolic pathway
 identified [Lysosomal amino acid transporter SLC38A9 signals arginine
 sufficiency to mTORC1]]
Message-ID: <20171019190902.18741771@VictoriasJourney.com>

Per A2 in this thread, How to use sed/grep to extract text between two words? the first expression, below, "works" as long as the matched text does not contain a newline:

在这个线程中,每A2,如何使用sed/grep来提取两个单词之间的文本?第一个表达式,下面的“works”,只要匹配的文本不包含换行符:

grep -o -P '(?<=Subject: ).*(?=molecular)' corpus/01

[SLC38A9 lysosomal arginine sensor; mTORC1 pathway] Key

However, despite trying numerous variants (.+?; /s; ...), I could not get these to work:

然而,尽管尝试了许多变体(.+?/ s;我无法让这些工作:

grep -o -P '(?<=Subject: ).*(?=link)' corpus/01
grep -o -P '(?<=Subject: ).*(?=therapeutic)' corpus/01
etc.

Solution 1.

解决方案1。

Per Extract text between two strings on different lines

每个提取文本在两个字符串之间的不同行。

sed -n '/Subject: /{:a;N;/Message-ID:/!ba; s/\n/ /g; s/\s\s*/ /g; s/.*Subject: \|Message-ID:.*//g;p}' corpus/01

which gives

这给了

[SLC38A9 lysosomal arginine sensor; mTORC1 pathway] Key molecular link in major cell growth pathway: Findings point to new potential therapeutic target in pancreatic cancer [mTORC1 Activator SLC38A9 Is Required to Efflux Essential Amino Acids from Lysosomes and Use Protein as a Nutrient] [Re: Nutrient sensor in key growth-regulating metabolic pathway identified [Lysosomal amino acid transporter SLC38A9 signals arginine sufficiency to mTORC1]]                              

Solution 2.*

解决方案2。*

Per How can I replace a newline (\n) using sed?

如何使用sed替换新行(\n) ?

sed ':a;N;$!ba;s/\n/ /g' corpus/01

will replace newlines with a space.

将用空格替换换行符。

Chaining that with A2 in How to use sed/grep to extract text between two words?, we get:

用A2在如何使用sed/grep来提取两个单词之间的文本?,我们得到:

sed ':a;N;$!ba;s/\n/ /g' corpus/01 | grep -o -P '(?<=Subject: ).*(?=Message-ID:)'

which gives

这给了

[SLC38A9 lysosomal arginine sensor; mTORC1 pathway] Key molecular  link in major cell growth pathway: Findings point to new potential  therapeutic target in pancreatic cancer [mTORC1 Activator SLC38A9 Is  Required to Efflux Essential Amino Acids from Lysosomes and Use Protein as  a Nutrient] [Re: Nutrient sensor in key growth-regulating metabolic pathway  identified [Lysosomal amino acid transporter SLC38A9 signals arginine  sufficiency to mTORC1]] 

This variant removes double spaces:

该变体删除了双空间:

sed ':a;N;$!ba;s/\n/ /g; s/\s\s*/ /g' corpus/01 | grep -o -P '(?<=Subject: ).*(?=Message-ID:)'

giving

[SLC38A9 lysosomal arginine sensor; mTORC1 pathway] Key molecular link in major cell growth pathway: Findings point to new potential therapeutic target in pancreatic cancer [mTORC1 Activator SLC38A9 Is Required to Efflux Essential Amino Acids from Lysosomes and Use Protein as a Nutrient] [Re: Nutrient sensor in key growth-regulating metabolic pathway identified [Lysosomal amino acid transporter SLC38A9 signals arginine sufficiency to mTORC1]]