在同一行上的多个字符串之间获取文本

时间:2022-10-25 00:02:44

I would like to use bash on a file to extract text that lies between two strings. There are already some answers to this, eg:

我想在文件上使用bash来提取位于两个字符串之间的文本。已经有一些答案,例如:

Print text between two strings on the same line

在同一行上的两个字符串之间打印文本

But I would like to do this for multiple occurrences, sometimes on the same line, sometimes on new lines. for example, starting with a file like this:

但我想多次出现这种情况,有时是在同一条线上,有时在新线上。例如,从这样的文件开始:

\section{The rock outcrop pools experimental system} \label{intro:rockpools}
contain pools at their summit \parencite{brendonck_pools_2010} that have weathered into the rock over time \parencite{bayly_aquatic_2011} through chemical weathering after water collecting at the rock surface \parencite{lister_microgeomorphology_1973}.
Classification depends on dimensions \parencite{twidale_gnammas_1963}.

I would like to retrieve:

我想检索:

brendonck_pools_2010
bayly_aquatic_2011
lister_microgeomorphology_1973
twidale_gnammas_1963

I imagine sed should be able to do this but I'm not sure where to start.

我想sed应该可以做到这一点,但我不知道从哪里开始。

3 个解决方案

#1


1  

This two stage extract might be easier to understand, without using Perl regex.

在不使用Perl正则表达式的情况下,这两个阶段的提取可能更容易理解。

$ grep -o "parencite{[^}]*}" cite | sed 's/parencite{//;s/}//'
brendonck_pools_2010
bayly_aquatic_2011
lister_microgeomorphology_1973
twidale_gnammas_1963

or, as always awk to the rescue!

或者,一如既往地拯救!

$ awk -F'[{}]' -v RS=" " '/parencite/{print $2}' cite
brendonck_pools_2010
bayly_aquatic_2011
lister_microgeomorphology_1973
twidale_gnammas_1963

#2


1  

Using grep -oP;

使用grep -oP;

grep -oP '\\parencite\{\K[^}]+' file
brendonck_pools_2010
bayly_aquatic_2011
lister_microgeomorphology_1973
twidale_gnammas_1963

Or using gnu-awk:

或者使用gnu-awk:

awk -v FPAT='\\\\parencite{[^}]+' '{for (i=1; i<=NF; i++) {
    sub(/\\parencite{/, "", $i); print $i}}' file
brendonck_pools_2010
bayly_aquatic_2011
lister_microgeomorphology_1973
twidale_gnammas_1963

#3


0  

This might work for you (GNU sed):

这可能适合你(GNU sed):

sed '/\\parencite{\([^}]*\)}/!d;s//\n\1\n/;s/^[^\n]*\n//;P;D' file

Delete any lines that don't contain the required string. Surround the first occurrance with newlines and remove upto and including the first newline. Print upto and including the following newline then delete what was printed and repeat.

删除任何不包含所需字符串的行。用换行符围绕第一次出现并删除并包括第一个换行符。打印并包含以下换行,然后删除打印和重复的内容。

#1


1  

This two stage extract might be easier to understand, without using Perl regex.

在不使用Perl正则表达式的情况下,这两个阶段的提取可能更容易理解。

$ grep -o "parencite{[^}]*}" cite | sed 's/parencite{//;s/}//'
brendonck_pools_2010
bayly_aquatic_2011
lister_microgeomorphology_1973
twidale_gnammas_1963

or, as always awk to the rescue!

或者,一如既往地拯救!

$ awk -F'[{}]' -v RS=" " '/parencite/{print $2}' cite
brendonck_pools_2010
bayly_aquatic_2011
lister_microgeomorphology_1973
twidale_gnammas_1963

#2


1  

Using grep -oP;

使用grep -oP;

grep -oP '\\parencite\{\K[^}]+' file
brendonck_pools_2010
bayly_aquatic_2011
lister_microgeomorphology_1973
twidale_gnammas_1963

Or using gnu-awk:

或者使用gnu-awk:

awk -v FPAT='\\\\parencite{[^}]+' '{for (i=1; i<=NF; i++) {
    sub(/\\parencite{/, "", $i); print $i}}' file
brendonck_pools_2010
bayly_aquatic_2011
lister_microgeomorphology_1973
twidale_gnammas_1963

#3


0  

This might work for you (GNU sed):

这可能适合你(GNU sed):

sed '/\\parencite{\([^}]*\)}/!d;s//\n\1\n/;s/^[^\n]*\n//;P;D' file

Delete any lines that don't contain the required string. Surround the first occurrance with newlines and remove upto and including the first newline. Print upto and including the following newline then delete what was printed and repeat.

删除任何不包含所需字符串的行。用换行符围绕第一次出现并删除并包括第一个换行符。打印并包含以下换行,然后删除打印和重复的内容。