在两个字符串之间提取文本并对其执行操作

I have a file which contains following text

我有一个包含以下文本的文件

<MY_TEXT="XYZ" PATH="MNO"       #First occurrence of MY_TEXT
<location= "XYZ" path="ABC"
\location>
<R_DATA = MNOP
<Mylocation ="ghdf" stime=20150301 etime=20150401 >
<Mylocation ="ghdf" stime=20150401 etime=20150501 >
\R_DATA>
<Blah>
\MY_TEXT>                       #Second occurrence of MY_TEXT
<MY_TEXT="ABC" PATH="EFG"       #Third occurrence of MY_TEXT
<location= "QQQ" path="LLL"
\location>
<R_DATA = MNOP     
<Mylocation ="ghdf" stime=20150301 etime=20150401 >
<Mylocation ="ghdf" stime=20150401 etime=20150501 >
\R_DATA>
<Blah>
\MY_TEXT>         #Fourth occurrence of MY_TEXT

My task is to find a text in line which has <MY_TEXT="XYZ", it may have spaces in start and then find its closing \MY_TEXT So output is kind of

我的任务是找到一行

<MY_TEXT="XYZ" PATH="MNO"       #First occurrence of MY_TEXT
<location= "XYZ" path="ABC"
\location>
<R_DATA = MNOP
<Mylocation ="ghdf" stime=20150301 etime=20150401 >  #First occurrence of Mylocation
<Mylocation ="ghdf" stime=20150401 etime=20150501 >  #Second occurrence of Mylocation
\R_DATA>
<Blah>
\MY_TEXT>

Then it finds last occurrence of Mylocation i.e #Second occurrence of Mylocation here and modified the text etime=20150501 to something and append a new line after it inline in the file.

然后它在此处找到Mylocation的最后一次出现，即#Second出现的Mylocation，并将文本etime = 20150501修改为某个内容，并在文件中内联后添加一个新行。

I came across this link Sed to extract text between two strings . But using sed command here either fetches me nothing when I use -n option or prints entire file when i remove -n . So I am not able to process the text further as I am not able to extract the text I want in the first place.

我遇到了这个链接Sed来提取两个字符串之间的文本。但是在这里使用sed命令或者在我使用-n选项时不取任何东西或在我删除-n时打印整个文件。所以我无法进一步处理文本，因为我无法首先提取我想要的文本。

I also tried sed -n '/^ *START=A *$/,/^ *END *$/p' yourfile . But of no use. Can you guys help me as my scripting is not great. Thanks in advance.

我也试过sed -n'/ ^ * START = A * $ /，/ ^ * END * $ / p'yourfile。但没有用。你们可以帮助我，因为我的脚本不是很好。提前致谢。

2 个解决方案

#1

This is a little tricky with sed, but I'll have a go at it.

这对sed来说有点棘手，但我会对它有所了解。

Important note: This looks like a well-defined file format, but I don't recognize it. It might be prudent to see if there are tools that work on this format directly rather than treating it like a flat file the way sed must. It is very probable that such a solution would be shorter, easier to understand, and more robust than direct-text hackery.

重要说明：这看起来像一个明确定义的文件格式，但我不认识它。看看是否有直接使用这种格式的工具而不是像sed必须的那样将其视为平面文件可能是谨慎的。这种解决方案很可能比直接文本hackery更短，更容易理解，更健壮。

That said, you can use

也就是说，你可以使用

sed -n '/<MY_TEXT="XYZ"/ { :a /\\MY_TEXT>/! { N; ba }; s/\(.*\)\(<Mylocation\)/\1\\MY_TEXT>\n\2/; h; s/.*\\MY_TEXT>\n//; s/etime=[0-9]\+/etime=something/; s/\n/\n\n/; s/$/\\MY_TEXT>/; G; s/\(.*\)\\MY_TEXT>\n\(.*\)\\MY_TEXT>\n\(.*\)/\2\1/; p }' filename

Output:

输出：

<MY_TEXT="XYZ" PATH="MNO"       #First occurrence of MY_TEXT
<location= "XYZ" path="ABC"
\location>
<R_DATA = MNOP
<Mylocation ="ghdf" stime=20150301 etime=20150401 >
<Mylocation ="ghdf" stime=20150401 etime=something >

\R_DATA>
<Blah>
\MY_TEXT>

The most confusing bit of this is the use of \MY_TEXT>\n as a marker to separate the working chunks; this is done because we know it doesn't appear anywhere else in the text. \MY_TEXT> first appears in the last line of the block we're working on, so there's never going to be a newline after it in the input data. (The code might be clearer with something else that doesn't appear in the text, but I don't know that of anything more obvious for certain).

最令人困惑的是使用\ MY_TEXT> \ n作为分隔工作块的标记;这样做是因为我们知道它没有出现在文本的任何其他地方。 \ MY_TEXT>首先出现在我们正在处理的块的最后一行，所以在输入数据中它之后永远不会是换行符。（代码可能更清晰，其他内容没有出现在文本中，但我不知道任何更明显的东西）。

The code works as follows:

代码的工作原理如下：

#!/bin/sed -nf

/<MY_TEXT="XYZ"/ {                                    # If we find the starter
                                                      # line:
  :a
  /\\MY_TEXT>/! {                                     # fetch the rest of the
    N                                                 # block into the
    ba                                                # pattern space
  }
  s/\(.*\)\(<Mylocation\)/\1\\MY_TEXT>\n\2/           # mark the place before
                                                      # the last Mylocation tag
  h                                                   # copy that to the hold
                                                      # buffer
  s/.*\\MY_TEXT>\n//                                  # remove the stuff before
                                                      # the marker
  s/etime=[0-9]\+/etime=something/                    # replace  the etime
                                                      # attribute
  s/\n/\n\n/                                          # insert the new line
  s/$/\\MY_TEXT>/                                     # put a marker at the end
  G                                                   # fetch back the stuff
                                                      # from the hold buffer
  s/\(.*\)\\MY_TEXT>\n\(.*\)\\MY_TEXT>\n\(.*\)/\2\1/  # replace the end chunk
                                                      # with the edited version
  p                                                   # print the result.
}

#2

Simple solution is to use range

简单的解决方案是使用范围

awk '/<MY_TEXT="XYZ"/,/\\MY_TEXT/' file
<MY_TEXT="XYZ" PATH="MNO"       #First occurrence of MY_TEXT
<location= "XYZ" path="ABC"
\location>
<R_DATA = MNOP
<Mylocation ="ghdf" stime=20150301 etime=20150401 >
<Mylocation ="ghdf" stime=20150401 etime=20150501 >
\R_DATA>
<Blah>
\MY_TEXT>                       #Second occurrence of MY_TEXT

Or sed

或者是sed

sed -n '/<MY_TEXT="XYZ"/,/\\MY_TEXT/p' file
<MY_TEXT="XYZ" PATH="MNO"       #First occurrence of MY_TEXT
<location= "XYZ" path="ABC"
\location>
<R_DATA = MNOP
<Mylocation ="ghdf" stime=20150301 etime=20150401 >
<Mylocation ="ghdf" stime=20150401 etime=20150501 >
\R_DATA>
<Blah>
\MY_TEXT>                       #Second occurrence of MY_TEXT

#1