Regex:在字符串中的两个标记之间提取子字符串

时间:2022-09-13 07:47:51

I have a file in the following format:

我有以下格式的文件:

Data Data
Data
[Start]
Data I want
[End]
Data

I'd like to grab the Data I want from between the [Start] and [End] tags using a Regex. Can anyone show me how this might be done?

我想使用Regex从[Start]和[End]标记之间获取我想要的数据。有人能告诉我怎么做吗?

9 个解决方案

#1


22  

\[start\]\s*(((?!\[start\]|\[end\]).)+)\s*\[end\]

This should hopefully drop the [start] and [end] markers as well.

这应该也应该放弃[开始]和[结束]标记。

#2


63  

\[start\](.*?)\[end\]

Zhich'll put the text in the middle within a capture.

Zhich会把文本放在捕获的中间。

#3


5  

$text ="Data Data Data start Data i want end Data";
($content) = $text =~ m/ start (.*) end /;
print $content;

I had a similar problem for a while & I can tell you this method works...

我有一段时间遇到了类似的问题,我可以告诉你这个方法是有效的。

#4


4  

A more complete discussion of the pitfalls of using a regex to find matching tags can be found at: http://faq.perl.org/perlfaq4.html#How_do_I_find_matchi. In particular, be aware that nesting tags really need a full-fledged parser in order to be interpreted correctly.

可以在http://faq.perl.org/perlfaq4.html#How_do_I_find_matchi找到关于使用regex查找匹配标记的缺陷的更完整的讨论。特别要注意的是,嵌套标记确实需要一个完整的解析器才能正确地解释。

Note that case sensitivity will need to be turned off in order to answer the question as stated. In perl, that's the i modifier:

请注意,为了回答如上所述的问题,需要关闭案例敏感性。在perl中,i修饰符是:

$ echo "Data Data Data [Start] Data i want [End] Data" \
  | perl -ne '/\[start\](.*?)\[end\]/i; print "$1\n"'
 Data i want 

The other trick is to use the *? quantifier which turns off the greediness of the captured match. For instance, if you have a non-matching [end] tag:

另一个诀窍是使用*?量词,用来关闭捕获的比赛的贪心。例如,如果您有一个不匹配的[end]标签:

Data Data [Start] Data i want [End] Data [end]

you probably don't want to capture:

你可能不想捕捉到:

 Data i want [End] Data

#5


4  

While you can use a regular expression to parse the data between opening and closing tags, you need to think long and hard as to whether this is a path you want to go down. The reason for it is the potential of tags to nest: if nesting tags could ever happen or may ever happen, the language is said to no longer be regular, and regular expressions cease to be the proper tool for parsing it.

虽然您可以使用正则表达式来解析打开和关闭标记之间的数据,但是您需要仔细考虑这是否是您想要的路径。它的原因是标签可能会嵌套:如果嵌套标签可能会发生或者可能会发生,那么语言就不再是常规的,正则表达式不再是解析它的合适工具。

Many regular expression implementations, such as PCRE or perl's regular expressions, support backtracking which can be used to achieve this rough effect. But PCRE (unlike perl) doesn't support unlimited backtracking, and this can actually cause things to break in weird ways as soon as you have too many tags.

许多正则表达式实现,如PCRE或perl的正则表达式,都支持回溯,可以用来实现这种粗略的效果。但是PCRE(不像perl)不支持无限制的回溯,这实际上会导致一旦标签太多就会以奇怪的方式崩溃。

There's a very commonly cited blog post that discusses this more, http://kore-nordmann.de/blog/do_NOT_parse_using_regexp.html (google for it and check the cache currently, they seem to be having some downtime)

有一篇经常被引用的博客文章对此进行了更多的讨论,http://kore-nordmann.de/blog/do_NOT_parse_using_regexp.html

#6


3  

Well, if you guarantee that each start tag is followed by an end tag then the following would work.

如果您保证每个开始标记后面都有一个结束标记,那么下面的操作将会有效。

\[start\](.*?)\[end\]

However, If you have complex text such as the follwoing:

但是,如果您有以下复杂的文本:

[start] sometext [start] sometext2 [end] sometext [end]

then you would run into problems with regex.

然后您会遇到regex的问题。

Now the following example will pull out all the hot links in a page:

下面的例子将会拉出页面中所有的热门链接:

'/<a(.*?)a>/i'

In the above case we can guarantee that there would not be any nested cases of:

在上述情况下,我们可以保证不会出现以下任何嵌套情况:

'<a></a>'

So, this is a complex question and can't just be solved with a simple answer.

这是一个复杂的问题,不能用简单的答案来解决。

#7


1  

With Perl you can surround the data you want with ()'s and pull it out later, perhaps other languages have a similar feature.

使用Perl,您可以使用()'s包围您想要的数据并稍后将其取出,也许其他语言也具有类似的特性。

if ($s_output =~ /(data data data data START(data data data)END (data data)/) 
{
    $dataAllOfIt = $1;      # 1 full string
    $dataInMiddle = $2;     # 2 Middle Data
    $dataAtEnd = $3;        # 3 End Data
}

#8


0  

Refer to this question to pull out text between tags with space characters and dots (.)

请参考这个问题,在带有空格字符和点(.)的标记之间提取文本。

[\S\s] is the one I used

是我用过的

Regex to match any character including new lines

正则表达式匹配任何字符,包括新行。

#9


0  

Reading the text with in the square brackets [] i.e.[Start] and [End] and validate the array with a list of values. jsfiddle http://jsfiddle.net/muralinarisetty/r4s4wxj4/1/

在方括号[]中读取文本,即[Start]和[End]使用值列表来验证数组。jsfiddle http://jsfiddle.net/muralinarisetty/r4s4wxj4/1/

var mergeFields = ["[sitename]",
                   "[daystoholdquote]",
                   "[expires]",
                   "[firstname]",
                   "[lastname]",
                   "[sitephonenumber]",
                   "[hoh_firstname]",
                   "[hoh_lastname]"];       

var str = "fee [sitename] [firstname] \
sdfasd [lastname] ";
var res = validateMeargeFileds(str);
console.log(res);

function validateMeargeFileds(input) {
    var re = /\[\w+]/ig;
    var isValid;
    var myArray = input.match(re);

    try{
        if (myArray.length > 0) {
            myArray.forEach(function (field) {

                isValid = isMergeField(field);

                if (!isValid){
                   throw e;                        
                }
            });
        }
    }
    catch(e) {        
    }

    return isValid;
}

function isMergeField(mergefield) {
    return mergeFields.indexOf(mergefield.toLowerCase()) > -1;
}

#1


22  

\[start\]\s*(((?!\[start\]|\[end\]).)+)\s*\[end\]

This should hopefully drop the [start] and [end] markers as well.

这应该也应该放弃[开始]和[结束]标记。

#2


63  

\[start\](.*?)\[end\]

Zhich'll put the text in the middle within a capture.

Zhich会把文本放在捕获的中间。

#3


5  

$text ="Data Data Data start Data i want end Data";
($content) = $text =~ m/ start (.*) end /;
print $content;

I had a similar problem for a while & I can tell you this method works...

我有一段时间遇到了类似的问题,我可以告诉你这个方法是有效的。

#4


4  

A more complete discussion of the pitfalls of using a regex to find matching tags can be found at: http://faq.perl.org/perlfaq4.html#How_do_I_find_matchi. In particular, be aware that nesting tags really need a full-fledged parser in order to be interpreted correctly.

可以在http://faq.perl.org/perlfaq4.html#How_do_I_find_matchi找到关于使用regex查找匹配标记的缺陷的更完整的讨论。特别要注意的是,嵌套标记确实需要一个完整的解析器才能正确地解释。

Note that case sensitivity will need to be turned off in order to answer the question as stated. In perl, that's the i modifier:

请注意,为了回答如上所述的问题,需要关闭案例敏感性。在perl中,i修饰符是:

$ echo "Data Data Data [Start] Data i want [End] Data" \
  | perl -ne '/\[start\](.*?)\[end\]/i; print "$1\n"'
 Data i want 

The other trick is to use the *? quantifier which turns off the greediness of the captured match. For instance, if you have a non-matching [end] tag:

另一个诀窍是使用*?量词,用来关闭捕获的比赛的贪心。例如,如果您有一个不匹配的[end]标签:

Data Data [Start] Data i want [End] Data [end]

you probably don't want to capture:

你可能不想捕捉到:

 Data i want [End] Data

#5


4  

While you can use a regular expression to parse the data between opening and closing tags, you need to think long and hard as to whether this is a path you want to go down. The reason for it is the potential of tags to nest: if nesting tags could ever happen or may ever happen, the language is said to no longer be regular, and regular expressions cease to be the proper tool for parsing it.

虽然您可以使用正则表达式来解析打开和关闭标记之间的数据,但是您需要仔细考虑这是否是您想要的路径。它的原因是标签可能会嵌套:如果嵌套标签可能会发生或者可能会发生,那么语言就不再是常规的,正则表达式不再是解析它的合适工具。

Many regular expression implementations, such as PCRE or perl's regular expressions, support backtracking which can be used to achieve this rough effect. But PCRE (unlike perl) doesn't support unlimited backtracking, and this can actually cause things to break in weird ways as soon as you have too many tags.

许多正则表达式实现,如PCRE或perl的正则表达式,都支持回溯,可以用来实现这种粗略的效果。但是PCRE(不像perl)不支持无限制的回溯,这实际上会导致一旦标签太多就会以奇怪的方式崩溃。

There's a very commonly cited blog post that discusses this more, http://kore-nordmann.de/blog/do_NOT_parse_using_regexp.html (google for it and check the cache currently, they seem to be having some downtime)

有一篇经常被引用的博客文章对此进行了更多的讨论,http://kore-nordmann.de/blog/do_NOT_parse_using_regexp.html

#6


3  

Well, if you guarantee that each start tag is followed by an end tag then the following would work.

如果您保证每个开始标记后面都有一个结束标记,那么下面的操作将会有效。

\[start\](.*?)\[end\]

However, If you have complex text such as the follwoing:

但是,如果您有以下复杂的文本:

[start] sometext [start] sometext2 [end] sometext [end]

then you would run into problems with regex.

然后您会遇到regex的问题。

Now the following example will pull out all the hot links in a page:

下面的例子将会拉出页面中所有的热门链接:

'/<a(.*?)a>/i'

In the above case we can guarantee that there would not be any nested cases of:

在上述情况下,我们可以保证不会出现以下任何嵌套情况:

'<a></a>'

So, this is a complex question and can't just be solved with a simple answer.

这是一个复杂的问题,不能用简单的答案来解决。

#7


1  

With Perl you can surround the data you want with ()'s and pull it out later, perhaps other languages have a similar feature.

使用Perl,您可以使用()'s包围您想要的数据并稍后将其取出,也许其他语言也具有类似的特性。

if ($s_output =~ /(data data data data START(data data data)END (data data)/) 
{
    $dataAllOfIt = $1;      # 1 full string
    $dataInMiddle = $2;     # 2 Middle Data
    $dataAtEnd = $3;        # 3 End Data
}

#8


0  

Refer to this question to pull out text between tags with space characters and dots (.)

请参考这个问题,在带有空格字符和点(.)的标记之间提取文本。

[\S\s] is the one I used

是我用过的

Regex to match any character including new lines

正则表达式匹配任何字符,包括新行。

#9


0  

Reading the text with in the square brackets [] i.e.[Start] and [End] and validate the array with a list of values. jsfiddle http://jsfiddle.net/muralinarisetty/r4s4wxj4/1/

在方括号[]中读取文本,即[Start]和[End]使用值列表来验证数组。jsfiddle http://jsfiddle.net/muralinarisetty/r4s4wxj4/1/

var mergeFields = ["[sitename]",
                   "[daystoholdquote]",
                   "[expires]",
                   "[firstname]",
                   "[lastname]",
                   "[sitephonenumber]",
                   "[hoh_firstname]",
                   "[hoh_lastname]"];       

var str = "fee [sitename] [firstname] \
sdfasd [lastname] ";
var res = validateMeargeFileds(str);
console.log(res);

function validateMeargeFileds(input) {
    var re = /\[\w+]/ig;
    var isValid;
    var myArray = input.match(re);

    try{
        if (myArray.length > 0) {
            myArray.forEach(function (field) {

                isValid = isMergeField(field);

                if (!isValid){
                   throw e;                        
                }
            });
        }
    }
    catch(e) {        
    }

    return isValid;
}

function isMergeField(mergefield) {
    return mergeFields.indexOf(mergefield.toLowerCase()) > -1;
}