当没有匹配时,Regex性能很差

时间:2022-09-17 21:38:30

I have a problem with slow working regex, but only in case when the patter doesn't match. In all other cases performance are acceptable, even if patter matches in the end of text. I'am testing performance on 100KB text input.

我有一个缓慢工作的正则表达式的问题,但是只有在不匹配的情况下才会出现。在所有其他情况下,性能都是可以接受的,即使模式匹配在文本末尾。我在100KB文本输入上测试性能。

What I am trying to do is to convert input in HTML-like syntax which is using [] instead of <> brackets and translate it to valid XML.

我要做的是将输入转换为类似html的语法,使用[]而不是<>方括号,并将其转换为有效的XML。

Sample input:

样例输入:

...some content[vc_row param="test1"][vc_column]text [brackets in text] content[/vc_column][/vc_row][vc_row param="xxx"]text content[/vc_row]...some more content

Sample output:

样例输出:

...some content<div class="vc_row" param="test1"><div class="vc_column" >text [brackets in text] content</div></div><div class="vc_row" param="xxx">text content</div>...some more content

To do this I am using regex:

为此,我使用regex:

/(.*)(\[\/?vc_column|\[\/?vc_row)( ?)(.*?)(\])(.*)/

And I do this in while loop until the patter matches.

我在while循环中做这个,直到patter匹配。

As I mentioned before this works, but last iteration is extremly slow (or first if nothing matches). Here is complete javascript I am using:

正如我在前面提到的那样,最后一次迭代非常缓慢(如果没有匹配的话,第一次迭代)。下面是我正在使用的完整javascript:

var str   = '...some content[vc_row param="test1"][vc_column]text content[/vc_column][/vc_row][vc_row param="xxx"]text content[/vc_row]...some more content';

var regex = /(.*)(\[\/?vc_column|\[\/?vc_row)( ?)(.*?)(\])(.*)/;
while (matches = str.match(regex)) {
    matches = str.match(regex);
    if (matches[2].slice(1, 2) !== '/')
        str = matches[1] + "<div class=\"" + matches[2].slice(1) + "\"" + " " + matches[4] + ">" + matches[6];
    else
        str = matches[1] + "</div>" + matches[6];
}

How could i improve my regex "not match" performance?

如何改进regex“不匹配”性能?

2 个解决方案

#1


1  

You can split it up in 2 regex. One for the start tags, one for the closing tags.

你可以把它分成两个regex。一个是开始标签,一个是结束标签。

And then chain 2 global g replaces.

然后链式2全局g替换。

var str   = '...some content[vc_row param="test1"][vc_column]text with [brackets in text] content[/vc_column][/vc_row][vc_row param="xxx"]text content[/vc_row]...some more content';

const reg1 = /\[(vc_(?:column|row))(\s+[^\]]+)?\s*\]/g;
const reg2 = /\[\/(vc_(?:column|row))\s*\]/g;

var result = str.replace(reg1, "<div class=\"$1\"$2>").replace(reg2, "</div>");

console.log(result);

Note that those (.*) in the original regex aren't needed this way.

注意,原始regex中的那些(.*)不是以这种方式需要的。

Using a nameless function, then it could be done via 1 regex replace.

使用一个无名函数,然后可以通过一个regex替换来完成。

var str   = '...some content[vc_row param="test1"][vc_column]text with [brackets in text] content[/vc_column][/vc_row][vc_row param="xxx"]text content[/vc_row]...some more content';

const reg = /\[(\/)?(vc_(?:column|row))(\s+[^\]]+)?\s*\]/g;

var result = str.replace(reg, function(m,c1,c2,c3){
              if(c1) return "</div>";
              else return "<div class=\""+ c2 +"\""+ (c3?c3:"") +">";
             });

console.log(result);

#2


1  

How about a replace... Like

如何替换…就像

str.replace(/\[(\/?)(vc_column|vc_row)([^\]]*?)\]/g, function(a,b,c,d) {
    return '<' + b + 'div' + (b==='/' ? '' : ' class="' + c + '"') + d + '>';
    });

This matches a tag (start or end) and all attributes, including brackets, capturing everything except the brackets. Then puts it back together in the correct format (divs with classes).

它匹配一个标记(开始或结束)和所有属性(包括括号),捕获除括号之外的所有属性。然后以正确的格式将其重新组合在一起(与类一起使用)。

And the global flag (/../g) removes the need for any loops.

全局标志(/. /g)消除了任何循环的需要。

var sInput = '...some content[vc_row param="test1"][vc_column]text [brackets in text] content[/vc_column][/vc_row][vc_row param="xxx"]text content[/vc_row]...some more content';

console.log(sInput.replace(/\[(\/?)(vc_column|vc_row)([^\]]*?)\]/g, function(a,b,c,d) {
    return '<' + b + 'div' + (b==='/' ? '' : ' class="' + c + '"') + d + '>';
    })
    );

#1


1  

You can split it up in 2 regex. One for the start tags, one for the closing tags.

你可以把它分成两个regex。一个是开始标签,一个是结束标签。

And then chain 2 global g replaces.

然后链式2全局g替换。

var str   = '...some content[vc_row param="test1"][vc_column]text with [brackets in text] content[/vc_column][/vc_row][vc_row param="xxx"]text content[/vc_row]...some more content';

const reg1 = /\[(vc_(?:column|row))(\s+[^\]]+)?\s*\]/g;
const reg2 = /\[\/(vc_(?:column|row))\s*\]/g;

var result = str.replace(reg1, "<div class=\"$1\"$2>").replace(reg2, "</div>");

console.log(result);

Note that those (.*) in the original regex aren't needed this way.

注意,原始regex中的那些(.*)不是以这种方式需要的。

Using a nameless function, then it could be done via 1 regex replace.

使用一个无名函数,然后可以通过一个regex替换来完成。

var str   = '...some content[vc_row param="test1"][vc_column]text with [brackets in text] content[/vc_column][/vc_row][vc_row param="xxx"]text content[/vc_row]...some more content';

const reg = /\[(\/)?(vc_(?:column|row))(\s+[^\]]+)?\s*\]/g;

var result = str.replace(reg, function(m,c1,c2,c3){
              if(c1) return "</div>";
              else return "<div class=\""+ c2 +"\""+ (c3?c3:"") +">";
             });

console.log(result);

#2


1  

How about a replace... Like

如何替换…就像

str.replace(/\[(\/?)(vc_column|vc_row)([^\]]*?)\]/g, function(a,b,c,d) {
    return '<' + b + 'div' + (b==='/' ? '' : ' class="' + c + '"') + d + '>';
    });

This matches a tag (start or end) and all attributes, including brackets, capturing everything except the brackets. Then puts it back together in the correct format (divs with classes).

它匹配一个标记(开始或结束)和所有属性(包括括号),捕获除括号之外的所有属性。然后以正确的格式将其重新组合在一起(与类一起使用)。

And the global flag (/../g) removes the need for any loops.

全局标志(/. /g)消除了任何循环的需要。

var sInput = '...some content[vc_row param="test1"][vc_column]text [brackets in text] content[/vc_column][/vc_row][vc_row param="xxx"]text content[/vc_row]...some more content';

console.log(sInput.replace(/\[(\/?)(vc_column|vc_row)([^\]]*?)\]/g, function(a,b,c,d) {
    return '<' + b + 'div' + (b==='/' ? '' : ' class="' + c + '"') + d + '>';
    })
    );