如何使用正则表达式用Javascript替换字符串中特定单词以外的所有内容

时间:2022-10-29 15:02:54

Imagine you have a string like this: "This is a sentence with words."

想象一下,你有一个这样的字符串:“这是一个带有单词的句子。”

I have an array of words like $wordList = ["sentence", "words"];

我有一些单词,比如$ wordList = [“sentence”,“words”];

I want to highlight words that aren't on the list. Which means I need to find and replace everything else and I can't seem to crack how to do that (if it's possible) with RegEx.

我想强调一下列表中没有的单词。这意味着我需要找到并替换其他所有内容,而且似乎无法用RegEx来解决如何做到这一点(如果可能的话)。

If I want to match the words I can do something like:

如果我想匹配我可以做的事情:

text = text.replace(/(sentence|words)\b/g, '<mark>$&</mark>');

text = text.replace(/(sentence | words)\ b / g,' $& ');

(which will wrap the matching words in "mark" tags and, assuming I have some css for <mark>, highlight them) which works perfectly. But I need the opposite! I need it to basically select the entire string and then exclude the words listed. I've tried /^((?!sentence|words)*)*$/gm but this gives me a strange infinity issue because I think it's too open ended.

(它会将匹配的单词包装在“mark”标签中,假设我有一些的css,突出显示它们),它们完美无缺。但我需要相反的!我需要它基本上选择整个字符串,然后排除列出的单词。我已经尝试了/ ^((?!句子|单词)*)* $ / gm但这给了我一个奇怪的无限问题,因为我觉得它太开放了。

Taking that original sentence, what I would hope to end up with is "<mark> This is a </mark> sentence <mark> with some </mark> words."

拿这个原始句子,我希望最终得到的是“这是 句子,带有一些 字。”

Basically wrapping (via replace) everything except the words listed.

基本上包装(通过替换)除列出的单词之外的所有内容。

The closest I can seem to get is something like /^(?!sentence|words).*\b/igm which will successfully do it if a line starts with one of the words (ignoring that entire line).

我似乎得到的最接近的是/^(?!sentence|words).*\b/igm,如果一行以其中一个单词开头(忽略整行),它将成功地执行此操作。

So to summarize: 1) Take a string 2) take a list of words 3) replace everything in the string except the list of words.

总结一下:1)取一个字符串2)取一个单词列表3)替换字符串中的所有内容,除了单词列表。

Possible? (jQuery is loaded for something else already, so raw JS or jQuery are both acceptable).

可能? (jQuery已经加载了其他东西,所以原始JS或jQuery都可以接受)。

4 个解决方案

#1


5  

Create the regex from the word list.
Then do a string replace with the regex.
(It's a tricky regex)

从单词列表创建正则表达式。然后用正则表达式替换字符串。 (这是一个棘手的正则表达式)

var wordList = ["sentence", "words"];

// join the array into a string using '|'.  
var str = wordList.join('|');
// finalize the string with a negative assertion
str = '\\W*(?:\\b(?!(?:' + str + ')\\b)\\w+\\W*|\\W+)+';

//create a regex from the string
var Rx = new RegExp( str, 'g' );
console.log( Rx ); 

var text = "%%%555This is a sentence with words, but not sentences ?!??!!...";
text = text.replace( Rx, '<mark>$&</mark>');

console.log( text );

Output

/\W*(?:\b(?!(?:sentence|words)\b)\w+\W*|\W+)+/g
<mark>%%%555This is a </mark>sentence<mark> with </mark>words<mark>, but not sentences ?!??!!...</mark>

Addendum

The regex above assumes the word list contains only word characters.
If that's not the case, you must match the words to advance the match position
past them. This is easily accomplished with a simplified regex and a callback function.

上面的正则表达式假定单词列表仅包含单词字符。如果不是这种情况,您必须匹配单词以提前匹配位置。这可以通过简化的正则表达式和回调函数轻松完成。

var wordList = ["sentence", "words", "won't"];

// join the array into a string using '|'.  
var str = wordList.join('|');
str = '([\\S\\s]*?)(\\b(?:' + str + ')\\b|$)';

//create a regex from the string
var Rx = new RegExp( str, 'g' );
console.log( Rx ); 

var text = "%%%555This is a sentence with words, but won't be sentences ?!??!!...";

// Use a callback to insert the 'mark'
text = text.replace(
        Rx,
        function(match, p1,p2)
        {
           var retStr = '';
           if ( p1.length > 0 )
              retStr = '<mark>' + p1 + '</mark>';
           return retStr + p2;
        }
      );

console.log( text );

Output

/([\S\s]*?)(\b(?:sentence|words|won't)\b|$)/g
<mark>%%%555This is a </mark>sentence<mark> with </mark>words<mark>, but 
</mark>won't<mark> be sentences ?!??!!...</mark>

#2


3  

You could still perform the replacement on the positive matches, but reverse the closing/opening tag, and add an opening tag at the start and a closing one at the end of the string. I use here your regular expression which could be anything you want, so I'll assume it matches correctly what needs to be matched:

您仍然可以在肯定匹配上执行替换,但是反转关闭/打开标记,并在开头添加一个开始标记,在字符串末尾添加一个结束标记。我在这里使用你的正则表达式,这可能是你想要的任何东西,所以我认为它正确匹配需要匹配的东西:

var text = "This is a sentence with words.";

text = "<mark>" + text.replace(/\b(sentence|words)\b/g, '</mark>$&<mark>') + "</mark>";

// If empty tags bother you, you can add:
text = text.replace(/<mark><\/mark>/g, "");

console.log(text);

Time Complexity

In comments below someone makes a point that the second replacement (which is optional) is a waste of time. But it has linear time complexity as is illustrated in the following snippet which charts the duration for increasing string sizes.

在下面的评论中,有人指出第二次替换(可选)是浪费时间。但它具有线性时间复杂度,如下面的片段所示,该片段列出了增加字符串大小的持续时间。

The X axis represents the number of characters in the input string, and the Y-axis represents the number of milliseconds it takes to execute the replacement with /<mark><\/mark>/g on such input string:

X轴表示输入字符串中的字符数,Y轴表示在此类输入字符串上使用/ <\ / mark> / g执行替换所需的毫秒数:

// Reserve memory for the longest string
const s = '<mark></mark>' + '<mark>x</mark>'.repeat(2000);
    regex = /<mark><\/mark>/g,
    millisecs = {};
// Collect timings for several string sizes:
for (let size = 100; size < 25000; size+=100) {
	millisecs[size] = test(15, 8, _ => s.substr(0, size).replace(regex, ''));
}
// Show results in a chart:
chartFunction(canvas, millisecs, "len", "ms");

// Utilities
function test(countPerRun, runs, f) {
    let fastest = Infinity;
    for (let run = 0; run < runs; run++) {
        const started = performance.now();
        for (let i = 0; i < countPerRun; i++) f();
        // Keep the duration of the fastest run:
        fastest = Math.min(fastest, (performance.now() - started) / countPerRun);
    }
    return fastest;
}

function chartFunction(canvas, y, labelX, labelY) {
    const ctx = canvas.getContext('2d'),
        axisPix = [40, 20],
        largeY = Object.values(y).sort( (a, b) => b - a )[
                    Math.floor(Object.keys(y).length / 10)
                ] * 1.3; // add 30% to value at the 90th percentile 
        max = [+Object.keys(y).pop(), largeY],
        coeff = [(canvas.width-axisPix[0]) / max[0], (canvas.height-axisPix[1]) / max[1]],
        textAlignPix = [-8, -13];
    ctx.translate(axisPix[0], canvas.height-axisPix[1]);
    text(labelY + "/" + labelX, [-5, -13], [1, 1], false, 2);
    // Draw axis lines
    for (let dim = 0; dim < 2; dim++) {
        const c = coeff[dim], world = [c, 1];
        let interval = 10**Math.floor(Math.log10(60 / c));
        while (interval * c < 30) interval *= 2;
        if (interval * c > 60) interval /= 2;
        let decimals = ((interval+'').split('.')[1] || '').length;
        line([[0, 0], [max[dim], 0]], world, dim);
        for (let x = 0; x <= max[dim]; x += interval) {
            line([[x, 0], [x, -5]], world, dim);
            text(x.toFixed(decimals), [x, textAlignPix[1-dim]], world, dim, dim+1);
        }
    }
    // Draw function
    line(Object.entries(y), coeff);

    function translate(coordinates, world, swap) {
        return coordinates.map( p => {
            p = [p[0] * world[0], p[1] * world[1]];
            return swap ? p.reverse() : p;
        });
    }
    
    function line(coordinates, world, swap) {
        coordinates = translate(coordinates, world, swap);
        ctx.beginPath();
        ctx.moveTo(coordinates[0][0], -coordinates[0][1]);
        for (const [x, y] of coordinates.slice(1)) ctx.lineTo(x, -y);
        ctx.stroke();
    }

    function text(s, p, world, swap, align) { // align: 0=left,1=center,2=right
        const [[x, y]] = translate([p], world, swap);
        ctx.font = '9px courier';
        ctx.fillText(s, x - 2.5*align*s.length, 2.5-y);
    }
}
<canvas id="canvas" width="600" height="200"></canvas>

For each string size (which is incremented with steps of 100 characters), the time to run the regex 15 times is measured. This measurement is repeated 8 times and the duration of the fastest run is reported in the graph. On my PC the regex runs in 25µs on a string with 25 000 characters (consisting of <mark> tags). So not something to worry about ;-)

对于每个字符串大小(以100个字符为步长递增),测量运行正则表达式15次的时间。该测量重复8次,并且在图中报告最快运行的持续时间。在我的电脑上,正则表达式在25μs的字符串上运行25000个字符(由标签组成)。所以不用担心;-)

You may see some spikes in the chart (due to browser and OS interference), but the overall tendency is linear. Given that the main regex has linear time complexity, the overall time complexity is not negatively affected by it.

您可能会在图表中看到一些峰值(由于浏览器和操作系统干扰),但整体趋势是线性的。鉴于主正则表达式具有线性时间复杂度,总体时间复杂度不会受到它的负面影响。

However that optional part can be performed without regular expression as follows:

但是,可选的部分可以在没有正则表达式的情况下执

if (text.substr(6, 7) === '</mark>') text = text.substr(13);
if (text.substr(-13, 6) === '<mark>') text = text.substr(0, text.length-13);

Due to how JavaScript engines deal with strings (immutable), this longer code runs in constant time.

由于JavaScript引擎如何处理字符串(不可变),这个较长的代码会在恒定时间内运行。

Of course, it does not change the overall time complexity, which remains linear.

当然,它并没有改变整体时间复杂度,这仍然是线性的。

#3


1  

I'm not sure if this will work for every case, but for the given string it does.

我不确定这是否适用于所有情况,但对于给定的字符串它。

let s1 = "This is a sentence with words.";
let wordList = ["sentence", "words"];

let reg = new RegExp("([\\s\\S]*?)(" + wordList.join("|") + ")", "g");

console.log(s1.replace(reg, "<mark>$1</mark>$2"))

#4


1  

Do it the opposite way: Mark everything and unmark the matched words you have.

以相反的方式做到:标记所有内容并取消标记您拥有的匹配单词。

text = `<mark>${text.replace(/\b(sentence|words)\b/g, '</mark>$&<mark>')}</mark>`;

Negated regex is possible but inefficient for this. In fact regex is not the right tool. The viable method is to go through the strings and manually construct the end string:

否定的正则表达式是可能的,但效率低下。事实上,正则表达式不是正确的工具。可行的方法是遍历字符串并手动构造结束字符串:

//var text = "This is a sentence with words.";
//var wordlist = ["sentence", "words"];
var result = "";
var marked = false;
var nextIndex = 0;

while (nextIndex != -1) {
    var endIndex = text.indexOf(" ", nextIndex + 1);
    var substring = text.slice(nextIndex, endIndex == -1 ? text.length : endIndex);
    var contains = wordlist.some(word => substring.includes(word));
    if (!contains && !marked) {
        result += "<mark>";
        marked = true;
    }
    if (contains && marked) {
        result += "</mark>";
        marked = false;
    }
    result += substring;
    nextIndex = endIndex;
}

if (marked) {
    result += "</mark>";
}
text = result;

#1


5  

Create the regex from the word list.
Then do a string replace with the regex.
(It's a tricky regex)

从单词列表创建正则表达式。然后用正则表达式替换字符串。 (这是一个棘手的正则表达式)

var wordList = ["sentence", "words"];

// join the array into a string using '|'.  
var str = wordList.join('|');
// finalize the string with a negative assertion
str = '\\W*(?:\\b(?!(?:' + str + ')\\b)\\w+\\W*|\\W+)+';

//create a regex from the string
var Rx = new RegExp( str, 'g' );
console.log( Rx ); 

var text = "%%%555This is a sentence with words, but not sentences ?!??!!...";
text = text.replace( Rx, '<mark>$&</mark>');

console.log( text );

Output

/\W*(?:\b(?!(?:sentence|words)\b)\w+\W*|\W+)+/g
<mark>%%%555This is a </mark>sentence<mark> with </mark>words<mark>, but not sentences ?!??!!...</mark>

Addendum

The regex above assumes the word list contains only word characters.
If that's not the case, you must match the words to advance the match position
past them. This is easily accomplished with a simplified regex and a callback function.

上面的正则表达式假定单词列表仅包含单词字符。如果不是这种情况,您必须匹配单词以提前匹配位置。这可以通过简化的正则表达式和回调函数轻松完成。

var wordList = ["sentence", "words", "won't"];

// join the array into a string using '|'.  
var str = wordList.join('|');
str = '([\\S\\s]*?)(\\b(?:' + str + ')\\b|$)';

//create a regex from the string
var Rx = new RegExp( str, 'g' );
console.log( Rx ); 

var text = "%%%555This is a sentence with words, but won't be sentences ?!??!!...";

// Use a callback to insert the 'mark'
text = text.replace(
        Rx,
        function(match, p1,p2)
        {
           var retStr = '';
           if ( p1.length > 0 )
              retStr = '<mark>' + p1 + '</mark>';
           return retStr + p2;
        }
      );

console.log( text );

Output

/([\S\s]*?)(\b(?:sentence|words|won't)\b|$)/g
<mark>%%%555This is a </mark>sentence<mark> with </mark>words<mark>, but 
</mark>won't<mark> be sentences ?!??!!...</mark>

#2


3  

You could still perform the replacement on the positive matches, but reverse the closing/opening tag, and add an opening tag at the start and a closing one at the end of the string. I use here your regular expression which could be anything you want, so I'll assume it matches correctly what needs to be matched:

您仍然可以在肯定匹配上执行替换,但是反转关闭/打开标记,并在开头添加一个开始标记,在字符串末尾添加一个结束标记。我在这里使用你的正则表达式,这可能是你想要的任何东西,所以我认为它正确匹配需要匹配的东西:

var text = "This is a sentence with words.";

text = "<mark>" + text.replace(/\b(sentence|words)\b/g, '</mark>$&<mark>') + "</mark>";

// If empty tags bother you, you can add:
text = text.replace(/<mark><\/mark>/g, "");

console.log(text);

Time Complexity

In comments below someone makes a point that the second replacement (which is optional) is a waste of time. But it has linear time complexity as is illustrated in the following snippet which charts the duration for increasing string sizes.

在下面的评论中,有人指出第二次替换(可选)是浪费时间。但它具有线性时间复杂度,如下面的片段所示,该片段列出了增加字符串大小的持续时间。

The X axis represents the number of characters in the input string, and the Y-axis represents the number of milliseconds it takes to execute the replacement with /<mark><\/mark>/g on such input string:

X轴表示输入字符串中的字符数,Y轴表示在此类输入字符串上使用/ <\ / mark> / g执行替换所需的毫秒数:

// Reserve memory for the longest string
const s = '<mark></mark>' + '<mark>x</mark>'.repeat(2000);
    regex = /<mark><\/mark>/g,
    millisecs = {};
// Collect timings for several string sizes:
for (let size = 100; size < 25000; size+=100) {
	millisecs[size] = test(15, 8, _ => s.substr(0, size).replace(regex, ''));
}
// Show results in a chart:
chartFunction(canvas, millisecs, "len", "ms");

// Utilities
function test(countPerRun, runs, f) {
    let fastest = Infinity;
    for (let run = 0; run < runs; run++) {
        const started = performance.now();
        for (let i = 0; i < countPerRun; i++) f();
        // Keep the duration of the fastest run:
        fastest = Math.min(fastest, (performance.now() - started) / countPerRun);
    }
    return fastest;
}

function chartFunction(canvas, y, labelX, labelY) {
    const ctx = canvas.getContext('2d'),
        axisPix = [40, 20],
        largeY = Object.values(y).sort( (a, b) => b - a )[
                    Math.floor(Object.keys(y).length / 10)
                ] * 1.3; // add 30% to value at the 90th percentile 
        max = [+Object.keys(y).pop(), largeY],
        coeff = [(canvas.width-axisPix[0]) / max[0], (canvas.height-axisPix[1]) / max[1]],
        textAlignPix = [-8, -13];
    ctx.translate(axisPix[0], canvas.height-axisPix[1]);
    text(labelY + "/" + labelX, [-5, -13], [1, 1], false, 2);
    // Draw axis lines
    for (let dim = 0; dim < 2; dim++) {
        const c = coeff[dim], world = [c, 1];
        let interval = 10**Math.floor(Math.log10(60 / c));
        while (interval * c < 30) interval *= 2;
        if (interval * c > 60) interval /= 2;
        let decimals = ((interval+'').split('.')[1] || '').length;
        line([[0, 0], [max[dim], 0]], world, dim);
        for (let x = 0; x <= max[dim]; x += interval) {
            line([[x, 0], [x, -5]], world, dim);
            text(x.toFixed(decimals), [x, textAlignPix[1-dim]], world, dim, dim+1);
        }
    }
    // Draw function
    line(Object.entries(y), coeff);

    function translate(coordinates, world, swap) {
        return coordinates.map( p => {
            p = [p[0] * world[0], p[1] * world[1]];
            return swap ? p.reverse() : p;
        });
    }
    
    function line(coordinates, world, swap) {
        coordinates = translate(coordinates, world, swap);
        ctx.beginPath();
        ctx.moveTo(coordinates[0][0], -coordinates[0][1]);
        for (const [x, y] of coordinates.slice(1)) ctx.lineTo(x, -y);
        ctx.stroke();
    }

    function text(s, p, world, swap, align) { // align: 0=left,1=center,2=right
        const [[x, y]] = translate([p], world, swap);
        ctx.font = '9px courier';
        ctx.fillText(s, x - 2.5*align*s.length, 2.5-y);
    }
}
<canvas id="canvas" width="600" height="200"></canvas>

For each string size (which is incremented with steps of 100 characters), the time to run the regex 15 times is measured. This measurement is repeated 8 times and the duration of the fastest run is reported in the graph. On my PC the regex runs in 25µs on a string with 25 000 characters (consisting of <mark> tags). So not something to worry about ;-)

对于每个字符串大小(以100个字符为步长递增),测量运行正则表达式15次的时间。该测量重复8次,并且在图中报告最快运行的持续时间。在我的电脑上,正则表达式在25μs的字符串上运行25000个字符(由标签组成)。所以不用担心;-)

You may see some spikes in the chart (due to browser and OS interference), but the overall tendency is linear. Given that the main regex has linear time complexity, the overall time complexity is not negatively affected by it.

您可能会在图表中看到一些峰值(由于浏览器和操作系统干扰),但整体趋势是线性的。鉴于主正则表达式具有线性时间复杂度,总体时间复杂度不会受到它的负面影响。

However that optional part can be performed without regular expression as follows:

但是,可选的部分可以在没有正则表达式的情况下执

if (text.substr(6, 7) === '</mark>') text = text.substr(13);
if (text.substr(-13, 6) === '<mark>') text = text.substr(0, text.length-13);

Due to how JavaScript engines deal with strings (immutable), this longer code runs in constant time.

由于JavaScript引擎如何处理字符串(不可变),这个较长的代码会在恒定时间内运行。

Of course, it does not change the overall time complexity, which remains linear.

当然,它并没有改变整体时间复杂度,这仍然是线性的。

#3


1  

I'm not sure if this will work for every case, but for the given string it does.

我不确定这是否适用于所有情况,但对于给定的字符串它。

let s1 = "This is a sentence with words.";
let wordList = ["sentence", "words"];

let reg = new RegExp("([\\s\\S]*?)(" + wordList.join("|") + ")", "g");

console.log(s1.replace(reg, "<mark>$1</mark>$2"))

#4


1  

Do it the opposite way: Mark everything and unmark the matched words you have.

以相反的方式做到:标记所有内容并取消标记您拥有的匹配单词。

text = `<mark>${text.replace(/\b(sentence|words)\b/g, '</mark>$&<mark>')}</mark>`;

Negated regex is possible but inefficient for this. In fact regex is not the right tool. The viable method is to go through the strings and manually construct the end string:

否定的正则表达式是可能的,但效率低下。事实上,正则表达式不是正确的工具。可行的方法是遍历字符串并手动构造结束字符串:

//var text = "This is a sentence with words.";
//var wordlist = ["sentence", "words"];
var result = "";
var marked = false;
var nextIndex = 0;

while (nextIndex != -1) {
    var endIndex = text.indexOf(" ", nextIndex + 1);
    var substring = text.slice(nextIndex, endIndex == -1 ? text.length : endIndex);
    var contains = wordlist.some(word => substring.includes(word));
    if (!contains && !marked) {
        result += "<mark>";
        marked = true;
    }
    if (contains && marked) {
        result += "</mark>";
        marked = false;
    }
    result += substring;
    nextIndex = endIndex;
}

if (marked) {
    result += "</mark>";
}
text = result;