PHP Regex匹配字符串中的单词，不包括一个特定单词

I have a text ($txt), an array of words ($words) i want to add a link and a word ($wordToExclude) that must be not replaced.

我有一个文本($ txt),一个单词数组($ words)我想添加一个必须不替换的链接和一个单词($ wordToExclude)。

$words = array ('adipiscing','molestie','fringilla');
$wordToExclude = 'consectetur adipiscing';


$txt = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Quisque
mattis tincidunt dolor sed consequat. Sed rutrum, mauris convallis bibendum 
dignissim, ligula sem molestie massa, vitae condimentum neque sem non tellus.
Aenean dolor enim, cursus vel sodales ac, condimentum ac erat. Quisque
lobortis libero nec arcu fringilla imperdiet. Pellentesque commodo, 
arcu et dictum tincidunt, ipsum elit molestie ipsum, ut ultricies nisl
neque in velit. Curabitur luctus dui id urna consequat vitae mattis
turpis pretium. Donec nec adipiscing velit.'

I want to obtain this result:

我想获得这个结果:

$txt = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Quisque
mattis tincidunt dolor sed consequat. Sed rutrum, mauris convallis bibendum 
dignissim, ligula sem <a href="#">molestie</a> massa, vitae condimentum neque sem non tellus.
Aenean dolor enim, cursus vel sodales ac, condimentum ac erat. Quisque
lobortis libero nec arcu <a href="#">fringilla</a> imperdiet. Pellentesque commodo, 
arcu et dictum tincidunt, ipsum elit <a href="#">molestie</a> ipsum, ut ultricies nisl
neque in velit. Curabitur luctus dui id urna consequat vitae mattis
turpis pretium. Donec nec <a href="#">adipiscing</a> velit.'

3 个解决方案

#1

$result = preg_replace(
    '/\b                 # Word boundary
    (                    # Match one of the following:
     (?<!consectetur\s)  #  (unless preceded by "consectetur "
     adipiscing          #  adipiscing
    |                    # or
     molestie            #  molestie
    |                    # etc.
     fringilla
    )                    # End of alternation
    \b                   # Word boundary
    /ix', 
    '<a href="#">\1</a>', $subject);

#2

Okie doke! While I think this is technically doable, the solutions I have provided are kind of soft at this point:

Okie doke!虽然我认为这在技术上是可行的,但我提供的解决方案在这一点上有点软:

s%(?!consectetur adipiscing)(adipiscing|molestie|fringilla)(?<!consectetur adipiscing)%<a href="#LinkBasedUpon$1">$1</a>%s

turns...

sit amet, consectetur adipiscing elit. Quisque... ligula sem molestie massa... nec arcu fringilla imperdiet... nec adipiscing velit.

坐下来,奉献精神。 Quisque ... ligula sem molestie massa ... nec arcu fringilla imperdiet ... nec adipiscing velit。

into...

sit amet, consectetur adipiscing elit. Quisque... ligula sem <a href="#LinkBasedUponmolestie">molestie</a> massa... nec arcu <a href="#LinkBasedUponfringilla">fringilla</a> imperdiet... nec <a href="#LinkBasedUponadipiscing">adipiscing</a> velit.

坐下来,奉献精神。 Quisque ... ligula sem molestie massa ... nec arcu fringilla imperdiet ... nec adipiscing velit。

The reason it is a soft solution is that it does not handle partial words or other cases where the word(s) to exclude do not either begin or end with one of the words to be matched. e.g, if we were to append to the excluded 'word' (i.e. consectetur adipiscing elit), this expression would end up matching the adipiscing in consectetur adipiscing elit, because adipiscing does not begin or end the same as consectetur adipiscing elit

它是一个软解决方案的原因是它不处理部分单词或其他情况,其中要排除的单词不以其中一个要匹配的单词开头或结尾。例如,如果我们要附加到被排除的“单词”(即,加入elitiscing elit),这种表达最终将与adiptetur adipiscing elit中的adipiscing相匹配,因为adipiscing不会开始或结束与consectetur adipiscing elit相同

It should work as long as your exclude 'word' (A B C) always ends or begins with one of the words to be found (C|X|E has a C in it, and A B C ends with the word C, so should therefore work...)

它应该工作,只要你的排除'单词'(ABC)总是结束或开始找到一个单词(C | X | E中有一个C,而ABC以单词C结尾,所以应该工作...)

EDIT {

The reason the 'not matched' words must begin or end with one of the matched words is that this solution uses negative lookahead before the match, and negative lookbehind after the match to ensure that the matched sequence does not match the words to not be matched (does that make sense?)

“不匹配”单词必须以匹配单词之一开头或结尾的原因是此解决方案在匹配前使用负前瞻,并在匹配后使用负后瞻,以确保匹配的序列与不匹配的单词不匹配(那有意义吗?)

}

There are certain solutions to this, but they are either or both processor and programming effort intensive, and get exponentially more so depending on the size of the lists of words and the length of the searched text AND the specific requirements - and you never specified anything else, so I'm not gonna go into it at this point. Let me know if this is good enough for your situation!

有一些解决方案,但它们是处理器和编程工作密集型中的一个或两个,并且取决于单词列表的大小和搜索文本的长度以及特定要求而呈指数级增长 - 并且您从未指定任何内容否则,所以我现在不打算进入它。如果这对你的情况足够好,请告诉我!

#3

I see you're doing it in PHP. I understand you have an ARRAY of words to find in a text and you need to replace those with links. Also you have ONE string that needs to be excluded when doing the replacing. Maybe instead of writing cool and clean yet complicated regular expressions what about this practical albeit probably not the nicest solution:

我看到你是用PHP做的。我知道你在文本中找到了ARRAY字样,你需要用链接替换它们。此外,您还需要在替换时排除一个字符串。也许不是写出很酷,干净但复杂的正则表达式而是这个实用的,尽管可能不是最好的解决方案:

You split the task into subtasks:

您将任务拆分为子任务:

use preg_match_all to find offsets of all occurrences of the excluded string (you know the string length (strlen) and with the PREG_OFFSET_CAPTURE flag for preg_match_all you will figure out exact starts and ends - if there are more than one)

使用preg_match_all来查找所有出现的排除字符串的偏移量(你知道字符串长度(strlen)和preg_match_all的PREG_OFFSET_CAPTURE标志,你会发现确切的开始和结束 - 如果有多个)

do foreach on your word list and again use preg_match_all to get all occurrences of the words you need to replace with links

在你的单词列表上做foreach并再次使用preg_match_all来获取你需要用链接替换的所有单词

compare the positions you found in step 2 with those found in step 1 and if they're outside do the replace or skip if you get overlap

将您在步骤2中找到的位置与步骤1中找到的位置进行比较,如果它们在外面,则进行替换或跳过,如果您出现重叠

It surely won't be a one-liner but would be quite easy to code and then probably quite easy to read later too.

它肯定不会是一个单行,但是很容易编码,然后可能很容易阅读。

#1

$result = preg_replace(
    '/\b                 # Word boundary
    (                    # Match one of the following:
     (?<!consectetur\s)  #  (unless preceded by "consectetur "
     adipiscing          #  adipiscing
    |                    # or
     molestie            #  molestie
    |                    # etc.
     fringilla
    )                    # End of alternation
    \b                   # Word boundary
    /ix', 
    '<a href="#">\1</a>', $subject);

#2