在HTML中搜索和替换单词

时间:2022-09-13 09:40:06

what I'm trying to do is make a 'jargon buster'. Basically I have some html and some glossary terms in a database. When the person clicks on jargon buster it replaces the words in the text with a nice tooltip (wztooltip) which shows them the meanings.

我想做的是制造一个“行话终结者”。基本上,我在数据库中有一些html和一些术语表。当用户点击“行话终结者”时,它会用一个漂亮的工具提示(wztooltip)替换文本中的单词,并显示其含义。

I've been trying hard on this one and been looking heavily at this question Regex / DOMDocument - match and replace text not in a link

我一直在努力研究这个问题,并仔细研究Regex / DOMDocument - match并替换不在链接中的文本

and it seems like the answer lies in the simple_html_dom libs but I'm having trouble getting it to work. Obviously any words already linked don't get touched. Here is a strip down of what I've got.

似乎答案就在simple_html_dom libs中,但是我很难让它工作。显然,任何已经链接的单词都不会被碰触。这是我所拥有的东西的简图。

$html = str_get_html($article['content']);

$query_glossary = "SELECT word,glossary_term_id,info FROM glossary_terms WHERE status = 1  ORDER BY LENGTH(word) DESC";
$result_glossary = mysql_query_run($query_glossary);

while($glossary = mysql_fetch_array($result_glossary)) {
    $glossary_link = SITEURL.'/glossary/term/'.string_to_url($glossary['word']).'-'.$glossary['glossary_term_id'];
    if(strlen($glossary['info'])>400) {
        $glossary_info = substr(strip_tags($glossary['info']),0,350).' ...<br /> <a href="'.$glossary_link.'">Read More</a>';
    }
    else {
        $glossary_info = $glossary['info'];
    }
    $glossary_tip = 'href="javascript:;" onmouseout="UnTip();" class="article_jargon_highligher" onmouseover="'.tooltip_javascript('<a href="'.$glossary_link.'">'.$glossary['word'].'</a>',$glossary_info,400,1,0,1).'"';
    $glossary_word = $glossary['word'];
    $glossary_word = preg_quote($glossary_word,'/');

    //once done we can replace the words with a nice tip    
    foreach ($html->find('text') as $element) {
        if (!in_array($element->parent()->tag,array())) {
            //problems are case aren't taken into account and grammer
            $element->innertext = str_ireplace(''.$glossary['word'].' ',' <a '.$glossary_tip.' >'.$glossary['word'].'</a> ', $element->innertext);

           //$element->innertext = str_ireplace(''.$glossary['word'].',',' <a '.$glossary_tip.'>'.$glossary['word'].'</a> ', $element->innertext);
           //$element->innertext = preg_replace ("/\s(".$glossary_word.")\s/ise","nothing(' <a'.'$glossary_tip.'>'.'$1'.'</a> ')" , $element->innertext);
          // $element->innertext = str_replace('__glossary_tip_replace__',$glossary_tip, $element->innertext);
        }
    }
}
$article['content'] = $html->save();

3 个解决方案

#1


11  

Use the inverted word character \W to select for any characters other than numbers and letters in your regex pattern. Because this would still fail at the boundaries of the text blob, you would also need to test those conditions as well. Thus using the word 'term' as the text you are searching for:

在regex模式中,使用倒排字字符\W选择除数字和字母之外的任何字符。因为这在文本blob的边界上仍然会失败,所以您还需要测试这些条件。因此,使用“术语”一词作为您正在搜索的文本:

(^term$)|(^term\W)|(\Wterm\W)|(\Wterm$)

The first condition checks to make sure that term isn't the only contents of the blob, the second checks if its the first word, the third if it contained within the blob, and the last if its the last word.

第一个条件检查以确保术语不是blob的唯一内容,第二个条件检查它是否是第一个单词,第三个条件检查它是否包含在blob中,最后一个条件检查它是否是最后一个单词。

If you want to consider any other characters as word characters (say a hyphen) you would need to repace the \W with [^\w\-].

如果你想考虑其他任何字符作为单词字符(字符)你说需要repace \ W(^ \ W \]。

Hope this helps. There are probably optimizations that can performed as well, but this should at least be a good starting point.

希望这个有帮助。可能也有可以执行的优化,但这至少应该是一个好的起点。

#2


8  

Assuming all your glossary "words" consist of standard "word" characters, (i.e. [A-Za-z0-9_]), then a simple word boundary assertion can be placed before and after the word in the regex pattern. Try replacing the pertinant statement with this:

假设您的词汇表“单词”由标准的“单词”字符(即[a- za -z0-9_])组成,那么可以在regex模式中的单词之前和之后放置一个简单的单词边界断言。试着用以下语句替换相关语句:

$element->innertext = preg_replace(
    '/\b'. $glossary_word .'\b/i',
    '<a '. $glossary_tip .' >'. $glossary['word'] .'</a>',
    $element->innertext);

This assumes that $glossary_word has been run trough preg_quote (which your code does).

这假设$glossary_word已经通过preg_quote(你的代码就是这样)运行了。

However, if the glossary words may contain other non-standard word characters (such as a '-' dash), a more complex regex can be formulated which incorporates lookahead and lookbehind to ensure that only whole words are matched. For example:

但是,如果词汇表中的单词可能包含其他非标准的单词字符(如“-”破折号),则可以构造一个更复杂的regex,其中包含了lookahead和lookbehind,以确保只匹配完整的单词。例如:

$re_pattern = "/         # Match a glossary whole word.
    (?<=[\s'\"]|^)       # Word preceded by whitespace, quote or BOS.
    {$glossary_word}     # Word to be matched.
    (?=[\s'\".?!,;:]|$)  # Word followed by ws, quote, punct or EOS.
    /ix";

#3


3  

I had this problem in JS getting individual words. What I did was the following (you can translate it from JS to PHP):

我在JS中遇到了这个问题。我所做的是(你可以把它从JS翻译成PHP):

It actually works REALLY well for me. :)

这对我来说真的很有效。:)

var words = document.body.innerHTML;

// FIRST PASS

// remove scripts
words = words.replace(/<script[\s\S]*?>[\s\S]*?<\/script>/gi, '');
// remove CSS
words = words.replace(/<style[\s\S]*?>[\s\S]*?<\/style>/gi, '');
// remove comments
words = words.replace(/<!--[\s\S]*?-->/g, '');
// remove html character entities
words = words.replace(/&.*?;/g, ' ');
// remove all HTML
words = words.replace(/<[\s\S]*?>/g, '');

// SECOND PASS

// remove all newlines
words = words.replace(/\n/g, ' ');
// replace multiple spaces with 1 space
words = words.replace(/\s{2,}/g, ' ');

// split each word
words = words.split(/[^a-z-']+/gi);

#1


11  

Use the inverted word character \W to select for any characters other than numbers and letters in your regex pattern. Because this would still fail at the boundaries of the text blob, you would also need to test those conditions as well. Thus using the word 'term' as the text you are searching for:

在regex模式中,使用倒排字字符\W选择除数字和字母之外的任何字符。因为这在文本blob的边界上仍然会失败,所以您还需要测试这些条件。因此,使用“术语”一词作为您正在搜索的文本:

(^term$)|(^term\W)|(\Wterm\W)|(\Wterm$)

The first condition checks to make sure that term isn't the only contents of the blob, the second checks if its the first word, the third if it contained within the blob, and the last if its the last word.

第一个条件检查以确保术语不是blob的唯一内容,第二个条件检查它是否是第一个单词,第三个条件检查它是否包含在blob中,最后一个条件检查它是否是最后一个单词。

If you want to consider any other characters as word characters (say a hyphen) you would need to repace the \W with [^\w\-].

如果你想考虑其他任何字符作为单词字符(字符)你说需要repace \ W(^ \ W \]。

Hope this helps. There are probably optimizations that can performed as well, but this should at least be a good starting point.

希望这个有帮助。可能也有可以执行的优化,但这至少应该是一个好的起点。

#2


8  

Assuming all your glossary "words" consist of standard "word" characters, (i.e. [A-Za-z0-9_]), then a simple word boundary assertion can be placed before and after the word in the regex pattern. Try replacing the pertinant statement with this:

假设您的词汇表“单词”由标准的“单词”字符(即[a- za -z0-9_])组成,那么可以在regex模式中的单词之前和之后放置一个简单的单词边界断言。试着用以下语句替换相关语句:

$element->innertext = preg_replace(
    '/\b'. $glossary_word .'\b/i',
    '<a '. $glossary_tip .' >'. $glossary['word'] .'</a>',
    $element->innertext);

This assumes that $glossary_word has been run trough preg_quote (which your code does).

这假设$glossary_word已经通过preg_quote(你的代码就是这样)运行了。

However, if the glossary words may contain other non-standard word characters (such as a '-' dash), a more complex regex can be formulated which incorporates lookahead and lookbehind to ensure that only whole words are matched. For example:

但是,如果词汇表中的单词可能包含其他非标准的单词字符(如“-”破折号),则可以构造一个更复杂的regex,其中包含了lookahead和lookbehind,以确保只匹配完整的单词。例如:

$re_pattern = "/         # Match a glossary whole word.
    (?<=[\s'\"]|^)       # Word preceded by whitespace, quote or BOS.
    {$glossary_word}     # Word to be matched.
    (?=[\s'\".?!,;:]|$)  # Word followed by ws, quote, punct or EOS.
    /ix";

#3


3  

I had this problem in JS getting individual words. What I did was the following (you can translate it from JS to PHP):

我在JS中遇到了这个问题。我所做的是(你可以把它从JS翻译成PHP):

It actually works REALLY well for me. :)

这对我来说真的很有效。:)

var words = document.body.innerHTML;

// FIRST PASS

// remove scripts
words = words.replace(/<script[\s\S]*?>[\s\S]*?<\/script>/gi, '');
// remove CSS
words = words.replace(/<style[\s\S]*?>[\s\S]*?<\/style>/gi, '');
// remove comments
words = words.replace(/<!--[\s\S]*?-->/g, '');
// remove html character entities
words = words.replace(/&.*?;/g, ' ');
// remove all HTML
words = words.replace(/<[\s\S]*?>/g, '');

// SECOND PASS

// remove all newlines
words = words.replace(/\n/g, ' ');
// replace multiple spaces with 1 space
words = words.replace(/\s{2,}/g, ' ');

// split each word
words = words.split(/[^a-z-']+/gi);