如何为某些帖子创建有效的内容过滤器？

I've tagged this post as WordPress, but I'm not entirely sure it's WordPress-specific, so I'm posting it on * rather than WPSE. The solution doesn't have to be WordPress-specific, simply PHP.

我已经将这篇文章标记为WordPress，但我并不完全确定它是特定于WordPress的，所以我将它发布在*而不是WPSE上。解决方案不一定是WordPress特定的，只需PHP。

The Scenario
I run a fishkeeping website with a number of tropical fish Species Profiles and Glossary entries.

情景我运营的鱼网站上有许多热带鱼种类资料和词汇表条目。

Our website is oriented around our profiles. They are, as you may term it, the bread and butter of the website.

我们的网站围绕我们的个人资料。正如你所说，它们是网站的面包和黄油。

What I'm hoping to achieve is that, in every species profile which mentions another species or a glossary entry, I can replace those words with a link - such as you'll see here. Ideally, I would also like this to occur in news, articles and blog posts too.

我希望实现的是，在提及另一个物种或词汇表条目的每个物种档案中，我都可以用链接替换这些词 - 比如你在这里看到的。理想情况下，我也希望这也发生在新闻，文章和博客文章中。

We have nearly 1400 species profiles and 1700 glossary entries. Our species profiles are often lengthy and at last count our species profiles alone numbered more than 1.7 million words of information.

我们有近1400种物种和1700种词汇表。我们的物种概况通常很长，最后只统计我们的物种概况超过170万字的信息。

What I'm Currently Attempting
Currently, I have a filter.php with a function that - I believe - does what I need it to do. The code is quite lengthy, and can be found in full here.

我目前正在尝试目前，我有一个filter.php，其功能是 - 我相信 - 做我需要做的事情。代码非常冗长，可以在这里找到。

In addition, in my WordPress theme's functions.php, I have the following:

另外，在我的WordPress主题的functions.php中，我有以下内容：

# ==============================================================================================
# [Filter]
#
# Every hour, using WP_Cron, `my_updated_posts` is checked. If there are new Post IDs in there,
# it will run a filter on all of the post's content. The filter will search for Glossary terms
# and scientific species names. If found, it will replace those names with links including a 
# pop-up.

    include "filter.php";

# ==============================================================================================
# When saving a post (new or edited), check to make sure it isn't a revision then add its ID
# to `my_updated_posts`.

    add_action( 'save_post', 'my_set_content_filter' );
    function my_set_content_filter( $post_id ) {
        if ( !wp_is_post_revision( $post_id ) ) {

            $post_type = get_post_type( $post_id );

            if ( $post_type == "species" || ( $post_type == "post" && in_category( "articles", $post_id ) ) || ( $post_type == "post" && in_category( "blogs", $post_id ) ) ) {
                //get the previous value
                $ids = get_option( 'my_updated_posts' );

                //add new value if necessary
                if( !in_array( $post_id, $ids ) ) {
                    $ids[] = $post_id;
                    update_option( 'my_updated_posts', $ids );
                }
            }
        }
    }

# ==============================================================================================
# Add the filter to WP_Cron.

    add_action( 'my_filter_posts_content', 'my_filter_content' );
    if( !wp_next_scheduled( 'my_filter_posts_content' ) ) {
        wp_schedule_event( time(), 'hourly', 'my_filter_posts_content' );
    }

# ==============================================================================================
# Run the filter.

    function my_filter_content() {
        //check to see if posts need to be parsed
        if ( !get_option( 'my_updated_posts' ) )
            return false;

        //parse posts
        $ids = get_option( 'my_updated_posts' );

        update_option( 'error_check', $ids );

        foreach( $ids as $v ) {
            if ( get_post_status( $v ) == 'publish' )
                run_filter( $v );

            update_option( 'error_check', "filter has run at least once" );
        }

        //make sure no values have been added while loop was running
        $id_recheck = get_option( 'my_updated_posts' );
        my_close_out_filter( $ids, $id_recheck );

        //once all options, including any added during the running of what could be a long cronjob are done, remove the value and close out
        delete_option( 'my_updated_posts' );
        update_option( 'error_check', 'working m8' );
        return true;
    }

# ==============================================================================================
# A "difference" function to make sure no new posts have been added to `my_updated_posts` whilst
# the potentially time-consuming filter was running.

    function my_close_out_filter( $beginning_array, $end_array ) {
        $diff = array_diff( $beginning_array, $end_array );
        if( !empty ( $diff ) ) {
            foreach( $diff as $v ) {
                run_filter( $v );
            }
        }
        my_close_out_filter( $end_array, get_option( 'my_updated_posts' ) );
    }

The way this works, as (hopefully) described by the code's comments, is that each hour WordPress operates a cron job (which is like a false cron - works upon user hits, but that doesn't really matter as the timing isn't important) which runs the filter found above.

正如代码的评论所描述的那样，它的工作方式是每小时WordPress运行一个cron作业（就像一个错误的cron - 工作在用户命中，但这并不重要，因为时间不是重要的）运行上面找到的过滤器。

The rationale behind running it on an hourly basis was that if we tried to run it when each post was saved, it would be to the detriment of the author. Once we get guest authors involved, that is obviously not an acceptable way of going about it.

每小时运行它的理由是，如果我们试图在每个帖子被保存时运行它，那将对作者造成损害。一旦我们让客座作者参与进来，这显然不是一种可行的方式。

The Problem...
For months now I've been having problems getting this filter running reliably. I don't believe that the problem lies with the filter itself, but with one of the functions that enables the filter - i.e. the cron job, or the function that chooses which posts are filtered, or the function which prepares the wordlists etc. for the filter.

问题......几个月来，我一直遇到问题，让这个过滤器可靠运行。我不相信问题在于过滤器本身，而是使用其中一个启用过滤器的功能 - 即cron作业，或选择过滤哪些帖子的功能，或者准备单词列表等的功能过滤器。

Unfortunately, diagnosing the problem is quite difficult (that I can see), thanks to it running in the background and only on an hourly basis. I've been trying to use WordPress' update_option function (which basically writes a simple database value) to error-check, but I haven't had much luck - and to be honest, I'm quite confused as to where the problem lies.

不幸的是，诊断问题非常困难（我可以看到），这要归功于它在后台运行并且每小时运行一次。我一直在尝试使用WordPress的update_option函数（基本上写一个简单的数据库值）进行错误检查，但我没有太多运气 - 说实话，我很困惑，问题出在哪里。

We ended up putting the website live without this filter working correctly. Sometimes it seems to work, sometimes it doesn't. As a result, we now have quite a few species profiles which aren't correctly filtered.

我们最终在没有此过滤器的情况下正常运行网站。有时似乎有效，有时却没有。因此，我们现在有相当多的物种概况未被正确过滤。

What I'd Like...
I'm basically seeking advice on the best way to go about running this filter.

我想要的是......我基本上都在寻求有关运行此过滤器的最佳方法的建议。

Is a Cron Job the answer? I can set up a .php file which runs every day, that wouldn't be a problem. How would it determine which posts need to be filtered? What impact would it have on the server at the time it ran?

Cron工作是答案吗？我可以设置一个每天运行的.php文件，这不会有问题。如何确定需要过滤哪些帖子？它运行时对服务器有什么影响？

Alternatively, is a WordPress admin page the answer? If I knew how to do it, something along the lines of a page - utilising AJAX - which allowed me to select the posts to run the filter on would be perfect. There's a plugin called AJAX Regenerate Thumbnails which works like this, maybe that would be the most effective?

或者，WordPress管理页面是答案吗？如果我知道怎么做，那么沿着页面的某些东西 - 利用AJAX - 允许我选择帖子来运行过滤器将是完美的。有一个名为AJAX Regenerate Thumbnails的插件可以这样工作，也许这会是最有效的吗？

Considerations

注意事项

The size of the database/information being affected/read/written
受影响/读/写的数据库/信息的大小
Which posts are filtered
过滤了哪些帖子
The impact the filter has on the server; especially considering I don't seem to be able to increase the WordPress memory limit past 32Mb.
过滤器对服务器的影响;特别是考虑到我似乎无法将WordPress的内存限制增加到32Mb以上。
Is the actual filter itself efficient, effective and reliable?
实际的过滤器本身是否高效，有效和可靠？

This is quite a complex question and I've inevitably (as I was distracted roughly 18 times by colleagues in the process) left out some details. Please feel free to probe me for further information.

这是一个非常复杂的问题，我不可避免地（因为我在过程中被同事分散了大约18次）遗漏了一些细节。请随时向我探讨更多信息。

Thanks in advance,

提前致谢，

1 个解决方案

#1

Do it when the profile is created.

在创建配置文件时执行此操作。

Try reversing the whole process. Rather than checking the content for the words, check the words for the content's words.

尝试颠倒整个过程。而不是检查单词的内容，检查单词的内容的单词。

Break the content post on entry into words (on space)
在进入单词（空格）时打破内容帖子
Eliminate duplicates, ones under the smallest size of a word in the database, ones over the largest size, and ones in a 'common words' list that you keep.
消除重复数据，数据库中单词最小尺寸的副本，最大尺寸的副本以及保留的“常用单词”列表中的副本。
Check against each table, if some of your tables include phrases with spaces, do a %text% search, otherwise do a straight match (much faster) or even build a hash table if it really is that big a problem. (I would do this as a PHP array and cache the result somehow, no sense reinventing the wheel)
检查每个表，如果您的某些表包含带空格的短语，请执行％text％搜索，否则执行直接匹配（更快）或甚至构建哈希表，如果它确实是一个大问题。（我会这样做作为一个PHP数组并以某种方式缓存结果，没有意义重新发明*）
Create your links with the now dramatically smaller lists.
使用现在极小的列表创建链接。

You should be able to easily keep this under 1 second even as you move out to even 100,000 words you are checking against. I've done exactly this, without caching the word lists, for a Bayesian Filter before.

你应该可以轻松地将其保持在1秒以内，即使你正在检查甚至100,000个单词。对于以前的贝叶斯过滤器，我已经完成了这个，而没有缓存单词列表。

With the smaller list, even if it is greedy and gathers words that don't match "clown" will catch "clown loach", the resulting smaller list should be only a few to a few dozen words with links. Which will take no time at all to do a find and replace over a chunk of text.

随着较小的列表，即使它是贪婪的并且收集与“小丑”不匹配的单词将捕获“小丑泥鳅”，所得到的较小的列表应该只有几个到几十个单词的链接。这将花费很长时间来查找并替换一大块文本。

The above doesn't really address your concern over the older profiles. You don't say exactly how many there are, just that there is a lot of text and that it is on 1400 to 3100 (both items) put together. This older content you could do based on popularity if you have the info. Or on date entered, newest first. Regardless the best way to do this is to write a script that suspends the time limit on PHP and just batch-runs a load/process/save on all the posts. If each one takes about 1 second (probably much less but worst case) you are talking 3100 seconds which is a little less than an hour.

上述内容并未真正解决您对旧配置文件的担忧。你没有确切地说有多少，只是有很多文字，并且它在1400到3100（两个项目）放在一起。如果你有信息，你可以根据受欢迎程度做这个旧内容。或者在输入的日期，最新的第一个。无论如何，最好的方法是编写一个暂停PHP时间限制的脚本，并在所有帖子上批量运行加载/进程/保存。如果每个人花费大约1秒钟（可能更少但最坏的情况），你说的是3100秒，这不到一个小时。

#1