自动拼写检查文本中的单词

[EDIT]In Short: How would you write an automatic spell checker? The idea is that the checker builds a list of words from a known good source (a dictionary) and automatically adds new words when they are used often enough. Words which haven't been used a while should be phased out. So if I delete part of a scene which contains "Mungrohyperiofier", the checker should remember it for a while and when I type "Mung<Ctrl+Space>" in another scene, it should offer it again. If I don't use the word for, say, a few days, it should forget about it.

[编辑]简而言之:您如何编写自动拼写检查程序?我们的想法是,检查器根据已知的良好来源(字典)构建单词列表,并在经常使用时自动添加新单词。应该逐步取消一段时间没用过的词。因此,如果我删除包含“Mungrohyperiofier”的场景的一部分,检查器应该记住它一段时间,当我在另一个场景中键入“Mung ”时,它应该再次提供它。如果我不使用这个词,比如说,几天,就应该忘掉它。

At the same time, I'd like to avoid adding typos to the dictionary.[/EDIT]

与此同时,我想避免在字典中添加拼写错误。[/ EDIT]

I want to write a text editor for SciFi stories. The editor should offer word completion for any word used anywhere in the current story. It will only offer a single scene of the story for editing (so you can easily move scenes around).

我想为SciFi故事编写一个文本编辑器。编辑器应该为当前故事中任何地方使用的任何单词提供单词完成。它只会提供故事的单个场景进行编辑(因此您可以轻松地移动场景)。

This means I have three sets:

这意味着我有三套:

The set of all words in all other scenes

所有其他场景中所有单词的集合

The set of word in the current scene before I started editing it

我开始编辑之前当前场景中的单词集

The set of words in the current editor

当前编辑器中的单词集

I need to store the sets somewhere as it would be too expensive to build the list from scratch every time. I think a simple plain text file with one-word-per-line is enough for that.

我需要将这些集存储在某个地方,因为每次从头开始构建列表都太昂贵了。我认为一个简单的纯文本文件,每行一个字就足够了。

As the user edits the scene, we have these situations:

当用户编辑场景时,我们遇到以下情况:

She deletes a word. This word is not used anywhere else in the current scene.

她删了一个字。在当前场景中的任何其他位置都不使用该单词。

She types a word which is new

她输入了一个新词

She types a word which already exists

她输入了一个已存在的单词

She types a word which already exists but makes a typo

她输入了一个已经存在的单词但输入了拼写错误

She corrects a typo in a word which is in set #2.

她纠正了第2集中的一个拼写错误。

She corrects a typo in a word which is in set #1 (i.e. the typo is elsewhere, too).

她纠正了第1集中的一个拼写错误(即拼写错误也在其他地方)。

She deletes a word which she plans to use again. After the deletion, the word is no longer in the sets #1 and #3, though.

她删除了一个她打算再次使用的单词。删除后,单词不再出现在#1和#3集中。

The obvious strategy would be to rebuilt the word sets when a scene is saved and build the set #1 from a word-list file per scene.

显而易见的策略是在保存场景时重建单词集,并从每个场景的单词列表文件构建集#1。

So my question is: Is there a clever strategy to keep words which aren't used anywhere anymore but still be able to phase out typos? If possible, this strategy should work in the background without the user even noticing what is going on (i.e. I want to avoid to have to grab the mouse to select "add word to dictionary" from the menu).

所以我的问题是:是否有一个聪明的策略来保留不再在任何地方使用但仍然可以逐步淘汰拼写错误的单词?如果可能的话,这个策略应该在后台运行,而用户甚至不会注意到正在发生的事情(即我想避免必须抓住鼠标从菜单中选择“向字典添加单词”)。

[EDIT] Based on a comment from grieve

[编辑]基于悲伤的评论

3 个解决方案

#1

So you want to write a spelling checker. Here's Peter Norvig's paper about writing a spelling corrector. It describes a simple and robust spelling corrector. You can use the already-written part of the book, plus a reference list (say from a free dictionary) for the language model. I would also go to existing open-source spelling checkers, such as aspell and hunspell, to get some ideas.

所以你想写一个拼写检查器。这是Peter Norvig关于编写拼写校正器的论文。它描述了一个简单而强大的拼写纠正器。您可以使用本书的已编写部分,以及语言模型的参考列表(例如,从免费字典中)。我也会去现有的开源拼写检查程序,如aspell和hunspell,以获得一些想法。

#2

The structure you should use is a trie. Tail/suffix compression will help with memory. You can use a pseudo reference counting GC for keeping track of usage.

你应该使用的结构是trie。尾部/后缀压缩将有助于记忆。您可以使用伪引用计数GC来跟踪使用情况。

For the actual nodes, you would probably need no more than a 32-bit integer, 21-bits for unicode, and the rest for various other tags and information.

对于实际节点,您可能需要不超过32位整数,21位用于unicode,其余用于其他各种标签和信息。

#3

Reminds me of what I have been told about garbage collecting in modern LISP implementations :

让我想起了我在现代LISP实现中对垃圾收集的了解:

data when created is put in "pool 1",

创建时的数据放在“池1”中,

when there is a need to garbage collect the garbage collector look in pool 1 for unused entries and remove them.

当需要垃圾收集垃圾收集器时,在池1中查找未使用的条目并将其删除。

Then any remaining entry is moved to pool 2.

然后将任何剩余的条目移动到池2。

Pool 2 is examined only when there is a need to more memory than pool 1 can release.

仅当需要比池1可以释放更多的内存时才检查池2。

Data from pool 2 that survive a garbage collection is put in pool 3 and ... so on.

来自池2的数据在垃圾收集后仍然存在于池3中......等等。

The idea is to put dynamically the data in a pool corresponding to its lifetime...

我们的想法是将数据动态放入与其生命周期相对应的池中......

#1

#2