
时间:2022-09-13 09:53:33

[EDIT]In Short: How would you write an automatic spell checker? The idea is that the checker builds a list of words from a known good source (a dictionary) and automatically adds new words when they are used often enough. Words which haven't been used a while should be phased out. So if I delete part of a scene which contains "Mungrohyperiofier", the checker should remember it for a while and when I type "Mung<Ctrl+Space>" in another scene, it should offer it again. If I don't use the word for, say, a few days, it should forget about it.

[编辑]简而言之:您如何编写自动拼写检查程序?我们的想法是,检查器根据已知的良好来源(字典)构建单词列表,并在经常使用时自动添加新单词。应该逐步取消一段时间没用过的词。因此,如果我删除包含“Mungrohyperiofier”的场景的一部分,检查器应该记住它一段时间,当我在另一个场景中键入“Mung ”时,它应该再次提供它。如果我不使用这个词,比如说,几天,就应该忘掉它。

At the same time, I'd like to avoid adding typos to the dictionary.[/EDIT]

与此同时,我想避免在字典中添加拼写错误。[/ EDIT]

I want to write a text editor for SciFi stories. The editor should offer word completion for any word used anywhere in the current story. It will only offer a single scene of the story for editing (so you can easily move scenes around).


This means I have three sets:


  1. The set of all words in all other scenes
  2. 所有其他场景中所有单词的集合

  3. The set of word in the current scene before I started editing it
  4. 我开始编辑之前当前场景中的单词集

  5. The set of words in the current editor
  6. 当前编辑器中的单词集

I need to store the sets somewhere as it would be too expensive to build the list from scratch every time. I think a simple plain text file with one-word-per-line is enough for that.


As the user edits the scene, we have these situations:


  1. She deletes a word. This word is not used anywhere else in the current scene.
  2. 她删了一个字。在当前场景中的任何其他位置都不使用该单词。

  3. She types a word which is new
  4. 她输入了一个新词

  5. She types a word which already exists
  6. 她输入了一个已存在的单词

  7. She types a word which already exists but makes a typo
  8. 她输入了一个已经存在的单词但输入了拼写错误

  9. She corrects a typo in a word which is in set #2.
  10. 她纠正了第2集中的一个拼写错误。

  11. She corrects a typo in a word which is in set #1 (i.e. the typo is elsewhere, too).
  12. 她纠正了第1集中的一个拼写错误(即拼写错误也在其他地方)。

  13. She deletes a word which she plans to use again. After the deletion, the word is no longer in the sets #1 and #3, though.
  14. 她删除了一个她打算再次使用的单词。删除后,单词不再出现在#1和#3集中。

The obvious strategy would be to rebuilt the word sets when a scene is saved and build the set #1 from a word-list file per scene.


So my question is: Is there a clever strategy to keep words which aren't used anywhere anymore but still be able to phase out typos? If possible, this strategy should work in the background without the user even noticing what is going on (i.e. I want to avoid to have to grab the mouse to select "add word to dictionary" from the menu).


[EDIT] Based on a comment from grieve


3 个解决方案



So you want to write a spelling checker. Here's Peter Norvig's paper about writing a spelling corrector. It describes a simple and robust spelling corrector. You can use the already-written part of the book, plus a reference list (say from a free dictionary) for the language model. I would also go to existing open-source spelling checkers, such as aspell and hunspell, to get some ideas.

所以你想写一个拼写检查器。这是Peter Norvig关于编写拼写校正器的论文。它描述了一个简单而强大的拼写纠正器。您可以使用本书的已编写部分,以及语言模型的参考列表(例如,从免费字典中)。我也会去现有的开源拼写检查程序,如aspell和hunspell,以获得一些想法。



The structure you should use is a trie. Tail/suffix compression will help with memory. You can use a pseudo reference counting GC for keeping track of usage.


For the actual nodes, you would probably need no more than a 32-bit integer, 21-bits for unicode, and the rest for various other tags and information.




Reminds me of what I have been told about garbage collecting in modern LISP implementations :


data when created is put in "pool 1",


when there is a need to garbage collect the garbage collector look in pool 1 for unused entries and remove them.


Then any remaining entry is moved to pool 2.


Pool 2 is examined only when there is a need to more memory than pool 1 can release.


Data from pool 2 that survive a garbage collection is put in pool 3 and ... so on.


The idea is to put dynamically the data in a pool corresponding to its lifetime...




So you want to write a spelling checker. Here's Peter Norvig's paper about writing a spelling corrector. It describes a simple and robust spelling corrector. You can use the already-written part of the book, plus a reference list (say from a free dictionary) for the language model. I would also go to existing open-source spelling checkers, such as aspell and hunspell, to get some ideas.

所以你想写一个拼写检查器。这是Peter Norvig关于编写拼写校正器的论文。它描述了一个简单而强大的拼写纠正器。您可以使用本书的已编写部分,以及语言模型的参考列表(例如,从免费字典中)。我也会去现有的开源拼写检查程序,如aspell和hunspell,以获得一些想法。



The structure you should use is a trie. Tail/suffix compression will help with memory. You can use a pseudo reference counting GC for keeping track of usage.


For the actual nodes, you would probably need no more than a 32-bit integer, 21-bits for unicode, and the rest for various other tags and information.




Reminds me of what I have been told about garbage collecting in modern LISP implementations :


data when created is put in "pool 1",


when there is a need to garbage collect the garbage collector look in pool 1 for unused entries and remove them.


Then any remaining entry is moved to pool 2.


Pool 2 is examined only when there is a need to more memory than pool 1 can release.


Data from pool 2 that survive a garbage collection is put in pool 3 and ... so on.


The idea is to put dynamically the data in a pool corresponding to its lifetime...
