aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--runtime/doc/develop.txt105
-rw-r--r--runtime/doc/spell.txt113
2 files changed, 108 insertions, 110 deletions
diff --git a/runtime/doc/develop.txt b/runtime/doc/develop.txt
index 8881845fdd..7ced8dd016 100644
--- a/runtime/doc/develop.txt
+++ b/runtime/doc/develop.txt
@@ -153,110 +153,5 @@ window View on a buffer. There can be several windows in Vim,
fit in the shell.
-Spell checking *develop-spell*
-
-When spell checking was going to be added to Vim a survey was done over the
-available spell checking libraries and programs. Unfortunately, the result
-was that none of them provided sufficient capabilities to be used as the spell
-checking engine in Vim, for various reasons:
-
-- Missing support for multi-byte encodings. At least UTF-8 must be supported,
- so that more than one language can be used in the same file.
- Doing on-the-fly conversion is not always possible (would require iconv
- support).
-- For the programs and libraries: Using them as-is would require installing
- them separately from Vim. That's mostly not impossible, but a drawback.
-- Performance: A few tests showed that it's possible to check spelling on the
- fly (while redrawing), just like syntax highlighting. But the mechanisms
- used by other code are much slower. Myspell uses a hashtable, for example.
- The affix compression that most spell checkers use makes it slower too.
-- For using an external program like aspell a communication mechanism would
- have to be setup. That's complicated to do in a portable way (Unix-only
- would be relatively simple, but that's not good enough). And performance
- will become a problem (lots of process switching involved).
-- Missing support for words with non-word characters, such as "Etten-Leur" and
- "et al.", would require marking the pieces of them OK, lowering the
- reliability.
-- Missing support for regions or dialects. Makes it difficult to accept
- all English words and highlight non-Canadian words differently.
-- Missing support for rare words. Many words are correct but hardly ever used
- and could be a misspelled often-used word.
-- For making suggestions the speed is less important and requiring to install
- another program or library would be acceptable. But the word lists probably
- differ, the suggestions may be wrong words.
-
-
-Spelling suggestions *develop-spell-suggestions*
-
-For making suggestions there are two basic mechanisms:
-1. Try changing the bad word a little bit and check for a match with a good
- word. Or go through the list of good words, change them a little bit and
- check for a match with the bad word. The changes are deleting a character,
- inserting a character, swapping two characters, etc.
-2. Perform soundfolding on both the bad word and the good words and then find
- matches, possibly with a few changes like with the first mechanism.
-
-The first is good for finding typing mistakes. After experimenting with
-hashtables and looking at solutions from other spell checkers the conclusion
-was that a trie (a kind of tree structure) is ideal for this. Both for
-reducing memory use and being able to try sensible changes. For example, when
-inserting a character only characters that lead to good words need to be
-tried. Other mechanisms (with hashtables) need to try all possible letters at
-every position in the word. Also, a hashtable has the requirement that word
-boundaries are identified separately, while a trie does not require this.
-That makes the mechanism a lot simpler.
-
-Soundfolding is useful when someone knows how the words sounds but doesn't
-know how it is spelled. For example, the word "dictionary" might be written
-as "daktonerie". The number of changes that the first method would need to
-try is very big, it's hard to find the good word that way. After soundfolding
-the words become "tktnr" and "tkxnry", these differ by only two letters.
-
-To find words by their soundfolded equivalent (soundalike word) we need a list
-of all soundfolded words. A few experiments have been done to find out what
-the best method is. Alternatives:
-1. Do the sound folding on the fly when looking for suggestions. This means
- walking through the trie of good words, soundfolding each word and
- checking how different it is from the bad word. This is very efficient for
- memory use, but takes a long time. On a fast PC it takes a couple of
- seconds for English, which can be acceptable for interactive use. But for
- some languages it takes more than ten seconds (e.g., German, Catalan),
- which is unacceptable slow. For batch processing (automatic corrections)
- it's too slow for all languages.
-2. Use a trie for the soundfolded words, so that searching can be done just
- like how it works without soundfolding. This requires remembering a list
- of good words for each soundfolded word. This makes finding matches very
- fast but requires quite a lot of memory, in the order of 1 to 10 Mbyte.
- For some languages more than the original word list.
-3. Like the second alternative, but reduce the amount of memory by using affix
- compression and store only the soundfolded basic word. This is what Aspell
- does. Disadvantage is that affixes need to be stripped from the bad word
- before soundfolding it, which means that mistakes at the start and/or end
- of the word will cause the mechanism to fail. Also, this becomes slow when
- the bad word is quite different from the good word.
-
-The choice made is to use the second mechanism and use a separate file. This
-way a user with sufficient memory can get very good suggestions while a user
-who is short of memory or just wants the spell checking and no suggestions
-doesn't use so much memory.
-
-
-Word frequency
-
-For sorting suggestions it helps to know which words are common. In theory we
-could store a word frequency with the word in the dictionary. However, this
-requires storing a count per word. That degrades word tree compression a lot.
-And maintaining the word frequency for all languages will be a heavy task.
-Also, it would be nice to prefer words that are already in the text. This way
-the words that appear in the specific text are preferred for suggestions.
-
-What has been implemented is to count words that have been seen during
-displaying. A hashtable is used to quickly find the word count. The count is
-initialized from words listed in COMMON items in the affix file, so that it
-also works when starting a new file.
-
-This isn't ideal, because the longer Vim is running the higher the counts
-become. But in practice it is a noticeable improvement over not using the word
-count.
vim:tw=78:ts=8:ft=help:norl:
diff --git a/runtime/doc/spell.txt b/runtime/doc/spell.txt
index a767f6cbbf..0902d5d10f 100644
--- a/runtime/doc/spell.txt
+++ b/runtime/doc/spell.txt
@@ -1,4 +1,4 @@
-*spell.txt* For Vim version 7.4. Last change: 2016 Jan 08
+*spell.txt*
VIM REFERENCE MANUAL by Bram Moolenaar
@@ -11,10 +11,6 @@ Spell checking *spell*
3. Generating a spell file |spell-mkspell|
4. Spell file format |spell-file-format|
-Note: There also is a vimspell plugin. If you have it you can do ":help
-vimspell" to find about it. But you will probably want to get rid of the
-plugin and use the 'spell' option instead, it works better.
-
==============================================================================
1. Quick start *spell-quickstart* *E756*
@@ -1633,4 +1629,111 @@ WORDCHARS (Hunspell) *spell-WORDCHARS*
is no need to separate words before checking them (using a
trie instead of a hashtable).
+==============================================================================
+5. Spell checker design *develop-spell*
+
+When spell checking was going to be added to Vim a survey was done over the
+available spell checking libraries and programs. Unfortunately, the result
+was that none of them provided sufficient capabilities to be used as the spell
+checking engine in Vim, for various reasons:
+
+- Missing support for multi-byte encodings. At least UTF-8 must be supported,
+ so that more than one language can be used in the same file.
+ Doing on-the-fly conversion is not always possible (would require iconv
+ support).
+- For the programs and libraries: Using them as-is would require installing
+ them separately from Vim. That's mostly not impossible, but a drawback.
+- Performance: A few tests showed that it's possible to check spelling on the
+ fly (while redrawing), just like syntax highlighting. But the mechanisms
+ used by other code are much slower. Myspell uses a hashtable, for example.
+ The affix compression that most spell checkers use makes it slower too.
+- For using an external program like aspell a communication mechanism would
+ have to be setup. That's complicated to do in a portable way (Unix-only
+ would be relatively simple, but that's not good enough). And performance
+ will become a problem (lots of process switching involved).
+- Missing support for words with non-word characters, such as "Etten-Leur" and
+ "et al.", would require marking the pieces of them OK, lowering the
+ reliability.
+- Missing support for regions or dialects. Makes it difficult to accept
+ all English words and highlight non-Canadian words differently.
+- Missing support for rare words. Many words are correct but hardly ever used
+ and could be a misspelled often-used word.
+- For making suggestions the speed is less important and requiring to install
+ another program or library would be acceptable. But the word lists probably
+ differ, the suggestions may be wrong words.
+
+
+Spelling suggestions *develop-spell-suggestions*
+
+For making suggestions there are two basic mechanisms:
+1. Try changing the bad word a little bit and check for a match with a good
+ word. Or go through the list of good words, change them a little bit and
+ check for a match with the bad word. The changes are deleting a character,
+ inserting a character, swapping two characters, etc.
+2. Perform soundfolding on both the bad word and the good words and then find
+ matches, possibly with a few changes like with the first mechanism.
+
+The first is good for finding typing mistakes. After experimenting with
+hashtables and looking at solutions from other spell checkers the conclusion
+was that a trie (a kind of tree structure) is ideal for this. Both for
+reducing memory use and being able to try sensible changes. For example, when
+inserting a character only characters that lead to good words need to be
+tried. Other mechanisms (with hashtables) need to try all possible letters at
+every position in the word. Also, a hashtable has the requirement that word
+boundaries are identified separately, while a trie does not require this.
+That makes the mechanism a lot simpler.
+
+Soundfolding is useful when someone knows how the words sounds but doesn't
+know how it is spelled. For example, the word "dictionary" might be written
+as "daktonerie". The number of changes that the first method would need to
+try is very big, it's hard to find the good word that way. After soundfolding
+the words become "tktnr" and "tkxnry", these differ by only two letters.
+
+To find words by their soundfolded equivalent (soundalike word) we need a list
+of all soundfolded words. A few experiments have been done to find out what
+the best method is. Alternatives:
+1. Do the sound folding on the fly when looking for suggestions. This means
+ walking through the trie of good words, soundfolding each word and
+ checking how different it is from the bad word. This is very efficient for
+ memory use, but takes a long time. On a fast PC it takes a couple of
+ seconds for English, which can be acceptable for interactive use. But for
+ some languages it takes more than ten seconds (e.g., German, Catalan),
+ which is unacceptable slow. For batch processing (automatic corrections)
+ it's too slow for all languages.
+2. Use a trie for the soundfolded words, so that searching can be done just
+ like how it works without soundfolding. This requires remembering a list
+ of good words for each soundfolded word. This makes finding matches very
+ fast but requires quite a lot of memory, in the order of 1 to 10 Mbyte.
+ For some languages more than the original word list.
+3. Like the second alternative, but reduce the amount of memory by using affix
+ compression and store only the soundfolded basic word. This is what Aspell
+ does. Disadvantage is that affixes need to be stripped from the bad word
+ before soundfolding it, which means that mistakes at the start and/or end
+ of the word will cause the mechanism to fail. Also, this becomes slow when
+ the bad word is quite different from the good word.
+
+The choice made is to use the second mechanism and use a separate file. This
+way a user with sufficient memory can get very good suggestions while a user
+who is short of memory or just wants the spell checking and no suggestions
+doesn't use so much memory.
+
+
+Word frequency
+
+For sorting suggestions it helps to know which words are common. In theory we
+could store a word frequency with the word in the dictionary. However, this
+requires storing a count per word. That degrades word tree compression a lot.
+And maintaining the word frequency for all languages will be a heavy task.
+Also, it would be nice to prefer words that are already in the text. This way
+the words that appear in the specific text are preferred for suggestions.
+
+What has been implemented is to count words that have been seen during
+displaying. A hashtable is used to quickly find the word count. The count is
+initialized from words listed in COMMON items in the affix file, so that it
+also works when starting a new file.
+
+This isn't ideal, because the longer Vim is running the higher the counts
+become. But in practice it is a noticeable improvement over not using the word
+count.
+
vim:tw=78:sw=4:ts=8:ft=help:norl: