Class WordFormGenerator
- java.lang.Object
-
- org.apache.lucene.analysis.hunspell.WordFormGenerator
-
public class WordFormGenerator extends java.lang.Object
A utility class used for generating possible word forms by adding affixes to stems (getAllWordForms(String, String, Runnable)
), and suggesting stems and flags to generate the given set of words (compress(List, Set, Runnable)
).
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description private static class
WordFormGenerator.AffixEntry
private static class
WordFormGenerator.FlagSet
private static class
WordFormGenerator.State
private class
WordFormGenerator.WordCompressor
-
Field Summary
Fields Modifier and Type Field Description private CharObjectHashMap<java.util.List<WordFormGenerator.AffixEntry>>
affixes
private Dictionary
dictionary
private Stemmer
stemmer
-
Constructor Summary
Constructors Constructor Description WordFormGenerator(Dictionary dictionary)
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description private char[]
appendFlags(WordFormGenerator.AffixEntry affix)
protected boolean
canStemToOriginal(AffixedWord derived)
A sanity-check that the word form generated by affixation ingetAllWordForms(String, String, Runnable)
is indeed accepted by the spell-checker and analyzed to be the form of the original dictionary entry.EntrySuggestion
compress(java.util.List<java.lang.String> words, java.util.Set<java.lang.String> forbidden, java.lang.Runnable checkCanceled)
Given a list of words, try to produce a smaller set of dictionary entries (with some flags) that would generate these words.private AffixCondition
condition(int affixId)
private static char[]
deduplicate(char[] flags)
private java.util.List<AffixedWord>
expand(AffixedWord stem, char[] flags, java.lang.Runnable checkCanceled)
private void
fillAffixMap(FST<IntsRef> fst, AffixKind kind)
void
generateAllSimpleWords(java.util.function.Consumer<AffixedWord> consumer, java.lang.Runnable checkCanceled)
Traverse the whole dictionary and derive all word forms via affixation (as ingetAllWordForms(String, String, Runnable)
) for each of the entries.java.util.List<AffixedWord>
getAllWordForms(java.lang.String root, java.lang.Runnable checkCanceled)
Generate all word forms for all dictionary entries with the given root word.java.util.List<AffixedWord>
getAllWordForms(java.lang.String stem, java.lang.String flags, java.lang.Runnable checkCanceled)
Generate all word forms for the given root pretending it has the given flags (in the same format as the dictionary uses).private java.util.List<AffixedWord>
getAllWordForms(DictEntry entry, char[] encodedFlags, java.lang.Runnable checkCanceled)
private boolean
isCompatibleWithPreviousAffixes(AffixedWord stem, AffixKind kind, char flag)
private boolean
isForbiddenWord(char[] chars, int offset, int length)
private boolean
shouldConsiderAtAll(char[] flags)
private static char[]
sortAndDeduplicate(char[] flags)
private java.lang.String
strip(int affixId)
private java.lang.String
toString(AffixKind kind, IntsRef input)
private char[]
updateFlags(char[] flags, char toRemove, char[] toAppend)
-
-
-
Field Detail
-
dictionary
private final Dictionary dictionary
-
affixes
private final CharObjectHashMap<java.util.List<WordFormGenerator.AffixEntry>> affixes
-
stemmer
private final Stemmer stemmer
-
-
Constructor Detail
-
WordFormGenerator
public WordFormGenerator(Dictionary dictionary)
-
-
Method Detail
-
condition
private AffixCondition condition(int affixId)
-
strip
private java.lang.String strip(int affixId)
-
getAllWordForms
public java.util.List<AffixedWord> getAllWordForms(java.lang.String root, java.lang.Runnable checkCanceled)
Generate all word forms for all dictionary entries with the given root word. The result order is stable but not specified. This is equivalent to "unmunch" from the "hunspell-tools" package.- Parameters:
checkCanceled
- an object that's periodically called, allowing to interrupt the generation by throwing an exception
-
getAllWordForms
public java.util.List<AffixedWord> getAllWordForms(java.lang.String stem, java.lang.String flags, java.lang.Runnable checkCanceled)
Generate all word forms for the given root pretending it has the given flags (in the same format as the dictionary uses). The result order is stable but not specified. This is equivalent to "unmunch" from the "hunspell-tools" package.- Parameters:
checkCanceled
- an object that's periodically called, allowing to interrupt the generation by throwing an exception
-
getAllWordForms
private java.util.List<AffixedWord> getAllWordForms(DictEntry entry, char[] encodedFlags, java.lang.Runnable checkCanceled)
-
sortAndDeduplicate
private static char[] sortAndDeduplicate(char[] flags)
-
deduplicate
private static char[] deduplicate(char[] flags)
-
canStemToOriginal
protected boolean canStemToOriginal(AffixedWord derived)
A sanity-check that the word form generated by affixation ingetAllWordForms(String, String, Runnable)
is indeed accepted by the spell-checker and analyzed to be the form of the original dictionary entry. This can be overridden for cases where such check is unnecessary or can be done more efficiently.
-
isForbiddenWord
private boolean isForbiddenWord(char[] chars, int offset, int length)
-
expand
private java.util.List<AffixedWord> expand(AffixedWord stem, char[] flags, java.lang.Runnable checkCanceled)
-
shouldConsiderAtAll
private boolean shouldConsiderAtAll(char[] flags)
-
updateFlags
private char[] updateFlags(char[] flags, char toRemove, char[] toAppend)
-
appendFlags
private char[] appendFlags(WordFormGenerator.AffixEntry affix)
-
generateAllSimpleWords
public void generateAllSimpleWords(java.util.function.Consumer<AffixedWord> consumer, java.lang.Runnable checkCanceled)
Traverse the whole dictionary and derive all word forms via affixation (as ingetAllWordForms(String, String, Runnable)
) for each of the entries. The iteration order is undefined. Only "simple" words are returned, no compounding flags are processed. Upper- and title-case variations are not returned, even if the spellchecker accepts them.- Parameters:
consumer
- the object that receives each derived word formcheckCanceled
- an object that's periodically called, allowing to interrupt the traversal and generation by throwing an exception
-
compress
public EntrySuggestion compress(java.util.List<java.lang.String> words, java.util.Set<java.lang.String> forbidden, java.lang.Runnable checkCanceled)
Given a list of words, try to produce a smaller set of dictionary entries (with some flags) that would generate these words. This is equivalent to "munch" from the "hunspell-tools" package. The algorithm tries to minimize the number of the dictionary entries to add or change, the number of flags involved, and the number of non-requested additionally generated words. All the mentioned words are in the dictionary format and case: no ICONV/OCONV/IGNORE conversions are applied.- Parameters:
words
- the list of words to generateforbidden
- the set of words to avoid generatingcheckCanceled
- an object that's periodically called, allowing to interrupt the generation by throwing an exception- Returns:
- the information about suggested dictionary entries and overgenerated words, or
null
if the algorithm couldn't generate anything
-
isCompatibleWithPreviousAffixes
private boolean isCompatibleWithPreviousAffixes(AffixedWord stem, AffixKind kind, char flag)
-
-