Class Dictionary
- java.lang.Object
-
- org.apache.lucene.analysis.hunspell.Dictionary
-
public class Dictionary extends java.lang.Object
In-memory structure for the dictionary (.dic) and affix (.aff) data of a hunspell dictionary.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description (package private) static class
Dictionary.Breaks
Possible word breaks according to BREAK directivesprivate static class
Dictionary.DefaultAsUtf8FlagParsingStrategy
Used to read flags as UTF-8 even if the rest of the file is in the default (8-bit) encodingprivate static class
Dictionary.DoubleASCIIFlagParsingStrategy
Implementation ofDictionary.FlagParsingStrategy
that assumes each flag is encoded as two ASCII characters whose codes must be combined into a single character.(package private) static class
Dictionary.FlagParsingStrategy
Abstraction of the process of parsing flags taken from the affix and dic filesprivate static class
Dictionary.NumFlagParsingStrategy
Implementation ofDictionary.FlagParsingStrategy
that assumes each flag is encoded in its numerical form.private static class
Dictionary.SimpleFlagParsingStrategy
Simple implementation ofDictionary.FlagParsingStrategy
that treats the chars in each String as a individual flags.
-
Field Summary
Fields Modifier and Type Field Description (package private) static int
AFFIX_APPEND
private static int
AFFIX_CONDITION
(package private) static int
AFFIX_FLAG
(package private) static int
AFFIX_STRIP_ORD
(package private) char[]
affixData
private int
aliasCount
private java.lang.String[]
aliases
private boolean
alternateCasing
private static byte[]
BOM_UTF8
(package private) Dictionary.Breaks
breaks
(package private) static java.util.Map<java.lang.String,java.lang.String>
CHARSET_ALIASES
(package private) boolean
checkCompoundCase
(package private) boolean
checkCompoundDup
(package private) java.util.List<CheckCompoundPattern>
checkCompoundPatterns
(package private) boolean
checkCompoundRep
(package private) boolean
checkCompoundTriple
(package private) boolean
checkSharpS
(package private) char
circumfix
(package private) boolean
complexPrefixes
(package private) char
compoundBegin
(package private) char
compoundEnd
(package private) char
compoundFlag
(package private) char
compoundForbid
(package private) int
compoundMax
(package private) char
compoundMiddle
(package private) int
compoundMin
(package private) char
compoundPermit
(package private) CompoundRule[]
compoundRules
private int
currentAffix
(package private) java.nio.charset.CharsetDecoder
decoder
(package private) static java.nio.charset.Charset
DEFAULT_CHARSET
private static int
DEFAULT_FLAGS
(package private) boolean
enableSplitSuggestions
private static char
FLAG_SEPARATOR
(package private) static char
FLAG_UNSET
(package private) FlagEnumerator.Lookup
flagLookup
The list of unique flagsets (wordforms).(package private) Dictionary.FlagParsingStrategy
flagParsingStrategy
(package private) char
forbiddenword
(package private) char
forceUCase
(package private) boolean
fullStrip
(package private) boolean
hasCustomMorphData
we set this during sorting, so we know to add an extra int (index inmorphData
) to FST output(package private) static char
HIDDEN_FLAG
(package private) ConvTable
iconv
private char[]
ignore
(package private) boolean
ignoreCase
(package private) char
keepcase
(package private) java.lang.String
language
(package private) java.util.List<java.util.List<java.lang.String>>
mapTable
(package private) static int
MAX_PROLOGUE_SCAN_WINDOW
(package private) int
maxDiff
(package private) int
maxNGramSuggestions
private static char
MORPH_SEPARATOR
private int
morphAliasCount
private java.lang.String[]
morphAliases
(package private) java.util.List<java.lang.String>
morphData
(package private) char
needaffix
(package private) java.lang.String[]
neighborKeyGroups
(package private) static char[]
NOFLAGS
(package private) char
noSuggest
(package private) ConvTable
oconv
(package private) char
onlyincompound
(package private) boolean
onlyMaxDiff
(package private) java.util.ArrayList<AffixCondition>
patterns
All condition checks used by prefixes and suffixes.(package private) FST<IntsRef>
prefixes
(package private) java.util.List<RepEntry>
repTable
private char[]
secondStagePrefixFlags
All flags used in affix continuation classes.private char[]
secondStageSuffixFlags
All flags used in affix continuation classes.(package private) boolean
simplifiedTriple
(package private) char[]
stripData
(package private) int[]
stripOffsets
(package private) char
subStandard
(package private) FST<IntsRef>
suffixes
(package private) java.lang.String
tryChars
(package private) java.lang.String
wordChars
(package private) WordStorage
words
The entries in the .dic file, mapping to their set of flags
-
Constructor Summary
Constructors Constructor Description Dictionary(java.io.InputStream affix, java.util.List<java.io.InputStream> dictionaries, boolean ignoreCase, SortingStrategy sortingStrategy)
Creates a new Dictionary containing the information read from the provided InputStreams to hunspell affix and dictionary files.Dictionary(Directory tempDir, java.lang.String tempFileNamePrefix, java.io.InputStream affix, java.io.InputStream dictionary)
Creates a new Dictionary containing the information read from the provided InputStreams to hunspell affix and dictionary files.Dictionary(Directory tempDir, java.lang.String tempFileNamePrefix, java.io.InputStream affix, java.util.List<java.io.InputStream> dictionaries, boolean ignoreCase)
Creates a new Dictionary containing the information read from the provided InputStreams to hunspell affix and dictionary files.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description private void
addHiddenCapitalizedWord(java.lang.StringBuilder reuse, SortingStrategy.EntryAccumulator acc, java.lang.String word, java.lang.String afterSep)
private int
addMorphFields(java.util.Map<java.lang.String,java.lang.Integer> indices, java.lang.String morphFields)
private void
addPhoneticRepEntries(java.lang.String word, java.lang.String ph)
(package private) char
affixData(int affixIndex, int offset)
private FST<IntsRef>
affixFST(java.util.TreeMap<java.lang.String,IntArrayList> affixes)
(package private) char[]
allNonSuggestibleFlags()
(package private) char
caseFold(char c)
folds single character (according to LANG if present)private void
checkCriticalDirectiveSame(java.lang.String directive, java.io.LineNumberReader reader, java.lang.Object expected, java.lang.Object actual)
(package private) java.lang.CharSequence
cleanInput(java.lang.CharSequence input, java.lang.StringBuilder reuse)
(package private) DictEntry
dictEntry(java.lang.String root, int flagId, int morphDataId)
(package private) static java.lang.String
extractLanguageCode(java.lang.String isoCode)
private java.lang.String
firstArgument(java.io.LineNumberReader reader, java.lang.String line)
(package private) int
formStep()
(package private) int
getAffixCondition(int affix)
private java.lang.String
getAliasValue(int id)
private java.nio.charset.CharsetDecoder
getDecoder(java.lang.String encoding)
Retrieves the CharsetDecoder for the given encoding.(package private) static java.nio.file.Path
getDefaultTempDir()
Returns the default temporary directory pointed to byjava.io.tmpdir
.(package private) static Dictionary.FlagParsingStrategy
getFlagParsingStrategy(java.lang.String flagLine, java.nio.charset.Charset charset)
Determines the appropriateDictionary.FlagParsingStrategy
based on the FLAG definition line taken from the affix fileboolean
getIgnoreCase()
Returns true if this dictionary was constructed with theignoreCase
option(package private) boolean
hasFlag(int entryId, char flag)
(package private) boolean
hasFlag(IntsRef forms, char flag)
protected double
hashFactor()
The factor determining the size of the internal hash table used for storing the entries.(package private) boolean
hasLanguage(java.lang.String... langCodes)
(package private) static int
indexOfSpaceOrTab(java.lang.String text, int start)
(package private) boolean
isCrossProduct(int affix)
(package private) boolean
isDotICaseChangeDisallowed(char[] word)
(package private) boolean
isFlagAppendedByAffix(int affixId, char flag)
(package private) boolean
isSecondStagePrefix(char flag)
(package private) boolean
isSecondStageSuffix(char flag)
private IntsRef
lookup(FST<IntsRef> fst, char[] word)
DictEntries
lookupEntries(java.lang.String root)
(package private) IntsRef
lookupPrefix(char[] word)
(package private) IntsRef
lookupSuffix(char[] word)
(package private) IntsRef
lookupWord(char[] word, int offset, int length)
Looks up Hunspell word forms from the dictionaryprivate static boolean
maybeConsume(java.io.BufferedInputStream stream, byte[] bytes)
Consume the provided byte sequence in full, if present.(package private) boolean
mayNeedInputCleaning()
private void
mergeDictionaries(java.util.List<java.io.InputStream> dictionaries, java.nio.charset.CharsetDecoder decoder, SortingStrategy.EntryAccumulator acc)
private static int
morphBoundary(java.lang.String line)
(package private) boolean
needsInputCleaning(java.lang.CharSequence input)
(package private) static IntsRef
nextArc(FST<IntsRef> fst, FST.Arc<IntsRef> arc, FST.BytesReader reader, IntsRef output, int ch)
private void
parseAffix(java.util.TreeMap<java.lang.String,IntArrayList> affixes, CharHashSet secondStageFlags, java.lang.String header, java.io.LineNumberReader reader, AffixKind kind, java.util.Map<java.lang.String,java.lang.Integer> seenPatterns, java.util.Map<java.lang.String,java.lang.Integer> seenStrips, FlagEnumerator flags)
Parses a specific affix rule putting the result into the provided affix mapprivate void
parseAlias(java.lang.String line)
private Dictionary.Breaks
parseBreaks(java.io.LineNumberReader reader, java.lang.String line)
private CompoundRule[]
parseCompoundRules(java.io.LineNumberReader reader, int num)
private ConvTable
parseConversions(java.io.LineNumberReader reader, int num)
private java.util.List<java.lang.String>
parseMapEntry(java.io.LineNumberReader reader, java.lang.String line)
private void
parseMorphAlias(java.lang.String line)
private int
parseNum(java.io.LineNumberReader reader, java.lang.String line)
private void
readAffixFile(java.io.InputStream affixStream, java.nio.charset.CharsetDecoder decoder, FlagEnumerator flags)
Reads the affix file through the provided InputStream, building up the prefix and suffix mapsprivate void
readConfig(java.io.InputStream stream, java.nio.charset.Charset streamCharset)
Parses the encoding and flag format specified in the provided InputStreamprivate java.util.List<java.lang.String>
readMorphFields(java.lang.String word, java.lang.String unparsed)
private WordStorage
readSortedDictionaries(FlagEnumerator flags, SortingStrategy.EntrySupplier sorted)
private static java.nio.charset.CharsetDecoder
replacingDecoder(java.nio.charset.Charset charset)
private static boolean
shouldSkipEscapedChar(char ch)
private java.lang.String
singleArgument(java.io.LineNumberReader reader, java.lang.String line)
private java.lang.String[]
splitBySpace(java.io.LineNumberReader reader, java.lang.String line, int expectedParts)
private java.lang.String[]
splitBySpace(java.io.LineNumberReader reader, java.lang.String line, int minParts, int maxParts)
private java.util.List<java.lang.String>
splitMorphData(java.lang.String morphData)
protected boolean
tolerateAffixRuleCountMismatches()
Whether incorrect PFX/SFX rule counts should be silently ignored.protected boolean
tolerateDuplicateConversionMappings()
Whether duplicate ICONV/OCONV lines should be silently ignored.(package private) java.lang.String
toLowerCase(java.lang.String word)
(package private) static char[]
toSortedCharArray(CharHashSet set)
(package private) java.lang.String
toTitleCase(java.lang.String word)
private java.lang.String
unescapeEntry(java.lang.String entry)
private void
writeNormalizedWordEntry(java.lang.StringBuilder reuse, java.lang.String line, SortingStrategy.EntryAccumulator acc)
-
-
-
Field Detail
-
MAX_PROLOGUE_SCAN_WINDOW
static final int MAX_PROLOGUE_SCAN_WINDOW
- See Also:
- Constant Field Values
-
NOFLAGS
static final char[] NOFLAGS
-
FLAG_UNSET
static final char FLAG_UNSET
- See Also:
- Constant Field Values
-
DEFAULT_FLAGS
private static final int DEFAULT_FLAGS
- See Also:
- Constant Field Values
-
HIDDEN_FLAG
static final char HIDDEN_FLAG
- See Also:
- Constant Field Values
-
DEFAULT_CHARSET
static final java.nio.charset.Charset DEFAULT_CHARSET
-
decoder
java.nio.charset.CharsetDecoder decoder
-
breaks
Dictionary.Breaks breaks
-
patterns
java.util.ArrayList<AffixCondition> patterns
All condition checks used by prefixes and suffixes. these are typically re-used across many affix stripping rules. so these are deduplicated, to save RAM.
-
words
WordStorage words
The entries in the .dic file, mapping to their set of flags
-
flagLookup
final FlagEnumerator.Lookup flagLookup
The list of unique flagsets (wordforms). theoretically huge, but practically small (for Polish this is 756), otherwise humans wouldn't be able to deal with it either.
-
stripData
char[] stripData
-
stripOffsets
int[] stripOffsets
-
wordChars
java.lang.String wordChars
-
affixData
char[] affixData
-
currentAffix
private int currentAffix
-
AFFIX_FLAG
static final int AFFIX_FLAG
- See Also:
- Constant Field Values
-
AFFIX_STRIP_ORD
static final int AFFIX_STRIP_ORD
- See Also:
- Constant Field Values
-
AFFIX_CONDITION
private static final int AFFIX_CONDITION
- See Also:
- Constant Field Values
-
AFFIX_APPEND
static final int AFFIX_APPEND
- See Also:
- Constant Field Values
-
flagParsingStrategy
Dictionary.FlagParsingStrategy flagParsingStrategy
-
aliases
private java.lang.String[] aliases
-
aliasCount
private int aliasCount
-
morphAliases
private java.lang.String[] morphAliases
-
morphAliasCount
private int morphAliasCount
-
morphData
final java.util.List<java.lang.String> morphData
-
hasCustomMorphData
boolean hasCustomMorphData
we set this during sorting, so we know to add an extra int (index inmorphData
) to FST output
-
ignoreCase
boolean ignoreCase
-
checkSharpS
boolean checkSharpS
-
complexPrefixes
boolean complexPrefixes
-
secondStagePrefixFlags
private char[] secondStagePrefixFlags
All flags used in affix continuation classes. If an outer affix's flag isn't here, there's no need to do 2-level affix stripping with it.
-
secondStageSuffixFlags
private char[] secondStageSuffixFlags
All flags used in affix continuation classes. If an outer affix's flag isn't here, there's no need to do 2-level affix stripping with it.
-
circumfix
char circumfix
-
keepcase
char keepcase
-
forceUCase
char forceUCase
-
needaffix
char needaffix
-
forbiddenword
char forbiddenword
-
onlyincompound
char onlyincompound
-
compoundBegin
char compoundBegin
-
compoundMiddle
char compoundMiddle
-
compoundEnd
char compoundEnd
-
compoundFlag
char compoundFlag
-
compoundPermit
char compoundPermit
-
compoundForbid
char compoundForbid
-
checkCompoundCase
boolean checkCompoundCase
-
checkCompoundDup
boolean checkCompoundDup
-
checkCompoundRep
boolean checkCompoundRep
-
checkCompoundTriple
boolean checkCompoundTriple
-
simplifiedTriple
boolean simplifiedTriple
-
compoundMin
int compoundMin
-
compoundMax
int compoundMax
-
compoundRules
CompoundRule[] compoundRules
-
checkCompoundPatterns
java.util.List<CheckCompoundPattern> checkCompoundPatterns
-
ignore
private char[] ignore
-
tryChars
java.lang.String tryChars
-
neighborKeyGroups
java.lang.String[] neighborKeyGroups
-
enableSplitSuggestions
boolean enableSplitSuggestions
-
repTable
java.util.List<RepEntry> repTable
-
mapTable
java.util.List<java.util.List<java.lang.String>> mapTable
-
maxDiff
int maxDiff
-
maxNGramSuggestions
int maxNGramSuggestions
-
onlyMaxDiff
boolean onlyMaxDiff
-
noSuggest
char noSuggest
-
subStandard
char subStandard
-
iconv
ConvTable iconv
-
oconv
ConvTable oconv
-
fullStrip
boolean fullStrip
-
language
java.lang.String language
-
alternateCasing
private boolean alternateCasing
-
BOM_UTF8
private static final byte[] BOM_UTF8
-
CHARSET_ALIASES
static final java.util.Map<java.lang.String,java.lang.String> CHARSET_ALIASES
-
FLAG_SEPARATOR
private static final char FLAG_SEPARATOR
- See Also:
- Constant Field Values
-
MORPH_SEPARATOR
private static final char MORPH_SEPARATOR
- See Also:
- Constant Field Values
-
-
Constructor Detail
-
Dictionary
public Dictionary(Directory tempDir, java.lang.String tempFileNamePrefix, java.io.InputStream affix, java.io.InputStream dictionary) throws java.io.IOException, java.text.ParseException
Creates a new Dictionary containing the information read from the provided InputStreams to hunspell affix and dictionary files. You have to close the provided InputStreams yourself.- Parameters:
tempDir
- Directory to use for offline sortingtempFileNamePrefix
- prefix to use to generate temp file namesaffix
- InputStream for reading the hunspell affix file (won't be closed).dictionary
- InputStream for reading the hunspell dictionary file (won't be closed).- Throws:
java.io.IOException
- Can be thrown while reading from the InputStreamsjava.text.ParseException
- Can be thrown if the content of the files does not meet expected formats
-
Dictionary
public Dictionary(Directory tempDir, java.lang.String tempFileNamePrefix, java.io.InputStream affix, java.util.List<java.io.InputStream> dictionaries, boolean ignoreCase) throws java.io.IOException, java.text.ParseException
Creates a new Dictionary containing the information read from the provided InputStreams to hunspell affix and dictionary files. You have to close the provided InputStreams yourself.- Parameters:
tempDir
- Directory to use for offline sortingtempFileNamePrefix
- prefix to use to generate temp file namesaffix
- InputStream for reading the hunspell affix file (won't be closed).dictionaries
- InputStream for reading the hunspell dictionary files (won't be closed).- Throws:
java.io.IOException
- Can be thrown while reading from the InputStreamsjava.text.ParseException
- Can be thrown if the content of the files does not meet expected formats
-
Dictionary
public Dictionary(java.io.InputStream affix, java.util.List<java.io.InputStream> dictionaries, boolean ignoreCase, SortingStrategy sortingStrategy) throws java.io.IOException, java.text.ParseException
Creates a new Dictionary containing the information read from the provided InputStreams to hunspell affix and dictionary files. You have to close the provided InputStreams yourself.- Parameters:
affix
- InputStream for reading the hunspell affix file (won't be closed).dictionaries
- InputStream for reading the hunspell dictionary files (won't be closed).sortingStrategy
- the entry strategy for the dictionary loading- Throws:
java.io.IOException
- Can be thrown while reading from the InputStreamsjava.text.ParseException
- Can be thrown if the content of the files does not meet expected formats
-
-
Method Detail
-
formStep
int formStep()
-
lookupWord
IntsRef lookupWord(char[] word, int offset, int length)
Looks up Hunspell word forms from the dictionary
-
lookupPrefix
IntsRef lookupPrefix(char[] word)
-
lookupSuffix
IntsRef lookupSuffix(char[] word)
-
nextArc
static IntsRef nextArc(FST<IntsRef> fst, FST.Arc<IntsRef> arc, FST.BytesReader reader, IntsRef output, int ch)
-
readAffixFile
private void readAffixFile(java.io.InputStream affixStream, java.nio.charset.CharsetDecoder decoder, FlagEnumerator flags) throws java.io.IOException, java.text.ParseException
Reads the affix file through the provided InputStream, building up the prefix and suffix maps- Parameters:
affixStream
- InputStream to read the content of the affix file fromdecoder
- CharsetDecoder to decode the content of the file- Throws:
java.io.IOException
- Can be thrown while reading from the InputStreamjava.text.ParseException
-
checkCriticalDirectiveSame
private void checkCriticalDirectiveSame(java.lang.String directive, java.io.LineNumberReader reader, java.lang.Object expected, java.lang.Object actual) throws java.text.ParseException
- Throws:
java.text.ParseException
-
parseMapEntry
private java.util.List<java.lang.String> parseMapEntry(java.io.LineNumberReader reader, java.lang.String line) throws java.text.ParseException
- Throws:
java.text.ParseException
-
hasLanguage
boolean hasLanguage(java.lang.String... langCodes)
-
lookupEntries
public DictEntries lookupEntries(java.lang.String root)
- Parameters:
root
- a string to look up in the dictionary. No case conversion or affix removal is performed. To get the possible roots of any word, you may callHunspell.getRoots(String)
- Returns:
- the dictionary entries for the given root, or
null
if there's none
-
dictEntry
DictEntry dictEntry(java.lang.String root, int flagId, int morphDataId)
-
extractLanguageCode
static java.lang.String extractLanguageCode(java.lang.String isoCode)
-
parseNum
private int parseNum(java.io.LineNumberReader reader, java.lang.String line) throws java.text.ParseException
- Throws:
java.text.ParseException
-
singleArgument
private java.lang.String singleArgument(java.io.LineNumberReader reader, java.lang.String line) throws java.text.ParseException
- Throws:
java.text.ParseException
-
firstArgument
private java.lang.String firstArgument(java.io.LineNumberReader reader, java.lang.String line) throws java.text.ParseException
- Throws:
java.text.ParseException
-
splitBySpace
private java.lang.String[] splitBySpace(java.io.LineNumberReader reader, java.lang.String line, int expectedParts) throws java.text.ParseException
- Throws:
java.text.ParseException
-
splitBySpace
private java.lang.String[] splitBySpace(java.io.LineNumberReader reader, java.lang.String line, int minParts, int maxParts) throws java.text.ParseException
- Throws:
java.text.ParseException
-
parseCompoundRules
private CompoundRule[] parseCompoundRules(java.io.LineNumberReader reader, int num) throws java.io.IOException, java.text.ParseException
- Throws:
java.io.IOException
java.text.ParseException
-
parseBreaks
private Dictionary.Breaks parseBreaks(java.io.LineNumberReader reader, java.lang.String line) throws java.io.IOException, java.text.ParseException
- Throws:
java.io.IOException
java.text.ParseException
-
affixFST
private FST<IntsRef> affixFST(java.util.TreeMap<java.lang.String,IntArrayList> affixes) throws java.io.IOException
- Throws:
java.io.IOException
-
parseAffix
private void parseAffix(java.util.TreeMap<java.lang.String,IntArrayList> affixes, CharHashSet secondStageFlags, java.lang.String header, java.io.LineNumberReader reader, AffixKind kind, java.util.Map<java.lang.String,java.lang.Integer> seenPatterns, java.util.Map<java.lang.String,java.lang.Integer> seenStrips, FlagEnumerator flags) throws java.io.IOException, java.text.ParseException
Parses a specific affix rule putting the result into the provided affix map- Parameters:
affixes
- Map where the result of the parsing will be putheader
- Header line of the affix rulereader
- BufferedReader to read the content of the rule fromseenPatterns
- map from condition -> index of patterns, for deduplication.- Throws:
java.io.IOException
- Can be thrown while reading the rulejava.text.ParseException
-
affixData
char affixData(int affixIndex, int offset)
-
isCrossProduct
boolean isCrossProduct(int affix)
-
getAffixCondition
int getAffixCondition(int affix)
-
parseConversions
private ConvTable parseConversions(java.io.LineNumberReader reader, int num) throws java.io.IOException, java.text.ParseException
- Throws:
java.io.IOException
java.text.ParseException
-
readConfig
private void readConfig(java.io.InputStream stream, java.nio.charset.Charset streamCharset) throws java.io.IOException, java.text.ParseException
Parses the encoding and flag format specified in the provided InputStream- Throws:
java.io.IOException
java.text.ParseException
-
maybeConsume
private static boolean maybeConsume(java.io.BufferedInputStream stream, byte[] bytes) throws java.io.IOException
Consume the provided byte sequence in full, if present. Otherwise leave the input stream intact.- Returns:
true
if the sequence matched and has been consumed.- Throws:
java.io.IOException
-
getDecoder
private java.nio.charset.CharsetDecoder getDecoder(java.lang.String encoding)
Retrieves the CharsetDecoder for the given encoding. Note, This isn't perfect as I think ISCII-DEVANAGARI and MICROSOFT-CP1251 etc are allowed...- Parameters:
encoding
- Encoding to retrieve the CharsetDecoder for- Returns:
- CharSetDecoder for the given encoding
-
replacingDecoder
private static java.nio.charset.CharsetDecoder replacingDecoder(java.nio.charset.Charset charset)
-
getFlagParsingStrategy
static Dictionary.FlagParsingStrategy getFlagParsingStrategy(java.lang.String flagLine, java.nio.charset.Charset charset)
Determines the appropriateDictionary.FlagParsingStrategy
based on the FLAG definition line taken from the affix file- Parameters:
flagLine
- Line containing the flag information- Returns:
- FlagParsingStrategy that handles parsing flags in the way specified in the FLAG definition
-
unescapeEntry
private java.lang.String unescapeEntry(java.lang.String entry)
-
shouldSkipEscapedChar
private static boolean shouldSkipEscapedChar(char ch)
-
morphBoundary
private static int morphBoundary(java.lang.String line)
-
indexOfSpaceOrTab
static int indexOfSpaceOrTab(java.lang.String text, int start)
-
mergeDictionaries
private void mergeDictionaries(java.util.List<java.io.InputStream> dictionaries, java.nio.charset.CharsetDecoder decoder, SortingStrategy.EntryAccumulator acc) throws java.io.IOException
- Throws:
java.io.IOException
-
writeNormalizedWordEntry
private void writeNormalizedWordEntry(java.lang.StringBuilder reuse, java.lang.String line, SortingStrategy.EntryAccumulator acc) throws java.io.IOException
- Throws:
java.io.IOException
-
addHiddenCapitalizedWord
private void addHiddenCapitalizedWord(java.lang.StringBuilder reuse, SortingStrategy.EntryAccumulator acc, java.lang.String word, java.lang.String afterSep) throws java.io.IOException
- Throws:
java.io.IOException
-
toLowerCase
java.lang.String toLowerCase(java.lang.String word)
-
toTitleCase
java.lang.String toTitleCase(java.lang.String word)
-
readSortedDictionaries
private WordStorage readSortedDictionaries(FlagEnumerator flags, SortingStrategy.EntrySupplier sorted) throws java.io.IOException
- Throws:
java.io.IOException
-
hashFactor
protected double hashFactor()
The factor determining the size of the internal hash table used for storing the entries. The table size isentry_count * hashFactor
. The default factor is 1.0. If there are too many hash collisions, the factor can be increased, resulting in faster access, but more memory usage.
-
tolerateAffixRuleCountMismatches
protected boolean tolerateAffixRuleCountMismatches()
Whether incorrect PFX/SFX rule counts should be silently ignored. False by default: aParseException
will happen.
-
tolerateDuplicateConversionMappings
protected boolean tolerateDuplicateConversionMappings()
Whether duplicate ICONV/OCONV lines should be silently ignored. False by default: anIllegalStateException
will happen.
-
allNonSuggestibleFlags
char[] allNonSuggestibleFlags()
-
readMorphFields
private java.util.List<java.lang.String> readMorphFields(java.lang.String word, java.lang.String unparsed)
-
addMorphFields
private int addMorphFields(java.util.Map<java.lang.String,java.lang.Integer> indices, java.lang.String morphFields)
-
addPhoneticRepEntries
private void addPhoneticRepEntries(java.lang.String word, java.lang.String ph)
-
isDotICaseChangeDisallowed
boolean isDotICaseChangeDisallowed(char[] word)
-
parseAlias
private void parseAlias(java.lang.String line)
-
getAliasValue
private java.lang.String getAliasValue(int id)
-
parseMorphAlias
private void parseMorphAlias(java.lang.String line)
-
splitMorphData
private java.util.List<java.lang.String> splitMorphData(java.lang.String morphData)
-
hasFlag
boolean hasFlag(IntsRef forms, char flag)
-
isFlagAppendedByAffix
boolean isFlagAppendedByAffix(int affixId, char flag)
-
hasFlag
boolean hasFlag(int entryId, char flag)
-
mayNeedInputCleaning
boolean mayNeedInputCleaning()
-
needsInputCleaning
boolean needsInputCleaning(java.lang.CharSequence input)
-
cleanInput
java.lang.CharSequence cleanInput(java.lang.CharSequence input, java.lang.StringBuilder reuse)
-
toSortedCharArray
static char[] toSortedCharArray(CharHashSet set)
-
isSecondStagePrefix
boolean isSecondStagePrefix(char flag)
-
isSecondStageSuffix
boolean isSecondStageSuffix(char flag)
-
caseFold
char caseFold(char c)
folds single character (according to LANG if present)
-
getIgnoreCase
public boolean getIgnoreCase()
Returns true if this dictionary was constructed with theignoreCase
option
-
getDefaultTempDir
static java.nio.file.Path getDefaultTempDir() throws java.io.IOException
Returns the default temporary directory pointed to byjava.io.tmpdir
. If not accessible or not available, an IOException is thrown.- Throws:
java.io.IOException
-
-