Class ICUTokenizerFactory
- java.lang.Object
-
- org.apache.lucene.analysis.AbstractAnalysisFactory
-
- org.apache.lucene.analysis.TokenizerFactory
-
- org.apache.lucene.analysis.icu.segmentation.ICUTokenizerFactory
-
- All Implemented Interfaces:
ResourceLoaderAware
public class ICUTokenizerFactory extends TokenizerFactory implements ResourceLoaderAware
Factory forICUTokenizer
. Words are broken across script boundaries, then segmented according to the BreakIterator and typing provided by theDefaultICUTokenizerConfig
.To use the default set of per-script rules:
<fieldType name="text_icu" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.ICUTokenizerFactory"/> </analyzer> </fieldType>
You can customize this tokenizer's behavior by specifying per-script rule files, which are compiled by the ICU RuleBasedBreakIterator. See the ICU RuleBasedBreakIterator syntax reference.
To add per-script rules, add a "rulefiles" argument, which should contain a comma-separated list of
code:rulefile
pairs in the following format: four-letter ISO 15924 script code, followed by a colon, then a resource path. E.g. to specify rules for Latin (script code "Latn") and Cyrillic (script code "Cyrl"):<fieldType name="text_icu_custom" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.ICUTokenizerFactory" cjkAsWords="true" rulefiles="Latn:my.Latin.rules.rbbi,Cyrl:my.Cyrillic.rules.rbbi"/> </analyzer> </fieldType>
- Since:
- 3.1
-
-
Field Summary
Fields Modifier and Type Field Description private boolean
cjkAsWords
private ICUTokenizerConfig
config
private boolean
myanmarAsWords
static java.lang.String
NAME
SPI name(package private) static java.lang.String
RULEFILES
private IntObjectHashMap<java.lang.String>
tailored
-
Fields inherited from class org.apache.lucene.analysis.AbstractAnalysisFactory
LUCENE_MATCH_VERSION_PARAM, luceneMatchVersion
-
-
Constructor Summary
Constructors Constructor Description ICUTokenizerFactory()
Default ctor for compatibility with SPIICUTokenizerFactory(java.util.Map<java.lang.String,java.lang.String> args)
Creates a new ICUTokenizerFactory
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description ICUTokenizer
create(AttributeFactory factory)
Creates a TokenStream of the specified input using the given AttributeFactoryvoid
inform(ResourceLoader loader)
Initializes this component with the provided ResourceLoader (used for loading classes, files, etc).private com.ibm.icu.text.BreakIterator
parseRules(java.lang.String filename, ResourceLoader loader)
-
Methods inherited from class org.apache.lucene.analysis.TokenizerFactory
availableTokenizers, create, findSPIName, forName, lookupClass, reloadTokenizers
-
Methods inherited from class org.apache.lucene.analysis.AbstractAnalysisFactory
defaultCtorException, get, get, get, get, get, getBoolean, getChar, getClassArg, getFloat, getInt, getLines, getLuceneMatchVersion, getOriginalArgs, getPattern, getSet, getSnowballWordSet, getWordSet, isExplicitLuceneMatchVersion, require, require, require, requireBoolean, requireChar, requireFloat, requireInt, setExplicitLuceneMatchVersion, splitAt, splitFileNames
-
-
-
-
Field Detail
-
NAME
public static final java.lang.String NAME
SPI name- See Also:
- Constant Field Values
-
RULEFILES
static final java.lang.String RULEFILES
- See Also:
- Constant Field Values
-
tailored
private final IntObjectHashMap<java.lang.String> tailored
-
config
private ICUTokenizerConfig config
-
cjkAsWords
private final boolean cjkAsWords
-
myanmarAsWords
private final boolean myanmarAsWords
-
-
Method Detail
-
inform
public void inform(ResourceLoader loader) throws java.io.IOException
Description copied from interface:ResourceLoaderAware
Initializes this component with the provided ResourceLoader (used for loading classes, files, etc).- Specified by:
inform
in interfaceResourceLoaderAware
- Throws:
java.io.IOException
-
parseRules
private com.ibm.icu.text.BreakIterator parseRules(java.lang.String filename, ResourceLoader loader) throws java.io.IOException
- Throws:
java.io.IOException
-
create
public ICUTokenizer create(AttributeFactory factory)
Description copied from class:TokenizerFactory
Creates a TokenStream of the specified input using the given AttributeFactory- Specified by:
create
in classTokenizerFactory
-
-