Package org.apache.lucene.analysis.query
Class QueryAutoStopWordAnalyzer
- java.lang.Object
-
- org.apache.lucene.analysis.Analyzer
-
- org.apache.lucene.analysis.AnalyzerWrapper
-
- org.apache.lucene.analysis.query.QueryAutoStopWordAnalyzer
-
- All Implemented Interfaces:
java.io.Closeable
,java.lang.AutoCloseable
public final class QueryAutoStopWordAnalyzer extends AnalyzerWrapper
AnAnalyzer
used primarily at query time to wrap another analyzer and provide a layer of protection which prevents very common words from being passed into queries.For very large indexes the cost of reading TermDocs for a very common word can be high. This analyzer was created after experience with a 38 million doc index which had a term in around 50% of docs and was causing TermQueries for this term to take 2 seconds.
- Since:
- 3.1
-
-
Nested Class Summary
-
Nested classes/interfaces inherited from class org.apache.lucene.analysis.Analyzer
Analyzer.ReuseStrategy, Analyzer.TokenStreamComponents
-
-
Field Summary
Fields Modifier and Type Field Description static float
defaultMaxDocFreqPercent
private Analyzer
delegate
private java.util.Map<java.lang.String,java.util.Set<java.lang.String>>
stopWordsPerField
-
Fields inherited from class org.apache.lucene.analysis.Analyzer
GLOBAL_REUSE_STRATEGY, PER_FIELD_REUSE_STRATEGY
-
-
Constructor Summary
Constructors Constructor Description QueryAutoStopWordAnalyzer(Analyzer delegate, IndexReader indexReader)
Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for all indexed fields from terms with a document frequency percentage greater thandefaultMaxDocFreqPercent
QueryAutoStopWordAnalyzer(Analyzer delegate, IndexReader indexReader, float maxPercentDocs)
Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for all indexed fields from terms with a document frequency percentage greater than the given maxPercentDocsQueryAutoStopWordAnalyzer(Analyzer delegate, IndexReader indexReader, int maxDocFreq)
Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for all indexed fields from terms with a document frequency greater than the given maxDocFreqQueryAutoStopWordAnalyzer(Analyzer delegate, IndexReader indexReader, java.util.Collection<java.lang.String> fields, float maxPercentDocs)
Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for the given selection of fields from terms with a document frequency percentage greater than the given maxPercentDocsQueryAutoStopWordAnalyzer(Analyzer delegate, IndexReader indexReader, java.util.Collection<java.lang.String> fields, int maxDocFreq)
Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for the given selection of fields from terms with a document frequency greater than the given maxDocFreq
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description Term[]
getStopWords()
Provides information on which stop words have been identified for all fieldsjava.lang.String[]
getStopWords(java.lang.String fieldName)
Provides information on which stop words have been identified for a fieldprotected Analyzer
getWrappedAnalyzer(java.lang.String fieldName)
Retrieves the wrapped Analyzer appropriate for analyzing the field with the given nameprotected Analyzer.TokenStreamComponents
wrapComponents(java.lang.String fieldName, Analyzer.TokenStreamComponents components)
Wraps / alters the given TokenStreamComponents, taken from the wrapped Analyzer, to form new components.-
Methods inherited from class org.apache.lucene.analysis.AnalyzerWrapper
attributeFactory, createComponents, getOffsetGap, getPositionIncrementGap, initReader, initReaderForNormalization, normalize, wrapReader, wrapReaderForNormalization, wrapTokenStreamForNormalization
-
Methods inherited from class org.apache.lucene.analysis.Analyzer
close, getReuseStrategy, normalize, tokenStream, tokenStream
-
-
-
-
Field Detail
-
delegate
private final Analyzer delegate
-
stopWordsPerField
private final java.util.Map<java.lang.String,java.util.Set<java.lang.String>> stopWordsPerField
-
defaultMaxDocFreqPercent
public static final float defaultMaxDocFreqPercent
- See Also:
- Constant Field Values
-
-
Constructor Detail
-
QueryAutoStopWordAnalyzer
public QueryAutoStopWordAnalyzer(Analyzer delegate, IndexReader indexReader) throws java.io.IOException
Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for all indexed fields from terms with a document frequency percentage greater thandefaultMaxDocFreqPercent
- Parameters:
delegate
- Analyzer whose TokenStream will be filteredindexReader
- IndexReader to identify the stopwords from- Throws:
java.io.IOException
- Can be thrown while reading from the IndexReader
-
QueryAutoStopWordAnalyzer
public QueryAutoStopWordAnalyzer(Analyzer delegate, IndexReader indexReader, int maxDocFreq) throws java.io.IOException
Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for all indexed fields from terms with a document frequency greater than the given maxDocFreq- Parameters:
delegate
- Analyzer whose TokenStream will be filteredindexReader
- IndexReader to identify the stopwords frommaxDocFreq
- Document frequency terms should be above in order to be stopwords- Throws:
java.io.IOException
- Can be thrown while reading from the IndexReader
-
QueryAutoStopWordAnalyzer
public QueryAutoStopWordAnalyzer(Analyzer delegate, IndexReader indexReader, float maxPercentDocs) throws java.io.IOException
Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for all indexed fields from terms with a document frequency percentage greater than the given maxPercentDocs- Parameters:
delegate
- Analyzer whose TokenStream will be filteredindexReader
- IndexReader to identify the stopwords frommaxPercentDocs
- The maximum percentage (between 0.0 and 1.0) of index documents which contain a term, after which the word is considered to be a stop word- Throws:
java.io.IOException
- Can be thrown while reading from the IndexReader
-
QueryAutoStopWordAnalyzer
public QueryAutoStopWordAnalyzer(Analyzer delegate, IndexReader indexReader, java.util.Collection<java.lang.String> fields, float maxPercentDocs) throws java.io.IOException
Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for the given selection of fields from terms with a document frequency percentage greater than the given maxPercentDocs- Parameters:
delegate
- Analyzer whose TokenStream will be filteredindexReader
- IndexReader to identify the stopwords fromfields
- Selection of fields to calculate stopwords formaxPercentDocs
- The maximum percentage (between 0.0 and 1.0) of index documents which contain a term, after which the word is considered to be a stop word- Throws:
java.io.IOException
- Can be thrown while reading from the IndexReader
-
QueryAutoStopWordAnalyzer
public QueryAutoStopWordAnalyzer(Analyzer delegate, IndexReader indexReader, java.util.Collection<java.lang.String> fields, int maxDocFreq) throws java.io.IOException
Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for the given selection of fields from terms with a document frequency greater than the given maxDocFreq- Parameters:
delegate
- Analyzer whose TokenStream will be filteredindexReader
- IndexReader to identify the stopwords fromfields
- Selection of fields to calculate stopwords formaxDocFreq
- Document frequency terms should be above in order to be stopwords- Throws:
java.io.IOException
- Can be thrown while reading from the IndexReader
-
-
Method Detail
-
getWrappedAnalyzer
protected Analyzer getWrappedAnalyzer(java.lang.String fieldName)
Description copied from class:AnalyzerWrapper
Retrieves the wrapped Analyzer appropriate for analyzing the field with the given name- Specified by:
getWrappedAnalyzer
in classAnalyzerWrapper
- Parameters:
fieldName
- Name of the field which is to be analyzed- Returns:
- Analyzer for the field with the given name. Assumed to be non-null
-
wrapComponents
protected Analyzer.TokenStreamComponents wrapComponents(java.lang.String fieldName, Analyzer.TokenStreamComponents components)
Description copied from class:AnalyzerWrapper
Wraps / alters the given TokenStreamComponents, taken from the wrapped Analyzer, to form new components. It is through this method that new TokenFilters can be added by AnalyzerWrappers. By default, the given components are returned.- Overrides:
wrapComponents
in classAnalyzerWrapper
- Parameters:
fieldName
- Name of the field which is to be analyzedcomponents
- TokenStreamComponents taken from the wrapped Analyzer- Returns:
- Wrapped / altered TokenStreamComponents.
-
getStopWords
public java.lang.String[] getStopWords(java.lang.String fieldName)
Provides information on which stop words have been identified for a field- Parameters:
fieldName
- The field for which stop words identified in "addStopWords" method calls will be returned- Returns:
- the stop words identified for a field
-
getStopWords
public Term[] getStopWords()
Provides information on which stop words have been identified for all fields- Returns:
- the stop words (as terms)
-
-