Package org.apache.lucene.classification
Class CachingNaiveBayesClassifier
- java.lang.Object
-
- org.apache.lucene.classification.SimpleNaiveBayesClassifier
-
- org.apache.lucene.classification.CachingNaiveBayesClassifier
-
- All Implemented Interfaces:
Classifier<BytesRef>
public class CachingNaiveBayesClassifier extends SimpleNaiveBayesClassifier
A simplistic Lucene based NaiveBayes classifier, with caching feature, seehttp://en.wikipedia.org/wiki/Naive_Bayes_classifier
This is NOT an online classifier.
-
-
Field Summary
Fields Modifier and Type Field Description private java.util.ArrayList<BytesRef>
cclasses
private java.util.Map<BytesRef,java.lang.Double>
classTermFreq
private int
docsWithClassSize
private boolean
justCachedTerms
private java.util.Map<java.lang.String,java.util.Map<BytesRef,java.lang.Integer>>
termCClassHitCache
-
Fields inherited from class org.apache.lucene.classification.SimpleNaiveBayesClassifier
analyzer, classFieldName, indexReader, indexSearcher, query, textFieldNames
-
-
Constructor Summary
Constructors Constructor Description CachingNaiveBayesClassifier(IndexReader indexReader, Analyzer analyzer, Query query, java.lang.String classFieldName, java.lang.String... textFieldNames)
Creates a new NaiveBayes classifier with inside caching.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected java.util.List<ClassificationResult<BytesRef>>
assignClassNormalizedList(java.lang.String inputDocument)
Transforms values into a range between 0 and 1private java.util.List<ClassificationResult<BytesRef>>
calculateLogLikelihood(java.lang.String[] tokenizedText)
private java.util.Map<BytesRef,java.lang.Integer>
getWordFreqForClassess(java.lang.String word)
void
reInitCache(int minTermOccurrenceInCache, boolean justCachedTerms)
This function is building the frame of the cache.-
Methods inherited from class org.apache.lucene.classification.SimpleNaiveBayesClassifier
assignClass, countDocsWithClass, getClasses, getClasses, normClassificationResults, tokenize
-
-
-
-
Field Detail
-
cclasses
private final java.util.ArrayList<BytesRef> cclasses
-
termCClassHitCache
private final java.util.Map<java.lang.String,java.util.Map<BytesRef,java.lang.Integer>> termCClassHitCache
-
classTermFreq
private final java.util.Map<BytesRef,java.lang.Double> classTermFreq
-
justCachedTerms
private boolean justCachedTerms
-
docsWithClassSize
private int docsWithClassSize
-
-
Constructor Detail
-
CachingNaiveBayesClassifier
public CachingNaiveBayesClassifier(IndexReader indexReader, Analyzer analyzer, Query query, java.lang.String classFieldName, java.lang.String... textFieldNames)
Creates a new NaiveBayes classifier with inside caching. If you want less memory usage you could callreInitCache()
.- Parameters:
indexReader
- the reader on the index to be used for classificationanalyzer
- anAnalyzer
used to analyze unseen textquery
- aQuery
to eventually filter the docs used for training the classifier, ornull
if all the indexed docs should be usedclassFieldName
- the name of the field used as the output for the classifiertextFieldNames
- the name of the fields used as the inputs for the classifier
-
-
Method Detail
-
assignClassNormalizedList
protected java.util.List<ClassificationResult<BytesRef>> assignClassNormalizedList(java.lang.String inputDocument) throws java.io.IOException
Transforms values into a range between 0 and 1- Overrides:
assignClassNormalizedList
in classSimpleNaiveBayesClassifier
- Parameters:
inputDocument
- the input text as aString
- Returns:
- a
List
ofClassificationResult
, one for each existing class - Throws:
java.io.IOException
- if assigning probabilities fails
-
calculateLogLikelihood
private java.util.List<ClassificationResult<BytesRef>> calculateLogLikelihood(java.lang.String[] tokenizedText) throws java.io.IOException
- Throws:
java.io.IOException
-
getWordFreqForClassess
private java.util.Map<BytesRef,java.lang.Integer> getWordFreqForClassess(java.lang.String word) throws java.io.IOException
- Throws:
java.io.IOException
-
reInitCache
public void reInitCache(int minTermOccurrenceInCache, boolean justCachedTerms) throws java.io.IOException
This function is building the frame of the cache. The cache is storing the word occurrences to the memory after those searched once. This cache can made 2-100x speedup in proper use, but can eat lot of memory. There is an option to lower the memory consume, if a word have really low occurrence in the index you could filter it out. The other parameter is switching between the term searching, if it true, just the terms in the skeleton will be searched, but if it false the terms whoes not in the cache will be searched out too (but not cached).- Parameters:
minTermOccurrenceInCache
- Lower cache size with higher value.justCachedTerms
- The switch for fully exclude low occurrence docs.- Throws:
java.io.IOException
- If there is a low-level I/O error.
-
-