- java.lang.Object
-
- org.apache.lucene.util.fst.FSTCompiler<T>
-
public class FSTCompiler<T> extends java.lang.Object
Builds a minimal FST (maps an IntsRef term to an arbitrary output) from pre-sorted terms with outputs. The FST becomes an FSA if you use NoOutputs. The FST is written on-the-fly into a compact serialized format byte array, which can be saved to / loaded from a Directory or used directly for traversal. The FST is always finite (no cycles).NOTE: The algorithm is described at http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.3698
The parameterized type T is the output type. See the subclasses of
Outputs
.FSTs larger than 2.1GB are now possible (as of Lucene 4.2). FSTs containing more than 2.1B nodes are also now possible, however they cannot be packed.
It now supports 3 different workflows:
- Build FST and use it immediately entirely in RAM and then discard it
- Build FST and use it immediately entirely in RAM and also save it to other DataOutput, and load it later and use it
- Build FST but stream it immediately to disk (except the FSTMetaData, to be saved at the end). In order to use it, you need to construct the corresponding DataInput and use the FST constructor to read it.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description (package private) static class
FSTCompiler.Arc<T>
Expert: holds a pending (seen but not yet serialized) arc.static class
FSTCompiler.Builder<T>
Fluent-style constructor for FSTFSTCompiler
.(package private) static class
FSTCompiler.CompiledNode
(package private) static class
FSTCompiler.FixedLengthArcsBuffer
Reusable buffer for building nodes with fixed length arcs (binary search or direct addressing).(package private) static interface
FSTCompiler.Node
private static class
FSTCompiler.NullFSTReader
This class is used for FST backed by non-FSTReader DataOutput.(package private) static class
FSTCompiler.UnCompiledNode<T>
Expert: holds a pending (seen but not yet serialized) Node.
-
Field Summary
Fields Modifier and Type Field Description (package private) boolean
allowFixedLengthArcs
(package private) long
arcCount
(package private) long
binarySearchNodeCount
(package private) long
continuousNodeCount
(package private) DataOutput
dataOutput
private NodeHash<T>
dedupHash
private static float
DIRECT_ADDRESSING_MAX_OVERSIZE_WITH_CREDIT_FACTOR
Maximum oversizing factor allowed for direct addressing compared to binary search when expansion credits allow the oversizing.(package private) static float
DIRECT_ADDRESSING_MAX_OVERSIZING_FACTOR
(package private) long
directAddressingExpansionCredit
(package private) float
directAddressingMaxOversizingFactor
(package private) long
directAddressingNodeCount
(package private) static int
FIXED_LENGTH_ARC_DEEP_NUM_ARCS
(package private) static int
FIXED_LENGTH_ARC_SHALLOW_DEPTH
(package private) static int
FIXED_LENGTH_ARC_SHALLOW_NUM_ARCS
(package private) FSTCompiler.FixedLengthArcsBuffer
fixedLengthArcsBuffer
private FSTCompiler.UnCompiledNode<T>[]
frontier
(package private) FST<T>
fst
(package private) long
lastFrozenNode
private IntsRefBuilder
lastInput
private T
NO_OUTPUT
(package private) long
nodeCount
private static FSTReader
NULL_FST_READER
(package private) int[]
numBytesPerArc
private long
numBytesWritten
(package private) int[]
numLabelBytesPerArc
private boolean
paddingBytePending
(package private) GrowableByteArrayDataOutput
scratchBytes
(package private) int
version
-
Constructor Summary
Constructors Modifier Constructor Description private
FSTCompiler(FST.INPUT_TYPE inputType, double suffixRAMLimitMB, Outputs<T> outputs, boolean allowFixedLengthArcs, DataOutput dataOutput, float directAddressingMaxOversizingFactor, int version)
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description void
add(IntsRef input, T output)
Add the next input/output pair.(package private) long
addNode(FSTCompiler.UnCompiledNode<T> nodeIn)
FST.FSTMetadata<T>
compile()
Returns the metadata of the final FST.private FSTCompiler.CompiledNode
compileNode(FSTCompiler.UnCompiledNode<T> nodeIn)
(package private) void
finish(long newStartNode)
private void
freezeTail(int prefixLenPlus1)
long
fstRamBytesUsed()
long
fstSizeInBytes()
long
getArcCount()
float
getDirectAddressingMaxOversizingFactor()
FSTReader
getFSTReader()
Get the respectiveFSTReader
of theDataOutput
.long
getNodeCount()
static DataOutput
getOnHeapReaderWriter(int blockBits)
Get an on-heap DataOutput that allows the FST to be read immediately after writing, and also optionally saved to an external DataOutput.private void
reverseScratchBytes()
Reverse the scratch bytes in place.(package private) void
setEmptyOutput(T v)
private boolean
shouldExpandNodeWithDirectAddressing(FSTCompiler.UnCompiledNode<T> nodeIn, int numBytesPerArc, int maxBytesPerArcWithoutLabel, int labelRange)
Returns whether the given node should be expanded with direct addressing instead of binary search.private boolean
shouldExpandNodeWithFixedLengthArcs(FSTCompiler.UnCompiledNode<T> node)
Returns whether the given node should be expanded with fixed length arcs.private boolean
validOutput(T output)
private void
writeLabel(DataOutput out, int v)
private void
writeNodeForBinarySearch(FSTCompiler.UnCompiledNode<T> nodeIn, int maxBytesPerArc)
private void
writeNodeForDirectAddressingOrContinuous(FSTCompiler.UnCompiledNode<T> nodeIn, int maxBytesPerArcWithoutLabel, int labelRange, boolean continuous)
private void
writePaddingByte()
Write the padding byte, ensure no node gets address 0 which is reserved to mean the stop state w/ no arcsprivate void
writePresenceBits(FSTCompiler.UnCompiledNode<T> nodeIn)
private void
writeScratchBytes(int destPos, byte[] bytes, int offset, int length)
Write bytes from a source byte[] to the scratch bytes.
-
-
-
Field Detail
-
DIRECT_ADDRESSING_MAX_OVERSIZING_FACTOR
static final float DIRECT_ADDRESSING_MAX_OVERSIZING_FACTOR
- See Also:
- Constant Field Values
-
FIXED_LENGTH_ARC_SHALLOW_DEPTH
static final int FIXED_LENGTH_ARC_SHALLOW_DEPTH
-
FIXED_LENGTH_ARC_SHALLOW_NUM_ARCS
static final int FIXED_LENGTH_ARC_SHALLOW_NUM_ARCS
-
FIXED_LENGTH_ARC_DEEP_NUM_ARCS
static final int FIXED_LENGTH_ARC_DEEP_NUM_ARCS
-
DIRECT_ADDRESSING_MAX_OVERSIZE_WITH_CREDIT_FACTOR
private static final float DIRECT_ADDRESSING_MAX_OVERSIZE_WITH_CREDIT_FACTOR
Maximum oversizing factor allowed for direct addressing compared to binary search when expansion credits allow the oversizing. This factor prevents expansions that are obviously too costly even if there are sufficient credits.
-
NULL_FST_READER
private static final FSTReader NULL_FST_READER
-
NO_OUTPUT
private final T NO_OUTPUT
-
lastInput
private final IntsRefBuilder lastInput
-
paddingBytePending
private boolean paddingBytePending
-
frontier
private FSTCompiler.UnCompiledNode<T>[] frontier
-
lastFrozenNode
long lastFrozenNode
-
numBytesPerArc
int[] numBytesPerArc
-
numLabelBytesPerArc
int[] numLabelBytesPerArc
-
fixedLengthArcsBuffer
final FSTCompiler.FixedLengthArcsBuffer fixedLengthArcsBuffer
-
arcCount
long arcCount
-
nodeCount
long nodeCount
-
binarySearchNodeCount
long binarySearchNodeCount
-
directAddressingNodeCount
long directAddressingNodeCount
-
continuousNodeCount
long continuousNodeCount
-
allowFixedLengthArcs
final boolean allowFixedLengthArcs
-
directAddressingMaxOversizingFactor
final float directAddressingMaxOversizingFactor
-
version
final int version
-
directAddressingExpansionCredit
long directAddressingExpansionCredit
-
dataOutput
final DataOutput dataOutput
-
scratchBytes
final GrowableByteArrayDataOutput scratchBytes
-
numBytesWritten
private long numBytesWritten
-
-
Constructor Detail
-
FSTCompiler
private FSTCompiler(FST.INPUT_TYPE inputType, double suffixRAMLimitMB, Outputs<T> outputs, boolean allowFixedLengthArcs, DataOutput dataOutput, float directAddressingMaxOversizingFactor, int version)
-
-
Method Detail
-
getOnHeapReaderWriter
public static DataOutput getOnHeapReaderWriter(int blockBits)
Get an on-heap DataOutput that allows the FST to be read immediately after writing, and also optionally saved to an external DataOutput.- Parameters:
blockBits
- how many bits wide to make each block of the DataOutput- Returns:
- the DataOutput
-
getFSTReader
public FSTReader getFSTReader()
Get the respectiveFSTReader
of theDataOutput
. To call this method, you need to use the default DataOutput orgetOnHeapReaderWriter(int)
, otherwise we will throw an exception.- Returns:
- the DataOutput as FSTReader
- Throws:
java.lang.IllegalStateException
- if the DataOutput does not implement FSTReader
-
getDirectAddressingMaxOversizingFactor
public float getDirectAddressingMaxOversizingFactor()
-
getNodeCount
public long getNodeCount()
-
getArcCount
public long getArcCount()
-
compileNode
private FSTCompiler.CompiledNode compileNode(FSTCompiler.UnCompiledNode<T> nodeIn) throws java.io.IOException
- Throws:
java.io.IOException
-
addNode
long addNode(FSTCompiler.UnCompiledNode<T> nodeIn) throws java.io.IOException
- Throws:
java.io.IOException
-
writePaddingByte
private void writePaddingByte() throws java.io.IOException
Write the padding byte, ensure no node gets address 0 which is reserved to mean the stop state w/ no arcs- Throws:
java.io.IOException
-
writeLabel
private void writeLabel(DataOutput out, int v) throws java.io.IOException
- Throws:
java.io.IOException
-
shouldExpandNodeWithFixedLengthArcs
private boolean shouldExpandNodeWithFixedLengthArcs(FSTCompiler.UnCompiledNode<T> node)
Returns whether the given node should be expanded with fixed length arcs. Nodes will be expanded depending on their depth (distance from the root node) and their number of arcs.Nodes with fixed length arcs use more space, because they encode all arcs with a fixed number of bytes, but they allow either binary search or direct addressing on the arcs (instead of linear scan) on lookup by arc label.
-
shouldExpandNodeWithDirectAddressing
private boolean shouldExpandNodeWithDirectAddressing(FSTCompiler.UnCompiledNode<T> nodeIn, int numBytesPerArc, int maxBytesPerArcWithoutLabel, int labelRange)
Returns whether the given node should be expanded with direct addressing instead of binary search.Prefer direct addressing for performance if it does not oversize binary search byte size too much, so that the arcs can be directly addressed by label.
- See Also:
getDirectAddressingMaxOversizingFactor()
-
writeNodeForBinarySearch
private void writeNodeForBinarySearch(FSTCompiler.UnCompiledNode<T> nodeIn, int maxBytesPerArc)
-
reverseScratchBytes
private void reverseScratchBytes()
Reverse the scratch bytes in place. This operation does not affect scratchBytes.getPosition().
-
writeScratchBytes
private void writeScratchBytes(int destPos, byte[] bytes, int offset, int length)
Write bytes from a source byte[] to the scratch bytes. The written bytes must fit within what was already written in the scratch bytes.This operation does not affect scratchBytes.getPosition().
- Parameters:
destPos
- the position in the scratch bytesbytes
- the source byte[]offset
- the offset inside the source byte[]length
- the number of bytes to write
-
writeNodeForDirectAddressingOrContinuous
private void writeNodeForDirectAddressingOrContinuous(FSTCompiler.UnCompiledNode<T> nodeIn, int maxBytesPerArcWithoutLabel, int labelRange, boolean continuous)
-
writePresenceBits
private void writePresenceBits(FSTCompiler.UnCompiledNode<T> nodeIn)
-
freezeTail
private void freezeTail(int prefixLenPlus1) throws java.io.IOException
- Throws:
java.io.IOException
-
add
public void add(IntsRef input, T output) throws java.io.IOException
Add the next input/output pair. The provided input must be sorted after the previous one according toIntsRef.compareTo(org.apache.lucene.util.IntsRef)
. It's also OK to add the same input twice in a row with different outputs, as long asOutputs
implements theOutputs.merge(T, T)
method. Note that input is fully consumed after this method is returned (so caller is free to reuse), but output is not. So if your outputs are changeable (egByteSequenceOutputs
orIntSequenceOutputs
) then you cannot reuse across calls.- Throws:
java.io.IOException
-
setEmptyOutput
void setEmptyOutput(T v)
-
finish
void finish(long newStartNode)
-
validOutput
private boolean validOutput(T output)
-
compile
public FST.FSTMetadata<T> compile() throws java.io.IOException
Returns the metadata of the final FST. NOTE: this will return null if nothing is accepted by the FST themselves.To create the FST, you need to:
- If a FSTReader DataOutput was used, such as the one returned by
getOnHeapReaderWriter(int)
fstMetadata = fstCompiler.compile(); fst = FST.fromFSTReader(fstMetadata, fstCompiler.getFSTReader());
- If a non-FSTReader DataOutput was used, such as
IndexOutput
, you need to first create the correspondingDataInput
, such asIndexInput
then pass it to the FST constructfstMetadata = fstCompiler.compile(); fst = new FST<>(fstMetadata, dataInput, new OffHeapFSTStore());
- Throws:
java.io.IOException
-
fstRamBytesUsed
public long fstRamBytesUsed()
-
fstSizeInBytes
public long fstSizeInBytes()
-
-