Class UAX29URLEmailTokenizerImpl
- java.lang.Object
-
- org.apache.lucene.analysis.email.UAX29URLEmailTokenizerImpl
-
public final class UAX29URLEmailTokenizerImpl extends java.lang.Object
This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29 URLs and email addresses are also tokenized according to the relevant RFCs.Tokens produced are of the following types:
- <ALPHANUM>: A sequence of alphabetic and numeric characters
- <NUM>: A number
- <URL>: A URL
- <EMAIL>: An email address
- <SOUTHEAST_ASIAN>: A sequence of characters from South and Southeast Asian languages, including Thai, Lao, Myanmar, and Khmer
- <IDEOGRAPHIC>: A single CJKV ideographic character
- <HIRAGANA>: A single hiragana character
- <KATAKANA>: A sequence of katakana characters
- <HANGUL>: A sequence of Hangul characters
- <EMOJI>: A sequence of Emoji characters
-
-
Field Summary
Fields Modifier and Type Field Description static int
AVOID_BAD_URL
static int
EMAIL_TYPE
Email token typestatic int
EMOJI_TYPE
Emoji token typestatic int
HANGUL_TYPE
Hangul token typestatic int
HIRAGANA_TYPE
Hiragana token typestatic int
IDEOGRAPHIC_TYPE
Ideographic token typestatic int
KATAKANA_TYPE
Katakana token typestatic int
NUMERIC_TYPE
Numbersstatic int
SOUTH_EAST_ASIAN_TYPE
Chars in class \p{Line_Break = Complex_Context} are from South East Asian scripts (Thai, Lao, Myanmar, Khmer, etc.).static int
URL_TYPE
URL token typestatic int
WORD_TYPE
Alphanumeric sequencesprivate long
yychar
Number of characters up to the start of the matched text.private int
yycolumn
Number of characters from the last newline up to the start of the matched text.static int
YYEOF
This character denotes the end of file.static int
YYINITIAL
Lexical States.private int
yyline
Number of newlines encountered up to the start of the matched text.private static int[]
ZZ_ACTION
Translates DFA states to action switch labels.private static java.lang.String
ZZ_ACTION_PACKED_0
private static int[]
ZZ_ATTRIBUTE
ZZ_ATTRIBUTE[aState] contains the attributes of stateaState
private static java.lang.String
ZZ_ATTRIBUTE_PACKED_0
private int
ZZ_BUFFERSIZE
Initial size of the lookahead buffer.private static int[]
ZZ_CMAP_BLOCKS
Second-level tables for translating characters to character classesprivate static java.lang.String
ZZ_CMAP_BLOCKS_PACKED_0
private static int[]
ZZ_CMAP_TOP
Top-level table for translating characters to character classesprivate static java.lang.String
ZZ_CMAP_TOP_PACKED_0
private static java.lang.String[]
ZZ_ERROR_MSG
private static int[]
ZZ_LEXSTATE
ZZ_LEXSTATE[l] is the state in the DFA for the lexical state l ZZ_LEXSTATE[l+1] is the state in the DFA for the lexical state l at the beginning of a line l is of the form l = 2*k, k a non negative integerprivate static int
ZZ_NO_MATCH
Error code for "could not match input".private static int
ZZ_PUSHBACK_2BIG
Error code for "pushback value was too large".private static int[]
ZZ_ROWMAP
Translates a state to a row index in the transition tableprivate static java.lang.String
ZZ_ROWMAP_PACKED_0
private static int[]
ZZ_TRANS
The transition table of the DFAprivate static java.lang.String
ZZ_TRANS_PACKED_0
private static java.lang.String
ZZ_TRANS_PACKED_1
private static java.lang.String
ZZ_TRANS_PACKED_10
private static java.lang.String
ZZ_TRANS_PACKED_2
private static java.lang.String
ZZ_TRANS_PACKED_3
private static java.lang.String
ZZ_TRANS_PACKED_4
private static java.lang.String
ZZ_TRANS_PACKED_5
private static java.lang.String
ZZ_TRANS_PACKED_6
private static java.lang.String
ZZ_TRANS_PACKED_7
private static java.lang.String
ZZ_TRANS_PACKED_8
private static java.lang.String
ZZ_TRANS_PACKED_9
private static int
ZZ_UNKNOWN_ERROR
Error code for "Unknown internal scanner error".private boolean
zzAtBOL
Whether the scanner is currently at the beginning of a line.private boolean
zzAtEOF
Whether the scanner is at the end of file.private char[]
zzBuffer
This buffer contains the current text to be matched and is the source of theyytext()
string.private int
zzCurrentPos
Current text position in the buffer.private int
zzEndRead
Marks the last character in the buffer, that has been read from input.private boolean
zzEOFDone
Whether the user-EOF-code has already been executed.private int
zzFinalHighSurrogate
private int
zzLexicalState
Current lexical state.private int
zzMarkedPos
Text position at the last accepting state.private java.io.Reader
zzReader
Input device.private int
zzStartRead
Marks the beginning of theyytext()
string in the buffer.private int
zzState
Current state of the DFA.
-
Constructor Summary
Constructors Constructor Description UAX29URLEmailTokenizerImpl(java.io.Reader in)
Creates a new scanner
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description int
getNextToken()
Resumes scanning until the next regular expression is matched, the end of input is encountered or an I/O-Error occurs.void
getText(CharTermAttribute t)
Fills CharTermAttribute with the current token text.void
setBufferSize(int numChars)
Sets the scanner buffer size in charsboolean
yyatEOF()
Returns whether the scanner has reached the end of the reader it reads from.void
yybegin(int newState)
Enters a new lexical state.int
yychar()
Character count processed so farchar
yycharat(int position)
Returns the character at the given position from the matched text.void
yyclose()
Closes the input reader.int
yylength()
How many characters were matched.void
yypushback(int number)
Pushes the specified amount of characters back into the input stream.void
yyreset(java.io.Reader reader)
Resets the scanner to read from a new input stream.private void
yyResetPosition()
Resets the input position.int
yystate()
Returns the current lexical state.java.lang.String
yytext()
Returns the text matched by the current regular expression.private static int
zzCMap(int input)
Translates raw input code points to DFA table rowprivate boolean
zzRefill()
Refills the input buffer.private static void
zzScanError(int errorCode)
Reports an error that occurred while scanning.private static int[]
zzUnpackAction()
private static int
zzUnpackAction(java.lang.String packed, int offset, int[] result)
private static int[]
zzUnpackAttribute()
private static int
zzUnpackAttribute(java.lang.String packed, int offset, int[] result)
private static int[]
zzUnpackcmap_blocks()
private static int
zzUnpackcmap_blocks(java.lang.String packed, int offset, int[] result)
private static int[]
zzUnpackcmap_top()
private static int
zzUnpackcmap_top(java.lang.String packed, int offset, int[] result)
private static int[]
zzUnpackRowMap()
private static int
zzUnpackRowMap(java.lang.String packed, int offset, int[] result)
private static int[]
zzUnpackTrans()
private static int
zzUnpackTrans(java.lang.String packed, int offset, int[] result)
-
-
-
Field Detail
-
YYEOF
public static final int YYEOF
This character denotes the end of file.- See Also:
- Constant Field Values
-
ZZ_BUFFERSIZE
private int ZZ_BUFFERSIZE
Initial size of the lookahead buffer.
-
YYINITIAL
public static final int YYINITIAL
Lexical States.- See Also:
- Constant Field Values
-
AVOID_BAD_URL
public static final int AVOID_BAD_URL
- See Also:
- Constant Field Values
-
ZZ_LEXSTATE
private static final int[] ZZ_LEXSTATE
ZZ_LEXSTATE[l] is the state in the DFA for the lexical state l ZZ_LEXSTATE[l+1] is the state in the DFA for the lexical state l at the beginning of a line l is of the form l = 2*k, k a non negative integer
-
ZZ_CMAP_TOP
private static final int[] ZZ_CMAP_TOP
Top-level table for translating characters to character classes
-
ZZ_CMAP_TOP_PACKED_0
private static final java.lang.String ZZ_CMAP_TOP_PACKED_0
- See Also:
- Constant Field Values
-
ZZ_CMAP_BLOCKS
private static final int[] ZZ_CMAP_BLOCKS
Second-level tables for translating characters to character classes
-
ZZ_CMAP_BLOCKS_PACKED_0
private static final java.lang.String ZZ_CMAP_BLOCKS_PACKED_0
- See Also:
- Constant Field Values
-
ZZ_ACTION
private static final int[] ZZ_ACTION
Translates DFA states to action switch labels.
-
ZZ_ACTION_PACKED_0
private static final java.lang.String ZZ_ACTION_PACKED_0
- See Also:
- Constant Field Values
-
ZZ_ROWMAP
private static final int[] ZZ_ROWMAP
Translates a state to a row index in the transition table
-
ZZ_ROWMAP_PACKED_0
private static final java.lang.String ZZ_ROWMAP_PACKED_0
- See Also:
- Constant Field Values
-
ZZ_TRANS
private static final int[] ZZ_TRANS
The transition table of the DFA
-
ZZ_TRANS_PACKED_0
private static final java.lang.String ZZ_TRANS_PACKED_0
- See Also:
- Constant Field Values
-
ZZ_TRANS_PACKED_1
private static final java.lang.String ZZ_TRANS_PACKED_1
- See Also:
- Constant Field Values
-
ZZ_TRANS_PACKED_2
private static final java.lang.String ZZ_TRANS_PACKED_2
- See Also:
- Constant Field Values
-
ZZ_TRANS_PACKED_3
private static final java.lang.String ZZ_TRANS_PACKED_3
- See Also:
- Constant Field Values
-
ZZ_TRANS_PACKED_4
private static final java.lang.String ZZ_TRANS_PACKED_4
- See Also:
- Constant Field Values
-
ZZ_TRANS_PACKED_5
private static final java.lang.String ZZ_TRANS_PACKED_5
- See Also:
- Constant Field Values
-
ZZ_TRANS_PACKED_6
private static final java.lang.String ZZ_TRANS_PACKED_6
- See Also:
- Constant Field Values
-
ZZ_TRANS_PACKED_7
private static final java.lang.String ZZ_TRANS_PACKED_7
- See Also:
- Constant Field Values
-
ZZ_TRANS_PACKED_8
private static final java.lang.String ZZ_TRANS_PACKED_8
- See Also:
- Constant Field Values
-
ZZ_TRANS_PACKED_9
private static final java.lang.String ZZ_TRANS_PACKED_9
- See Also:
- Constant Field Values
-
ZZ_TRANS_PACKED_10
private static final java.lang.String ZZ_TRANS_PACKED_10
- See Also:
- Constant Field Values
-
ZZ_UNKNOWN_ERROR
private static final int ZZ_UNKNOWN_ERROR
Error code for "Unknown internal scanner error".- See Also:
- Constant Field Values
-
ZZ_NO_MATCH
private static final int ZZ_NO_MATCH
Error code for "could not match input".- See Also:
- Constant Field Values
-
ZZ_PUSHBACK_2BIG
private static final int ZZ_PUSHBACK_2BIG
Error code for "pushback value was too large".- See Also:
- Constant Field Values
-
ZZ_ERROR_MSG
private static final java.lang.String[] ZZ_ERROR_MSG
-
ZZ_ATTRIBUTE
private static final int[] ZZ_ATTRIBUTE
ZZ_ATTRIBUTE[aState] contains the attributes of stateaState
-
ZZ_ATTRIBUTE_PACKED_0
private static final java.lang.String ZZ_ATTRIBUTE_PACKED_0
- See Also:
- Constant Field Values
-
zzReader
private java.io.Reader zzReader
Input device.
-
zzState
private int zzState
Current state of the DFA.
-
zzLexicalState
private int zzLexicalState
Current lexical state.
-
zzBuffer
private char[] zzBuffer
This buffer contains the current text to be matched and is the source of theyytext()
string.
-
zzMarkedPos
private int zzMarkedPos
Text position at the last accepting state.
-
zzCurrentPos
private int zzCurrentPos
Current text position in the buffer.
-
zzStartRead
private int zzStartRead
Marks the beginning of theyytext()
string in the buffer.
-
zzEndRead
private int zzEndRead
Marks the last character in the buffer, that has been read from input.
-
zzAtEOF
private boolean zzAtEOF
Whether the scanner is at the end of file.- See Also:
yyatEOF()
-
zzFinalHighSurrogate
private int zzFinalHighSurrogate
-
yyline
private int yyline
Number of newlines encountered up to the start of the matched text.
-
yycolumn
private int yycolumn
Number of characters from the last newline up to the start of the matched text.
-
yychar
private long yychar
Number of characters up to the start of the matched text.
-
zzAtBOL
private boolean zzAtBOL
Whether the scanner is currently at the beginning of a line.
-
zzEOFDone
private boolean zzEOFDone
Whether the user-EOF-code has already been executed.
-
WORD_TYPE
public static final int WORD_TYPE
Alphanumeric sequences- See Also:
- Constant Field Values
-
NUMERIC_TYPE
public static final int NUMERIC_TYPE
Numbers- See Also:
- Constant Field Values
-
SOUTH_EAST_ASIAN_TYPE
public static final int SOUTH_EAST_ASIAN_TYPE
Chars in class \p{Line_Break = Complex_Context} are from South East Asian scripts (Thai, Lao, Myanmar, Khmer, etc.). Sequences of these are kept together as as a single token rather than broken up, because the logic required to break them at word boundaries is too complex for UAX#29.See Unicode Line Breaking Algorithm: http://www.unicode.org/reports/tr14/#SA
- See Also:
- Constant Field Values
-
IDEOGRAPHIC_TYPE
public static final int IDEOGRAPHIC_TYPE
Ideographic token type- See Also:
- Constant Field Values
-
HIRAGANA_TYPE
public static final int HIRAGANA_TYPE
Hiragana token type- See Also:
- Constant Field Values
-
KATAKANA_TYPE
public static final int KATAKANA_TYPE
Katakana token type- See Also:
- Constant Field Values
-
HANGUL_TYPE
public static final int HANGUL_TYPE
Hangul token type- See Also:
- Constant Field Values
-
EMAIL_TYPE
public static final int EMAIL_TYPE
Email token type- See Also:
- Constant Field Values
-
URL_TYPE
public static final int URL_TYPE
URL token type- See Also:
- Constant Field Values
-
EMOJI_TYPE
public static final int EMOJI_TYPE
Emoji token type- See Also:
- Constant Field Values
-
-
Method Detail
-
zzUnpackcmap_top
private static int[] zzUnpackcmap_top()
-
zzUnpackcmap_top
private static int zzUnpackcmap_top(java.lang.String packed, int offset, int[] result)
-
zzUnpackcmap_blocks
private static int[] zzUnpackcmap_blocks()
-
zzUnpackcmap_blocks
private static int zzUnpackcmap_blocks(java.lang.String packed, int offset, int[] result)
-
zzUnpackAction
private static int[] zzUnpackAction()
-
zzUnpackAction
private static int zzUnpackAction(java.lang.String packed, int offset, int[] result)
-
zzUnpackRowMap
private static int[] zzUnpackRowMap()
-
zzUnpackRowMap
private static int zzUnpackRowMap(java.lang.String packed, int offset, int[] result)
-
zzUnpackTrans
private static int[] zzUnpackTrans()
-
zzUnpackTrans
private static int zzUnpackTrans(java.lang.String packed, int offset, int[] result)
-
zzUnpackAttribute
private static int[] zzUnpackAttribute()
-
zzUnpackAttribute
private static int zzUnpackAttribute(java.lang.String packed, int offset, int[] result)
-
yychar
public final int yychar()
Character count processed so far
-
getText
public final void getText(CharTermAttribute t)
Fills CharTermAttribute with the current token text.
-
setBufferSize
public final void setBufferSize(int numChars)
Sets the scanner buffer size in chars
-
zzCMap
private static int zzCMap(int input)
Translates raw input code points to DFA table row
-
zzRefill
private boolean zzRefill() throws java.io.IOException
Refills the input buffer.- Returns:
false
iff there was new input.- Throws:
java.io.IOException
- if any I/O-Error occurs
-
yyclose
public final void yyclose() throws java.io.IOException
Closes the input reader.- Throws:
java.io.IOException
- if the reader could not be closed.
-
yyreset
public final void yyreset(java.io.Reader reader)
Resets the scanner to read from a new input stream.Does not close the old reader.
All internal variables are reset, the old input stream cannot be reused (internal buffer is discarded and lost). Lexical state is set to
ZZ_INITIAL
.Internal scan buffer is resized down to its initial length, if it has grown.
- Parameters:
reader
- The new input stream.
-
yyResetPosition
private final void yyResetPosition()
Resets the input position.
-
yyatEOF
public final boolean yyatEOF()
Returns whether the scanner has reached the end of the reader it reads from.- Returns:
- whether the scanner has reached EOF.
-
yystate
public final int yystate()
Returns the current lexical state.- Returns:
- the current lexical state.
-
yybegin
public final void yybegin(int newState)
Enters a new lexical state.- Parameters:
newState
- the new lexical state
-
yytext
public final java.lang.String yytext()
Returns the text matched by the current regular expression.- Returns:
- the matched text.
-
yycharat
public final char yycharat(int position)
Returns the character at the given position from the matched text.It is equivalent to
yytext().charAt(pos)
, but faster.- Parameters:
position
- the position of the character to fetch. A value from 0 toyylength()-1
.- Returns:
- the character at
position
.
-
yylength
public final int yylength()
How many characters were matched.- Returns:
- the length of the matched text region.
-
zzScanError
private static void zzScanError(int errorCode)
Reports an error that occurred while scanning.In a well-formed scanner (no or only correct usage of
yypushback(int)
and a match-all fallback rule) this method will only be called with things that "Can't Possibly Happen".If this method is called, something is seriously wrong (e.g. a JFlex bug producing a faulty scanner etc.).
Usual syntax/scanner level error handling should be done in error fallback rules.
- Parameters:
errorCode
- the code of the error message to display.
-
yypushback
public void yypushback(int number)
Pushes the specified amount of characters back into the input stream.They will be read again by then next call of the scanning method.
- Parameters:
number
- the number of characters to be read again. This number must not be greater thanyylength()
.
-
getNextToken
public int getNextToken() throws java.io.IOException
Resumes scanning until the next regular expression is matched, the end of input is encountered or an I/O-Error occurs.- Returns:
- the next token.
- Throws:
java.io.IOException
- if any I/O-Error occurs.
-
-