Class AbstractDictionary
- java.lang.Object
-
- org.apache.lucene.analysis.cn.smart.hhmm.AbstractDictionary
-
- Direct Known Subclasses:
BigramDictionary
,WordDictionary
abstract class AbstractDictionary extends java.lang.Object
SmartChineseAnalyzer abstract dictionary implementation.Contains methods for dealing with GB2312 encoding.
-
-
Field Summary
Fields Modifier and Type Field Description static int
CHAR_NUM_IN_FILE
Dictionary data contains 6768 Chinese characters with frequency statistics.static int
GB2312_CHAR_NUM
Last Chinese Character in GB2312 (87 * 94).static int
GB2312_FIRST_CHAR
First Chinese Character in GB2312 (15 * 94) Characters in GB2312 are arranged in a grid of 94 * 94, 0-14 are unassigned or punctuation.
-
Constructor Summary
Constructors Constructor Description AbstractDictionary()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description java.lang.String
getCCByGB2312Id(int ccid)
Transcode from GB2312 ID to Unicodeshort
getGB2312Id(char ch)
Transcode from Unicode to GB2312long
hash1(char c)
32-bit FNV Hash Functionlong
hash1(char[] carray)
32-bit FNV Hash Functionint
hash2(char c)
djb2 hash algorithm,this algorithm (k=33) was first reported by dan bernstein many years ago in comp.lang.c.int
hash2(char[] carray)
djb2 hash algorithm,this algorithm (k=33) was first reported by dan bernstein many years ago in comp.lang.c.
-
-
-
Field Detail
-
GB2312_FIRST_CHAR
public static final int GB2312_FIRST_CHAR
First Chinese Character in GB2312 (15 * 94) Characters in GB2312 are arranged in a grid of 94 * 94, 0-14 are unassigned or punctuation.- See Also:
- Constant Field Values
-
GB2312_CHAR_NUM
public static final int GB2312_CHAR_NUM
Last Chinese Character in GB2312 (87 * 94). Characters in GB2312 are arranged in a grid of 94 * 94, 88-94 are unassigned.- See Also:
- Constant Field Values
-
CHAR_NUM_IN_FILE
public static final int CHAR_NUM_IN_FILE
Dictionary data contains 6768 Chinese characters with frequency statistics.- See Also:
- Constant Field Values
-
-
Method Detail
-
getCCByGB2312Id
public java.lang.String getCCByGB2312Id(int ccid)
Transcode from GB2312 ID to UnicodeGB2312 is divided into a 94 * 94 grid, containing 7445 characters consisting of 6763 Chinese characters and 682 symbols. Some regions are unassigned (reserved).
- Parameters:
ccid
- GB2312 id- Returns:
- unicode String
-
getGB2312Id
public short getGB2312Id(char ch)
Transcode from Unicode to GB2312- Parameters:
ch
- input character in Unicode, or character in Basic Latin range.- Returns:
- position in GB2312
-
hash1
public long hash1(char c)
32-bit FNV Hash Function- Parameters:
c
- input character- Returns:
- hashcode
-
hash1
public long hash1(char[] carray)
32-bit FNV Hash Function- Parameters:
carray
- character array- Returns:
- hashcode
-
hash2
public int hash2(char c)
djb2 hash algorithm,this algorithm (k=33) was first reported by dan bernstein many years ago in comp.lang.c. another version of this algorithm (now favored by bernstein) uses xor: hash(i) = hash(i - 1) * 33 ^ str[i]; the magic of number 33 (why it works better than many other constants, prime or not) has never been adequately explained.- Parameters:
c
- character- Returns:
- hashcode
-
hash2
public int hash2(char[] carray)
djb2 hash algorithm,this algorithm (k=33) was first reported by dan bernstein many years ago in comp.lang.c. another version of this algorithm (now favored by bernstein) uses xor: hash(i) = hash(i - 1) * 33 ^ str[i]; the magic of number 33 (why it works better than many other constants, prime or not) has never been adequately explained.- Parameters:
carray
- character array- Returns:
- hashcode
-
-