DataparkSearch Engine 4.31 reference manual: The Web searching software | ||
---|---|---|
Prev | Chapter 7. Languages support | Next |
Chinese, Japanese, Korean and Thai writings have no spaces between words in phrase as in western languages. Thus, while indexing documents in these languages, it's need additionally to segment phrases into words.
For Japanese language phrase segmenting the one of ChaSen, a morphological system for Japanese language, or MeCab, a Japanese morphological analyser, is used. Thus, you need one of these systems to be installed before DataparkSearch's configuring and building.
To enable Japanese language phrase segmenting use --enable-chasen or --enable-mecab switch for configure.
For Chinese language phrase segmenting the frequency dictionary of Chinese words is used. And segmenting itself is done by dynamic programming method to maximize the cumulative frequency of produced words.
To enable Chinese language phrase segmenting it's need to enable the support for Chinese charsets while DataparkSearch configuring, and specify the frequency dictionary of Chinese words by LoadChineseList command in indexer.conf file.
LoadChineseList [charset dictionaryfilename]
By default, the GB2312 charset and mandarin.freq dictionary is used.
Note: You need to download frequency dictionaries from our web site, or from one of our mirrors, see Section 1.2.
For Thai language phrase segmenting the frequency dictionary of Thai words is used. And segmenting itself is done as for Chinese language.
To enable Thai language phrase segmenting it's need to specify the frequency dictionary of Thai words by LoadThaiList command in indexer.conf file.
LoadThaiList [charset dictionaryfilename]
By default, the tis-620 charset and thai.freq dictionary is used.
Note: You need to download frequency dictionaries from our web site, or from one of our mirrors, see Section 1.2.
For Korean language phrase segmenting the frequency dictionary of Korean words is used. And segmenting itself is done as for Chinese language.
To enable Korean language phrase segmenting it's need to specify the frequency dictionary of Korean words by LoadKoreanList command in indexer.conf file.
LoadKoreanList [charset dictionaryfilename]
By default, the euc-kr charset and korean.freq dictionary is used.
Note: You need to download frequency dictionaries from our web site, or from one of our mirrors, see Section 1.2.