DataparkSearch Engine 4.45 reference manual
The Web searching software
Copyright © 2003-2007 by Datapark corp.
Copyright © 2001-2003 by Lavtech.com corp.
Table of Contents
1.
Introduction
1.1.
DataparkSearch Features
1.2.
Where to get
DataparkSearch
.
1.3.
Disclaimer
1.4.
Authors
1.4.1.
Contributors
2.
Installation
2.1.
SQL database requirements
2.2.
Supported operating systems
2.3.
Tools required for installation
2.4.
Installing
DataparkSearch
2.5.
Possible installation problems
2.6.
Installation registration
3.
Indexing
3.1.
Indexing in general
3.1.1.
Configuration
3.1.2.
Running
indexer
3.1.3.
How to create SQL table structure
3.1.4.
How to drop SQL table structure
3.1.5.
Subsection control
3.1.6.
How to clear database
3.1.7.
Database Statistics
3.1.8.
Link validation
3.1.9.
Parallel indexing
3.2.
Supported HTTP response codes
3.3.
Content-Encoding support
3.4.
Specifying WEB space to be indexed
3.4.1.
Server command
3.4.2.
Realm command
3.4.3.
Subnet command
3.4.4.
Using different parameter for server and it's subsections
3.4.5.
Default indexer behavior
3.4.6.
Using
indexer -f <filename>
3.5.
Aliases
3.5.1.
Alias indexer.conf command
3.5.2.
Different aliases for server parts
3.5.3.
Using alias in Server command
3.5.4.
Using alias in Realm command
3.5.5.
AliasProg command
3.5.6.
ReverseAlias command
3.5.7.
Alias in search.htm search template
3.6.
ServerTable
3.6.1.
Loading servers table
3.6.2.
Server table structure
3.6.3.
Flushing ServerTable
3.7.
External parsers
3.7.1.
Supported parser types
3.7.2.
Setting up parsers
3.7.3.
Avoid indexer hang on parser execution
3.7.4.
Pipes in parser's command line
3.7.5.
Charsets and parsers
3.7.6.
DPS_URL environment variable
3.7.7.
Some third-party parsers
3.8.
Other commands uses in
indexer.conf
3.8.1.
Include
command
3.8.2.
DBAddr
command
3.8.3.
VarDir
command
3.8.4.
NewsExtensions
command
3.8.5.
SyslogFacility
command
3.8.6.
LocalCharset
command
3.8.7.
ForceIISCharset1251
command
3.8.8.
StopwordFile
command
3.8.9.
LangMapFile
command
3.8.10.
Word length commands
3.8.11.
MaxDocSize
command
3.8.12.
MinDocSize
command
3.8.13.
IndexDocSizeLimit
command
3.8.14.
URLSelectCacheSize
command
3.8.15.
URLDumpCacheSize
command
3.8.16.
UseCRC32URLId
command
3.8.17.
HTTPHeader
command
3.8.18.
Allow
command
3.8.19.
Disallow
command
3.8.20.
CheckOnly
command
3.8.21.
HrefOnly
command
3.8.22.
CheckMp3
command
3.8.23.
CheckMp3Only
command
3.8.24.
IndexIf
command
3.8.25.
NoIndexIf
command
3.8.26.
HoldBadHrefs
command
3.8.27.
DeleteOlder
command
3.8.28.
UseRemoteContentType
command
3.8.29.
AddType
command
3.8.30.
ParserTimeOut
command
3.8.31.
Period
command
3.8.32.
PeriodByHops
command
3.8.33.
ExpireAt
command
3.8.34.
UseDateHeader
command
3.8.35.
Tag
command
3.8.36.
TagIf
command
3.8.37.
Category
command
3.8.38.
CategoryIf
command
3.8.39.
DefaultLang
command
3.8.40.
MaxHops
command
3.8.41.
TrackHops
command
3.8.42.
MaxDepth
command
3.8.43.
MaxDocsPerServer
command
3.8.44.
MaxNetErrors
command
3.8.45.
ReadTimeOut
command
3.8.46.
DocTimeOut
command
3.8.47.
NetErrorDelayTime
command
3.8.48.
Cookies
command
3.8.49.
Robots
command
3.8.50.
RobotsPeriod
command
3.8.51.
CrawlDelay
command
3.8.52.
DetectClones
command
3.8.53.
Section
command
3.8.54.
HrefSection
command
3.8.55.
Index
command
3.8.56.
RemoteCharset
command
3.8.57.
URLCharset
command
3.8.58.
ProxyAuthBasic
command
3.8.59.
Proxy
command
3.8.60.
AuthBasic
command
3.8.61.
ServerWeight
command
3.8.62.
OptimizeAtUpdate
command
3.8.63.
SkipUnreferred
command
3.8.64.
Bind
command
3.8.65.
URL
command
3.8.66.
ServerDB, RealmDB, SubnetDB and URLDB
commands
3.9.
Extended indexing features
3.9.1.
Indexing SQL database tables (htdb: virtual URL scheme)
3.9.2.
Indexing binaries output (exec: and cgi: virtual URL schemes)
3.9.3.
Mirroring
3.10.
Using syslog
3.11.
Storing compressed document copies
3.11.1.
Configure stored
3.11.2.
How stored works
3.11.3.
Using stored during search
4.
DataparkSearch
HTML parser
4.1.
Tag parser
4.2.
Special characters
4.3.
META tags
4.4.
Links
4.5.
Comments
4.6.
Body patterns
5.
Storing data
5.1.
SQL storage types
5.1.1.
General storage information
5.1.2.
Various modes of words storage
5.1.3.
Storage mode - single
5.1.4.
Storage mode - multi
5.1.5.
Storage mode - crc
5.1.6.
Storage mode - crc-multi
5.1.7.
Storage mode - cache
5.1.8.
SQL structure notes
5.1.9.
Additional features of non-CRC storage modes
5.2.
Cache mode storage
5.2.1.
Introduction
5.2.2.
Cache mode word indexes structure
5.2.3.
Cache mode tools
5.2.4.
Starting cache mode
5.2.5.
Optional usage of several splitters
5.2.6.
Using run-splitter script
5.2.7.
Doing search
5.2.8.
Using search limits
5.3.
DataparkSearch
performance issues
5.3.1.
searchd
usage recommendation
5.3.2.
Memory based filesystem (mfs) usage recommendation
5.3.3.
URLInfoSQL command
5.3.4.
MarkForIndex command
5.3.5.
MySQL performance
5.3.6.
Post-indexing optimization
5.3.7.
Asynchronous resolver library
5.4.
SearchD support
5.4.1.
Why using searchd
5.4.2.
Starting searchd
5.5.
Oracle notes
5.5.1.
5.5.2.
Compilation, Installation and Configuration
6.
Subsections
6.1.
Tags
6.1.1.
Tags in SQL version
6.2.
Categories
7.
Languages support
7.1.
Character sets
7.1.1.
Supported character sets
7.1.2.
Character sets aliases
7.1.3.
Recoding
7.1.4.
Recoding at search time
7.1.5.
Document charset detection
7.1.6.
Automatic charset guesser
7.1.7.
Default charset
7.1.8.
Default Language
7.1.9.
Recoding during search
7.2.
Making multi-language search pages
7.2.1.
How does it work?
7.2.2.
Possible troubles
7.3.
Segmenters for Chinese, Japanese, Korean and Thai languages
7.3.1.
Japanese language phrase segmenter
7.3.2.
Chinese language phrase segmenter
7.3.3.
Thai language phrase segmenter
7.3.4.
Korean language phrase segmenter
7.4.
Multilingual servers support
8.
Searching documents
8.1.
Using search front-ends
8.1.1.
Performing search
8.1.2.
Search parameters
8.1.3.
Changing different document parts weights at search time
8.1.4.
Using front-end with an shtml page
8.1.5.
Using several templates
8.1.6.
Advanced boolean search
8.1.7.
The Verity Query Language, VQL
8.1.8.
How search handles expired documents
8.2.
mod_dpsearch
module for Apache httpd
8.2.1.
Why using
mod_dpsearch
8.2.2.
Configuring
mod_dpsearch
8.3.
How to write search result templates
8.3.1.
Template sections
8.3.2.
Variables section
8.3.3.
Includes in templates
8.3.4.
Conditional template operators
8.3.5.
Security issues
8.4.
Designing search.html
8.4.1.
How the results page is created
8.4.2.
Your HTML
8.4.3.
Forms considerations
8.4.4.
Relative links in search.htm
8.4.5.
Adding Search form to other pages
8.5.
Relevance
8.5.1.
Ordering documents
8.5.2.
Relevance calculation
8.5.3.
Popularity rank
8.5.4.
Boolean search
8.5.5.
Crosswords
8.5.6.
The Summary Extraction Algorithm (SEA)
8.6.
Search queries tracking
8.7.
Search results cache
8.8.
Fuzzy search
8.8.1.
Ispell
8.8.2.
Aspell
8.8.3.
Synonyms
8.8.4.
Accent insensitive search
8.8.5.
Acronyms and abbreviations
9.
Miscellaneous
9.1.
Reporting bugs
9.1.1.
Core dump reports
9.2.
Using
libdpsearch
library
9.2.1.
dps-config
script
9.2.2.
DataparkSearch
API
9.3.
Database schema
A.
Donations
Index
List of Tables
3-1.
Verbose levels
5-1.
Cache limit types
7-1.
Language groups
7-2.
Charsets aliases
8-1.
Available search parameters
8-2.
VQL operators supported by DataparkSearch
8-3.
Configure-time parameters to tune relevance calculation (switches for
configure
)
9-1.
server
table schema
9-2.
Several server's parameters values in
srvinfo
table
Next
Introduction