3.10. Other commands uses in indexer.conf

3.10.1. Include command

You may include another configuration file in any place of the indexer.conf using Include <filename> command. Absolute path if <filename> starts with "/":


Include /usr/local/dpsearch/etc/inc1.conf

Relative path else:


Include inc1.conf

3.10.2. DBAddr command

DBAddr command is URL-style database description. It specify options (type, host, database name, port, user and password) to connect to SQL database. Should be used before any other commands. You may specify several DBAddr commands. In this case DataparkSearch will merge result from every database specified. Command have global effect for whole config file. Format:


DBAddr <DBType>:[//[DBUser[:DBPass]@]DBHost[:DBPort]]/DBName/[?[dbmode=mode]{&<parameter name>=<parameter value>}]

Note: ODBC related. Use DBName to specify ODBC data source name (DSN) DBHost does not matter, use "localhost".

Note: Solid related. Use DBHost to specify Solid server DBName does not matter for Solid

You may use CGI-like encoding for DBUser and DBPass if you need use special characters in user name or password. For example, if you have ABC@DEF as password, you should write it as ABC%40DEF.

Currently supported DBType values are mysql, pgsql, msql, solid, mssql, oracle, ibase, sqlite. Actually, it does not matter for native libraries support. But ODBC users should specify one of supported values. If your database type is not supported, you may use "unknown" instead.

MySQL and PostgreSQLusers can specify path to Unix socket when connecting to localhost: mysql://foo:bar@localhost/dpsearch/?socket=/tmp/mysql.sock

If you are using PostgreSQL and do not specify hostname, e.g. pgsql://user:password@/dbname/ then PostgreSQL will not work via TCP, but will use default Unix socket.

dbmode parameter. You may also select database mode of words storage. When "single" is specified, all words are stored in the same table (file). If "multi" is selected, words will be located in different tables (files) depending of their lengths. "multi" mode is usually faster but requires more tables (files). If "crc" mode is selected, DataparkSearch will store 32 bit integer word IDs calculated by HASH32 algorithm instead of words. This mode requires less disk space and it is faster comparing with "single" and "multi" modes, however it doesn't support substring searches. "crc-multi" uses the same storage structure with the "crc" mode, but also stores words in different tables (files) depending on words lengths like "multi" mode. Default mode is "single".

stored parameter. Format: stored=StoredHost[:StoredPort]. This parameter is used to specify host and port, if specified, where stored daemon is running, if you plan to use document excerpts and cached copies.

cached parameter. Format: cached=CachedHost[:CachedPort]. Use cached at given host and port if specified. It is required for cache storage mode only (see Section 5.2). Each indexer will connect to cached on given address at startup.

charset parameter. Format: charset=DBCharacterSet. This parameter can be used to specity database connection charset. The charset specified by DBCharacterSet should be equal to charset specified by LocalCharset command.

Example:


DBAddr          mysql://foo:bar@localhost/dpsearch/?dbmode=single

3.10.3. VarDir command

You may choose alternative working directory for cache mode:


VarDir /usr/local/dpsearch/var

3.10.4. NewsExtensions command

Whether to enable news extensions. Default value is no.


NewsExtensions yes

3.10.5. SyslogFacility command

This is used if DataparkSearch was compiled with syslog support and if you don't like the default value. Argument is the same as used in syslog.conf file. For list of possible facilities see syslog.conf(5)


SyslogFacility local7

3.10.6. Word length commands

Word lengths. You may change default length range of words stored in database. By default, words with the length in the range from 1 to 32 are stored.


MinWordLength 1
MaxWordLength 32

3.10.7. MaxDocSize command

This command is used for specify maximal document size. Default value 1048576 (1 Megabyte). Takes global effect for whole config file.


MaxDocSize 1048576

3.10.8. MinDocSize command

This command is used to checkonly urls with content size less than value specified. Default value 0. Takes global effect for whole config file.


MinDocSize 1024

3.10.9. IndexDocSizeLimit command

Use this command to specify the maximal amount of data stored in index per document. Default value 0. This mean no limit. Takes effect till next IndexDocSizeLimit command.


IndexDocSizeLimit 65536

3.10.10. URLSelectCacheSize command

Select number of targets to index at once. Default value is 1024.


URLSelectCacheSize 10240

3.10.11. URLDumpCacheSize command

Select at once this number of urls to write cache mode indexes, to preload url data or to calculate the Popularity Rank. Default value is 100000.


URLDumpCacheSize 10240

3.10.12. UseCRC32URLId command

Switch on or off the ID generation for URL using HASH32. Default value is "no".


UseCRC32URLId yes

3.10.13. HTTPHeader command

You may add your desired headers in indexer HTTP request. You should not use "If-Modified-Since", "Accept-Charset" headers, these headers are composed by indexer itself. "User-Agent: DataparkSearch/version" is sent too, but you may override it. Command has global effect for all configuration file.


HTTPHeader "User-Agent: My_Own_Agent"
HTTPHeader "Accept-Language: ru, en"
HTTPHeader "From: webmaster@mysite.com"

3.10.14. Allow command


Allow [Match|NoMatch] [NoCase|Case] [String|Regex] <arg> [<arg> ... ]

Use this command to allow URLs that match (doesn't match) given argument. First three optional parameters describe the type of comparison. Default values are Match, NoCase, String. Use NoCase or Case values to choose case sensitive or case insensitive comparison. Use Regex to choose regular expression comparison. Use String to choose string with wildcards comparison. Wildcards are '*' for any number of characters and '?' for one character. Note that '?' and '*' have special meaning in String match type. Please use Regex to describe documents with '?' and '*' signs in URL. String match is much faster than Regex. Use String where it is possible. You may use several arguments for one Allow command. You may use this command any times. Takes global effect for config file. Note that DataparkSearch automatically adds one "Allow regex .*" command after reading config file. It means that allowed everything that is not disallowed.

Examples


#  Allow everything:
Allow *
#  Allow everything but .php .cgi .pl extensions case insensitively using regex:
Allow NoMatch Regex \.php$|\.cgi$|\.pl$
#  Allow .HTM extension case sensitively:
Allow NoCase *.HTM

3.10.15. Disallow command


Disallow [Match|NoMatch] [NoCase|Case] [String|Regex] <arg> [<arg> ... ]

Use this command to disallow URLs that match (doesn't match) given argument. The meaning of first three optional parameters is exactly the same with Allow command. You can use several arguments for one Disallow command. Takes global effect for config file. Examples:


# Disallow URLs that are not in udm.net domains using "string" match:
Disallow NoMatch *.udm.net/*
# Disallow any except known extensions and directory index using "regex" match:
Disallow NoMatch Regex \/$|\.htm$|\.html$|\.shtml$|\.phtml$|\.php$|\.txt$
# Exclude cgi-bin and non-parsed-headers using "string" match:
Disallow */cgi-bin/* *.cgi */nph-*
# Exclude anything with '?' sign in URL. Note that '?' sign has a 
# special meaning in "string" match, so we have to use "regex" match here:
Disallow Regex  \?

3.10.16. CheckOnly command


CheckOnly [Match|NoMatch] [NoCase|Case] [String|Regex] <arg> [<arg> ... ]

The meaning of first three optional parameters is exactly the same with Allow command. Indexer will use HEAD instead of GET HTTP method for URLs that match/do not match given regular expressions. It means that the file will be checked only for being existing and will not be downloaded. Useful for zip,exe,arj and other binary files. Note that you can disallow those files with commands given below. You may use several arguments for one CheckOnly commands. Useful for example for searching through the URL names rather than the contents (a la FTP-search). Takes global effect for config file. Examples:


# Check some known non-text extensions using "string" match:
CheckOnly *.b	  *.sh   *.md5
# or check ANY except known text extensions using "regex" match:
CheckOnly NoMatch Regex \/$|\.html$|\.shtml$|\.phtml$|\.php$|\.txt$

3.10.17. HrefOnly command


HrefOnly [Match|NoMatch] [NoCase|Case] [String|Regex] <arg> [<arg> ... ]

The meaning of first three optional parameters is exactly the same with Allow command. Use this to scan a HTML page for "href" attribute of tags but not to index the contents of the page with an URLs that match (doesn't match) given argument. Commands have global effect for all configuration file. When indexing large mail list archives for example, the index and thread index pages (like mail.10.html, thread.21.html, etc.) should be scanned for links but shouldn't be indexed:


HrefOnly */mail*.html */thread*.html

3.10.18. CheckMp3 command


CheckMp3 [Match|NoMatch] [NoCase|Case] [String|Regex] <arg> [<arg> ...]

The meaning of first three optional parameters is exactly the same with Allow command. If an URL matches given rules, indexer will download only a little part of the document and try to find MP3 tags in it. On success, indexer will parse MP3 tags, else it will download whole document then parse it as usual. Notes: This works only with those servers which support HTTP/1.1 protocol. It is used "Range: bytes" header to download mp3 tag.


CheckMp3 *.bin *.mp3

3.10.19. CheckMp3Only command


CheckMP3Only [Match|NoMatch] [NoCase|Case] [String|Regex] <arg> [<arg> ...]

The meaning of first three optional parameters is exactly the same with Allow command. If an URL matches given rules, indexer, like in the case CheckMP3 command, will download only a little part of the document and try to find MP3 tags. On success, indexer will parse MP3 tags, else it will NOT download whole document.


CheckMP3Only *.bin *.mp3

3.10.20. IndexIf command


IndexIf [Match|NoMatch] [NoCase|Case] [String|Regex] <section> <arg> [<arg> ... ]

Use this command to allow indexing, if the value of section match the arg pattern given. The meaning of first three optional parameters is exactly the same as for the Allow command (see Section 3.10.14).

Example


IndexIf regex Title Manual
IndexIf body "*important detail*"

3.10.21. NoIndexIf command


NoIndexIf [Match|NoMatch] [NoCase|Case] [String|Regex] <section> <arg> [<arg> ... ]

Use this command to disallow indexing, if the value of section match the arg pattern given. The meaning of first three optional parameters is exactly the same as for the Allow command (see Section 3.10.14).

Example


NoIndexIf regex Title Sex
IndexIf body *xxx*

3.10.22. HoldBadHrefs command


HoldBadHrefs <time>

How much time to hold URLs with erroneous status before deleting them from the database. For example, if host is down, indexer will not delete pages from this site immediately and search will use previous content of these pages. However if site doesn't respond for a month, probably it's time to remove these pages from the database. For <time> format see description of Period command in Section 3.10.26.


HoldBadHrefs 30d

3.10.23. DeleteOlder command


DeleteOlder <time>

How much time to hold URLs before deleting them from the database. For example, for news sites indexing, you may delete automatically old news articles after specified period. For <time> format see description of Period command in Section 3.10.26. Default value is 0. "0" value mean "do not check". You may specify several DeleteOlder commands, for example, by one for every Server command.


DeleteOlder 7d

3.10.24. UseRemoteContentType command


UseRemoteContentType yes/no

This command specifies if the indexer should get content type from http server headers (yes) or from it's AddType settings (no). If set to 'no' and the indexer could not determine content-type by using its AddType settings, then it will use http header. Default: yes


UseRemoteContentType yes

3.10.25. AddType command


AddType [String|Regex] [Case|NoCase] <mime type> <arg> [<arg>...]

This command associates filename extensions (for services that don't automatically include them) with their mime types. Currently "file:" protocol uses these commands. Use optional first two parameter to choose comparison type. Default type is "String" "Case" (case insensitive string match with '?' and '*' wildcards for one and several characters correspondently).


AddType image/x-xpixmap	*.xpm

3.10.26. Period command


Period <time>

Set reindex period. <time> is in the form 'xxxA[yyyB[zzzC]]' (Spaces are allowed between xxx and A and yyy and so on) there xxx, yyy, zzz are numbers (can be negative!) A, B, C can be one of the following: s - second M - minute h - hour d - day m - month y - year (these letters are the same as in strptime/strftime functions). Examples:


 15s - 15 seconds
 4h30M - 4 hours and 30 minutes
 1y6m-15d - 1 year and six month minus 15 days
 1h-10M+1s - 1 hour minus 10 minutes plus 1 second

If you specify only number without any character, it is assumed that time is given in seconds. Can be set many times before Server command and takes effect till the end of config file or till next Period command.


Period 7d

3.10.27. PeriodByHops command


PeriodByHops <hops> [ <time> ]

Set reindex period per <hops> basis. The format for <time> is the same as for Period.

Can be set many times before Server command and takes effect till the end of config file or till next PeriodByHops command with same <hops> value. If <time> parameter is omitted, this undefine the previous defined value.

If for given <hops> value the appropriate PeriodByHops command is not specified, in this case the value defined in Period command is used.

3.10.28. ExpireAt command


ExpireAt [ A [ B [ C [ D [ E ]]]]]

This command allow specify the exactly expiration time for documents. May be specified per Server/Realm basis and takes effect till the end of config file or till next ExpireAt command. ExpireAt specified without any arguments disable previously specified value. A - stand for minute, may be * or 0-59; B - stand for hour, may be * or 0-23; C - stand for day of month, may be * or 1-31; D - stand for month, may be * or 1-12; E - stand for day of week, may be * or 0-6, 0 - is Sunday. ExpireAt command have higher prioroty over Period or PeriodByHops command.

3.10.29. UseDateHeader command


UseDateHeader yes|no

Use Date header if no Last-Modified header is sent by remote web-server. Default value: no.

3.10.30. Tag command


Tag <string>

Use this field for your own purposes. For example for grouping some servers into one group, etc... During search you'll be able to limit URLs to be searched through by their tags. Can be set multiple times before Server command and takes effect till the end of config file or till next Tag command. Default values is an empty string.

3.10.31. TagIf command


TagIf <tag> [Match|NoMatch] [NoCase|Case] [String|Regex] <section> <arg> [<arg> ... ]

Mark document by <tag> tag, if the value of section match the arg pattern given. The meaning of first three optional parameters are exactly the same as for the Allow command (see Section 3.10.14).

Example


TagIf Docs regex Title Manual

3.10.32. Category command


Category <string>

You may distribute documents between nested categories. Category is a string in hex number notation. You may have up to 6 levels with 256 members per level. Empty category means the root of category tree. Take a look into Section 6.2 for more information.


# This command means a category on first level:
Category AA
# This command means a category on 5th level:
Category FFAABBCCDD

3.10.33. CategoryIf command


CategoryIf <category> [Match|NoMatch] [NoCase|Case] [String|Regex] <section> <arg> [<arg> ... ]

Mark document by <category> category, if the value of section match arg pattern given. The meaning of first three optional parameters is exactly the same as for the Allow command (see Section 3.10.14).

Example


CategoryIf 010F regex Title "JOB ID"

3.10.34. MaxHops command


MaxHops <number>

Maximum way in "mouse clicks" from start url. Default value is 256. Can be set multiple times before "Server" command and takes effect till the end of config file or till next MaxHops command.


MaxHops 256

3.10.35. TrackHops command


TrackHops yes|no

This command enable or disable hops tracking in reindexing. Default value is no. If enabled, the value of hops for url is recalculated when reindexing. Otherwise the value of hops is calculated only once at insertion of url into database.


TrackHops yes

3.10.36. MaxDepth command


MaxDepth <number>

Maximum directory depth of url. Default value is 16. Can be set multiple times before "Server" command and takes effect till the end of config file or till next MaxDepth command.


MaxDepth 2

3.10.37. MaxDocsPerServer command


MaxDocsPerServer <number>

Limits the number of documents retrieved from Server. Default value is -1, that means no limits. If set to positive value, no more than given number of pages will be indexed from one server during this run of index. Can be set multiple times before Server command and takes effect till the end of config file or till next MaxDocsPerServer command.


MaxDocsPerServer 100

3.10.38. MaxNetErrors command


MaxNetErrors <number>

Maximum network errors for each server. Default value is 16. Use 0 for unlimited errors number. If there too many network errors on some server (server is down, host unreachable, etc) indexer will try to do not more then 'number' attempts to connect to this server. Takes effect till the end of config file or till next MaxNetErrors command.


MaxNetErrors 16

3.10.39. ReadTimeOut command


ReadTimeOut <time>

Connect timeout and stalled connections timeout. For <time> format see Section 3.10.26. Default value is 30 seconds. Can be set any times before Server command and takes effect till the end of config file or till next ReadTimeOut command.


ReadTimeOut 30s

3.10.40. DocTimeOut command


DocTimeOut <time>

Maximum amount of time indexer spends for one document downloading. For <time> format see Section 3.10.26. Default value is 90 seconds. Can be set any times before Server command and takes effect till the end of config file or till next DocTimeOut command.


DocTimeOut 1m30s

3.10.41. NetErrorDelayTime command


NetErrorDelayTime <time>

Specify document processing delay time if network error has occurred. For <time> format see Section 3.10.26. Default value is one day


NetErrorDelayTime 1d

3.10.42. Cookies command


Cookies yes/no

Enables/Disables the support for HTTP cookies. Command may be used several times before Server command and takes effect till the end of config file or till next Cookies command. Default value is "no".


Cookies yes

3.10.43. Robots command


Robots yes/no

Allows/disallows using robots.txt and <META NAME="robots" ...> exclusions. Use no, for example for link validation of your server(s). Command may be used several times before Server command and takes effect till the end of config file or till next Robots command. Default value is "yes".


Robots yes

3.10.44. RobotsPeriod command

By defaults, robots.txt data holds in SQL-database for one week. You may change this period using RobotsPeriod command:


RobotsPeriod <time>
For <time> format see description of Period command in Section 3.10.26.

RobotsPeriod 30d

3.10.45. CrawlDelay command

Use this command to specify default pause in seconds between consecutive fetches from same server. This is similar to crawl-delay command in robots.txt file, but can specified in indexer.conf file on per server basis. If no crawl-delay value is specified in robots.txt, the value of CrawlDelay is used. If crawl-delay is specified in robots.txt, then the maximum of CrawlDelay and crawl-delay is used as interval between consecutive fetches.

3.10.46. Section command


Section <string> <number> <maxlen> [ <pattern> <replacement> ]

where <string> is a section name and <number> is section ID between 0 and 255. Use 0 if you don't want to index some of these sections. It is better to use different IDs for different sections. In this case during search time you'll be able to give different weight to each section or even disallow some sections at a search time. <maxlen> argument contains a maximum length of section which will be stored in database. Use 0 for <maxlen>, if you don't want to store this section. <pattern> and <replacement> are a regex-like pattern and replacement to extract section value from document content.


# Standard HTML sections: body, title
Section	body			1	256
Section title			2	128
Section GoodName                3       128 "<h1>([^<]*)</h1>" "<b>GoodName:</b> $1"

3.10.47. HrefSection command


HrefSection <string> [ <pattern> <replacement> ]

where <string> is a section name, <pattern> and <replacement> are a regex-like pattern and replacement to extract section value from document content. Use this command to extract links from document content.


# Standard HTML sections: body, title
HrefSection	link
HrefSection     NewLink "<newlink>([^<]*)</newlink>" "$1"

3.10.48. Index command


Index yes/no

Prevent indexer from storing words into database. Useful for example for link validation. Can be set multiple times before Server command and takes effect till the end of config file or till next Index command. Default value is "yes".


Index no

3.10.49. ProxyAuthBasic command


ProxyAuthBasic login:passwd

Use http proxy basic authorization. Can be used before every Server command and takes effect only for next one Server command! It should be also before Proxy command. Examples:


ProxyAuthBasic somebody:something  

3.10.50. Proxy command


Proxy your.proxy.host[:port]

Use proxy rather then connect directly. One can index ftp servers when using proxy Default port value if not specified is 3128 (Squid) If proxy host is not specified direct connect will be used. Can be set before every Server command and takes effect till the end of config file or till next Proxy command. If no one Proxy command specified indexer will use direct connect. Examples:


#           Proxy on atoll.anywhere.com, port 3128:
Proxy atoll.anywhere.com
#           Proxy on lota.anywhere.com, port 8090:
Proxy lota.anywhere.com:8090
#           Disable proxy (direct connect):
Proxy

3.10.51. AuthBasic command


AuthBasic login:passwd

Use basic http authorization. Can be set before every Server command and takes effect only for next one Server command! Examples:


AuthBasic somebody:something  

# If you have password protected directory(-ies), but whole server is open,use:
AuthBasic login1:passwd1
Server http://my.server.com/my/secure/directory1/
AuthBasic login2:passwd2
Server http://my.server.com/my/secure/directory2/
Server http://my.server.com/

3.10.52. ServerWeight command


ServerWeight <number>

Server weight for Popularity Rank calculation (see Section 8.5.3). Default value is 1.


ServerWeight 1

3.10.53. OptimizeAtUpdate command


OptimizeAtUpdate yes

Specify word index optimize strategy. Default value: no If enabled, this save disk space, but slow down indexing. May be placed in indexer.conf and cached.conf.

3.10.54. SkipUnreferred command


SkipUnreferred yes

Default value: no. Use this command to skip reindexing for unreferred documents. Unreferred document is document with no links to it. This command require the links collection to be enabled (see Section 8.5.3).

3.10.55. Bind command


Bind 127.0.0.1

You may use this command to specify local ip address, if your system have several network interfaces.

3.10.56. ProvideReferer command


ProvideReferer yes

Use this command to provide Referer: request header for HTTP and HTTPS connections.