5.3. DataparkSearch performance issues

The cache mode is the fastest DataparkSearch's storage mode. Use it if you need maximal search speed.

If your /var directory isn't changing since indexing is finished, you may disable file locking using "ColdVar yes" command placed in search.htm (or in searchd.conf, if searchd is used). This allow you to save some time on file locking.

5.3.1. searchd usage recommendation

If you plan use ispell data, synonym or stopword lists, it's recommended setup the searchd daemon for speed-up searches (See Section 5.4). searchd daemon preload all these data and lists and holds them in memory. This reduce average search query execution time.

Also, searchd can preload url info data (20 bytes per URL indexed) and cache mode limits (4 or 8 bytes per URL depend on limit type). This allow reduce average search time.

5.3.2. Memory based filesystem (mfs) usage recommendation

If you use cache storage mode and you have enough RAM on your PC, you may place /usr/local/dpsearch/var directory on memory based filesystem (mfs). This allow speedup both indexing and searching.

If you haven't enough RAM to fit /usr/local/dpsearch/var, you may place on memory filesystem any of /usr/local/dpsearch/var/tree, /usr/local/dpsearch/var/url or /usr/local/dpsearch/var/store directories as well.

5.3.3. URLInfoSQL command

For dbmode cache, you may use URLInfoSQL no command to disable storing URL Info into SQL database. But using this command, you'll be unable to use limits by language and by Content-Type.

5.3.4. MarkForIndex command

By default, DataparkSearch are marking all URLs selected for indexing as indexed for 4 hours. This prevent possible simultaneous indexing of the same URL by different indexer instance running. But for huge installation this feature can take some time for processing. You may switch off this markage using "MarkForIndex no" in your indexer.conf file.

5.3.5. CheckInsertSQL command

By default, DataparkSearch trying to insert data into SQL database regardless it's already present there. On some systems this raise some error loggings. To avoid such errors, you may enable additional checks, is the inserting data new, by specifying CheckInsertSQL yes command in your indexer.conf.

5.3.6. MySQL performance

MySQL users may declare DataparkSearch tables with DELAY_KEY_WRITE=1 option. This will make the updating of indexes faster, as these are not logged to disk until the file is closed. DELAY_KEY_WRITE excludes updating indexes on disk at all.

With it indexes are processed only in memory and written onto disk as last resort, command FLUSH TABLES or mysqld shutdown. This can take even minutes and impatient user can kill -9 mysql server and break index files with this. Another downside is that you should run myisamchk on these tables before you start mysqld to ensure that they are okay if something killed mysqld in the middle.

Because of it we didn't include this table option into default tables structure. However as the key information can always be generated from the data, you should not lose anything by using DELAY_KEY_WRITE. So, use this option for your own risk.

5.3.7. Post-indexing optimization

This article was supplied by Randy Winch

I have some performance numbers that some of you might find interesting. I'm using RH 6.2 with the 2.2.14-6.1.1 kernel update (allows files larger than 2 gig) and mysql 2.23.18-alpha. I have just indexed most of our site using mnoGoSearch 3.0.18:


          mnoGoSearch statistics

    Status    Expired      Total
   -----------------------------
       200     821178    2052579 OK
       301        797      29891 Moved Permanently
       302          3          3 Moved Temporarily
       304          0          7 Not Modified
       400          0         99 Bad Request
       403          0          7 Forbidden
       404      30690     100115 Not found
       500          0          1 Internal Server Error
       503          0          1 Service Unavailable
   -----------------------------
     Total     852668    2182703

I optimize the data by dumping it into a file using SELECT * INTO OUTFILE, sort it using the system sort routine into word (CRC) order and then reloading it into the database using the procedure described in the mysql online manual.

The performance is wonderful. My favorite test is searching for "John Smith". The optimized database version takes about 13 seconds. The raw version takes about 73 seconds.


Search results: john : 620241 smith : 177096
Displaying documents 1-20 of total 128656 found

5.3.8. Asynchronous resolver library

Using ares, an asynchronous resolver library (dns/ares in FreeBSD ports collection), allow to perform DNS queries without blocking for every indexing thread. Please note, this also increase the number of concurrent queries to your DNS server.