DataparkSearch Engine 4.45 reference manual: The Web searching software | ||
---|---|---|
Prev | Chapter 5. Storing data | Next |
The cache mode is the fastest DataparkSearch's storage mode. Use it if you need maximal search speed.
If your /var directory isn't changing since indexing is finished, you may disable file locking using "ColdVar yes" command placed in search.htm (or in searchd.conf, if searchd is used). This allow you to save some time on file locking.
If you plan use ispell data, synonym or stopword lists, it's recommended setup the searchd daemon for speed-up searches (See Section 5.4). searchd daemon preload all these data and lists and holds them in memory. This reduce average search query execution time.
Also, searchd can preload url info data (20 bytes per URL indexed) and cache mode limits (4 or 8 bytes per URL depend on limit type). This allow reduce average search time.
If you use cache storage mode and you have enough RAM on your PC, you may place /usr/local/dpsearch/var directory on memory based filesystem (mfs). This allow speedup both indexing and searching.
If you haven't enough RAM to fit /usr/local/dpsearch/var, you may place on memory filesystem any of /usr/local/dpsearch/var/tree, /usr/local/dpsearch/var/url or /usr/local/dpsearch/var/store directories as well.
For dbmode cache, you may use URLInfoSQL no command to disable storing URL Info into SQL database. But using this command, you'll be unable to use limits by language and by Content-Type.
By default, DataparkSearch are marking all URLs selected for indexing as indexed for 4 hours. This prevent possible simultaneous indexing of the same URL by different indexer instance running. But for huge installation this feature can take some time for processing. You may switch off this markage using "MarkForIndex no" in your indexer.conf file.
MySQL users may declare DataparkSearch tables with DELAY_KEY_WRITE=1 option. This will make the updating of indexes faster, as these are not logged to disk until the file is closed. DELAY_KEY_WRITE excludes updating indexes on disk at all.
With it indexes are processed only in memory and written onto disk as last resort, command FLUSH TABLES or mysqld shutdown. This can take even minutes and impatient user can kill -9 mysql server and break index files with this. Another downside is that you should run myisamchk on these tables before you start mysqld to ensure that they are okay if something killed mysqld in the middle.
Because of it we didn't include this table option into default tables structure. However as the key information can always be generated from the data, you should not lose anything by using DELAY_KEY_WRITE. So, use this option for your own risk.
This article was supplied by Randy Winch <gumby@cafes.net>
I have some performance numbers that some of you might find interesting. I'm using RH 6.2 with the 2.2.14-6.1.1 kernel update (allows files larger than 2 gig) and mysql 2.23.18-alpha. I have just indexed most of our site using mnoGoSearch 3.0.18:
mnoGoSearch statistics Status Expired Total ----------------------------- 200 821178 2052579 OK 301 797 29891 Moved Permanently 302 3 3 Moved Temporarily 304 0 7 Not Modified 400 0 99 Bad Request 403 0 7 Forbidden 404 30690 100115 Not found 500 0 1 Internal Server Error 503 0 1 Service Unavailable ----------------------------- Total 852668 2182703
I optimize the data by dumping it into a file using SELECT * INTO OUTFILE, sort it using the system sort routine into word (CRC) order and then reloading it into the database using the procedure described in the mysql online manual.
The performance is wonderful. My favorite test is searching for "John Smith". The optimized database version takes about 13 seconds. The raw version takes about 73 seconds.
Search results: john : 620241 smith : 177096 Displaying documents 1-20 of total 128656 found
Using ares, an asynchronous resolver library (dns/ares in FreeBSD ports collection), allow to perform DNS queries without blocking for every indexing thread. Please note, this also increase the number of concurrent queries to your DNS server.