Speeding up Searches using Prefiltering

Webglimpse 2.7.4 and above have a new setting that allows prefiltering of PDf and even HTML files, greatly increasing search speed for large indexes. Indexing is also significantly faster; and for remote files, storage needs are actually reduced. For local files, storage requirements increase by about 20% of the total size of files indexed. It works by simply keeping around the pure-text version of the file that glimpse needs to index and search, rather than generating it on-the-fly as needed. Remote file storage is reduced by storing the filtered version rather than the original. Extra meta information can now be stored in the filtered file (such as line numbers, for jump-to-line), which will allow administrators to add meta information about specific files even if they do not own those files.

Anyway, if you just want to implement it, here is how:

  1. Make sure that you have the correct filters installed in .glimpse_filters. If you are already indexing PDF files, no change should be needed to the PDF filter. For HTML, please use htuml2txt.pl as the filter program so that <TITLE> tags and other essential information will be preserved. Your .glimpse_filters file may look something like this:
    *.pdf   /usr/local/bin/usexpdf.sh   <
    *.PDF   /usr/local/bin/usexpdf.sh   <
    *.html  /usr/local/wgdemo/lib/htuml2txt.pl <
    *.htm   /usr/local/wgdemo/lib/htuml2txt.pl <
    ...more filetypes here...
    
    Note, if you have not yet indexed PDF files, please see How To Index PDF Documents using XPDF.

  2. Go into the archive management screen and enter 'all' in the new "Prefilter file types:" input area. This is the default for new archives.

  3. Rebuild your archive. You may do this either by running
    /path/to/your/archive/wgreindex
    manually or by pressing the 'Build Index' button in the web interface. Note, once you rebuild it manually, you may have permissions problems doing it from the web in the future unless you reset ownership to the web user. Generally you should pick one method or the other and be consistent in order to avoid problems.
Now search your rebuilt archive, and see the speed improvement! (That is, unless it was already fast before - but you can still be happy that the CPU load on your server is less.)
Back to Docs and Howtos