Filtering and Filetypes


 

Why we filter: index the text, not the tags

Basically, webglimpse crawls each document slated to be indexed in order to gather each and every word. Since you don't need to index the HTML codes, it automatically excludes them. Filtering also allows us to index documents such as PDF or Microsoft Word, by using publicly available programs to convert those files to text.  In addition, filtering files provides us the opportunity to exclude text such as repeated headers and footers and perform other custom processing.  A custom filter script must be written to perform these operations, but you will find some possibly helpful subroutines in wgFilter.pm.

Controlling which filetypes are indexed with .glimpse_filters and .wgfilter-index

The actual filter script used for each type of document is set in the file .glimpse_filters in the archive directory. Note that each extension likely to be encountered in your site should have a corresponding entry in .glimpse_filters.

Example of a .glimpse_filters file that has entries added for .pdf and .PDF file extensions:

.pdf   /home/sites/natlaw/scripts/usexpdf.sh <
*.PDF   /home/sites/natlaw/scripts/usexpdf.sh <
*.html  /home/sites/natlaw/scripts/htuml2txt.pl <
*.ht    /home/sites/natlaw/scripts/htuml2txt.pl <
*.shtml /home/sites/natlaw/scripts/htuml2txt.pl <
*.sht   /home/sites/natlaw/scripts/htuml2txt.pl <
*.php   /home/sites/natlaw/scripts/htuml2txt.pl <
*.PHP   /home/sites/natlaw/scripts/htuml2txt.pl <
*.asp   /home/sites/natlaw/scripts/htuml2txt.pl <
*.ASP   /home/sites/natlaw/scripts/htuml2txt.pl <
*.php3  /home/sites/natlaw/scripts/htuml2txt.pl <
*.php4  /home/sites/natlaw/scripts/htuml2txt.pl <
*.htm   /home/sites/natlaw/scripts/htuml2txt.pl <
*.HTM   /home/sites/natlaw/scripts/htuml2txt.pl <
*.HTML  /home/sites/natlaw/scripts/htuml2txt.pl <
*.jhtml /home/sites/natlaw/scripts/htuml2txt.pl <

In addition, which files are indexed at all is controlled by the file .wgfilter-index (also in the archive directory). .wgfilter-index can be used to control inclusion or exclusion of subdirectories, files by glob pattern, and include or exclude specific filetypes. For efficiency, processing of .wgfilter-index stops at the first match. A generally efficient technique is to include first Deny lines by directory, then Allow lines by filetype, then Deny everything else.

Example of a .wgfilter-index file that will index URLs ending in .htm, .html, .txt, .php, .php3, .phpX, .pdf and /, except for anything in a /cgi-bin/ directory or the Webglimpse search forms themselves; and will not attempt to follow mailto: or javascript links.

Deny (^|/)wgindex\.html$
Deny (^|/)wgall\.html$
Deny (^|/)wgany\.html$
Deny (^|/)wgsimple\.html$
Deny (^|/)wgverysimple\.html$
Deny (^|/)cgi-bin(/|$)
Deny ^mailto:
Deny (^|/)JavaScript:
Allow \.s?html?$
Allow \.php.?$
Allow \.pdf$
Allow \.asp$
Allow \.txt$
Allow \/$
Deny .
The default .wgfilter-index file automatically generated in your archive directory is longer and does not have the fallback Deny line, in order to support a wide variety of sites.

Speeding Up Searches Using Prefiltering

Webglimpse 2.7.4 and above have a new setting that allows prefiltering of PDF and even HTML files, greatly increasing search speed for large indexes. Indexing is also significantly faster; and for remote files, storage needs are actually reduced. For local files, storage requirements increase by about 20% of the total size of files indexed. It works by simply keeping around the pure-text version of the file that glimpse needs to index and search, rather than generating it on-the-fly as needed. Remote file storage is reduced by storing the filtered version rather than the original. Extra meta information can now be stored in the filtered file (such as line numbers, for jump-to-line), which will allow administrators to add meta information about specific files even if they do not own those files.

Be sure that you have the correct filters installed in .glimpse_filters. If you are already indexing PDF files, no change should be needed to the PDF filter. For HTML, please use htuml2txt.pl as the filter program so that <TITLE> tags and other essential information will be preserved. Your .glimpse_filters file may look something like this:

*.pdf /usr/local/bin/usexpdf.sh <
*.PDF /usr/local/bin/usexpdf.sh <
*.html /usr/local/wgdemo/lib/htuml2txt.pl <
*.htm /usr/local/wgdemo/lib/htuml2txt.pl <
...more filetypes here...

Note, if you have not yet indexed PDF files, please see How To Index PDF Documents using XPDF.

Go into the archive management screen and enter 'all' in the new "Prefilter file types:" input area. This is the default for new archives.

Rebuild your archive. You may do this either by running

    /path/to/your/archive/wgreindex

manually or by pressing the 'Build Index' button in the web interface. Note, once you rebuild it manually, you may have permissions problems doing it from the web in the future unless you reset ownership to the web user. Generally you should pick one method or the other and be consistent in order to avoid problems.

Now search your rebuilt archive, and see the speed improvement! (That is, unless it was already fast before - but you can still be happy that the CPU load on your server is less.)

Eliminating Repetitive Text

Often documents in a repository have a section of text that is repeated across many or all documents, that is not desireable to search on. Standard headers and footers are the most common example, that produce repetitive and uninformative search output when the user happens to query on a keyword that is part of the standard toolbar, for instance.

To filter out such repetitive text requires some customization of the HTML filter. The easiest way to remove simple text blocks is probably to modify the htuml2txt.pl filter script to use the provided wgFilter.pm module and make use of the SkipTag or SkipSection routines. A detailed description of how to create a custom script is beyond the scope of this document; if you run into problems try subscribing to the users mailing list

Filetypes

Webglimpse does not actually have a fixed list of supported filetypes. Rather, any filetype which can be filtered to ascii text by an external filter is supported. We've tested with for HTML (or XML), PDF, Word, Excel, and Zip files, documented in the next section.

Now, let's talk about the details of setting up filtering for pdf, word, excel and zip files.

Continue to Next Page >>