wgfilter-index


wgfilter-index

Common Problems With Indexes

Permissions issues and incorrect paths are the most common reasons for problems with indexing. Listed below are some of the more typical problems people run into when indexing for the first time, along with possible solutions.

Zero byte files - There could be a few reasons why you are getting zero byte files when you index. For example, the path to your perl executable may be incorrect or Webglimpse may not be able to find it's library files.

Zero files indexed - Perhaps you've used an incorrect path for your executables or the permissions are not set properly for web user, eg. 'apache' or 'nobody'. Try reindexing manually to view the error output.

Files To Examine

.wg_log - maintains a list of all collected files and URLs.

.wg_err - a list of error messages

If you find that the log and error log output is not giving you the error output that you need, the best way to find out exactly where the errors are, is to run ./wgreindex on a command line, or if you must run from cron remove the -q switch. This will force the errors to be displayed on the screen when you re-run your wgreindex.

Most likely any problems you encounter will be due to a path or permission problem. The list below represents a step-by-step approach to isolating the source of the problem:

  1. Troubleshoot by running

    ./wgreindex

    on the command line from the archive directory as an appropriately privileged user. Examine the command output,

  2. .wg_log and .wg_err files.

  3. Try turning off prefiltering (unless you are indexing PDFs or other binary files) by unchecking the prefiltering box in wgarcmin.

  4. Check that the paths and URLs in the

  5. .wg_toindex

    file are correct. If retrieving remote files, check the contents of files in the .remote subdirectory.

  6. Check the extensions of the files to be indexed against the list of allowed extensions in the file

  7. .wgfilter-index

  8. If the files to be indexed are not plain text, make sure they have a working filter installed in the file

  9. .glimpse_filters

  10. If you are configured to use an alternate file end mark (in order to index files with spaces in the names), make sure you have made the matching modification to both Glimpse and Webglimpse.

  11. From the web administration's main page, you can use the "Test Path Translation" program to ensure that the webglimpse path matches the web URL that you have specified. As seen below in Figure 21, you would enter the URL that you designated and click the "URL 2 File" button. If the URL that you specified is correct and accessible by your web server, the file path will be displayed in the File field. Similarly, if you want to test the file path translation, you would enter it into the File field and click the "File 2 URL" button. If it is accessible, the corresponding URL will be displayed in the URL field.

    You also can check the canonical aliasing of your domain name from this screen.

 

Format of .wgfilter-index

The file .wgfilter-index can be used to include and exclude files based on many patterns, not only filetype. It is probably most commonly used to exclude subdirectories from the indexing.

A copy of .wgfilter-index is placed in each archive directory by confarc. You can edit this file to selectively include/exclude files and directories from your index.

The format is:      Allow|Deny regexp Each line of the file should begin with Allow or Deny, followed by some whitespace, then the regular expression matching the files to be included or excluded from the index. So for example, the line

	Deny  (^|/)secret

excludes any file whose path contains the string "/secret" or whose path begins with "secret". For more information on regular expressions, see any Perl reference book, or type "man regexp", or see the on-line docs on Perl regular expressions.

The lines of .wgfilter-index are processed in order, with earlier lines having precedence over later ones. So to index only files ending in .htm or .html, you could use a .wgfilter-index file containing the lines

	Allow \.html?$
	Deny .*

Note: .wgfilter-index is read by makenh when creating the list of files to be indexed. It is not read by glimpse or glimpseindex directly.