Configuring and Customizing


Configuring and Customizing

Introduction

Webglimpse has two parts : Glimpse, the fast C engine which does the text indexing and pattern matching, and Webglimpse proper, the flexible Perl spider, archive manager and user interface script.

When you build an index, you first specify some rules for which files should be included. You might index by directory, by following links from a starting page, or a combination of both. The indexing script first makes a complete list of all files to be indexed, and retrieves any remote links. Then it feeds this list to the glimpseindex program, which builds a keyword index for fast searching.

When a user runs a search, the webglimpse.cgi script first gets the query, checks all the options and does some pre-processing. Then it sends the query to glimpse for a fast search of the index and files. Finally, it parses the raw results, formats them nicely according to the customized format on that site, ranks them, highlights keywords and presents the results back to the user.

Webglimpse 2.X has been out for almost two years and is in wide use, with new development continuing steadily and new releases about once a month.

Configuration File Overview

Starting in March of 2006, all users are allowed and recommended to upgrade to the Advanced version of Webglimpse, which has several modules for customization. Educational and Government users and nonprofits who installed prior to 3/06 may have a more limited version that was provided in the past, but are now free to upgrade to the full Advanced version.

[maybe put a link here to 'philosophy' explaining why...that may come later]

Note: (version 1.X is no longer recommended for new installs. It does not have any of the web management features and is missing many bugfixes.

Using wgoutput.cfg , you have control over each piece of text output by webglimpse in the results page. Listed below, is a sample wgoutput.cfg file. This file is shipped as the default output configuration in version 1.7.4. Note, however, that in versions prior to 2.1 the file is named .wgoutput.cfg so it is hidden from the directory listing. Each line in the wgoutput.cfg file starts with a special variable name or a + character to continue the previous record. The variables are used as follows to generate customized results output:

begin_html output once at the beginning of results
neigh_msg
noneigh_msg
output once if this is a neighborhood search
output once if not a neighborhood search
lines_msg
nolines_msg
output once if using jump to line
output once if not using jump to line
begin_files output once if any files match query
begin_file_marker output at start of each file match,
before the actual link to the file
begin_lines output before first matching line of each file,
but after the actual link to the file
begin_single_line output before each matching line of each file
end_single_line output after each matching line of each file
end_lines output after all matching lines of each file
end_file_marker output after end_lines
end_files output after all matching files (if any)
end_html output at end of results page

Using patterns in Results output
Starting in version 1.7.4, it is also possible to substitute user-defined and pre-defined variables from the matched document into the results set. Variables are denoted by   [VARNAME]  in the wgoutput.cfg file.

Pre-defined variables are as follows:

[QUERY] The original user query

[SEARCHTITLE] Optional form variable in wgindex.html

[MATCHED_LINES] Number of matching lines
(Can ONLY be used in end_html variable above)

[MATCHED_FILES] Number of matching files
(Can ONLY be used in end_html variable above)

User-defined variables are set up in the .wgoutputfields file. The format is
TYPE<tab>NAME<tab>REGEXP

where TYPE may be FILE, to look for value in the actual matching file, or PATH, to look for the value in the directory path
NAME should match a variable [NAME] in the wgoutput.cfg file in the archive directory. TITLE is a special name, and will be used for the linked text.

The value will be set to the first matching ()'s (using the $1 variable after a match)

For example, the .wgoutputfields line: FILE REFNO ^Reference Number:\s+(.+)$

will cause each file returned by Glimpse to be parsed for the regular expression /^Reference Number:\s+(.+)$/, and if a match occurs, the subexpression corresponding to the ()'s is substituted for
[REFNO]

wherever that variable appears in the output text specified in wgoutput.cfg. Note, only one subexpression should be enclosed in parentheses.

Here is an example of a path-based variable. This line in .wgoutputfields:
PATH LASTSUBDIR \/([^\/]+)\/[^\/]+$

will cause the path to each returned file, to be parsed for the regular expression /\/([^\/]+)\/[^\/]+$/. If a match occurs, the subexpression corresponding to the ()'s, in this case the last subdirectory containing the file, is substituted for      [LASTSUBDIR]    as it appears in wgoutput.cfg.

Customized Search Results Order (ranking)
Four built-in ranking schemes are available to the user as they search your site: rank by most recent first, by matches in title and meta tags, by link popularity, or by a combination of all these (the default). As the administrator, you can create your own ranking formula using all the available information about the match, and make your own customized ranking schemes available to your users (or limit them to one scheme of your choice). Re-sorting and ranking of hits is an extremely powerful tool that allows users to find the hits most relevant to them.
META tag support allows you to include any meta tag explicitly in your ranking formulas. This gives you precise control over ordering of your hit results, if you so desire.

Glimpse actually does not use any ranking algorithm, except for ordering by most recent first, however Webglimpse allows the administrator to create their own ranking formula based on the available variables:

# Available variables are:
#
# $N # of times the word appears in the record
# $LineNo Where in the file the word appears
# $TITLE # of times the word appears in the TITLE tag
# $FILE # of times the word appears in the file path
# $Days Date (how many days old the file is)
# $META Total # of times the word appears in any META tag
# $LinkPop Link popularity in the site
# %MetaHash Hash with the # of times the word appears in each META tag,
# indexed by the NAME= parameter of the meta tag.
# $LinkString actual url of link

By editing the .wgrankhits.cfg file in each archive you can create one or more named ranking formulas. So, in effect, you can actually let the end user choose a specific named ranking formula on each search. (If you invent new ranking formulas rather than modifying the default, you also need to edit the search form to make the names match)

Structured Input - Supporting metatags

Field-based searching will allow you to make a query such as 'subject=Things ' when you have defined "subject" as a field, and you have indexed a file containing the line. 'subject: Things'. For example, search for "subject=Things" at the demo page http://iwhome.com/wgarchives/demo/fields/wgindex.html
You must check the box 'Use Filters' Note, you can also do combination searches, such as 'subject=Things;New Test'. This will search for all files containing the text "New Test" and the line "subject: Things". The allowed # of spelling errors and partial match criteria apply only to the field value "Things" and not to the field name.

Okay, now you have a little bit of a background on structured input - supporting metatags and how it works, so how do you implement this into your Webglimpse configuration? It's fairly simple, read on!

First, be sure you have installed updated versions of both Webglimpse and Glimpse. For versions of glimpse 4.14 or above, build as follows:    ./configure --enable-structured-queries make clean make make install

If you are upgrading from a version prior to 1.7.5, for 1.X run confarc to configure (or re-configure) an archive. This should copy over the new distribution files, including .wginputfields. For 2.X press the 'Save Changes' button in the manage archive screen.

Next, edit the .wginputfields file in your archive directory. Enter one field name per line, no spaces.

Check that the files you are indexing contain the field data in the format     fieldName: fieldValue     where "fieldName" is at the very beginning of the line.

Now, edit the .glimpse_filters file. Add a line for the type of files you want to index with fields such as:

       *.EXT /WEBGLIMPSE_HOME/lib/parsefields.pl <

Where "EXT" is the real extension on the files to be indexed this way, and "WEBGLIMPSE_HOME" is the real path on your system to the Webglimpse home directory.

Finally, edit the wgreindex file. Make sure the -z and -s options are present on both glimpseindex command lines. You may only need to add the -s option.

Run ./wgreindex to rebuild your archive.

Use the wgindex.html file to perform a field-based search. Make sure to check the 'Use Filters' checkbox.

There you have it, a structured input searchable archive!