Making A Searchable Site


Making A Searchable Site

This section assumes you have successfully installed Webglimpse and are now ready to configure an archive to be searched.

Let's talk now about archives. What is an archive? We use the term 'archive' for the collection of documents to be indexed, along with the index itself and associated archive configuration files. By default, Webglimpse will store the indexes and archive-specific configuration files in the directory "/[webglimpse_home]/archives/[N]" where 'N' is the number of the archive. You can create multiple archives with a single installation of Webglimpse; and each archive may include local files from your server's hard drive, remote files retrieved from other servers, and/or files retrieved from your local server thru its web interface.

Before creating your archive, it is a good idea to have a clear picture of exactly which documents you want indexed and have a plan in place. While this may sound obvious, it is not always so. You should ask yourself the following questions:

Indexing Your Website

Using the web interface to create an archive:

Login: Open your web browser and point it to your administration page, wgarcmin.cgi

(At the completion of the install, you should have been given the URL where you chose to install wgarcmin.cgi, most likely something like

http://yourserver.com/cgi-bin/wgarcmin.cgi   or   http://yourserver.com/cgi-bin/wg2/wgarcmin.cgi )

You will be prompted for the administrator name and password you set during the install. (if you've forgotten the administrator login name or password, you can view the troubleshooting information page.

Once you have logged into the administration module, you will see the initial 'Manage Archives' screen. If this is your first time in the administration area, then the drop-down list will be empty.

Press the Add New Archive button: at the top of the screen to configure a new archive.

Enter any Title, Category and Description you choose for the new archive. The Title you enter will be displayed in the search forms and result page. You may also choose a language at this time. (Figure 14)

Figure 14

Choose Your Indexing Method

You need to choose which indexing method you want: "Index by Directory", "Index by Site", or "Index by Traversing Link". This is the main decision you need to make when creating an archive - how to determine the files to be indexed. Here are some things to consider in making that determination:

Index by Directory - a good choice if all the files you want to index are in a local directory on your server, and if there is not a bunch of other junk in that directory that you do NOT want indexed. Indexing by directory will save time and space, as static files do not need to be retrieved thru the web interface. In another screen you can specify any dynamic filetypes, such as .php and .cgi extensions, that should be run thru the web interface. You can also exclude files from the index by later editing the .wgfilter-index file in the archive directory, to specifically exclude any files or subdirectories you do not want indexed.

Index by Site - probably the easiest choice for most websites, this is a good option if you want to index all the pages under a specific domain or website directory, and if all those pages are linked in some way from the starting page. That is, if you want to index everything under http://yoursite.com/, and if the user could get to all pages in your site by browsing starting with the first page.

Index by Traversing Links - if you want to index a selection of remote pages and sites, this is the option to choose. You will have flexible options on the next page for exactly which remote pages to retrieve - whether to get only those you specifically link to, or all pages on those sites you link to, or to just follow a set number of 'hops' regardless of where those hops lead.

If you chose Index by Directory, you will be presented with the screen as displayed in Figure 15:

Figure 15

Directory Path specifies the location in the filesystems of the files to be indexed. All files within the specified directory will be indexed, except those denied by .wgfilter-index. The list of files is generated by `find . -type f -follow -print `

Equivalent URL tells Webglimpse what URL on your server corresponds to the directory being indexed. This should be a full URL starting with "http://" and ending with the directory or alias. It should not end with index.html or other html file name.

Max # Pages tells Webglimpse the maximum # of pages to index from this directory. For example, if this is set to 100, then only the first 100 pages returned by `find` will be indexed.

 

If you chose Index by Site, you will be presented with the screen as displayed in Figure 16:

Figure 16

Site URL: specifies the starting page for the index. The domain of this starting page will be used to limit the pages included in the index.

Max # Pages sets the maximum # of pages to index for this site. For example, if this is set to 100, then only the first 100 pages traversed will be indexed.

Max # Hops limits the depth to which links will be traversed. For example, if this is set to 1, then only pages directly linked to from the starting page will be indexed. For a Site type root, this can be safely set to a large value since pages on other servers will not be gathered or indexed.

 

If you chose Index by Traversing Links, you will be presented with the screen as displayed in Figure 17:

Figure 17

Start URL: specifies the starting page for the index.

Max # Hops limits the depth to which links will be traversed. For a Tree type root, this should NOT be set too high; usually a value of 1 or 2 is sufficient.

Follow links these checkboxes give you fairly precise control over how the traversing spider should behave.

Max # Local Pages sets the maximum # of pages to index from the local server. The storage requirements are lower for indexing local pages because they do not need to be copied to the local machine before being indexed.

Max # Remote Pages sets the maximum # of pages to gather and index from remote servers. These pages are actually retrieved and stored on the local server as well as being indexed.

Reindex Freq is NOT actually functional as of version 2.0.04. It will be used to create a crontab fragment that can be manually included in a users' crontab to reindex regularly.

Now that you've set up an archive, you will automatically be placed in the archive management area. Most likely you will need to configure your local domain now. By clicking on the "Update Status" link at the top of the Manage Archive screen, it will alert you if this should be your next step. To configure the local domain, you should click the "Back to WgMin Home" link towards the bottom of the screen.

 

Next, click on the "Configure Local Domain" button in the middle of the administrator screen.

Figure 18 displays the Configure Local Domain screen.

Figure 18

You must enter the proper Server Name. This should be your server name including your domain information. For example, if your server name is "myserver" and your domain is "home.com" the correct information to enter into the server name field is myserver.home.com

The Document Root must be the actual path to your webservers document root directory. The other settings are optional and can probably be left alone.

The UserDir is used only for sites of the form "http://yourdomain.com/~user" and corresponds to the Apache UserDir directive.
(see documentation at http://apache.org/ for further explanation)

The Script Alias & Extensions setting is a way to make sure that dynamic pages are retrieved via http and not indexed on the filesystem. You can enter regular expressions corresponding to any dynamically generated pages on your site.

Directory Aliases are only needed if you are indexing files directly, but some directories are really aliased to a different URL than would be expected by the DocumentRoot setting. If you don't know if you are doing this, you probably are not and can leave it blank.

Once you are done, you should save the entries by pressing the "Save & Validate Changes" button.

Indexing A Set Of Remote Websites

If you are indexing remotes site(s) you will now need to configure the remote locations by clicking on the "Configure Remote Domain" button on the administration page.

Figure 19 shows what the remote domain configuration screen looks like.

Figure 19

You must enter the url to the remote domain. You do not need to include the "http://" in this field. Typically, the port that is used is port 80, if you know that is incorrect, you should enter the correct port.

If you have a domain alias you should enter that in the "Domain Alias" field. (this is not a required field)

If the remote site requires a cookie-based login, you should place a check in the box and enter the additional information required.

Once finished, you should save the entries you made by pressing the "Save & Validate Changes" button.

Continue to Next Page >>