Web Glimpse Administration
Help for Add Link Tree
The Add Link Tree page adds a Tree-type root to an archive. A "Tree" type root is an index of pages generated by traversing links from the specified start page, limited by the number of "hops" chosen and by the rules for traversing remote sites. The Tree root is the most flexible but also complex archive component; it has potential to index an unexpectedly large number of pages so the "Max # Remote" pages limit should be chosen to restrain a runaway gatherer.

"Traversing links" means following <A HREF...> and similar tags, just as if the program were a user clicking on links on the page. <FRAME> and <IMAGEMAP> tags are also traversed.

Start URL: specifies the starting page for the index.

Max # Hops limits the depth to which links will be traversed. For example, if this is set to 1, then only pages directly linked to from the starting page will be indexed. For a Tree type root, this should NOT be set too high; usually a value of 1 or 2 is sufficient. 3 hops from yahoo.com would be a lot of pages...

Follow links... these checkboxes give you fairly precise control over how the traversing spider should behave. In the examples below, let http://A.com/a.html be the start page; let A.com be a local server; let there be links from a.html to http://A.com/b.html and http://B.com/b.html; and from b.html to http://B.com/c.html and http://C.com/c.html. Make a drawing if this sounds confusing.

  • Even if no boxes are checked, http://A.com/a.html will be indexed, and also the link to http://A.com/b.html, because the server A.com is local.
  • ...to remote pages: If this box is checked, then links from the start page or any local page to a remote server will be followed. So http://B.com/b.html is now indexed.
  • ...on remote sites: If this box is checked, then links on remote servers that stay on the same server will be followed. So the link to http://B.com/c.html IS indexed (because it is on the same server as http://B.com/b.html), but the link to http://C.com/c.html may not be.
  • ...from remote sites to other remote sites: If this box is checked, its a free-for-all; there are no restrictions on the type of link that will be followed. Now http://C.com/c.html is indexed (assuming # hops was set to at least 2).
The above rules are for accepting or rejecting links being traversed. The traversal is still controlled by the number of hops and max # of local and remote pages.

Max # Local Pages sets the maximum # of pages to index from the local server. The storage requirements are lower for indexing local pages because they do not need to be copied to the local machine before being indexed.

Max # Remote Pages sets the maximum # of pages to gather and index from remote servers. These pages are actually retrieved and stored on the local server as well as being indexed.

Reindex Freq is NOT actually functional as of version 2.0.04. It will be used to create a crontab fragment that can be manually included in a users' crontab to reindex regularly.

Back to WGmin home