Comparison of Webglimpse vs HtDig

Prepared in 2003 at a client's request. Please notify us if you believe this comparison is in any way inaccurate or is now outdated.
Project Fundamentals
Last Modified Known Security Holes Security features Program language Specification of Files to Index File Types Supported Code availability, Licensing Tech Support
HtDig 2002-02-01 Yes-beta* Not known C only Site* HTML, text * GPL mailing list
Webglimpse 2003-05-16 No perl -T C for indexing and search;
Perl for web presentation.
Site, Directory or Tree* HTML, text, PDF, MSWord, .gz, .zip * open code,but requires license Yes. Guaranteed successful install and config with all purchased licenses.

User Interface
Boolean queries Phrase searches Fuzzy/approx matching Easy search interface with
"ALL, ANY or PHRASE" choice
Wildcard searches Language Templates Available Limit search by... Re-Rank Hits (user choice of criteria) Keyword Highlighting Combine results from multiple archives
HtDig Yes No Yes No No English URL pattern No No No
Webglimpse Yes Yes Yes Yes Yes Hebrew, German, Spanish, Italian, French, Finnish, Norwegian, Portuguese and Estonian (Russian just received 6/02/03) URL pattern or Subdirectory Yes Yes Yes

Web-ministration interface Customizable Search output Customizable Ranking formulas Query Log (what are users searching for?) Statistics on gathered pages Email to administrator on index failure Meta tag support
HtDig No Yes No No No Not known Not known
Webglimpse Yes Yes Yes Yes Yes Yes Yes

Technical details
Indexing algorithm Options for Speed Options for Size Options for returned text Platforms tested on
HtDig Not known Not known Not known Not known; but depends on retrieving all files, even local ones. According to the htdig site, the database may frequently be larger than the actual files indexed. Linux, Solaris, SunOS, HP/UX, IRIX, freeBSD, Mac OS X
Webglimpse block-level inverted index caching of search results; ability to limit number of hits returned for extremely fast search (<1s on 2Gb of data) Tiny, medium or large index; pre-filtering of files. Index takes typically 5-15% of total file size. Local files do not need to be gathered. find sentences; limit by chars; limit by lines Linux, Solaris, SunOS, HP/UX, freeBSD, AIX, IRIX, OSF, Mach, Mac OS X

Some Users
HtDig NASA, Tennessee Valley Authority, Valley Internet, Together Networks, many Linux and GNU-related sites, many universities.
Webglimpse NASA, Los Alamos Natl Labs, Altohiway, Texas Workforce Commission, Baystate Health System, Intel, Hewlitt-Packard, AT&T, many small businesses, universities, and government agencies


HtDig has a known security hole in the latest beta version 3.2.0b3, currently downloadable from the site There is a fix in the latest stable version, 3.1.16, and in the code snapshot. The previous stable version, 3.1.15, also had the security hole. This beta version with known security problems has apparently been available for download since 2001-10-15. According to these notes, "This hole can allow remote users to read any file on your system that the UID running your webserver can read."

HtDig selects the files to index by gathering links from one or more starting URLs. It will gather links that are on the same site as the starting ones by matching a simple set of string patterns.

Webglimpse can index files by Site, essentially the same as HtDig; by Directory (all files within a specified directory on the server, whether or not they are linked); and by Tree (all files with a certain number of 'mouse clicks' or 'hops' away from one or more starting points. Webglimpse can also include or exclude files by regexp patterns and can accept information about synonymous virtual domains and alias directories in order not to gather duplicate links.

According to the 'Features and Requirements' page on the website, " Both HTML documents and plain text files can be searched. Searching of other file types will be supported in future versions.". However, there are references to searching PDF files in the FAQ area; this may refer only to the beta version which currently is released with a security hole. Possibly by getting the new beta code snapshot you might successfully be able to index PDF using the xpdf add-on.

Webglimpse supports indexing any file that can be filtered to text by an external program. Free and reliable external programs are known for PDF, MSWord, and all compressed file formats. By pre-filtering files before indexing (and filtering on download) searches are quite fast even on these filetypes. Pre-filtering also saves a great deal of space when indexing remote files. Several scripts to filter HTML tags are provided, including ones which convert HTML character codes such as á = aacute; for effective searching in non-English languages.

