Comparison of Webglimpse vs HtDig
Prepared in 2003 at a client's request. Please notify us if you believe this comparison is in any way inaccurate or is now outdated.
Project Fundamentals
|
Last Modified |
Known Security Holes |
Security features |
Program language |
Specification of Files to Index |
File Types Supported |
Code availability, Licensing |
Tech Support |
| HtDig |
2002-02-01 |
Yes-beta* |
Not known |
C only |
Site* |
HTML, text * |
GPL |
mailing list |
| Webglimpse |
2003-05-16 |
No |
perl -T |
C for indexing and search;
Perl for web presentation.
|
Site, Directory or Tree* |
HTML, text, PDF, MSWord, .gz, .zip * |
open code,but requires license |
Yes. Guaranteed successful install and config with all purchased licenses. |
User Interface
| Boolean queries |
Phrase searches |
Fuzzy/approx matching |
Easy search interface with
"ALL, ANY or PHRASE" choice |
Wildcard searches |
Language Templates Available |
Limit search by... |
Re-Rank Hits (user choice of criteria) |
Keyword Highlighting |
Combine results from multiple archives |
| HtDig |
Yes |
No |
Yes |
No |
No |
English |
URL pattern |
No |
No |
No |
| Webglimpse |
Yes |
Yes |
Yes |
Yes |
Yes |
Hebrew, German, Spanish, Italian, French, Finnish, Norwegian, Portuguese and Estonian
(Russian just received 6/02/03) |
URL pattern or Subdirectory |
Yes |
Yes |
Yes |
Administration
| Web-ministration interface |
Customizable Search output |
Customizable Ranking formulas |
Query Log (what are users searching for?) |
Statistics on gathered pages |
Email to administrator on index failure |
Meta tag support |
| HtDig |
No |
Yes |
No |
No |
No |
Not known |
Not known |
| Webglimpse |
Yes |
Yes |
Yes |
Yes |
Yes |
Yes |
Yes |
Technical details
| Indexing algorithm |
Options for Speed |
Options for Size |
Options for returned text |
Platforms tested on |
|
| HtDig |
Not known |
Not known |
Not known |
Not known; but depends on retrieving all files, even local ones. According to the htdig site, the database may
frequently be larger than the actual files indexed. |
Linux, Solaris, SunOS, HP/UX, IRIX, freeBSD, Mac OS X |
| Webglimpse |
block-level inverted index |
caching of search results; ability to limit number of hits returned for extremely fast search (<1s on 2Gb of data) |
Tiny, medium or large index; pre-filtering of files. Index takes typically 5-15% of total file size. Local files do
not need to be gathered.
| find sentences; limit by chars; limit by lines |
Linux, Solaris, SunOS, HP/UX, freeBSD, AIX, IRIX, OSF, Mach, Mac OS X |
Some Users
| HtDig |
NASA, Tennessee Valley Authority, Valley Internet, Together Networks, many
Linux and GNU-related sites, many universities. |
| Webglimpse |
NASA, Los Alamos Natl Labs, Altohiway, Texas Workforce Commission, Baystate Health System, Intel, Hewlitt-Packard,
AT&T, many small businesses, universities, and government agencies |
Notes
HtDig has a known security hole in the latest beta version 3.2.0b3, currently downloadable from the site
http://htdig.org. There is a fix in the latest stable version, 3.1.16, and in the code snapshot. The previous stable
version, 3.1.15, also had the security hole. This beta version with known security problems has apparently been available for
download since 2001-10-15. According to these notes, "This hole can allow remote users to read any file on your system that
the UID running your webserver can read."
HtDig selects the files to index by gathering links from one or more starting URLs. It will gather links that are on the
same site as the starting ones by matching a simple set of string patterns.
Webglimpse can index files by Site, essentially the same as HtDig; by Directory (all files within a specified directory
on the server, whether or not they are linked); and by Tree (all files with a certain number of 'mouse clicks' or 'hops'
away from one or more starting points. Webglimpse can also include or exclude files by regexp patterns and can accept
information about synonymous virtual domains and alias directories in order not to gather duplicate links.
According to the 'Features and Requirements' page on the http://htdig.org website, " Both HTML documents and plain text
files can be searched. Searching of other file types will be supported in future versions.". However, there are
references to searching PDF files in the FAQ area; this may refer only to the beta version which currently is released
with a security hole. Possibly by getting the new beta code snapshot you might successfully be able to index PDF using the xpdf
add-on.
Webglimpse supports indexing any file that can be filtered to text by an external program. Free and reliable external
programs are known for PDF, MSWord, and all compressed file formats. By pre-filtering files before indexing (and
filtering on download) searches are quite fast even on these filetypes. Pre-filtering also saves a great deal of space
when indexing remote files. Several scripts to filter HTML tags are provided, including ones which convert HTML
character codes such as á = aacute; for effective searching in non-English languages.
|