[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

some comments on htuml2txt.pl



Hi Golda,

Well, I am back.  I was swamped with deadlines and have finally had
some extra time to look at the webglimpse again, in particular the
new htuml2txt.pl --very exciting.

So now probably you are busy and do not have time to look at
these things. (-:

Anyway, here is what I can report based on a couple of hours
of testing.  The short story is that the conversion seems to
work all right, but this creates a series of problems for
the /cgi-bin/webglimpse script.  I have not tried to 
investigate this systematically, but I will report a series
of problems. They are all "reasonable" problems, that is, 
one could reasonably expect such a search.  I have not tried (yet)
to intentionally give pathological cases.

If it makes any difference I use: glimpse-4.11 and perl 5.004_3 
and this was webglimpse_1.7b2

-------------------

For the Danish letters glimpse seems to be finding them all right if
they are indexed with htuml2txt.pl (that is, both the html versions and
the "actual" versions are found). 

-------------------------

But there is a problem with the /cgi-bin/webglimpse script because
it is not "case insensitive".

That is, if I search for

"Sørensen"  then I get all the instances with "Sørensen"

but if I search for "S#rensen" then I get all the
ones found by "Sørensen", but I also get SØRENSEN (because of the
capital Ø , which was not found by "Sørensen" )

Similarly if I search for "SØrensen" then I only get SØRENSEN
(again because of the capital "Ø"), and no "Sørensen"

Probably it is just a matter of adjusting what is fed to
glimpse in the webglimpse cgi-bin script?  I have not
studied the parameters, but probably you will know
them immediately, and it should not be too hard to fix.

----------------------

If I tried to search with "Sørensen" and one spelling mistake allowed
then I got the error:

/usr/bin/glimpse: size of pattern 's' must be > #of errors 1

It would work all right if I searched on "Sorensen" with one error.

But now it would find SØRENSEN  but it only found 36 compared with 53 cases.
if I search on "Sørensen" with no errors.  (Not quite sure why)

(None of them are marked in bold)

---------------------

Another point about  /cgi-bin/webglimpse

Usually the found term is placed in bold.  But I guess it
is more tricky with #  

Anyway, I tried to compare a search for:

"forståelse"

vs.

"forst#else"

(The idea with using the # is to check how well the
htuml2txt.pl script is finding the Danish letters, also
in the HTML version)

In the latter (wildcard) case, all the "forståelse" that were found
were also marked in bold, and also the following...where
the begining and end were marked, as underlined.

 forståelighed og sprogforståelse 
 -----                       ----

But:  "forstyrrelse"  which was also (correctly) found
by forst#else was not marked in bold.

That is probably wrong, no? But I cannot explain it.

--------------------------

Next problem.

I tried to search on the Danish letter: Æ

I got an error that "Æ" did not contain any words and
would take too long to search.  Fair enough.  So then
I tried to search on "z"  Now I did not get any
error message, and what is more, it found:

KRISTINE MARIE JENSEN DE LOPÉZ 

(where the Z was marked in bold)  Here was the code from the file:

 LOPÉZ

(I understand the problem, in general, that the ; is used as a word
marker, hence the "Z" was found as a word.  But probably this
is not the right behavior, in the long run.)

I was also about to search on the letter "b"

Again, it found all the cases with uml letters: such as:

 både 
 udløb.
 Böwadt 
 Småbørns

(and in each case, the B is set in bold)

---------------------

Along the way, I discovered that it would be a good idea for
.glimpse_filters to also have *.HTM and .HTML
as well as .htm and .html  (I had at least one user who
had some .HTM files --- the pleasures of moving files
from MS-DOS). 

And finally....here are some small patches / suggestions
relative to 1.7b2 (maybe these are already fixed in 1.7.1)

1. wginstall.pl

For the cgi-bin script, wginstall checks where perl is located
and puts the right value for webglimpse, but it does not do so
for mfs

2. Just a small modification to give better information.
    I was getting "ERROR" and did not understand where to look.

--- webglimpse/lib/config.pl    Wed Nov 11 17:28:20 1998
+++ webglimpse.new/lib/config.pl        Fri Apr 16 18:31:02 1999
@@ -336,7 +336,7 @@
        # Now doing error checking -- check for parsing problem --GB 7/16/98
        if (!defined($protocol))        {
                print ERRFILE "Error parsing $url\n";
-               print "ERROR\n";
+               print "URL PARSING ERROR -- see .wg_err\n";
                return $URL_ERROR;
        }
               

3. Here are a few small things. It seems that a \ slipped
in by mistake.  And I added a couple more "quiets"

--- webglimpse.new/makenh	Fri Apr 16 18:31:01 1999
+++ webglimpse/makenh	Wed Nov 11 17:45:29 1998
@@ -202,7 +202,7 @@
 foreach $item (@configlist) 
 {
 	$value = '';
-	eval "\$value = \$$item";
+	eval "$value = \$$item";
 	print LOGFILE " $item: $value\n";
 }
 print LOGFILE " urllist: @urllist\n\n";
@@ -1361,9 +1357,9 @@
 	 # $urlstat = $siteconf::URL_REMOTE;
 	 $filename="";
 	 if(($urlstat==$URL_REMOTE)||($urlstat==$URL_TRAVERSE)||($urlstat==$URL_SCRIPT)){
-		 print "Url $link is remote...\n";
+	    if (!$quiet) { print "Url $link is remote...\n"; }
 	    if(($urlstat==$URL_REMOTE) && ($traverse_type!=1)){  # only do if we're allowing remote
-	       print "Skipping non-local url: $link.\n";
+	       if (!$quiet) { print "Skipping non-local url: $link.\n"; }
 	       next;
 	    }
 
 
Also with makenh, around line 1482 there is 
 print LOGFILE "Error with link: $link.  Cannot recognize as local *or* remote.\n";

but I think it should be print ERRFILE

4. For confarc:  Why not create a directory if the one that
is entered is not found?

Ok.  I will keep playing around with it.  It is nice to see webglimpse
moving forward.

Cheers,
  Seth