[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: replacing the monster regex





>>>>>>>>>>>>>>>>>> Original Message <<<<<<<<<<<<<<<<<<

On 1/29/02, 2:17:10 AM, "Golda Velez" <golda@iwhome.com> wrote regarding 
replacing the monster regex:


> Ok, I think I've got the replacement done, just need to test some - one 
question, though

> What did you mean by "the splits work much better when the substr() is 
removed"?  Unless >maxchars was set very low, the substr should not have 
affected the part of the line that we're >actually splitting up...did you 
maybe have it set low, or what errors did you get?

What I got was consistently broken links.  Over n over.  I couldn't 
figure out why it wasn't parsing the filename correctly.  Until I took 
out the substr() -  I had not modified the default value, either.  
However, when you have a path like:
http://thiserver.thislongveryverylongdomainname.net/project/subproject/s
ubsubproject/somefurtherdivision/andanother/oneortwo/forfurtherorganizat
ion/A ridiculously long file name that describes exactly what the use 
case is.html

I was finding it was unsurprisingly breaking the splits || regex.
Hadn't gone back to see what the actual value for the substr was, since 
everything worked fine without it. :)




* Will paste this into my version, and see what happens :)
> Here's the replacement code (which still needs to be tested), now off in 
a subroutine:
> --------------------------

>             ################################################
>              # Now with simpler regexp, may not need to do this substr
>                 # We don't need to process more than maxchars, this can 
speed things up a lot
>                 # for files with very long records (e.g. no linebreaks)
>               if ($maxchars >= 10000) {
>                       $$glinesref[$i] = 
substr($$glinesref[$i],0,$maxchars);
>               }
>              ################################################

>                 ($file, $link, $pop, $rest) = 
split(/$FILE_END_MARK/,$$glinesref[$i],4);

>                 # Better check - if $pop is not simple numeric, we are 
probably using an older index
>                 # that did not save link popularity values
>                 if ($pop =~ /\D/) {
>                         $rest = $pop.$rest;
>                         $pop = 1;
>                 }

> # for html documents, there will be an extra space and tab, then the 
title or "No Title" and a colon
> # colons in the title are escaped
> # non-html documents do not have title section

>                 if ($rest =~ /^:/) {
>                         $title = '';
>                 } else {
>                         $rest =~ s/\s*$FILE_END_MARK*(.*[^\\]):(.+)$/$2/;
>                         $title = $1;
>                 }
>                 ($null, $date, $string) = split(':', $rest, 3);

> ------------------------------

> Thanks again for your suggestion, I need to do a bunch of testing and 
integrate a few other changes, it will be about a week before the next 
actual release, I think.

> --G