[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: replacing the monster regex
>>>>>>>>>>>>>>>>>> Original Message <<<<<<<<<<<<<<<<<<
On 1/29/02, 2:17:10 AM, "Golda Velez" <golda@iwhome.com> wrote regarding
replacing the monster regex:
> Ok, I think I've got the replacement done, just need to test some - one
question, though
> What did you mean by "the splits work much better when the substr() is
removed"? Unless >maxchars was set very low, the substr should not have
affected the part of the line that we're >actually splitting up...did you
maybe have it set low, or what errors did you get?
What I got was consistently broken links. Over n over. I couldn't
figure out why it wasn't parsing the filename correctly. Until I took
out the substr() - I had not modified the default value, either.
However, when you have a path like:
http://thiserver.thislongveryverylongdomainname.net/project/subproject/s
ubsubproject/somefurtherdivision/andanother/oneortwo/forfurtherorganizat
ion/A ridiculously long file name that describes exactly what the use
case is.html
I was finding it was unsurprisingly breaking the splits || regex.
Hadn't gone back to see what the actual value for the substr was, since
everything worked fine without it. :)
* Will paste this into my version, and see what happens :)
> Here's the replacement code (which still needs to be tested), now off in
a subroutine:
> --------------------------
> ################################################
> # Now with simpler regexp, may not need to do this substr
> # We don't need to process more than maxchars, this can
speed things up a lot
> # for files with very long records (e.g. no linebreaks)
> if ($maxchars >= 10000) {
> $$glinesref[$i] =
substr($$glinesref[$i],0,$maxchars);
> }
> ################################################
> ($file, $link, $pop, $rest) =
split(/$FILE_END_MARK/,$$glinesref[$i],4);
> # Better check - if $pop is not simple numeric, we are
probably using an older index
> # that did not save link popularity values
> if ($pop =~ /\D/) {
> $rest = $pop.$rest;
> $pop = 1;
> }
> # for html documents, there will be an extra space and tab, then the
title or "No Title" and a colon
> # colons in the title are escaped
> # non-html documents do not have title section
> if ($rest =~ /^:/) {
> $title = '';
> } else {
> $rest =~ s/\s*$FILE_END_MARK*(.*[^\\]):(.+)$/$2/;
> $title = $1;
> }
> ($null, $date, $string) = split(':', $rest, 3);
> ------------------------------
> Thanks again for your suggestion, I need to do a bunch of testing and
integrate a few other changes, it will be about a week before the next
actual release, I think.
> --G