[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
replacing the monster regex
Ok, I think I've got the replacement done, just need to test some - one question, though
What did you mean by "the splits work much better when the substr() is removed"? Unless maxchars was set very low, the substr should not have affected the part of the line that we're actually splitting up...did you maybe have it set low, or what errors did you get?
Here's the replacement code (which still needs to be tested), now off in a subroutine:
--------------------------
################################################
# Now with simpler regexp, may not need to do this substr
# We don't need to process more than maxchars, this can speed things up a lot
# for files with very long records (e.g. no linebreaks)
if ($maxchars >= 10000) {
$$glinesref[$i] = substr($$glinesref[$i],0,$maxchars);
}
################################################
($file, $link, $pop, $rest) = split(/$FILE_END_MARK/,$$glinesref[$i],4);
# Better check - if $pop is not simple numeric, we are probably using an older index
# that did not save link popularity values
if ($pop =~ /\D/) {
$rest = $pop.$rest;
$pop = 1;
}
# for html documents, there will be an extra space and tab, then the title or "No Title" and a colon
# colons in the title are escaped
# non-html documents do not have title section
if ($rest =~ /^:/) {
$title = '';
} else {
$rest =~ s/\s*$FILE_END_MARK*(.*[^\\]):(.+)$/$2/;
$title = $1;
}
($null, $date, $string) = split(':', $rest, 3);
------------------------------
Thanks again for your suggestion, I need to do a bunch of testing and integrate a few other changes, it will be about a week before the next actual release, I think.
--G