[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: replacing the monster regex
At 04:18 PM 1/29/02 GMT, Derek Pomery wrote:
>However, when you have a path like:
>http://thiserver.thislongveryverylongdomainname.net/project/subproject/s
>ubsubproject/somefurtherdivision/andanother/oneortwo/forfurtherorganizat
>ion/A ridiculously long file name that describes exactly what the use
>case is.html
>
>I was finding it was unsurprisingly breaking the splits || regex.
>Hadn't gone back to see what the actual value for the substr was, since
>everything worked fine without it. :)
>
Must have been a monster name indeed - the default limit to the substr was
10000 chars!
But, this is still a good point, and if it runs fast enough without the
substr we'll just drop it.
The code I sent you expects to be dropped into a new version of makenh that
does the %20 to ' ' replacements and a bunch of other stuff as well. I'll
try to get it off to you tonight so you don't have to spend time
integrating the fragment I sent into the old version.
--G
>
>
>>>>>>>>>>>>>>>>>>> Original Message <<<<<<<<<<<<<<<<<<
>
>On 1/29/02, 2:17:10 AM, "Golda Velez" <golda@iwhome.com> wrote regarding
>replacing the monster regex:
>
>
>> Ok, I think I've got the replacement done, just need to test some - one
>question, though
>
>> What did you mean by "the splits work much better when the substr() is
>removed"? Unless >maxchars was set very low, the substr should not have
>affected the part of the line that we're >actually splitting up...did you
>maybe have it set low, or what errors did you get?
>
>What I got was consistently broken links. Over n over. I couldn't
>figure out why it wasn't parsing the filename correctly. Until I took
>out the substr() - I had not modified the default value, either.
>
>
>
>* Will paste this into my version, and see what happens :)
>> Here's the replacement code (which still needs to be tested), now off in
>a subroutine:
>> --------------------------
>
>> ################################################
>> # Now with simpler regexp, may not need to do this substr
>> # We don't need to process more than maxchars, this can
>speed things up a lot
>> # for files with very long records (e.g. no linebreaks)
>> if ($maxchars >= 10000) {
>> $$glinesref[$i] =
>substr($$glinesref[$i],0,$maxchars);
>> }
>> ################################################
>
>> ($file, $link, $pop, $rest) =
>split(/$FILE_END_MARK/,$$glinesref[$i],4);
>
>> # Better check - if $pop is not simple numeric, we are
>probably using an older index
>> # that did not save link popularity values
>> if ($pop =~ /\D/) {
>> $rest = $pop.$rest;
>> $pop = 1;
>> }
>
>> # for html documents, there will be an extra space and tab, then the
>title or "No Title" and a colon
>> # colons in the title are escaped
>> # non-html documents do not have title section
>
>> if ($rest =~ /^:/) {
>> $title = '';
>> } else {
>> $rest =~ s/\s*$FILE_END_MARK*(.*[^\\]):(.+)$/$2/;
>> $title = $1;
>> }
>> ($null, $date, $string) = split(':', $rest, 3);
>
>> ------------------------------
>
>> Thanks again for your suggestion, I need to do a bunch of testing and
>integrate a few other changes, it will be about a week before the next
>actual release, I think.
>
>> --G
>
>
------------------------------------------------------------
Golda Velez (use contact form) 626-792-9277
Internet Workshop http://iwhome.com
Webglimpse Search Software http://webglimpse.net
~~~~~~~~~~~~~~~~~~~~~~~~~~~
Help organize the world - index your own corner of the web