[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: replacing the monster regex



At 04:18 PM 1/29/02 GMT, Derek Pomery wrote:
>However, when you have a path like:
>http://thiserver.thislongveryverylongdomainname.net/project/subproject/s
>ubsubproject/somefurtherdivision/andanother/oneortwo/forfurtherorganizat
>ion/A ridiculously long file name that describes exactly what the use 
>case is.html
>
>I was finding it was unsurprisingly breaking the splits || regex.
>Hadn't gone back to see what the actual value for the substr was, since 
>everything worked fine without it. :)
>

Must have been a monster name indeed - the default limit to the substr was
10000 chars!  

But, this is still a good point, and if it runs fast enough without the
substr we'll just drop it.  

The code I sent you expects to be dropped into a new version of makenh that
does the %20 to ' ' replacements and a bunch of other stuff as well.  I'll
try to get it off to you tonight so you don't have to spend time
integrating the fragment I sent into the old version.

--G

>
>
>>>>>>>>>>>>>>>>>>> Original Message <<<<<<<<<<<<<<<<<<
>
>On 1/29/02, 2:17:10 AM, "Golda Velez" <golda@iwhome.com> wrote regarding 
>replacing the monster regex:
>
>
>> Ok, I think I've got the replacement done, just need to test some - one 
>question, though
>
>> What did you mean by "the splits work much better when the substr() is 
>removed"?  Unless >maxchars was set very low, the substr should not have 
>affected the part of the line that we're >actually splitting up...did you 
>maybe have it set low, or what errors did you get?
>
>What I got was consistently broken links.  Over n over.  I couldn't 
>figure out why it wasn't parsing the filename correctly.  Until I took 
>out the substr() -  I had not modified the default value, either.  
>
>
>
>* Will paste this into my version, and see what happens :)
>> Here's the replacement code (which still needs to be tested), now off in 
>a subroutine:
>> --------------------------
>
>>             ################################################
>>              # Now with simpler regexp, may not need to do this substr
>>                 # We don't need to process more than maxchars, this can 
>speed things up a lot
>>                 # for files with very long records (e.g. no linebreaks)
>>               if ($maxchars >= 10000) {
>>                       $$glinesref[$i] = 
>substr($$glinesref[$i],0,$maxchars);
>>               }
>>              ################################################
>
>>                 ($file, $link, $pop, $rest) = 
>split(/$FILE_END_MARK/,$$glinesref[$i],4);
>
>>                 # Better check - if $pop is not simple numeric, we are 
>probably using an older index
>>                 # that did not save link popularity values
>>                 if ($pop =~ /\D/) {
>>                         $rest = $pop.$rest;
>>                         $pop = 1;
>>                 }
>
>> # for html documents, there will be an extra space and tab, then the 
>title or "No Title" and a colon
>> # colons in the title are escaped
>> # non-html documents do not have title section
>
>>                 if ($rest =~ /^:/) {
>>                         $title = '';
>>                 } else {
>>                         $rest =~ s/\s*$FILE_END_MARK*(.*[^\\]):(.+)$/$2/;
>>                         $title = $1;
>>                 }
>>                 ($null, $date, $string) = split(':', $rest, 3);
>
>> ------------------------------
>
>> Thanks again for your suggestion, I need to do a bunch of testing and 
>integrate a few other changes, it will be about a week before the next 
>actual release, I think.
>
>> --G
>
>
------------------------------------------------------------
Golda Velez         (use contact form)       626-792-9277
Internet Workshop                          http://iwhome.com
Webglimpse Search Software             http://webglimpse.net
		~~~~~~~~~~~~~~~~~~~~~~~~~~~
 Help organize the world - index your own corner of the web